# Fundamentals of Artificial Neural Networks

by Mohamad H. Hassoun (MIT Press, 1995) http://neuron.eng.wayne.edu/

Preface

My purpose in writing this book has been to give a systematic account of major concepts and methodologies of artificial neural networks and to present a unified framework that makes the subject more accessible to students and practitioners. This book emphasizes fundamental theoretical aspects of the computational capabilities and learning abilities of artificial neural networks. It integrates important theoretical results on artificial neural networks and uses them to explain a wide range of existing empirical observations and commonly used heuristics. The main audience is first-year graduate students in electrical engineering, computer engineering, and computer science. This book may be adapted for use as a senior undergraduate textbook by selective choice of topics. Alternatively, it may also be used as a valuable resource for practicing engineers, computer scientists, and others involved in research in artificial neural networks. This book has evolved from lecture notes of two courses on artificial neural networks, a senior-level course and a graduate-level course, which I have taught during the last 6 years in the Department of Electrical and Computer Engineering at Wayne State University. The background material needed to understand this book is general knowledge of some basic topics in mathematics, such as probability and statistics, differential equations and linear algebra, and something about multivariate calculus. The reader is also assumed to have enough familiarity with the concept of a system and the notion of "state," as well as with the basic elements of Boolean algebra and switching theory. The required technical maturity is that of a senior undergraduate in electrical engineering, computer engineering, or computer science. Artificial neural networks are viewed here as parallel computational models, with varying degrees of complexity, comprised of densely interconnected adaptive processing units. These networks are fine- grained parallel implementations of nonlinear static or dynamic systems. A very important feature of these networks is their adaptive nature, where "learning by example" replaces traditional

"programming" in solving problems. This feature makes such computational models very appealing in application domains where one has little or incomplete understanding of the problem to be solved but where training data is readily available. Another key feature is the intrinsic parallelism that allows for fast computations of solutions when these networks are implemented on parallel digital computers or, ultimately, when implemented in customized hardware. Artificial neural networks are viable computational models for a wide variey of problems, including pattern classification, speech synthesis and recognition, adaptive interfaces between humans and complex physical systems, function approximation, image data compression, associative memory, clustering, forecasting and prediction, combinatorial optimization, nonlinear system modeling, and control. These networks are "neural" in the sense that they may have been inspired by neuroscience, but not because they are faithful models of biologic neural or cognitive phenomena. In fact, the majority of the network models covered in this book are more closely related to traditional mathematical and/or statistical models such as optimization algorithms, nonparametric pattern classifiers, clustering algorithms, linear and nonlinear filters, and statistical regression models than they are to neurobiologic models. The theories and techniques of artificial neural networks outlined here are fairly mathematical, although the level of mathematical rigor is relatively low. In my exposition I have used mathematics to provide insight and understanding rather than to establish rigorous mathematical foundations. The selection and treatment of material reflect my background as an electrical and computer engineer. The operation of artificial neural networks is viewed as that of nonlinear systems: Static networks are viewed as mapping or static input/output systems, and recurrent networks are viewed as dynamical systems with evolving "state." The systems approach is also evident when it comes to discussing the stability of learning algorithms and recurrent network retrieval dynamics, as well as in the adopted classifications of neural networks as discrete-state or continuous-state and discrete-time or continuous-time. The neural network paradigms (architectures and their associated learning rules) treated here were selected because of their relevence, mathematical tractability, and/or practicality. Omissions have been made for a number of reasons, including complexity, obscurity, and space. This book is organized into eight chapters. Chapter 1 introduces the reader to the most basic artificial neural net, consisting of a single linear threshold gate (LTG). The computational capabilities of linear and polynomial threshold gates are derived. A fundamental theorem, the function counting theorem, is proved and is applied to study the capacity and the generalization capability of threshold gates. The concepts covered in this chapter are crucial because they lay the theoretical foundations for justifying and exploring the more general artificial neural network architectures treated in later chapters. Chapter 2 mainly deals with theoretical foundations of multivariate function approximation using neural networks. The function counting theorem of chapter 1 is employed to derive upper bounds on the capacity of various feedforward nets of LTGs. The necessary bounds on the size of LTG-based multilayer classifiers for the cases of training data in general position and in arbitrary position are derived.

Theoretical results on continuous function approximation capabilities of feedforward nets, with units employing various nonlinearities, are summarized. The chapter concludes with a discussion of the computational effectiveness of neural net architectures and the efficiency of their hardware implementations. Learning rules for single-unit and single -ayer nets are covered in Chapter 3. More than 20 basic discrete-time learning rules are presented. Supervised rules are considered first, followed by reinforcement, Hebbian, competitive, and feature mapping rules. The presentation of these learning rules is unified in the sense that they may all be viewed as realizing incremental steepest-gradient-descent search on a suitable criterion function. Examples of single-layer architectures are given to illustrate the application of unsupervised learning rules (e.g., principal component analysis, clustering, vector quantization, and self-organizing feature maps). Chapter 4 is concerned with the theoretical aspects of supervised, unsupervised, and reinforcement learning rules. The chapter starts by developing a unifying framework for the charaterization of various learning rules (supervised and unsupervised). Under this framework, a continuous-time learning rule is viewed as a first-order stochastic differential equation/ dynamical system whereby the state of the system evolves so as to minimize an associated instantaneous criterion function. Statistical approximation techniques are employed to study the dynamics and stability, in an "average" sense, of the stochastic system. This approximation leads to an "average learning equation" that, in most cases, can be cast as a globally, asymptotically stable gradient system whose stable equilibria are minimizers of a well-defined criterion function. Formal analysis is provided for supervised, reinforcement, Hebbian, competitive, and topology-preserving learning. Also, the generalization properties of deterministic and stochastic neural nets are analyzed. The chapter concludes with an investigatinon of the complexity of learning in multilayer neural nets. Chapter 5 deals with learning in multilayer artificial neural nets. It extends the gradient descent-based learning to multilayer feedforward nets, which results in the back error propagation learning rule (or backprop). An extensive number of methods and heuristics for improving backprop's convergence speed and solution quality are presented, and an attempt is made to give a theoretical basis for such methods and heuristics. Several significant applications of backprop-trained multilayer nets are described. These applications include conversion of English text to speech, mapping of hand gestures to speech, recognition of handwritten ZIP codes, continuous vehicle navigation, and medical diagnosis. The chapter also extends backprop to recurrent networks capable of temporal association, nonlinear dynamical system modeling, and control. Chapter 6 is concerned with other important adaptive multilayer net architectures, such as the radial basis function (RBF) net and the cerebeller model articulation controller (CMAC) net, and their associated learning rules. These networks often have similar computational capabilities to feedforward multilayer nets of sigmoidal units, but with the potential for faster learning. Adaptive mulilayer unit-allocating nets such as hyperspherical classifiers, restricted Coulomb energy (RCE) net, and cascadecorrelation net are discussed. The chapter also addresses the issue of unsupervised learning in multilayer nets, and it describes two specific networks [adaptive resonance theory (ART) net and the autoassociative clustering net] suitable for adaptive data

clustering. The clustering capabilities of these nets are demonstrated through examples, including the decomposition of complex electromyogram signals. Chapter 7 discusses associative neural memories. Various models of associative learning and retrieval are presented and analyzed, with emphasis on recurrent models. The stability, capacity, and error-correction capabilities of these models are analyzed. The chapter concludes by describing the use of one particular recurrent model (the Hopfield continuous model) for solving combinatorial optimization problems. Global search methods for optimal learning and retrieval in multilayer neural networks is the topic of Chapter 8. It covers the use of simulated annealing, meanfield annealing, and genetic algorithms for optimal learning. Simulated annealing is also discussed in the context of local-minima-free retrievals in recurrent neural networks (Boltzmann machines). Finally, a hybrid genetic algorithm/gradient-descent search method that combines optimal and fast learning is described. Each chapter concludes with a set of problems designed to allow the reader to further explore the concepts discussed. More than 200 problems of varying degrees of difficulty are provided. The problems can be divided roughly into three categories. The first category consists of problems that are relatively easy to solve. These problems are designed to directly reinforce the topics discussed in the book. The second category of problems, marked with an asterisk (*), is relatively more difficult. These problems normally involve mathematical derivations and proofs and are intended to be thought provoking. Many of these problems include reference to technical papers in the literature that may give complete or partial solutions. This second category of problems is intended mainly for readers interested in exploring advanced topics for the purpose of stimulating original research ideas. Problems marked with a dagger ( ) represent a third category of problems that are numerical in nature and require the use of a computer. Some of these problems are mini programming projects, which should be especially useful for students. This book contains enough material for a full semester course on artificial neural networks at the first-year graduate level. I have also used this material selectively to teach an upper-level undergraduate introductory course. For the undergraduate course, one may choose to skip all or a subset of the following material: Sections 1.4-1.6, 2.12.2, 4.3-4.8, 5.1.2, 5.4.3- 5.4.5, 6.1.2, 6.2-6.4, 6.4.2, 7.2.2, 7.4.1-7.4.4, 8.3.2, 8.4.2, and 8.6. I hope that this book will prove useful to those students and practicing professionals who are interested not only in understanding the underlying theory of artificial neural networks but also in pursuing research in this area. A list of about 700 relevent references is included with the aim of providing guidance and direction for the readers' own search of the research literature. Even though this reference list may seem comprehensive, the published literature is too extensive to allow such a list to be complete.

Acknowledgments

First and foremost, I acknowledge the contributions of the many researchers in the area of artificial neural networks on which most of the material in this text is based. It would have been extremely difficult (if not impossible) to write this book without the support and assistance of a number of organizations and individuals. I would first like to thank the National Science Foundation (through a PYI Award), Electric Power Research Institute (EPRI), Ford Motor Company, Mentor Graphics, Sun Micro Systems, Unisis Corporation, Whitaker Foundation, and Zenith Data Systems for supporting my research. I am also grateful for the support I have received for this project from Wayne State University through a Career Development Chair Award. I thank my students, who have made classroom use of preliminary versions of this book and whose questions and comments have definitely enhanced it. In particular, I would like to thank Raed Abu Zitar, David Clark, Mike Finta, Jing Song, Agus Sudjianto, Chuanming (Chuck) Wang, Hui Wang, Paul Watta, and Abbas Youssef. I also would like to thank my many colleagues in the artificial neural networks community and at Wayne State University, especially Dr. A. Robert Spitzer, for many enjoyable and productive conversations and collaborations. I am in debt to Mike Finta, who very capably and enthusiastically typed the complete manuscript and helped with most of the artwork, and to Dr. Paul Watta of the Computation and Neural Networks Laboratory, Wayne State University, for his critical reading of the manuscript and assistance with the simulations that led to Figures 5.3.8 and 5.3.9. My deep gratitude goes to the reviewers for their critical and constructive suggestions. They are Professors Shun-Ichi Amari of the University of Tokyo, James Anderson of Brown University, Thomas Cover of Stanford University, Richard Golden of the University of Texas-Dallas, Laveen Kanal of the University of Maryland, John Taylor of King's College London, Francis T. S. Yu of the University of Pennsylvania, Dr. Granino Korn of G. A. and T. M. Korn Industrial Consultants, and other anonymous reviewers. Finally, let me thank my wife Amal, daughter Lamees, and son Tarek for their quiet patience through the many lonely hours during the preparation of the manuscript. Mohamad H. Hassoun Detroit, 1994

Table of Contents

Fundamentals of Artificial Neural Networks by Mohamad H. Hassoun (MIT Press, 1995) Chapter 1 Threshold Gates 1.0 Introduction 1.1 Threshold Gates 1.1.1 Linear Threshold Gates 1.1.2 Quadratic Threshold Gates 1.1.3 Polynomial Threshold Gates 1.2 Computational Capabilities of Polynomial Threshold Gates 1.3 General Position and the Function Counting Theorem 1.3.1 Weierstrass's Approximation Theorem 1.3.2 Points in General Position 1.3.3 Function Counting Theorem 1.3.4 Separability in f-Space 1.4 Minimal PTG Realization of Arbitrary Switching Functions 1.5 Ambiguity and Generalization 1.6 Extreme Points 1.7 Summary Problems Chapter 2 Computational Capabilities of Artificial Neural Networks 2.0 Introduction 2.1 Some Preliminary Results on Neural Network Mapping Capabilities 2.1.1 Network Realization of Boolean Functions 2.1.2 Bounds on the Number of Functions Realizable by a Feedforward Network of LTG's 2.2 Necessary Lower Bounds on the Size of LTG Networks 2.2.1 Two Layer Feedforward Networks 2.2.2 Three Layer Feedforward Networks 2.2.3 Generally Interconnected Networks with no Feedback 2.3 Approximation Capabilities of Feedforward Neural Networks for Continuous Functions 2.3.1 Kolmogorov's Theorem 2.3.2 Single Hidden Layer Neural Networks are Universal Approximators 2.3.3 Single Hidden Layer Neural Networks are Universal Classifiers 2.4 Computational Effectiveness of Neural Networks 2.4.1 Algorithmic Complexity 2.4.2 Computational Energy 2.5 Summary Problems Chapter 3 Learning Rules

3.0 Introduction 3.1 Supervised Learning in a Single Unit Setting 3.1.1 Error Correction Rules Perceptron Learning Rule Generalizations of the Perceptron Learning Rule The Perceptron Criterion Function Mays Learning Rule Widrow-Hoff (alpha-LMS) Learning Rule 3.1.2 Other Gradient Descent-Based Learning Rules mu-LMS Learning Rule The mu-LMS as a Stochastic Process Correlation Learning Rule 3.1.3 Extension of the mu-LMS Rule to Units with Differentiable Activation Functions: Delta Rule 3.1.4 Adaptive Ho-Kashyap (AHK) Learning Rules 3.1.5 Other Criterion Functions 3.1.6 Extension of Gradient Descent-Based Learning to Stochastic Units 3.2 Reinforcement Learning 3.2.1 Associative Reward-Penalty Reinforcement Learning Rule 3.3 Unsupervised Learning 3.3.1 Hebbian Learning 3.3.2 Oja's Rule 3.3.3 Yuille et al. Rule 3.3.4 Linsker's Rule 3.3.5 Hebbian Learning in a Network Setting: Principal Component Analysis (PCA) PCA in a Network of Interacting Units PCA in a Single Layer Network with Adaptive Lateral Connections 3.3.6 Nonlinear PCA 3.4 Competitive learning 3.4.1 Simple Competitive Learning 3.4.2 Vector Quantization 3.5 Self-Organizing Feature Maps: Topology Preserving Competitive Learning 3.5.1 Kohonen's SOFM 3.5.2 Examples of SOFMs 3.6 Summary Problems Chapter 4 Mathematical Theory of Neural Learning 4.0 Introduction 4.1 Learning as a Search Mechanism 4.2 Mathematical Theory of Learning in a Single Unit Setting 4.2.1 General Learning Equation 4.2.2 Analysis of the Learning Equation 4.2.3 Analysis of some Basic Learning Rules 4.3 Characterization of Additional Learning Rules 4.3.1 Simple Hebbian Learning 4.3.2 Improved Hebbian Learning 4.3.3 Oja's Rule 4.3.4 Yuille et al. Rule

4.3.5 Hassoun's Rule 4.4 Principal Component Analysis (PCA) 4.5 Theory of Reinforcement Learning 4.6 Theory of Simple Competitive Learning 4.6.1 Deterministic Analysis 4.6.2 Stochastic Analysis 4.7 Theory of Feature Mapping 4.7.1 Characterization of Kohonen's Feature Map 4.7.2 Self-Organizing Neural Fields 4.8 Generalization 4.8.1 Generalization Capabilities of Deterministic Networks 4.8.2 Generalization in Stochastic Networks 4.9 Complexity of Learning 4.10 Summary Problems Chapter 5 Adaptive Multilayer Neural Networks I 5.0 Introduction 5.1 Learning Rule for Multilayer Feedforward Neural Networks 5.1.1 Error Backpropagation Learning Rule 5.1.2 Global Descent-Based Error Backpropagation 5.2 Backprop Enhancements and Variations 5.2.1 Weights Initialization 5.2.2 Learning Rate 5.2.3 Momentum 5.2.4 Activation Function 5.2.5 Weight Decay, Weight Elimination, and Unit Elimination 5.2.6 Cross-Validation 5.2.7 Criterion Functions 5.3 Applications 5.3.1 NetTalk 5.3.2 Glove-Talk 5.3.3 Handwritten ZIP Code Recognition 5.3.4 ALVINN: A Trainable Autonomous Land Vehicle 5.3.5 Medical Diagnosis Expert Net 5.3.6 Image Compression and Dimensionality Reduction 5.4 Extensions of Backprop for Temporal Learning 5.4.1 Time-Delay Neural Networks 5.4.2 Backpropagation Through Time 5.4.3 Recurrent Back-Propagation 5.4.4 Time-Dependent Recurent Back-Propagation 5.4.5 Real-Time Recurrent Learning 5.5 Summary Problems Chapter 6 Adaptive Multilayer Neural Networks II 6.0 Introduction 6.1 Radial Basis Function (RBF) Networks

4.5 The DAM as a Gradient Net and its Application to Combinatorial Optimization 7.4.2.3.4 Other DAM Models 7.2.5 Summary Problems Chapter 7 Associative Neural Memories 7.6 Heteroassociative DAM 7.4.2.2 Projection DAMs 7.4.1 Adaptive Resonance Theory (ART) Networks 6.4.1 Correlation DAMs 7.1 RBF Networks versus Backprop Networks 6.6.3 Unit-Allocating Adaptive Networks 6.2 Cascade-Correlation Network 6.4.1.3 Hysteretic Activations DAM 7.1.6 Summary Problems Chapter 8 Global Search Methods for Neural Networks 8.2 Cerebeller Model Articulation Controller (CMAC) 6.1 Simple Associative Memories and their Associated Recording Recipes Correlation Recording Recipe A Simple Nonlinear Associative Memory Model Optimal Linear Associative Memory (OLAM) OLAM Error Correction Capabilities Strategies for Improving Memory Recording 7.1 Basic Associative Neural Memory Models 7.4.2 Autoassociative Clustering Network 6.1.1 Hyperspherical Classifiers Restricted Coulomb Energy (RCE) Classifier Real-Time Trained Hyperspherical Classifier 6.0 Introduction 8.3 Characteristics of High-Performance DAMs 7.2 Non-Monotonic Activations DAM Discrete Model Continuous Model 7.1 CMAC Relation to Rosenblatt's Perceptron and Other Models 6.3.1 Brain-State-in-a-Box (BSB) DAM 7.0 Introduction 7.5 Sequence Generator DAM 7.4.4 Exponential Capacity DAM 7.2 Dynamic Associative Memories (DAM) Continuous-Time Continuous-State Model Discrete-Time Continuous-State Model Discrete-Time Discrete-State Model 7.4 Clustering Networks 6.2 DAM Capacity and Retrieval Dyanamics 7.1.1 Local versus Global Search
.2 RBF Network Variations 6.

1 Global Convergence in a Stochastic Recurrent Neural Net: The Boltzmann Machine 8.1 Fundamentals of Genetic Search 8.2 Application of Genetic Algorithms to Neural Networks 8.1 Mean-Field Retrieval 8.3.6 Genetic Algorithm Assisted Supervised Learning 8.2 Mean-Field Learning 8.2 Stochastic Gradient Search: Global Search via Diffusion 8.5 Genetic Algorithms in Neural Network Optimization 8.2 Simulations 8.5.3.1.1 A Gradient Descent/Ascent Search Strategy 8.4.6.4 Mean-Field Annealing and Deterministic Boltzmann Machines 8.8.7 Summary Problems References Index
.5.2 Simulated Annealing-Based Global Search 8.1.4.3 Simulated Annealing for Stochastic Neural Networks 8.2 Learning in Boltzmann Machines 8.6.1 Hybrid GA/Gradient Descent Method for Feedforward Multilayer Net Training 8.

1965. and its computational capabilities are studied. adaptive interfaces between humans and complex physical systems. Another key feature is the intrinsic parallel architecture which allows for fast computation of solutions when these networks are implemented on parallel digital computers or. but where training data is available. ultimately. A very important feature of these networks is their adaptive nature where "learning by example" replaces "programming" in solving problems. speech synthesis and recognition. Hu. The "artificial neuron" is the basic building block/processing unit of an artificial neural network.1. The transfer function of an LTG is given analytically by
. 1943).1 Linear Threshold Gates The basic function of a linear threshold gate (LTG) is to discriminate between labeled points (vectors) belonging to two different classes. THRESHOLD GATES
1. The artificial neuron model considered here is closely related to an early model used in threshold logic (Winder. Here. 1965. Artificial neural networks are viable computational models for a wide variety of problems. when implemented in customized hardware. nonlinear filters. y. an approximation to the function of a biological neuron is captured by the linear threshold gate (McCulloch and Pitts. forecasting and prediction. These networks are "neural" in the sense that they may have been inspired by neuroscience. a method for minimal parameter PTG synthesis is developed for the realization of arbitrary binary mappings (switching functions).1. the polynomial threshold gate (PTG) is developed as a generalization of the LTG. Finally. Dertouzos. These include pattern classification. the majority of the networks covered in this book are more closely related to traditional mathematical and/or statistical models such as non-parametric pattern classifiers. Muroga. into a single binary output. 1962. nonlinear system modeling. 1969. but not necessarily because they are faithful models of biological neural or cognitive phenomena. Also in this chapter. In fact. It is necessary to understand the computational capabilities of this processing unit as a prerequisite for understanding the function of a network of such units. These networks are fine-grained parallel implementations of nonlinear static or dynamic systems. This feature makes such computational models very appealing in application domains where one has little or incomplete understanding of the problem to be solved. Then. 1964. 1967. An important theorem. Lewis and Coates. clustering.
1. comprised of densely interconnected adaptive processing units. the chapter concludes by defining the concepts of ambiguous and extreme points and applies them to study the generalization capability of threshold gates and to determine the average amount of information necessary for characterizing large data sets by threshold gates. and control. is proved and is used to determine the statistical capacity of LTG's and PTG's. 1964. x. and statistical regression models than they do with neurobiological models. This chapter investigates the computational capabilities of a linear threshold gate (LTG). 1971).0 Introduction
Artificial neural networks are parallel computational models. respectively. associative memory. clustering algorithms. Sheng. image compression. An LTG maps a vector of input data. combinatorial optimization. Cover. Brown. function approximation.1 Threshold Gates
1. known as the Function Counting Theorem.

1.(1. (a) Symbolic representation of a linear threshold gate and (b) its transfer function. A graphical representation of Equation (1. as described in Equation (1.1) where x = [x1 x2 . respectively..1 (b).1..1 (a) shows a symbolic representation of an LTG with n inputs. xn]T and w = [w1 w2 .
(a) (b) Figure 1.1).1.1. Thus.2 shows an example LTG which realizes the Boolean function y given by:
(1. Figure 1.1.1) is shown in Figure 1.2)
. wn]T are the input and weight (column) vectors...1) is n dimensional with binary or real components (i.1.1.e.1. the LTG output y may assume either of the following mapping forms:
or
An LTG performs a linear weighted-sum operation followed by a nonlinear hard clipping/thresholding operation. The vector x in Equation (1. Figure 1. 1}n or x Rn) and ..1.1. and T is a threshold constant. x {0.

2). It should be noted that the solution given in Figure 1. there are 2n
combinations of n independent binary variables which lead to unique ways of labeling these 2n combinations into two distinct categories (i.2. respectively.2)] is given by y = 0.1. cannot be realized using a single LTG and is termed a non-threshold function.e. For example.4 (a) and (b). for a proper operation of the LTG. 1. 1}.1. Example of an LTG realization of the Boolean function .2) that a single ninput LTG is capable of realizing only a small subset of these Boolean functions (refer to Figure 1.1. the output y [using Equation (1. It can be shown (see Section 1.2) reads y = [(NOT x1) AND x2] OR [x2 AND (NOT x3)]. x2. in Equation (1.. x3) and their associated y values from Equation (1. The other seven inequalities are obtained similarly for each of the remaining cases: (x1. which gives the first of the above eight inequalities: 0 < T.1. A Boolean function which can be realized by a single LTG is known as a threshold function. Hence.1.2 is one of an infinite number of possible solutions for the above set of inequalities.1. 0). the weight vector w = [1 2 1]T and threshold T = lead to a correct realization of y.1). for input (x1. that is.
. Equation (1. x3) = (0.3). Linear and non-linear separability are illustrated in Figure 1. 1). 1) through (1. such as the exclusive-OR (XOR) function . Here.
Figure 1. A threshold function is a linearly separable function. Any function that is not linearly separable.1.1. geometrically separated from the inputs corresponding to the other category by a hyperplane. 0. One way of arriving at the solution for w and T is to directly solve the following set of eight inequalities:
0 < T w 1 + w2 T w1 < T w 1 + w3 < T w2 T w 2 + w3 T w3 < T w 1 + w2 + w3 < T
These inequalities are obtained by substituting all eight binary input combinations (x1. x3) = (0. x2. a function with inputs belonging to two distinct categories (classes) such that the inputs corresponding to one category may be perfectly. we require: 0w1 + 0w2 + 0w3 < T. 0. There exists a total of unique Boolean functions (switching functions) of n variables. 0 or 1). x2.where x1 and x2 belong to {0.

Muroga. 1960 .1. 1971) as shown in Table 1.1. Filled circles and open squares designate points in the first class and second class.
n
NUMBER OF THRESHOLD FUNCTIONS Bn 4 14 104
TOTAL NUMBER OF BOOLEAN FUNCTIONS
1 2 3
4 16 256
. the ratio of the number of LTG-realizable Boolean functions.
(1. This table shows the limitation of a single LTG with regard to the realization of an arbitrary Boolean function.Figure 1.4. formally. respectively.3) This result is verified in Section 1.1. Pictorial representation depicting the set of threshold functions as a small subset of the set of all Boolean functions. and (b) non-linearly separable function.1.3. Bn. to the total number of Boolean functions approaches zero. Linear versus non-linear separability: (a) Linearly separable function. as . Here.
(a) (b) Figure 1.1.2. Threshold functions have been exhaustively enumerated for small n (Cameron.

The K-map for the threshold function in Equation (1.1 of Chapter 2.
Figure 1.1. provided that its basic topological structure is preserved. For n 5.1.6 along with its corresponding threshold pattern. it is capable of realizing the universal NAND (or NOR) logic operation ( and ). Figure 1. Therefore a single n-input LTG is a much more powerful gate than a single n-input NAND or NOR gate.2) is shown in Figure 1.1.539.536 4.134 8. each of which will be a threshold function.1.552.572 15.1 in Section 1.1.1.2.2 for an example).1. Admissible patterns of Boolean functions of n variables are also admissible for functions of n + 1 or more variables (Kohavi.4x1038 1. Admissible Karnaugh map threshold patterns for n = 3. This decomposition allows for obtaining an LTG network realization for Boolean functions as illustrated later in Section 2. Hence.4 5 6 7 8
1. Although a single LTG cannot represent all Boolean functions. Note that the complements of such patterns also represent admissible threshold patterns (refer to Example 1.8x1019 3.070.3x109 1.
.378. though.882 94. a Karnaugh map (or K-map) may be employed to identify thresholding functions or to perform the decomposition of nonthreshold functions into two or more factors. an LTG is capable of realizing many more Boolean functions.5. any Boolean function is realizable using a network of LTG's (only two logic levels are needed). the LTG is a universal logic gate.028.946
65.16x1077
Table 1.864 17.5 shows the admissible K-map threshold patterns for n = 3. Each admissible pattern may be in any position on the map. Comparison of the number of threshold functions versus the number of all possible Boolean functions for selected values of n. Besides the basic NAND and NOR functions. 1978).561.

5.1.1.6. This phenomenon is illustrated through an example (Example 1.1) given in the next section. For example. and
(1.1.1.1.4) and (1.5) for . one can do this by feeding the products or ANDings of inputs as new inputs to the LTG. In this case. The realization of a Boolean function by the above process leads to a quadratic threshold gate (QTG).1.5) eliminate the wijxixj
and wjixjxi duplications.Figure 1. one might try to design a yet more powerful "logic" gate which can realize non-threshold functions.4) for . Note that the only difference between Equations (1.2 Quadratic Threshold Gates Given that for large n the number of threshold functions is very small in comparison to the total number of available Boolean functions. This can be accomplished by expanding the number of inputs to an LTG. The bounds on the double summations in Equations (1. QTG's greatly increase the number of realizable Boolean functions as
. 1.1. we require a fixed preprocessing layer of AND gates which artificially increases the dimensionality of the input space. The 1's of this function can be grouped as shown to form one of the threshold patterns depicted in Figure 1.1.2. A Karnaugh map representation for the threshold function .5) is the range of the index j of the second summation in the double summation term. We expect that the resulting Boolean function (which is now only partially specified) becomes a threshold function and hence realizable using a single LTG.1.4) and (1. The general transfer characteristics for an n-input QTG are given by
(1.

3 Polynomial Threshold Gates Although the QTG greatly increases the number of functions that can be realized.2. and
(1. Knowing that a second order polynomial expansion of inputs offers some improvements.1. a single QTG still cannot realize all Boolean functions of n variables.1.2.1. Note that the LTG and QTG are special cases.8)
. 1. Comparison of the number of degrees of freedom in an LTG versus a QTG.6) In this case. it makes sense to extend this concept to r-order polynomials. the number of degrees of freedom is given by
(1.1. we find an increased flexibility of a QTG over an LTG. This results in a polynomial threshold gate denoted PTG(r).compared to LTG's.1.
Threshold gate
Number of degrees of freedom/parameters (including threshold)
LTG
QTG Table 1.1. By comparing the number of degrees of freedom (number of weights plus threshold) listed in Table 1.7) for . where LTG PTG(1) and QTG PTG(2). The general transfer equation for a PTG(r) is given by
(1.

k at a time. The PTG appears to be a powerful gate.3) or
.2) A couple of special cases are interesting to examine. Bn(r).
1. It implies that the most difficult n-variable Boolean function to implement by a PTG(r) requires r = n. .2.7) and
(1. in the worst case. of n variables by a PTG(r).2 Computational Capabilities of Polynomial Threshold Gates
Next.1.1) should not exceed the number of n-variable Boolean functions.1 (Nilsson. The proof of this theorem follows from the discussion in Section 1.4.2.1.3. Krishnan. and this may
require increases exponentially in n.4 for details):
(1.2. Indeed. the term gives the number of different combinations of m different things.2. if it is to be a tight upper bound. Theorem 1. From Theorem 1. This means that the right hand side of Equation (1.2. It is worthwhile to investigate its capabilities. Let us start with a theorem which establishes the universality of a single n-input PTG(r) in the realization of arbitrary Boolean functions. r n (see Section 1.2. we consider the capabilities of PTG's in realizing arbitrary Boolean functions.1) and
then finding the limit of C(2n. d 1) as r approaches n. Theorem 1.1 indicates that r = n is an upper bound on the order of the PTG for realizing arbitrary Boolean functions.
parameters.2.for . this can be shown to be the case by first starting with Equation (1. without repetitions. Here. The first is when r = n. the number of required parameters
Winder (1963) gave the following upper bound on the number of realizable Boolean functions.2.1) Where d is given by Equation (1. Since
. 1965. we have
(1. 1966): Any Boolean function of n-variables can be realized using a PTG of order r n. any n-variable Boolean function is realizable using a single PTG(n). So.

3) by taking the following limit:
. Employing Equation (1.7)
Total number Number of Threshold of Boolean functions functions n 1 2 3 4 5 6 7 8 4 14 104 1.552. despite the pessimistic fact implied by Equation (1. 1963.070.3 x 109 1.2.2.5) allows us to validate Equation (1.9 x 108 2.240 1.16 x 1077
Upper bounds for
4 14 128 3. gives the following upper bound on the number of n-input threshold functions:
(1.5)
It can be shown (Winder.3.6).2.2 x 1014
4 16 170 5461 559.6) Table 1. The other interesting case is for r = 1.4) which is the desired result. Note that.134 8.2.2. Enumeration of threshold functions and evaluations of various upper bounds for n 8. a single LTG remains a powerful logic gate by being able to realize a very large number of Boolean functions. 1965)
(1.946 4 16 256 65.8 x 1019
Table 1.1.9 x 1011 8.539.
.3) that a yet tighter upper bound on Bn(1) is Equation (1. which leads to the case of an LTG. n 2.882 412.1.2. This can be seen from the enumerated results in Table 1.4 x 1038 1. In fact.2.2 x 1014
2 16 512 65.882 94.1 extends Table 1.
(1.864 17.5 x 108 1.1 by evaluating the above upper bounds on the number of threshold functions.1) with d = n + 1.378.2 x 1011 9.8 x 1019 3.536 33.1.572 15.2.2.554.736 1.561.6 x 1014 1.(1.9 x 1010 5.536 4.432 6.028.1 (see the column labeled Bn(1)). Bn(1) scales exponentially in n as can be deduced from the following lower bound (Muroga. see Problem 1.2.

3.1.2. The resulting architecture is shown in Figure 1.. Karnaugh map of XNOR function Since n = 2.1: Consider the XNOR function (also known as the equivalence function):
Using a Karnaugh map (shown in Figure 1.2.2. it cannot be realized using a single 2-input LTG (a diagonal pattern of 1's in the Kmap is a non-threshold pattern and indicates a non-threshold/non-linearly separable function).
Figure 1.
Figure 1. each having a fan-in between two to
r input lines.1. therefore.2.e.2.By its very definition. the QTG with three weights. Theorem 1. it can be verified that this function is not a threshold function. PTG realization as a cascade of one layer of AND gates and a single LTG.2.1 implies that a PTG(2) is sufficient. i. and a single LTG with inputs (representing the inputs received from the AND gates plus the original n inputs). should be sufficient to realize the XNOR function.2. a PTG may be thought of as a two layer network with a fixed preprocessing layer followed by a high fan-in LTG. Example 1.2. Kaszerman (1963) showed that a PTG(r) with binary inputs can
be realized as the cascade of a layer of
AND gates.
.2). shown in Figure 1.

2.5 (this pattern is the complement of one of the threshold patterns shown in Figure 1. By defining Figure 1.2.2.4.5.5. This method works well here. Since is defined as . Dertouzos (1965) describes a tabular method for determining weights for LTG's with small n.2.2.6.
Figure 1.3. The topic of adaptive weight computation for LTG's is treated in Chapter Three.1.Figure 1. we can treat the PTG as a 3-input LTG and generate the K-map shown in
Figure 1. The QTG in Figure 1. but will not be discussed.2. and therefore it is an admissible threshold pattern). these states can be assigned either a 1 or 0.4.2.
. there are some undefined states which we will refer to as "don't care" states. Karnaugh map after an appropriate assignment of 1's and 0's to the "don't care" states of the map in Figure 1. and give us the flexibility to identify the threshold pattern shown in Figure 1. . The above K-map verifies that is realizable using a single LTG. Karnaugh map for XNOR in expanded input space. A PTG(2) (or QTG) with binary inputs.2.3 will realize the desired XNOR function with weight assignments as shown in Figure 1.4. to our liking.

3 General Position and the Function Counting Theorem
In this section. Note that this interpretation does not affect the QTG in Figure 1.6.1.6 when the inputs are in {0.5).9). we present a classical theorem on the capability of polynomials as approximators of continuous functions.2. Separating surface realized by the QTG of Equation (1.. which illustrates how the QTG employs a nonlinear separating surface in order to correctly classify all vertices of the XNOR problem.2. A plot of this function is shown in Figure 1.Figure 1.2. 2.. how many dichotomies of these m points are realizable with a single PTG(r). 1}n.
Figure 1.2. we took the liberty of interpreting the AND operation as multiplication. n? We are also interested in answering the question for the more general case of m points in Rn. Another way of geometrically interpreting the operation of the QTG described by Equation (1.
. 1}.2.9) (here ' ' is the symbol for "defined as"). which gives
(1. . we can visualize the surface of Equation (1.8) In arriving at this result.2. Now.7. according to
(1. Next. we try to answer the following fundamental question: Given m points in {0.9) in 3-dimensions as a plane which properly separates the four vertices (patterns) of the XNOR function (see Problem 1.7.8) is possible if we define the product x1x2 as a new input x3. A geometric interpretation of the synthesized QTG may be obtained by first employing Equation (1.2. But first.4 for an exploration of this idea). Equation (1. Weight assignment for a QTG for realizing the XNOR function.2.2..8) may be used to define a separating surface g that discriminates between the 1 and 0 vertices.2. for r = 1.2.
1.

1 (Weierstrass' Approximation Theorem): Let g be a continuous real-valued function defined on a closed interval [a.2). consider l5. there are only six linear partitions. For m > n. there are two different classifications).3. The proof of this theorem can be found in Apostol (1957). one may enumerate all possible linear dichotomies to be equal to 14. 1. b]. consider the case m = 4 and n = 2. Weierstrass' Approximation Theorem also applies for the case of a continuous multivariate function g which maps a compact set Rn to a compact set Y R (Narendra and Parthasarathy.1 are in general position.2 are not. It could be the decision surface implementing either of the following: (1) x1 and x4 in class 1. LTG).1 can also be extended to non-polynomial functions. The lines li.. then any g satisfying the requirements of Theorem 1. given any positive. Theorem 1. i = 1. This gives the PTG the natural capability for realizing Boolean functions and complex dichotomies of arbitrary points in Rn. and x2 and x3 in class 1. or (2) x1 and x4 in class 2. d(x)} be a complete orthonormal set.3. Note that the order of the polynomial depends on the function being approximated and the desired accuracy of the approximation. and x2 and x3 in class 2. . b]. Thus. the number of linear dichotomies is equal to twice the number of ways in which the m points can be partitioned by an (n 1)-dimensional hyperplane (for each distinct partition. where S is a finite set of arbitrary points in Rn. Thus the four points of Figure 1. Let {1(x).2) where the wj's are real constants. .3. Then.1 can be approximated by a function
(1. a set of m points in Rn is in general position if every subset of n or fewer points (vectors) is linearly independent. 1}. 7. A PTG differs from a polynomial in that it has an intrinsic quantifier (threshold nonlinearity) for the output state. a set of m points is in general position if no (m 2)-dimensional hyperplane contains the set.3. In particular.1 is described by the statement: every continuous function can be "uniformly approximated" by a polynomial. Equivalently. which implies that
.2 Points in General Position Let us calculate the number of dichotomies of m points in Rn (ways of labeling m points into two distinct categories) achieved by a linear separating surface (i. Polynomials may also be used to approximate binary-valued functions defined on a finite set of points.1) for every x [a.3. We call each of these dichotomies a linear dichotomy. For m n-dimensional patterns. Examples of such functions are Boolean functions and functions of the form g: S {0.3. we say that a set of m points is in general position in Rn if and only if no subset of n + 1 points lies on an -dimensional hyperplane.3. The remainder of this section develops some theoretical results needed to answer the questions raised at the beginning of this section. whereas the four points of Figure 1.. Theorem 1. If three of the points belong to the same line (Figure 1.3.3. there exists a polynomial y (which may depend on ) with real coefficients such that
(1. As an example. for m n. Thus. a single PTG of unrestricted order r which employs no thresholding is a universal approximator for continuous functions g: Y.3. 1990).1 Weierstrass' Approximation Theorem Theorem 1. Figure 1.e.3. give all possible linear partitions of these four points.3. ..1.1 shows four points in a two-dimensional space.

Proof of Theorem 1. This means that a hyperplane is not constrained by the requirement of correctly classifying n + 1 or fewer points in general position.3) Note that general position requires a stringent rank condition on the matrix [x1 x2 .3.. This theorem is also useful in giving an upper bound on the number of linear dichotomies for points in arbitrary position. Theorem 1.3. xm] has maximal rank n if at least one n × n submatrix has a nonzero determinant).3..3. xm] (the matrix [x1 x2 . which counts the number of linearly separable dichotomies of m points in general position in Rd. Points not in general position.1 Points in general position.3 Function Counting Theorem The so-called Function Counting Theorem. Note that as n . Cover.3.3) that for m points in general position with m n+1.3.2 (Function Counting Theorem..2:
.3. Figure 1. 1.3. the total number of linear dichotomies is 2m.4) The general position requirement on the m points is a necessary and sufficient condition. 1965): The number of linearly separable dichotomies of m points in general position in Euclidean d-space is
(1. a set of m random points in Rn is in general position with probability approaching one.(1. It can be shown (see Section 1.2.
Figure 1.. is essential for estimating the separating capacity of an LTG or a PTG and is considered next.

For the remaining C(m. This is because when the points are in general position any hyperplane through xm+1 that realizes the dichotomy {X+. then as n approaches infinity. We refer to m = 2(n + 1) as the statistical capacity of the LTG.
. m 2. Thus. d 1).3. xm}... Note that if m < 2(n + 1). 1 yields
from which the theorem follows immediately on noting that C(1. is given by
Again.3.3. Let C(m. X+ (X) is a subset of X consisting of all points which lie above (below) the separating hyperplane. . d) be the number of linearly separable dichotomies {X+. Now. Therefore.. d) D dichotomies. there will be one new linear dichotomy for each old one. i. C(m + 1. either or must be separable. the probability of a single n-input LTG to separate m points in general position (assuming equal probability for the 2m dichotomies) is
(1. D is the number of linear dichotomies of X that could have had the dividing hyperplane drawn through xm+1...2 may now be employed to study the ability of a single LTG to separate m points in general position in Rn. the number of linear dichotomies of X'..3. some of the linear dichotomies of the set X can be achieved by hyperplanes which pass through xm+1. X} can be shifted infinitesimally to allow arbitrary classification of xm+1 without affecting the separation of the dichotomy {X+.5) Equation (1.5) is plotted in Figure 1. exactly one half of all possible dichotomies of the m points are linearly separable. X = {x1.. the LTG almost always separates the m points. is in general position.. xm+1}.3. xm+1. . X}. .3. in general position in Rd. But this number is simply C(m. It is noted from Figure 1. d). Here. At m = 2(n + 1). Let the number of such dichotomies be D. Since the total number of possible dichotomies in Rn is 2m. that a single LTG is essentially not capable of handling the classification of m points in general position when m > 3(n + 1). x2.e. such that the set of m + 1 points. For each of these D linear dichotomies there will be two new dichotomies.
Theorem 1..3. because constraining the hyperplane to go through a particular point xm+1 makes the problem effectively d 1 dimensional. X} of X. N) = 2 (one point can be linearly
separated into one category or the other) and
for m d+1. PLS approaches one. X' = {x1.. x2. m 1. and .Consider a set of m points. Consider a new point. This observation allows us to obtain the recursion relation
The repeated iteration of the above relation for m.

. if there exists a set of d 1 PTG weights and a threshold which correctly classify all m points.3. x2. where . may be viewed as an LTG with d 1 preprocessed inputs (see Figure
1. That is. we say X = {x1.Figure 1. . Therefore. Here.6)
Figure 1. equivalently for the case d m. the probability of a single PTG with d degrees of freedom (including threshold) to separate m points in -general position is given by
(1.
.. Block diagram of a PTG represented as the cascade of a fixed preprocessing layer and a single LTG. Note that the Function Counting Theorem still holds true if the set of m points is in general position in -space or.3. which will be referred to as the -surface. Probability of linear separability of m points in general position in Rn.4). The inverse image of this hyperplane in the input space Rn defines a polynomial separating surface of order r. A dichotomy of m points in Rn is said to be "-separable" by a PTG(r).4.3. 1.4 Separability in Space A PTG(r) with labeled inputs x Rn. xm} is in -general position.3. We refer to the mapping (x) from the input space Rn to the -space Rd1 as the -mapping.3. if there exists a (d 2)-dimensional hyperplane in -space which correctly classifies all m points.. if no d points lie on the same -surface in the input space.3.

Note that by convention 00 = 1. Next..3. if n = 4. . for i = 0. 1966): Any n-variable switching (Boolean) function of m points with m 2n.7) Thus.1) where
(1.4. the threshold T will be set to zero). is less than or equal to the number of -separable dichotomies of m points in -general position. 1.4. Equation (1.2. is realizable using a PTG(r) with m or less terms (degrees of freedom).4. Bn(r).1 (Kashyap. the function g(x) must satisfy:
.. so that the number of 1's in x(i) is larger than or equal to those in x(j) if i > j.1).3) where
.4. the number of Boolean functions.
(1. To find the separating surface.4.8)
which is exactly the inequality of Equation (1. Proof of Theorem 1. d 1). x(1). m 1 (1. L(m.2) For example.3. x = [x1 x2 x3 x4]T.4... . x(m1)..4 Minimal PTG Realization of Arbitrary Switching Functions
Theorem 1.2) gives i(x) = x1 x2 x4. of n-variables which can be realized by an n-input PTG(r) satisfies
Bn(r)
(1. we rearrange the m points (vertices) of a given switching function as x(0).Since for an arbitrary set of m points in Rn. g(x) = 0.. and xi = [ 1 1 0 1]T.
1. that separates the m points into the two designated classes (0 and 1). we may write:
(1.1: Let g(x) represent the weighted-sum signal (the signal just before thresholding) for a PTG with m degrees of freedom (here. the number of -separable dichotomies.

4) and (1. the solution vector w exists and can be easily calculated by forward substitution (e. A is a triangular nonsingular matrix. see Gerald. B is a lower triangular and nonsingular matrix. wm 1]T is the PTG weight vector.4) and w = [w0 w1 w2 .4..4.4.8) gives Bii = 1. Accordingly.4.4. Example 1. .. thus.4. Here.1. since x(i) has more 1's than x(j) when i > j.3) may be rewritten in the following compact form
Aw = b 0 (1.4. Now.g.(1. Ordered Patterns
x1 0 1 0 1
x2 0 0 1 1
x3 0 0 0 0
Class 1 0 0 1
x(0) x(1) x(2) x(3)
.6) and T is an m m diagonal matrix given by
(1... Note that some of the components of the solution vector w may be forced to zero with proper selection of the margin vector b. A = TB.4. m 1.4.4. the ijth component of B is given by
(1. If the vector y is normalized by multiplying patterns of class 0 by 1.5)
where b is an arbitrary positive margin vector. Equation (1. This completes the proof of the Theorem.4.8) Since we have assumed that 00 = 1. where B is defined as the m m matrix
(1.. 1. for i = 0. Using Equations (1.5). 1978) in Equation (1.1 are shown sorted in an ascending order in terms of the number of 1's in each pattern.1: Consider the partially specified non-threshold Boolean function in Table 1.4. Hence. then Equation (1.6).7) The A matrix is a square matrix and its singularity depends on the B matrix.. then Bij = 0 for i j. The patterns in Table 1.4.

using forward substitution in Equation (1.4.x(4) 0 1 1 1 x(5) 1 1 1 1 Table 1.4.4.7). we arrive
at the solution:
w = [ 1 2 2 4 2 2]T
Substituting this w in Equation (1. A = T B is given as
Now.4. From Equation (1.4.1. the T matrix is given by
Thus.5) with b = [ 1 1 1 1 1 1]T. A partially specified Boolean function for Example 1.1) allows us to write the equation for the separating surface (surface) realized by the above PTG as
. the B matrix is computed as
and from Equation (1.8).4.1.

relative to the given class of decision surfaces and with respect to the training set.. if there exists two decision surfaces that both correctly classify the training set but yield different classifications of the new pattern.1). To prove this. d takes on its largest possible value of 2n. -surfaces associated with a PTG(r)] which separate this set.1 is equivalent to Theorem 1.1}n.. a PTG with is sufficient for guaranteed realizability of m arbitrary points in {0.
1. However. x2. Therefore.
. the family of surfaces that can be implemented by a PTG). Clearly. it is generally known that if a large number of training patterns are used. Note also that for m = 2n. points y1 and y3 are unambiguous.5 Ambiguity and Generalization
Consider the training set {x1. we show that the number of training patterns must exceed the statistical capacity of the PTG before unique response becomes probable. any Boolean function of n-variables is realizable by a single PTG(r n) (this proves Theorem 1.2. for some dichotomies of the training set.4. for r = n.1.
we note that according to Theorem 1. a PTG) to correctly classify new patterns. the assignment of category will not be unique.. The new pattern will be assigned to the category lying on the same side of the decision surface. 1965) that the number of training patterns must exceed the statistical capacity of a threshold gate before ambiguity is eliminated. Now.g..2. the decision surface will be sufficiently constrained to yield a unique response to a new pattern. In Figure 1. In the context of a single threshold gate. a decision (separating) surface from the admissible class is synthesized which correctly assigns members of the training set to the desired categories. Next.g. but point y2 is ambiguous for the class of linear decision surfaces. thus resulting in a simpler realization of the PTG.g(x) = 1 2x1 2x2 + 4x1x2 + 2x2x3 2x1x2x3 = 0
In general. 1966) indicates that the number of nonzero components of vector w varies roughly from to about . the margin vector may be chosen in such a way that some of the components of the weight vector w are forced to zero. . Theorem 1. The classification of a new pattern y is ambiguous.5. which is the largest possible value for m. it can be shown (Cover. Generalization is the ability of a network (here. Experimental evidence (Kashyap.. During the training phase.4.1. xm} and a class of decision surfaces [e.1. Consider the problem of generalizing from a training set with respect to a given admissible family of decision surfaces (e.

Z} which can be separated by a (d 2)-dimensional hyperplane constrained to pass through (y). xm} if and only if there exists a -surface containing y which separates {X+.5. Theorem 1. {Z+.. Now. Constraining the hyperplane to pass through a point effectively reduces the dimension of the space by 1. Z}. Equivalently.5. X}.1 (Cover. x2. . then the probability.. then y is ambiguous with respect to C(m. The point y2 has an ambiguous classification. if each of the -separable dichotomies of X has equal probability. X} in the input space).3. (y)} is in general position in Rd1. Now.. . Proof of Theorem 1. xm. Hence.. X} of X = {x1... .. The points y1 and y3 are uniquely classified regardless of which of the two linear decision surfaces shown is used.5. A(m. d 2) dichotomies of X relative to the class of all -surfaces. when X is in -general position. each linear dichotomy of Z. d 2) -separable dichotomies of X. Ambiguous generalization. that y is ambiguous with respect to a random -separable dichotomy of X is
. Z} (here. (xm). x2.Figure 1. This is because. by Theorem 1. X} can be shifted infinitesimally to allow arbitrary classification of y without affecting the separation of {X+. the point (y) is ambiguous with respect to D linear dichotomies {Z+. the point (y) is ambiguous with respect to the linear dichotomy {Z+. X}. D = C(m. d 2). since -general position of X{y} implies that the set of points Z{(y)} = {(x1). Thus.1: The point y is ambiguous with respect to a dichotomy {X+.2. corresponds to a unique -separable dichotomy {X+. (x2). by noting the one-to-one correspondence between the linear separable dichotomies in -space and the -separable dichotomies in the input space we establish that y is ambiguous with respect to C(m. d).1. y} be in -general position in Rd1.. 1965): Let X {y} = {x1. any -surface through y that realizes the dichotomy {X+..

one can employ Equation (1. d) for large d is given by (Cover. then we can determine the number of samples required to achieve a desired ε . ambiguous response is reduced only when . i.5.2.5.5.e. A new point y is ambiguous with respect to C(10. To eliminate ambiguity.Example 1.
Figure 1.2). 2) = 92 dichotomies of the m points are separable by the class of all lines in the plane.
. is equal to Therefore.1. Thus.1: Let us apply the above theorem to Figure 1.5. Then.5. when m is larger than the statistical capacity of the PTG (or more generally.5. that is. If we define 0 ε 1 as the probability of generalization error.2) and show that m must satisfy A(m. where m = 10 and d = 3 (including the bias input).2) where = m/d. C(10. we find that the probability of generalization error for a single threshold gate with d degrees of freedom approaches zero as fast as in the limit of m >> d. linearly separable dichotomy of the m points with probability
The behavior of A(m. as d. d). the separating surface used). a new point y is ambiguous with respect to a random. Assume that we choose m points (patterns) from a given distribution such that these points are in general position and that they are classified independently with equal probability into one of two categories. conditioned on the separability of the entire set.3) in order to bound the probability of generalization error below ε . 1965)
(1.5. the probability of generalization error on a new pattern similarly chosen. Now.2. Asymptotic probability of ambiguous response. It is interesting to note that according to Equation (1. Thus.5. 1) = 20 dichotomies of the m points relative to the class of all lines in the plane..
(1. we need 2 or m 2d. The function A*() is plotted in Figure 1. a large number of training samples is required.

the expected number of extreme points can be written as
(1. there exists a minimal sufficient subset of extreme points such that any hyperplane correctly separating this subset must separate the entire set correctly. with = m/d. 20 and as a function of .
Figure 1. d) linearly separable dichotomies.1. Normalized expected number of extreme points for d = 5. Since these probabilities are equal. Pe. the two dichotomies which differs only in the classification of the remaining point. the normalized expected number of extreme points is given by (Cover.1) Then. From this definition it follows that a point is an extreme point of the linear dichotomy {X+.6. Thus.1. the probability. each of the dichotomies is the restriction of two linearly separable dichotomies of the original m points.3)
. X} if and only if it is ambiguous with respect to {X+. d 1) dichotomies. is equal to the sum of the m probabilities that each point is an extreme point. X}.6. d 1) linear dichotomies of the remaining m 1 points. Since there are a total of C(m.6.1 illustrates the effect of the size of the training set m on the normalized expected number of extreme points. each of the m points is ambiguous with respect to precisely C(m 1. Therefore. X} of a set of points.6. for a set of m points in general position in Rd. Here. the expected number of extreme points in a set of m points in general position in Rd.6.6 Extreme Points
For any linear dichotomy {X+. of a point to be an extreme point with respect to a randomly generated dichotomy is (assuming equiprobable dichotomies)
(1. . each of the m points is an extreme point with respect to 2C(m 1. 1965)
(1. For large d.2) Figure 1.

Finally. not including
1.3 Prove that a PTG(r) with n binary inputs has threshold.4 Derive the closed form expression for the number of degrees of freedom for a PTG (r) given in Equation (1. a PTG with order is capable of realizing any Boolean function of n inputs.
Problems:
1. is equal to twice the number of its degrees of freedom (weights).1 for d . assuming data in general position in Rn.
degrees of freedom. assuming the separability of this training set by the threshold gate. a threshold gate with d-degrees of freedom trained to correctly classify m patterns will. First.2 Derive the formulas given in Table 1. on the average. we find that the average number of patterns needed for complete characterization of a given training set is only twice the number of degrees of freedom d of a threshold gate. A single LTG can only realize threshold (linearly separable) functions. A fundamental theorem.1. known as the Function Counting Theorem.1 Verify that the patterns in Figure 1.8). that for a random separable set of m patterns.1.
. Second. The LTG is then extended to the more powerful PTG. In this context. This result is also true for a PTG. On the other hand. the statistical capacity of a single LTG. and analyzes its computational capabilities in realizing binary mappings.5 are admissible threshold patterns. in the limit as .2 for degrees of freedom for real and binary input QTG's. On the other hand.
1. roughly. the average number of patterns (information) which need to be stored for a complete characterization of the whole set is only twice the number of degrees of freedom of the class of separating surfaces being used. respond ambiguously to a new input pattern as long as . is proved and is used to derive the following important properties of threshold gates.1.1. The notion of a separating surface is introduced and it is used to describe the operation of a threshold gate as a two-class classifier.7 Summary
This chapter introduces the LTG as the basic computational unit in a binary state neural network. it is found that the probability of generalization error approaches
zero asymptotically as .1.which agrees with the simulation results in Figure 1.1. The flexibility of a PTG's separating surface is due to a dimensionality expanding polynomial mapping (-mapping) which is realized by a preprocessing layer intrinsic to the PTG. The above limit on the expected number of extreme points implies.6. a PTG of finite order or a finite network of simple gates). 1.1. a single PTG is capable of realizing a highly-flexible nonlinear (polynomial) separating surface.. The impressive computational power of a PTG comes at the expense of increased implementation complexity.g. The implication to pattern recognition is that the essential information in an infinite training set can be expected to be loaded into a network of finite capacity (e.
1. It is found that a single LTG is characterized by a linear separating surface (hyperplane).

-general position)? Why? Can this PTG map m arbitrary points in Rn into m points in -general position? Explain.4 Plot the function
for n 2.
*
1.1 are properly separated by g.1.3. if m = 2(n + 1).
1.3.
c. Also.2.
1. and that
.e.2.2. x3) = 0.2. (Hint: use
.3.
b.
for a = 2 and 3.3 Verify the inequality in Equation (1.2 a. using the recursion relation in part (a).1 Show that
. Prove algebraically that
.2 Assume n < m d = .1 Prove that
. Now.
1. the number of dichotomies
of the m points which are realizable with a single LTG is bounded from above by
. Plot g(x1.3 Given m arbitrary points in Rn space. let x3 = x1x2. the Binomial Theorem:
where and are real numbers and n is an integer. Prove by induction. Use the Binomial Theorem to prove that
. is it true that m points in general position in Rn are mapped by the -mapping of a PTG(r) into points in general position in Rd1 space (i. Show that the four patterns of the XNOR function of Example 1. Show that for m 3n+2 and n 2.. for 1 i n. show that
. Also.5) by showing that 1.2.
*
1.2.
show that
. x2.

3.6 Derive Equation (1. the functions g(x). (2. 2. 1). for all i j = 1. Hint: The determinant
is the Vandermonde's Determinant.... 1).3. (1.3. . Plot g(x) to verify your solution. Compare graphically. 1. its approximation y(x). the PTG output without thresholding) versus x. Note that this bound reduces to for m = 2n.4) from the recursion relation 1.3.4 to synthesize a polynomial g(x) of minimal order which realizes the above mapping. Use this result to show that
1. 0). Next. 1} defined by the input/output pairs (0.3.. which is nonzero if zi zj.
a. Is it possible for the order r of this PTG to be smaller than that of the polynomial g(x)? If the answer is yes. which approximates the function g(x) = sin x over [0.Equation (1. for and .7 Let y(x) = a + bx.3.
*
1. By
minimizing
.7) adapted for the LTG case.e. ]. with d 1 = n)..
and its power series approximation g(x) x
. b R. then synthesize the appropriate weights for this PTG (assume a zero threshold) and plot the weighted-sum (i. Find the parameters a and b which minimize:
. . and (4. yi}.5 Consider the mapping from R1 to {0.. 0).8 Find the polynomial y(x) = ax + bx3.
b.. m. which represents a tighter upper bound than that of Equation (1. Use the result of Problem 1.4 Consider a mapping f : R R. 1. m. Show that this mapping can be exactly realized by a polynomial of order r = m 1 (this is referred to as "strict" interpolation). (3.2. 0). i = 1.5). where a. assume that a PTG(r) is used to realize the same mapping.3. 2.3. defined as the set of input/output pairs {xi. Assume that xi xj for all i j.

2).4.
†
. See Cover (1965) for hints.4.5.5.
.4 to synthesize a PTG for realizing the Boolean function .1 is not a threshold function. 10.
*
1. given in Equation (1.
1. Does a b > 0 exist which leads to a solution vector w with four or fewer nonzero components? 1. and 20. Use a margin vector b which results in the minimal number of non-zero PTG weights.4. Use the K-map technique to show that the Boolean function in Table 1.1) versus for d = 2.6.3 Derive Equation (1.1 if b = [1 1 1 1 1 3]T .1 Derive Equation (1. starting from Equation (1.2 Plot the probability of ambiguous response.1.6.2 Use the method of Section 1.2). A(m.5.3).5.1 Derive Equation (1.3). d).4.5.1 Find the solution vector w in Example 1.
1.5.6.
*
1.

. unless noted otherwise. Furthermore.
2.g.1. some important theoretical results on the approximation of arbitrary multivariate continuous functions by feedforward multilayer neural networks are presented.1. see Kohavi. In this chapter. Note that for continuous functions.1 establish the "universal" realization capability of this architecture for continuous functions of the form (assuming that the output unit has a linear activation
function) and for Boolean functions of the form . respectively. This chapter concludes with a brief section on neural network computational complexity and the efficiency of neural network hardware implementation. network. Here. The parity function is a Boolean function which requires the largest network.2. 1} are considered. universality means that the realization is exact.1 Network Realization of Boolean Functions A well known result from switching theory (e. The following sections consider other more interesting neural net architectures and present important results on their computational capabilities.1 Some Preliminary Results on Neural Network Mapping Capabilities
In this section.0 Introduction
In the previous chapter. In particular. networks of LTG's are considered and their mapping capabilities are investigated. i. The function approximation capabilities of networks of units (artificial neurons) with continuous nonlinear activation functions are also investigated.e. some basic LTG network architectures are defined and bounds on the number of arbitrary functions which they can realize are derived.
.3.1 (extended to multivariate functions) and 1.2. the computational capabilities of single LTG's and PTG's were investigated.3.4. Before proceeding any further. the order r of the PTG may become very large. the network with 2n−1 AND gates. for Boolean functions.2. assuming that the inputs and their complements are available as inputs.1. 2. In the remainder of this book. neural network. Here.. universality means that the approximation of an arbitrary continuous function can be made to any degree of accuracy. Theorems 1. On the other hand. the K-map of the parity function is shown in Figure 2. as was shown in Figure 1. the terms artificial neural network. The realization of both Boolean and multivariate functions of the form f : Rn {0. r n is sufficient. For the case n = 4.1. and net will be used interchangeably. 1978) is that any switching function (Boolean function) of n-variables can be realized using a two layer network with at most 2n−1 AND gates in the first (hidden) layer and a single OR gate in the second (output) layer. This network is known as the AND-OR network and is shown in Figure 2. COMPUTATIONAL CAPABILITIES OF
ARTIFICIAL NEURAL NETWORKS
2. note that the n-input PTG(r) of Chapter One can be considered as a form of a neural network with a "fixed" preprocessing (hidden) layer feeding into a single LTG in its output layer.

though.. dissertations have been written on threshold logic networks and their synthesis (refer to the introduction of Chapter One for references).D.1 with LTG's and still retain the universal logic property of the AND-OR network [The universality of a properly interconnected network of simple threshold gates was first noted in a classic paper by McCulloch and Pitts (1943)]. Given the same switching function. is an example where the required number of AND gates is equal to the number of threshold gates in both nets. as for the AND-OR net. one may replace the hidden layer AND gates in Figure 2.
Figure 2. A recommended brief introduction to this subject appears in the book by Kohavi (1978). It was shown in Chapter One that a single LTG is a more powerful logic gate than a single AND (or OR) gate.Figure 2. Thus. However. LTG's are sufficient.1) that hidden LTG's would still be necessary for realizing arbitrary Boolean functions in the limit of large n.2. Another interesting result is that a two hidden layer net with
LTG's is necessary for realizing arbitrary Boolean functions (see Section 2.2.1.1.e.
The synthesis of threshold-OR and other LTG nets for the realization of arbitrary switching functions was extensively studied in the sixties and early seventies. it can be shown (see Section 2. Here. we give only a simple illustrative example employing the K-map technique (the K-
. In fact.1.2 for details). 2n−1 gates. The resulting net with the LTG's does not require the complements of the input variables as inputs.2. The parity function. AND-OR network structure. Note that a fewer number of hidden LTG's than AND gates are needed if the LTG net employs an LTG for the output unit. This LTG net will be referred to as a threshold-OR net. K-map for the 4-input parity function.1. the threshold-OR net may require a smaller number of gates compared to an AND-OR net. i. Here. several books and Ph.

1.1.3. Proof of Theorem 2. Theorem 2.3(b) shows one possible decomposition of f(x) into a minimal number of threshold patterns (single LTG realizable patterns) f1(x) and f2(x).1.1.map was introduced in Section 1. Here. This K-map-based synthesis technique may be extended to multiple output nets (see Problem 2.2). First.1: This theorem can be viewed as a corollary of Theorem 1. 1}n. and finally a two hidden layer feedforward network. and (c) Threshold-OR realization of f(x).1.1 can be easily extended to the case of multiple output switching functions of the form . Threshold-OR realization of a 3-input switching function f(x): (a) K-map for f(x). Theorem 2. a two layer net of LTG's employing m LTG's in its hidden layer.1.2 Bounds on the Number of Functions Realizable by a Feedforward Network of LTG's Consider the following question: How many functions f : Rn {0. 1} defined on m arbitrary points in Rn are realizable by a layered net of k LTG's? In the following.1: Consider a partially specified switching function f(x) defined on a set of m arbitrary points in {0.1.
. All three nets are assumed to be fully interconnected.1. The following theorem establishes an upper bound on the size of a feedforward net of LTG's for realizing arbitrary.3(a). is sufficient to realize f(x). an arbitrarily interconnected net with no feedback. the worst case scenario is to duplicate the above network L times where L is the number of output functions! This leads to a sufficient LTG net realization having mL LTG's in its hidden layer and L LTG's in its output layer. feeding into a single output LTG.1. 2. (b) Kmap decomposition of f(x) into two threshold functions. this question is answered for three different network architectures.1 of Chapter One (see Problem 2.1) to realize the Boolean function f(x) in Figure 2.2).4.1.3(c). Figure 2. partially specified switching functions. The corresponding architecture for the threshold-OR net realization is depicted in Figure 2.
(a) (b) (c) Figure 2. Then. second. a single hidden layer feedforward network. with m 2n. but is only practical for .1.

We can show that for m arbitrary points in Rn. all feeding to a single LTG in the output layer as shown in Figure 2.1.5. the network can realize no more than
or
functions. recall Problem 1. k) functions with G given by
.3)
functions (dichotomies). fully interconnected net of k LTG's with no feedback shown in Figure 2. 1963)
(2.3. Proof: Each of the k LTG's can realize at most (Winder.1.1. Therefore. Another interesting type of network is the generally.Consider a feedforward net with k LTG's in its hidden layer. this net can realize at most F1(n. This type of network can realize at most G(n.4.1)
Figure 2. m. m. A two layer feedforward net with k hidden LTG's feeding into a single output LTG. 1963. with F1 given by (Winder. k) functions. with m 3n + 2 and n 2. Also. and the final LTG can realize at most
functions.4.1.

which is less than
This last result can be simplified to
..1. Proof: Since there is no feedback..(2.5). The total network can realize at most
functions. so that the jth gate only receives inputs from the n independent variables and the j − 1 gates labeled 1.1.. Generally fully interconnected net of k LTG's with no feedback.. the gates (LTG's) can be ordered as first gate. 2... j − 1 (see Figure 2.1.2)
Figure 2. . second gate. .5. kth gate. The jth gate can then realize at most
functions.

Similarly, it can be shown that the feedforward net shown in Figure 2.1.6 with two hidden layers of LTG's each, and with a single output LTG, is capable of realizing at most

(2.1.3) functions.

Figure 2.1.6. A three layer LTG net having LTG.

units in each of the two hidden layers and a single output

Similar bounds can be derived for points in general position by simply replacing the

bound on

the number of realizable functions by the tighter bound for input units (units receiving direct connections from all components of the input vector), since the statistical capacity of a single d-input LTG is 2(d + 1) points, for large d (refer to Section 1.3.3).

**2.2 Necessary Bounds on the Size of LTG Networks
**

In this section, we derive necessary lower bounds on the number of gates, k, in a network of LTG's for realizing any function f : Rn {0, 1} of m points arbitrarily chosen in Rn. For the case of functions defined on m points in general position, similar bounds are also derived. Again, we consider the three network architectures of Section 2.1.2. 2.2.1 Two Layer Feedforward Networks Case 1: m arbitrary points Consider the function f : Rn {0, 1} defined on m arbitrary points in Rn. Here, we show that a two layer

feedforward net with less than

LTG's in its hidden layer is not sufficient to realize such an

arbitrary function (in the limit of large n). Recalling Equation (2.1.1) and requiring that F1(n, m, k) be larger than or equal to the total number of possible binary functions of m points, we get (assuming m 3n + 2 and n 2)

(2.2.1) We may solve for k as a function of m and n. Taking the logarithm (base 2) of Equation (2.2.1) and rearranging terms, we get

(2.2.2)

By employing Stirling's approximation

, we can write

where a large n is assumed. Similarly, we employ the approximation

. Now, Equation (2.2.2) reduces to

(2.2.3) with and . Equation (2.2.3) gives a necessary condition on the size of the net for realizing any function of m arbitrary n-dimensional points. It is interesting to note the usefulness of this

bound by comparing it to the sufficient net constructed by Baum (1988), which requires hidden LTG's. As a special case, we may consider an arbitrary completely specified (m = 2n) Boolean function with . Here, Equation (2.2.3) with m = 2n gives

(2.2.4) which means that as n , an infinitely large net is required. A limiting case of Equation (2.2.3) is for , which leads to the bound

(2.2.5) Case 2: m points in general position Recalling Equation (1.3.5) and the discussion in Section 1.3.3 on the statistical capacity of a single LTG, one can determine an upper bound, FGP, on the number of possible functions (dichotomies) on m points in general position in Rn, for a two layer feedforward net with k hidden units as

(2.2.6) for m 2n and n . The first term in the right hand side of Equation (2.2.6) represents the total number of functions realizable by the k hidden layer LTG's, where each LTG is capable of realizing 22(n+1) functions. On the other hand, the term in Equation (2.2.6) represents an upper bound on the number of functions realized by the output LTG. It assumes that the hidden layer transforms the m points in general position in Rn to points which are not necessarily in general position in the hidden space, Rk. For the net to be able to realize any one of the above functions, its hidden layer size k must satisfy

(2.2.7) or

(2.2.8) This bound is tight for m close to 2n, it gives k = 1, which is the optimal net (recall, that any dichotomy of m points in general position in Rn has a probability approaching one to be realized by a single LTG as long as m 2n for large n). Note that when there is only a single LTG in the hidden layer, the output LTG becomes redundant.

Equation (2.2.8) agrees with the early experimental results reported by Widrow and Angell (1962). Also, it is interesting to note that in the limit as , Equation (2.2.8) gives a lower bound on k, which is equal to one half the number of hidden LTG's of the optimal net reported by Baum (1988). It is important to note that the bounds on k derived using the above approach are relatively tighter for the m points in the general position case than for the case of m points in arbitrary position. This is because, we are using the actual statistical capacity of an LTG for the general position case, as opposed to the upper bound on the number of dichotomies being used for the arbitrary position case. This observation is also valid for the bounds derived in the remainder of this section. 2.2.2 Three Layer Feedforward Networks Case 1: m arbitrary points

Consider a two hidden layer net having (k is even) LTG's in each of its hidden layers and a single LTG in its output layer. From Equation (2.1.3) the net is capable of realizing at most

arbitrary functions. Following earlier derivations, we can bound the necessary net size by

(2.2.9) for m 3n and n . Next, taking the logarithm of both sides of Equation (2.2.9) and employing Sterling's approximation gives

(2.2.1 0) which can be solved for k as

(2.2.11) Assuming that m >> n2, Equation (2.2.11) can be approximated as

(2.2.12)

For the special case of arbitrary Boolean functions with m = 2n, Equation (2.2.12) gives a lower bound

of hidden LTG's. An upper bound of Muroga (1959). Case 2: m points in general position Here, we start from the relation

on the number of hidden LTG's was reported by

(2.2.13) which assumes that k is large and that the two hidden layer mappings preserve the general position property of the m points. Now, in the limit of give and n , Equation (2.2.13) can be solved for k to

(2.2.14) 2.2.3 Generally Interconnected Networks with no Feedback It is left as an exercise for the reader to verify that the necessary size of an arbitrarily fully interconnected network of LTG's (with no feedback) for realizing any function f : Rn {0, 1} defined on m arbitrary points is given by (see Problem 2.2.3)

(2.2.15) and for points in general position

(2.2.16) It is of interest to compare Equations (2.2.12) and (2.2.14) with Equations (2.2.15) and (2.2.16), respectively. Note that for these two seemingly different network architectures, the number of necessary LTG's for realizing arbitrary functions is of the same order. This agrees with the results of Baum (1988), who showed that these bounds are of the same order for any layered feedforward net with two or more hidden layers and the same number of units, irrespective of the number of layers used. This suggests that if one were able to compute an arbitrary function using a two hidden layer net with only units, there would not be much to gain, in terms of random or arbitrary function realization capability, by using more than two hidden layers! Also, comparing Equations (2.2.3), (2.2.12) and (2.2.15) [or Equations (2.2.8), (2.2.14), and (2.2.16)] shows that when the size of the training set is much larger than the dimension of the input exemplars, then networks with two or more hidden layers may require substantially fewer units than networks with a single hidden layer. In practice, the actual points (patterns) that we want to discriminate between are not arbitrary or random; rather, they are likely to have natural regularities and redundancies. This may make them easier to realize with networks having substantially smaller size than these enumerational statistics would indicate. For convenience, the results of this section are summarized in Table 2.2.1.

NETWORK ARCHITECTURE

LOWER BOUNDS ON THE SIZE OF AN LTG NET POINTS IN GENERAL ARBITRARY POINTS POSITION

One hidden layer feedforward net with k hidden units Two hidden layer feedforward net with in each layer units

Generally interconnected net with k units (no feedback) Table 2.2.1. Lower bounds on the size of a net of LTG's for realizing any function of m points, f : Rn {0, 1}, in the limit of large n.

**2.3 Approximation Capabilities of Feedforward Neural Networks for Continuous Functions
**

This section summarizes some fundamental results, in the form of theorems, on continuous function approximation capabilities of feedforward nets. The main result is that a two layer feedforward net with a sufficient number of hidden units, of the sigmoidal activation type, and a single linear output unit is capable of approximating any continuous function f : Rn R to any desired accuracy. Before formally stating the above result, we consider some early observations on the implications of a classical theorem on function approximation, Kolmogorov's theorem, which motivates the use of layered feedforward nets as function approximators. 2.3.1 Kolmogorov's Theorem It has been suggested (Hecht-Nielsen, 1987 and 1990; Lippmann, 1987; Spreecher, 1993) that Kolmogorov's theorem concerning the realization of arbitrary multivariate functions, provides theoretical support for neural networks that implement such functions. Theorem 2.3.1 (Kolmogorov, 1957): Any continuous real-valued functions f (x1, x2, ..., xn) defined on [0, 1]n, , can be represented in the form

f(x1, x2, ..., xn) =

(2.3.1)

where the gj's are properly chosen continuous functions of one variable, and the ij's are continuous monotonically increasing functions independent of f.

The basic idea in Kolmogorov's theorem is captured in the network architecture of Figure 2.3.1, where a universal transformation M maps Rn into several uni-dimensional transformations. The theorem states that one can express a continuous multivariate function on a compact set in terms of sums and compositions of a finite number of single variable functions.

Figure 2.3.1. Network representation of Kolmogorov's theorem. Others, such as Girosi and Poggio (1989), have criticized this interpretation of Kolmogorov's theorem as irrelevant to neural networks by pointing out that the φ ij functions are highly non-smooth and the functions gj are not parameterized. On the other hand, K rková (1992) supported the relevance of this theorem to neural nets by arguing that non-smooth functions can be approximated as sums of infinite series of smooth functions, thus one should be able to approximately implement φ ij and gj with parameterized networks. More recently, Lin and Unbehauen (1993) argued that an "approximate" implementation of gi does not, in general, deliver an approximate implementation of the original function f(x). As this debate continues, the importance of Kolmogorov's theorem might not be in its direct application to proving the universality of neural nets as function approximators, in as much as it points to the feasibility of using parallel and layered network structures for multivariate function mappings. 2.3.2 Single Hidden Layer Neural Networks are Universal Approximators Rigorous mathematical proofs for the universality of feedforward layered neural nets employing continuous sigmoid type, as well as other more general, activation units were given, independently, by Cybenko (1989), Hornik et al. (1989), and Funahashi (1989). Cybenko's proof is distinguished by being mathematically concise and elegant [it is based on the Hahn-Banach theorem (Luenberger, 1969)]. The following is the statement of Cybenko's theorem (the reader is referred to the original paper by Cybenko (1989) for the proof). Theorem 2.3.2 (Cybenko, 1989): Let ϕ be any continuous sigmoid type function (e.g., ). Then, given any continuous real-valued function f on [0, 1]n (or any other compact subset of Rn) and ε 0, there exists vectors w1, w2, ..., wN, , and and a parameterized function G(, w, , ): [0, 1]n R such that

| G(x, w, , ) − f(x) | ε for all x

where

G(x, w, , ) =

and wj Rn, j , j R, w = (w1, w2, ..., wN),

(2.3.2) = (1, 2, ..., N), and = (1, 2, ..., N).

Hornik et al. (1989) [employing the Stone-Weierstrass Theorem (Rudin, 1964)] and Funahashi (1989) [using an integral formula presented by Irie and Miyake (1988)] independently proved similar theorems stating that a one hidden layer feedforward neural network is capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. This implies that any failure of a function mapping by a multilayer network must arise from inadequate choice of parameters (i.e., poor choices for w1, w2, ..., wN, , and ) or an insufficient number of hidden nodes. Hornik et al. (1990) proved another important result relating to the approximation capability of multilayer feedforward neural nets employing sigmoidal hidden unit activations. They showed that these networks can approximate not only an unknown function, but also approximate its derivative. In fact, Hornik et al. (1990) also showed that these networks can approximate functions that are not differentiable in the classical sense, but possess a generalized derivative as in the case of piecewise differentiable functions. Using a theorem by Sun and Cheney (1992), Light (1992a) extended Cybenko's results to any continuous function f on Rn and showed that signed integer weights and thresholds are sufficient for accurate approximation. In another version of the theorem, Light shows that the sigmoid can be replaced by any continuous function ϑ on R satisfying:

(2.3.3) where n is odd and n 3. A similar result can be established for n even. Examples of activation functions satisfying the above conditions are given by the family of functions

(2.3.4) Note that the cosine term is the Chebyshev polynomial of degree n. Figure 2.3.2 shows two plots of this activation function for n = 3 and n = 7, respectively.

Ito (1991) showed that any function belonging to the class of rapidly decreasing continuous functions in Rn (i.3 Single Hidden Layer Neural Networks are Universal Classifiers Theorem 2..e.5)
where f :An {1.3..3. Here.. The universality of single hidden layer nets with units having non-sigmoid activation functions was formally proved by Stinchcombe and White (1989).3. and is emoty for i j. An is a compact (closed and bounded) subset of Rn.(a) (b) Figure 2. Pk partition An into k disjoint measurable subsets. and (b) n = 7. 1993) have extended these results by showing that the above neural network with locally bounded piecewise continuous activation functions for hidden units is a universal approximator if and only if the network's activation function is not a polynomial. functions f(x) satisfying for any kj 0) can be approximated arbitrarily well by a two layer architecture with a finite number of LTGs in the hidden layer. . k}.3. bounded.e. 2. two-class problems in a two-dimensional space)..e. Baldi (1991) showed that a wide class of continuous multivariate functions can be approximated by a weighted sum of bell-shaped functions (multivariate Bernstein polynomials).
Thus.. Hornik (1991) proved that a sufficient condition for universal approximation can be obtained by using continuous. P2. This result was confirmed empirically by Huang and Lippmann (1988) on several examples including the ones shown in Figure 2.e. the requirement of rapid decrease in f is not necessary and can be weakened. i..2. .. Activation function n() for the case (a) n = 3.
. (1993..3 for n = 2 and k = 2 (i. and nonconstant hidden unit activation functions. 2. 1989) of the form
f(x) = j iff x Pj (2.. a single hidden layer net with sigmoidal activation units and a single linear output unit is a universal classifier..2 may also be extended to classifier-type mappings (Cybenko. and P1.3. Recently. Leshno et al. a single hidden layer net with bell-shaped activations for its hidden units and a single linear output unit is a possible approximator of functions f : Rn R. i. see also Hornik.

although useful for theoretical considerations. the class label is explicitly generated as the output of a linear unit. the number of units is a measure of space complexity. the resulting time. simultaneously. we seek to minimize. Here.3. each of which is responsible for representing a unique class. is not practical. This relaxes the constraints on the hidden layer mapping by adding several output units.3. If a given problem is very demanding in terms of space complexity. The same is true for problems where time complexity is very demanding. (Adapted from W. and length of algorithm (Kolmogorov complexity). 1986a and 1986b).. Ten complex decision regions formed by a neural net classifier with a single hidden layer. Solving a problem on a sequential computer requires a certain number of steps (time complexity). P.Figure 2. leads to a large number of hidden units for the realization of complex mappings. Lippmann. it imposes unnecessary constraints on the hidden layer. In practical implementations. memory size (space complexity). In a neural net simulation the number of computations is a measure of time complexity.4 Computational Effectiveness of Neural Networks
2. 1988. and the number of weights (degrees of freedom) where the algorithm is "stored" is a measure of Kolmogorov complexity. In formulating neural network solutions for practical problems.5). Huang and R.
2. and Kolmogorov complexities of the network.) We conclude this section by noting that in Equation (2. Y. with permission of the American Institute of Physics. then the required network size is large and thus the number of weights is large.1 Algorithmic Complexity A general way of looking at the efficiency of embedding a problem in a neural network comes from a computational complexity point of view (Abu-Mostafa.
. 1986a). space. LTG's or sigmoid type units may be used as output units. even if the Kolmogorov complexity of the algorithm is very modest! This spells inefficiency: neural net solutions of problems with short algorithms and high-space complexity are very inefficient. This representation of class labels via quantized levels of a single linear output unit. while the other complexities may not be. which in turn.3.4. a local encoding of classes is more commonly used (Rumelhart et al.

At this point. A typical microprocessor chip can perform about 10 million operations per second and uses about 1 watt of power. than the best digital technology that we can hope to attain. etc. The ultimate silicon technology we can envision today will dissipate on the order of 10−9 joules of energy for each operation at the single chip level. roughly. we switch about 104 transistors to do one operation. Thus. [For examples of very large optical neural networks. From this point of view.]
. In round numbers. we may then compare computational devices in terms of the total energy required to solve a given problem. optical interconnections may be used in conjunction with electronic VLSI chips to efficiently implement very large and richly interconnected neural networks. A nerve pulse arrives at each synapse on the average of 10 times per second. Since the power dissipation is a few watts. one can very efficiently realize analog artificial neurons with existing technology employing device physics (Hopfield. such as holograms. 1986b). each operation costs only 10−16 joules! The brain is more efficient. The brain. 2. Examples are pattern recognition problems in natural environments and AI problems requiring huge data bases. there is a potential for effectively and efficiently realizing very large analog neural nets on single chips. the brain accomplishes 1016 complex operations per second. one may recall the results of Chapter One. 1990. Similarly. Optics have the unique advantage that beams of light can cross each other without affecting their information contents. optical filters. has about 1015 synapses. A neural net can develop one by learning from examples. multiplication may be performed with a small MOS transistor circuit. As technology evolved. the reader is referred to the works of Paek and Psaltis (1987). microprocessor-based. These problems make the most use of the large capacity of neural nets. as will be shown in Chapters 5 and 6. and Anderson and Erie (1987).. But we neglected a basic fact: the transistor does an exponential all by itself. analog spatial light modulators. Carver Mead (1991) wrote: "We pay a factor of 104 for taking all of the beautiful physics that is built into these transistors. Using the physics of the device (e. it costs about 10−7 joules to do one operation on such a chip. be it biologically-based." Based on the above. So. to transistors. The other interesting property of random problems is that we do not have explicit algorithms for solving them.The above complexity discussion leads us to identify certain problems which best match the computational characteristics of artificial neural networks.g. Therefore. These results and those presented here show that neural networks can be very efficient classifiers. by a factor of 10 million. The richness of optical device physics. and nonlinear activations (sigmoids) can be realized using a single MOS transistor operating in its subthreshold region.2 Computation Energy For any computational device. AbuMostafa and Psaltis (1987). We then string together those multiplications and additions to get an exponential. and then painfully building it back. which point to the effectiveness of threshold units in loading a large set of labeled patterns (data/prototypes) by only learning on extreme patterns. mashing it down into a one or a zero. and fourier lenses suggests that optical implementation of analog neural nets could be very efficient. Such problems are called random problems (Abu-Mostafa. with AND and OR gates to reinvent the multiply. the type of computations required by neural networks may lend themselves nicely to efficient implementation with current analog VLSI technology. Problems which require a very long algorithm if run on sequential machines make the most use of neural nets. One reason for the inefficiency in computation energy is due to the way devices are used in a system. to integrated circuits). analog computing in a single transistor) can save us these four orders of magnitude in computation energy. based on Cover's concept of extreme patterns/inequalities.. For example. the cost-percomputation or computational energy of the device can be directly measured in terms of the energy required (in units of Joules) to perform the computation. it always moved in the direction of lower energy per unit computation in order to allow for more computations per unit time in a practically sized computing machine (note the trend from vacuum tubes. Analog optical implementations of neural networks could also have a competitive advantage over digital implementations. since the capacity of the net grows faster than the number of units. 1991). Mead. It is interesting to note that humans are very good at such random problems but not at structured ones. analog addition is done for free at a node.4. on the other hand. In a typical silicon implementation.

we find that from an algorithmic complexity point of view.
Figure P2. we explore learning algorithms which may be used to adaptively discover optimal values for these parameters. the derived bounds suggest that no matter what network architecture is used (assuming no feedback).
Problems:
2. it is crucial that the parameters of these networks be carefully chosen in order to exploit the full function approximation potential of these networks. x2. However.1.2.2. These networks are also capable of implementing arbitrarily complex dichotomies.2. the functions encountered in practice are likely to have natural regularities and redundancies which make them easier to realize with substantially smaller networks than these bounds would indicate. they are suitable as pattern classifiers. the size of such
networks must always be on the order or larger for the realization of arbitrary or random functions of m points in Rn. We also find such networks appealing from a computation energy point of view when implemented in analog VLSI and/or optical technologies that utilize the properties of device physics. Fortunately. In the next chapter.5). Finally. A three input.1. It is found that networks with two or more hidden layers are potentially more size-efficient than networks with a single hidden layer.2 Find a minimal threshold-OR network realization for the switching function given by the K-maps in Figure P2. though.1. in the limit m >> n2. x3) = x1x3 + x2x3 + into the ORing of two threshold functions (Hint: recall the admissible K-map threshold patterns of Figure 1.5 Summary
Lower bounds on the size of multilayer feedforward neural networks for the realization of arbitrary dichotomies of points are derived.1 Use the K-map-based threshold-OR synthesis technique illustrated in Figure 2.3 to identify all possible decompositions of the Boolean function f(x1.1. 2.
. However. Single hidden layer feedforward neural networks are universal approximators for arbitrary multivariate continuous functions.1. neural networks are best fit for solving random problems such as pattern recognition problems in noisy environments. two output switching function.1. thus.

that can be realized using the generally interconnected LTG network shown in Figure 2.2)]. feeding into a single output LTG is sufficient for realizing any Boolean function of m points in {0.4 Consider an arbitrarily fully interconnected network of LTG's with no feedback.1.5. for .15) [Hint: Use Equation (2. defined on m points in general position.2 Show that the activation in Equation (2.1.
*
.2. completely specified Boolean
2.2.14).
*
2.2.1.2.2.16).
*
LTG's for realizing arbitrary.3.1.3. Hint: Use the result of Problem 2.4. to show that a two layer LTG net with m LTG's in the hidden layer.2.3.2.
2.3 Derive the bound on k in Equation (2.1}n.2.4) with n = 3 satisfies the three conditions in Equation (2.
*
2.
2.2 Derive the bound on k in Equation (2.2.3.4 and 1. Show that this
network must have more than functions of n-variables.1. 2.5 Derive the bound on k in Equation (2.1 and the equivalence of the two nets in Figures 1. Is the requirement of full interconnectivity between the input vector and the hidden LTG layer necessary? Why? 2.1 Plot Equation (2.3.3.3).2.4) for n = 9.3 Employ Theorem 1.1 Estimate the maximum number of functions .2.

6. simple single layer architectures are assumed. The idea here is to optimize (maximize or minimize) some criterion or performance function defined in terms of the output activity of the units in the network. the weights and the outputs of the network are usually expected to converge to representations which capture the statistical regularities of the input data. Exceptions are cases involving unsupervised (competitive or feature mapping) learning schemes where an essential competition mechanism necessitates the use of multiple units. two groups of rules are discussed: Error correction rules and gradient descent-based rules.1 Supervised Learning in a Single Unit Setting
Supervised learning is treated first. and at each step of the learning process they are updated so that the error between the network's output and a corresponding desired target is reduced. For such cases. and 7) extend some of the learning rules discussed here to networks with multiple units and multiple layers. More precisely.1. These rules essentially drive the output error of a given unit to zero. an attempt is made to point out criterion functions that are minimized by using each rule. each input pattern/signal received from the environment is associated with a specific desired target pattern. Here. In most cases. which otherwise would have looked more like a diverse variety of learning procedures.1 Error Correction Rules Error correction rules were initially proposed as ad hoc rules for single unit training. Throughout this section. Perceptron Learning Rule
. and it allows us to unify a wide range of existing learning rules. where the teacher signal is the "correct answer". We start with the classical perceptron learning rule and give a proof for its convergence.1. other error correction rules such as Mays' rule and the -LMS rule are covered. this differs from supervised learning. LEARNING RULES
3. In the context of artificial neural networks. Later chapters of this book (Chapters 5. This view is adopted in this chapter. these learning rules are presented in the basic form appropriate for single unit training. By the end of this section it will be established that all of these learning rules can be systematically derived as minimizers of an appropriate criterion function. thus unifying them with the other gradient-based search rules.0 Introduction
One of the most significant attributes of a neural network is its ability to learn by interacting with its environment or with an information source. which gradually optimizes a prespecified objective (criterion) function. unsupervised learning involves the clustering of (or the detection of similarities among) unlabeled patterns of a given training set. Learning in a neural network is normally accomplished through an adaptive procedure. We will also cast these learning rules as relaxation rules. and unsupervised learning tasks. known as a learning rule or algorithm whereby the weights of the network are incrementally adjusted so as to improve a predefined performance measure over time. the learning process can be viewed as "search" in a multi-dimensional parameter (weight) space for a solution. reinforced. 3. In supervised learning (also known as learning with a teacher or associative learning).
3. such as the ones presented in Section 3. the weights are synthesized gradually. Then.3.2. Here. This chapter presents a number of basic learning rules for supervised. the process of learning is best viewed as an optimization process. Reinforcement learning involves updating the network's weights in response to an "evaluative" teacher signal. Usually. Reinforcement learning rules may be viewed as stochastic search mechanisms that attempt to maximize the probability of positive external reinforcement for a given training set. On the other hand.

The incremental learning process given in Equation (3. . "designing" an appropriate perceptron to correctly classify the training set amounts to determining a weight vector w* such that the following relations are satisfied:
(3.. ... One possible incremental method for arriving at a solution w* is to invoke the perceptron learning rule (Rosenblatt. we say that the perceptron correctly classifies the training set. an initial weight vector w1 is selected (usually at random)
.Consider the following version of a linear threshold gate shown in Figure 3.
Figure 3. The input signal xn+1 is usually set to 1 and plays the role of a bias to the perceptron. m. k = 1. The perceptron computational unit... is to design a perceptron such that for each input vector xk of the training set.... 2. Assume we are training the above perceptron to load (learn) the training pairs: {x1. . dm} where is the kth input vector and . The goal.1. one containing all points xk with dk = +1 and the other region containing all points xk with dk = − 1. In this case.. wn+1]T Rn+1 consisting of the free parameters (weights) of the perceptron.. where sgn is the "sign" function which returns +1 or − depending 1 on whether the sign of its scalar argument is positive or negative.1.1) is equivalent to finding a separating hyperplane which correctly classifies all vectors xk. Of course. is the desired target for the kth input vector (usually.. m. k = 1. Thus. The input/output relation for the perceptron is given by y = sgn(xTw)..1. in the context of the above discussion. d1}. m. {xm. 1962):
k = 1. First. respectively. 2...1. that is. then.. We will refer to it as the perceptron. (3. The perceptron maps an input vector x = [x1 x2 .1. finding a solution vector w* to Equation (3. we desire a hyperplane xTw* = 0 which partitions the input space into two distinct regions. 2.1. . for each k = 1. . d2}. called the learning rate. xn+1]T to a bipolar binary output y..2) proceeds as follows. We will denote by w the vector w = [w1 w2 . the perceptron output yk matches the desired target dk.2)
where is a positive constant. In other words. the order of these training pairs is random). and thus it may be viewed as a simple two-class classifier. we require yk = sgn(wTxk) = dk. The entire collection of these pairs is called the training set. {x2.1..1. 2.1) Recall that the set of all x which satisfy xTw* = 0 defines a hyperplane in Rn..

a correction is done if and only if a misclassification. the perceptron learning rule can be written as:
(3. On the other hand.6)
. 8 mod 8 = 0.. the superscript k in xk (and dk) is the label of the training pair presented at the kth iteration.. Hence. indicated by
(3. then the superscripts in xk and dk should be replaced by [(k − 1) mod m] + 1. more than one cycle through the training set is required to determine an appropriate solution vector. This process of sequentially presenting the training patterns is usually referred to as "cycling" through the training set. in Equation (3.4) That is. Thus. Here. and 19 mod 8 = 3). if the number of training pairs. In general.1. The addition of vector zk to wk in Equation (3.1.g.. the direction of increasing . the superscript k in wk refers to the iteration number.5.2). To be more precise. Then the m pairs {xk.1.1. 5 mod 8 = 5. is finite. dk} of the training set are used to successively update the weight vector until (hopefully) a solution w* is found which correctly classifies the training set.. the perceptron learning rule attempts to find a solution w* for the following system of inequalities
for k = 1. m. 2. as can be seen from Figure 3. a mod b returns the remainder of the division of a by b (e.2. and a complete presentation of the m training pairs is referred to as a cycle (or pass) through the training set.3) moves the weight vector directly toward and perhaps across the hyperplane by the amount of .3) where
(3.5) occurs. m (3. The new inner product is larger than
and the correction wk = wk+1 − wk is clearly moving wk in a good
direction. Notice that for = 0.1.1.1. . This observation is valid for all incremental learning rules presented in this chapter.to begin the process.

1. and after k
(3.5. then.9)
Since zk is misclassified. 1962. and hence
(3. m (3.1. Geometric representation of the perceptron learning rule with = 0. we obtain
(3. we have
.2). the square distance between wk and α w* is reduced by at least β corrections we may write Equation (3.8)
where is a positive scale factor. 1965) as follows.11)
If we choose sufficiently large.e.2. we may proceed to show that the perceptron learning rule converges to a solution (Novikoff. it is clear that a solution vector (i. Assuming.1.13) It follows that the sequence of corrections must terminate after no more than k0 corrections.1..12) as
2
at each correction. if the kth pattern is misclassified we may write
wk+1 − α w* = wk − α w* + zk (3. an in particular the perceptron learning algorithm of Equation (3. Let w* be any solution vector. so that
for k = 1.7)
Then. and thus
(3.1.12) Thus. there are two main issues to consider: (1) The existence of solutions and (2) convergence of the algorithm to the desired solutions (if they exist).1..1. where
. In an analysis of any learning algorithm. In the case of the perceptron. in particular
.1. that the training set is linearly separable.1. Ridgway.. 1962.10) to get
( is positive since
) and substitute
(3.Figure 3. 2. from Equation (3.1.1. a vector w* which correctly classifies the training set) exists if and only if the given training set is linearly separable. let and into Equation (3.1. .10)
Now.3).. Nilsson.

That is. given by Equation (3.2) converges to one of these solutions. In general. it is achieved in a finite number of iterations.1. the denominator of Equation (3.15) Here.1. a linearly separable problem admits an infinite number of solutions. This generalized learning rule updates the weight vector whenever to exceed the margin b. .15) is of no help for predicting the maximum number of corrections.1.1. Here. 1973) that if the training set is linearly separable and if the following three conditions are satisfied:
. It can be shown (Duda and Hart. This solution.14) Therefore. the resulting weight vector must classify all the samples correctly since a correction occurs whenever a sample is misclassified and since each sample appears infinitely often in the sequence. if a solution exists. When corrections cease. is sensitive to the value of the learning rate. used and to the order of presentation of the training pairs. Therefore.15) implies that the difficulty of the problem is essentially determined by the samples most nearly orthogonal to the solution vector. The bound on the number of corrections. Equation (3.1. This makes the perceptron robust with respect to noisy inputs.16) The margin b is useful because it gives a dead-zone robustness to the decision boundary. Generalizations of the Perceptron Learning Rule The perceptron learning rule may be generalized to include a variable increment ρ
k
and a fixed. k0.1. However.(3.14) depends on the choice of the initial weight vector w1. fails
positive margin b. k0 is a function of the initially unknown solution weight vector w*. The perceptron learning rule in Equation (3.1. though. This sensitivity is responsible for the varying quality of the perceptron generated separating surface observed in simulations. If w1 = 0. the perceptron's decision hyperplane is constrained to lie in a region between the two classes such that sufficient clearance is realized between this hyperplane and the extreme points (boundary patterns) of the training set. the algorithm for weight vector update is given by
(3. we get
or
(3.

(3. This reinforcement places w in a region that tends to minimize the probability of error for nonlinearly separable cases. this update procedure converges faster than the perceptron rule. For example. In the nonlinearly separable case. Here. m.1. In general.17b)
(3.1. Butz's rule is as follows
(3.. Few theoretical results are available on the behavior of these algorithms for nonlinearly separable problems [see Minsky and Papert (1969) and Block and Levin (1970) for some preliminary results]. it is known that the length of w in the perceptron rule is bounded. this learning rule converges in finite time.1.. 0 1.18) where Z(wk) is the set of patterns z misclassified by wk. but it requires more storage.. in the perceptron learning rule. the weight vector change is along the direction of the resultant vector of all misclassified patterns.19) The Perceptron Criterion Function It is interesting to see how the above error correction rules can be derived by a gradient descent on an appropriate criterion (objective) function. i.17c)
(e. the above algorithms do not converge. This information may be used to terminate the search for w*. 1973):
. For the perceptron. 1964).1.g. Another approach is to average the weight vectors near the fluctuation point w*.. Another variant of the perceptron learning rule is given by the "batch" update procedure
(3. tends to fluctuate near some limiting value ||w*|| (Efron.1.e. ρ k = or even ρ k = ρ k) then wk converges to a solution w* which satisfies for i = 1. Butz (1967) proposed the use of a reinforcement factor . Furthermore. we may define the following criterion function (Duda and Hart. 2.17a)
(3. when k is fixed at a positive constant .. .

leads to the weight update rule
(3.1. Note that if Z(w) is empty then J(w) = 0.1.18).(3. the likelihood of wk overlapping with one of these transition points is negligible. Before moving on. 2. Therefore.1.21) through (3. sudden changes in the gradient of J occur every time the perceptron output y goes through a transition at .20) where Z(w) is the set of samples misclassified by w (i. it can be shown that
(3.. Due to the piecewise linear nature of J. Mays Learning Rule
. the initial search point. k = 1.1.3 for further exploration into gradient descent on the perceptron criterion function. The smaller J is. This can be achieved by making wk proportional to the gradient of J at the present location wk. formally we may write
(3.1. Next.1. m.e. otherwise J(w) > 0.. we can incrementally improve the search point wk at each iteration by sliding downhill on the surface defined by J(w) in w space..1.21).21) as the steepest gradient descent search rule or.24) is the appropriate criterion function for the modified perceptron rule in Equation (3. and the learning rate (step size) are to be specified by the user.20).16). simply.23). Specifically.21).21) Here.22) in Equation (3. which updates wk so that a step is taken downhill in the "steepest" direction along the search surface J(w) at wk. and thus we may still express J as in Equation (3.1. the gradient of J is not
defined at "transition" points w satisfying .1. .1. However.1. Following a similar procedure as in Equations (3.1. The original perceptron learning rule of Equation (3. the better the weight vector w will be.23) The learning rule given in Equation (3.22). Given this objective function J(w).1.23) is identical to the multiple-sample (batch) perceptron rule of Equation (3. The reader is referred to Problem 3. w1.3) can be thought of as an "incremental" gradient descent search rule for minimizing the perceptron criterion function in Equation (3.1.1. We will refer to Equation (3. zTw 0).1. Geometrically. we may use J to perform a discrete gradient descent search.1.1.22) is not mathematically precise. gradient descent. we should note that the gradient of J in Equation (3.. because of the discrete nature of Equation (3. substituting the gradient
(3. J(w) is proportional to the sum of the distances from the misclassified samples to the decision boundary.

1.26) is given by
(3.1. with or without the use of margin.20) and (3. Unfortunately. an alternative function is the quadratic function
(3.The criterion functions in Equations (3. whereas the gradient of the perceptron criterion function. Mays' rule converges in a finite number of iterations.28) If we consider the incremental update version of Equation (3.29) If the training set is linearly separable. is not.1. To fix this problem a decreasing learning rate such as ρ k = may be used to force convergence to some approximate separating surface (Duda and Singleton.1. the present function can be dominated by the input vectors with the largest magnitudes.1.1.26) The gradient of J(w) in Equation (3. for 0 < ρ < 2 (Duda and Hart. For example.24) are by no means the only functions we can construct that are minimized when w is a solution vector. leads to the following learning rule
(3.1. 1964). we arrive at Mays' rule (Mays. We may eliminate this undesirable effect by dividing by :
(3. the training procedure in Equation (3.29) will never converge. Its major difference is that its gradient is continuous.25) where b is a positive constant margin. Widrow-Hoff (α -LMS) Learning Rule
.1. the function J(w) in Equation (3.21). upon substituting in Equation (3. Like the previous criterion functions.1. 1973).27) which.1.28).25) focuses attention on the misclassified samples.1.1. In the case of a nonlinearly separable training set. 1964):
(3.

stability is insured for most practical purposes if 0 < α < 2. Alternatively. 1985.3. this rule is sometimes referred to as the -LMS rule (the "" is used here to distinguish this rule from another very similar rule which is discussed in Section 3. 1960).1.30) where dk R is the desired response.
Figure 3. Widrow and Lehr. the desired error correction is achieved with a weight change of the smallest possible magnitude.30) is similar to the perceptron rule if one sets .1. one can show that the -LMS learning rule is a gradient descent minimizer of an appropriate quadratic criterion function (see Problem 3. this rule is self-normalizing in the sense that the choice of does not depend on the magnitude of the input vectors. Later.Another example of an error correcting rule with a quadratic criterion function is the Widrow-Hoff rule (Widrow and Hoff.e.. on the average. This is the basic idea behind the minimal disturbance principle on which the -LMS is founded. Equation (3. Since the -LMS rule selects wk to be collinear with xk.1.2 Other Gradient Descent-Based Learning Rules
. If the input vectors are independent over time.1. This rule was originally used to train the linear unit. as
(3. which embodies the so-called minimal disturbance principle. Adaptive linear combiner element (ADALINE). 1985) that this rule converges in the mean square to the solution w* which corresponds to the least-mean-square (LMS) output error.1. the responses to previous training samples are minimally disturbed. also known as the adaptive linear combiner element (ADALINE). and > 0. it was discovered (Widrow and Stearns.2). Therefore. the error in Equation (3. when adapting to learn a new training sample. not after the nonlinearity as in the perceptron. In this case.2). The Widrow-Hoff rule was originally proposed as an ad hoc rule.30) is measured at the linear output. shown in Figure 3.1. 3.1. Thus.3.31) Though. in Equation (3.1. the output of the linear unit in response to the input xk is simply yk = (xk)Tw. The constant α controls the stability and speed of convergence (Widrow and Stearns. 1990).1. ||xk|| is the same for all k). if all input patterns are of the same length (i. As for May's rule. The -LMS rule is given by
(3.4).1.

special attention is given to this rule in this chapter. µ -LMS Learning Rule The -LMS learning rule (Widrow and Hoff.1.35) Note that this rule becomes identical to the α -LMS learning rule in Equation (3.1.32) is quadratic in the weights because of the linear relation between yi and w. In the following. regardless of the setting of the initial search point.1. as would be the case when x {− +1}n.1.1.1. is given by
(3. if the positive constant is chosen sufficiently small. Since the α -LMS learning algorithm converges when
.32) be the sum of squared error (SSE) criterion function. the -LMS rule is described in the context of the linear unit in Figure 3. These rules are systematically derived by first defining an appropriate criterion function and then optimizing such a function by an iterative gradient search procedure. Therefore. additional learning rules for single unit training are derived.34) is sometimes referred to as the "batch" LMS rule.36) Also. the gradient descent search implemented by Equation (3. Let
(3.34).1. The learning rule in Equation (3. 1960) represents the most analyzed and most applied simple learning rule. where
(3.30) upon setting as
(3. then the 1.In the following.1.1.1. using steepest gradient descent search to minimize J(w) in Equation (3. -LMS rule becomes identical to the -LMS rule.33) Now. known as the µ -LMS or LMS rule. It is also of special importance due to its possible extension to learning in multiple unit neural nets.34) will asymptotically converge towards the solution w*.32) gives
(3.1. J(w) defines a convex hyperparaboloidal surface with a single minimum w* (the global minimum). w1.34) The criterion function J(w) in Equation (3. In fact. when the input vectors have the same length. Therefore. The incremental version of Equation (3.3.1.

1. 1976. This means that the weight vector wk will tend to move towards the global minimum w* of the convex SSE criterion function..
Horowitz and Senne (1981) showed that the bound
on guarantees the convergence of w in
the mean square (i. Indeed. Note that the bound
is less restrictive than the one in Equation (3. For example. which enables it to track time fluctuations in the input data. Here. It should be noted that convergence in the mean square implies convergence in the mean. (Widrow and Stearns.1. Next.35) never converges. 1985.37). dm]T. Also see Problem 4.e. hence. <wk> approaches a solution w* as k .3. m. which precludes the tracking of time variations.. wk will essentially stop changing for large k. for input patterns generated by a zero-mean Gaussian process independent over time. k = 1. the converse is not necessarily true.37) For input patterns independent over time and generated by a stationary process. Thus.8 in the next chapter for further exploration.34). 2.... is set to > 0. The assumptions of decorrelated patterns and stationarity are not necessary conditions for the convergence of -LMS (Widrow et al. as k ).. m > n + 1. though. Thus. 1981).. the -LMS rule becomes a "good" approximation to the gradient descent rule in Equation (3. because it cannot accommodate non-stationarity in the input signal. 1955) of X for m > n + 1. the fixed increment (constant ) LMS learning rule has the advantage of limited memory. In applications such as linear filtering.1..38)
where X = [x1 x2 . xm]. it becomes impossible to satisfy all requirements (xk)Tw = dk. " " signifies the "mean" or
expected value.0 < α < 2. Macchi and Eweda (1983) have a much stronger result regarding convergence of the -LMS rule which is even valid when a finite number of successive training patterns are strongly correlated. where 0 is a small positive constant.1. for convergence. In practical problems. When the learning rate is sufficiently small. however.1. . d = [d1 d2 . Therefore. The extreme points (minima and maxima) of the function J(w) are solutions to the equation
(3. the decreasing step size is not very valuable. we can start from Equation (3.1.1. <wk> is ensured if the fixed learning rate is chosen to be smaller than
.. convergence of the mean of the weight vector. we show that w* is given by
w* = X†d (3. Farden.36) and calculate the required range on for ensuring the convergence of the -LMS rule:
(3. and X† = (XXT)−1X is the generalized inverse or pseudoinverse (Penrose. Equation (3.39)
.) In this case.

The LMS rule may also be applied to synthesize the weight vector. the solution vector obtained may now be used in the perceptron for classification. During training the desired target dk is set to +1 for one class and to for the other class.3 with the given training pairs {xk.1.1. Here.32) must satisfy
(3.41)
which for a nonsingular matrix X XT gives the solution in Equation (3. one starts with the mean-square error (MSE) criterion function:
(3. using the LMS rule. m. and any negative constant can be used as the target for the other class). w..1. when the training set is nonlinearly separable. It has the advantage of naturally arriving at a learning rate schedule ρ k for asymptotic convergence in the mean square. (In fact. one starts by training the linear unit in Figure 3.5). the output of the classifier will now be properly restricted to the set {− +1}. or explicitly
(3. w* is a minimum of J. Therefore.1. however.Therefore. maximum. This can be readily achieved after noting that J is equal to the positive-definite matrix XXT.43)
. However.42) Recall that just because w* in Equation (3. The -LMS as a Stochastic Process Stochastic approximation theory may be employed as an alternative to the deterministic gradient descent analysis presented thus far. this does not guarantee that w* is a local minimum of the criterion function J.1. considerably narrow the choices in that such w* represents (in a local sense) either a point of minimum. Here. we may evaluate the second derivative or
Hessian matrix of J at w* and show that it is positive definite. 1. When used as a perceptron weight vector. Due to the thresholding nonlinearity in the perceptron.1.40) Equation (3. by employing the LMS rule for perceptron training. This should not be surprising.1. the minimum SSE solution in Equation (3.1.42) satisfies the condition J(w*) = 0. k = 1. this solution does not necessarily represent a linear separable solution. It does.40) can be rewritten as
X XT w = X d (3. even when the training set is linearly separable (this is further explored in Section 3.1. To verify that w* is actually a minimum of J(w). any positive constant can be used as the target for one class. or saddle point of J. . Thus. Therefore. since the SSE criterion function is not designed to constrain its minimum inside the linearly separable solution region.1.42) does not generally minimize the perceptron classification error rate.1. After convergence of the learning process.. 2. linear separability is sacrificed for good compromise performance on both separable and nonseparable problems. any minimum of the SSE criterion function in Equation (3.38).. dk}. the solution arrived at may still be a useful approximation. of a perceptron for solving twoclass classification problems.

one may compute the gradient of J as
(3.42) as
Now.
First. one can show that when the size of the training set m is large.45).1.1. the determinant of C.43) as the solution of
which gives
(3.1. then the minimum SSE solution converges to the minimum MSE solution.where denotes the mean (expectation) over all training vectors. Now. It represents the minimum MSE solution. In fact. The solution w* in Equation (3.3. Note that the expected value of a vector or a matrix is found by taking the expected values of its components.1. We refer to C as the autocorrelation matrix of the input vectors and to P as the cross-correlation vector between the input vector x and its associated desired target d (more to follow on the properties of C in Section 3.1. |C|.44) which upon setting to zero. It is interesting to note here.1. is assumed different from zero. multiplying the right hand side of the above equation by
allows us to express it as
.45). allows us to find the minimum w* of J in Equation (3.45)
where and .1.1). the close relation between the minimum SSE solution in Equation (3.45) is sometimes called the Wiener weight vector (Widrow and Stearns. let us express XXT as the sum of vector outer products
We can also rewrite Xd
as
. 1985). also known as the least mean square (LMS) solution. In Equation (3.1. This representation allows us to express Equation (3.42) and the LMS or minimum MSE solution in Equation (3.

43) is of the form and is known as a regression function.
(3.47a)
2.46) is also known as a stochastic approximation procedure (or Kiefer-Wolfowitz or Robbins-Monro procedure). The squares and circles are to be mapped to the targets +1 and − respectively.1.5 shows plots for the evolution of the square of the distance between the vector wk and the (computed) minimum SSE solution. The ten squares and ten filled circles in this figure are positioned at the points whose coordinates (x1.000 steps.45) asymptotically in the mean square.1. the right most circle represents the training pair {[2.1. For example. 1}.34) and (3. to converge to a small neighborhood of w*. The iterative algorithm in Equation (3.
(3..1. we have established the equivalence of the minimum SSE and minimum MSE for a large training set. respectively. i. the reader is referred to Wasan (1969).e. instead of the expected gradient in Equation (3. the learning rate (step size) was set to 0. one may employ a gradient descent procedure. This leads to the stochastic process:
(3. at each learning step the input vector x is drawn at random.1. For the incremental LMS rule. the training examples are selected randomly from the training set. Similarly. It can be shown. x2) specify the two components of the input vectors. 1]T. Consider the training set depicted in Figure 3.1. Incremental LMS requires more learning steps.47b)
3. we present the results of a set of simulations which should help give some insight into the dynamics of the batch and incremental LMS learning rules. In both simulations.4 for a simple mapping problem. The batch LMS rule converges to the optimal solution w* in less than 100 steps. Next.1.1. 0]T. the left most square in the figure represents the 1.1.46) which is the same as the µ -LMS rule in Equation (3.1: In this example. we are interested in comparing the convergence behavior of the discrete-time dynamical systems in Equations (3. wk converges to w* in Equation (3.1.
(3. Here.35) except for a variable learning rate k. the instantaneous gradient [(xk)T wk − dk] xk is used.35).005. Example 3.1. in order to minimize the MSE criterion. training pair {[0.Finally. For a thorough discussion of stochastic approximation theory.1. that if |C| 0 and ρ k satisfies the three conditions:
1.1.
. Specifically. the averages and become very good approximations of the expectations C = <xxT> and P = <dx>.
(3.47c)
then.44). Figure 3.48) The criterion function in Equation (3. − 1}. w* for batch LMS (dashed line) and incremental LMS (solid line). on the order of 2. The initial search point w1 was set to [0. if m is large.1. Thus. where.1. 2]T.

Here. one learning step of incremental LMS is taken to mean a full cycle through the 20 samples. In both cases w1 = 0 and = 0. These results indicate a very similar behavior in the convergence characteristics of incremental and batch LMS learning. Here. For comparison. Learning curves for the batch LMS (dashed line) and incremental LMS (solid line) learning rules for the data in Figure 3.
Figure 3.1. for the incremental LMS case. the simulation result with batch LMS learning is plotted in the figure (see dashed line). Plots (learning curves) for the distance square between the search point wk and the minimum SSE solution w* generated using two versions of the LMS learning rule.1.
Figure 3. In order to allow for a more meaningful comparison between the two LMS rule versions. A 20 sample training set used in the simulations associated with Example 3.1.4. but with a relatively faster convergence of the batch LMS rule near w*.02 as can be seen from Figure 3.1. one cycle corresponds to 20 consecutive learning iterations. Also.4. The result for the batch LMS rule shown here is identical to the one shown in Figure 3. respectively.1.
. wk represents the weight vector after the completion of the kth learning "cycle". which did not change during training.1.
Figure 3.35) with a random order of presentation of the training patterns.The fluctuations in ||wk − w*||2 in this neighborhood are less than 0. this is attributed to the use of more accurate gradient information.6. The dashed line corresponds to the batch LMS rule in Equation (3.1.34).5. The solid line corresponds to the incremental LMS rule in Equation (3.6.1. fixed order of presentation of the training patterns.005 are used.1.5.1. The effect of a deterministic order of presentation of the training examples on the incremental LMS rule is shown by the solid line in Figure 3. the training examples are presented in a predefined order. Points signified by a square and a filled circle should map into +1 and . This is so because of the small step size used. Note the logarithmic scale for the iteration number k. The same initialization and step size are used as before.5 (this result looks different only because of the present use of a linear scale for the horizontal axis). Both cases show asymptotic convergence toward the optimal solution w*. The incremental LMS rule results shown assume a deterministic.1.

2. Here. This is only possible if the training vectors xk are encoded such that XXT is the identity matrix (i. Correlation learning is further explored in Section 7. Figure 3. for the unit's output and the desired target.. respectively.1. over all training pairs. Note that minimizing J(w) is equivalent to maximizing the correlation between the desired target and the corresponding linear unit's output for all xi.50).1 in Chapter 7. This rule is obtained by steepest
gradient descent on the criterion function .1. we arrive at the weight vector w* given by:
(3. Note that Equation (3.38) if X† = X.1.1.3 Extension of µ -LMS Rule to Units with Differentiable Activation Functions: Delta Rule The following rule is similar to the -LMS rule except that it allows for units with a differentiable nonlinear activation function f.
. the xk's are orthonormal).2.1. employing steepest gradient descent to minimize J(w) leads to the learning rule:
(3. 3. m. Now. Here.e.7 illustrates a unit with a sigmoidal activation function.51) where X and d are as defined earlier.50) By setting to 1 and completing one learning cycle using Equation (3.1.51) leads to the minimum SSE solution in Equation (3. <y> and <d> are computed averages.1. .1. the unit's output is with net defined as the vector inner product xTw. and performing gradient descent to minimize J..49) where yi = (xi)Tw...Correlation Learning Rule The Correlation rule is derived by starting from the criterion function
(3. Another version of this type of learning is the covariance learning rule.3.1. Covariance learning provides the basis of the cascade-correlation net presented in Section 6. i = 1.

1.1. For the "logistic" function. Note how f asymptotically approaches +1 and − in the limit as 1
Figure 3.. Figure 3. with xi Rn+1 ( = 1 for all i) and di [− 1. i = 1.
.
+1].52) leads to the delta rule
(3. then its
derivative is derivative is tangent activation function with net approaches + and . di}. Hyperbolic tangent activation function f and its derivative . Performing gradient descent on the instantaneous SSE criterion function whose gradient is given by (3. consider the training pairs {xi.8. .1. . Again. If f is defined by f(net) = tanh(β net).1.53)
where
and
. .1.8 plots f and
. the for the hyperbolic
.7. A computational unit with a differentiable sigmoidal activation function. 2.Figure 3.. plotted for
.. m. respectively.
.

One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of f '(net) in Figure 3. In particular. c2} classification problem with m labeled feature vectors (training vectors) {xi.. In the following. a single perceptron can be trained to correctly 1) classify the above training pairs if an (n + 1)-dimensional weight vector w is computed which satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's activation function):
(3. In this case. Assume that xi belongs to Rn+1 (with the last component of xi being a constant bias of value 1) and that di = +1 (− if xi c1 (c2). such as the ability to automatically identify critical cluster boundaries and place a linear decision surface in such a way that it leads to enhanced classification robustness.56) and we let
Z = [ z1 z2 .8.1. very small weight changes even when the error (d − y) is large)... notice how f '(net) 0 when net has large magnitude (i. extends these capabilities to find "good" approximate solutions for nonlinearly separable problems. will be discussed in Chapter 5. One common flat spot elimination technique involves replacing f ' by f ' plus a small positive bias .53) directly depends on the magnitude of f '(net). |net| > 3)..e. 2. AHK III. because the magnitude of the weight change in Equation (3. Consider a two-class {c1. AHK I and AHK II. three learning rules: AHK I..1..1. we expect the delta learning rule to progress very slowly (i. according to
(3.57)
then Equation (3. zm] (3. the weight update equation reads as:
(3. The three AHK learning rules preserve the simple incremental nature found in the LMS and perceptron learning rules. Then.. 3. zi. if we define a set of m new vectors. i = 1. This extension.54) One of the primary advantages of the delta rule is that it has a natural extension which may be used to train multilayered neural nets. di}.1. known as back error propagation. The third training rule.1. The AHK rules also possess additional processing capabilities.1.55) Next. . these regions are called "flat spots" of f '.4 Adaptive Ho-Kashyap (AHK) Learning Rules Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as enhanced alternatives to the LMS and perceptron learning rules.55) may be rewritten as the single matrix equation
.1. Two of the proposed learning rules. it would be advantageous to try to eliminate the flat spot phenomenon when using the delta learning rule. In these flat spots.1. m. Since slow convergence results in excessive computation time. are well suited for generating robust decision surfaces for linearly separable problems. AHK II.e. and AHK III are derived based on gradient descent strategies on an appropriate criterion function.

1.1.1. Slansky and Wassel.62b)
. 1981) with respect to b and w so that J is minimized subject to the constraint b > 0.60) and employing the updated margin vector from Equation (3. For simulations comparing the above training algorithm to the LMS and perceptron training procedures. The direct synthesis of the w estimate in Equation (3. We will refer to the above algorithm as the direct Ho-Kashyap (DHK) algorithm. or until ε < 0.e.58). In the Ho-Kashyap algorithm. J(w.1.
Starting with the criterion function . such computation can be computationally expensive and requires special treatment when Z ZT is ill-conditioned (i. the components of the margin vector are first initialized to small positive values. b) =
w = Z†b (3.61) where denotes the absolute value of the components of the argument vector and bk is the "current" margin vector.1. the training of the perceptron is now equivalent to solving Equation (3. the reader is referred to Hassoun and Clark (1988). for m > n + 1.61).1. |ZZT| close to zero).1. Hassoun and Youssef (1989). Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation (3.60) involves a one-time computation of the pseudo-inverse of Z.(3. gradient descent may be performed (Slansky and Wassel.1. we arrive at the following equivalent form of Equation (3..58) Now. which is an indication of nonlinear separability of the training set (no solution is found).59). A new estimate of w can now be computed using Equation (3.1.1. Next. However. and the pseudo-inverse is used to generate a solution for w (based on the initial guess of b) || ZTw − b ||2:
which minimizes the SSE criterion function.1.59) Thus.59) for w.1.62a)
(3. It can be shown (Ho and Kashyap. a new estimate for the margin vector is computed by performing the constrained (b > 0) gradient descent
(3. An alternative algorithm that is based on gradient descent principles and which does not require the direct computation of Z† can be derived. which is an indication of linear separability of the training set. This derivation is presented next. 1965. 1981) that the HoKashyap procedure converges in a finite number of steps if the training set is linearly separable.1.55):
(3. defining an m-dimensional positive-valued margin vector b (b > 0) and using it in Equation (3. The gradient of J with respect to w and b is given by
(3. subject to the constraint b > 0. This process continues until all the components of ε are zero (or are sufficiently small and positive).60)
where Z† = (Z ZT)−1 Z. and Hassoun (1989a).

1.65a)
.64a) and
(3. where λ
2
<
max
is the largest eigenvalue of the positive
A completely adaptive Ho-Kashyap procedure for solving Equation (3.1. In all of the above Ho-Kashyap learning procedures.61). ρ 1 = 1. 1973) if 0 < ρ 1 < 2 and 0 < ρ definite matrix Z ZT.1.5(ε + | ε |) with ε as defined in Equation (3. . Because of the requirement that all training vectors zk (or xk) be present and included in Z.59) is arrived at by starting
from the instantaneous criterion function following incremental update rules:
. which leads to the
(3. It can be easily shown that if ρ 1 = 0 and b1 = 1.e. the incremental learning procedure in Equation (3. This leads to the following gradient descent formulation of the Ho-Kashyap procedure:
(3. i.62a) by − 0. An alternative way of writing Equation (3.1.64b) Here bi represents a scalar margin associated with the xi input. If full margin error correction is assumed in Equation (3. we will refer to the above procedure as the batch mode adaptive Ho-Kashyap (AHK) procedure. Equation (3. Furthermore.63b) where 1 and 2 are strictly positive constant learning rates. the margin values are initialized to small positive values.1. and the perceptron weights are initialized to zero (or small random) values.1. respectively.64) reduces to the heuristically derived procedure reported in Hassoun and Clark (1988).1.64a).63a) and
(3. One analytic method for imposing the constraint b > 0 is to replace the gradient in Equation (3. convergence can be guaranteed (Duda and Hart.1.1.where the superscripts k and k + 1 represent current and updated values..1.1.1.64) is
and
(3.63) reduces to the -LMS learning rule.

This solution is not one of the linearly separable solutions for this problem. Hassoun and Song (1992) showed that a sufficient condition for the convergence of the AHK rules.66a)
In the general case of an adaptive margin.1. We will refer to this procedure as the AHK I learning rule.1. Consider the simple two-class linearly separable problem shown earlier in Figure 3. is given by .35) can be written as fixed at +1. 1992).65).65) was realized by starting with a positive initial margin and restricting the change in ∆ b to positive real values. An alternative. .35) is used to obtain the solution shown as a dashed line in Figure 3. respectively.005 is used. ∆ w is set to 0 in Equation (3.
.1. except for the cases where a decrease in bi results in a negative margin.1. The reader is invited to apply the AHK III as in Problem 3.and
(3.1. Example 3.65b)
where ∆ b and ∆ w signifies the difference between the updated and current values of b and w.1.67a) (3. This robust solution was in fact automatically generated by the AHK I learning rule of Equation (3. it should be noted that the most robust solution.1.1.64) and Equation (3.1. with bi held
The implied constraint bi > 0 in Equation (3. in the sense of tolerance to noisy input.66). For comparison purposes. The -LMS rule of Equation (3. Four examples of linearly separable solutions are shown as solid lines in the figure. Here. the perceptron.1. This modification results in the following alternative AHK II learning rule:
and and
(3.1.1. as in Equation (3.66b). Here.1.2).9.1.1. and AHK learning rules are compared in terms of the quality of the solution they generate.66b)
(3.67b) Another variation results in the AHK III rule which is appropriate for both linearly separable and nonlinearly separable problems. it may be noted that the -LMS rule in Equation (3. These solutions are generated using the perceptron learning rule of Equation (3.9.1.7 for gaining insight into the dynamics and separation behavior of this learning rule.1.4. Here.1. the initial weight vector was set to 0 and a learning rate = 0. more flexible way to realize this constraint is to allow both positive and negative changes in ∆ b. with varying order of input vector presentations and with a learning rate of = 0. LMS.2: In this example. The advantages of the AHK III rule are that (1) it is capable of adaptively identifying difficult-to-separate class boundaries and (2) it uses such information to discard nonseparable training vectors and speed up convergence (Hassoun and Song.1. which is shown as a dotted line in Figure 3. is given by
(3.

. For comparison.52).1. any differentiable function that is minimized upon setting yi = di. If r = 1..32) is not the only possible choice. . the criterion function in Equation (3.1.Figure 3. could be used. In general.1. A family of instantaneous Minkowski-r criterion functions. .1.68) or its instantaneous version
(3.24). and (3. and 20.5 Other Criterion Functions The SSE criterion function in Equation (3.. The general form of the gradient of this criterion function is given by
(3. (3.70) Note that for r = 2 this reduces to the gradient of the SSE criterion function given by Equation (3. We have already seen other alternative functions such as the ones in Equations (3. 1988) given by
(3. The dotted line is the solution generated by the AHK I rule. for a linear unit with normally distributed
.1.1. 3.1.1.10. then J(w) = |d − y| with the gradient (note that the gradient of J(w) does not exist at the solution points d = y) (3. LMS generated decision boundary (dashed line) for a two-class linearly separable problem. For a supremum error measure is approached. four solutions generated using the perceptron learning rule are shown (solid lines).68) is known as the Manhattan norm. 2.1.1. It can be shown. 2. One possible generalization of SSE is the Minkowski-r criterion function (Hanson and Burr.1.1. m.69) Figure 3. A small r gives less weight for large deviations and tends to reduce the influence of the outermost points in the input space during learning. for i = 1.
Figure 3.71) In this case.1.1.9.10 shows a plot of |d − y|r for r = 1.25).20).

This criterion function is known as robust regression since it is more robust to an outlier training sample than r = 2.1. i = 1.1. If the distribution of the training patterns has a heavy tail such as the Laplace-type distribution. . di}. 2. r = 1 will be a better criterion function choice.75) Therefore.inputs. upon the presentation of xi as
(3.. Consider the training pairs {xi. Since a weighted sum of independent normally distributed random variables is itself normally distributed [e. The maximum likelihood estimate of w is that value of w which maximizes the probability of occurrence of observation di for input xi.74) is equivalent to minimizing the SSE criterion
(3.74)
Since the term is a constant.1. 1 < r < 2 is appropriate to use for pseudo-Gaussian distributions where the distribution tails are more pronounced than in the Gaussian. m. The proof is as follows. that r = 2 is an appropriate choice in the sense of both minimum SSE and minimum probability of prediction error (maximum likelihood).1.. This allows us to express the conditional probability density for observing target di. Thus the prediction error is normally
distributed with mean zero and some variance. given w. Finally. However. maximizing the log-likelihood function in Equation (3. with the assumption of a linear unit (ADALINE) with normally distributed inputs. if the input distribution is non-Gaussian.. the SSE criterion is optimal in the sense of minimizing prediction error. See Mosteller and Tukey (1980) for a more thorough discussion on the maximum likelihood estimation technique.
.73) Maximizing the above likelihood is equivalent to maximizing the log-likelihood function:
(3. (1970)]. The likelihood of w with respect to the whole training set is the joint distribution
(3.. then the SSE criterion will not possess maximum likelihood properties.g.72) This function is also known as the likelihood of w with respect to observation di. . Then a linear unit with a fixed but unknown weight vector w outputs the estimate when presented with input xi. Assume that the vectors xi are drawn randomly and independently from a normal distribution.1. then is normally distributed. see Mosteller et al.

e.76) where d belongs to the open interval (− +1). On the other hand. 1987.
for all s 0.1. This eliminates the flat-spot encountered in the delta rule and makes the training here more like µ -LMS (note.. that y here is given by y = ).76) is (3. 1959) defined by
(3. the gradient of Equation (3. Consider the following general criterion function
(3. 1988) that if the criterion function is well-formed. however. such that misclassification. if one exists (Wittner and Denker. Hertz et al.e.77). 1991). J(w) 0.2. There exists ε > 0. 1988):
1. 1988) is the instantaneous relative entropy error measure (Kullback.77) The factor f '(net) in Equations (3.1.1. since it may fail to find a linearly separable solution as demonstrated in Example 3. i. gradient descent on the SSE criterion function does not share this property. and if y = d then J(w) = 0.1.Another criterion function that can be used (Baum and Wilczek. provided that such a region exists..
2. we need to use a well-formed criterion function.53) and (3..
.70) is missing from Equation (3. 3.
. i.1. Now. g does not push in the wrong direction. Hopfield. g(s) is bounded below. As before. 1988. then gradient descent is guaranteed to enter the region of linearly separable solutions w*. we say is "well-formed" if g(s) is differentiable and satisfies the following conditions (Wittner and Denker. This entropy criterion is "well formed" in the sense that gradient descent over such a function will result in a linearly separable solution. g keeps pushing if there is a
For a single unit with weight vector w. 1988.1.1. In order for gradient descent search to find a solution w* in the desired linearly separable region.. it can be shown (Wittner and Denker. Solla et al.78) where
Let .1. If 1. y = f(net) = tanh(β net). For all s.

11. Formally. g(s) is bounded below. is given
by
(3. as discussed in Section 8.1. y is given by
(3. a stochastic unit has a binaryvalued output which is a probabilistic function of the input activity net. . Hence. <y>.
.1. the unit always responds with the same output.6 Extension of Gradient-Descent-Based Learning to Stochastic Units The linear threshold gate. perceptron.Example 3. as depicted in Figure 3.3. for a given input. note that the expected value of y. Also. as is shown in the next section.1. Let us now define a SSE criterion function in terms of the mean output of the stochastic unit:
.
2.1.
Figure 3. and ADALINE are examples of deterministic units. A stochastic unit.79)
One possible probability function is
.1.
= 1 ε > 0 for all s 0. On the other hand.1.20) is a well-formed criterion function since it satisfies the above conditions:
1.
3.1.11.3: The perceptron criterion function in Equation (3.80) Stochastic units are the basis for reinforcement learning networks. . Also. these units allow for a natural mapping of optimal stochastic learning and retrieval methods onto neural networks. since
3. thus g(s) = − and s
= 1 > 0 for all s.

1. Otherwise.1. 1911).. So rk evaluates the "appropriateness" of the unit's output. In its simplest form. yk.. 3. reinforcement learning is based on the idea that if an action is followed by an "improvement" in the state of affairs.1 Associative Reward-Penalty Reinforcement Learning Rule
. rk gives no indication on what yk should be.1. k = 1. The idea here is not to associate xk with rk as in supervised learning. Sutton et al. rk becomes the desired target dk.83) leads to a weight vector which is equal to that obtained using the delta rule for a deterministic unit with a hyperbolic tangent activation function... Given a training set of the form {xk. 1991.2. One may view supervised learning in a stochastic unit as an extreme case of reinforcement learning." The basic idea of reinforcement learning has its origins in psychology in connection with experimental studies of animal learning (Thorndike. then the tendency to produce that action is strengthened. where and rk is an evaluative signal
(normally ) which is supplied by a "critic". we arrive at the update rule
(3.e. In general.81) Employing gradient descent. m. It is therefore important for the unit to be stochastic so that a mechanism of exploration of the output space is present.82) In the incremental update mode. Therefore. we have the following update rule:
This learning rule is identical in form to the delta learning rule given in Equation (3. the reinforcement signal itself may be stochastic such that the pair {xk. i. which used a deterministic unit with an activation function f (net) = tanh(net).2 Reinforcement Learning
Reinforcement learning is a process of trial and error which is designed to maximize the expected value of a criterion function known as a "reinforcement signal. reinforced.1. in an average sense. .(3.83) to train the stochastic unit. the stochastic unit learning rule in Equation (3. Rather. rk is a reinforcement signal which informs the unit being trained about its performance on the input xk.. rk}.
3. Here. we may use the gradient descent learning rule of Equation (3. the tendency of the system to produce that action is weakened (Barto and Singh.1. The most general extreme for reinforcement learning (and the most difficult) is where both the reinforcement signal and the input patterns depend arbitrarily on the past history of the stochastic unit's output. Also. Usually. 2. due to the input xk. rk} only provides the "probability" of positive reinforcement.53). where the output of the unit is binary and there is one correct output for each input. 1991). An example would be the stabilization of an unstable dynamical system or even a game of chess where the reinforcement signal arrives at the end of a sequence of player moves.

Ackley and Littman. Also.4) Another variation uses rk = 1 and has the simple form:
(3. . we may express the Arp reinforcement rule as
(3. 1987) utilizes a continuous-valued or graded reinforcement signal.1) may be eliminated without affecting the general behavior of this learning rule. 1990).2.5) This later rule is more amenable to theoretical analysis. We discuss it here in the context of a single stochastic unit.2) guides the unit to do what it just did if yk is "good" and to do the opposite if not (Widrow et al.2.1. and has the form:
(3. which is known as the associative reward-penalty (Arp) algorithm.3)
with + >> > 0. as is shown in Section 4. In general.2. When learning converges. this makes the dynamics of wk in Equation (3. .1.2. the unit's output approaches the state providing the largest average reinforcement on the training set. Then.83).2. Motivated by Equation (3. pattern xk+1 is presented several times. see the Special Issue on Reinforcement Learning.2. where it will be shown that the rule tends to maximize the average reinforcement signal. In this case. edited by Sutton (1992). see Barto (1985) and Williams (1992). and so on.5. Here.83). and the accumulation of all the weight changes is used to update wk. 1987. One variation of Arp (Barto and Jordan..1) substantially different from that of wk in the supervised stochastic learning rule in Equation (3. the resulting learning rule corresponds to steepest descent on the relative entropy criterion function.1) where
(3. the approach 1 making the unit effectively deterministic.2. Reinforcement learning speed may be improved if batch mode training is used (Barto and Jordan. The derivative term in Equation (3.
. for theoretical and practical considerations. For an overview treatment of the theory of reinforcement learning.2. 1973). The setting of dk according to Equation (3. a given pattern xk is presented several times.2) and
(3.We now present a reinforcement learning rule due to Barto and Anandan (1985).

the weights of the network are to be optimized with respect to this criterion. to this unit. of unlabeled vectors in Rn.3.4 and 3. In the remainder of this chapter. Then.and post-synaptic neurons. Hebbian learning is treated first.. Normally.3) Since. the success of unsupervised learning hinges on some appropriately designed network which encompasses a task-independent criterion of the quality of representation that the network is required to learn. This gives can be evaluated by averaging
(3. features or regularities in the training data. At each time k.3. respectively. m}. we will present a vector x. 1973. competitive learning. The expected weight change Equation (3. randomly drawn from p(x).5. 3.. Here.
(3. We are given a training set {xi.3. In some cases the xi's must be mapped into a lower dimensional set of patterns such that any topological relations existing among the xi's are preserved among the new set of patterns.3. i = 1. some basic unsupervised learning rules for a single unit and for simple networks are introduced.1) to update the weight vector w. and thus is the only
is known as the autocorrelation matrix and is given by
. 1976)
(3.3 Unsupervised Learning
In unsupervised learning. at equilibrium equilibrium state.1 Hebbian Learning The rules considered in this section are motivated by the classical Hebbian synaptic modification hypothesis (Hebb. which may be stated formally as ( Stent.3.3. there is no teacher signal. Equation (3. 1949). assuming x and w are statistically independent.
Let us now assume that the input vectors are drawn from an arbitrary probability distribution p(x).. The following three classes of unsupervised rules are considered: Hebbian learning. competitive learning and self-organizing feature map learning are covered in Sections 3. C = = 0. We will employ the Hebbian rule in Equation (3.2) or. x and y. Let us also assume that the network being trained consists of a single unit.1) over all inputs x. and self-organizing feature map learning. Our interest here is in training networks of simple units to perform the above tasks. Hebb suggested that biological synaptic efficaces (w) change in proportion to the correlation between the firing of the pre.3) leads to . Changeux and Danchin. The objective is to categorize or discover. .1)
where > 0 and the unit's output is
. respectively.3. 2.3. Here.

5) In the following.2 Oja's Rule An alternative approach for stabilizing the Hebbian rule is to modify it by adding a weight decay proportional to y2 (Oja. This leads to the update rule:
(3. c(i). 1973. The detailed analysis of these rules is deferred to Chapter 4. It can be shown that the solution w* = 0 is not stable (see Section 4. Oja and Karhunen. Therefore. .3. 2. each c(i) satisfies the relation Cc(i) = ic(i). 3. this rule is driven to maximize the variance of the output of a linear unit. 3. its eigenvalues..3 Yuille et al.3) is unstable and it drives w to infinite magnitude.3. i = 1. n.3.1) is to normalize to '1' after each learning step ( von der Malsburg.6)
Oja's rule converges in the mean to a state w* which maximizes the mean value of y2.1). Equation (3. 1982. Rubner and Tavan. subject to the constraint ||w|| = 1. .3.1. The analysis of this rule is covered in Chapter 4. assuming that y has zero mean. and the cross terms are the cross correlations among the input components.3.3. that this learning rule tends to maximize the mean square of y. In other words. One way to prevent the divergence of the Hebbian learning rule in Equation (3. This results in Oja's rule:
(3. are positive real or zero.. i.. . A zero mean y can be achieved if the unit inputs are independent random variables with zero mean. From the definition of an eigenvector. 1982). in Section 4. 1989). Thus. Rule
.(3. and it has orthogonal eigenvectors. It can also be shown that the solution w* is the principal eigenvector (the one with the largest corresponding eigenvalue) of C (Oja. we briefly describe additional stable Hebbian-type learning rules.3.4) The terms on the main diagonal of C are the mean squares of the input components.3. 1985). C is a Hermitian matrix (real and symmetric).3. with a direction parallel to that of the eigenvector of C with the largest eigenvalue. It will be shown.

but then increases fast towards 1.01. cos approaches 1) the convergence becomes slow again. +0.1515.3. rule.2. Note that Figure 3. Here. As can be seen from Figure 3.3. This. k. which does not allow for the principal eigenvector to dominate all other eigenvectors.4 show a comparable behavior for both rules. relative to that of the previous data set.7) converges to a vector w* that points in the same (or opposite) direction of the principal eigenvector of C and whose norm is given by the square root of the largest eigenvalue of C. As the direction of w approaches that of the principal eigenvector (i.3. the largest two eigenvalues of C are equal to 2. a strong competition emerges among several eigenvectors. The two rules exhibit an almost identical weight vector direction evolution as depicted in Figure 3. however.3 and 3. both Oja and Yuille et al. The later rule exhibits an oscillatory behavior in ||wk|| about its theoretical asymptotic value.4 show two sets of simulation results for the evolution of the weight vector of a single linear unit trained with Oja's rule and with Yuille et al.6299.3. Finally. it is shown that Equation (3.3.3.1172 and 1.5.3.1 and 3. respectively. In this particular case. During learning. the training vectors are presented in a fixed.3. as compared to the earlier simulation in Figure 3. we note that the oscillatory behavior in Figures 3.
.7) It can be shown (see Problem 3. This overlap increases slowly at first. respectively.4) that. each attempting to align wk along its own direction. the existence of a dominating eigenvector for C is responsible for the relatively faster convergence of cos . in an average sense. respectively.2 show the behavior of the norm and the direction (cosine of the angle between w and the eigenvector of C with the largest eigenvalue) of the weight vector w as a function of iteration number. cyclic order.1.3. Beyond this point. the data set leads to a correlation matrix having its two largest eigenvalues equal to 0..3. respectively. As for the convergence of the direction of w. . Figure 3. The figure shows an initial low overlap between the starting weight vector and the principal eigenvector of the data correlation matrix. Figures 3. (1989) proposed the rule
(3. This is due to the uniform nature of the data.4 can be significantly reduced by resorting to smaller constant learning rates or by using a decaying learning rate of the form
. In this simulation. and a learning rate is used.3.e.2.3. the convergence behavior of Oja's and Yuille et al.5]. Figures 3. Yuille et al.1 through 3. Again. Thus. This data leads to a correlation matrix C with a dominating eigenvector. the end result being the slow convergence of cos .3. the weight vector update is given by gradient descent on the criterion function:
(3. but with a learning rate of 0. The second set of simulations involves a training set of sixty 15-dimensional vectors drawn randomly from a normal distribution with zero mean and variance of 1.3.Other modifications to the Hebbian rule have been proposed to prevent divergence. the length of wk (practically) converges after 3. Example 3. rules are used.3. rules converge to the theoretically predicted values of 1 and . leads to slower convergence speeds for both ||wk|| and cos .3. rule.4. the convergence becomes very slow. Oja and Yuille et al.8) In Section 4.3.1: In this example.000 iterations.1578 and 0. Here. learning rules are demonstrated on zero-mean random data. Figure 3.3 shows a better behaved convergence for Oja's rule as compared to Yuille et al. the data (training set) consists of sixty 15-dimensional vectors whose components are generated randomly and independently from a uniform distribution in the range [− 0.2 only shows the evolution of cos over the first 3600 iterations.3. Here.

rule (dashed curve) with = 0.3. The training set consists of sixty 15-dimensional real-valued vectors whose components are generated according to N(0. time for Oja's rule (solid curve) and for Yuille et al. +0.3.Figure 3.5.3.01. rule (dashed curve) with = 0. Weight vector magnitude vs. rule (dashed curve) with = 0.3.02.
. The training set is the same as for Figure 3.
Figure 3. Weight vector magnitude vs.5].3.1. rule (dashed curve) with = 0. The training set consists of sixty 15-dimensional real-valued vectors whose components are generated according to a uniform random distribution in the range [− 0.4. The training set is the same as for Figure 3.
Figure 3.2. Evolution of the cosine of the angle between the weight vector and the principal eigenvector of the correlation matrix for Oja's rule (solid curve) and for Yuille et al. Evolution of the cosine of the angle between the weight vector and the principal eigenvector of the correlation matrix for Oja's rule (solid curve) and for Yuille et al. time for Oja's rule (solid curve) and for Yuille et al. 1).
Figure 3.01.3.3.3.02 (the dashed line is hard to see because it overlaps with the solid line).1.

10) where . it can be easily shown that Equation (3.3.9) is given by
(3. we get
(3. It can be shown that this rule is unstable.. but with the restriction final state will be clamped at a boundary value.3. The average weight change in Equation (3.13) where was used.. the simplified
subject to the constraint ..13) has the following associated criterion function:
(3. If is large enough. Equation (3. Here.3. .10).3.3.4 Linsker's Rule Linsker (1986. 1988) proposed and studied the general unsupervised learning rule:
(3.9) subject to bounding constraints on the weights .3.13) corresponds to maximizing
.3.11) gives the ith average weight change:
(3.3. with y = wTx and
. w− = 1.
. form of in Equation (3. bi = a for i = 1. n. and Equation (3. then for all i.12) where Ci is the ith row of C.3.3. 2.11) reduces to
(3. If we assume that the components of x are random and are drawn from the same probability distribution with mean .3. the
.11) where 1 is an n-dimensional column vector of ones. by noting that wTCw is the mean square of the unit's output activity.3.3. If we set a > 0.14)
Therefore. and d = 0 in Equation (3.3. or
.

The projection xTu is the linear sum of n zero-mean random variables.3.. then the above rule converges to a weight vector with
weights equal to w+
and the remaining weights equal to w−.e. In this section.5 Hebbian Learning in a Network Setting: Principal Component Analysis (PCA) Hebbian learning can lead to more interesting computational capabilities when applied to a network of units. Here. In other words. The weight vector configuration is such that is maximized.. we apply unsupervised Hebbian learning in a simple network setting to extract the m principal directions of a given set of data (i. the objective is to find the solution(s) u* which maximizes .16)
.g. be an input vector generated according to a zero-mean probability distribution p(x). clustering or classification) much easier to handle. ui. i = 1. Equation (3. the data undergoes a dimensionality reduction. and recalling that autocorrelation matrix C.3.15) may be expressed as
(3. by is the
noting that . the leading eigenvector directions of the input vectors' auto-correlation matrix). Amari (1977a) and later Linsker (1988) pointed out that principal component analysis (PCA) is equivalent to maximizing the information content in the outputs of a network of linear units. and n is odd.3. according to the inner products xTui. 3.w+ = +1. Subsequently. the n-dimensional input data (vectors x) may be transformed to a lower m-dimensional space without losing essential intrinsic information. one of the weights at w− will be pushed towards zero so that is maintained. the variance of the projection xTu with respect to p(x). The following is an outline for a direct optimization-based method for determining the ui vectors.. This. . Let . Let u denote a vector in Rn. in turn. 2. m. ui. we are interested in finding the maxima w* of the criterion function
(3.3. For the n even case.. makes subsequent processing of the data (e. Since m is smaller than n. This can be done by projecting the input vectors onto the m-dimensional subspace spanned by the extracted orthogonal vectors. in the input space that account for as much of the data's variance as possible. onto which the input vectors are to be projected. Now. which is itself a zero-mean random variable. subject to ||u|| = 1. The aim of PCA is to extract m normalized orthogonal vectors..15)
from which the unity norm solution(s) u* can be computed as
with ||w*|| 0.

. simultaneously. then we subtract the mean from it before extracting the principal components.3.The extreme points of J(w) are the solutions to J(w) = 0. . these projections are equivalent to the ones obtained by the classical Karhunen-Loève transformation of statistics (Karhunen. Thus. the output signal variance. but with the additional requirement that the vector w be orthogonal to c(1). Oja (1989) extended his learning rule in Equation (3.3. i = 1.3. we arrive at the m principal directions u1 through um. which gives
or
(3.3. m < n. and the second vector u2 points in the direction of maximum variance in the subspace orthogonal to u1. Again. and occurs at w* = ac(2). the maxima w* of J(w) must point in the same or opposite direction of one of the eigenvectors of C. these vectors are ordered so that u1 points in the direction of the maximum data variance. 1947.17) are w = ac(i). Loève.15).3. Note that the previous Hebbian rules discussed in the single linear unit setting all maximize .. n. u2 = c(2). Now. i = 1. m. we repeat the above maximization of J(w) in Equation (3.3. Consider a network of m linear units.19)
.18) where wij is the jth weight for unit i and yi is its output. 2. In other words.17) The solutions to Equation (3. are called the principal components of the data. If the data has a non-zero mean. the solution u3 = c(3) maximizes J under the constraint that u3 be orthogonal to u1 and u2. Therefore. and hence they extract the first principal component of the zero-mean data. Here.. 2.3. Similarly.. the variance of the
projection xTu is maximized for u = u1 = = c(1).. J(w*) = 1). 1963). Continuing this way. a R. PCA in a network of Interacting Units Here. Let wi be the weight vector of the ith unit. we find that the only maximum exists at w* = ac(1) for some finite real-valued a (in this case. it can be readily seen that the maximum of J(w) is equal to 2.. Another rule proposed by Sanger (1989) is given by:
(3. Upon careful examination of the Hessian of J(w) [as in Section 4.2]. . the projections xTui. and so on. Next.6) to the m-unit network according to (we will drop the k superscript here for clarity):
(3. an m-output network is desired which is capable of incrementally and efficiently computing the first m principal components of a given set of vectors in Rn.

3.19) under the
assumption with and .
The above two rules are identical to Oja's rule in Equation (3. as well as the output signals yk of all units.Equations (3.
. as shown in Figure 3. in order. Each signal yk is modulated by the jth weight of unit k and is fed back as an inhibitory input to unit i. and therefore differ individually from trial to trial.4. Equation (3. For m > 1. An equally important property of Sanger's PCA net is that there is no need to compute the full correlation matrix C. Additional analysis is given in Section 4. Thus. where the first unit (i = 1) has
.1) to update its jth weight. On the other hand.6. to be available when adjusting the jth weight of unit i. Sanger's rule is insensitive to initial conditions and to the order of presentation of input data. in this case the m weight vectors converge to span the same subspace as the first m eigenvectors of C.19) require communication between the units in the network during learning. but with an effective input signal whose jth component is given by the term inside parentheses in Equation (3. It converges to . unit i can be viewed as employing the original Hebbian rule of Equation (3. See also Section 5. the reader is referred to Gonzalez and Wintz (1987) and Sanger (1989).5 and 3. the weight vectors depend on the initial conditions and on the order of presentation of the input data.3. This property can lead to significant savings in computational effort if the dimension of the input vectors is very large compared to the desired number of principal components to be extracted.3. the first eigenvectors of C are computed by the net adaptively and directly from the input vectors. PCA in a Single Layer network with Adaptive Lateral Connections Another approach for PCA is to use the single layer network with m linear units and trainable lateral connections between units.5 (Rubner and Tavan.3. Here. 1989). The significance of this proof is that it guarantees the dynamics in Equation (3. Oja's rule does not generally find the eigenvector directions of C.3. Both rules converge to wi's that are orthogonal.3.3.3.3.18) and (3.18).3.3. For an interesting application of PCA to image coding/compression.6) for m = 1. they only differ by the upper limit on the summation.6. where . However. Rather.18) requires the jth input signal xj. Sanger (1989) gave a convergence proof of the above PCA net employing Equation (3.3.19) to find the first m eigenvectors of the auto-correlation matrix C (assuming that the eigenvalues through m are distinct). Some insights into the convergence behavior of this rule/PCA net can be gained by exploring the analysis in Problems 3. Sanger's rule employs similar feedback except that the ith unit only receives modulated output signals generated by units with index k.

Figure 3. Two natural ways of introducing nonlinearities into the PCA net are via higher-order units or via units with nonlinear activations. namely c(2).20) On the other hand. extract principal components which provide an optimal linear mapping from the original input space to a lower dimensional output space whose dimensions are determined by the number of linear units in the network. Thus.6 Nonlinear PCA PCA networks.
. In order to see how higher-order units lead to higher-order statistics PCA. 1993). in the form:
(3. The second unit tries to do the same except that the lateral connection u21 from unit 1 inhibits y2 from approaching c(1). Since the principal directions are orthogonal.3. Here. this network extracts the first m principal data directions in descending order. The weights wij connecting the inputs xk to the units are updated according to the simple normalized Hebbian learning rule:
(3. The nonlinearities implicitly introduce higher-order moments into the optimal solution. the lateral weights employ anti-Hebbian learning. Optimal PCA mapping based on more complex statistical criteria are also possible if nonlinear units are used (Oja.21) where > 0.5: PCA network with adaptive lateral connections. such as those discussed above. Note that the first unit with index 1 extracts c(1). the correlation's yi yj approach zero as convergence is approached and thus the uij weights are driven to zero. the extracted principal components can be thought of as the eigenvectors of the matrix of higher-order statistics which is a generalization of the second order statistics matrix (correlation matrix C). Taylor and Coombes. just as Sanger's network. hence y2 is forced to settle for the 2nd principal direction.3. 3. 1991.3. The optimality of this mapping is with respect to the second order statistics of the training set {xk}. and so on.3. The lateral connections uij are present from unit j to unit i only if i > j.

]T
w = [w1 w2 ... wnn]T
is a vector of real valued parameters.1...22) Another way of interpreting Equation (3.consider the case of a simple network of a single quadratic unit of n-inputs xi.. xn
and
x1x2 . The input/output relation for this quadratic unit is given by:
(3. such as
(3.22) is to write it in the form of a linear unit. if stable Hebbian learning is used to adapt the w parameter vector.3. wn w11 w12 .. as in a QTG [refer to Section (1. the n-input quadratic unit is equivalent to a linear unit
receiving its inputs from a fixed preprocessing layer.2)]. this vector will stabilize at the principal eigenvector of the correlation matrix Cz = inputs xi as: .... w1n w22 w23 .. 2. n. Now.3. Therefore. . This preprocessing layer transforms the original input vectors xk into higher dimensional vectors zk. x1xn
x2x3 . i = 1.. This matrix can be written in terms of the
....23) where
z = [ x1 x2 x3 ..3.

we considered simple Hebbian-based networks of linear units which employed some degree of competition through lateral inhibition in order for each unit to capture a principal component of the training set.. and Xu (1994).7). Here. it only makes sense to consider a group of interacting units. we extend this notion of competition among units and specialization of units to tackle a different class of problems involving clustering of unlabeled data or vector quantization.6. The network would then classify each cluster of "similar" input data as a single output class. Also. Therefore. and that C2 = . 3. This can be seen by employing Taylor series expansion of the output of a nonlinear unit y = f (wTx) at wTx = 0. it is assumed that all derivatives of f exist. The matrices C2 and C3 reflect third order statistics and C4 reflects fourth order statistics.
3. as in Equation (3. each receiving the same input x Rn and producing an output yi. This expansion allows us to write this unit's output as
(3. We assume the simplest architecture where we have a single layer of units.4.3. For further exploration in nonlinear PCA.3.3)
. Extraction of higher-order statistics is also possible if units with nonlinear activation functions are used (e. higher-order principal component extraction is expected when Hebbian-type learning is applied to this unit. where
(3.2) which may be written as
(3. Here.4. Thus.1 Simple Competitive Learning Because we are now dealing with competition. the reader may consult Karhunen (1994). sigmoidal activation units). see Section 5. This active unit is called the "winner" and is determined as the unit with the largest weighted-sum . a network of binary-valued outputs.3.25) which may be interpreted as the output of a high-order (polynomial) unit (see Problem 3.4.4. Sudjianto and Hassoun (1994).Note that C1 = .3.4).g. In this section.4 Competitive Learning
In the previous section.1) and xk is the current input. unit i is the winning unit if
(3. with only one "on" at a time. could tell which of several categories an input belongs to. Yet higher-order statistics can be realized by allowing for higher-order terms of the xi's in z. These categories are to be discovered by the network on the basis of correlations in the input data. We also assume that only one unit is active at a given time.

This is similar to what we have described in the previous section with a slight variation: Each unit inhibits all other units and self-excites itself. It is interesting to note that lateral inhibition may be employed here in order to implement the "winner-take-all" operation in Equation (3.4.3).2) or (3.4. Note.g. see Grossberg. 1976. we give a learning equation for weight updating. An appropriate activation function for this type of network is shown in Figure 3. Single layer competitive network. as shown in Figure 3. . m. after convergence. In order to assure winner-take-all operation. 2.if for all i = 1.
. that if one is training the net as part of a computer simulation. there is no need for the winner-take-all net to be implemented explicitly. Thus. Thus far..4. only the winning unit will saturate at 1 with all other units having zero outputs. 1987). a proper choice of lateral weights and unit activation functions must be made (e. One possible choice for the lateral weights is
(3.4.4)
where 0 < < and m is the number of units in the network.4. we have only described the competition mechanism of the competitive learning technique.
Figure 3.1..2 where T is chosen such that the outputs yi do not saturate at 1 before convergence of the winner-take-all competition.1. Next. and Lippmann.. it is more efficient from a computation point of view to perform the winner selection by direct search for the maximum neti.. the winner is the node with the weight vector closest (in a Euclidean distance sense) to the input vector. however.4.

Figure 3.4.4. 1985):
(3.5) If the magnitudes of the input vectors contain no useful information. Ultimately.1. 1969.6) The above rules tend to tilt the weight vector of the current winning unit in the direction of the current input. Let us view the input and weight vectors as points scattered on the surface of a hypersphere (or a circle as in Figure 3.3. 1973. This effect is illustrated pictorially for a simple example in Figure 3.3).4. the weights of the winning unit are updated (the weights of all other units are left unchanged) according to (Grossberg.4. Activation function for units in the competitive network of Figure 3. For a given input xk drawn from a random distribution p(x). some units (frequent winner units) will evolve so that their weight vector points towards the "center of mass" of the nearest significant dense cluster of data points. The cumulative effect of the repetitive application of the above rules can be described as follows. a more appropriate rule to use is
(3. The effect of the application of the competitive learning rule is to sensitize certain units towards neighboring clusters of input data.2.4. Rumelhart and Zipser. von der Malsburg.
.4.

if x causes the unit with label "+".05) value is used. their probability of occurrence may be reduced by using some of the following heuristics: (1) Initialize the weight vectors to randomly selected sample input vectors. or "" to be a winner. Example 3. or.2) and (3. such units may still be desirable since they may capture new clusters if the underlying distribution p(x) changes in time. or "" printed at the exact position of the training sample x.4. Here. After learning has converged.5(a) with the initial weights marked by "*". Each of the remaining three units evolved its weight vector to a distinct region in the input space. and cluster membership.3 where the weight vector w1 does not significantly evolve throughout the computation.4. Here. and "" are used to tag the three winner units. in Figure 3. but with a learning rate much smaller than the one used for the winner unit. Since is small. the weight vectors fluctuate inside their respective terminal neighborhoods. "o". for example. Note how one of the units never evolved beyond its initial weight vector. "o".1: Consider the set of unlabeled two-dimensional data points plotted in Figure 3. which employs Equations (3. Simple competitive learning.5).6 are generated by the net with = 0.2) is used to determine the winning unit.4. only three units are found to ever become winners. "o". respectively. it is necessary to "calibrate" the above clustering network in order to determine the number of units representing the various learned clusters. So we normally overestimate this number by including excess units in the network.4.4. 1985). These fluctuations are amplified if a relatively larger ( = 0.01 is performed for 1000 iterations. The following example demonstrates the above ideas for a simple data clustering problem. Finally. respectively.4.4. (3) update the neighboring units of the winner unit. We should note that the exact same clusters in Figure 3. the weight vectors are shown to enter and stay inside "tiny" neighborhoods. all loser units are also updated in response to an input vector. some units will be redundant in the sense that they do not evolve significantly and thus do not capture any data clusters. we demonstrate the typical solution (cluster formation) generated by a four-unit competitive net. 1986). Equation (3. Here. Here. (2) use "leaky learning" (Rumelhart and Zipser. These are typically the units that are initialized to points in the weight space that have relatively low overlap with the training data points. (b) weight vector configuration after performing simple competitive learning.3. During calibration.4. as demonstrated in Figure 3. the input samples (data points) are selected uniformly at random from the data set. During training. and/or (4) smear the input vectors with added noise using a longtailed distribution (Szu.6 depicts the result of the calibration process. This can be seen. The cluster labeled "" looks interesting because of its obvious
. This means that after convergence. No normalization of either the weight vectors or the input vectors is used. we do not know the number of clusters in the data set. This is because this unit never became a winner. Training with = 0. The four weight vectors are initialized to random values close to the center of the plot in Figure 3.(a) (b) Figure 3.4. the trained network is calibrated by assigning a unique cluster label for any unit which becomes a winner for one or more training samples. The figure shows a "+". The resulting weight vector trajectories are plotted in Figure 3.4. (a) Initial weight vectors.5(b). Here.4.05. Figure 3.4.4. At the onset of learning. the weights of the net are held fixed and the training set is used to interrogate the network for sensitized units. if desired. the symbols "+".4.

Weight vector evolution trajectories for a four-unit competitive net employing Equations (3. and (b) learning rate equals 0.4. one may explain the three-cluster solution in this example as representing a suboptimal solution which corresponds to a local minimum of J(w). but failed. Unlabeled two-dimensional data used in training the simple competitive net of Example 3.4.4. However.4.
Figure 3.5).4. Therefore.
(a)
(b) Figure 3.4.01. this is what the net attempted to do.1.bimodal structure. did in fact result in this "intuitive" solution. These trajectories are shown superimposed on the plane containing the training data. A "*" is used to indicate the initial setting of the weights for each of the four units. Intuitively speaking. where is the weight vector of the winner unit.5.5) is shown to correspond to stochastic gradient
descent on the criterion . (a) Learning rate equals 0.
. if one carefully looks at Figure 3. the competitive learning rule is Equation (3. very few simulations out of a large number of simulations which were attempted. but are not shown here.4.4.2) and (3. one would have expected this cluster to be divided by the competitive net into two distinct clusters. In Chapter 4.5(b). In fact.05.

each cell is represented by one of the reconstruction vectors.4. a vector quantizer first determines the region in which the vector lies. the learning rate in Equation (3. When presented with a new input vector x.5) with a winning unit determination based on the Euclidean distance as in Equation (3. it is the decision surface between pattern classes and not the inside of the class distribution.e. Here. in an average sense. the quantizer is called Voronoi quantizer. This solution represents a typical cluster formation generated by the simple four-unit competitive net of Example 3. 1984).7. The above quantizer process can be easily adapted in order to optimize the placement of the decision surface between different classes. Kohonen (1989) designed supervised versions of vector quantization (called learning vector quantization. Vector quantization is a technique whereby the input space is divided into a number of distinct regions.5) is selected as a monotonically decreasing function of the number of iterations k. so as to improve the quality of the classifier decision regions.Figure 3. Kohonen (1989) conjectured that. The four reconstruction vectors are shown as filled circles in the figure.3) may now be used in order to allocate a set of m reconstruction vectors . wi. Here. speech and image data). . LVQ) for adaptive pattern classification. Based on empirical results. which should be described most accurately. monotonically increasing function of p(x). The ith Voronoi cell contains those points of the input space that are closest (in a Euclidean sense) to the vector wi than to any other vector wj. Each wi is then labeled according to the majority of classes represented among those samples which have been
. The set of all possible reconstruction vectors wi is usually called the "codebook" of the quantizer. Then. 1980. the number of wi falling in a small volume of Rn centered at x) obtained by the above competitive learning process takes the form of a continuous..g. m. we set the starting values of the vectors wi to the first m randomly generated samples of x. j i. 1984). to the input space of n-dimensional vectors x..7 shows an example of the input space partitions of a Voronoi quantizer with four reconstruction vectors. In pattern classification problems.4. as opposed to using the vector itself. 2. Figure 3.6..4. Here. Additional samples x are then used for training. A three-cluster solution for the data shown in Figure 3.4. When the Euclidean distance similarity measure is used to decide on the region to which the input x belongs.4. i = 1.4. Thus. Here. The Voronoi quantizer partitions its input space into Voronoi cells (Gray. and for each region a "template" (reconstruction vector) is defined (Linde et al. Let x be distributed according to the probability density function p(x). this competitive learning algorithm may be viewed as an "approximate" method for computing the reconstruction vectors wi in an unsupervised manner. 3. we need to categorize a given set of xk data points (vectors) into m "templates" so that later one may use an encoded version of the corresponding template of any input vector to represent the vector. Each calibration sample is assigned to that wi which is closest...2 Vector Quantization One of the common applications of competitive learning is adaptive vector quantization for data compression (e.
Figure 3. Gray. one would start with a trained Voronoi quantizer and calibrate it using a set of labeled input samples (vectors).4. Input space partitions realized by a Voronoi quantizer with four reconstruction vectors. This leads to efficient quantization (compression) for storage and for transmission purposes (albeit at the expense of some distortion).4. the asymptotic local point density of the wi's (i.1. Initially. the quantizer outputs an encoded version of the reconstruction vector wi representing that particular region containing x. class information is used to fine tune the reconstruction vectors in a Voronoi quantizer.4.4.. The competitive learning rule in Equation (3.

Otherwise. Another improved algorithm named LVQ2 has also been suggested by Kohonen et al.5 Self-Organizing Feature Maps: Topology Preserving Competitive Learning
Self-organization is a process of unsupervised learning whereby significant patterns or features in the input data are discovered. Here. (1988) which approaches the performance predicted by Bayes decision theory (Duda and Hart. Assume that the closest reconstruction vector wi to xk carries the label of class cl. The primary effect of the reward/punish rule in Equation (3.7) and (3. More general competitive networks with stable categorization behavior have been proposed by Carpenter and Grossberg (1987 a.
3. self-organization learning consists of adaptively modifying the synaptic weights of a network of locally interacting units in response to input excitations and in accordance with a learning rule. At the same time.4. Then. called ART1. The local interaction of units means that the changes in the behavior of a unit only (directly) affects the behavior of its immediate neighborhood.7) is to minimize the number of misclassifications. the distribution of the calibration samples to the various classes. Next.
. Let the training vector xk belong to the class cj.4. the vectors wi are pulled away from the zones of class overlap where misclassifications persist. Equations (3.assigned to wi. until a final useful configuration develops. The key question here is how could a useful configuration evolve from self-organization. 1952). This global order leads to coherent behavior. the tuning of the decision surfaces is accomplished by rewarding correct classifications and punishing incorrect ones. Some theoretical aspects of competitive learning are considered in the next chapter. is described in Chapter 6. however. The answer lies essentially in a naturally observed phenomenon whereby global order can arise from local interactions (Turing. only vector wi is updated according to the following supervised rule (LVQ rule):
(3. b). One of these networks.4.4. 1990):
(3.7) where k is assumed to be a monotonically decreasing function of the number of iterations k. After convergence.8) This recursive rule causes i to decrease if wi classifies xk correctly. i increases. if such probabilities are known. the input space Rn is again partitioned by a Voronoi tessellation corresponding to the tuned wi vectors. The convergence speed of LVQ can be improved if each vector wi has its own adaptive learning rate given by (Kohonen. 1973). which is the essence of self-organization. This phenomenon applies to neural networks (biological and artificial) where many originally random local interactions between neighboring units (neurons) of a network couple and coalesce into states of global order.4. In the context of a neural network.8) define what is known as an "optimized learning rate" LVQ (OLVQ). as well as the relative numbers of the wi assigned to these classes must comply with the apriori probabilities of the classes.

the neighborhood is shrunk down until it ultimately goes to zero. and if r1 and r2 are the location of the corresponding winner units in the net/array. This network attempts to map a set of input vectors xk in Rn onto an array of units (normally one.. As learning progresses. then the Euclidean distance ||r1 − r2|| is expected to be small. This function is very critical for the successful preservation of topological properties.4. However. Thus. will allow us to interpret changes in wi as movements of unit i. ||r1 − r2|| approaches zero as x1 approaches x2. where there exists an image of the body surface. The winner unit is determined according to the Euclidean distance as in Equation (3. The idea is to develop a topographic map of the input vectors.or two-dimensional) such that any topological relationships among the xk patterns are preserved and are represented by the network in terms of a spatial distribution of unit activities. One may think of the initial large neighborhood as effecting an exploratory global search which is then continuously refined to a local
. which exhibits self-organization features. Each unit in the array is characterized by an n-dimensional weight vector.1 is presented. It is believed that such biological topology preserving maps are not entirely preprogrammed by the genes and that some sort of (unsupervised) self-organizing learning phenomenon exists that tune such maps during development. The learning rate also must follow a monotonically decreasing schedule in order to achieve convergence. if x1 and x2 are "similar" or are topological neighbors in Rn.3).e.4. the closer we expect the position in the array of the two units representing these patterns. The more related two patterns are in the input space.1 Kohonen's SOFM The purpose of Kohonen's self-organizing feature map is to capture the topology and the probability distribution of input data (Kohonen. Two early models of topology preserving competitive learning were proposed by von der Malsburg (1973) and Willshaw and von der Malsburg (1976) for the retinotopic map problem. This. In other words. It is normally symmetric [i. a global organization of the units is expected to emerge. ] with values close to one for between units i
units i close to i* and decreases monotonically with the Euclidean distance and i* in the array. The major difference between this update rule and that of simple competitive learning is the presence of the neighborhood function in the former. defines a relatively large neighborhood whereby all units in the net are updated for any input xk. where only the winner unit is updated. with no weight vector normalization. The ith unit weight vector wi is sometimes viewed as a position vector which defines a "virtual position" for unit i in Rn.5) and is defined by:
(3. a modified version of the simple competitive learning network discussed in Section 3. where each unit receives the same input xk. so that similar input vectors would trigger nearby units.In the following.4.
At the onset of learning. In this section. Also. The learning rule is similar to that of simple competitive learning in Equation (3.5.5. 1982a and 1989). An example of such topology preserving self-organizing mappings which exists in animals is the somatosensory map from the skin onto the somatosensory cortex. 3. we present a detailed topology preserving model due to Kohonen (1982a) which is commonly referred to as the self-organizing feature map (SOFM). one should keep in mind that no physical movements of units take place. xk Rn. in turn. This model generally involves an architecture consisting of a two-dimensional structure (array) of linear units. The retinotopic map from the retina to the visual cortex is another example.1) where i* is the index of the winner unit.

Let the input vector x be a random variable with a stationary probability density function p(x). The computation achieved by a repetitive application of this process is rather surprising and is captured by the following proposition due to Kohonen (1989): "The wi vectors tend to be ordered according to their mutual similarity. Finally kmax is usually set to 2 or more orders of magnitude larger than the total number of units in the net. however.5.5. Here. k = 0) to reach all units when i* is set close to the center of the array.1) and various other self-organizing algorithms is captured by the following two step process: (1) Locate the best-matching unit for the input vector x. Connections are shown between points
. the weight vectors of all units are shown as points superimposed on the input space.8). and kmax is the maximum number of learning steps anticipated. where g is some continuous.5. However. and the asymptotic local point density of the wi. i*.search as the variance of :
approaches zero. f << 1. 3.3) and for
(3. in an average sense. 0 (md is the number of units along the largest diagonal of the array). each unit i has four immediate neighboring units. the basis for Equation (3. and may be set proportional to with k representing the learning step.4) with 0(0) and f(f) controlling the initial and final values of the learning rate (neighborhood width). or an equivalent value that will permit (i. The following is one possible choice for
(3. and f = 0. is of the form g(p(x)).2)
where the variance 2 controls the width of the neighborhood. which are located symmetrically at a distance of 1 from unit i.5. Ritter and Schulten (1988a) proposed the following update schedules for
(3. respectively.1 shows an example of mapping uniform random points inside the positive unit square onto a 10 10 planar array of units with rectangular topology. In general.5. other array topologies such as hexagonal topology may be assumed. Here. (2) Increase matching at this unit and its topological neighbors." This proposition is tested by the following experiments. Then. practical values are 0 < 0 1 (typically 0.5. In the figure.5. monotonically increasing function.2 Examples of SOFMs Figure 3. There is no theory to specify the parameters of these learning schedules for arbitrary training data.

corresponding to topologically neighboring units in the array. This aberration is due to the "pulling" by the units inside the map. During this phase. First. 0 = 8.
(a) Iteration 0 (b) Iteration 10.1. the map spreads to fill all of the input space except for the border region which shows a slight contraction of the map.000 Figure 3.000 (d) Iteration 70. a triangle.5. thus contributing to slow convergence.5.01. respectively. it is useful to observe that there are two phases in the formation of the map: an ordering phase and a convergence phase.4 all share the following properties. The important thing.4 show additional examples of mapping uniformly distributed points from a disk. and kmax = 70.000).1 (a).1 (b)-(d) show snapshots of the time evolution of the feature map. The ordering phase involves the initial formation of the correct topological ordering of the weight vectors.000).5. For good results. Another important result is that the density of the weight vectors in the weight space follows the uniform probability distribution of the input vectors.1(d). and kmax = 50. as shown in Figure 3.01. though.1 through 3. The weights are initialized randomly near (0. Figures 3. and a hollow disk.000
.5.000 (d) Iteration 50.5.2 through 3.
(a) Iteration 0 (b) Iteration 100
(c) Iteration 10.5. f = 1. Figures 3. f = 0.000
(c) Iteration 30. 0. f = 0. which preserves the topology of the Euclidean input space. The feature maps in Figures 3.2.5) as shown in Figure 3. a line connecting two weight vectors wi and wj is only used to indicate that the two corresponding units i and j are adjacent (immediate neighbors) in the array. is to observe how in all cases the w vectors are ordered according to their mutual similarity.8. the neighborhood width 2 and the learning rate take on very small values.5. This is roughly accomplished during the first several hundred iterations of the learning algorithm.5.5. where the map converges asymptotically to a solution which approximates p(x).5.8.000 Figure 3. Mapping uniform random points from the positive unit square using 10 10 planar array of units (0 = 0. Initially. the map untangles and orders its units as in parts (b) and (c) of the figure. If the distribution p(x) had been non-uniform we would have found more grid points of the map where p(x) was high (this is explored in Problem 3. The fine tuning of the map is accomplished during the convergence phase.
(a) Iteration 0 (b) Iteration 1.5). The initial weight distribution is shown in part (a) of each of these figures. for improved visualization. Ultimately.5. onto a 15 15 array of units. the convergence phase may take 10 to 1000 times as many steps as the ordering phase. Learning rule parameters are provided in the figure captions. f = 1. During training the inputs to the units in the array are selected randomly and independently from a uniform distribution p(x) over the positive unity square. Another common property in the above SOFMs is a border aberration effect which causes a slight contraction of the map. Mapping uniform random points from a disk onto a 15 15 array of units (0 = 0. Thus. and a higher density of weight vectors at the borders. 0 = 5.

(c) Iteration 10.8. Two additional interesting simulations of SOFMs. An illustrative example is shown in Figure 3.000). Ritter and K.5.
(a) (b)
(c) (d) Figure 3. Mappings from higher to lower dimensions are also possible with a SOFM.000 Figure 3.6 depicts the evolution of the feature map from an initial random map. Here. due to Ritter and Schulten (1988a). R.000). Kohonen's Self-Organizing Maps: Exploring Their Computational
.01.5.
(a) Iteration 0 (b) Iteration 100
(c) Iteration 20.5.5.000 Figure 3. and are in general useful for dimensionality reduction of input data.000 (d) Iteration 40.6 (a). Here. f = 1.5. It is interesting to note the boundaries developed [shown as dotted regions in the maps of Figure 3. Schulten. and kmax = 50.5.000 (d) Iteration 50. the units self-organize such that they cover the largest region possible (space filling curve).5.000 Figure 3. f = 0. and T shown in Figure 3.
(a) Iteration 0 (b) Iteration 5000
(c) Iteration 20. The first simulation involves the formation of a somatosensory map between the tactile receptors of a hand surface and an "artificial cortex" formed by a 30 30 planar square array of linear units.1). f = 1.000 (d) Iteration 50.6 (c) and (d)] between the various sensory regions.5. and kmax = 50. L. Mapping uniform random points from a hollow disk onto a 15 15 array of units (0 = 0.5. (From H.8. random points are selected according to the probability distributions defined by regions D. 0 = 8. Figure 3.6.01.4. f = 0. 1988. Also note the correlations between the sensory regions sizes of the input data and their associated regions in the converged map. Mapping from a hollow disk region onto a linear array (chain) of sixty units using a SOFM.5 where a mapping from a hollow disk region to a linear array (chain) of sixty units is performed using Kohonen's feature map learning in Equation (3. An example of somatosensory map formation on a planar array of 30 x 30 units.3. 0 = 8. M. Mapping uniform random points from a triangle onto a 15 15 array of units (0 = 0.5. The training set consisted of the activity patterns of the set of tactile receptors covering the hand surface. are considered next.

) The second simulation is inspired by the work of Durbin and Willshaw (1987) (see also Angeniol et al. vol. 1988) related to solving the traveling salesman problem (TSP) by an elastic net. Here. A linear array (chain) of 100 units is used for the feature map. 30 random city locations are chosen in the unit square with the objective of finding the path with minimal length that visits all cities.
. Proceedings of the IEEE International Conference on Neural Networks. respectively. © 1988 IEEE. The generated solution path is shown in Figures 3. pp. Figure 3.7 (a) shows the 30 randomly chosen city locations (filled circles) and the initial location and shape of the chain (open squares).Capabilities. where each city is visited only once.5.000. Open squares correspond to weight vector coordinates of units forming the chain..000 steps. 7. 1988. 109116. (c).5. and 10. I.500.7.5.
(a) (b)
(c) (d) Figure 3. and Hueter. Filled circles correspond to fixed city locations. and (d) after 5. Solving 30-city TSP using a SOFM consisting of a "chain" of one hundred units.7 (b). The initial neighborhood size used is 20 and the initial learning rate is 1.

Figure 3.5. A striking result is that the various units in the array become sensitized to spectra of different phonemes and their variations in a two-dimensional order. Kohonen. These inputs are presented in their natural order as inputs to the units in the array.. shown as circles. Units. We conclude this section by highlighting an interesting practical application of feature maps as phonotopic maps for continuous speech processing (Kohonen. note how each unit in this array has six immediate neighbors. Hueter.
Figure 3. Phonotopic feature map for Finnish speech. 'p'. and combinatorial optimization (Angeniol et al.9 shows the sequence of responses obtained from the phonotopic map when the Finnish word "humpilla" (name of a place) was uttered. The self-organizing process of Equation (3.
. Ritter and Schulten. 1989. This sequence of responses defines a phonemic transcription of the uttered word." two-dimensional map of speech elements onto the hexagonal array. etc. Ritter and Schulten. although teaching was not done using the phonemes. 1988). 1988. After training on sufficiently large segments of continuous speech (Finnish speech in this case) and subsequent calibration of the resulting map with standard reference phoneme spectra.5. Then this transcription can be recognized by comparing it with reference transcriptions collected from a great many words. the distinction of 'k'. 1988a). Here. © 1988 IEEE. the signal x(t) corresponding to an uttered word induces an ordered sequence of responses in the units in the array. One can attribute this to the fact that the input spectra are clustered around phonemes. (1990).5. input vectors are often high dimensional. and the self-organizing process finds these clusters.8. Some units are shown to respond to two phonemes.8. and 't' from this map is not reliable [a solution for such a problem is described in Kohonen (1988)].) a microphone signal is first digitally preprocessed and converted into a 15 channel spectral representation which covers the range of frequencies from 200 Hz. an 8 × 12 hexagonal array of units is assumed as depicted in Figure 3. For example. 1988. Here. the phonotopic map of Figure 3. (From T. the units are labeled with the symbols of the phonemes to which they "learned" to give the best responses. which may be useful for speech training therapy. to 5 KHz. are labeled with the symbols of the phonemes to which they adapted to give the best responses.83 milliseconds.8 emerges. Let these channels together constitute the 15-dimensional input vector x(t) to a feature map (normally. recognition. these phonemic trajectories provide the means for the visualization of the phonemes of speech. 21(3). each vector is preprocessed by subtracting the average from all components and normalizing its length). The "Neural" Phonetic Typewriter. Also. Some of the units in the array have double labels which implies units that respond to two phonemes. 1988a). 1988 .1) is used to create a "topographic. pp.5. representing a short-time spectra of the speech waveform. Such is the case in speech processing. including trajectory planning for a robot arm (Kohonen. Further extensions of the use of SOFMs in automatic speech recognition can be found in Tattersal et al.5. Here.Applications of SOFMs can be found in many areas. 11-22. are computed every 9. In speech processing (transcription. In practical applications of self-organizing feature maps.) The input vectors x(t). This phonotopic map can be used as the basis for isolated-word recognition. IEEE Computer Magazine.

Steepest gradient-based learning is extended to the stochastic unit. This system can be adapted on-line to new speakers by fine tuning the phonotopic map on a dictation of about 100 words. Here.6 Summary
This chapter describes a number of basic learning rules for supervised. Kohonen. IEEE Computer Magazine. appropriate parameter initializations. SSE. and unsupervised learning. depending on the speaker and difficulty of text. simple. or unsupervised. These functions. It presents a unifying view of these learning rules for the single unit setting. © 1988 IEEE.
3. It is pointed out that Hebbian learning applied to a single linear unit tends to maximize the unit's output variance. and some remarks on convergence behavior and the nature of the obtained solution. principal component analysis.9. When trained on speech from half a dozen speakers. pp.5. It is demonstrated that local interactions in a competitive net can lead to global order. and the most frequent words of the language (this amounts to several thousand words). learning vector quantization. Preliminary characterizations of some learning rules are given. Various forms of criterion functions are discussed including the perceptron. MSE. the learning process is viewed as steepest gradient-based search for a set of weights that optimizes an associated criterion function. when optimized by an iterative search process.
. This system is used as a "phonetic typewriter" which can produce text from arbitrary dictation. Reinforcement learning is viewed as a stochastic process which attempts to maximize the average reinforcement. and selforganizing feature maps. 21(3). Minkowski-r. names. Finally. using office text. The issue of which learning rules are capable of finding linearly separable solutions led us to identify the class of well-formed criterion functions. Hebbian learning is discussed and examples of stable Hebbian-type learning rules are presented along with illustrative simulations. 11-22.Figure 3. single layer networks of multiple interconnected units are considered in the context of competitive learning. It gives a quick reference to the learning equations and their associated criterion functions. which serves as the foundation for reinforcement learning. with the in depth mathematical analysis deferred to the next chapter. the phonetic typewriter had a transcription accuracy between 92 and 97 percent. The "Neural" Phonetic Typewriter.83 milliseconds. (From T. reinforcement. type of unit activation function employed. guarantee that the search will enter the region in the weight space which corresponds to linearly separable solutions. and relative entropy criterion functions. 1988. Simulations are also included which are designed to illustrate the powerful emerging computational properties of these simple networks and their application.1 summarizes all learning rules considered in this chapter.) Kohonen (1988) employed the phonotopic map in a speech transcription system implemented in hardware in a PC environment. A case in point is the SOFM where simple incremental interactions among locally neighboring units lead to a global map which preserves the topology and density of the input data. The arrows correspond to intervals of 9. This is true for all learning rules regardless of the usual taxonomy of these rules as supervised. Unsupervised learning is treated in the second half of the chapter. Sequence of the responses of units obtained from the phonotopic map when the Finnish word "humpilla" was uttered. reinforcement. Table 3. Here.

Perceptron Rule with variable learning rate and fixed margin (supervised)
2. Finite
convergence if where is a finite p constant.
3.SUMMARY OF BASIC LEARNING RULES
LEARNING RULE (type)
CRITERION FUNCTIONa
LEARNING VECTORb
CONDITIONS
ACTIVATION FUNCTIONc
REMAR
Perceptron Rule (supervised)
Finite converge time if training linearly separab stays bounded arbitrary trainin b>0
k
satisfies:
1.
.
Converges to z
training set is line separable.

Finite converge the solution Mays' Rule (supervised) b>0
Butz's Rule (supervised)
>0
if training set is l separable.) Converges in the m
. Plac region that tend minimize the probability of e non-linearly se cases. Finite converge training set is l separable.
mean opera each learning s training vector
drawn at random.
Widrow-Hoff Rule (-LMS) (supervised) -LMS (supervised)
k
Converges in th square to the m SSE or LMS so ||xi|| = ||xj|| for all
Converges in th square to the m SSE or LMS so
satisfies:
Stochastic -LMS Rule
1.

3.
Relative Entropy Delta Rule (supervised)
y = tanh( net)
Eliminates the suffered by the rule.(supervised)
2. Converge linearly separab
. r = 2 give rule.
Correlation Rule (supervised)
Converges to th minimum SSE if the xk's are mu
orthonormal.
Minkowski-r Delta rule (supervised)
y = f(net) where f is a sigmoid function.
Delta Rule (supervised)
y = f(net) where f is a sigmoid function.
square to the mini or LMS solution.
Extends the -L to cases with differentiable n linear activatio 1 < r < 2 for ps Gaussian distri p(x) with prono tails. r = 1 arise p(x) is a Laplac distribution.

solution if one Margin
Weight vector: AHK I (supervised) with margin vector
bi's can only incre
their initial values Converges to a ro solution for linear separable problem
Margin
AHK II (supervised)
bi's can take any p with margin vector = Weight vector:
value. Converges solution for linearly separable
.

an in a solution w tends to minim misclassificatio Stochastic activation:
Delta Rule with Stochastic neuron (supervised)
Performance in average is equi the delta rule a a unit with deterministic
. It autom identifies and d the critical poin affecting the no separability.Margin
Weight vector: AHK III (supervised) with margin vector
Converges for separable as we non-linearly se cases.

Rule (unsupervised)
w
the direction of c(1 maximizes <y2>
Maximizes <y2 Linsker's Rule
to
. Oja's Rule However. Co
.
Converges in th to
.activation:
Simple Associative Reward/Penalty Rule (reinforcement) Hebbian Rule (unsupervised)
Stochastic activation (as above)
wk evolves so as t
maximize the aver reinforcement sign
wit
pointing in the dir c(1) (see comment
Does not have an exact J(w). whic maximizes <y2 Converges in th to
Yuille's et al. this rule tends to (unsupervised) maximize <y2> subject to the
constraint ||w|| = 1.

Converges to a minima of J representing so clustering configuration. Th
.
i* is the index of the winning unit:
2.3. satisfies:
Standard Competitive Learning Rule (unsupervised) (see comment e)
1. w* approaches c(1) (se 4.5 for details).
and k evolve according to: or
3.(unsupervised)
>0
to w* whose comp are clamped at w− when is large.
Converges in th to Hassoun's Rule (unsupervised) (see comment e)
k
For
this rule reduces to:
w* parallel to c(1)
.
k
Kohonen's Feature Map Rule
The weight vec evolve to a solu which tends to the topology of input space.

where continuous monotonically increasing func p(x) is a station probability den function govern
dc(i)
is the ith normalized eigenvector of the autocorrelation matrix C with a ( and ).(unsupervised)
(see comment e)
point density o solution is of th g(p(x)).
criterion functions associated with these rules are discussed in Chapter 4.
is the learning rate and
c
.
aNote: bThe
corresponding eigenvalue where
eThe
general form of the learning equation is: is the learning vector.

Show that if we restrict to sufficiently small values. we follow the notation in Section 3. then wk+1 converges in a finite number of steps (recall that no weight update is needed if all xk are classified correctly). b.1. then
is minimized by setting to its optimal value
d. then
(1) where w* is a solution which separates all patterns xk correctly. Show that the choice = opt guarantees the convergence of the perceptron learning rule in a finite number of steps. Does wk+1 have to converge to the particular solution w*? Explain.2). 1993) for the perceptron rule in Equation (3.Problems:
3.1. then wk+1 in Equation (1) stops changing after a finite number of corrections. Show that if xk is misclassified by a perceptron with weight vector wk.1.
3. e.1. show that if the choice = opt is made. minimizes the maximum number of corrections k0 if w1 = 0.1. in the convergence proof for the perceptron learning rule. In other words.1 Show that the choice .1.
*
a.2 This problem explores an alternative convergence proof (Kung.
c. Here. Show that if xk is misclassified.2) leads to the learning rule
. Show that the use of = opt in Equation (3.

7 Repeat Problem 3. d.1.25) with b = 0. d} 4.1.6 Draw the two decision surfaces for Problem 3.1. 1}.4 Show that the α -LMS rule in Equation (3. using the AHK III rule.1.23).5 and compare them to the decision surfaces generated by the perceptron rule. 2. d R.1.1. a.1 and a reinforcement factor
†
). 1. b. w*. compute and plot the dynamic margins (bi) vs. Identify the solution region in the weight space containing all linearly separable solutions w*.30) can be obtained from an incremental gradient descent on the criterion function . where x.5 For the simple two-dimensional linearly separable two-class problem in Figure P3. ρ 1 = 0.
†
Figure P3. 3. starting from an arbitrary initial weight vector. 1}. to be known!) 3.
. learning cycles using the AHK I learning rule with the following initial values and parameters: w1 = 0. w2 is the weight associated with the bias input (assume a
3. 3.1. µ -LMS learning rule (use µ = 0.1. and Butz's rule (use = 0. The following is a guided exploration into the properties of the function J for this linearly separable training set. and initial margins of 0. over the range and bias of +1). Plot the criterion function J for a two-input perceptron with weights w1 and w2.5. This exploration should give some additional insights into the convergence process of the perceptron learning rule. since it requires a solution.7. Note that these pairs take the form {x.1.05.1.1.5. What is the value of J in this region? c.(Note that this learning rule is impractical. consider the four training pairs {− − {− − {− +1}.
3. describe the evolution process of the weight vector in Equation (3.20) is piecewise linear.05). Here. and {+1.1. ρ 2 = 0. Repeat using the AHK II rule. Next.3 Show that the perceptron criterion function J in Equation (3. Assume w1 = 0 for these rules. Compare the convergence speed and quality of the solution of these two rules.1. +1}.1.5. Based on the above results.5 for the two-class problem in Figure P3. For comparative purposes. plot the quadratic criterion function in Equation (3.1.1.1. Comment on the differentiability of this function. A simple two-class linearly separable pattern for Problem 3.

1. with pattern k as .7.9 Check the "well-formedness" of the criterion functions of the following learning rules.
Employing statistical mechanics arguments. dk}.1.1 and a reinforcement factor ). Mays' rule. µ -LMS learning rule (use µ = 0. A simple two-class nonlinearly separable training set for Problem 3.8 Draw the decision surface for Problem 3. and Butz's rule (use = 0. k = 1..10 a.1.7 and compare it to the decision surfaces generated by the perceptron rule.Figure P3. 2.11 Consider the training set {xk.1. 3. Is there a value for r for which the Minkowski-r criterion function is well-formed? b.7.05). a. 3. it is possible to show (Gordon et al. m.. Define a stability measure for
where w is a perceptron weight vector which is constrained to have a constant length
.1. .1. µ -LMS rule. 1993) that the mean number of errors made by the perceptron on the training set at "temperature" T (T is a monotonically decreasing positive function of training time) is:
where b
is an imposed stability factor specified by the user. b. c. Is the relative entropy criterion function well-formed? 3. Assume w1 = 0 for these rules.1.
..
3. AHK I rule..

2.1. show that an appropriate update rule for the ith weight is given by:
Give a qualitative analysis of the behavior of this learning rule and compare it to the correlation and AHK III rules.Employing gradient descent on J(w).1) with = 1.35).2. How would this rule behave for nonlinearly separable training sets? 3.6 and = 0.1. 3.12 Consider the learning procedure (Polyak.13 Consider the Minkowski-r criterion function in Equation (3.2. 1990):
. Use + = 0. Explain the effects of b and T on the placement of the separating hyperplane. then extensive experimentation with various r values must be done in order to estimate an appropriate value for r.1. and assume x3 = +1.1 Two-class linearly separable training set for Problem 3. Repeat by subjectively assigning a +1 or a − to rk based on 1 1 whether the movement of the separating surface after each presentation is good or bad. .
where and . 3. an automatic method for estimating r is possible by adaptively updating r in the direction of decreasing E. respectively.
Figure P3. Assume + = 0. and − otherwise.1 Use the Arp rule in Equation (3.
. Employ steepest gradient descent on E(r) in order to derive an update rule for r.2.1. Use reinforcement rk equal to +1 if xk is correctly classified.2. arrived at after 10.1.68). If no prior knowledge is available about the distribution of the training data.01. Alternatively. Start with
†
. Discuss qualitatively the difference between this learning procedure and the -LMS learning rule (Equation 3.06 and plot the generated separating surface after the first twenty presentations. 100 and 200 cycles through the training set in Figure P3. 50. to find the stochastic unit separating surface wTx = 0.1.1 and = 0.

1) and N(0. in Yuille et al.19) with m = 2 units. Start by showing that is the average of Oja's rule in Equation (3.7 Approximate the output of a sigmoid unit having the transfer characteristics y = tanh(wTx) by the first four terms in a Taylor series expansion.g. Study.1) and various initial weight vectors on the wk trajectories. for large k.4 Show that the average weight change. the eigenvectors of C. Evaluate the
*
Hessian matrix at the equilibria found in Problem 3. assume = 0. = 0.5 Consider Sanger's rule (Equation 3. Oja's rule. Employ these results to show that wi converges to c(i). Study the stability of these equilibria states in terms of the eigenvalues of the Hessian matrix. Show that this unit approximates a third order unit if it is operated near wTx = 0. 3.3. Comment on the effects of the saturation region on the quality of the higher-order principal component analysis realized when the unit is trained using Hebbian learning.3. via simulation. Show that the second unit's update equation for the average weight vector change is given by:
and that c(1) is not an equilibrium point. In your simulations.. the average direction of wk is that of the maximum data variance. Also. Show that the first two terms in give the projection of onto the space orthogonal to the first i − 1 eigenvectors of C. so wk = c(k) for k < i.2 Show that the average Oja rule has as its equilibria. 3.3.05.
†
3.
.6 Show that the average weight change al. assume w1 = [1. respectively. and Yuille et al.3. the principal eigenvector of C.3.3.5. 0.3. Assume that the first unit has already converged to c(1).2. rule is given by steepest gradient
3.01).6).3. descent on the criterion function . 0.3. Generate and plot 1000 data points x in the input plane. Verify graphically that.01 and stop training after 1500 iterations. rule upon training a single linear unit with weight vector w on samples x drawn randomly and independently from the above distribution.1 Consider a data set of two-dimensional vectors x generated as follows: x1 and x2 are distributed randomly and independently according to the normal distributions N(0.3. show that for Oja's rule. 3. the effects of larger learning rates (e. 0. 1991)
*
of the ith unit for Sanger's rule is given by (Hertz et
Now assume that the weight vectors for the first i − 1 units have already converged to their appropriate eigenvectors.3. 1]T in all simulations.. Compute and plot (in the input plane) the solution trajectories wk generated by the normalized Hebbian rule.3 Starting from . 3.

Assume a learning rate = 0.4. d. Draw the decision boundary for the classifier in part b. 0]T. 3]T.4. compared to the solution in Problem 3.01.3 Repeat Problem 3. 0]T. with no weight vector normalization. b.5. Stop training after 2000 iterations (steps). in a single layer competitive network. and [3. [0.4. [1. the vector xk is to be chosen from the data set at random. 3]T.
c. 2).4. 1]T. 2]T.4. Make sure that you use the same initial weight vectors as in the previous problem. Compare these tessellations to the ones drawn in part a. 3]T. as in Example 3. and start with random weight vectors distributed uniformly in the region |w1| 10 and |w2| 10. 3.4. Finally.4.6) and compare this solution to the real solution.
Consider the 200 sample two-dimensional data set {xk} generated randomly and independently as follows: 50 samples generated according to the normal distribution N([− +5]T.5 x2 − 2. and 75 samples generated according to a uniform distribution in the region and − 7. 75 samples generated 5. at each iteration. [3. for all i.5.4. Use a five unit competitive net in order to discover the underlying clusters in this data set.4.
†
3. 1]
†
a. Stop training after 1000 iterations.1 [use Equations (3.2)].4.2. +5]T.5) and (3.
Show that this rule preserves
† 3. initialized to . Starting with the Voronoi quantizer of part a. − T. 1. use the LVQ method described in Section 3. Assume equal a priori probability of the two classes. according to N([+5.4.3. use the adaptive learning rates in Equation (3.2 using the similarity measure in Equation (3. Draw the Voronoi tessellations realized by the weight vectors (reconstruction vectors) which resulted from the LVQ training in part b. [0. for determining the winning unit. [− 0]T. 2) for classes 1 and 2. 3]T. During training. plot the clusters discovered by the net (as in Figure 3. 0]T.4. 1) and p2(x) = N([3. Discuss any differences in the number and/or shape of the generated clusters. Draw the input space partitions (Voronoi tessellation) realized by the above quantizer.4 Consider a Voronoi quantizer with the following 10 reconstruction vectors: [0. respectively. [3.3). 1).2
for all k. Depict the evolution of the weight vectors of all five units as in Figure 3. 4]T. [4.2 in order to design a two-class classifier for data generated randomly according to the probability density functions p1(x) = N([0. 1}n and the initial weights Assume the following learning rule for the ith unit
. [2.1 Let xk {0.
.4.4.8).

1987):
*
where m is the number of units in the net and > 0.8.5. Train a 2-dimensional Kohonen feature map on points generated randomly and uniformly from this region.5.3. .1 Consider the following criterion function. and 80. xk represents the position of city k and wi represents the ith stop. Show that J is bounded below. 40. Show that gradient descent on J leads to the update rule:
where
b. 3. kmax = 80.5. 0 = 0.01.000. 1000. Assume a 15 × 15 array of units and the
†
following learning parameters. . a. f = 0. and that as 0 and m .1 at times 0. suitable for solving the TSP in an elastic net architecture (Durbin and Willshaw. J is minimized by the shortest possible tour. 20.5. Display the trained net as in Figure 3. c.000.000. Here. Choose all initial weights randomly inside the region.000.
.2. Give qualitative interpretations for the various terms in J and wi.2 Consider the L-shaped region shown in Figure P3.

1.Figure P3.
†
3.5 Repeat the SOFM simulation in Figure 3.8.5. 3.5. |x1| 1.5. Is p(w) proportional to p(x)?
†
. Display the trained map as in Figure 3.5.5.000 iterations to that in the previous simulation. Use the learning parameters given in the caption of Figure 3. 3. .3. L-shaped region representing a uniform distribution of inputs for the feature map of Problems 3.1). .2.5 at various training times.01. Repeat the simulation and plot the map (chain) at iterations 50.5.000.3 Repeat problem 3.3 for details) assuming p(x) = N([0.5.5 employs 0 = 0. Discuss your results. and |x2| 1.5.4 The self-organizing map simulation in Figure 3. (Note: the chain configuration in Figure 3.5.000. and kmax = 70.5. What is the general shape of the point distribution p(w) of the weight vectors? Estimate the variance of p(w).2 and 3. f = 0. 0]T. 0.000 and 70.5.2 (but with 0 = 20) for a 1-dimensional chain consisting of 60 units and initialized randomly inside the L-shaped region.5.5.1 (see Section 3.5 (d) is not a stable configuration for the above parameters.) Repeat
†
the simulation with and compare the resulting chain configuration at 70.

g(x)}. Supervised learning can be related to classical approximation theory (Poggio and Girosi. in an average sense. These are learning generalization and learning complexity. Hebbian and reinforcement learning rules. generalization in deterministic and stochastic nets are investigated. can be viewed as a process of searching a multi-dimensional parameter space for a state which optimizes a predefined criterion function J (J is also commonly referred to as an error function. have well-defined analytical criterion functions.1 Learning as a Search Mechanism
Learning in an artificial neural network. This framework is based on the notion that learning in general neural networks can be viewed as search.4. in most cases. Finally. The approximation problem is to find an optimal parameter vector that provides the "best" approximation of g on the set for a given class of functions G. Approximation techniques are employed to determine. x . with or without constraints.
4. These learning rules implement local search mechanisms (i. A unifying framework for the characterization of various learning rules is presented. g(x)}. supervised learning rules are designed so as to minimize an error measure between the network's output and the corresponding desired output. MATHEMATICAL THEORY OF
NEURAL LEARNING
4.. The chapter also treats two important issues associated with learning in a general feedforward neural network. Therefore. for a solution which optimizes a prespecified criterion function. the nature of the asymptotic solutions of the stochastic system. the nature and stability of the asymptotic solutions obtained using the basic supervised. and subject to certain assumptions. can be cast as a globally. we desire a solution w* Rd such that
. unsupervised Hebbian learning is designed to maximize the variance of the output of a given unit. In fact. In this case. Under this framework. from samples {x. stochastic differential equation/dynamical system. whereby the state of the system evolves so as to minimize an associated instantaneous criterion function. Here. Here. these stable equilibria can be taken as the possible limits (attractor states) of the stochastic learning equation. asymptotically stable gradient system whose stable equilibria are minimizers of a well-defined average criterion function. in a multidimensional space. gradient search) to obtain weight vector solutions which (locally) optimize the associated criterion function. a cost function.1). reinforcement.0 Introduction
This chapter deals with theoretical aspects of learning in artificial neural networks. Formal analysis is also given for simple competitive learning and self-organizing feature map learning. Formally stated. and reinforcement learning is designed to maximize the average reinforcement signal. the set of samples {x. is referred to as a training set. by an approximation function (or class of functions) G(w. where w Rd is a parameter vector with d degrees of freedom. The section on generalization presents a theoretical method for calculating the asymptotic probability of correct generalization of a neural network as a function of the training set size and the number of free parameters in the network. it is the criterion function that determines the nature of a learning rule. x). For example. It investigates mathematically.e. except Oja's rule (refer to Table 3. or an objective function). whether it is supervised. and x belongs to a compact set Rn. a continuous-time learning rule is viewed as a first-order. The chapter concludes by reviewing some significant results on the complexity of learning in neural networks. which were introduced in the previous chapter. This approximation leads to an "average learning equation" which. all of the learning rules considered in Chapter 3. the idea is to approximate or interpolate a continuous multivariate function g(x). 1990a). or unsupervised.

the approximation functions are usually chosen from the class of smooth sigmoidal-type functions.3) The ||P(w)||2 term in Equation (4. In terms of search mechanisms. we will see that for feedforward neural networks. Usually we know our objective and this knowledge is translated into an appropriate criterion function J. which are computationally very expensive.3. Later in this chapter.1. Examples of this regularization strategy have already been encountered in the unsupervised Hebbian-type learning rules presented in Chapter 3 [e. Thus. generalization means the ability of G to correctly map new points x. In this case. it is well known that an appropriate criterion function J is is the
(4. etc. the number of degrees of freedom (weights) in G plays a critical role in determining the degree of data overfitting. Among the above three factors..
. Of course. The choice of approximation function G. the criterion function J.g. we must be willing to accept locally-optimal solutions. and the approximation is constructed as a superposition of such sigmoidal functions. polynomials and rational functions are typically used for function approximation.(4.1. the class of functions g being approximated. and the search mechanism for w* all play critical roles in determining the quality and properties of the resulting solution/approximation.2) whose global minimum represents the minimum sum of square error (SSE) solution. in Chapter 2.8)]. invariance. On the other hand.1. if gradient search is used. an interesting question here is how well does a neural network compare to other universal approximation functions in terms of generalization (some insight into answering this question is given in Section 5. Another way to control generalization is through criterion function conditioning.1) where the tolerance is a positive real number and is any appropriate norm. On the other hand. 1990a) may be added to an initial criterion function according to
(4. there are two important issues in selecting an appropriate class of basis functions: universality and generalization.5). refer to Equation (3. for artificial neural networks. which were not seen during the learning phase. such a term may help condition the criterion so that the search process is stabilized. to any desired degree of accuracy. These ideas can also be extended to unsupervised learning where the
be thought of as a constraint satisfaction term. By universality. In general.2.1.3) is used to imbed a priori information about the function g. we mean the ability of G to represent. In general. gradient-based search is appropriate (and is simple to implement) for cases where it is known that J is differentiable and bounded. the most critical in terms of affecting the quality of the approximation in Equation (4. constraint that is the quantity to be minimized subject to the regularization
is kept "small" with the Lagrange multiplier determining the degree of term may
regularization.1. such as smoothness. For the case where Euclidean norm. drawn from the same underlying input distribution p(x). A "regularization" term (Poggio and Girosi. only global search mechanisms. In classical analysis.1) is the choice of approximation function G. we have established the universality of feedforward layered neural nets. may lead to global optimal solutions (global search-based learning is the subject of Chapter 8). which directly affects generalization quality.

2. the term -w in Equation (4. and a continuous-time version.1). From the point of view of analysis. in some cases.35) can be obtained by setting = 0 and r(w.1) and the continuous-time version (4. respectively.2)
where and are positive real. z) is referred to as a "learning signal.2) plays the role of a "forgetting term" which tends to "erase" those weights not receiving sufficient reinforcement during learning. x.2. in which the weight vector evolves according to a discrete dynamical system of the form w(k+1) = g(w(k)). it is useful to think of Equation (4. instead of dealing separately with the various learning rules proposed in the previous chapter.3) For the case r(w. Two forms of the general learning equation will be presented: a discrete-time version. x. called the general learning equation (Amari. LMS. z) and/or x are identically zero. z) = z in Equation (4. a scalar teacher signal z.2) if = 0. an input vector x Rn. 1977a and 1990). and r(w. in which the weight vector evolves according to a smooth dynamical system of the form . x. Now. substituting . Here. and = 0 and r(w. and r(w. z).2. = 1. z) = z .2. Thus. which captures the salient features of several of the different single-unit learning rules.2) as implementing a fixed increment steepest gradient descent search of an instantaneous criterion function J." Note that in Equation (4.2) is adopted and is referred to as the "general learning equation. and .1.1.1 General Learning Equation Consider a single unit which is characterized by a weight vector w Rn. The input vector (signal) is assumed to be generated by an environment or an information source according to the probability density p(x.wTx in Equation (4.2. and Hebbian learning rules.2. x.1. of the perceptron learning rule in Equation (3.50). consider the following discrete-time dynamical process which governs the evolution of the unit's weight vector w
(4. and. we seek to study a single learning rule. z)." One can easily verify that the above two equations lead to discrete-time and continuous-time versions. z) = r(wTx. Statistical analysis of the continuous-time version of the general learning equation is then performed for selected learning rules. z) = r(u. r(w. Similarly. x.4)
.y (here z is taken as bipolar binary).1).2 Mathematical Theory of Learning in a Single Unit Setting
In this section. x. or p(x) if z is missing (as in unsupervised learning). In the remainder of this section.2.
(4. z) = z . the right-hand side of Equation (4. The -LMS (or Widrow-Hoff) rule of Equation (3. Equation (4. In a supervised learning setting. y = sgn(wTx).2. including correlation. the teacher signal is taken as the desired target associated with a particular input vector.2) the state w* = 0 is an asymptotically stable equilibrium point if either r(w. z) = y leads to the Hebbian rule in Equation (3.2.2.2.2.3.1) lead to the simple correlation rule in Equation (3.4. x. or formally.2) can be integrated to yield
(4. 4.

From Equation (4. the general learning equation is designed so that it extracts the maximum amount of "knowledge" present in the learning signal r(w. In the most general case.7)
The gradient system in Equation (4. This type of criterion function (which has the classic form of a potential function) is appropriate for learning rules such as perceptron.
. Equation (4. x.3) may not readily be determined (or may not even exist).2.2 Analysis of the Learning Equation In a stochastic environment where the information source is ergodic.2) then becomes a stochastic differential equation or a stochastic approximation algorithm. one would expect the stochastic dynamics of the system in Equation (4. because the
(4. 4. z).2. with sufficiently small .2.2. we may view the task of minimizing J as an optimization problem with the
objective of maximizing
subject to regularization which penalizes solution vectors
w* with large norm.4) fits the general form of the constrained criterion function in Equation (4.2. however.2. In practice. these local minima are asymptotically stable points (attractors) and that the local maxima are unstable points (Hirsch and Smale. it is a well established result that." Equation (4. local maxima.6) has special properties that makes its dynamics rather simple to analyze.2. the average value of becomes proportional to the average gradient of the instantaneous criterion function J. Furthermore. It is interesting to note that by maximizing .3). we write
(4. The weight vector w is changed in random directions depending on the random variable x.where = . LMS.6) is useful from a theoretical point of view in determining the equilibrium state(s) and in characterizing the stochastic learning equation [Equation (4.2.2.2. We will refer to this equation as the "average learning equation.6) It is interesting to note that finding w* does not require the knowledge of J explicitly if the gradient of J is known. First. z). 1974).2.5) as . In other words. It is interesting to note that finding the equilibrium points w* of the general learning equation does not require the knowledge of J explicitly if the gradient of J is known. The general learning equation in Equation (4. This means that the equilibria w* are local minima.3). z}. the sequence of inputs x(t) is an independent stochastic process governed by p(x).2. and a suitable criterion function J satisfying Equation (4. Thus. x.1. the stochastic learning equation is implemented and its average convergence behavior is characterized by the "average learning equation" given as
=
(4. Therefore. z) r(wTx.2. note that the equilibria w* are solutions to <J> = 0. one is actually maximizing the amount of information learned from a given example pair {x.3). Hebbian.2)] in an "average" sense. linear nature of the averaging operation allows us to express Equation (4.2.5) may be viewed as a steepest gradient descent search for w* which (locally) minimizes the expected criterion function. and correlation rules. to approach a local minimum of <J>.5) where implies averaging over all possible inputs x with respect to the probability distribution p(x). Formally. The criterion function J in Equation (4. for any > 0. and/or saddle points of <J>. r(w.

. Thus. and therefore the gradient system
converges globally and asymptotically to w*.8) which leads to the average learning equation
(4.2.2) we have the stochastic equation (4.2.2. and Hebbian learning rules.3) are used for weight adaptation.2. 1991). Here.2.5) and (4.2.10) The stability of w* may now be systematically studied through the "expected" Hessian matrix < H(w*) > which is computed. which represents the desired target associated with the input x.. equivalently. discrete-time versions of the stochastic dynamical system in Equation (4. its only minimum from any initial
.2.7). where max is the largest eigenvalue of the Hessian matrix H = J.2. by first employing Equations (4. w* is a stable equilibrium of Equation (4. Dickinson. In fact. one arrives at the (only) equilibrium point
(4. the eigenvalues of are strictly negative.2.9) to identify . z) = z. as
(4. These are the correlation. LMS. Correlation Learning Here. r(w. The book by Tsypkin (1971) gives an excellent treatment of these iterative learning rules and their stability. 1981.8).9).9)
Now.2. These discrete-time "learning rules" and their associated average learning equations have been extensively studied in more general context than that of neural networks. From Equation (4.3. evaluated at the current point in the search space (the proof of this statement is outlined in Problem 4. its eigenvalues are strictly positive or. by setting
= 0.3 Analysis of Some Basic Learning Rules By utilizing Equation (4. i.In practice.11) This equation shows that the Hessian of <J> is positive definite. 4. This makes the system
locally asymptotically stable at the equilibrium solution w* by virtue of Liapunov's first method (see Gill et al.2. we are now ready to analyze some basic learning rules. the stability of the corresponding discrete-time average learning equation
(discrete-time gradient system) is ensured if 0 < < . x. the positive definite Hessian implies that w* is a minimum of .e.

2. defined in Equation (3. If we have .2) leads to the stochastic equation
.4). and
. and the stochastic dynamics in Equation (4.15)
Let C denote the positive semidefinite autocorrelation matrix
.2.13) are expected to approach this solution. the regularization term is needed in order to keep the solution bounded. The Hessian matrix is
(4. Finally. Therefore.2. the average learning equation becomes
(4.2. then is the equilibrium state. the trajectory w(t) of the stochastic system in Equation (4.2. the underlying instantaneous criterion function J is given by
(4. and that this analysis is identical to the analysis of the -LMS rule in Chapter 3.2.14).4).12)
which may be minimized by maximizing the correlation zy subject to the regularization term Here.state.8) is expected to approach and then fluctuate about the state .4) leads to
. Note that w* approaches the minimum SSE solution in the limit of a large training set.13) In this case. w* = C-1P is the only (asymptotically) stable solution for Equation (4.2. Let us now check the stability of w*.16)
which is positive definite if 0. x.3.2.wTx (the output error due to input x) and = 0.2. LMS Learning For r(w.
(4. Thus.2. Equation (4. z) = z .2.14) with equilibria satisfying
or
(4.
From Equation (4. note that with = 0 Equation (4.

4) we get the instantaneous criterion function minimized by the Hebbian rule in Equation (4. In
general.21) positive definite.or
(4.2.
will not be an eigenvalue of C.2) gives the Hebbian rule with decay (4. employing Equation (4.18) whose average is
(4.2.19) Setting in Equation (4. This equilibrium solution is asymptotically stable if since this makes the Hessian
is greater than the largest eigenvalue of C
(4.19) will have only one equilibrium at
w* = 0.20)
So if C happens to have
as an eigenvalue then w* will be the eigenvector of C corresponding to
.2.18):
.2. z) = y = wTx.2. though.2. Equation (4.2. upon setting r(w.2. so Equation (4. Hebbian Learning Here. Now.19) leads to the equilibria
(4.17) which is the instantaneous SSE (or MSE) criterion function.2. x.2.

(4.2.22)

The regularization term is not adequate here to stabilize the Hebbian rule at a solution which maximizes y2. However, other more appropriate regularization terms can insure stability, as we will see in the next section.

**4.3 Characterization of Additional Learning Rules
**

Equation (4.2.5) [or (4.2.6)] is a powerful tool which can be used in the characterization of the average behavior of stochastic learning equations. We will employ it in this section in order to characterize some unsupervised learning rules which were considered in Chapter 3. The following analysis is made easier if one assumes that w and x are uncorrelated, and that ), with w replaced by its mean equation is averaged with respect to x (denoted by

. This assumption leads to the "approximate" average learning

(4.3.1) The above approximation of the average learning equation is valid when the learning equation contains strongly mixing random processes (processes for which the "past" and the "future" are asymptotically independent) with the mixing rate high compared to the rate of change of the solution process; i.e., it can be assumed that the weights are uncorrelated with the patterns x and with themselves. Taking the expected (average) value of a stochastic equation one obtains a deterministic equation whose solution approximates

asymptotically the behavior of the original system as described by the stochastic equation (here, = (t) = is normally assumed). Roughly, the higher the mixing rate the better the approximation in Equation (4.3.1) (Kushner and Clark, 1978). We shall frequently employ Equation (4.3.1) in the remainder of this chapter. A more rigorous characterization of stochastic learning requires the more advanced theory of stochastic differential equations and will not be considered here [see Kushner (1977) and Ljung (1978)]. Rather, we may proceed with a deterministic analysis using the "average versions" of the stochastic equations. It may be shown that a necessary condition for the stochastic learning rule to converge (in the mean-square sense) is that the average version of the learning rule must converge. In addition, and under certain assumptions, the exact solution of a stochastic equation is guaranteed to "stay close," in a probabilistic sense, to the solution of the associated average equation. It has been shown (Geman, 1979) that under strong

mixing conditions (and some additional assumptions), . This result implies that if sufficiently small learning rates are used, the behavior of a stochastic learning equation may be well approximated, in a mean-square sense, by the deterministic dynamics of its corresponding average equation. Oja (1983) pointed out that the convergence of constrained gradient descent- (or ascent-) based stochastic learning equations (the type of equations considered in this chapter) can be studied with averaging techniques; i.e., the asymptotically stable equilibria of the average equation are the possible limits of the stochastic equation. Several examples of applying the averaging technique to the characterization of learning rules can be found in Kohonen (1989).

Before proceeding with further analysis of learning rules, we make the following important observations regarding the nature of the learning parameter in the stochastic learning equation (Heskes and Kappen, 1991). When a neural network interacts with a fixed unchanging (stationary) environment, the aim of the learning algorithm is to adjust the weights of the network in order to produce an optimal response; i.e., an optimal representation of the environment. To produce such an optimal and static representation, we require the learning parameter, which controls the amount of learning, to eventually approach zero. Otherwise, fluctuations in the representation will persist, due to the stochastic nature of the learning equation, and asymptotic convergence to optimal representation is never achieved. For a large class of stochastic algorithms, asymptotic convergence can be guaranteed (with high probability) by using the learning parameter ( Ljung, 1977; Kushner and Clark, 1978).

On the other hand, consider biological neural nets. Human beings, of course, are able to continually learn throughout their entire lifetime. In fact, human learning is able to proceed on two different time scales; humans learn with age (very large time scale adaptation/learning) and are also capable of discovering regularities and are attentive for details (short time scale learning). This constant tendency to learn accounts for the adaptability of natural neural systems to a changing environment. Therefore, it is clear that the learning processes in biological neural networks does not allow for asymptotically vanishing learning parameters. In order for artificial neural networks to be capable of adapting to a changing (nonstationary) environment, the learning parameter must take a constant nonzero value. The larger the learning parameter, the faster the response of the network to the changing environment. On the other hand, a large learning parameter has a negative effect on the accuracy of the network's representation of the environment at a given time; a large gives rise to large fluctuations around the desired optimal representation. In practice, though, one might be willing to trade some degree of fluctuation about the optimal representation (solution) for adaptability to a nonstationary process. Similar ideas have been proposed in connection with stochastic adaptive linear filtering. Here, an adaptive algorithm with a constant step size is used because it has the advantage of a limited memory, which enables it to track time fluctuations in the incoming data. These ideas date back to Wiener (1956) in connection with his work on linear prediction theory. 4.3.1 Simple Hebbian Learning We have already analyzed one version of the Hebbian learning rule in the previous section. However, here we consider the most simple form of Hebbian learning which is given by Equation (4.2.18) with = 0; namely, (4.3.2) The above equation is a continuous-time version of the unsupervised Hebbian learning rule introduced in Chapter 3. Employing Equation (4.3.1), we get the approximate average learning equation

(4.3.3) In Equation (4.3.3) and in the remainder of this chapter, the subscript x in is dropped in order to simplify notation. Now, the average gradient of J in Equation (4.3.3) may be written as

(4.3.4) from which we may determine the average instantaneous criterion function

(4.3.5)

Note that is minimized by maximizing the unit's output variance. Again, C = is the autocorrelation matrix, which is positive semidefinite having orthonormal eigenvectors c(i) with corresponding eigenvalues . That is, Cc(i) = ic(i) for i = 1, 2, ..., n. The dynamics of Equation (4.3.3) are unstable. To see this, we first find the equilibrium points by setting = 0, giving C = in = 0 or w* = 0. w* is unstable because the Hessian (in an average sense) = -C is non-positive for all . Note, however, that the direction of . Therefore, Equation (4.3.3) is unstable and results is not random; it will tend to point in the is minimized when is

direction of c(1), since if one assumes a fixed weight vector magnitude, parallel to the eigenvector with the largest corresponding eigenvalue.

In the following, we will characterize other versions of the Hebbian learning rule some of which were introduced in Chapter 3. These rules are well-behaved, and hence solve the divergence problem encountered with simple Hebbian learning. For simplifying mathematical notation and terminology, the following sections will use J, J, and H to designate , , and , respectively. We will refer to

as simply the criterion function, to as the gradient of J, and to as the Hessian of J. Also, the quantity w in the following average equations should be interpreted as the state of the average learning equation. 4.3.2 Improved Hebbian Learning Consider the criterion function

(4.3.6) It is a well established property of quadratic forms that if w is constrained to the surface of the unit hypersphere, then Equation (4.3.6) is minimized when w = c(1) with (e.g. see Johnson and

Wichern, 1988). Also, for any real symmetric n n matrix A, the Rayleigh quotient

satisfies

where 1 and n are the largest and smallest eigenvalues of A, respectively. Let us start from the above criterion and derive an average learning equation. Employing Equation (4.3.1), we get

(4.3.7) which can be shown to be the average version of the nonlinear stochastic learning rule

(4.3.8)

If we heuristically set for the two terms in the above equation, Equation (4.3.8) reduces to the continuous version of Oja's rule [refer to Equation (3.3.6)]. Let us continue with the characterization of (4.3.8) and defer the analysis of Oja's rule to Section 4.3.3. At equilibrium, Equation (4.3.7) gives

(4.3.9) Hence, the equilibria of Equation (4.3.7) are the solutions of Equation (4.3.9) given by w* = c(i), i = 1, 2, ..., n. Here, J takes its smallest value of substitution in Equation (4.3.6). at w* = c(1). This can be easily verified by direct

Next, consider the Hessian of J at w* = c(i) (assuming, without loss of generality, = 1) and multiply it by c(j) , namely, H(c(i)) c(j). It can be shown that this quantity is given by (see Problem 4.3.1. For a reference on matrix differential calculus, the reader is referred to the book by Magnus and Neudecker, 1988):

(4.3.10) This equation implies that H(w*) has the same eigenvectors as C but with different eigenvalues. H(w*) is positive semi-definite only when w* = c(1). Thus, by following the dynamics of Equation (4.3.7), w will eventually point in the direction of c(1) (since none of the other directions c(i) is stable). Although the direction of w will eventually stabilize, it is entirely possible for ||w|| to approach infinity, and Equation (4.3.7) will appear never to converge. We may artificially constrain ||w|| to finite values by normalizing w after each update of Equation (4.3.8). Alternatively, we may set the two equal to 1. This latter case is considered next. 4.3.3 Oja's Rule Oja's rule was defined by Equation (3.3.6). Its continuous-time version is given by the nonlinear stochastic equation terms in Equation (4.3.8)

(4.3.11) The corresponding average learning equation is thus ( Oja, 1982; 1989)

(4.3.12) which has its equilibria at w satisfying

(4.3.13) The solutions of Equation (4.3.13) are w* = c(i), i = 1, 2, ..., n. All of these equilibria are unstable except for . This can be seen by noting that the Hessian

(4.3.14) is positive definite only at w* = c(1) (or -c(1)). Note that Equation (4.3.14) is derived starting from

, with can be seen from

given in Equation (4.3.12). Although J is not known, the positive definiteness of H

(4.3.15) and by noting that the eigenvalues of C satisfy (we assume 1 2). Therefore, Oja's rule is equivalent to a stable version of the Hebbian rule given in Equation (4.3.8). A formal derivation of Oja's rule is explored in Problem 4.3.7. A single unit employing Oja's rule (Oja's unit) is equivalent to a linear matched filter. To see this, assume that for all x, , where is a fixed vector (without loss of generality let ) and v is a vector of symmetrically distributed zero-mean noise with uncorrelated components having variance 2. Then . The largest eigenvalue of C is 1 + 2 and the corresponding eigenvector is . Oja's unit then becomes a matched filter for the data, since in Equation (4.3.12). Here, the unit responds maximally to the data mean. Further characterization of Oja's rule can be found in Xu (1993). Oja's rule is interesting because it results in a local learning rule which is biologically plausible. The locality property is seen by considering the component weight adaptation rule of Equation (4.3.11), namely

(4.3.16) and by noting that the change in the ith weight is not an "explicit" function of any other weight except the ith weight itself. Of course, the concept of locality. does depend on w via y = wTx. However, this dependence does not violate

It is also interesting to note that Oja's rule is similar to Hebbian learning with weight decay as in Equation (4.2.18). For Oja's rule, though, the growth in ||w|| is controlled by a "forgetting" or weight decay term, -y2w, which has nonlinear gain; the forgetting becomes stronger with stronger response, thus preventing || w|| from diverging.

Example 4.3.1: This example shows typical simulation results comparing the evolution of the weight vector w according to the stochastic Oja rule in Equation (4.3.11) and its corresponding average rule in Equation (4.3.12). Consider a training set {x} of forty 15-dimensional column vectors with independent random components generated by a normal distribution N(0, 1). In the following simulations, the training vectors autocorrelation

matrix has the following set of eigenvalues: {2.561, 2.254, 2.081, 1.786, 1.358, 1.252, 1.121, 0.963, 0.745, 0.633, 0.500, 0.460, 0.357, 0.288, 0.238}. During training, one of the forty vectors is selected at random and is used in the learning rule to compute the next weight vector. Discretized versions of Equations (4.3.11) and (4.3.12) are used where is replaced by wk+1 - wk. A learning rate = 0.005 is used. This is equivalent to integrating these equations with Euler's method (e.g., see Gerald, 1978) using a time step t = 0.005 and = 1. The initial weight vector is set equal to one of the training vectors. Figures 4.3.1 (a) and (b) show the evolution of the cosine of the angle between wk and c(1) and the evolution of the norm of wk, respectively. The solid line corresponds to the stochastic rule and the dashed line corresponds to the average rule.

(a)

(b) Figure 4.3.1. (a) Evolution of the cosine of the angle between the weight vector wk and the principal eigenvector of the autocorrelation matrix C for the stochastic Oja rule (solid line) and for the average Oja rule (dashed line). (b) Evolution of the magnitude of the weight vector. The training set consists of forty 15dimensional real-valued vectors whose components are independently generated according to a normal distribution N(0, 1). The presentation order of the training vectors is random during training. The above simulation is repeated, but with a fixed presentation order of the training set. Results are shown in Figures 4.3.2 (a) and (b). Note that the results for the average learning equation (dashed line) are identical in both simulations since they are not affected by the order of presentation of input vectors. These simulations agree with the theoretical results on the appropriateness of using the average learning equation to approximate the limiting behavior of its corresponding stochastic learning equation. Note that a

monotonically decreasing learning rate (say proportional to or with k 1) can be used to force the convergence of the direction of wk in the first simulation. It is also interesting to note that better approximations are possible when the training vectors are presented in a fixed deterministic order (or in a random order but with each vector guaranteed to be selected once every training cycle of m = 40 presentations). Here, a sufficiently small, constant learning rate is sufficient for making the average dynamics approximate, in a practical sense, the stochastic dynamics for all time.

(a)

(b) Figure 4.3.2. (a) Evolution of the cosine of the angle between the weight vector wk and the principal eigenvector of the autocorrelation matrix C for the stochastic Oja rule (solid line) and for the average Oja rule (dashed line). (b) Evolution of the magnitude of the weight vector. The training set consists of forty 15dimensional real-valued vectors whose components are independently generated according to a normal distribution N(0, 1). The presentation order of the training vectors is fixed. 4.3.4 Yuille et al. Rule The continuous-time version of the Yuille et al. (1989) learning rule is

(4.3.17) and the corresponding average learning equation is

(4.3.18) with equilibria at

(4.3.19) From Equation (4.3.18), the gradient of J is

(4.3.20) which leads to the Hessian

(4.3.21) Note that one could have also computed H directly from the known criterion function

(4.3.22) Now, evaluating H(wi*)wj* we get

(4.3.23) which implies that the wj* are eigenvectors of H(wi*) with eigenvalues i - j for i j and 2i for i = j. Therefore, H(wi*) is positive definite if and only if i > j, for i j. In this case, w* = are the only stable equilibria, and the dynamics of the stochastic equation are expected to approach w*. 4.3.5 Hassoun's Rule In the following, the unsupervised Hebbian-type learning rule

(4.3.24) with ||w|| 0 is analyzed. Another way for stabilizing Equation (4.3.2) is to start from a criterion function that explicitly penalizes the divergence of w. For example, if we desire utilize J given by to be satisfied, we may

(4.3.25) with > 0. It can be easily shown that steepest gradient descent on the above criterion function leads to the learning equation

(4.3.26) Equation (4.3.26) is the average learning equation for the stochastic rule of Equation (4.3.24). Its equilibria are solutions of the equation

(4.3.27) Thus, it can be seen that the solution vectors of Equation (4.3.27), denoted by of the eigenvectors of C, say c(i), , must be parallel to one

For the discrete-time version.3. Therefore.3. it is interesting to note that for the case >> 1.3.29)
which requires > i for all i.31)
which implies that
is positive definite if and only if
is parallel to c(1) and > 1. Equation (4. Note that if >> i.3.(4. the
only stable equilibria of Equation (4.3. and = 1.30) we have
(4. Starting from the Hessian
(4.26) approach unity norm eigenvectors of the correlation matrix C when is large. Next. Thus. Like the Yuille et al.3.32) which is very similar to the discrete-time simple Hebbian learning rule with weight vector normalization of Equation (3. we investigate the stability of these equilibria.3. this rule preserves the information about the size of 1 [1 can be computed from Equation (4.3.29)].3.3.28) it can be seen that the norm of the ith equilibrium state is given by
(4. then approaches one for all equilibrium points.5) and expressed here as
. From Equation (4. rule.3.24) reduces to
(4.28) where i is the ith eigenvalue of C. the equilibria of Equation (4.26) are which approach c(1) for >> 1.

24) are used where is replaced by wk+1 .4. and (4. rule or Hassoun's rule.3.33) Note that the weight vector in divergence. (4. Note that the weight decay gain in Oja's rule utilizes more information.
4. where m is the number of units in the PCA network. discrete-time versions of the stochastic learning rules in Equations (4. ||w||2. In concluding this section.3.9. We will assume. The regularization in these latter rules is only a function of the current weight vector magnitude. -y2w. and lead to stability. without any loss of generality. the stability of these discrete-time stochastic dynamical systems critically depends on the value of the learning rate .wk and w(t) by wk.1) with i = 1.4.3. in Equation (4. regularization appears as weight decay terms -w. that m = 2.. Recalling Sanger's rule [from Equation (3. -||w||2w.18).18) does not stabilize Hebb's
rule.3.11).17).2) and
.(4. another interpretation of the regularization effects on the stabilization of Hebbian-type rules is presented. and w. Therefore.3. respectively. 2.3.3. .. is analyzed in this section. one should carefully design the gain coefficient in the weight decay term for proper performance. than the Yuille et al. it has been shown earlier that a simple positive constant gain in Equation (4.19)] and writing it in vector form for the continuous-time case.. In the stochastic learning Equations (4.3.11). Such analysis is explored in Problems 4. Although the stability analysis is difficult. and
(4.3.24).3. (4. the nonlinear dynamic gains y2. On the other hand. one can resort to the discrete-time versions of the average learning equations corresponding to these rules and derive conditions on for asymptotic convergence (in the mean) in the neighborhood of equilibrium states w*. employing Sanger's rule.17).2. This leads to the following set of coupled learning equations for the two units:
(4.33) need not be normalized to prevent
In practice. For example.8 and 4.3.2.3. However. Here. one may think of weight decay as a way of stabilizing unstable learning rules. m. we get
(4.4 Principal Component Analysis (PCA)
The PCA network of Section 3. (4. in the form of y2.5.

. the equilibrium w* = 0 is not stable. Equation (4. to the end points.4.6) Hence. Note that the simultaneous evolution of the wi is advantageous since it leads to faster learning than if the units are trained one at a time. The Hessian is given by is not an
(4. Equation (4. not one at a time. Equation (4. The unit-by-unit description presented here helps simplify the explanation of the PCA net behavior.4. Though.5) converges
asymptotically to the unique stable vector which is the eigenvector of C with the second largest eigenvalue 2.3)
Equation (4.
. we will assume a sequential operation of the two-unit net where unit 1 is allowed to fully converge before evolving unit 2. . Note that the point equilibrium. 3.(4.4. the weight vectors wi approach their final values simultaneously.3) becomes
(4.3) is Oja's rule with the added inhibitory term .. It is independent of unit 2 and thus converges to .8) which is positive definite only at . the above analysis still applies.. With the sequential update assumption. In fact. .4) For clarity. Thus.5) which has equilibria satisfying
(4. Now.4. the average learning equation for unit 2 is given by
(4.4. and with i = 2. the principal eigenvector of the autocorrelation matrix of the input data (assuming zero mean input vectors).. assuming 2 3. This mode of operation is permissible since unit 1 is independent of unit 2. asymptotically.1). we will drop the subscript on w..4.4. Next. For the remaining equilibria we have the Hessian matrix
(4. Similarly.2) is Oja's rule.4. m) will extract the ith eigenvector of C.4.4. 2. the ith unit (i = 1. n are solutions. for a network with m interacting units according to Equation (4..4.7)
Since is not positive definite.

5.5.5. The gradient of
is given by
.5.3). it is shown that Equation (4. Next.5.6).2) Now. Hertz et al.5. The continuous-time version of this rule is given by
(4..3) is proportional to the gradient of the expected reinforcement signal (Williams.2) as implementing a gradient search on an average instantaneous criterion function.5.1) through (4. for the kth input vector with respect
(4.1) from which the average learning equation is given as
(4.4) with the expected output
(4.5) as in Section 3.5 Theory of Reinforcement Learning
Recall the simplified stochastic reinforcement learning rule of Equation (3. whose gradient is given by
(4. 1991).
First. we may think of Equation (4.5. to all possible outputs y as
. we have and the output y is generated by a stochastic unit according to the probability function P(y| w.2.6. 1987. x).5. we express the expected (average) reinforcement signal. given by
(4.2.1.5.3) In Equations (4.6)
and then evaluate its gradient with respect to w. employing Equation (4. .4.5).

by averaging Equation (4.12) over all inputs xk.9)
(4.2.e.5.5. Thus. i.. Equation (4.13) where now the averages are across all patterns and all outputs.5. we arrive at
(4.5.5.10).5.5.2) can be written in terms of
(4. Note that is proportional to as in
Equation (4. Extensions of these results to a wider class of reinforcement algorithms can be found in Williams (1987).5. We also have
(4.5.12) Finally.1) converges. we get
(4. The characterization of the associative reward-penalty algorithm in Equation (3. on average.11) which can also be written as
(4.5.4).3) and has an opposite sign.6) and use (4. 7) which follows from Equation (4.7) to give
(4.14) which implies that the average weight vector converges to a local maximum of .10) If we now take the gradient of Equation (4.5. Equation (4.1) is more difficult since
.5.8)
and which can be used in Equation (4.5.(4.5.5.5. to a solution that locally maximizes the average reinforcement signal.

Here.6. An alternative way of expressing J in Equation (4.2) is through the use of a "cluster membership matrix" M defined for each unit i = 1. 1988a)
(4.it does not necessarily maximize .6.4) Now.6. P(xk). However.. where each unit uses the simple continuous-time competitive rule [based on Equations (3.1) Also. n by
(4. The cluster membership matrix allows us to express the criterion function J as
(4. the above analysis should give some insight into the behavior of simple reinforcement learning.3) and (3. .6.2) where is the weight vector of the winner unit upon the presentation of the input vector xk. 2.4. 4. should be inserted inside the summation in Equation (4.3) Here.4.4) yields
(4.5)]
(4. consider the criterion function (Ritter and Schulten. In general. Two approaches are described: one deterministic and the other statistical. a probability of occurrence of xk.6.6. all vectors xk are assumed to be equally probable..
4.2).6.6.6 Theory of Simple Competitive Learning
In this section we attempt to characterize simple competitive learning. is a dynamically evolving function of k and i which specifies whether or not unit i is the winning unit upon the presentation of input xk.5)
. performing gradient descent on J in Equation (4..1 Deterministic Analysis Consider a single layer of linear units.6.

the reader is referred to Bachmann et al. Here.6.6.6.6. only in the case of "sufficiently sparse" input data points can one prove stability and convergence theorems for the stochastic (incremental) competitive learning rule (Grossberg. However. (1987) for yet another suitable criterion function. which has the ability to reduce the effects of outlier data points by proper choice of the exponent r.6. may be employed which incorporate some interesting heuristics into the competitive rule for enhancing convergence speed or for altering the underlying "similarity measure" implemented by the learning rule. while being attracted by its own cluster.2 Stochastic Analysis The following is an analysis of simple competitive learning based on the stochastic approximation technique introduced in Section 4. Criterion functions. Let P(xk) be the probability that input xk is presented on any trial. the weight normalization is assumed for all units. the average learning equation may be expressed as
(4. Consider the following normalized discrete-time competitive rule (von der Malsberg. In practice. Rumelhart and Zipser.which is the batch mode version of the learning rule in Equation (4. 1973 . This causes the winning weight vector to be repelled by input vectors in other clusters.6) preserves this weight normalization at any iteration (this was explored in Problem 3.. 1985):
(4. (1991) that the above batch mode competitive learning rule corresponds to the k-means clustering algorithm when a finite training set is used. Another example is to employ a different similarity measure (norm) in J such as the Minkowski-r norm of Equation (3. For an example.4).68).1).6. Here. Also.6)
where.4. 1976). It was noted by Hertz et al.6.g.7)
.6.3. 4.
. It can be easily verified that Equation (4. we may replace by
in Equation (4. where and 0 is a positive constant) in order to stop weight evolution at one of the local solutions. Then.5) since stochastic noise due to the random presentation order of the input patterns may kick the solution out of "poor" minima towards minima which are more optimal. The data points are sparse enough if there exists a set of clusters so that the minimum overlap within a cluster exceeds the maximum overlap between that cluster and any other cluster. which enhances convergence. a damped learning rate k is used (e.1.4). again. Other criterion functions may also be employed. The local rule of Equation (4.1) may have an advantage over the batch mode rule in Equation (4.1). other than the one in Equation (4.6.
and typically. the setting is a single layer network of linear units. the relatively large initial learning rate allows for wide exploration during the initial phase of learning. .

Thus. wij is expected to be proportional to the conditional probability that the jth bit of input xk is active given unit i is a winner. the jth component of vector wi is given as
(4.10) is the probability that unit i wins averaged over all stimulus patterns. Now. note that the denominator of Equation (4.6) in Equation (4. we may employ Bayes' rule
and write Equation (4.
(active) and unit i is a winner.9) Therefore.9)].8) which implies that at equilibrium
(4. unit i will have a weighted sum (activity) of
. using Equation (4.6.10) as
(4. Note further that
is the probability that that all patterns have the same number of active bits (i.6. assuming for all k).6. First.6.7) we get
(4.6.
Next.6.6.. upon the presentation of a new pattern [assuming the equilibrium weight values given by Equation (4.11) which states that at equilibrium.6.where is the conditional probability that unit i wins when input xk is presented.6.e.10) We now make the following observations.

Note that Equation (4.6.6. a unit responds most strongly to patterns that overlap other patterns to which the unit responds and most weakly to patterns that are far from patterns to which it responds. there are many solutions which satisfy the equilibrium
Equation (4. Thus.6.6.6.13)
where represents the overlap between stimulus and the kth training pattern xk. generally.6) leads the search to one of many stable equilibrium states satisfying Equations (4. the system might move to a new equilibrium state which is.12) or
(4.
on wi.(4.6.
becomes stable. A sequence of stimuli might. Note that we may express the conditional probability according to the winner-take-all mechanism
(4.6. In this case.9). at equilibrium. In such a state. however.14)
Because of the dependency of relation given in Equation (4.15) where the averaging is taken over all xk.6.14). be presented in such a way as to introduce relatively large fluctuations in the wi's.6. and therefore.9) and (4. more stable in the sense that becomes unlikely to change values for a very long period of time. Rumelhart and Zipser (1985) gave a measure of the stability of an equilibrium state as the average amount by which the output of the winning units is greater than the response of all of the other units averaged over all patterns and all clusters. the ith unit activations become stable (fluctuate minimally). This stability measure is given by
(4. and i* is the index of winning units.15) can be written as
.

In particular.7. Kohonen. 1990. Kohonen's rule is a stochastic gradient descent search and leads. Performing gradient descent on (4. Heskes and Kappen.1) is an extension of the competitive learning criterion function of Equation (4.6. 1986.7. In geometric terms.7. Lo et al. and much of the analysis has been done under simplifying assumptions. In the following. 1988a)
(4.5..6. It can be seen that Equation (4. the resolution and stability of the map.1) yields
(4. 1986 and 1988b. on average and for small .7 Theory of Feature Mapping
The characterization of topological feature preserving maps has received special attention in the literature (Kohonen. the more stable the system is expected to be. 1993b). 1993a. Ritter and Schulten. 1982b. Takeuchi and Amari (1979) and Amari (1980 and 1983) have extensively studied a continuous-time dynamical version of this map to investigate the topological relation between the self-organized map and the input space governed by the density p(x).1)
where i* is the label of the winner unit upon the presentation of stimulus (input) xk and is the neighborhood function that was introduced in Section 3.16) The larger the value of J. to a local minimum of J in Equation (4. Thus.2) which is just the batch mode version of Kohonen's self-organizing rule in Equation (3.5.(4. Maximizing J can also be viewed as maximizing the overlap among patterns within a group (cluster) while minimizing the overlap among patterns between groups. this is exactly what is required for the clustering of unlabeled data. Tolat. 1993.1).1).7. These minima are given as solutions to
(4. The characterization of a general feature map is difficult. A continuous dynamical version of Kohonen's map is also described and analyzed. and convergence speed. a one-dimensional version of the self-organizing feature map of Kohonen is characterized following the approach of Ritter and Schulten (1986). 4.7.7.3)
.7. Cottrell and Fort. J is maximized when the weight vectors point toward maximally compact stimulus (input) regions that are as distant as possible from other such regions.2).
4.1 Characterization of Kohonen's Feature Map Consider the criterion function J(w) defined by (Ritter and Schulten.

Local minima of J are topological defects like kinks in onedimensional maps and twists in two-dimensional maps [Kohonen. given by (assuming a sharply peaked symmetric ):
(4.6). a selforganizing feature map tends to under sample high probability regions and over sample low probability ones.7.5) and (4. Geszti. The local stability of some of these equilibria is ensured (with a probability approaching 1) if the learning coefficient = (t) is sufficiently small.3) with a continuous version that assumes a continuum of units and where the distribution p(x) appears explicitly. however.4) are relatively easy to find for the onedimensional map with scalar r and a given input distribution p(x).7. the term p(w) is given by p(x)|x=w. Actually. This verifies the density preserving
feature of the map. 1989. one may obtain the equilibria w* by solving Equation (4. namely
(4. in turn. An implicit partial differential equation for w can be derived from Equation (4. for the case of two. positive.7.7. Ideally. and decays according to the following necessary and sufficient conditions (Ritter and Schulten. the equilibrium w* satisfies (Ritter and Schulten. it depends on the choice of and the distribution p(x). The analysis of feature maps becomes more tractable if one replaces Equation (4. we desire the global minimum of the criterion function J. However.7b)
.5) which.4). 1990]. we would have for zero distortion. solutions of Equation (4.6) In Equations (4. 1986]. satisfies the implicit differential equation corresponding to Equation (4.7. Therefore. Here. On the other hand.7. no explicit solutions exist for w(r) given an arbitrary p(x).7a) and
(4. Finally. 1988b)
(4.This equation is not easy to solve.4) [Ritter and Schulten.7.7. 1986)
(4.5) shows that the density of the units in w space is proportional to around point r.7.6).7. Equation (4.7.or higher-dimensional maps.4) where r* = r*(x) is the coordinate vector of the winning unit upon the presentation of input x.7.7.

2 Self-Organizing Neural Fields Consider a continuum of units arranged as an infinite two-dimensional array (neural field).7. 4. and then moves into a refinement phase where it adapts to the details of p(x). Equation (4.8) where
(4. For laws with > 1 or exponential decay laws.9) and h is a constant bias field. it is assumed that the potential u(r.7. Occasionally. it is assumed that this potential increases in proportion to the total stimuli s(r. One-dimensional plot of a neural field's lateral weight distribution.1. Also.7.g.In particular.7. t) decays with time constant to the resting potential h in the absence of any stimulation. r') = (r . It can also be shown that during convergence. the "untangling" phase can slow convergence. the decay law with 0 < 1 ensures convergence.7. Geszti (1990) suggested the use of a strongly asymmetric neighborhood function in order to speed up learning by breaking the symmetry effects responsible for slow untangling of kinks and twists. a unit at position r makes excitatory connections with all of its neighbors located within a distance from r. Here. the map first becomes untangled and fairly even.7.r'). because some types of tangle (e.7. and some residual error remains even in the limit t .. The dynamics of the neural field potential u(r. x) which is the sum of the lateral stimuli
. kinks and twists) can take a long time to untangle.8).7a) is not fulfilled. where f is either a monotonically non-decreasing positive saturating activation function or a step function. This lateral weight distribution is assumed to be of the on-center off-surround type as is shown in Figure 4. The output of the unit at r is assumed to be a nonlinear function of its potential y(r) = f [u(r)]. Associated with each unit r is a set of input weights w(r) and another set of lateral weights (r. and makes inhibitory connections with all other units. In Equation (4. Each point (unit) on this array may be represented by a position vector r and has an associated potential u(r).
Figure 4. t) are given by
(4.1 for the one-dimensional case.

the duration of stimulus x is assumed to be much shorter than the time constant ' of the weight w. the rates of change of and w.7.7. a local excitation is a pattern where the excitation is concentrated on units in a small local region.. and it is assumed that a pattern x is chosen according to probability density p(x). Let us now look at the dynamics of the self-organizing process. i.8) must satisfy or
(4.7. 1990).2. we assume that inputs are applied to the neural field for a time duration which is longer than the time constant of the neural field potential. In Equation (4. A conceptual diagram for the above neural field is shown in Figure 4.11)
. u*(r) is positive only for a small neighborhood centered at a maximally excited unit r0.r') is strongly off-surround inhibitory. On the other hand. Also.8).and the input stimuli wTx due to the input signal x Rn.7. and eventually converges to one of the equilibrium solutions. given any x.7.2. are assumed to be much slower than that of the neural field potential. We start by assuming a particular update rule for the input weights of the neural field.e. the potential distribution u(r. if any. u*) is the total stimuli at equilibrium.8). One biologically plausible update rule is the Hebbian rule:
(4. Stable equilibrium solutions are the potential fields u(r) which the neural field can retain persistently under a constant input x.10) where s(r. x) represents a mapping from the input space onto the neural field.
Figure 4. t) can be considered to change in a quasi-equilibrium manner denoted by u(r. The equilibria of Equation (4. x).10) (Amari.7. Here. only a local excitation pattern is aroused as a stable equilibrium which satisfies Equation (4.7. Thus u*(r. When the lateral connections distribution (r . An initial excitation pattern applied to the neural field changes according to the dynamics given in Equation (4. The input signal (pattern) x is a random time sequence. Thus.7. Self-organizing neural field.

The dynamics of the potential field in Equation (4.15)].r') as shown in Figure 4. Here.7. It was shown that with (r .7.7. The equilibria of Equation (4.1 and f(u) = step(u).15) This learning rule is equivalent to the averaged continuum version of Kohonen's self-organizing feature map in Equation (4. one arrives at the average differential equation
(4. some properties of the formation of feature maps are revealed from these equations for the special.12) and multiply it by an arbitrary input vector x we arrive at an equation for the change in input stimuli
(4.14) The vector inner product xT x' represents the similarity of two input signals x and x' and hence the topology of the signal space (Takeuchi and Amari. we assume strong mixing in Equation (4.w(r). then monotonically and slowly shrinks in diameter for positive time.7.7. Amari.14) relates the topology of the input stimulus set {x} with that of the neural field.11) which allows us to write an expression for the average learning equation (absorbing ' in and in ) as
(4. This may be accomplished through proper control of the bias field h.2). 1979).12) are given by
(4. On the other hand. Local excitations (also known as a-solutions) where u*(r) is positive only over a finite interval [a1.8) and (4. are among these stable solutions. if one views the potential distribution u*(r. However. x) and the difference x . 1980 and 1983). In general. we use the earlier assumption ' . Here. On the other hand. Next. A(a) is the definite integral defined by
(4.7.7. if one assumes a learning rule where a unit r updates its input weight vector in proportion to the correlation of its equilibrium potential u*(r. as described below.7. there exist stable equilibrium solutions u*(r) for the x = 0 case.7. it is difficult to solve Equation (4. the -solution is stable if and only if h > -2 A(). Note how Equation (4.where is the neural field's equilibrium output activity due to input x.11). self-organization will emerge if the dynamics of the potential field evolve such that the quasi-stable equilibrium potential u* starts positive for all r. respectively. The 0-solution potential field. but revealing.12) where the averaging is over all possible x. u*(r) 0.7.7.7. x) as the weighting neighborhood function .7.13) If we now transpose Equation (4. An a-solution exists if and only if h + A(a) = 0. In Equation (4.7.7. case of a onedimensional neural field (Takeuchi and Amari. and the -solution field. a2] of the neural field are also possible. The 0-solution is stable if and only if h < 0.8) for a one-dimensional neural field was analyzed in detail by Amari (1977b) and Kishimoto and Amari (1979) for a step activation function and a continuous monotonically non-decreasing activation function. u*(r) 0. 1979.12) [or (4. where with A(a) as defined below.16)
.7.

the forgetting effect must further be proportional to wirxr. Kohonen uses a learning equation more complex than those in Equations (4. x) is controlled by the bias field h in order to control the convergence of the self-organizing process. u* becomes the 0-solution.7. On the other hand. or disturbance due to adjacent synapses. Here. Here. of this active region is a monotonically decreasing function of the bias field h. In addition.17)
where is a positive constant. forgetting effects in wi are proportional to the weight wi itself. ultimately. in Hebb's
learning rule.7. the term acts as a stabilizing term which models "forgetting" effects.and plotted in Figure 4. the reader is referred to Zhang (1991). yi.7.7. This weighted-sum of output activities replaces the output activity of the same cell.12) and (4. Furthermore. 1993) also showed that a single a-solution can exist for the case of a non-zero input stimulus. the width.15). if the disturbance caused by synaptic site r is mediated through the post synaptic potential. The model is similar to Amari's self-organizing neural field except that it uses a discrete twodimensional array of units. one may exploit the fact that the field potential/neighborhood function u*(r.8). causes u* to start at the -solution.7. Thus.16). A plot of A(a) of Equation (4. For further analysis of the self-organizing process in a neural field. in turn. Amari (see also Krekelberg and Kok. a. and that the corresponding active region of the neural field is centered at the unit r receiving the maximum input. and approximately act as one collectively
. the summation is taken over a subset of the synapses of unit i that are located near the jth synapse wij. The model assumes sharp self-on off-surround lateral interconnections so that the neural activity of the map is stabilized where the unit receiving the maximum excitation becomes active and all other units are inactive. This phenomenon is modeled by in the forgetting term. On the other hand.17) models a natural "transient" neighborhood function.
Figure 4. This equation is given for the ith unit weight vector by the pseudo-Hebbian learning rule
(4. It represents a weighted-sum of the output activities yl of nearby units. The term in Equation (4. Typically.7.7.3.7. the uniform bias field h is started at a positive value h > -2A() and is slowly decreased towards negative values. Kohonen (1993a and 1993b) proposed a self-organizing map model for which he gives physiological justification. which describes the strength of the diffuse chemical effect of cell l on cell i. hil is a function of the distance of these units. This.3. Kohonen's model employs unit potential dynamics similar to those of Equation (4. then gradually move through a-solutions with decreasing width a until.

19) converges to .7. for arbitrary x with wTx > 0.
.7. However.20) Furthermore. The following discussion is general in nature.7. though.18) is treated as a stochastic differential equation with strong mixing in accordance to the discussion of Section 4. Taking the expected value of both sides of Equation (4. the "neighborhood" function is determined by a "transient" activity due to a diffusive chemical effect of nearby cell potentials.17) much more difficult. multiplying both sides of Equation (4. the vector form of Equation (4.8 Generalization
In Chapter 2. and that w rotates such that its average direction is aligned with the mean of x. a solution for the expected value of w may be found if Equation (4. in the former. this equilibrium point can be shown to be stable (Kohonen.18).interacting set. This is the expected result of a self-organizing map when a uniform nondecreasing neighborhood function is used. In general. it can be concluded that the synaptic weight vector w is automatically normalized to the length .7.7. The major difference between the learning rules in Equation (4. we address two important issues related to learning in neural networks: generalization and complexity. Equation (4. Later chapters in this book address the question of learning in specific neural network architectures by extending appropriate learning rules covered in Chapter 3. What remains to be seen is whether such networks are capable of finding the necessary weight configuration for a given mapping by employing a suitable learning algorithm.7.7. In the remainder of this chapter.
4. which makes the analysis of Equation (4.18) by 2wT leads to the differential equation
(4. 1989). From the above analysis.18) and solving for its equilibrium points (by setting ) gives (4.7. Now. We have found that a feedforward neural net with a single hidden layer having an arbitrary number of sigmoidal activation units is capable of approximating any mapping (or continuous multivariate function) to within any desired degree of accuracy.7. Under the assumption that the index r ranges over all components of the input signal x and regarding as a positive scalar independent of w and x.7. the neighborhood term is nonuniform and time varying. On the other hand. The results of Chapter 2 on the computational capabilities of layered neural networks say nothing about the synthesis/learning procedure needed to set the interconnection weights of these networks. independent of the input signal x.7. whereas it is determined by a stable region of neural field potential in the latter. the solution for the direction of w* cannot be determined in closed form from the deterministic differential Equation (4.18) takes the Riccati differential equation form
(4.18) where = > 0.3.7.17) and (4. and thus it holds for a wide range of neural network paradigms.12) is that.19) Thus. we analyzed the capabilities of some neural network architectures for realizing arbitrary mappings.

and the interconnectivity pattern between layers.8.8. This section also considers the generalization capabilities of stochastic neural networks. Schwartz et al. where w represents the weights of an arbitrary network. each network is represented as a point w in weight space which implements a function fw(x).3) where
The region Vm represents the total volume of weight space which realizes the desired function fd as well as all other functions f that agree with fd on the desired training set. and does not necessarily represent the typical situation encountered in a specific training scheme. We may now partition the weight space into a set of disjoint regions. Let us define the quantity V0 as (4. If m examples are learned. and it also draws on some clarifications given by Hertz et al. The following analysis is based on the theoretical framework of Schwartz et al. It is assumed that these functions are of the form f : Rn {0. and is some a priori weight probability density function. (1991).8.2) where
Each time an example {xk. Here.5). However. 1}. the number of units within each layer. The main result of this analysis is rather surprising. As the number of learned examples m is increased. but the ideas can be extended to multiple and/or continuous-valued outputs as well. the averaging is over all possible networks (of fixed architecture) consistent with the training set. In this section. one for each function fw that this class of networks can implement. the expected ambiguity decreases. (1990) gave a theoretical framework for calculating the average probability of correct generalization for a neural net trained with a training set of size m. Thus. The volume of the region of weight space that implements a particular function f(x) is given by (4. two cases are considered: Average generalization and worst case generalization. the weight vector w is modified so that it enters the region of weight space that is compatible with the presented example.8. it should be kept in mind that this result is only meaningful when interpreted in an average sense.1 Generalization Capabilities of Deterministic Networks One important performance measure of trainable neural networks is the size of the training set needed to bound their generalization error below some specified number. then the volume of this region is given by (4. Consider a class of networks with a certain fixed architecture specified by the number of layers. if a new input is presented to the trained network it will be ambiguous with respect to a number of functions represented by Vm (recall the discussion on ambiguity for the simple case of a single threshold gate in Section 1. fd(xk)} of a desired function fd is presented and is successfully learned (supervised learning is assumed). 4.1) which stands for the total "volume" of the weight space. one can calculate the average probability of correct generalization for any training set of size m if one knows a certain function that can (in theory) be calculated before training begins. (1990). and the only assumptions about the network's architecture is that it is deterministic (employs deterministic units) and that it is a universal architecture (or faithful model) of the class of functions/mappings being learned.
.Generalization is measured by the ability of a trained network to generate the correct output for a new randomly chosen input belonging to the same probability density p(x) governing the training set. Thus.

.8. although the allowed volume of weight (or function) space shrinks. This probability is equal to the average fraction of the remaining weight space that f occupies: (4.G(m). It is speculated
.8. As an example.8) which is the ratio between the m + 1 and the mth moments of 0(g) and can be computed if 0(g) is given or estimated.8.8. These two behaviors of the learning curve have also been verified through numerical experiments. Another useful measure of generalization is the average generalization ability G(m) given by (4. the remaining regions tend to have large generalization ability. If. on the other hand. Vm for each probable sequence. This assumption is expected to be valid as long as m is small compared to the total number of possible input combinations.e.. Let us define the probability Pm(f ) that a particular function f can be implemented after training on m examples of fd.5) The quantity g(f ) takes on values between 0 and 1. The above result is interesting since it allows us to compute.8.4) are independent. the volume of weight space consistent with both the training examples and a "particular" function f is given by (4. The form of Equation (4.8) allows us to predict the number of examples.7) Note that an exact m(g) can be derived by dividing the right-hand side of Equation (4. g(f ) may be viewed as the probability that for an input x randomly chosen from p(x). xk) in Equation (4. then the prediction error decays as . averaging Vm(f ) over all xk gives (4.8.6) The approximation in Equation (4. before learning.4) where Equation (4. Note that fw in I has been replaced by f and that the product term factors outside of the integral. fd) is the number of bits by which f and fd differ. m. We may also define the average prediction error as 1 . Thus. It is referred to as the generalization ability of f .8. Now. for a completely specified n-input Boolean function fd. If a finite gap between g = 1 and the next highest g for which 0(g) is nonzero exists. the distribution of generalization ability after training with m examples. 1991 and 1992).8. G(m) gives the entire "learning curve".8.8. necessary to train the network to a desired average generalization performance.2) was used.7) shows that the distribution m(g) tends to get concentrated at higher and higher values of g as more and more examples are learned.6) is based on the assumption that Vm does not vary much with the particular training sequence. the factors I(f. i. The nature of the gap in the distribution of generalizations near the region of perfect generalization (g = 1) is not completely understood.8. g(f ) is given by (assuming that all 2n inputs are equally likely)
where dH(f. Equation (4..6) to compute the distribution of generalization ability g(f ) across all possible f 's after successful training with m examples: (4.Next.7) by . then the prediction error decays to 1 exponentially as . assuming independent input vectors xk generated randomly from distribution p(x). The asymptotic behavior (m ) of the average prediction error is determined by the form of the initial distribution 0(g) near g = 1.8. during learning. Let us use Equation (4. i.e. Thus. it gives the average expected success rate as a function of m. i.e. Good generalization requires that Pm(f ) be small. These gaps have been detected in experiments involving the learning of binary mappings (Cohn and Tesauro. there is no such gap in 0(g).

though theoretically interesting. Consider a parametric family of stochastic machines where a machine is specified by a d-dimensional parameter vector w such that the probability of output y. as (4.that such a gap could be due to the dynamic effects of the learning process where the learning algorithm may. w). Therefore. A more general learning curve based on statistical physics and VC dimension theories (Vapnik and Chervonenkis. avoid the observed near-perfect solutions. It is interesting to note that this is the same condition for "good" generalization (low ambiguity) for a single LTG derived by Cover (1965) (refer to Section 1. Similar results for worst case generalization are reported in Blumer et al. for a given input x Rn. one may assume the machine to be a stochastic multilayer neural network parameterized by a weight vector w Rd. consider a single hidden layer feedforward neural net with k LTG's and d weights that has been trained on the m examples so that at least a fraction 1 . also.11) and where
. to a first order approximation. with a probability approaching 1. we summarize a result that tells us about the generalization ability of a deterministic feedforward neural network in the "worst" case. Another possibility is that the gap is inherent in the nature of the binary mappings themselves.5) and obtained empirically by Widrow (1987). is specified by P(y | x. whose estimation is computationally expensive. emits a binary output with probability (4. 1}. is of little practical use for estimating m. Also. Here. For generalization results with noisy target signals the reader is referred to Amari et al.9. of the examples are correctly classified. (1992). one may note that in the limit of large m the architecture of the network is not important in determining the worst case generalization behavior.9) Ignoring the log term. with x Rn and y = f(x) {0. the size and architecture of the network and the learning scheme all play a role in determining generalization quality (see the next chapter for more details).2 Generalization in Stochastic Networks This section deals with the asymptotic learning behavior of a general stochastic learning dichotomy machine (classifier). (1992).8. what matters is the ratio of the number of degrees of freedom (weights) to the training set size. Consider a set of m labeled training example pairs (x.8. this network will correctly classify the fraction 1 . 1989) (4. On the other hand. where .8.8. Then. given an input x.8. which. As an example. It should also be noted that the architecture of the net can play an important role in determining the speed of convergence of a given class of learning methods. the case of learning a binary-valued output function f :Rn {0. (1989). for some reason. y) selected randomly from some arbitrary probability distribution p(x.10) which requires m d for good generalization.. we may write Equation (4. as discussed later in Section 4. and does not necessarily represent the typical situation encountered in a specific training scheme. since it requires the knowledge of the distribution 0(g). The results in this section are based on the work of Amari and Murata (1993). 4. 1971) which applies to a general class of networks can be found in Haussler et al. It also gives results that are only valid in an average sense. 1} is treated. The above approach.9). Next. y). We desire a relation between the generalization error and the training error in terms of the number of free parameters of the machine (machine complexity) and the size of the training set. as long as (Baum and Haussler. none of the above theories may hold for the case of a small training set. In this later case.of future random test examples drawn from p(x. y).

The result is also in agreement with Cover's result on classifier ambiguity in Equation (1... ym+1): (4. (The reader is referred to the original paper by Amari and Murata (1993) for such proof). the stochastic nature of the machine is determined by its stochastic output unit.12) and g(x. where the term in Equation (4. m. w) may be considered as a smooth deterministic function (e. ). Amari (1993) also proved that the average predictive entropy gen(m) for a general deterministic dichotomy machine (e.18) may be viewed as the probability of ambiguous response on the m + 1 input.5. 2. This gives (4.8.13) Similarly.. 1993): The asymptotic learning curve for the entropic training error is given by (4. which from Equation (4. and emits yk. superposition of multivariate sigmoid functions typically employed in layered neural nets). which are randomly generated according to a fixed but unknown probability distribution p(x). An entropic loss function is used to evaluate the generalization of a trained machine for a new example (xm+1.8. H0 is unknown and it can be eliminated from Equation (4.1.8. It is interesting to note that this result is similar to the worst case learning curve for deterministic machines [Equation (4. Let gen be the average predictive entropy (also known as the average entropic loss) of a trained machine parameterized by for a new example (xm+1.16) and for the entropic generalization error by (4. Theorem 4. This machine predicts y for a given x with probability P(y | x.8. we define train as the average entropic loss over the training examples used to obtain (4.8.8.3).8.15) Amari and Murata proved the following theorem for training and generalization error. the particular network architecture is of no importance here as long as it allows for a faithful realization of the true machine and m >> d. Thus. In general.10)].8.1 (Amari and Murata.(4. let H0 be the average entropic error of the true machine (4.1 uses standard techniques of asymptotic statistics and is omitted here. k = 1. when the training error is zero. ym+1).14) Finally.5 for definition) characterized by the machine will be our first candidate machine.8.16) is the classification error H0 of the true machine. In fact. the generalization error approaches that of the trained machine on m examples. feedforward neural net classifier) converges to 0 as in the limit m .18) which shows that for a faithful stochastic machine and in the limit of m d.8. Assume that there exists a true machine that can be faithfully represented by one of the above family of stochastic machines with parameter w0.. Again..17) The proof of Theorem 4. The maximum likelihood estimator (refer to Section 3. .
.8.17) by substituting its value from Equation (4.8.16).g.8. The true machine receives inputs xk.8. in this example.g.

is there an algorithm that is computationally "efficient" for training layered neural networks? Here. in the worst case. as the input pattern dimension or the number of input patterns increases) the training time scales up exponentially in the size of the problem. three-unit two layer net of LTG's can be NP-complete in the worst-case when learning a given set of examples. even for approximate learning. One can also use this fact and construct layered networks that have polynomial learning time complexity for certain classes of nonlinearly separable mappings. yd} representing f (x) in the expanded space Rd are linearly separable. Therefore. Blum and Rivest (1992) gave an example of two networks trained on the same task.1 depicts a layered architecture which can realize any function in F.1.. even if a solution exists.1.9. loading of an arbitrary mapping onto a "faithful" neural network architecture requires exponential time irrespective of the learning algorithm used (batch or adaptive). labeled D in Figure 4. such that training the first is NPcomplete but the second can be trained in polynomial time.. Also. Consider a set F of non-linearly separable functions or which has the following two properties: (1) There exists at least one layered neural net architecture for which loading m training pairs {x. They also showed that learning Boolean functions with a two-layer feedforward network of k-hidden units (k bounded by some polynomial in n) and one output unit (which computes the AND function) is NP-complete. such that d is bounded by some polynomial in n [e. Consider the class of functions defined on a collection of m arbitrary points in Rn. we will assume that the neural network has an arbitrary architecture but with no feedback connections. Blum and Rivest (1992) gave examples of functions in F. in the worst case. i. yd} of f(x) is NPcomplete. This set F is not empty. Learning in artificial neural networks is hard. This can be seen by noting that the training of the trainable part of this network (the output LTG) has polynomial complexity for m linearly separable examples in Rd and that as n increases.g. these theoretical results do not rule out the possibility of finding a polynomial-time algorithm for the training of certain classes of problems onto certain carefully selected architectures. as the problem size increases (that is. Blum and Rivest (1992) extend this result to the case of Boolean functions. i.9. a fixed preprocessing layer. which implies that the training set is labeled. This is illustrated next. Judd (1987. we will assume that the desired learning algorithm is a supervised one. More precisely. ]. Figure 4. d remains polynomial in n. The output node is a d-input LTG.9 Complexity of Learning
This section deals with the computational complexity of learning: How much computation is required to learn (exactly or approximately to some "acceptable" degree) an arbitrary mapping in a multilayer neural network? In other words.e. 1984) to compute the weights and thresholds of such nets in polynomial time. Here. 1986). (2) there exists a fixed dimensionality expansion process D that maps points x in Rn to points z in Rd.. It can be easily shown that the learning complexity of this network for functions in F is polynomial. Also. and that the m training examples {z. we will not be able to do much better than just randomly exhausting all combinations of weight settings to see if one happens to work.
Figure 4. 1990) showed that the learning problem in neural networks is NPcomplete.4. We may recall from Chapter One that the average amount of necessary and sufficient
. However. The efficiency of learning linearly separable classification tasks in a single threshold gate should not be surprising. Moreover. implements the above dimensionality expansion process.e. It has been shown that the problem of whether there exist two hyperplanes that separate them is NP-complete (Megiddo. This is easy to prove since one can use linear programming (Karmarkar. A layered architecture consisting of a fixed preprocessing layer D followed by an adaptive LTG. the training of a net with two-hidden n-input LTG's and a single output LTG on examples of such functions is exponential in time.9. it has been shown (Blum and Rivest. 1989) that training a simple n-input. the class of linearly separable mappings can be trained in polynomial time if single layer LTG nets are employed (only a single unit is needed if the mapping has a single output).

but lose in that the number of weights increases from n to O(n2) or higher. hyperspheres) of varying size. These stable solutions are then taken to be the possible solutions sought by the associated stochastic learning equation. thus bounding the (expected) necessary number of training examples for learning algorithms in separable problems. By using the network in Figure 4..
4. Moreover. or reinforcement rules. The problem of NP-complete learning in multilayer neural networks may be attributed to the use of fixed network resources (Baum.3. This and other efficiently trainable nets are considered in detail in Section 6. there is a trade off. we present analysis and insights into the theory of Hebbian. allows us to approximate the stochastic learning equation by a deterministic first-order dynamical system. under reasonable assumptions. However. and self-organizing learning. This algorithm simultaneously designs and trains an appropriate network for a given classification task. The main result here is that the number of training examples necessary for "good" generalization on test samples must far exceed the number of adjustable parameters of the network used. regardless of whether they are supervised. The asymptotic behavior of generalization error is derived for deterministic and stochastic networks. This implies that for a random set of linear inequalities in d unknowns. separable dichotomy of m points grows slowly with m and asymptotically approaches 2d (twice the number of degrees of freedom of the class of separating surfaces). For unsupervised learning. Learning an arbitrary mapping can be achieved in polynomial time for a network that allocates new computational units as more patterns are learned. Mukhopadhyay et al. The averaging technique is employed in characterizing several basic rules for supervised. Generalization in the average and in the worst-case are considered. and determine the nature of the solutions evolved by the average learning equation. This learning equation serves to unify a wide variety of learning rules. unsupervised. we gain in a worst-case computational sense. A general learning equation is presented which implements a stochastic steepest gradient descent search on a general criterion function with (or without) a regularization term. but instead is given more powerful building blocks to work with. a hidden layer of LTG's. and an output layer of logical OR gates. The learning equation is a first order stochastic differential equation. the expected number of extreme inequalities which are necessary and sufficient to cover the whole set.
. This allows us to employ an averaging technique to study its equilibria and its convergence characteristics. The basic idea of this method is to cover class regions with a minimal number of dynamically allocated hyperquadratic volumes (e.9.1.9. The use of averaging. This increase in the number of weights implies that the number of training examples must increase so that the network can meaningfully generalize on new examples (recall the results of the previous section). Another intuitive reason that the network in Figure 4. The chapter also looks at some important results on generalization of learning in general feedforward neural architectures. unsupervised. 1989.information for the characterization of the set of separating surfaces for a random. The former net does not have to start from scratch. a well-defined criterion function exists which allows us to treat the deterministic systems as a gradient system.1 is easier to train than a fully adaptive two layer feedforward net is that we are giving it predefined nonlinearities. (1993) gave a polynomial-time training algorithm for the general class of classification problems (defined by mappings of the form ) based on clustering and linear programming models. This enables us to exploit the global stability property of gradient systems. In particular. competitive.g. The resulting network has a layered structure consisting of a simple fixed preprocessing layer. tends to 2d as the number of consistent inequalities tends to infinity. this limit of 2d consistent inequalities is within the learning capacity of a single d-input LTG.10 Summary
Learning in artificial neural networks is viewed as a search for parameters (weights) which optimize a predefined criterion function. In most cases. self-organizing neural fields are introduced and analyzed. and reinforcement learning.

*
4. Is this an asymptotically stable point? Why? 4.
Problems
4. (Hint: Simulate the above dynamical system for the three cases: a < 0. show that
*
4. but not asymptotically.3 Study the stability of the lossless pendulum system with the nonlinear dynamics
about the equilibrium points x* = [0 0]T and x* = [ 0]T. 4. Find the underlying instantaneous criterion J and its expected value . and L is the length of the pendulum.3. that is. it is also found that efficient (polynomial-time) learning is possible if appropriate network architectures and corresponding learning algorithms are found for certain classes of mappings/learning tasks. Demonstrate this fact for the nonlinear system
at the equilibrium point x* = [0. if all eigenvalues of the system matrix A have nonpositive real parts.2. However.2. in the learning rules listed in Table 3. Here.2 about the equilibrium point x* = [0 1]T and write the system equations in the form . the issue of complexity of learning in neural networks is addressed. 4.1 Show that the Hessian of J in Equation (4. (Note that the asymptotic stability of a linear system requires the eigenvalues of its system matrix A to have strictly negative real parts). where J(w) is given by
.3. Linearize the second order nonlinear system in Problem 4. Show that the system matrix A is identical to the Jacobian matrix f '(x) at x = [0 1]T of the original nonlinear system and thus both matrices have the same eigenvalues. for an initial condition x(0) of your choice).2 Study the stability of the equilibrium point(s) of the dynamical system .4 Liapunov's first method for studying the stability of nonlinear dynamical systems (see footnote #4) is equivalent to studying the asymptotic stability of a linearized version of these systems about an equilibrium point.2.2.5 The linearization method for studying the stability of a nonlinear system at a given equilibrium point may fail when the linearized system is stable.2.6) is given by
Also.1.2.1 Identify the "regularization" term. if any. measures the angle of the pendulum with respect to its vertical rest position.2. and a > 0.3.1 of Chapter 3.1 Characterize the LMS learning rule with weight decay by analyzing its corresponding average differential equation as in Section 4. 4. 4. 0]T.2 Employ Liapunov's first method (see footnote #4) to study the stability of the nonlinear system
for the equilibrium point x* = [0 1]T. It is found that learning an arbitrary mapping in a layered neural network is NP-complete in the worst-case.Finally. g is gravitational acceleration. a = 0. and if the real part of one or more eigenvalues of A has zero real part.

5 Study the stability of the equilibrium points of the stochastic differential equation/learning rule (Riedel and Schild.3 Verify the Hessian matrices given in Section 4.3. d. where Q is a symmetric matrix. Is there a relation between the stable point(s) w* (if such point(s) exist) and the eigenvectors of the input data autocorrelation matrix? Is this learning rule local? Why? 4. respectively.21).7 Show that the discrete-time version of Oja's rule is a good approximation of the normalized Hebbian rule in Equation (4. and Hassoun's learning rules.26).
†
4.1. 1992)
where is a positive integer.3. the stability of the learning rule ( Riedel and Schild. 4. 4.3.6 Study.3. Show that in the neighborhood of w*. Show that the linearized gradient system at w* is given by (3) where I is the identity matrix.3.3.3. given in Equations (4.4 Show that the average learning equation for Hassoun's rule is given by Equation (4.1.3 for Oja's. Hint: Start by showing that
*
4. rule for = 2 ).Note that the above system corresponds to the average learning equation of Linsker's learning rule (see Section 3.35) converges asymptotically to the equilibrium solution in Equation (3.4) without weight clipping..30).33) for small values.14). in an average sense.3. (4.3. Yuille et al. and b is a vector of constants.. and (4. Assume that w* is an equilibrium point for this dynamical system. the gradient J(w) can be approximated as (2) where H(w*) is the Hessian of J evaluated at w*. i.4).e. Assume a training set {x} of twenty vectors in R10 whose components are generated randomly and independently according to the normal distribution N(0. 1).8 Consider the general learning rule described by the following discrete-time gradient system:
(1) with > 0. Show that the gradient in Equation (2) is exact when J(w) is quadratic.45) if 0 < < . via numerical simulations. where max is the largest eigenvalue of the autocorrelation matrix C in Equation (3. b.3. ( Note: this rule is equivalent to Yuille et al. Use the above results to show that. c.3.3. 1992 )
where . 4. (Hint: Start with the gradient system in Equation (1) and
.3. a. What are the conditions on H(w*) and for local asymptotic stability of w* in Equation (3)? e. the -LMS rule in Equation (3.

44) for J).24). respectively) to extract the principal eigenvector of this training set. (Hint: The trace of a matrix (the sum of all diagonal elements) is equal to the sum of its eigenvalues). 4.4. = 100.1 Show that Equation (4. 1989).
†
4. the "averaged" trajectories of w(t) are obtained by taking the expected value of . and give a justification for the choice = 1 which has led to Equation (4.14) to find the range of values for for which the discrete-time Oja's rule is stable (in an average sense). c.7) is the Hessian for the criterion function implied by Equation (4. Repeat using the corresponding discrete-time average learning equations with the same learning parameters and initial weight vector as before and compare the two sets of simulations.wk in Equations (4.3. Furthermore.005. (4.3. b.1. Show that.
*
4. Repeat for Hassoun's rule which has the Hessian matrix given by Equation (4.4.5). 4. Now. show that is a sufficient condition for convergence of the -LMS rule. and a random initial w.
. Let z = c(i).3. 1). Yuille et al. Now.32). b. the ith unity-norm eigenvector of the autocorrelation matrix C = <xxT>.17).3.3. Assume that > 0 and y = wTx. respectively.3.3.18) and Oja's rule are special cases of the learning rules in a and b. Use = 0. Consider a stochastic first order differential equation of the form where x(t) is governed by a stationary stochastic process.30).7.use Equation (3.3. Note that the c(i)'s are the equilibria of Oja's rule.11 This problem illustrates an alternative approach to the one of Section 4. Use the result in part b to show that if then w(t) will converge to the solution w* = c(1). . and that g(y) is an arbitrary scalar function of y such that exists.9 Use the results from the previous problem and Equation (4. the average rate of change of the cosine of the angle between w and c(i) for Oja's rule [Equation (4. Use a fixed presentation order of the training vectors.3.12 Use the technique outlined in Problem 4. Show that
where is the angle between vectors z and w.4. Let z be an arbitrary constant vector having the same dimension as x and w. and (4.2). 4.3.
a.3. and Hassoun's rules (replace by wk+1 .10 Consider a training set of 40 15-dimensional vectors whose components are independently generated according to a normal distribution N(0. Employ the stochastic discrete-time version of Oja's. where c(1) is the eigenvector with the largest eigenvalue 1 (Hint: Recall the bounds on the Rayleigh quotient given in Section 4. Note that Equation (4. Compare the convergence behavior of the three learning rules by generating plots similar to those in Figures 4. .11)..1 (a) and (b).3.3 for proving the stability of equilibrium points of an average learning equation (Kohonen.3.3. assume that the vectors x are statistically independent from each other and that strong mixing exists.11 to study the convergence properties (in the average) of the following stochastic learning rules which employ a generalized forgetting law:
*
a.12)] is given by
where i is the eigenvalue associated with c(i).

5) for p(x) x. Can you think of a physical system (for some integer value of N) which is governed by this "energy" function J? 4.3.7.5) satisfies Equation (4. See Hertz et al.3). where R.7.2 Derive a stochastic competitive learning rule whose corresponding average learning equation maximizes the criterion function in Equation (4.
.7.4).
*
4.6) can be derived from Equation (4.4 Prove the stability of the equilibrium point in Equation (4. For which input distribution p(x) do we have a zerodistortion feature map? 4. 4.11).7.6.20).1 Study (qualitatively) the competitive learning behavior which minimizes the criterion function
where is as defined in Equation (4. Equation (4.1 Show that for the one-dimensional feature map.6.7.2 Show that Equation (4. 4.7.3 Solve Equation (4.6).6.7.6. (Hint: Employ the technique outlined in Problem 4.7.16).7.7. (1991) for hints.4.

. we devote most of this chapter to study backprop. Again.1. zJ]T. In this chapter.. These include multilayer feedforward nets whose inputs are generated by a tapped delay-line circuit and fully recurrent neural networks. where x0 is a bias signal equal to 1.1. time-series prediction. These adaptive networks are capable of extending the applicability of artificial neural networks to nonlinear dynamical system modeling and temporal pattern association. function approximation. z = [z0. z0 = 1 represents a bias input and can be thought of as being generated by a "dummy" unit (with index zero) whose output z0 is clamped at 1.5. recognition of hand-written zip codes. A version of backprop based on an enhanced criterion function with global search capability is described. The vector z supplies the input for the output layer of L units. Several significant applications of backprop trained multilayer neural networks are described. . with differentiable activation function units. several methods for improving backprop's convergence speed and avoidance of local minima are presented.1 Learning Rule for Multilayer Feedforward Neural Networks Consider the two-layer feedforward architecture shown in Figure 5. x Rn+1. and image compression and reconstruction. In fact.. The last part of this chapter deals with extensions of backprop to more general neural network architectures.1. its variations. and it is one of the most frequently used learning rules in many applications of artificial neural networks. The output of the hidden layer is a (J+1)-dimensional real-valued vector z. The output layer generates an L-dimensional vector y in response to the input x which. z1.. should be identical (or very close) to a "desired" output vector d associated with x. Adaptive Multilayer Neural Networks I
5. Backprop provides a computationally efficient method for changing the weights in a feedforward network. allows for relatively fast convergence to good solutions. when the network is fully trained. Backpropagation is a gradient descent search algorithm which may suffer from slow convergence to local minima. Backprop-trained multilayer neural nets have been applied successfully to solve some difficult and diverse problems such as pattern classification. autonomous vehicle navigation. and image compression. The backprop learning rule is central to much current work on learning in artificial neural networks.
.1 shows a hidden layer having J units. x1. The resulting learning rule is commonly known as error back propagation (or backprop). This set of signals constitutes an input vector x.0 Introduction
This chapter extends the gradient descent-based delta rule of Chapter 3 to multilayer feedforward neural networks. the development of backprop is one of the main reasons for the renewed interest in artificial neural networks. xn}.. to learn a training set of input-output examples. The layer receiving the input signal is called the hidden layer. mapping hand gestures to speech. For these reasons. This network receives a set of scalar signals {x0. These applications include the conversion of English text into speech. Figure 5. . and its extensions. Whenever possible. theoretical justification is given for these methods. when properly tuned.. medical diagnosis. nonlinear system modeling.
5. which.

This idea has been invented independently by Bryson and Ho (1969). Next. denoted fo.1) Here.1. i. Next. For instance.Figure 5. Similarly. 1968). if the network implements a pattern classifier with binary outputs. A two layer fully interconnected feedforward neural network architecture. then a linear activation fo (net) = net may be used.1 to a single layer net and thus lose the universal approximation/mapping capabilities discussed in Chapter 2.1. Note that Equation (5. The objective here is to adaptively adjust the J(n + 1) + L(J +& nbsp. For example. then a saturating nonlinearity similar to fh may be used for fo. we illustrate the above idea by deriving a supervised learning rule for adjusting the weights wji and wlj such that the following error function is minimized (in a local sense) over the training set (Rumelhart et al. Since the learning here is supervised. and later in this chapter. A commonly used error function is the SSE measure.1) weights of this network such that the underlying function/mapping represented by the training set is approximated or learned.1. several other error functions will be discussed. wlj is the weight of the lth output unit associated with the hidden signal zj. and Parker (1985). Finally. consider a set of m input/output pairs {xk. and the learning algorithm seeks to minimize the criterion function over the space of possible weight settings. the functional form of fo is determined by the desired output signal/pattern representation or the type of application. That is. but this is by no means the only possibility. On the other hand. with values for and close to unity).1. w represents the set of all weights in the network. where dk is an L-dimensional vector representing the desired network output upon the presentation of xk.1. we may define an error function to measure the degree of approximation for any given setting of the network's weights. It is important to note that if fh is linear. Werbos (1974). Amari (1967.
fh is the logistic function defined by . target outputs are available.32) generalized for a multiple output network. the components of the desired output vector d must be chosen within the range of fo.. In this case. 1986b):
(5. learning can be viewed (as was done in Chapters 3 and 4) as an optimization process. if the desired output is real-valued (as in some function approximation applications). or a hyperbolic tangent function fh(net) = tanh ( net). Each unit of the output layer is assumed to have the same activation function. dk}. gradient descent on such a function will naturally lead to a learning rule. For clarity.
.1) is the " instantaneous" SSE criterion of Equation (3.1. if a differentiable criterion function is used. then one can always collapse the net in Figure 5.e. Once a suitable error function is formulated. only selected connections are drawn. we denote by wji the weight of the jth hidden unit associated with the input signal xi.. The activation function fh of the hidden units is assumed to be a differentiable nonlinear function (typically. the error function serves as a criterion function.

1.1. That is. one may derive a learning rule for hidden units by attempting to minimize the output layer error.1. but this time.
is the derivative of fo with respect
to net.4) where the partial derivative is to be evaluated at the current weight values.1. we perform gradient descent on the criterion function in Equation (5.3) The learning rule for the hidden layer weights wji is not as obvious as that for the output layer since we do not have available a set of target values (desired outputs) for hidden units.3 for updating the wlj weights.
(5. respectively.5. To complete the derivation of backprop for the hidden layer weights.1.2)
where
is the weighted sum for the lth output unit. This amounts to propagating the output errors (dl − yl) back through the output layer towards the hidden units in an attempt to estimate "dynamic" targets for these units. and similar to the above derivation for the output layer weights.4) as
(5. one can directly use the delta rule.1 Error Backpropagation Learning Rule Since the targets for the output units are explicitly specified.1.1.1.1. one may express the partial derivative in Equation (5. Such a learning rule is termed error back-propagation or the backprop learning rule and may be viewed as an extension of the delta rule (Equation 5.6)
. However. The zj's are computed by propagating the input vector x through the hidden layer according to:
(5. the gradient is calculated with respect to the hidden weights:
(5.1. and and represent the updated (new) and current weight values.2) used for updating the output layer.1).5) with
(5. derived in Section 3. Using the chain rule for differentiation.

1. we have
(5.5) and using Equation (5. one can immediately define an "estimated target" dj for the jth hidden unit implicitly in terms of the back propagated error signal dj− j as follows: z
(5.3).1. Set the learning rates o and h to small positive values (refer to Section 5. For example. we have (5.10) It is usually possible to express the derivatives of the activation functions in Equations (5.7) and
Now.1.2) and (5.1.2 for additional details).1.1.2.1. We will refer to this learning procedure as incremental backprop or just backprop:
1.1.9) By comparing Equation (5.8) into Equation (5.1 for details). upon substituting Equations (5.(5. Initialize all weights and refer to them as "current" weights
and
.1.1.1.2.4).1.9) in terms of the activations themselves.9) to (5. (see Section 5.1.12) The above learning equations may also be extended to feedforward nets with more than one hidden layer and/or nets with connections that jump over one or more layers (see Problems 5.
2. The complete procedure for updating the weights in a feedforward neural net utilizing the above rules is summarized below for the two layer architecture of Figure 5.2).2 and 5. for the logistic activation function. we arrive at the desired learning rule:
(5.1.1.6) through (5.
.1.1.11) and for the hyperbolic tangent function.

Thus. at each time step the input vector x is drawn at random) which allows for a wider exploration of the search space and. leads to better quality solutions. and how the various units work together to generate a desired solution.e. .2) and (5. the current weights are used in these computations. 1988) or by adding noise to the input patterns (Sietsma and Dow. and (2) it makes the search path in the weight space stochastic (here.. Use the desired target dk associated with xk and employ Equation (5. When backprop converges. However.9) to compute the hidden layer weights changes . Update all weights according to hidden layers. set and
and go to step 3. Using stochastic approximation theory. solutions generated by incremental backprop are often practical ones. This amounts to gradient descent on the criterion function
(5. In this case. This problem should help give a good feel for what is learned by the hidden units in a feedforward neural network. This is done by checking some preselected function of the output errors to see if its magnitude is below some preset threshold. Select an input pattern xk from the training set (preferably at random) and propagate it through the network. it converges to a local minima of the criterion function (McInerny et al. Another alternative is to employ " batch" learning where weight updating is performed only after all patterns (assuming a finite training set) have been presented. In both cases. it admits local minima. some noise reduction schedule should be employed to dynamically reduce the added noise level towards zero as learning progresses. incremental backprop approaches batch backprop and produces essentially the same results. However. 1989). one may try to reinitialize the search process. which means that the weights are updated after every presentation of an input pattern. thus generating hidden and output unit activities based on the current weight settings. stop. Normally.
and
for the output and
7.1. In general.13) Even though batch updating moves the search point w in the direction of the true gradient at each update step. 4. Finnoff (1993.. 1988). the incremental backprop learning procedure is applied to solve a two-dimensional. enhanced error correction may be achieved if one employs the updated output layer weights of recomputing yl and fo'(netl). If convergence is met. This fact is true of any gradient descent-based learning rule when the surface being searched is nonconvex (Amari.1. The batch learning is formally stated by summing the right hand side of Equations (5. Next.3. for small constant learning rates there is a nonnegligible stochastic element in the training process which gives incremental backprop a "quasiannealing" character in which the cumulative gradient is continuously perturbed. The above procedure is based on "incremental" learning.9) over all patterns xk.. otherwise. potentially. tune the learning parameters. two-class pattern classification problem. the "approximate" incremental updating is more desirable for two reasons: (1) It requires less storage. allowing the search to escape local minima with small shallow basins of attraction.1. 1990).2) to compute the output layer weight changes .
. this comes at the added cost
6. 1994) showed that for "very small" learning rates (approaching zero).1. i. Employ Equation (5.
5. respectively. It should be noted that backprop may fail to find a solution which passes the convergence test. Test for convergence.1. and/or use more hidden units. The local minima problem can be further eased by heuristically adding random noise to the weights (von Lehman et al.

a positive output indicates class B and a negative output indicates class A. followed by a second hidden layer with 4 units. The points inside the shaded region belong to class B and all other points are in class A.3 (i)-(l).1. However.1 Figure 5.1. the training of such smaller networks with backprop may become more difficult. The present problem can also be solved with smaller networks (fewer number of hidden units or even a network with a single hidden layer).
Figure 5. Incremental backprop was used with learning rates set to 0.1. A three layer feedforward neural network with backprop training is employed which is supposed to learn to distinguish between these two classes. A smaller network with a 5-3-1 architecture utilizing a variant backprop learning procedure (Hassoun et al. Note the linear nature of the separating surface realized by the first hidden layer units. placed at the exact coordinates of the test point (input) in the input space if and only if the corresponding unit response is positive. which has a comparable separating surface to the one in Figure 5.1.. The training set consists of 500 randomly chosen points. All units employ a hyperbolic tangent activation function. over the training set. as well as disjoint decision regions. Training was performed for several hundred cycles 1.1.1.1. The output unit should encode the class of each input vector.3 (m). The network consists of an 8-unit first hidden layer. we neglect the output unit and view the remaining net as one with an 8-4 architecture. from which complex nonlinear separating surfaces are realized by the second hidden layer units and ultimately by the output layer unit.2 Decision regions for the pattern classification problem in Example 5. followed by a 1-unit output layer.Example 5. a black dot was 1. In generating each plot. 250 from region A and another 250 from region B. Figure 5. Figure 5. In this training set. points representing class B and class A were assigned desired output (target) values of +1 and − respectively.3 (a)-(h) represent the decision boundaries learned by the eight units in the first hidden layer.3 shows geometrical plots of all unit responses upon testing the network with a new set of 1000 uniformly randomly generated points inside the [− +1]2 region. Figure 5.3 (i)-(l) shows the decision boundaries learned by the four units of the second hidden layer.1. Here.1: Consider the two-class problem shown in Figure 5.2.3 (m) shows the decision boundary realized by the output unit.1.1. as can be seen from Figure 5. This example also illustrates how a single hidden layer feedforward net (counting only the first two layers) is capable of realizing convex. concave. We will refer to such a network as having an 8-4-1 architecture.
. 1990) is reported in Song (1992).1. The boundaries between the dotted and the white regions in the plots represent approximate decision boundaries learned by the various units in the network.

1. It also flattens the portion of E(w) above E(w*) with minimal distortion elsewhere. This methodology is based on a global optimization scheme. and k is a small positive constant.14) where w*.1. 1973).1. 1993b). 5. As for the speed of convergence. the distribution of distributions thus assigns a probability to each distribution in the data set. actually. Gradient descent search may be eliminated all together in favor of a stochastic global search procedure that guarantees convergence to a global solution with high probability. On the other hand. the assured (in probability) optimality of these global search procedures comes at the expense of slow convergence. Huang and Lippmann (1988) employed Monte Carlo simulations to investigate the capabilities of backprop in learning complex decision regions (see Figure 2. a deterministic search procedure termed global descent is presented which helps backprop reach globally optimal solutions.14) is a monotonic transformation of the original criterion function (e. the two layer nets demonstrated better performance. The first term in the right-hand side in Equation (5.. It was found for networks of equal complexity (same number of weights). genetic algorithms and simulated annealing are examples of such procedures and are considered in Chapter 8.1.. which formulates optimization in terms of the flow of a special deterministic dynamical system (Cetin et al.2 Global Descent-Based Error Backpropagation Here. acronymed TRUST: terminal repeller unconstrained subenergy tunneling. is a fixed weight vector which can be a local minimum of E(w) or
an initial weight state w0. 1993a). is the unit step function. However.3. we describe a learning method in which the gradient descent rule in batch backprop is replaced with a "global descent" rule (Cetin et al.Figure 5.1.1.3). Next. SSE criterion may be used) which preserves all critical points of E(w) and has the same relative ordering of the local and global minima of E(w).1 (a)(h): Separating surfaces realized by the units in the first hidden layer. Global descent is a gradient descent on a special criterion function C(w.4.g.3. and (m): Separating surface realized by the output unit. (i)-(l): Separating surface realized by the units in the second hidden layer. on the average. w*) given by
( 5. the term
is a "repeller term" which gives rise to a convex surface with a unique minimum located at . Separating surfaces generated by the various units in the 8-4-1 network of Example 5. Villiers and Barnard (1993) reported similar simulations but on data sets which consisted of a "distribution of distributions" where a typical class is a set of clusters (distributions) in the feature space. They reported no significant performance difference between two and three layer feedforward nets when forming complex decision regions using backprop. three layer nets converged faster if the number of units in the two hidden layers were roughly equal. with component values .
. is a shifting parameter (typically set to 2). each of which can be more or less spread out and which might involve some or all of the dimensions of the feature space. that there is no significant difference between the quality of "best" solutions generated by two and three layer backprop-trained feedforward networks. They also demonstrated that backprop's convergence time is excessive for complex decision regions and the performance of such trained classifiers is similar to that obtained with the k-nearest neighbor classifier (Duda and Hart. The overall effects of this energy transformation is schematically represented for a onedimensional criterion function in Figure 5.

.13). Thus. while the second term is a "non-Lipschitzian" terminal repeller (Zak.1.16) This system has an unstable repeller equilibrium point at . i.1. w*) leads to the "global descent" update rule
(5. the repeller term is identically zero.15) necessitates a unique error surface for all patterns. w*). at the local minimum of E(w). Here. the dynamical system given by Equation (5.e.15).1. The second phase is a gradient descent minimization phase.1.1. Since for this condition the subenergy gradient term is nearly zero in the vicinity of the local minimum w*.1. is repelled from this local minimum until it reaches a lower energy region .1. Upon replacing the gradient descent in Equation (5.4) by Equation (5.1.15) where wi represents an arbitrary hidden unit or output unit weight.. A plot of a one-dimensional criterion function E(w) with local minimum at w*. the terminal repeller term in Equation (5. the batch training is required since Equation (5. The tunneling phase is characterized by .1.4.
Figure 5. Thus. Here. characterized by becomes .1. tunneling through portions of E(w) where
is accomplished. 1989).Performing gradient descent on C(w. The "power" of this repeller is determined by the constant k. when initialized with a small perturbation from w*.2) and (5. The update rule in Equation (5.1.15) dominates.15) is a "subenergy gradient". The function E(w) − E(w*) is plotted below. the modified backprop procedure may escape local minima of the original criterion function E(w) given in Equation (5.1.1.15 ) The first term on the right-hand side of Equation (5. Note
.1. Equation (5.e. leading to the dynamical system
(5. i.17)
where (w) is a dynamic learning rate (step size) equal to that (w) is approximately equal to when E(w*) is larger than E(w)+.15) automatically switches between two phases: A tunneling phase and a gradient descent phase.
. as well as the global descent criterion function C(w.15)
(5.

which is the dimension of w in the architecture of Figure 5. the system immediately enters a gradient descent phase which equilibrates at a local minimum. The figure depicts one tunneling phase for the global descent algorithm before convergence to a (perfect) global minimum solution.. but not for multivariate functions. the algorithm will always escape from one local minimum to another with a lower or equal functional value. Thus. These characteristics are particularly
. in the multidimensional case. Here w is a small perturbation which drives the system into the domain of interest. Incremental backprop was able to produce both of the solutions shown in Figure 5.and gradient descent-based batch backprop for the 4-bit parity. The surface is characterized by numerous flat and steep regions. then the system is initially in a tunneling phase. w* is chosen as one corner of a domain in the form of a hyperparallelepiped of dimension .
Figure 5. A slightly perturbated version of w*. On the other hand. In addition. 1986 .5 compares the learning curve for the global descent-based backprop to that of batch backprop for the four-bit parity problem in a feedforward net with four hidden units and a single output unit. Simulations using incremental backprop with the same initial weights as in the above simulations are also performed.Initially. Training can be stopped when a minimum w* corresponding to smaller than a preset threshold. holds in the neighborhood of w*. and the repeller at w* repels the system until it reaches a lower basin of attraction where . it is found that the direction of the perturbation vector w is very critical in regard to successfully reaching a global minimum. at which point it enters the minimization phase and follows the behavior discussed above. Learning curves for global descent. very small learning rates (0 and n) often lead to imperfect local solutions. Every time a new equilibrium is reached. This is due to the characteristics of the error surface. w* is set equal to this equilibrium and Equation (5. If . the system enters a repelling (tunneling) phase. is reached. We then set w* equal to this new minimum and repeat the process.1. Figure 5.1. on the other hand.1.1. while relatively larger learning rates may lead to a perfect solution. is taken as the initial state of the dynamical system in Equation (5. mapping error is present). or when E(w*) becomes
The global descent method is guaranteed to find the global minimum for functions of one variable.1. but are not shown in the figure.5.5. the system automatically switches to gradient descent and equilibrates at the next lower local minimum.e.1. As the dynamical system enters the next basin. However. If. In performing this simulation.2 Backprop Enhancements and Variations
Learning with backprop is slow (Sutton.
Since w* is now a local minimum. This local solution represents a partial solution to the 4-bit parity problem (i.15) is reinitialized with which assures a necessary consistency in the search flow direction.15).1. it has many troughs which are flat in the direction of search. 1988). namely . Huang and Lippmann.
5. The same initial random weights are used in both cases. at the onset of training. The tunneling will proceed to a lower basin. batch backprop converges to the first local minimum it reaches.

1991) (this phenomenon is known as "flat spot" and is considered in Section 5. In practice. then the convergence of backprop will be fast and the solution quality will be determined by the depth of the valley relative to the depth of the global minima. Many enhancements of and variations to backprop have been proposed. 1992). and/or improvement in the network's ability to generalize. randomness is introduced as a symmetry breaking mechanism. Lee et al. The motivation for starting from small weights is that large weights tend to prematurely saturate units in a network and render them insensitive to the learning process (Hush et al. These are mostly heuristic modifications with goals of increased speed of convergence. by generating uniform
random weights within the range . they discovered a complex fractal-like structure for convergence as a function of initial weights. backprop is very sensitive to initial conditions.1 Weight Initialization Due to its gradient descent nature. This may be
achieved by setting the initial weights of unit i to be on the order of where fi is the number of inputs (fan-in) for unit i (Wessels and Barnard. it prevents units from adopting similar functions and becoming redundant. It can be easily shown that for zero-mean random uniform weights in range and assuming normalized inputs which are randomly and uniformly distributed in the . 1991). especially when the size of the training set is small (Hush et al. backprop converges very slowly if w0 starts the search in a relatively flat region of the error surface. avoidance of local minima. If the choice of the initial weight vector w0 (here w is a point in the weight space being searched by backprop) happens to be located within the attraction basin of a strong local minima attractor (one where the minima is at the bottom of a steep-sided valley of the criterion/error surface). Using Monte Carlo simulations on simple feedforward nets with incremental backprop learning of simple functions.. 1986b).. 5.pronounced in classification problems.2.2. rather than the gradient descent metaphor with local valleys to get stuck in.. as desired. They reported regions of high sensitivity in the weight space where two very close initial points can lead to substantially different learning curves. A sensible strategy for choosing the magnitudes of the initial weights for avoiding premature saturation is to choose them such that an arbitrary unit i starts with a small and random weighted sum neti. we present some common heuristics which may improve these aspects of backprop learning in multilayer feedforward neural networks.2. Thus. In this section. An alternative explanation for the sensitivity of backprop to initial weights (as well as to other learning parameters) is advanced by Kolen and Pollack (1991)..4). On the other hand.2 Learning Rate
. the input to unit i (neti) is a random variable with zero mean and a standard deviation of unity. substantial improvements in backprop convergence speed and avoidance of "bad" local minima are possible by initializing the hidden unit weight vectors to normalized vectors selected randomly from the training set (Denoeux and Lengellé. In simulations involving single hidden layer feedforward networks for pattern classification and function approximation tasks. Thus. 5. they advance a many-body metaphor where the search trajectory is determined by complex interactions with the systems attractors. neti has zero-mean and has standard deviation . On the other hand. they hypothesize that these fractal-like structures arise in backprop due to the nonlinear nature of the dynamic learning equations which exhibit multiple attractors. the weights are normally initialized to small zero-mean random values (Rumelhart et al. 1991 . 1993).

the units with large fan-in tend to commit their output to a saturated state prematurely and are rendered difficult to adapt (see Section 5.4 for additional discussion). Thus. 1993). Franzini (1987) investigated a technique that heuristically adjusts whenever is close to
and decreasing it otherwise. if is large. units with high fan-in have their input activity (net) changed by a larger amount than units with low fan-in. normalizing the learning rates of the various units by dividing by their corresponding fan-in helps speed up learning. which amounts to iterating the procedure
. see also Vogl et al. The increased convergence speed of backprop due to the utilization of the above method of setting the individual learning rates for each unit inversely proportional to the number of inputs to that unit has been theoretically justified by analyzing the eigenvalue distribution of the Hessian matrix of the criterion function.1. Chan and Fallside (1987) proposed an adaptation rule for vectors and that is based on the cosine of the angle between the gradient
(here. In this section.The convergence speed of backprop is directly related to the learning rate parameter (o and h in Equations (5. Without this normalization. Therefore. respectively). t is an integer which represents iteration number). and its reciprocal may now be used as the learning rate in backprop. Sutton (1986) for each weight wi according to the number of
presented a method which can increase or decrease
sign changes observed in the associated partial derivative
. Silva and Almeida (1990.
.. we give a sample of the various approaches for selecting the proper learning rate. the search path will closely approximate the gradient path. (Le Cun et al. after each learning iteration. Such learning rate normalization can be intuitively thought of as maintaining balance between the learning speed of units with different fan-in. increasing it
by Jacobs (1988). This method was also studied empirically ..1. finding the largest eigenvalue max for speedy convergence seems rather inefficient. Now. This shortcut is based on a simple way of approximating the product of H by an arbitrarily chosen (random) vector z through Taylor expansion: where is a small positive constant. (1993).. one may employ a shortcut to efficiently estimate max (Le Cun et al. On the other hand.2. One early proposed heuristic (Plaut et al.. Cater (1987) suggested using
separate parameters. Many heuristics have been proposed so as to adapt the learning rate automatically. but convergence will be very slow due to the large number of update steps needed to reach a local minima. but the algorithm will eventually oscillate and thus not reach a minimum.9). Extensions of the idea of fan-in dependence of the learning rate have also been proposed by (Tesauro and Janssens. the vector z converges to where cmax is the normalized eigenvector of H corresponding to max. 1988). However. convergence will initially be very fast. evaluated at the search point w. 1986) is to use constant learning rates which are inversely proportional to the fan-in of the corresponding units. In general. The optimal learning rate for fast convergence of backprop/gradient descent search is the inverse of the largest eigenvalue of the Hessian matrix H of the error function E. if is small. 1991a). using the power method.2) and (5. and due to the nature of the sigmoidal activation function used. 1988)
used a method where the learning rate for a given weight wi is set to
if
and
. Thus. Computing the full Hessian matrix is prohibitively expensive for large networks with thousands of parameters are involved. An on-line version of this procedure is reported by Le Cun et al. one for each pattern xk. Therefore. it is desirable to have large steps when the search point is far away from a minimum with decreasing step size as the search approaches a minimum. the norm of the converged vector z gives a good estimate of |max|.

1) where α is a momentum rate normally chosen between 0 and 1 and Equation (5.2.9). Darken and
Moody (1991) proposed the "search then converge" schedule which allows for faster convergence without compromising stability.. 5. Here.1) is a special case of multi-stage gradient methods which have been proposed for accelerating convergence (Wegstein.2.2.
Here. with is used. Unfortunately. Note that for . this schedule leads to very slow convergence. 1977).1. if the partial derivatives have different signs. for times
. with
. Then. guarantees asymptotic convergence to a local minimum w*. one would like to start the search with a learning rate faster than
but then ultimately converge to
the
rate as w* is approached. 1986) to the right-hand side of the weight update rules in Equations (5. we may view incremental backprop as a stochastic gradient descent algorithm. each weight change wi is given some momentum so that it accelerates in the average downhill direction. The addition of
(5.have the same sign. the learning rate stays relatively high for a "search time" during which it is hoped that the weights will hover about a good minimum. 1990). then a learning rate of . However. increasing 0 can lead to instability for small t.1. A completely automatic "search then converge" schedule can be found in Darken and Moody (1992). A similar. So a procedure for optimizing is needed. this schedule reduces to the running average schedule.2. the criterion function being minimized.2.
When the input vectors are assumed to be randomly and independently chosen from a probability distribution. .
speed of incremental gradient descent search is to set (t) = (t − 1) if E(t) has the same sign as and otherwise (Pflug. along the lines of the theory in Section 4. In this schedule.2) and (5. 1971). theoretically justified method for increasing the convergence . and the training set. Thus.
. instead of fluctuating with every change in the sign of the associated partial derivative momentum to gradient search is formally stated as . simply setting the learning rate to a constant results in persistent residual fluctuations around a local minimum w*. the "running average" schedule . with sufficiently small 0.3 Momentum Another simple approach to speed up backprop is through the addition of a momentum term (Plaut et al. Based on results from stochastic approximation theory (Ljung. The variance of such fluctuations depends on the size of . the learning rate decreases as and the learning converges. 1958) and escaping local minima (Tsypkin.

α (t) has a negative sign and the weight change starts to decelerate.1) as
(5.2. then will be about the same at each time-step and Equation (5. if the current slope is persistently smaller than the previous one but has the same sign. which employs a dynamic momentum rate given by
(5. Fahlman (1989) proposed and extensively simulated a heuristic variation of backprop. for flat regions. Additional heuristics are used to handle the undesirable case where the current slope is in the same direction as the previous one.2. i. In this case.2. If the current slope is in the opposite direction from the previous one. it signals that the weights are crossing over a minimum.5)
. then α (t) is positive and the weight change will accelerate.e.2. but has the same or larger magnitude.4) in (5. if the search point is in a region of high fluctuation.4) With this adaptive α (t) substituted in (5. An empirical study of the effects of ρ and α on the convergence of backprop and on its learning curve can be found in Tollenaere (1990).1) leads to the update rule
(5. Substituting Equation (5.2. called quickprop.The momentum term can also be viewed as a way of increasing the effective learning rate in almost-flat regions of the error surface while maintaining a learning rate close to ρ (here 0 ρ 1) in regions with high fluctuations.. On the other hand. a momentum term leads to increasing the learning rate by a factor .2.3)
Thus. or up the current slope and toward a local maximum. Adaptive momentum rates may also be employed.2.2.2)
If the search point is caught in a flat region. the acceleration rate is determined by the magnitude of successive differences between slope values.1). otherwise.2.2) can be approximated as (with 0 α 1 and N large)
(5. the weight change will not gain momentum. this scenario would lead to taking an infinite step or moving the search point backwards. This can be seen by employing an N-step recursion and writing (5. Here. the momentum effect vanishes.

Newton's algorithm iteratively computes the weight changes w and works well when initialized within a convex region of E. this approximate rule is only efficient if the directions of maximal and minimal curvature of E happen to be aligned with the weight space axes.2. this suggests that the weight update rule in Equation (5.2.2.5) corresponds to steepest gradient descent-based adaptation with a dynamically changing effective learning rate (t).5) may be used with = 0.
. Several authors have suggested computationally efficient ways of approximating Newton's method (Parker.. such as inflection points
and plateaus. However this method is very computationally expensive since the computation H−1 requires O(N3) operations at each iteration (here. 1983) is based on a quadratic model of the criterion E(w) and hence uses only the first three terms in a Taylor series expansion of E about the "current" weight vector wc:
This quadratic function is minimized by solving the equation Newton's method:
which leads to . N is the dimension of the search space). 1988 . A simple solution is to replace the
term
in Equation (5.It is interesting to note that Equation (5.6) by where µ is a small positive constant. since its denominator is a crude approximation of the second derivative of E at step t. The Newton method (e.6) to blow-up. 1989). Bishop (1992) reported a somewhat efficient technique for computing the elements of the Hessian matrix exactly. Thus. As with Equation (5. it is not able to rotate the search direction as in the exact Newton's method.2.2.2. the algorithm converges quickly if the search region is quadratic or nearly so. 1987 . Becker and Le Cun proposed an approach whereby the off-diagonal elements of H are neglected.5).4) to improve convergence speed can be justified as being based on approximations of second-order search methods such as Newton's method. special heuristics must be used in order to prevent the search from moving in the wrong gradient direction and in order to deal with regions of very small curvature. In fact. However.g.6) which is a "decoupled" form of Newton's rule where each weight is updated separately.2. because it neglects off-diagonal Hessian terms. Ricotti et al. which cause wi in Equation (5. Dennis and Schnabel. The approximate Newton method described above is capable of scaling the descent step in each direction. In fact. thus arriving at the approximation
(5.. The use of error gradient information at two consecutive time steps in Equation (5. Here H is the
Hessian matrix with components
. using multiple feedforward propagation through the network.5) can now be viewed as an approximation of Newton's rule. Becker and Le Cun.2. followed by multiple backward propagation.2. This learning rate is given by the sum of the original constant learning rate and the reciprocal of the denominator of the second term in the right-hand side of Equation (5. The second term in the right-hand side of Equation (5.4).

1) reveals another adaptive momentum rate given by
Another similar approach is to set the current search direction d(t) to be a compromise between the current "exact" gradient E(t) and the previous search direction d(t − 1).7) This implies that the search direction in two successive steps of optimum steepest descent are orthogonal. i.e.9) where the relation w(t − 1) = w(t) − w(t − 1) = +ρ d(t − 1) has been used. the current search direction is chosen to be conjugate (with respect to H) to the previous search direction.2. we desire a ρ which minimizes . and that we compute the "exact" gradient E(t) (used in batch backprop) at time step t [to simplify notation. denoted d(t − 1). Unfortunately. However. this optimal learning step is impractical since it requires the computation of the Hessian 2E at each time step (refer to Problem 5. the necessary condition for minimizing E[w(t+1)] is (Tompkins. 1959)
(5.2. Comparing the component-wise weight update version of Equation (5. the learning rate is set at time t such that it minimizes the criterion function E at time step t + 1. Now.2.9) to Equation (5.8) leads to the weight vector update rule
(5.. employing Gram-Schmidt orthogonolization (Yu et al. 1993)
(5.2.2. This is the basis for the conjugate gradient method in which the search direction is chosen (by appropriately setting β ) so that it distorts as little as possible the minimization achieved by the previous search step. with
. Analytically.12 for an expression for the optimal ). The easiest method to enforce the orthogonal requirement is the Gram-Schmidt orthogonalization method. Brown.1) is to adjust α (t) at each update step such that the gradient descent search direction is "locally" optimal. In "optimum" steepest descent (also known as best-step steepest descent). i.. 1956.2. we write E[w(t)] as E(t)]. . When w(t) is specified.. we can satisfy the condition of orthogonal consecutive search directions by computing a new search direction.e. Here. we may still use some of the properties of the optimal in order to accelerate the search.8) Performing descent search in the direction d(t) in Equation (5.2. we require d(t − 1)T H(t − 1)d(t) = 0 where the Hessian is
.2. Suppose that we know the search direction at time t − 1.Another approach for deriving theoretically justifiable update schedules for the momentum rate in Equation (5. as we demonstrate next.

it is reasonable to assume that E is approximately quadratic near a local minimum. Additional modifications of the gradient descent method which enhances its convergence to global minima are discussed in Section 8..
. conjugate gradient descent is expected to accelerate the convergence of backprop once the search enters a small neighborhood of a local minimum.. 1990 . the search direction in the conjugate gradient method at time t is given by
Now.. On the other hand. Press et al. is chosen according to the Polack-Ribiére rule (Polack and Ribiére. we mention that the concept of gradient descent was first introduced by Cauchy (1847) for use in the solution of simultaneous equations. 1989 . Therefore. 1969 . the method has enjoyed popularity ever since. 1989 . which generally leads to better solution quality. E is not quadratic and therefore this method would be slower than what the theory predicts. Gorse and Shepherd. This faster convergence to local minima is the direct result of employing a better search direction as compared to incremental backprop. For a good survey of gradient search the reader is referred to the book by Polyak (1987). 1992) where one starts with incremental backprop and then switches to conjugate gradientbased backprop for the final convergence phase. 1986) :
Thus. using the weight update rule:
and substituting the above expression for d(t) in w(t) = d(t) leads to
. van der Smagt.1. In general. It is important to note that the above second-order modifications to backprop improve the speed of convergence of the weights to the "closest" local minimum. Battiti (1992) and van der Smagt (1994) gave additional characterization of second-order backprop (such as conjugate gradient-based backprop) from the point of view of optimization. which plays the role of an adaptive momentum. the conjugate gradient method theoretically converges in N or fewer iterations. Makram-Ebeid et al. As a historical note. 1986). The conjugate gradient method has been applied to multilayer feedforward neural net training (Kramer and Sangiovanni-Vincentelli. In practice. the stochastic nature of the search directions of incremental backprop and its fixed learning rates can be an advantage since it allows the search to escape shallow local minima. However. It should also be noted that some of the above enhancements to gradient search date back to the fifties and sixties and are discussed in Tsypkin (1971). the basic idea of conjugate gradient search was introduced by Hestenes and Stiefel (1952). β . This hybrid method has its roots in a technique from numerical analysis known as Levenberg-Marquardt optimization (Press et al. Beckman (1964) gives a good account of this method.assumed to be positive-definite. These observations suggest the use of hybrid learning algorithms (Møller. 1994) and is shown to outperform backprop in speed of convergence. When E is quadratic. As a general note.

e. Because we have
.2.6). the weight change approaches zero in Equations (5. Initially. Besides reducing the effects of flat spot. that the update rule for the hidden units would still have the derivative term. Thus.1. Since and is nonzero for most of the training phase. Hinton (1987a) suggested the use of a nonlinear error function that goes to infinity at the points where f ' goes to zero.1. the activation function recovers its sigmoidal nature gradually
as training progresses. The use of the homotopy activation function is one such example (Yang and Yu.
. many unwanted local minima are avoided. An alternative explanation of the effect of a gradually increasing activation function slope on the avoidance of local minima is given in Section 8.1.4 Activation Function As indicated earlier in this section. One may also modify the basic sigmoid activation function in backprop in order to reduce flat spot effects. Here.
Figure 5.2) but without the fo' term (note. See Equation (5. Equation (3.1.9) even when there is a large difference between the actual and desired/correct output for a given unit. That is. Backprop is used to achieve a minimum in E. flat spot effects are eliminated. and it will take an excessively long time for such incorrectly saturated units to reverse their states. where the activation function f(net) = tanh(β net) and its derivative f ' [given in Equation (5.76). is set to 1. which can provide a relatively better initial point for
minimizing .5.. this unit outputs a value close to one of the saturation levels of its activation function. f ' approaches zero. that is. Plots of f(net) = tanh(net) and its derivative f '(net).9) by fo' + ε and fh' + ε .1. however. or an unknown "correct" hidden target for a hidden unit) is substantially different from that of the saturated unit.1. forms a
homotopy between a linear and a sigmoid function with . respectively (a typical value for ε is 0. This situation can be explained by referring to Figure 5.2) and (5. Here. the error function of w which has a relatively smaller number of local minima than achieved a minimum point of is a polynomial .2.12)] are plotted for β = 1. resulting in a finite non-zero error value (see Franzini (1987) for an example of using such an error function). the homotopy function also helps backprop escape some local minima. 1989).2. If the corresponding target value (desired target value for an output unit. we say that the unit is incorrectly saturated or has entered a flat spot.4 based on the concept of mean-field annealing.. This can be seen by noting that when = 1. the size of the weight update due to backprop will be very small even though the error is relatively large.2. replace fo' and fh' in Equation (5. backprop suffers from premature convergence of some units to flat spots. A simple solution to the flat spot problem is to bias the derivative of the activation function (Fahlman.1. The entropic criterion of Chapter 3. then is decreased (monotonically) and backprop is continued until is zero.1).2. During training. is a good choice for the error function since it leads to an output unit update rule similar to that of Equation (5.1. when net has large magnitude.18) in Section 5. i. all nodes have linear activations. if a unit in a multilayer network receives a weighted signal net with a large magnitude. When this happens.1. 1993).2) and (5.

Moody and Yarvin (1992) reported an empirical study where they have compared feedforward networks with a single hidden layer feeding into a single linear output unit. 1989). a wide range of activation functions may be employed without compromising the universal approximation capabilities of such networks. the advantages of choosing one particular class of activation functions (or a mixture of various functions) is not completely understood. However. And. For example. Equations (5. respectively) such that the slope of each unit is adjusted. assume that we are given a set of 15 samples (shown as small circles) from which we are to find a "good" approximation to g(x). respectively.12). we saw that in order to guarantee good generalization.2. 1990 . 5. whereas rationals and Fourier series showed better performance and were comparable to sigmoids. It was found that the networks with nonsigmoidal activations attained superior performance on the highly nonlinear noiseless data.10) and
(5.2.2: An eleventh-order polynomial (dashed line) and an eighth-order polynomial (dotted line). polynomials did poorly.1. Typically. o and h are small positive constants. Weight Elimination.11) reduce the activation slopes toward zero. Sperduti and Starita. 1973 . each network employing different type of differentiable nonlinear activation function.2. Gradient descent on the error surface in the activation function's slope space leads to the following update rules (assuming hyperbolic tangent activation functions)
(5.2. however. These methods are covered in Chapter 6. the number of degrees of freedom or number of weights (which determines a network's complexity) must be considerably smaller than the amount of information available for training.5 Weight Decay. in the direction of reduced output error (Tawel. Benchmark simulations on a few data sets representing noisy data with only mild nonlinearity and noiseless data with a high degree of nonlinearity were performed. On the set of noisy data with mild nonlinearity.2. Kufudaki and Horejs. Here. 1991. These approximations are computed by minimizing the SSE criterion over the sample points.. 1990 . 1989 . reduces flat spot effects and therefore allows the weights to update rapidly in the initial stages of learning. rational functions (ratios of polynomials). the slopes are updated after every weight update step. Rezgui and Tepedelenlioglu.1. The higher order
. Kruschke and Movellan. Wieland and Leighton.2.11) and (5.3. The types of activation functions considered by Moody and Yarvin included the sigmoid logistic function. Other methods for improving the training speed of feedforward multilayer networks involve replacing the sigmoid units by Gaussian or other units. consider the rational function which is plotted in Figure 5. in turn. 1987).10) and (5. independently. which increases the effective dynamic range of the activation function which. As the algorithm begins to converge. and Fourier series (sums of cosines).8). the slope starts to increase and thus restores the saturation properties of the units. It is important to note here that the slope adaptation process just described becomes a part of the backprop weight update procedure. Other nonsigmoid activation functions may be utilized as long as they are differentiable (Robinson et al. 1993). Some insight into this matter can be gained from considering an analogous problem in curve fitting (Duda and Hart. polynomials. when initialized with slopes near unity.Another method for reducing flat spot effects involves dynamically updating the activation slope (λ and β in Equations (5.11) for the lth output unit and the jth hidden unit.2. 1991.2 (solid line). From the discussion on the approximation capabilities of multilayer feedforward networks in Section 2. and Unit Elimination In Chapter 4 (Section 4. Two polynomial approximations are shown in Figure 5.

2. and the output unit is linear.2. and thus is shown to give a very close fit to the data." However. Neural networks satisfy this approximation property and are thus superior to polynomials in approximating arbitrary nonlinear functions from sample points (see further discussion given below).2.2.2. fifth-order or less) leads to a poor approximation because this polynomial would not have sufficient "flexibility" to capture the nonlinear structure in g(x). based on the 15 samples shown (small circles).g.2.3 show the results of simulations involving the approximation of the function g(x). Thus.2. The reader is advised to consider the nature and complexity of this simple approximation problem by carefully studying Figure 5. the approximation involved minimizing an SSE criterion function over these few sample points. The dashed line represents an eleventh-order polynomial. Trying to use a yet lower order polynomial (e.polynomial has about the same number of parameters as the number of training samples. The objective of the approximation is to minimize the sum of squared error criterion. one should choose a class of approximation functions which penalizes unnecessary fluctuations between training sample points.01.2].. Here. a solution which is globally (or near globally) optimal in terms of sum-squared error over the training set (for example. From this huge set of potential data.. g(x)) is uncountably infinite. The training was stopped when the rate of change of the SSE became insignificantly small. A better overall approximation for g(x) is given by an eighth-order polynomial (dotted line). this is referred to as " memorization. the eleventh order polynomial) may be hardly appropriate in terms of interpolation (generalization) between data points. In this case. These nets are trained using the incremental backprop algorithm [given by Equations (5.1. all hidden units employ the hyperbolic tangent activation function (with a slope of 1). it is clear that the neural net approximation for g(x) is superior to that of polynomials in terms of accurate interpolation and extrapolation. increasing the number of hidden units to 12 units (37 degrees of freedom) improved the quality of the fit as shown by the dashed line in the figure. we close only 15 samples to try to approximate the function. the number of free parameters is equal to nine which is smaller than the number of training samples. Polynomial approximation for the function (shown as a solid line).2) and (5. Here. though.2. it does not provide reliable interpolation and/or extrapolation) over the full range of the data. On the other hand.9)] with and h = 0.3 and 5.3 is for a net with three hidden units (which amounts to 10 degrees of freedom).2. with the same set of samples used in the above simulations. using single hidden layer feedforward neural nets.
. Clearly. This "undetermined" nature leads to an approximation function that better matches the "smooth" function g(x) being approximated.e. the total number of possible training samples of the form (x. By comparing Figures 5.2.1. In this case.
Figure 5. Weights are initialized randomly and uniformly over the range [− 0. it is clear from the figure that this polynomial does not provide good "generalization" (i. Figure 5. Surprisingly. however.2).2. +0. The dotted line in Figure 5. fitting the data by an eighth-order polynomial leads to relatively better overall interpolations over a wider range of x values (refer to the dotted line in Figure 5.

Weight decay in the weight update equations of backprop can be accounted for by adding a complexity (regularization) term to the criterion function E that penalizes large weights. 1986 .2. One of the earliest and simplest approaches to remove excess degrees of freedom from a neural network is through the use of simple weight decay (Plaut et al. though. the net will have a tendency for overfitting. Weigend et al. These techniques are sometimes referred to as network " pruning" algorithms and are surveyed in Reed (1993). the derivative of the output of the network evaluated at the training data points is also a good approximation of the derivative of the unknown function being approximated. Hinton. Now. This approximation process is similar in its philosophy to the traditional spline technique for curve fitting (Schumaker. generalization can be improved if the number of free parameters in the net is optimized. The dotted line was generated by a 3-hidden unit feedforward net. was generated by a 12-hidden unit feedforward net. For example.2. Hence.Figure 5. Neural network approximation for the function (shown as a solid line). It should be noted.12) is a regularization term as in Equation (4.12) Here. Hornik et al. Otherwise. there has been much interest in techniques that automatically remove excess weights and/or units from a network.3)]. This issue is explored next. This result explains the good extrapolation capability of neural nets observed in simulations. which is shown to have substantial overlap with g(x). which means that the degrees of freedom of a neural net approximator must be constrained. In both cases. the behavior of the neural net output shown in Figure 5.
(5. They showed that. represents the relative importance of the complexity term with respect to the error term E(w) [note that the second term in Equation (5. (1991) proposed a procedure of weight-elimination given by minimizing
.3.2.13) which shows an exponential decay in wi if no learning occurs. in addition to approximating the training set. an exact fit of this data must be avoided. Once we decide on a particular approximation function or network architecture. that in most 5 practical situations the training data is noisy. Since it is difficult to estimate the optimal number of weights (or units) a priori. The bounded unit response localizes the nonlinear effects of individual hidden units in a neural network and allows for the approximations in different regions of the input space to be independently tuned.12) overly discourages the use of large weights where a single large weight costs much more than many small ones. Krogh and Hertz (1992) gave some theoretical justification for this generalization phenomena. gradient search for minima of J(w) leads to the following weight update rule
(5. Hinton (1987b) gave empirical justification by showing that such weight decay improves generalization in feedforward networks. 1986) in which each weight decays towards zero at a rate proportional to its magnitude.2..2. 1981).1. The dashed line. The generalization superiority of the neural net can be attributed to the bounded and smooth nature of the hidden unit responses. as compared to the potentially divergent nature of polynomials.3 for x > 10 and x < − is a case in point.1. standard incremental backprop training was used. Because it penalizes more weights than necessary. (1990) gave related theoretical justification for the usefulness of feedforward neural nets with sigmoidal hidden units in function approximation. the criterion function in Equation (5. so that connections disappear unless reinforced.

and Section 5..3) three adaptive networks having unit-allocation capabilities are discussed. 1990). which leads to the hidden
(5. The above ideas have been extended to unit elimination (e. In Chapter 6 (Section 6. 1991 . Weigend et al. 1989 . it is found that the validation (generalization) error decreases monotonically to a minimum but then starts to increase. and Mézard and Nadal (1989).13) by unit update rule for all weights of hidden units. For large w0. A heuristic for adjusting dynamically during learning is described in Weigend et al. even as the training error continues to decrease. Hergert et al. 5. In simulations involving backprop training of feedforward nets on noisy data. this procedure reduces to the weight decay procedure described above and hence favors many small weights. The idea is to keep the network as small as possible. one could penalize redundant units by replacing the weight decay term in Equation (5. 1989 .15) Generalization in feedforward networks can also be improved by utilizing network construction procedures. It should be noted that the above weight elimination procedure is very sensitive to the choice of .4. the reader is referred to Nowlan and Hinton (1992a and 1992b). As an example.g. (1990) .2. Also.(5. Here. Here.6 Cross Validation An alternative or complementary strategy to the above methods for improving generalization in feedforward neural networks is suggested by findings based on empirical results (Morgan and Bourlard. Further details on network construction procedures can be found in Marchand et al. note that when .2.2.7. Chauvin.14) where the penalty term on the right-hand side helps regulate weight magnitudes and w0 is a positive free parameter which must be determined..2. In practice. one would start with an excess of hidden units and dynamically discard redundant ones. and is illustrated through the computer simulation given next. fewer large weights are favored. Hassoun et al. For yet other forms of the complexity term. a w0 close to unity is used. Frean (1990) . 1990. as opposed to weight or unit pruning. we start with a small network and allow it to grow gradually (add more units) in response to incoming data. the cost of the weight approaches one (times ) which justifies the interpretation of the penalty term as a counter of large weights. whereas if w0 is small.
. This phenomenon is depicted in the conceptual plot in Figure 5... Fahlman and Lebiere (1990) . see Hanson and Pratt.2.2. (1991). 1992).

excessive training usually leads to overfitting.25. improved extrapolation. after every 10 training cycles. A single hidden layer feedforward neural net is used with 12 sigmoidal hidden units and a single linear output unit.3.5. This validation 8. multiple local minima may exist in the validation error curves of backprop trained feedforward neural networks. On the other hand.2. Training error (dashed curve) and validation error (solid curve) encountered in training multilayer feedforward neural nets using backprop.01. The dashed line
. shown in Figure 5.6.2. After 80 training cycles on the 15 noisy samples. by adding zero-mean normally distributed random noise whose variance is equal to 0. Two different neural network approximations of the rational function g(x).05. weight initialization.4. continued and then stopped after 10. It is interesting to note the non-monotonic behavior of the validation error between training cycles 2000 and 7000. Further insight into the dynamics of the generalization process for this problem can be gained from Figure 5. the validation RMS error is monitored by testing the net on a validation set of 294 perfect samples. and initial random weights in . The training samples shown are generated from the 15 perfect samples in Figure 5.
Figure 5. shown as a solid line. in general. from noisy samples.Figure 5. partial training may lead to a better approximation of the unknown function in the sense of improved interpolation and.25 variance. Here. Note that the optimal net in terms of overall generalization capability is the one obtained after about 80 to 90 training cycles. when training with noisy data.3 by adding zero-mean normally distributed random noise with 0.2.2. possibly. the training 8. Both approximations resulted from the same net with 12 hidden units and incremental backprop learning. Comparing the two approximations in Figure 5.2. The training error (RMS error on the training set of 15 points) is also shown in the figure as a solid line. The location of these minima is a complex function of the network size. and learning parameters.3.6.2. the training error keeps decreasing.5 leads to the conclusion that the partially trained net is superior to the excessively trained net in terms of overall interpolation and extrapolation capabilities. from a set of noisy sample points.2. h = 0. The output of the resulting net is shown as a dotted line in the figure. To summarize. Consider the problem of approximating the rational function g(x) plotted in Figure 5. uniformly spaced in the interval [− 12].2.2. error is shown as the dashed line in Figure 5. This set of points is generated from the 15 perfect samples. while the validation error increases. the net is tested for uniformly sampled inputs x in the range [− 12]. It employs incremental backprop training with o = 0. The output of this 80-cycle net is shown as a dashed line in Figure 5. Beyond this training point. Next.000 cycles.5. This suggests that.

backprop initially adapts the hidden units in the network such that they all attempt to fit the major features of the data. Starting from the instantaneous entropy criterion ( Baum and Wilczek.7 Criterion Functions As seen earlier in Section 3.
See how it works interactively
5. They explain that to a first approximation.5.1. The reader is referred to Finnoff et al.000 learning cycles. Here. Training and validation RMS errors for the neural net approximation of the function g(x). validation set.2.6.6) does not currently have a complete theoretical justification. though. Later. After completing 10. This later process continues as long as there is error and as long as training continues (this is exactly what happens in the simulation of Figure 5. mainly because perfect samples are used for validation. The validation set is used for deciding when to terminate training. see Stone. suffer from scarcity of training data which makes this method inappropriate. we consider two such criterion functions: (1) Relative entropy and (2) Minkowski-r<. When it ceases to improve. The training set consists of the 15 noisy samples in Figure 5.2. Baldi and Chauvin (1991) derived analytical results on the behavior of the validation error in LMS-trained single layer feedforward networks learning the identity map from noisy autoassociation pattern pairs..2. Their results agree with the above generalization phenomenon in nonlinear multilayer feedforward nets.5.g. (1993) for an empirical study of cross-validation-based generalization and its comparison to weight decay and other generalization-inducing methods.2.2. The generalization phenomenon depicted in Figure 5. the prediction set should not be used for validation during the training phase. The overall process suggests that the effective number of free parameters (weights) starts small (even if the network is oversized) and gets larger approaching the true number of adjustable parameters in the network as training proceeds. is used to estimate the expected performance (generalization) of the trained network on new data. Note that this heuristic requires the application to be data-rich.4 (and illustrated by the simulation in Figures 5. other criterion/error functions can be used from which new versions of the backprop weight update rules can be derived. (1991).represents the output of the net after 80 learning cycles. 1988)
. Therefore. This strategy is based on one of the early criteria in model evaluation known as cross-validation (e.
Figure 5. The training set is used to determine the values of the weights of the network. and prediction set. In particular. the whole available data set is split into three parts: Training set.2. The validation set consists of 294 perfect samples uniformly spaced in the interval [− 12]. as training proceeds.5 and 5. Here. training is stopped. The third part of the data. Training continues as long as the performance on the validation set keeps improving. some of the units then start to fit the noise in the data. 1978). The validation error starts lower than the training error 8.. the same net generates the dotted line output. A qualitative explanation for it was advanced by Weigend et al.2. the prediction set.5). Some applications. a suitable strategy for improving generalization in networks of non-optimal size is to avoid " overtraining" by carefully monitoring the evolution of the validation error during training and stopping just before it starts to increase.

Another choice is the Minkowski-rcriterion function ( Hanson and Burr. the probability that the lth hypothesis is true given an input pattern xk is determined by the output of the lth unit as ..17) and
(5.18) assume hyperbolic tangent activations at both layers. the reader is referred to Section 3.17) and (5. The entropy criterion is a " well formed" error function ( Wittner and Denker.2)] has now been eliminated. Thus.2.2.2.16) and employing gradient descent search.18) for the output and hidden layer units.21)
. 1988).19) which leads to the following weight update equations:
(5.2.2.17) we see that the fo' term present in the corresponding equation of standard backprop [Equation (5. 1988):
(5. fh' still appears in Equation (5. we may obtain the learning equations
(5. respectively. 1988).1. From Equation (5.(5. the output units do not have a flat spot problem. the flat spot problem is only partially solved by employing the entropy criterion.5 for a definition and discussion of "well formed" error functions.2.2. Equations (5. The entropy-based backprop is well suited to probabilistic training data.2.2. Such functions have been shown in simulations to converge faster than standard backprop ( Solla et al.1. Therefore.18) for the hidden units (this derivative appears implicitly as the term in the standard backprop equation). on the other hand. It has a natural interpretation in terms of learning the correct probabilities of a set of hypotheses represented by the outputs of units in a multilayer neural network. Here.20) and
(5.2.

2.23) Note that it is important that the r update rule be invoked much less frequently than the weight update rule (for example. 1988). Here. The motivation behind the use of this criterion is that it can lead to maximum likelihood estimation of weights for Gaussian and non-Gaussian input data distributions by appropriately choosing r (e.2. If no a priori knowledge is available about the distribution of the training data. Here. In fact. also see Section 5.11 for yet another form of regularization.g. unless extensive experimentation with various r values (e. faster learning. A small r (1 r < 2) gives less weight for large deviations and tends to reduce the influence of outlier-points in the input space during learning. when noise is negligible. (Refer to Problem 5. fewer hidden units are recruited when learning complex nonlinearly separable mappings for larger r values ( Hanson and Burr. it would be difficult to estimate a value for r.3. 2. Alternatively. 1992) which has been shown to improve backprop generalization by forcing the output to be insensitive to small changes in the input. Mao and Jain. 1986b. Similarly.21). 1989) where the technique of robust statistics ( Huber.1.2..where sgn is the sign function. 3) is done. One example is to set with 1 r < 2.2.. the case r = 1 is equivalent to minimizing the summed absolute error criterion which is known to suppress outlier data points. Here.9) by a nonlinear error suppressor function
which is compatible with the underlying probability density function p(x). the selection
( Kosko.2) and (5. an automatic method for estimating r is possible by adaptively updating r in the direction of decreasing E. r = 1.) Weight-sharing.5.14) used for enhancing generalization through weight pruning/elimination are examples. These statistical techniques motivate the replacement of the linear error in Equations (5.g. smaller weight magnitudes.2. On the other hand. regularization terms may be added to the above error functions E(w) in order to introduce some desirable effects such as good generalization. is another way for enhancing generalization ( Rumelhart et al. ( Poggio and Girosi. 1990a. a method where several weights in a network are controlled by a single parameter. r is updated once every 10 training epochs of backprop).22) which when restricting r to be strictly greater than 1 (metric error measure case) may be approximated as
(5. etc. Furthermore. The regularization terms in Equations (5.2.3 for an
.12) and (5.20) and (5. This error suppresser leads to exactly the Minkowski-r weight update rule of Equations (5. Another possible regularization term is ( Drucker and Le Cun. robustness of learning refers to insensitivity to small perturbations in the underlying probability distribution p(x) of the training set. The idea of increasing the learning robustness of backprop in noisy environments can be placed in a more general statistical framework ( White.1. smaller effective network size. the sensitivity of the separating surfaces implemented by the hidden units to the geometry of the problem may be increased by employing r > 2. r = 1 for data with Laplace distributions). These equations reduce to those of standard backprop for the case r = 2. 1981) takes effect.. It also helps speed up convergence by generating hidden layer weight distributions that have smaller variances than that generated by standard backpropagation. 1992) leads to robust backprop if p(x) has long tails such as a Cauchy distribution or some other infinite variance density. steepest gradient descent on E(r) results in the update rule
(5.2. 1993).

28) with
(5.24) to the error function.25) through (5. It imposes equality constraints among weights.2. 1992a and 1992b):
(5.29) It should be noted that the derivation of the partial of R with respect to the mixing proportions is less straight forward than those in Equations (5.27) and
(5. As the network learns and the variance shrinks.2. the j is the mixing proportion of Gaussian pj with . The j. (5. For gradient descent-based adaptation. j. those groupings become more and more distinct and converge to subsets influenced by the task being learned.2.2. thus reducing the number of free (effective) parameters in the network which leads to improved generalization. An automatic method for affecting weight sharing can be derived by adding the regularization term ( Nowlan and Hinton.2.25)
.27) since we must maintain the sum of the j's
. (5.2. where each pj(wi) is a Gaussian density with mean j and variance j." in which the learning algorithm decides for itself which weights should be tied together.application). The use of multiple adaptive Gaussians allows the implementation of "soft-weight sharing.2. and wi represents an arbitrary weight in the network.2. the initial grouping of weights into subsets will be very soft. If the Gaussians all start with high variance. (5. and j parameters are assumed to adapt as the network learns.26)
. one may employ the partial derivatives
.

it measures the responsibility of Gaussian j for the ith weight. prediction. Equation (5. image compression.3. To summarize. nonlinear system modeling.25) attempts to pull the weights towards the center of the "responsible" Gaussian.2.2. signal processing.to 1.27) and (5.e.. Similarly. These parameters are measured every 1/60th second.24) leads to unsupervised clustering of weights (weight sharing) driven by the biases in the training set. The hand-gesture data generated by the data glove consistof 16 parameters representing x.2.2. speech recognition.
. andyaw of the hand relative to a fixed reference and ten finger flexangles. andcontrol.25) through (5.2. the penalty term in (5. This method does not require theknowledge of specific mathematical models for or expert knowledgeof the problem being solved. Here. Backprop and its variationshave been applied to a wide variety of problems including patternrecognition. y. The partial derivative for j drives j toward the weighted average of the set of weights for which Gaussian j is responsible.3.2.28) has been obtained by appropriate use of a Lagrange multiplier method and a bit of algebraic manipulation. Thus. roll.29) is the posterior probability of Gaussian j given weight wi. pitch. i. 1993).medical diagnosis. one may come up with simple interpretations for the derivatives in Equations (5. a bank of five feedforward neural networks with singlehidden layers and backprop training is used to map sensor signalsgenerated by a data glove to appropriate commands (words).
5.28). 5. whichin turn are sent to a speech synthesizer which then speaks theword. A block diagram of the Glove-Talk system is shown in Figure5. The term rj(wi) in Equations (5. It realizes a competition mechanism among the various Gaussians for taking on responsibility of weight wi. The most appealing feature of backprop is its adaptivenature which allows complex processes to be modeled through learningfrom measurements or examples. z.2.3 Applications
Backprop is byfar the most popular supervised learningmethod for multilayer neural networks. the result in Equation (5. The purpose of this section is togive the reader a flavor of the various areas of application ofbackprop and to illustrate some strategies that might be usedto enhance the training process on some nontrivial real-worldproblems.1 NETtalk One of the earliest applications of backprop wasto train a network to convert English text into speech (Fels and Hinton.2.

Beforetraining. Weight sharing refers to having several connectionscontrolled by a single weight (feature maps. For this given rejection threshold. but with slightly more complications dueto the multiple 8 by 8 feature maps in H1. Additional weightpruning based on information theoretic ideas in a four hiddenlayer architecture similar to the one described above resultedin a network
.3) was employed in an incremental mode.4and +2.2. Again. Backprop based on the approximate Newton method(described in Section 5.3. As a result of the above structure.the network classification error on the test patterns was reducedto 1%. all units in a given map in H2 share their weights.760 independent weights.5.0% on the test set. the networkhas 1256 units. weightsharing"interconnections are used between the inputs and layer H1 andbetween layers H1 and H2. Each unit in H2 receivesinput from a subset (eight in this case) of the 12 maps in H1. the weights were initialized with a uniform randomdistributionbetween − 2.3.2. Another performance test was performedemploying a rejection criterion where an input pattern was rejectedif the levels of the two most active units in the output layerexceeded a given threshold. The connection scheme between layersH1 and H2 is quite similar to the one described above betweenH1 and the input layer.4 and further normalized by dividing each weight by thefan-in of the corresponding unit. The network is represented in Figure5. 64. The percentage of misclassified patterns was 0.660 connections. eachcontaining 4 by 4 units. but resulted in a 12% rejection rate. and 9. All units use a hyperbolic tangent activation function. Its receptive field is composed of a 5 by 5 window centered atidentical positions within each of the selected subset maps inH1.14% on the trainingset and 5. The network was trained for 23 cycles through the training set(which required 3 days of CPU time on a Sun SparcStation 1).Figure 5.

A total of 35. Ten of the fifteen patternsto be replaced are ones with the lowest error. the generalization error was 1. and witha rejection rate of 9.6. Full interconnectivity was assumed between the secondhidden layer and the output layer. Global fullyinterconnected nets with 150 units in the first hidden layer and50 units in the second layer. each training cycle consisted of a pass through a bufferof 200 images which includes the current original image and its14 shifted versions.in order to eliminate the problem of over training on repetitiveimages. Three types of networks were trained. as compared tousing the most active output unit. Secondly.3%.6%. These images are shown in Figure5. shared weight nets were also used whichhad approximately the same number of units in each layer as inthe local nets. for the global.respectively.0. thus causing thenetwork to "forget" what it had learned from earliertraining. 1. and improved performanceto 99% generalization error with a rejection rate of only 9% (Martin and Pittman (1991). ALVINN's architecture consists of a single hiddenlayer fully interconnected feedforward net with 5 sigmoidal unitsin the hidden layer and 30 linear output units. A second problem may arise such that when training thenetwork with only the current image of the road. All these results suggest anotherway for achieving good Rumelhart. similar to the one in Figure 5. These shared weight nets employed a weight sharingstrategy. A correct steering direction is then generated and usedas the desired target for each of the shifted images. with pixel values from 0 to 1. which employed 1-out-of-10encoding. between the inputand the first hidden layer and between the first and second hiddenlayers. one runs therisk of over learning from repetitive inputs. Using a real-time learning technique. The Gaussian target pattern makes the learning taskeasier than a "1-of-30" binary target pattern sinceslightly different road images require the network to respondwith only slightly different output vectors.3.1%.7%.000 independent weights) and local (at approximately79.
.3.000 independent weights) nets.7%.with only about as manyfree parameters as that described above. the network will not be presentedwith enough situations where it must recover from misalignmenterrors. It is anexample of a successfulapplication using sensor data in real time to perform a real-worldperception task. All nets had two hidden layersand ten units in their output layers. the local shared weights net (with about6. each input image is laterally shifted to create 14 additionalimages in which the vehicle appears to be shifted by various amountsrelative to the road center. local. With the full 35. Digits were automatically presegmented and size normalized toa 15 by 24 gray scale array.500 independent weights) were substantially better than theglobal (at 63. The desired steering angle is presented to the networkas a Gaussian distribution of activation centered around the steeringdirection that will keep the vehicle centered on the road. a new road imageand its 14 shifted versions are used to replace 15 patterns fromthe current set of 200 road scenes. were fully interconnected to units in theoutput layer. This allows finer steering corrections. 1991). A steadily increasing momentum coefficient is also used duringtraining. After each training cycle. 1989. These hiddenunits were fully interconnected to 100 units in the second hiddenlayer which. various nets were trained usingbackprop to error rates of 2 . the network is presentedwith road images as inputs and the corresponding steering signalgenerated by the human driver as the desired output. The steering direction generated by the network is taken to bethe center of mass of the activity pattern generated by the outputunits. in turn. These two problems are handled by ALVINN as follows. The variance 10 was determinedempirically. Finally. When the size of the training set was reduced tothe 1000 to 4000 range. The input isa 30 by 32 pattern reduced from the image of an on board camera.200 samples were available for training and another4000 samples for testing. Backproptraining is used with a constant learning rate for each weightthat is scaled by the fan-in of the unit to which the weight projects. Since the human driver tends to steer the vehicledown the center of the road.and 1.5. and local shared weights nets. Thedesired activation pattern was generated as where dlrepresents the desired output for unit l and Dlis the lth unit's distance from the correct steering directionpoint along the output vector. The other fivepatterns are chosen randomly. Pomerleau. ALVINNquickly learned to autonomously control the van by observing thereactions of a human driver. During the training phase.200 samples training set. Here. First. Local nets with 540 units in thefirst hidden layer receiving input from 5 by 8 local and overlappingregions (offset by 2 pixels) on the input array.

1988). Here. The most promising automated solution (see Gonzalez andWintz. Inthis section. The hidden units are assumedto be of the bipolar sigmoid type. Consider the architecture of a single hidden layerfeedforward neural network shown in Figure 5. Typically.Figure 5.8a. Yoon et al. andthe number of hidden units is assumed to be much smaller thanthe dimension of the input vector.1989.3. Shifted video images from a singleoriginal video image used to enrich the training set used to trainALVINN. +1]. the nonlinearity in the hidden units is theoreticallyof no help (Bourlard and Kamp. This networkhas the same number of units in its output layer as inputs. Figure 5.. Backprop may be used to learn such a mapping. This.this transformation should be designed with care (bottleneck). 1988).an illustrative example of a neural network-based medical diagnosissystem is described which is applied to the diagnosis of coronaryocclusion (Baxt. backprop attempts toextractregularities (significant features) from the input vectors.is expected to evolve an internal low-dimensional Cottrell and Munro. effectively.6.8b shows the reproduced imageby the autoassociative net when tested on the training image. Therehave been a number of attempts to automate the diagnosis process. In the retrieval phase (autonomous driving). restricts the informationin the hidden unit outputs to the number of bits used. Furtherrefinementsof ALVINN can be found in Boundset al. Quantization consistsof transforming the outputs of the hidden units [which are inthe open interval (− +1)]to some integer range corresponding to 1. These resultsare further supported by Baldiand Hornik (1989) who showed thatif J linear hidden units are used. allowing it todrive up to the van's maximum speed of 20 miles per hour (thismaximum speed is due to constraints imposed by the hydraulic drivesystem.000 iterationsat a learning rate of 0. Baxt.1 for the hidden and output layerweights. ALVINN can also generalize to drive along parts of the roadit has never encountered.7.) ALVINN requires approximately 50 iterations throughthis dynamically evolving set of 200 patterns to learn to driveon roads it had been trained to drive on (an on-board Sun-4Workstationtook 5 minutes to do the training during which a teacher driverdrives at about 4 miles per hour over the test road). In the following.the hidden layer. respectively. Here. hence the reconstructed image is of goodquality.(1987) and Cottrell andMunro (1988) found that the nonlinearityhas little added advantages in their simulations. The reproduced image is quite close (to the eye) to the trainingimage in Figure 5. This means that autoassociative backproplearning in a two layer feedforward neural network with linearunits has no
. autoassociative net was trainedon random 8 8patches of the image using incremental backprop learning.3.3. and indeed Cottrellet al. a neural network-based solutionto this problem is described. In general. A. Thus. 1990). and the output units are linear. Acute myocardial infarction (coronary occlusion)is an example of a disease which is difficult to diagnose. In additionto being able to drive along the same stretch of road it trainedon.) This speed is over twice the speed of any other sensor-basedautonomous system that was able to drive this van.3. 1988. This network is trained on a set of n-dimensional real-valuedvectors (patterns) xksuch that each xkis mapped to itself at the output layer in an
autoassociativemode. In this net. Thus.
the number of bits requiredfor transmission. 1987).01 and 0. 1990)..000 to
100. with permission of the MITPress. the network is trained to act as an encoder ofrealvaluedpatterns. thesystem is able to process 25 images per second. the learning consisted of 50. Pomerleau. the network's hiddenunits discard as little information as possible by evolving theirrespective weight vectors to point in the direction of the input'sprincipal components. which is also known as the representation layer. In order to achieve true compression for the purposeof efficient transmission over a digital communication link. (From D. the network learns toproject the input onto the subspace spanned by the first Jprincipal components of the input. theoutputs of the hidden units must be quantized. 1991.all pixel values are normalized in the range [-1. even under a wide variety of weatherconditions.

Here. backprop) into the temporal domain.5. In general. One such example is plant modeling in control applications. Cottrell. The classifier was trained to recognize thegender of a limited set of subjects.096 dimensional "pixel space. Each image is then aligned along the axes of the eyesand mouth. The overallencoder/classifiersystem resulted in a 95% correct gender recognition on both trainingand test sets which was found to be comparable to the recognitionrate of human beings on the same images. For another significantapplication of nonlinear PCA autoassociative nets." Here. All images are normalized to have equal brightnessand variance.4. the trainingset is comprised of 160 images of various facial impressionsof 10 male and 10 female subjects. with pruning of representationlayer units. encoder subnet of a four hiddenlayer autoassociativenet is used to supply five dimensional inputs to a feedforwardneural classifier. the autoassociativenet was first trained using backprop. the need arises to model dynamical processes where a time sequence is required in response to certain temporal input signal(s).
5." These principal manifolds can. and another mapping from the secondhidden layer to the output layer. Here. Thus. Temporal association networks must have a recurrent (as opposed to static) architecture in order to handle the time dependent nature of associations. [For an application of aHebbian-typePCA net to image compression.compute any continuous mapping from the inputs to the second hiddenlayer (representation layer). (1991). the grey levels of image pixelsare linearly scaled to the range [0. 1987.
.. this requires a recurrent architecture (nets with feedback connections) and proper associated learning algorithms. a network must be able to generate the rest of a sequence from a part of that sequence. it is desired to capture the dynamics of an unknown plant (usually nonlinear) by modeling a flexible-structured network that will imitate the plant by adaptively changing its parameters to track the plant's observable output signals when driven by the same input signals. for example. a 3 hidden layerautoassociativenet (with a linear or nonlinear representation layer) may in principlebe considered as a universal nonlinear PCA net. to generate a five-dimensional representation from50-dimensional inputs. For sequence reproduction.3.. 0. 8bitgray scale images. The resulting model is referred to as a temporal association network. sucha highly nonlinear net may be problematic to train by backpropdue to local minima. However. for predicting the price trend of a given stock market from its past history or predicting the future course of a time series from examples. Finally.8]. each of which can be considered to be a pointin a 4.g. in order to prevent the use of first order statisticsfor discrimination.processing capability beyond those of unsupervisedHebbian PCA nets of Section 3. A somewhat related recurrentmultilayer autoassociative net for data clustering and signaldecomposition is presented in Section 6. of which 120 images were usedfor training and 40 for testing. Two special cases of temporal association networks are sequence reproduction and sequence recognition networks. in many engineering.2. it would be very useful to extend the multilayer feedforward network and its associated training algorithm(s) (e. However. 1991. This is appropriate. serve aslow-dimensionalrepresentations of the data which are more useful than principalcomponents. 1991 . The inputs were taken as the first 50principal components of a 64 × 64-pixel. and economic applications.theoretically. Thus. principal manifolds. in some cases. The high rate of correct classification in the abovesimulation is a clear indication of the "richness" andsignificance of the representations/feature vectors discoveredby the nonlinear PCA autoassociative net. A three hidden layer autoassociative net can. the reader is referred to Kramer. The images are captured by aframe grabber. scientific.
Another way of interpreting the above autoassociativefeedforward network is from the point of view of feature extraction(Kuczewski et al. the readeris referred to Usui et al. and reduced to 64 × 64 pixels byaveraging.4 Extensions of Backprop for Temporal Learning
Up to this point we have been concerned with "static" mapping networks which are trained to produce a spatial output pattern in response to a particular spatial input pattern. In sequence recognition.

for speech recognition. The architecture in Figure 5.4. Lippmann.4.a network produces a spatial pattern or a fixed output in response to a specific input sequence. The time-delay neural net has been successfully applied to the problem of speech recognition (e. Given observed
. This maps a finite time sequence into a single output y (this can also be generalized for the case when x and/or y are vectors). 1987. if target values for the output unit are specified for various times t. In the following.. backprop may be used to train the above network to act as a sequence recognizer. Waibel et al. for example. neural net architectures having various degrees of recurrency and their associated learning methods are introduced which are capable of processing time sequences. 1989.3 are two other examples of sequence recognition networks. Weigend et al. 5. where the output encodes the word corresponding to the speech signal.1. Waibel. Thus. we discuss time series prediction since it captures the spirit of the type of processing done by the time-delay neural net.4.. Elamn and Zipser. 1991). 1989) and time series prediction (Lapedes and Farber..
Figure 5.1 is equivalent to a single hidden layer feedforward neural network receiving the (m + 1)-dimensional "spatial" pattern x generated by a tapped delay line preprocessor from a temporal sequence. 1989.1 Time-Delay Neural Networks Consider the time-delay neural network architecture shown in Figure 5. Tank and Hopfield. 1988. This is appropriate.1.g. 1988. Here. A time-delay neural network for one-dimensional input/output signals.4. NETtalk and Glove-Talk of Section 5. One may view this neural network as a discrete-time nonlinear filter (we may also use the borrowed terms finite-duration impulse response (FIR) filter or nonrecursive filter from the linear filtering literature).

the goal is to use these values to accurately predict .2) where a linear activation is assumed for the output unit. with the output of this delay element connected to the input of the time-delay net as is shown in Figure 5.g. Theoretical justification for the above approach is available in the form of a very powerful theorem by Takens (1981). 1991. A method is robust if it can maintain prediction accuracy for a wide range of p values.4.4.
. the output y [predicting ] is propagated through a single delay element. Backprop may now be employed to learn such a training set. a training
. Weigend and Gershenfeld.1 as the basis for predicting set is constructed of pairs . as p increases the quality of the predicted value will degrade for any predictive method. The timedelay neural network approach provides a robust approximation for g in Equation (5. however. The training procedure is identical to the one for the above prediction network. Weigend et al. Reported simulation results of this prediction method show comparable or better performance compared to other non neural network-based techniques (Lapedes and Farber. However. adaptive parameter model
(5. where . and fh is the nonlinear activation of hidden units.1) in the form of the continuous.2.1) with .4. during retrieval. As is normally done in linear signal processing applications (e.values of the state x of a (nonlinear) dynamical system at discrete times less than t. A simple modification to the time-delay net makes it suitable for sequence reproduction. 1988. one may use the tapped delay line nonlinear filter of Figure 5. which states that there exists a functional relation of the form
(5. as long as the trajectory x(t) evolves towards compact attracting manifolds of dimension d. Widrow and Stearns.. Clearly. we assume a one dimensional state x). 1985). Here. This theorem. provides no information on the form of g or the value of .4. This sequence reproduction net will only work if the prediction is very accurate since any error in the predicted signal has a multiplicative effect due to the iterated scheme employed. where p is some prediction time step into the future (for simplicity. 1993)..4.

respectively.4. The general form of Equation (5.3).2. and .
using
. termed infinite-duration impulse response (IIR) filter in the linear filtering literature.4. single output plant described by the difference equation:
(5.3) where u(t) and x(t) are. Sequence reproduction network. the input and output signals of the plant at time t. thus modeling the plant. Further generalization of the above ideas can result in a network for temporal association.4. We present such modifications in the context of nonlinear dynamical plant identification/modeling of control theory. g is a nonlinear function.Figure 5. This may also be viewed as a (nonlinear) recursive filter. Here. we assume that the order of the plant is known (m and n are known).3. Consider the following general nonlinear single input.3) suggests the use of a time-delay neural network shown inside the dashed rectangle in Figure 5.4. We are interested in training a suitable layered neural network to capture the dynamics of the plant in Equation (5.4.

This identification scheme is referred to as series-parallel identification model (Narendra and Parthasarathy. for the same input signal u(t) and same initial conditions.3. the feedforward part of the neural network consisted of a two hidden layer network with five inputs and a single linear output unit.4.3). 1990). the neural network and the plant receive the same input u(t). Theoretical justifications for the effectiveness of this neural network identification method can be found in Levin and Narendra (1992). After training. respectively. A time-delay neural network setup for the identification of a nonlinear plant. In one of their simulations. . one would expect the output to approximate the actual output of the plant.3. This network was used to identify the unknown plant
.4. If the training was successful.Figure 5.4.3) will generate (recursively) an output time sequence in response to an input time sequence. Narendra and Parthasarathy (1990) reported successful identification of nonlinear plants by time-delay neural networks similar to the one in Figure 5. the neural network with the switch S in the down position ( is fed back as the input to the top delay line in Figure 5. The two hidden layers consisted of 20 and 10 units with bipolar sigmoid activations. During training.4. The neural network also receives the plant's output x(t+1) (switch S in the up position in Figure 5. Backprop can be used to update the weights of the neural network based on the "static" mapping pairs
for various values of t.

5). Parthasarathy. on Neural Networks. Figure 5.4. Identification and Control of Dynamical Systems Containing Neural Networks. The training phase consisted of 100.
See how it works interactively 5. Nerrand et al.4.000. For example. The training phase consisted of 5 106 iterations.4. This amounts to 10.000 cycles over a 500 sample random input signal.4.(5.5) It should be noted that in the above simulation no attempt has been made to optimize the network size or to tune the learning process. (1993) present examples of such algorithms.000 sample random input signal.4 (b) shows simulation results with a single hidden layer net consisting of twenty bipolar sigmoid activation hidden units. which amounts to one training cycle over the random inputs signal u(t). Identification results for the plant in Equation (5. (a) The network has two hidden layers and is trained with incremental backprop for one cycle over a 100. Narendra and K.4. ©1990 IEEE.4 (a) shows the output of the plant (solid line) and the model (dotted line) for the input signal
(5.000 training iterations. Figure 5. 4-27. Other learning algorithms may be used for training the time-delay neural network discussed above. A learning rate of 0. IEEE Trans. Incremental backprop was used to train the network using a uniformly distributed random input signal whose amplitude was in the interval [− +1].25 was used.4) The inputs to the neural network during training were .4. some of which are extensions of algorithms used in classical linear adaptive filtering or adaptive control. Plant output x(t) (solid line) and neural network output (dotted line) in response to the input signal in Equation (5. 1.2 Backpropagation Through Time
.4. 1(1).4) a time-delay neural network.) (b) The network has a single hidden layer and is trained for 10. Here.4. incremental backprop with a learning rate of 0.000 training cycles over a 500 sample input signal having the same characteristics as described above.4.
(a) (b) Figure 5. (Adapted from K.25 was used. S. 0 < t 100. 1990.

(a) A simple recurrent network. The resulting unfolded network simplifies the training process of encoding the x(t) d(t) sequence association since now backprop learning is applicable. Once trained. Rumelhart et al. a fully recurrent neural net is a more appropriate/economic alternative.4. One reason is its inefficiency in handling long sequences. errors at the output of hidden units.In the previous section. Nowlan. The desired targets are defined on a set of arbitrary units at certain predetermined times. Another reason is that other learning methods are able to solve the problem without the need for unfolding. (b) A feedforward network generated by unfolding in time the recurrent net in (a). In general. a copy of the weights from any layer of the unfolded net are copied into the recurrent network which. Note that the connections wij from unit j to unit i in the unfolded network are identical for all layers. Also. must be propagated backward from the layer in which they originate. 2. and it is desired that the network generates the sequence d(t) as the output y2(t) of unit 2. however.4.5(b). Here. it is important to realize the constraint that all copies of each weight wij must remain identical across duplicated layers (backprop normally produces different increments wij for each particular weight copy). 1986b). A simple solution is to add together the individual weight changes for all copies of a partial weight wij and then change all such copies by the total amount. and not just the output errors. Thus. The two networks are equivalent over the four time steps t = 1. all units in the recurrent network are duplicated T times. 1986b. 4. Adapting backprop to training unfolded recurrent neural nets results in the so-called backpropagation through time learning method (Rumelhart et al.. arbitrary interconnection patterns between units can exist. These methods are treated next. First. so that a separate unit in the unfolded network holds the state yi(t) of the equivalent recurrent network at time t. The number of resulting layers is equal to the unfolding time interval T. we should note a couple of things here. 3. But first. 3. a partially recurrent neural network is presented which is capable of temporal association. and 4 is shown in Figure 5. 2. 1969) to arrive at a feedforward layered network.5. Nguyen and Widrow. 1989). targets may be specified for hidden units.
.. There exist relatively few applications of this technique in the literature (e. A network which behaves identically to the above simple recurrent net over the time steps t = 1. or both. output units. This idea is effective when T is small and limits the maximum length of sequences that can be generated. we describe one interesting application of backpropagation through time: The truck backer-upper problem. in turn. The network receives an input sequence x(t) at unit 1. individual units may be input units. Secondly. Here. This amounts to unfolding the recurrent network in time (Minsky and Papert.5(a).. is used for the temporal association task.
(a) (b) Figure 5.4.g. 1988. An example of a simple two-unit fully interconnected network is shown in Figure 5. However.

T. The controller receives the input vector x(k) and responds with a single output s(k). the trailer angle t is zero).4. the system exhibits partial recurrence since the emulator is a feedforward network and since it will be assumed that the controller has a feedforward single hidden layer architecture). the controller is used to control the real system. a feedforward single hidden layer neural network is trained.Consider the trailer truck system shown in Figure 5.. these two additional variables may be eliminated if the length of the cab and that of the trailer are given.
Figure 5. using backprop. with the trailer perpendicular to the dock. The steering signal was selected randomly during this training process. and T represents 1) 1)] the number of backup steps till the trailer hits the dock or leaves some predesignated borders of the parking lot (T depends on the initial state of the truck and the applied steering signal s). respectively. x(k)} where k = 1.4. Next. The reason for training the controller with the emulator and not with the real system is justified below.e. The general idea for training the emulator is depicted in the block diagram of Figure 5. However.7 shows the controller/emulator system in a retrieval mode. Before the controller is designed.6. to emulate the truck and trailer kinematics. A pictorial representation of the truck backer-upper problem. and where only backward movements of the cab are allowed. y.3 for the identification of nonlinear dynamical systems by a neural network. 0) with the trailer perpendicular to the dock (i. The goal is to design a controller which successfully backs-up the truck so that the back of the trailer designated by coordinates (x. Once trained.4. (1990a).. each consisting of a set of association pairs {[x(k− T s(k− T. The controller and the emulator are labeled C and E. The original application assumes six state variables. It is assumed that the truck backs-up at a constant speed. . The details of the trailer truck kinematics can be found in Miller et al. The whole system is recurrent due to the external feedback loops (actually. including the position of the back of the cab. The objective here is to design a controller which generates a steering signal s which successfully backs-up the truck so that the back of the trailer ends at the (0. y) ends at (0.4.6.. 2. in the figure. This is accomplished by training the network on a large number of backup trajectories (corresponding to random initial trailer truck position configurations).
. However. representing the control signal. 0) reference point.. The controller receives the observed state x = [x. the trained emulator network is used to train the controller. t. the tapped delay lines are not needed here because of the kinematic nature of the trailer truck system. c]T (c is the cab angle) and produces a steering signal (angle) s. Figure 5.

and t. Thousands of backups are required to train the controller.. Unfolding the controller/emulator neural network T time steps results in the T-level feedforward network of Figure 5.7.8.4.9.4. as long as it has sufficient clearance from the loading dock. the trailer hits the borders). Here.Figure 5. Typical backup trajectories are shown in Figure 5. only the controller weights are adjusted (with equal increments for all copies of the same weight as discussed earlier).
(a)
(b)
(c)
. This unfolded network has a total of 4T .4.1 layers of hidden units. randomly initialized controller neural network evolves over T time steps until its state enters a restricted region (i. The need to propagate the error through the plant block necessitates that a neural network-based plant emulator be used to replace the plant during training. Once the output layer errors are computed. It is helpful (but not necessary) to start the learning with "easy" initial cases and then proceed to train with more difficult cases. The idea of unfolding in time is applicable here.4. When initialized with state x(0). y. Controller/emulator retrieval system. The only units with specified desired targets are the three units of the output layer at level T representing x. the system with the untrained.
Figure 5. The backpropagation through time technique can now be applied to adapt the controller weights. The desired target vector is the zero vector. Unfolded trailer truck controller/emulator network over T time steps. The trained controller is capable of backing the truck from any initial state.e.8. they are propagated back through the emulator network units and through the controller network units.

By setting y . usually called recurrent backpropagation. Henceforth.4. Suppose that the network has converged to an equilibrium state y* in response to an input x. (b) trajectory. The new algorithm is used to encode " spatial" input/output associations as stable equilibria of the recurrent network. Then. simultaneously. In other words.Figure 5. i.) 5.4.4. in Equation (5. non-input units will be assigned an input = 0.9. (Courtesy of Lee Feldkamp and Gint Puskorius of Ford Research Laboratory. the extension here is still restricted to learning " static" mappings as opposed to temporal association. A unit is an input unit if it receives an element of the input pattern xk. it will respond with .4. respectively). given by:
(5.4. Consider a recurrent network of N units with outputs yi. Thus.19).4. a unit may belong to the set of input units and the set of output units.1. if neuron i is an output neuron. resulting in an error signal Ei.4.8)
. 1988). the presentation of xk is supposed to drive the network's output y(t) towards the fixed attractor state dk. connections wij. Michigan. was proposed independently by Pineda (1987. By definition.7). The goal is to adjust the weights of the network in such a way that the state y* ultimately becomes equal to the desired response d associated with the input x. it will serve as the basis for other extensions of backprop to sequence association which are discussed later in this section. This output is compared to the desired response di.. Dearborn.7.e. Typical backup trajectories for the trailer truck which resulted by employing a backpropagation through time trained controller (a) Initial state.3 Recurrent Backpropagation This section presents an extension of backprop to fully recurrent networks where the units are assumed to have continuously evolving states. .6).7) The following is a derivation of a learning rule for the system/network in Equation (5. The present extension.4. and activations f (neti). or it may be "hidden" in the sense that it is neither an input nor an output unit. In general. after training on a set of {xk. the pattern index k is dropped for convenience. This equilibrium point(s) represent the steady-state response of the network. one arrives at the equilibrium points y* of the above system. Output units are
designated as units with prespecified desired outputs .
(5. however. dk} pattern pairs.6) where neti represents the total input activity of unit i and − i simulates natural signal decay. our goal is to minimize the error function
(5.5a. A simple example (N = 2) of such a network is shown in Figure 5. which assumes the existence and asymptotic stability of at least one equilibrium point y*. and (c) final state.8) and (7. A biologically as well as electronically motivated choice for the state evolution of unit i is given by (refer to Equations (4. 1988) and Almeida (1987.

Note that an instantaneous error function is used so that the resulting weight update rule is incremental in nature.7) to obtain
(5. This implementation has O(N2) computational complexity and is usually
.9) gives the desired learning rule:
(5.9)
with
given by differentiating Equation (5.4.10) where ip is the Kronecker delta (ip = 1 if i = p and zero otherwise).4.4. substituting Equation (5.4. is possible.11) where
(5.4.10) is
(5.13) in Equation (5.with = 0 if unit i is not an output unit. Hence. then the matrix L is N N and its inversion requires O(N3) operations using standard matrix inversion methods.4. Another way of writing Equation (5. utilizing a modified recurrent neural network of the same size as the original network.13) where (L−1)ip is the ipth element of the inverse matrix L−1.11) and
(5. one may solve for get
by inverting the set of linear equations represented by Equation (5.4.14) When the recurrent network is fully connected.4.12)
Now.4.4.4. Using gradient descent search to update the weight wpq gives
(5. Pineda and Almeida independently showed that a more economical local implementation.

18). The similarity between Equations (5.14) and define it as :
(5.4.4.4. renaming the index p as j.4. Figure 5. a solution for z*. such a network may be arrived at by starting with the original network and replacing the coupling weight wij from unit j to unit i by from unit i to unit j. In fact. consider the summation term in Equation (5. and feeding the error as input to the ith output unit (of the original network).6) suggests that a recurrent network realization for computing z* should be possible.16) or. is possible if it is an attractor of the dynamics in Equation (5. substituting for L from Equation (5.12). as
(5. To see this.18) and (5.
. setting all inputs to zero.
(5.4.18).17) is satisfied by the equilibria of Equation (5.5) that z* is an attractor of Equation (5.4.4.10 shows the error-propagation network for the simple recurrent net given in Figure 5. undoing the matrix inversion in Equation (5. The resulting network is called the error-propagation network or the adjoint of the original net.4.4.15) Then. .
assuming linear activations for all units.4.4.4.4.called recurrent backpropagation.4.4. It can be shown (see Problem 5. if y* is an attractor of Equation (5.18) Note that Equation (5. and rearranging terms.17) This equation can be solved using an analog network of units zi with the dynamics
(5.5(a). Thus.4.4.6).15) leads to the set of linear equations for shown by .18).

.Figure 5. 5. Error-propagation (adjoint) network for the simple recurrent net in Figure 5. if the initial weights are chosen to be small enough.
(5.4. This is because these networks build attractors dk which correspond to input/output association patterns .5(a). 1988.4.4. the 's are computed by iteratively solving Equation (5.4.14) or its equivalent form
(5. That is.19) where
Next. In practice.4. Other applications of recurrent backpropagation nets can be found in Qian and Sejnowski (1989) and Barhen et al. However. it has been shown (Simard et al. refer to Chapter 7). 1989) that for any recurrent neural network architecture. 1987).4. The weights are finally adjusted using Equation (5.6). (1989). These pattern completion/error correction features are superior to those of feedforward networks (Almeida. the network almost always converges to a finite stable equilibrium y*. An input pattern xk is presented to the recurrent net and a steady-state solution is computed by iteratively solving Equation . It should be noted that the recurrent backpropagation reduces to incremental backprop for the special case of a net with no feedback. if a noisy and/or incomplete version of a trained pattern xk is presented as input.6). We may now give a brief outline of the recurrent backpropagation learning procedure. and so on. though.18).4. The above analysis assumed that finite equilibria y* exist and are stable. The steady-state outputs of the net are compared with the target dk to find the output errors
Then. a new input pattern is presented to the network and the above procedure is repeated.4.4 Time-Dependent Recurrent Backpropagation
. there always exists divergent trajectories for Equation (5. it potentially causes the network to eventually converge to dk.10. One potential application of recurrent backpropagation networks is as associative memories (for definition and details of associative memories.

the partial derivatives of E with respect to the weights may be computed as:
(5.4. Here. each output unit yl has a desired target signal dl(t) that is also a continuous function of time. Now.4. and zero otherwise. It can be thought of as an extension of recurrent backpropagation to dynamic sequences. which is some function of the trajectory y(t) for t between 0 and t1. utilizing
. we start with a recurrent net with units yi having the dynamics
(5.20).The recurrent backpropagation method just discussed can be extended to recurrent networks that produce time-dependent trajectories. The following is a brief outline of this algorithm.21) which measures the deviation of yl from the function dl. Similarly. Since the objective here is to teach the lth output unit to produce the trajectory dl(t) upon the presentation of x(t). and Sato (1990)].23) with the boundary condition zi(t1) = 0. Ei(t) is given by di(t) − yi(t) if unit i is an output unit. In Pearlmutter's method. an appropriate criterion (error) functional is
(5.4. Consider minimizing a criterion E(y). Here.4. b)[See also Werbos (1988).22)
where the dynamical system given by
. learning is performed as a gradient descent in the weights of a continuous recurrent network to minimize an error function E of the temporal trajectory of the states.20) Note that the inputs xi(t) are continuous functions of time. and zi(t) is the solution of
(5. One such extension is the time-dependent recurrent backpropagation method of Pearlmutter (1989a.4. One may also simultaneously minimize E in the time-constant space by gradient descent.
is the solution to Equation (5.

time-dependent recurrent backpropagation is more appropriate as an off-line training method. Having determined yi(t) and zi(t).22) and (5.4. and integrate the system in Equation (5. we may proceed with computing from Equations (5. (From B.4.12(c). Pearlmutter.4.12 (b) and (c) are generated by a network with 10 hidden units and two output units. they report a better formed figure "eight" compared to the one in Figure 5.)
(a) (b) (c) Figure 5. though.. Using numerical integration (e.20) for t [0.11 and 5. This means that recurrent neural nets are universal approximators of dynamical systems. however. Next. and (c) y1(t) versus y2(t) after 12. 1989a.
Due to its memory requirements and continuous-time nature.4. t1]. Fang and Sejnowski (1990) reported improved learning speed and convergence of the above algorithm as the result of allowing independent learning rates for individual weights in the network (for example. They may also be obtained using the calculus of variations and Lagrange multipliers as in optimal control theory (Bryson and Denham. after 1.182 and 20. Note.4. an important property of the continuous-time recurrent net described by Equation (5.22) and (5.000 cycles. first order finite difference approximations) one first solves Equation (5.500 cycles. that this says nothing about the existence of a learning procedure which will guarantee that any continuous trajectory is learned successfully.(5.4.000 cycles.g.20) should be noted.11. 1993).4. 1993) that the output of a sufficiently large continuous-time recurrent net with hidden units can approximate any continuous state space trajectory to any desired degree of accuracy. The desired trajectory [d1(t) versus d2(t)] is the circle in Figure 5. is that the failure of learning a given continuous trajectory by a sufficiently large recurrent net would be attributed to the learning algorithm used. y1(t) versus y2(t) after 1. Some applications of this technique include learning limit cycles in two-dimensional space (Pearlmutter. A. respectively.11(b) and (c) are produced by a network of four hidden units.4. It has been shown (Funahashi and Nakamura. respectively.4. This method has also been shown to work well in time series prediction (Logar et al. then set the boundary condition zi(t1) = 0.) Finally. respectively.12.4. The trajectory in Figure 5.11(a). What it implies. two output units.4. 1962).4.000 learning cycles. the wij's and i are computed from
and
and
.23)
backward from t1 to 0..4. 1989a) like the one shown in Figure 5.12(a).4.24) Equations (5.4.
(a) (b) (c)
.24). and no input units. respectively.500 and 12. The desired trajectory is shown in Figure 5.24) may be derived by using a finite difference approximation as in Pearlmutter (1988). Learning performance of time-dependent recurrent backpropagation: (a) desired trajectory d1(t) versus d2(t). with permission of the MIT Press. The state space trajectories in Figure 5.4. after only 2000 cycles. after 3. (b) generated state space trajectory.

5 Real-Time Recurrent Learning Another method that allows sequences to be associated is the real-time recurrent learning (RTRL) method proposed by Williams and Zipser (1989a. Pearlmutter.4.12.26) where
Thus. (c) generated trajectory after 20. This method allows recurrent networks to learn tasks that require retention of information over time periods having either fixed or indefinite length.182 cycles. (a) Desired state space trajectory. b). 1989a.000 cycles.27) with
(5. gradient descent on Etotal gives
(5.25) A desired target trajectory d(t) is associated with each input trajectory x(t). (b) generated trajectory after 3.4.29)
.28)
The partial derivative
in Equation (5.28) can now be computed from Equation (5.4.4.4. RTRL assumes recurrent nets with discrete-time states that evolve according to
(5. As before.Figure 5. (From B. with permission of the MIT Press. Learning the figure "eight" by a time-dependent recurrent backpropagation net.4. A.4.25) as
(5. the quadratic error measure is used
(5.) 5.4.4.

which may be viewed as nonlinear FIR or IIR filters.Since Equation (5. medical diagnosis. We also presented a global descent-based error backpropagation procedure which employs automatic tunneling through the error function for escaping local minima and converging towards a global minimum. volume 5(2). Theoretical basis for several of these variations is presented. Backpropagation through time is introduced as a training method for fully recurrent networks.4. are shown to be capable of sequence recognition and association by employing standard backprop training. This result is a natural generalization of the delta learning rule given in Chapter 3 for single layer networks.27) to update the weights. (The reader is also referred to the Special Issue of the IEEE Transactions on Neural Networks.1. avoid "poor" local minima. This heuristic is known as teacher forcing.
(starting from some initial value for . A recurrent backpropagation method for training fully recurrent nets on static (spatial) associations is presented.4. zero) and compute at any desired time. These variations include weight initialization methods. extensions of the idea of backward error propagation learning to recurrent neural networks are given which allow for temporal association of time sequences.1 Derive Equations (5. 1994. The reader may refer to Robinson and Fallside (1988). Finally. autonomous vehicle control. This method is also extended to temporal association of continuous-time sequences (time-dependent recurrent backpropagation).4. It employs a trick that allows backprop with weight sharing to be used to train an unfolded feedforward nonrecurrent version of the original network. by observing only the action of a Turing machine performing the same task. Various variations to backprop are introduced in order to improve convergence speed. Rohwer (1990).4.)
5. Time-delay neural networks. and data compression. it was found (Williams and Zipser. it was found that learning speed (and sometimes convergence) improved by setting the states of units yl(t) with known targets to their target values. it helps keep the network closer to the desired trajectory. thus the name real-time recurrent learning. The power of this method was demonstrated through a series of simulations (Williams and Zipser. and enhance generalization.1. a gradient descent-based learning procedure for minimizing the sum of squared error criterion function in a feedforward layered network of sigmoidal units. and the addition of regularization terms to the error function being minimized. e. but only after computing El(t) and the derivatives in Equation (5. 1989b). while using Equation (5.28) works well as long as the learning rate ρ is kept sufficiently small. autonomous learning parameters adjustments. Instead of using Equation (5. Direct training of fully recurrent networks is also possible. where N is the number of units in a fully interconnected net. Finally. Each cycle of this algorithm requires time proportional to N4. A number of significant real-world applications are presented where backprop is used to train feedforward networks for realizing complex mappings between noisy sensory data and the corresponding desired classifications/actions. In some of the simulations.29).11) and (5.1.
Problems
5. (1992) for other methods for learning sequences in recurrent networks.4. 1989a) that updating the weights after each time step according to Equation (5.29) relates the derivatives at time t to those at time t− we can iterate it forward 1.5 Summary
We started this chapter by deriving backprop.25) to iteratively update states at each iteration.g. These applications include converting a human's hand movement to speech. This avoids the need for allocating memory proportional to the maximum sequence length and leads to simple on-line implementations.. and Sun et al.
.12). a method of on-line temporal association of discrete-time sequences (real-time recurrent learning) is discussed. a 12 unit recurrent net learned to detect whether a string of arbitrary length comprised of left and right parentheses consists entirely of sets of balanced parentheses. for further exploration into recurrent neural networks and their applications. handwritten digit recognition. In one particular simulation.

5.1. Generate and plot the separating surfaces learned by the various units in the network. 5. and the output layer has L units with weights wlj and differentiable activations fo(netl).6 Derive and implement numerically the global descent backprop learning algorithm for a single hidden layer feedforward network starting from Equation (5. Assume a four hidden units fully interconnected feedforward net with unipolar sigmoid activation units.1.7 Consider the two layer feedforward net in Figure 5.2 Derive the backprop learning rule for the first hidden layer (layer directly connected to the input signal x) in a three layer (two hidden layer) feedforward network.1. 1993) that this network is capable of faster and more accurate training when the weights and the rji exponents are adapted as compared to the same network with fixed rji = 0. batch backprop. the second hidden layer has J units with weights wjk and differentiable activations fh2(netj).5.1.1.1.1. Let the weights of these additional connections be designated as wli (connection weights between the lth output unit and the ith input signal.1.1.5) for the 4-bit parity problem using incremental backprop.1 5.4 Derive the batch backprop rule for the network in Figure 5.3 Consider the neural network in Figure 5.1 to train a two-layer network with 12 hidden units and a single output unit to learn to distinguish between the class regions in Figure P5. and use the same initial weights and learning rates for all learning algorithms (use = 2 and k = 0. Assume that the first hidden layer has K units with weights wki and differentiable activations fh1(netk). Follow a similar training strategy to the one employed in Example 5. 5. Are there any restrictions on the values of the inputs xi? What would be a reasonable initial value for the rji exponents?
.1. and global descent backprop. and xi is the ith component of the input vector x. 5.1.1.1.
*†
5. Two-class classification problem.15).5 Use the incremental backprop procedure described in Section 5.1).1.1. Derive a learning rule for rji based on incremental gradient descent minimization of the instantaneous SSE criterion of Equation (5. It has been shown empirically ( Narayan. Can you identify the function realized by the output unit?
†
Figure P5.1. Assume that we replace the hidden layer weights wji by nonlinear weights of the form where rji R is a parameter associated with hidden unit j.1.) Derive a learning rule for these additional weights based on gradient descent minimization of the instantaneous SSE criterion function.001 for global descent. Generate a learning curve (as in Figure 5.1. and experiment with different directions of the perturbation vector w).1 with full additional connections between the input vector x and the output layer units.5.

5. Construct (by inspection) an optimal net of the type considered in Problem 5.8 Consider the Feigenbaum (1978) chaotic time series generated by the nonlinear iterated (discretetime) map
†
Plot the time series x(t) for t [0.9 Derive the partial derivatives of R in Equations (5.2.2. 5.5. Show that the random variable
has a zero mean and unity standard deviation.11 Consider the criterion function with entropy regularization ( Kamimura. and j on learning in a feedforward neural net. Assume that the components xi of the input vector x are randomly and uniformly distributed in the interval
[0.2. Compare the time series predicted by this varied 2 network to x(t) over the range t [0. 1].28) for the soft weight-sharing regularization term in Equation (5.
5.2.2.2.2.2.7 which will perfectly model this iterated map (assume zero biases and linear activation functions with unity slope for all units in the network). vary all exponents and weights by +1 percent and − percent.6 Derive the incremental backprop learning rule starting from the Minkowski-r criterion function in Equation (5.2. 1993):
. respectively. 5. starting from x(0) = 0.16).4 Derive the activation function slope update rules of Equations (5. j. 5.25) through (5.2. 20].
5.1.5 Derive the incremental backprop learning rule starting from the entropy criterion function in Equation (5. 20].2. assuming fixed values for the "responsibilities" rj(wi). 5.1.2.2.9).2.2. Use the appropriate partial derivatives to solve analytically for the optimal mixture parameters and . Assume x(0) = 0.2.2. Note that the output of the net at time t + 1 must serve as the new input to the net for predicting the time series at t + 2.19).2.2.11).2. 5. 5.6).10 Give a qualitative explanation for the effect of adapting the Gaussian mixture parameters j.24). Now.2.1 Given a unit with n weights wi uniformly randomly distributed in the range .7 Comment on the qualitative characteristics of the Minkowski-r criterion function for negative r.2 Explain qualitatively the characteristics of the approximate Newton's rule of Equation (5. and so on.2.3 Complete the missing steps in the derivation of Equation (5.2. 5.22).
5.8 Derive Equation (5.10) and (5.

2. these samples
†
have the following x values {− − − − − 5.5725
Validation Set Input Output − 6.2. Also.5411 − 4. 3. The validation set has the same noise statistics as for the training set.5000 2. generated uniformly in [− 12]. . Test the resulting "optimally trained" net on 200 points x. use the same weight initialization and learning parameters. During training.where is a normalized output of hidden unit j. 10}.13 Repeat the exact same simulation in Figure 5.0000
Output 2.2.6 in conjunction with Figure 5.2.0000 − 1. Discuss the differences.0000 − 4. The following training and validation sets are to be used in this problem.6. 1.5000 1.5 (dashed line).2.5. 8.5000 1. Assume the same network architecture as in Figure 5. compare the output of this net to the one in Figure 5.1932 − 5.6017 3.2434 2.2. 0.3 but with a 40 hidden unit feedforward net. and > 0.2.7027
. 4.1.0000 − 3.8382 − 1. 1971)
5.2.0000 2. The training set is the one plotted in Figure 5.12 The optimum steepest descent method employs a learning step positive root of the equation
*
defined as the smallest
Show that the optimal learning step is approximately given by ( Tsypkin. what would your intuitive conclusions be about the net's approximation behavior? Does the result of your simulation agree with your intuitive conclusions? Explain. Plot the validation and training RMS errors on a log-log scale for the first 10.1290 1.14 Repeat the simulations in Figure 5. Training Set
Input
− 5.2.5 using incremental backprop with cross-validation-based stopping of training.5000 2.1 with logistic sigmoid activations for all units and derive backprop based on this criterion/error function.3. 4. How would these results be impacted if a noisy data set is used? 5. Assume the net to be identical to the one discussed in Section 5.5.
†
Plot the output of this net versus x and compare it to the actual function being approximated. 8.000 cycles. use the noise-free training samples as indicated by the small circles in Figure 5.0000 − 2.2. and compare it to Figure 5. What are the effects of the entropy regularization term on the hidden layer activity pattern of the trained net? 5.1778 2. and give the reason(s) for the difference (if any) in performance of the two nets. 6.4374 − 2. 1.2. By comparing the number of degrees of freedom of this net to the size of the training set. 2. 3. Also.2. .

2.8026 0.3688
− 1.2267
0.16.1 (assume = 17).2652 − 2.
).16 Consider the simple neural net in Figure P5.e.4.3856 − 0. A neural network for approximating the function
†
. all the available data (i. b.4124 − 2.2.16. cross-validation cannot be used to stop training. both training and validation data).
5.0000 2.14 using.5000 3.
Figure P5.5782
0.4.14? Explain.0000 10.2. v2} which approximates the discontinuous function .2.5000 2..14.2102
− 0.
†
.3758 − 2.1380 1. 1977) discrete-time equation
Plot the time series x(t) for t [0.0000 9.2.8000 1.0000 0.5000 5. When solving the above nonlinear difference delay equation an initial condition specified by an initial function defined over a strip of width is required.g. Does the resulting net generalize better than the one in Problem 5.6880 − 0. as your training set.1351 − 2.0000 8.2.4. and c R.2 Use incremental backprop with sufficiently small learning rates to train the network in Figure 5.1
to predict in the Glass-Mackey time series of Problem 5.9381
5.
5.0000 7.5792 1.15 Repeat Problem 5. Here.6755
− 0. since we have no independent (nontraining) data to validate with. Use a collection of 500 training pairs corresponding to different values of t generated randomly from the time series for .1409 0.9612 0. for all x.1 Consider the time series generated by the Glass-Mackey ( Mackey and Glass.4.4000 0. 1000] and = 17.3497
1.3333 1. v1. One way to help avoid over training in this case.0000 3.9805 1. to any degree of accuracy.2500
0.5000
− 0.− 0. w2.7500 − 0. 5.0000 6.0000
0. Experiment with several different initial functions (e. Assume the hidden unit has an activation function and that the output unit has a linear activation with unit slope.2. Assume training pairs of the form
. would be to stop at the training cycle that led to the optimal net in Problem 5.0000 4.0000
†
1. a.4563 1. Show that there exist a set of real-valued weights {w1.

Assume the outputs of the delay lines (inputs to neural network in Figure 5.4. Plot the training RMS error versus the number of training cycles.6 Derive Equations (5. 5. (See Pearlmutter (1988) for help).4. Plot the signal predicted (recursively) by the trained network and compare it to for t = 0.1. 5.
. to train the network. feeding into a linear output unit. respectively.22) and (5. assume uniform random inputs in the interval [− +2] during training.11 (a) and 5. .4.4.3) to identify the nonlinear discrete-time plant (Narendra and Parthasarathy.4.4 Derive Equations (5. [For an interesting collection of time series and their prediction. 1990)
†
Use a feedforward neural network having 20 hyperbolic tangent activation (set = 1) units in its hidden layer.4.3 Employ the series-parallel identification scheme of Section 5.5 Show that if the state y* is a locally asymptotically stable equilibrium of the dynamics in Equation (5.4.10) and (5.17) is a locally asymptotically stable equilibrium of the dynamics in Equation (5. 6. plant as well as the recursively generated output of the identification model for the input
5. 18.4.
5. (Hint: Start by showing that linearizing the dynamical equations about their respective equilibria gives
and
where
* †
and
are small perturbations added to
and
.4. Plot the output of the 2.1 (refer to Figure 5.4. with sufficiently small learning rates.4.4. 12..Also.3) to be x(t) and u(t)..7 Employ time-dependent recurrent backpropagation learning to generate the trajectories shown in Figures 5. Use incremental backprop.4.18). Repeat with a two hidden layer net having 30 units in its first hidden layer and 15 units in its second hidden layer (use the learning equation derived in Problem 5.24).13). the reader is referred to the edited volume by Weigend and Gershenfeld (1994)].4. then the state z* satisfying Equation (5.)
5.4. assume 50 hidden units with hyperbolic tangent activation function (set = 1) and use a linear activation function for the output unit.6).12 (a). 1200.2 to train the weights of the first hidden layer).4.4.. Also.

4.
.5.8 Show that the RTRL method applied to a fully recurrent network of N units has O(N4) computational complexity for each learning iteration.

This resource allocating scheme is also shown to be the primary reason for efficient training. 1986 . have a locally-tuned response to frequency which is a consequence of their biophysical properties. One group of networks employs units with localized receptive fields. fundamental similarities and differences among the various networks are stressed. Lee and Kil (1988). significant extensions of the above networks are pointed out and the effects of these extensions on performance are discussed. This feature enables the network size to be determined dynamically and eliminates the need for guessing the proper network size. Casdagli (1989). The majority of the networks considered here employ processing units which are not necessarily sigmoidal. Examples of such networks are the radial basis function network and the cerebellar model articulation controller. Lapedes and Farber (1987). In addition. All networks discussed in this chapter differ in one or more significant ways from those in the previous chapter. RBF networks were independently proposed by Broomhead and Lowe (1988) . an artificial neural network model motivated by "locally-tuned" response biological neurons is described. Poggio and Girosi
. Duda and Hart.0 Introduction
The previous chapter concentrated on multilayer architectures with sigmoidal type units. The most important feature that distinguishes the RBF network from earlier radial basis function-based models is its adaptive nature which generally allows it to utilize a relatively smaller number of locally-tuned units (RBF's). 1962 . and approximations of smooth multivariate functions (Poggio and Girosi. probability density estimation (Parzen. 1961) which are utilized for interpolation (Micchelli. Here. and Moody and Darken (1989a. Neurons with locally-tuned response characteristics can be found in many parts of biological nervous systems. both static (feedforward) and dynamic.6. for example. Similar schemes were also suggested by Hanson and Burr (1987). A common feature in these networks is their fast training as compared to the backprop networks of the previous chapter. Powell. 1973 . Some of these networks may be used as function interpolators/approximators.1 Radial Basis Function (RBF) Networks
In this section. The above two groups of networks mainly employ supervised learning. Specht. The third and last group of adaptive multilayer networks treated in this chapter has the capability of unsupervised learning or clustering. Examples of networks in this group are hyperspherical classifiers and the cascade-correlation network. The model is commonly referred to as the radial basis function (RBF) network. as well as some variations.
6. The cochlear stereocilia cells. where units receiving direct input from input signals (patterns) can only "see" a part of the input pattern. 1990). These networks are capable of allocating units as needed during training. while others are best suited for classification tasks. The mechanisms leading to such increased training speed are emphasized. A second group of networks employs resource allocation. Adaptive Multilayer Neural Networks II
6. These nerve cells have response characteristics which are "selective" for some finite range of the input signal space. 1989). 1989b). Throughout this Chapter. Niranjan and Fallside (1988). two specific clustering nets are discussed: The ART1 network and the autoassociative clustering network. The present model is also motivated by earlier work on radial basis functions (Medgassy. The present chapter introduces several additional adaptive multilayer networks and their associated training procedures. 1987).

1. in the RBF net. the response characteristics of the jth hidden unit are given by:
(6. Given an input vector x. Notice the absence of hidden layer weights in Figure 6.
.1.1.1) where K is a strictly positive radially-symmetric function (kernel) with a unique maximum at its "center" j and which drops off rapidly to zero away from the center. here.1). Here. This is because the hidden unit outputs are not calculated using the weighted-sum/sigmoidal activation mechanism as in the previous chapter. For clarity. This implies that zj has an appreciable value only when the "distance" is smaller than the width j. as in Equation (1.1.1. The following is a description of the basic RBF network architecture and its associated training algorithm. only hidden to output layer connections for the lth output unit are shown.(1990b). the output of the RBF network is the Ldimensional activity vector y whose lth component is given by:
(6. each hidden unit output zj is obtained by calculating the "closeness" of the input x to an n-dimensional parameter vector j associated with the jth hidden unit. A radial basis function neural network consisting of a single hidden layer of locally-tuned units which is fully interconnected to an output layer of linear units.1.1.2) is similar in form to that employed by a PTG.1.2) It is interesting to note here that for L = 1 the mapping in Equation (6.
Figure 6. However. Rather. All hidden units simultaneously receive the n-dimensional real-valued input vector x.4.1. a choice is made to use radially symmetric kernels as "hidden units" as opposed to monomials. and others. The parameter j is the "width" of the receptive field in the input space for unit j. The RBF network has a feedforward structure consisting of a single hidden layer of J locally-tuned units which are fully interconnected to an output layer of L linear units as shown in Figure 6.

In fact. although capable of matching or exceeding the performance of backprop trained networks.1) and (6. their location. 1989). 1991. with the basis function in Equation (6. On the other hand. it can be shown that RBF networks are universal approximators (Poggio and Girosi. Hartman et al. then it is easy to see that an RBF network can be obtained from a single hidden layer neural network with unipolar sigmoid-type units and linear output units (like the one in Figure 5. and the norm is the Euclidean norm. In particular.RBF networks are best suited for approximating continuous or piecewise continuous real-valued mappings where n is sufficiently small. Next.
. we turn our attention to the training of RBF networks. In fact.1.
.1) by simply replacing the jth hidden unit weighted-sum netj = xTj by the negative of the normalized
Euclidean distance . 1992). these approximation problems include classification problems as a special case. and their width.1. Because of the differentiable nature of the RBF network's transfer characteristics. 1990 . and w are small positive constants. These parameters are the receptive field centers (means j of the hidden layer Gaussian units). Also. and . like feedforward neural networks with a single hidden layer of sigmoidal units.1. The degree of accuracy can be controlled by three parameters: The number of basis functions used. If we think of j as the parameter (weight) vector associated with the jth hidden unit.1. still gives training times comparable to those of sigmoidal-type networks (Wettschereck and Dietterich. In this case. Poggio and Girosi. 1989.4) where j is an adjustable bias.3) leads to hidden units with Gaussian-type activation functions and with a Euclidean distance similarity computation. This method. In other words. and wlj are updated as follows: . we would like to develop a training method that minimizes E by adaptively updating the free parameters of the RBF network.3) where j and j are the standard deviation and mean of the jth unit receptive field. According to Equations (6.2). Baldi. 1991. the RBF network may be viewed as approximating a desired function f(x) by superposition of non-orthogonal bell-shaped basis functions. where . consider the SSE criterion function as an error function E that we desire to minimize over the given training set. Another possible choice for the basis function is the logistic function of the form:
(6. no bias is needed. the use of the Gaussian basis function in Equation (6. the only difference between a RBF network and a feedforward neural network with a single hidden layer of sigmoidal units is the similarity computation performed by the hidden units. 1993) A special but commonly used RBF network assumes a Gaussian basis function for the hidden units:
(6. one of the first training methods that comes to mind is a fully supervised gradient descent method over E (Moody and Darken.1. j. . Consider a training set of m labeled pairs {xi. 1989a . Park and Sandberg.. and the output layer weights (wlj). di} which represent associations of a given mapping or samples of a continuous multivariate function. j.4). the receptive field widths (standard deviations j).1. respectively.1.

An incremental version of this batch mode process may also be used which requires no storage of past training vectors or cluster membership information. j = 1. Here. the supervised learning method is not guaranteed to utilize the computational advantages of locality. The remaining training vectors are assigned to class j of the closest center j. the centers are recomputed as the average of the training vectors in their class. Next. The idea here is to populate dense regions of the input space with receptive fields. a large number of receptive fields would be required to adequately represent the distribution of the input vectors in a high dimensional space. 1967. That is. When the hidden unit receptive fields are narrow.5) where is a small positive constant.. 2. a random training vector x is selected and the center sense) receptive field is updated according to: of the nearest (in a Euclidean distance
(6. Similarly. Here. Examples of learning methods which take advantage of the locality property of the hidden units are presented below.2) to effectively locate the k RBF centers (Vogt. Thus. at each time step.. these weights can be easily computed using the delta rule [Equation (5. Generally speaking. This two step process is invoked until all centers stop changing. An alternative approach is to center k receptive fields on a set of k randomly chosen training samples. only a small fraction of the total number of units in the network will be activated for a given input x. Moody and Darken (1989a) employed unsupervised learning of the receptive field centers j in which a relatively small number of RBF's are used. places no restrictions on maintaining small values for j.2)].5) is the simple competitive rule which we have analyzed in Section 4. we describe efficient methods for locating the receptive field centers and computing receptive field widths. 1988).One reason for the slow convergence of the above supervised gradient descent trained RBF network is its inefficient use of the locally-tuned representation of the hidden layer units. unless a large number of basis functions is used.4. once the hidden units are synthesized. As for the output layer weights. only those units which were activated need be updated for each input presentation.6.4) with w replaced by ).1. The adaptive strategy also helps reduce sampling error since it allows the 's to be determined by a large number of training samples. . 1973) is used to locate a set of k RBF centers which represents a local minimum of the SSE between the training set vectors x and the nearest of the k receptive field centers j (this SSE criterion function is given by Equation (4. One way to rectify this problem is to only use gradient descent-based learning for the basis function centers and use a method that maintains small values for the j's. Thus. the weight wlj determines the amount of contribution of the jth basis function to the lth output of the RBF net. unless we have prior knowledge about the location of prototype input vectors and/or the regions of the input space containing meaningful data. Assuming a uniform lattice with k divisions along each dimension of an n-dimensional input space. Here. In the following. this advantage is generally offset by reduced generalization ability..1. the adaptive centers learn to represent only the parts of input space which are richly represented by clusters of data. we may use learning vector quantization (LVQ) or one of its variants (see Section 3. One may view this computation as finding the proper normalization coefficients of the basis functions. This strategy has been shown to be very effective in terms of training speed. Several schemes have been suggested to find proper receptive field centers and widths without propagating the output error back through the network.1. A training strategy that decouples learning at the hidden layer from that at the output layer is possible for RBF networks due to the local receptive field nature of the hidden units.1. Anderberg.
. In the basic kmeans algorithm. the k-means clustering algorithm (MacQueen. though. This exponential growth renders this approach impractical for a high dimensional space. One method places the centers of the receptive fields according to some coarse lattice defined over the input space (Broomhead and Lowe. k which are set equal to k randomly selected training vectors. though. Cross-validation is normally used to decide on k.6. there is no formal method for specifying the required number k of hidden units in an RBF network. this lattice would require kn basis functions to cover the input space. the activated units are the ones with centers very close to the input vector in the input space. the k RBFs are initially assigned centers j. 1993). Equation (6. The above supervised learning method.

. The actual value of for a particular training set may be found by cross-validation.. the width for unit j may be set to the distance where i is the center of the nearest neighbor to unit j (usually. for strict interpolation.8) becomes the m × m matrix
. It is well known that a polynomial with finite order r = m − 1 is capable of performing strict interpolation on m samples {xi. For classification tasks. the term f '(netl) can be dropped for the case of linear units.13)]. Equation (6.. i. the minimum SSE solution for the system of equations ZTw = d is given by (assuming an overdetermined system. and d = [d1 d2 . can be adaptively computed using the delta rule
(6.1. the jith element of matrix Z may be expressed explicitly as
(6.1. dm]T. 1991). it is desired that an interpolation function be found which is constrained to "exactly" map the sample points xi into their associated targets di. Here. Alternatively. m.1. Theoretically speaking.. A similar result is available for RBF nets.42).1.3) is one example. for the case of linear output units.1.1. with distinct xi's in Rn (see Problem 1. Without loss of generality. We have already noted that the output layer weights. wJ]T the weight vector of the output unit.. is taken between 1. recalling Equations (3. m.. This result states that there is a class of radial-basis functions which guarantee that an RBF net with m such functions is capable of strict interpolation of m sample point in Rn (Micchelli. Thus. one should choose a relatively small (positive) value for this global width parameter..0 and 1. Therefore. for i = 1. one may make use of the category label of the nearest training vector.5).1. 1992b). for sufficiently small . wlj. Now. di}. zi is the output of the hidden layer for input xi. In order to preserve the local response characteristics of the hidden units. Light. their widths can be determined by one of several heuristics in order to get smooth interpolation... This leads to a sharpening of the class domains and allows for better approximation. the Z matrix in Equation (6. it would be advisable to use a smaller width which narrows the bell-shaped receptive field of the current unit. For example. the Gaussian function in Equation (6. 2. 1986 . Other heuristics based on local computations may be used which yield individually-tuned widths j. 2.3.. RBF networks with the same j in each hidden kernel unit have the capability of universal approximation (Park and Sandberg.4). 1989a) suggest that a "good" estimate for the global width parameter is the average width which represents a global average over all Euclidean distances between the center of each unit i and that of its nearest neighbor j.6) once the hidden layer parameters are obtained. If that category label is different from that represented by the current RBF unit.6) drives the output layer weights to minimize the SSE criterion function [recall Equation (5.1. there is no need to search for the centers j. one can just set j = xj for j = 1. one may formulate the problem of computing the weights as a set of simultaneous linear equations and employ the generalized-inverse method [recall Equation (3. consider a single output RBF net. .42)] to obtain the minimum SSE solution. Furthermore. Here.8)
with the parameters j and
assumed to have been computed using the earlier described methods. .7)
where Z = [z1 z2 . zm] is a J × m matrix. and denote by w = [w1 w2 .1. m J)
w* = Z†d = (ZZT)−1Zd (6. Empirical results (Moody and Darken. This suggests that we may simply use a single global fixed value for all j's in the network.
For "strict" interpolation problems.39) through (3..Once the receptive field centers are found using one of the above methods..e.1..

This requires Z to be nonsingular.2).2.
.2. and 1. w* can be computed as
(6. Hence.2 for = 0. and it resulted in better interpolation of g(x) compared to = 0. Therefore. The high frequency fine structure of f can also be "blurred" when the receptive fields are excessively wide. in Figure 6..10) Although in theory Equation (6. (dashed line).0. In other words. By employing the Taylor series expansion. According to Nyquist's sampling criterion. an exact solution w* is assured. Three designs are generated which correspond to the values = 0. when the receptive field density is not high enough. regardless of the value of (check the net output in Figure 6. Alternatively.0 is close to the average distance among all 15 sample points. Receptive field properties play an important role in the quality of an RBF network's approximation capability.2.10). It is interesting to note the excessive overfit by the RBF net for relatively high (compare to the polynomial-based strict interpolation of the same data shown in Figure 5.1. and = 1.5 (dotted line).2 for x > 10 and ).3 one can see that more accurate interpolation is possible with sigmoidal hidden unit nets. The output of the RBF net is shown in Figure 6. 15 Gaussian hidden units are used (all having the same width parameter ) with the jth Gaussian unit having its center j equal to xj. = 1. g(x j)).1. 1.6) for an adaptive computation of w*. Gaussians) are either too broad and/or too widely spaced relative to the fine spatial structure of f. The design is completed by computing the weight vector w of the output linear unit using Equation (6. it can be shown (Hoskins et al. These results also suggest that even for moderately high dimensional input spaces. it is important that receptive field densities and widths be chosen to match the frequency transfer characteristics imposed by the function f (Mel and Omohundro. Approximation error. the high frequency fine structure in the function being approximated is lost.5. the net's output approaches that of a polynomial function whose order is decreasing in .1.. 2. uniformly sampled in the interval [− 12].1. Therefore. Note that the appropriate width parameters j still need to be found.1.0 8. due to error in the "fit" of the RBF network to that of the target function f. these results show poor extrapolation capabilities by the RBF net. We will employ the method of strict interpolation for designing the RBF net. As expected. consider a single input/single output RBF network for approximating a continuous function f: RR. a relatively large number of RBF's must be used if the training data represent high frequency content mappings (functions) and if low approximation error is desired.5.... Hence. Finally. We then tested these networks with two hundred inputs x.1.2) from the fifteen noise-free samples (x j.1: The following example illustrates the application of the RBF net for approximating the function (refer to the solid line plot in Figure 6.10) always assures a solution to the strict interpolation problem. occurs when the receptive fields (e.9) which we refer to as the interpolation matrix. 15.(6. j = 1. Example 6. the RBF net exhibits polynomial behavior with an order successively decreasing as the RBF widths increase.1. by comparing the above results to those in Figure 5.1. these factors act to locally limit the high frequency content of the approximating network. in practice the direct computation of can become ill-conditioned due to the possibility of ZT being nearly singular. . one may resort to Equation (6. In other words. 1993) that when the width parameter is large. These observations can be generalized to the case of RBF network approximation of multivariate functions. 1991).1. the highest frequency which may be recovered from a sampled signal is one half the sampling frequency. this is mainly attributed to the ability of feedforward multilayer sigmoidal unit nets to approximate the first derivative of g(x). the choice of these parameters affects the interpolation quality of the RBF net. According to the above discussion.g.5 (dotted-dashed line).5 and = 1.5. To see this. The value of = 1.1.

which decouples the learning task for both hidden and output layers thus eliminating the need for the slow back error propagation. Casdagli.2.1 RBF Networks Versus Backprop Networks RBF networks have been applied with success to function approximation (Broomhead and Lowe. the RBF network typically requires ten times or more data to achieve the same accuracy as a backprop network. which leads to a smaller number of hidden units. For difficult classification tasks.1. the local nature of the hidden unit receptive fields in RBF nets prevents them from being able to "see" beyond the training data. = 1. the RBF net can lead to low "false-positive" classification rates. based on strict interpolation using the 15 samples shown (small circles). Nowlan. all units in a backprop network must be evaluated and their weights updated for every input vector. On the other hand.2. whereas the RBF network performs local fit. By contrast. 1991) compared to backprop networks. 1991 . to approximate a function and its derivatives (see Section 5.2 and 5.2. This is primarily due to the ability of feedforward nets. predicting the Glass-Mackey chaotic series of Problem 5. Moody and Darken. The RBF network with self-organized receptive fields needs more data and more hidden units to achieve similar precision to that of the backprop network.1. On the other hand. 1989a. This results in greater generalization by the backprop network from each training example. 1988 . (Compare these results to those in Figures 5. RBF networks which employ clustering for locating hidden unit receptive field centers can achieve a performance comparable to backprop networks (backprop-trained feedforward networks with sigmoidal hidden units). 1993). This causes the backprop network/classifier to indicate high
. In the following.5).0 (dashed line). This makes the RBF net a poor extrapolator. 1990. Another important reason for the faster training speed of RBF networks is the hybrid two-stage training scheme employed.2) employing sufficient training data and hidden units can lead to better classification rates (Wettschereck and Dietterich. 1989b) and classification (Niranjan and Fallside. Furthermore. Wettschereck and Dietterich.g. This property is due to the same reason that makes RBF nets poor extrapolators.. with sigmoidal hidden units.5 (dotted line). only a small fraction of the hidden units in an RBF network responds to any particular input vector. the sigmoidal hidden units in the backprop network can have high output even in regions far away from those populated by training data. the backprop network performs global fit to the training data. When used for function approximation. RBF net approximation of the function (shown as a solid line). Regions of the input space which are far from training vectors are usually mapped to low values by the localized receptive fields of the hidden units. Some of the reasons for the training speed advantage of RBF networks have been presented earlier in this section.3. 1992) and smaller "false-positive" classification errors (Lee. 1989 . However. the backprop network is a better candidate net when extrapolation is desired. Basically. The accuracy of RBF networks may be further improved if supervised learning of receptive field centers is used (Wettschereck and Dietterich. RBF networks or their modified versions (see Section 6. since the receptive field representation is well localized.Figure 6.2. 1992) but the speed advantage over backprop networks is compromised. Vogt.4. When used as a classifier.)
6. Lee. 1988 . It also utilizes the network's free parameters more efficiently. Lee and Kil. and (dotted-dashed line). 1988 . This allows the use of efficient self-organization (clustering) algorithms for adapting such units in a training mode that does not involve the network's output units. while requiring orders of magnitude less training time than backprop. On difficult approximation/prediction tasks (e.1 T time steps (T > 50) in the future). 1992 . The RBF net employs 15 Gaussian hidden units and its output is shown for three hidden unit widths: = 0. qualitative arguments are given for the above simulation-based observations on the performance of RBF and backprop networks.1.

11) implies
that for all inputs x. assuming a serial machine implementation. However. then the probability of Gaussian j having generated xk. If one interprets zj in Equation (6.1) as the probability Pj(xk) of observing xk under Gaussian distribution j:
(6. then the RBF network is superior. Which network is better to use for which tasks? The backprop network is better to use when training data is expensive (or hard to generate) and/or retrieval speed.2 RBF Network Variations In their work on RBF networks.1. the unweighted sum of all hidden unit activities in an RBF network results in the unity function. the motivation being that a superposition of basis functions that can represent the unity function (f(x) = 1) "exactly" would also suppress spurious structure when fitting a non-trivial function. given that we have observed xk.13)
.e. The use of Equation (6. Here.12) (where a is a normalization constant and j = for all j) and also assumes that all Gaussians are selected with equal probability." which is a desired mathematical property in function decomposition/approximation (Werntges.11) leads to a form of "smoothness" regularization.3. and when dealing with high dimensional input spaces. the RBF network realizes a "partition of unity.g. is critical (the smaller backprop network size requires less storage and leads to faster retrievals compared to RBF networks). 1993).1. the normalization in Equation (6. Another justification for the normalization of hidden unit outputs may be given based on statistical arguments. 6. if the data is cheap and plentiful and if on-line training is required (e. is:
(6.1.1. Moody and Darken (1989a) suggested the use of normalized hidden unit activities according to
(6.1. i.11) based on empirical evidence of improved approximation properties. the case of adaptive signal processing or adaptive control where data is acquired at a high rate and cannot be saved)..1. In other words.1. False-positive classification may be reduced in backprop networks by employing the "training with rubbish" strategy discussed at the end of Section 5.confidence classifications to meaningless inputs. However.3. this strategy generally requires an excessively large training set due to the large number of possible "rubbish" pattern combinations..

3.3 are those generated by the backprop network). 1990):
(6.
Figure 6.15).1. hud.
. An RBF network employing 100 Gaussian hidden units and soft competition for locating the Gaussian means is capable of 87. consider the classical vowel recognition task of Peterson and Barney (1952). the normalization in Equation (6. respectively (the decision boundaries shown in Figure 6. resulting in 338 training examples and 333 test examples. are updated for each input.11) now has a statistical significance: it represents the conditional probability of unit j generating xk.14) where Sj is the set of exemplars closest to Gaussian j.15) where P(j|xk) is given by Equation (6. A related general framework for designing optimal RBF classifiers can be found in Fakhr (1993). and backprop network (Huang and Lippmann. The clustering of the j's according to the incremental k-means algorithm is equivalent to a "hard" competition winner-take-all operation where. heard.14).0%. all hidden unit centers are updated according to an iterative version of Equation (6. 1990):
(6. This vowel data is randomly split into two sets. and 80. hood. the "exact" maximum likelihood estimate for j is given by (Nowlan. However.1. A plot of the test samples for the 10 vowel problem of Peterson and Barney (1952). 1989b). Another variation of RBF networks involves the so-called "soft" competition among Gaussian units for locating the centers j (Nowlan.0%.1. 1990).3. For example. This in effect realizes an iterative version of the "approximate" maximum likelihood estimate (Nowlan.1. 82. Rather than using the approximation in Equation (6. and Nj is the number of vectors contained in this set. A plot of the test examples is shown in Figure 6. had.1.1. the high performance of RBF networks employing "soft" competition may justify this added training computational cost.13). hod. In this "soft" competitive model. The spoken words consisted of 10 monosyllabic words each beginning with the letter "h" and ending with "d" and differing only in the vowel. Here. the data is obtained by spectrographic analysis and consists of the first and second formant frequencies of 10 vowels contained in words spoken by a total of 67 men.2% recognition rates reported for a 100 unit k-means-trained RBF network (Moody and Darken. One drawback of this "soft" clustering method is the computational requirements in that all j's.5). the RBF unit with the highest output zj updates its mean j according to Equation (6. 1990). and hawed. head. women and children.1. This performance exceeds the 82. 1988). knearest neighbor network (Huang and Lippmann.1 percent correct classification on the 333 example test set of the vowel data after being trained with the 338 training examples (Nowlan. who'd. The words used to obtain the data were heed. upon the presentation of input xk. rather than the mean of the winner. The lines are class boundaries generated by a two layer feedforward net trained with backprop on training samples.1.Therefore.1. hid.1. 1988).

One possible explanation for the training speed of Gaussian-bar networks could be their built-in automatic dynamic reduction of the network architecture (Hartman and Keeler.4 (c) may be used to replace the RBF's.4 (b). Because of their semilocal receptive fields. as shown in Figure 6. To overcome this tradeoff.1. with the advantages of requiring a smaller number of units to cover high-dimensional input spaces and producing high approximation accuracy. a sigmoid unit responds to a semi-infinite region by partitioning the input space with a "sigmoidal" hypersurface.16). Lippmann. For comparison purposes. and (c) Gaussian-bar units.1. as explained next. we write the Gaussian RBF as a product
(6.4 (a) shows the response of a two input Gaussian RBF. simulations involving difficult function prediction tasks have shown that training Gaussian-bar networks is significantly faster than sigmoid networks and slower but of the same order as RBF networks.1.16) where i indexes the input dimension and wji is a positive parameter signifying the ith weight of the jth hidden unit. 1988.1. "Gaussian-bar" units with the response depicted in Figure 6. This network has been found to retain comparable training speeds to RBF networks. Therefore. On the other hand. 1991a).
. Huang and R. Thus a Gaussian-bar unit is more like an "ORing" device and a pure Gaussian is more like an "ANDing" device.1. However. Analytically. RBF's have greater flexibility in discriminating finite regions of the input space.17) According to Equation (6. P. Note that a Gaussian-bar network has significantly more free parameters to adjust compared to a Gaussian RBF network of the same size (number of units).4. with permission of the American Institute of Physics).1. Y. the Gaussian-bar unit responds if any of the i Gaussians is activated (assuming the scaling factors wji are non-zero) while a Gaussian RBF requires all component Gaussians to be activated. but this comes at the expense of a great increase in the number of required units. 1991b). supervised gradient descent-based learning is normally used to update all network parameters. Since the above Gaussian-bar network employs parameter update equations which are non-linear in their parameters. Response characteristics for two-input (a) Gaussian. (b) sigmoid. on the contrary. the centers j of the hidden units cannot be determined effectively using competitive learning as in RBF networks. Figure 6.(Adapted from W. An RBF unit responds to a localized region of the input space. the output of the jth Gaussian-bar unit is given by:
(6. one might suspect that the training speed of such a network is compromised.1. The output units in a Gaussianbar network can be linear or Gaussian-bar.
(a) (b) (c) Figure 6. We conclude this section by considering a network of "semilocal activation" hidden units (Hartman and Keeler. Semilocal activation networks are particularly advantageous when the training set has irrelevant input exemplars.

(1990c) is described. and/or ji shrinking to a very small value. Therefore.. Sigmoid units may also be pruned (according to the techniques of Section 5. The vector z is a sparse vector in which at most c of its components are non-zero (c is called a generalization parameter and is user specified). The CMAC consists of two mappings (processing stages). 1990 .2. Kadirkamanathan et al. On the other hand. Gaussian-bar networks have greater pruning flexibility than sigmoid or Gaussian RBF networks. The second mapping generates the CMAC output y RL through a linear matrix-vector product Wz.
. Many other versions of RBF networks can be found in the literature (see Moody.
6. Figure 6.1 shows a schematic diagram of the CMAC with the first processing stage (mapping) shown inside the dashed rectangle. 1991 .3.1 in connection with a hyperspherical classifier net similar to the RBF network. is sparsely interconnected to a layer of logical OR units. The specific interconnection patterns between adjacent layers of the CMAC are considered next. Musavi et al. 1990 .2 Cerebellar Model Articulation Controller (CMAC)
Another neural network model which utilizes hidden units with localized receptive fields and which allows for efficient supervised training is the cerebellar model articulation controller (CMAC). 1975. Saha and Keeler.. Mel and Omohundro. There exist many variants of and extensions to Albus's CMAC. 1992 . This network was developed by Albus (1971) as a model of the cerebellum and was later applied to the control of robot manipulators (Albus. These mechanisms can occur completely independently for each input dimension. The first is a nonlinear transformation which maps the network input x Rn into a higher dimensional vector .5) but such pruning is limited to synaptic weights. Jones et al. 1989. Wettschereck and Dietterich.2. the CMAC version reported by Miller et al. in turn. This mapping is realized as a cascade of three layers: A layer of input sensor units that feeds its binary outputs to a layer of logical AND units which. This training method is described in detail in Section 6. ji moving away from the data. the first mapping transforms the n-dimensional input vector x into a J-dimensional binary vector z. 1992 . in turn. 1993). it is desirable to move such ji to a location far away from the data in order to eliminate the danger of these spikes on generalization.. one may avoid this danger. A similar model for the cerebellum was also independently developed by Marr (1969).A Gaussian-bar unit can effectively "prune" input dimension i by one of the following mechanisms: wji becoming zero. 1981). 1991 . where W is an L J matrix of modifiable real-valued weights. and increase retrieval speed by postprocessing trained networks to remove the pruned units of the network. while dissimilar inputs map into dissimilar vectors. The output of the OR layer is the vector z. Platt. moving any one of the ji's away from the data or shrinking ji to zero deactivates a Gaussian unit completely. with J >> n. respectively. Roy and Govil (1993) presented a method based on linear programming models which simultaneously adds RBF units and trains the RBF network in polynomial time for classification tasks. Here. reduce storage requirements. The CMAC has a built-in capability of local generalization: Two similar (in terms of Euclidean distance) inputs x and are mapped by the first mapping stage to similar binary vectors z and . In this section. 1991 . Training time could also be reduced by monitoring pruned units and excluding them from the calculations. create a spike response at ji. 1979. 1991 . In addition to this local generalization feature. Lay and Hwang. Bishop. Since pruning may lead to very small ji's which.

g. the dimension of z is still much larger than n). 1991. would have 10010 = 1020 vectors (points) in its input space. The AND units are divided into c subsets. at most c OR units will be excited by any input. and would require a correspondingly large number of AND units! However. Since exactly c AND units are excited by any input. the user must also specify a discretization of the input space. Miller et al. as shown in Figure 6.Figure 6. Therefore. Each component of the input vector x is fed to a series of sensor units with overlapping receptive fields.2.. The ratio of receptive field width to receptive field offset defines the generalization parameter c. For example. This is accomplished in the CMAC by randomly connecting the AND unit outputs to a smaller set of OR units. while the offset of the adjacent fields controls input quantization.. The receptive fields of the sensor units connected to each of the subsets are organized so as to span the input space without overlap. Miller et al. The resulting number of AND units (also called state-space detectors) resulting from the above organization can be very large for many practical problems. Parks and Militzer. This leads to a highly sparse vector z. There exists many ways of organizing the receptive fields of the individual subsets which produce the above excitation pattern (e. The binary outputs of the sensor units are fed into the layer of logical AND units. and thus the unit's input receptive field is the interior of a hypercube in the input hyperspace (the interior of a square in the two-dimensional input space of Figure 6.
. The width of the receptive field of each sensor controls input generalization.1.2. a system with 10 inputs.1). employ an organization scheme similar to Albus's original scheme where each of the subsets of AND units is identical in its receptive field organization. adjacent subsets are offset by the quantization level of each input. Each input vector excites one AND unit from each subset. one can significantly reduce the size of the adaptive output layer and hence reduce storage requirements and training time by transforming the binary output vector generated by the AND layer into a lower dimensional vector z (however.1.2. 1990c. each quantized into 100 different levels. Schematic illustration of a CMAC for a two-dimensional input. Here. 1992). Each AND unit receives an input from a group of n sensors. each sensor corresponds to one distinct input variable. but each subset is offset relative to the others along hyperdiagonals in the input hyperspace. In addition to supplying the generalization parameter c. Lane et al. for a total of c excited units for any input. most of the possible input vectors would never be encountered. Each sensor unit produces a '1' if the input falls within its receptive field and is '0' otherwise. most practical problems do not involve the whole input space..

. The CMAC will fail to approximate functions that oscillate rapidly or which are highly nonlinear (Cotter and Guillerm.2. As shown in Figure 6. 1992.
. 1987).. the CMAC can most successfully approximate functions that are slowly varying. The idea here is that some of the many random hidden units might get "lucky" and detect "critical features" of the input image. (Historically speaking. the term perceptron was originally coined for the architecture of Figure 6. the term perceptron is usually used to refer to the unit in Figure 3. this model consists of a hidden layer of a large number of units. Another important difference is that a PTG allows one of the hidden AND units to "see" (cover) the whole input pattern.1 or 3.4).3.2. Note that these ideas are similar to those discussed in relation to polynomial threshold gates (PTGs) in Chapter One (refer to Figure 1.1.1. thus allowing the output LTG(s) to perfectly classify the input image. with the other AND units covering substantial portions of the input. The CMAC also has practical hardware realizations using logic cell arrays in VLSI technology (Miller et al. In Rosenblatt's perceptron. each hidden unit is restricted to receive a small number of inputs relative to the total number of inputs in a given input pattern. in the current literature.2. respectively.. using LMS learning rule) in order to approximate a given function fl(x) implemented by the lth output unit. Additional details on the learning behavior of the CMAC can be found in Wong and Sideris (1992). 1993).4). 1990d). the CMAC does not have universal approximation capabilities like those of multilayer feedforward nets of sigmoid units or RBF nets. The basic difference here is that a binary input PTG employs AND gates as its hidden units. as opposed to the potentially more powerful Boolean units employed in Rosenblatt's perceptron. Ultimately. One appealing feature of the CMAC is its efficient realization in software in terms of training time and realtime operation. pattern recognition (Glanz and Miller. and signal processing (Glanz and Miller. Thus.g. 1}J of a high-dimensional feature space such that the training set becomes linearly separable.1 CMAC Relation to Rosenblatt's Perceptron and Other Models One of the earliest adaptive artificial neural network models is Rosenblatt's perceptron (Rosenblatt.2. Because of its intrinsic local generalization property. as illustrated in Figure 6. the input pattern is typically assumed to be a twodimensional binary image formed on a "retina".2. after being trained with the perceptron rule. the hidden random Boolean units are intended to map a given nonlinearly separable training set of patterns onto vectors z {0. 1989). which computes random Boolean functions. However. 1990b).2 or its variants. each hidden unit "sees" only a small piece of the binary input image. Brown et al. The high degree of sparsity of the vector z typically leads to fast learning. 6. Examples of applications of CMAC include real-time robotics (Miller et al. In its most basic form.The final output of the CMAC is generated by multiplying the vector z by the weight matrix of the output layer. 1961).. We may also use the CMAC as a classifier by adding nonlinear activations such as sigmoids or threshold activation functions to the output units and employing the delta rule or the perceptron rule (or adaptive Ho-Kashyap rule).2. connected to an output layer of one or more LTGs. Here. The lth row of this weight matrix (corresponding to the lth output unit) is adaptively and independently adjusted (e.

Later in this section. and allow for adaptive computation of the output layer weights. If we take a second look at the CMAC architecture in Figure 6.2.2. However. it has been shown (Minsky and Papert. while the perceptron uses LTGs trained with the perceptron rule.Figure 6. The limitations of Rosenblatt's model can be relaxed by allowing every hidden unit to see all inputs. For example. This limitation is due to the localized nature of the hidden unit receptive fields defined on the input image. nor it can determine whether or not the number of 'on' pixels in a finite input image is odd. Both models restrict the amount of input information seen by each hidden unit. Rosenblatt's perceptron.1. Yet another weakness of Rosenblatt's perceptron is that it is not "robustness-preserving. Here.2. the jth OR unit and the set of AND units feeding into it can be thought of as generating a random Boolean function and may be compared to a random Boolean unit in Figure 6.2.2." unless we start with an exponentially large number of hidden units. The Boolean functions zj in the CMAC acquire their random nature from the random interconnectivity pattern assumed between the two layers of AND and OR units. there is very little chance for the hidden units to randomly become a detector of "critical features. The later task is equivalent to the parity problem and can only be solved by Rosenblatt's perceptron if at least one hidden unit is allowed to have its receptive field span the entire input image. But this requirement renders the model impractical. we will see that Rosenblatt's perceptron does not have the intrinsic local generalization feature of the CMAC. the class of Boolean functions realized by the zj's does not have the richness nor diversity of the uniformly random Boolean functions zj's in Rosenblatt's perceptron. Therefore. consider two similar unipolar binary input patterns
. This difference is due to the different nature of intended applications for each model: The CMAC is primarily used as a continuous. it does not allow for good local generalization. smooth function approximator whereas Rosenblatt's perceptron was originally intended as a pattern classifier. and due to the sparsity of connections between these two layers. we see that the first mapping (represented by the circuitry inside the dashed rectangle) is realized by a Boolean AND-OR network. One may also note the more structured receptive field organization in the CMAC compared to the perceptron. However. 1969) that this particular perceptron model cannot determine whether or not all the parts of its input image (geometric figure) are connected to one another. Rosenblatt's perceptron also has common features with the CMAC model discussed above. To see this. employ a nonlinear Boolean transformation via the hidden units. this becomes impractical when the dimension n of the input is large since there would be possible Boolean functions for each hidden random Boolean unit to choose from. The non-universality of the CMAC is also shared by the perceptron of Rosenblatt." In other words. These two models also have a minor difference in that the CMAC normally uses linear output units with LMS training.

The use of random LTGs as opposed to random Boolean units as hidden units has the advantage of a "rich" distributed representation of the critical features in the hidden activation vector z. Gallant-Smith-Hassoun (GSH) model. and referred to as the Gamba Perceptron (Minsky and Papert. 1969). However. the GSH model uses Ho-Kashyap learning in Hassoun's version and the pocket algorithm [a modified version of the perceptron learning rule that converges for nonlinearly separable problems. an important question here is how many hidden random LTGs are required to realize any arbitrary n-input binary mapping of m points/vectors? This question can be easily answered for the case where the hidden LTG's are allowed to have arbitrary (not random) parameters (weights and thresholds). Empirical results show that the above intuitive answer is not correct! Simulations with the 7-bit parity function. Also. Note that for the worst case scenario of a complex completely specified n-input Boolean function where m = 2n. 1990). because of the uniform random nature of the hidden Boolean units. This distributed representation coupled with the ability of hidden LTGs to see the full input pattern makes the GSH model a universal approximator of binary mappings. Now. and recalling Problem 2.1. we may relate the normalized Hamming distances Dz and Dx according to
(6. the number of hidden LTGs in the GSH model still scales exponentially in n. the normalized Hamming distance Dz between the two z vectors corresponding
to any two input vectors is approximately equal to
. Therefore. For details.1) If x' is similar to x".2) where
. the output of any hidden unit zj is one (or zero) with a probability of 0.(vectors) x' and x". Using a result of Amari (1974. Here. This model. is also similar to a version of an early model studied by Gamba (1961). Let us assume n is large. 1987).5 regardless of the input.
Gallant and Smith (1987) and independently Hassoun (1988) have proposed a practical classifier model inspired by Rosenblatt's perceptron. assume that the weights and thresholds of the LTGs are generated according to the normal distributions and . then D(x'. In particular.2. we might intuitively expect that the required number of hidden units in the GSH model for approximating any binary function of m points will be much greater than m. Thus. Consider two n-dimensional binary input vectors x' and x" which are mapped by the random LTG layer of the GSH model into the J-dimensional binary vectors z' and z". The hidden LTGs assume fixed integer weights and bias (threshold) generated randomly in some range [− +a]. a. see Gallant (1993)] in the Gallant and Smith model. respectively. The above result on the size of the hidden layer in the GSH model may be partially explained in terms of the mapping properties of the random LTG layer. this model is not robustness-preserving. 1}n. random functions. we find that m hidden LTGs are sufficient to realize any binary function of m points in {0. and other completely as well as partially specified Boolean functions reveal that the required number of random LTGs is between m and m (Gallant and Smith. the activation patterns z at the output of the hidden layer are completely uncorrelated. x") is much smaller than 1. respectively. The main distinguishing features of the GSH model are that every hidden unit sees the whole input pattern x and that the hidden units are random LTGs.1. For the trainable output units. In this case.2. if we assume hidden LTGs with random parameters. but which solves some of the problems associated with Rosenblatt's model.2 and Theorem 1.4. we use the similarity measure of the normalized Hamming distance Dx between the two input patterns
(6. Now.

(6. In addition to the above desirable features. Figure 6. Also. in the worst case. the restricted dynamic range integer parameters of the hidden LTGs allow for their simple realization in VLSI technology. We also note that the range [− +a] of the hidden LTGs integer parameters can be made as small as [− +1]. The k-nearest neighbors classifier assigns to any new input the class most heavily represented among its knearest neighbors.2. exponential time. Duda and Hart. On the other hand. Therefore. 1.2 also implies that very different inputs map into very different outputs when .
Figure 6. 1989). Equation 6. parity) or highly nonlinear functions. however. In the context of pattern classification. no transformation or abstraction of the examples in the training set is required and one can immediately proceed to use this machine for classification regardless of the size of the training
. as needed.2) for A = 1 and . This classifier represents an extreme unit-allocating machine since it allocates a new unit for every learned example in a training set. This richness of distributed representation of input features at the hidden layer's output helps increase the probability of the output unit(s) to find good approximations to arbitrary mappings. desirable reduced dynamic range by having to increase the number of hidden LTGs. There are no computations involved in adjusting the "parameters" of allocated units: Each new allocated unit stores exactly the current example (vector) presented. Such a property is useful for recognizing differences among very similar input vectors as is often desired when dealing with Boolean mappings (e.9) we saw that training a multilayer feedforward neural network with fixed resources requires. polynomial time training is possible. Here. as more patterns are learned ( Baum.3) The parameter A is close to unity if the weights and thresholds are identically distributed or as long as .g. Normalized Hamming distance between two hidden layer activation vectors versus the distance between the two corresponding input vectors. In other words. For small values of Dx. we pay for this a. the classical k-nearest neighbors classifier ( Fix and Hodges.2. in the worst case. the differences among similar vectors in a small neighborhood of the input space are amplified in the corresponding z vectors.3 Unit-Allocating Adaptive Networks
In Chapter Four (Section 4. since
approaches infinity when Dx approaches zero. 1973) is an extreme example of a unit-allocating machine with O(1) learning complexity.2.3.
6. the GSH model is very appealing from a hardware realization point of view. the random LTG layer exhibits robustness-preserving features.2. This phenomenon is intuitively justifiable and has been verified empirically in Hassoun (1988)..3 shows a plot of Equation (6. if we are willing to use unit-allocating nets which are capable of allocating new units. 1951.2. Dz is small so that the output activity of the hidden layer is similar for similar inputs.

This later feature enhances the classifiers ability to reject "rubbish. The architecture of the RCE net is shown in Figure 6.g. The typical geometrical shape of the decision boundaries for classical pattern classifiers are hyperplanes and hypersurfaces. k-nearest neighbors classifiers are impractical as on-line classifiers due to the large number of computations required in classifying a new input.set. Hyperspherical classifiers were introduced by Cooper (1962. m. the training time of this classifier does not scale with the number of examples. Two of the networks considered are classifier networks. Euclidean space).1 Hyperspherical Classifiers Pattern classification in n-dimensional space consists of partitioning the space into category (class) regions with decision boundaries and assigning an unknown point in this space to the class in whose region it falls. and compact representation of training data.1. The output layer consists of L units. and the classification error rate need not even decrease monotonically with m ( Duda and Hart. 1988) as a restricted form of a "high-dimensional Coulomb potential" between a positive test charge and negative charges placed at various sites. we discuss two unit-allocating adaptive networks/classifiers which employ hyperspherical boundary forms. Batchelor and Wilkins (1968) and Batchelor (1969) [See Batchelor (1974) for a summary of early work]. The third network is capable of classification as well as approximation of continuous functions. The architecture of the RCE network contains two layers: A hidden layer and an output layer.3. The output layer is sparsely connected to the hidden layer.. This obviously leads to higher learning complexity than O(1).e. Surprisingly. the finite radii of the regions of influence can make a hyperspherical classifier abstain from classifying patterns from unknown categories (these patterns are typically represented as points in the input space which are far away from any underlying class regions). The name is derived from the form of the "potential function" governing the mapping characteristics. Therefore. (1982) and Reilly and Cooper (1990).. we must create and load an abstraction of the training data. Practical trainable networks should have a number of desirable attributes." Restricted Coulomb Energy (RCE) Classifier The following is a description of a specific network realization of a hyperspherical classifier proposed by Reilly et al. In this section. Even when utilizing a large training set. 6. i. Also. which means that the complexity is O(1). On the other hand. the decision is "ambiguous. The most significant of these attributes are fast learning speed. The decision of the network is "unambiguous" if one and only one output unit is active upon the presentation of an input. we are forced to use far fewer than one unit for every training sample. Thus. which has been interpreted (Scofield et al. it has been shown ( Cover and Hart. In the following. the performance of the nearest neighbor classifier deteriorates for training sets of small size."
. This region of influence makes a hyperspherical classifier typically more conservative in terms of storage than the nearest neighbor classifier. The hidden layer is fully interconnected to all components of an input pattern (vector) x Rn. 1966). a hyperspherical classifier is based upon the storage of examples represented as points in a metric space (e. the convergence to the above asymptotic performance can be arbitrarily slow. The metric defined on this space is a measure of the distance between an unknown input pattern and a known category. the formation of a compact representation is important for two reasons: Good generalization (less free parameters leads to less overfitting) and feasibility of hardware (VLSI) realization since silicon surface area is the premium. These networks are capable of forming compact representations of data easily and rapidly. we consider three practical unit allocating networks. This model is named the "restricted Coulomb energy" (RCE) network. this simple classifier has a probability of classification error less than twice the minimum achievable error probability of the optimal Bayes classifier for any integer value of k 1. accurate learning. The interior of the resulting hypersphere represents the decision region associated with the center point's category. each hidden unit projects its output to one and only one output unit.3. Unfortunately. Like the nearest-neighbor classifier. Each unit in the output layer corresponds to a pattern category. 1967) that in the limit of an infinite training set. otherwise.. The reason for desiring the first two attributes is obvious. The network assigns an input pattern to a category l if the output cell yl is activated in response to the input. 1973). Furthermore. Each stored point has associated with it a finite "radius" that defines the point's region of influence.

Training the RCE net involves two mechanisms: Unit commitment and modification of hidden unit radii." and D is some predefined distance metric between vectors j and x (e. Thus.2) On the other hand. rj R is a threshold or "radius.3.1). they will all "fire" and switch on the output units they are connected to. any input pattern falling within the influence region of a hidden unit will cause this unit to fire..1. The allocated hidden unit center 1 is set equal to x1 and its radius r1 is set equal to a user-defined parameter rmax (rmax is the maximum size of the region of influence ever assigned to a hidden unit). units are interconnected so that they do not violate the RCE interconnectivity pattern described above. Units may be committed to the hidden and output layers. Here. When committed. This unit is made fully interconnected to the input pattern and
.3.3. The jth hidden unit in the RCE net is associated with a hyperspherical region of the input space which defines the unit's region of influence.1) where j Rn is a parameter vector called "center".g. When a pattern falls within the regions of influence of several hidden units.3. RCE network architecture. The transfer characteristics of the jth hidden unit is given by
(6. Euclidean distance between real-valued vectors or Hamming distance between binary-valued vectors). Initially. the hidden units define a collection of hyperspheres in the space of input patterns. The location of this region is defined by the center j and its size is determined by the radius rj. the network starts with no units.Figure 6. Some of these hyperspheres may overlap. the transfer function of a unit in the output layer is the logical OR function. An arbitrary sample pattern x1 is selected from the training set and one hidden unit and one output unit are allocated. f is the threshold activation function given by
(6. According to Equation (6.

the third scenario represents the case of an input with a new category that is not represented by the network. Note that. The RCE net can also handle the case where a single category is contained in several disjoint regions. Here. It also leads to allocating a large number of hidden units approximately equal to the number of training examples coming from the regions of overlap. In principle. one of three scenarios emerges.
. then do nothing and continue training with a new input. This should not be a problem since the shrinking of the region of influence mechanism described under the first scenario will ultimately rectify the situation. Here. if the input pattern causes no output units (including the one representing the category of the input) to fire. we choose a second arbitrary example x2 and feed it to the current network. On the other hand. Here. Here. in its present form. Again if existing hidden units become active under this scenario. The new allocated unit is connected to the output unit representing the category of the input pattern. Also. This output unit represents the category of the input x1. First. if the input pattern causes only the output unit representing the correct category to fire. allocate a new hidden unit with center at 2 = x2 and radius rmax and connect the output z2 of this unit to the output unit. dmin). Next. The general version of this scenario occurs when the network has multiple output units. The RCE net is capable of developing proper separating boundaries for nonlinearly separable problems. Dynamic category learning is also possible with the RCE network.2 for a schematic representation of separating boundaries realized by the regions of influence for a nonlinearly separable two-class problem in two-dimensional pattern space. the RCE network is not suitable for handling overlapping class regions. If. In this case. The training phase continues (by cycling through the training set or by updating in response to a stream of examples) until no new units are allocated and the size of the regions of influence of all hidden units converges. then do nothing and continue the training session with a new input. some hidden units representing the wrong category fire. as in the first step of the training procedure. a hidden unit centered at this input is allocated and its radius is set as in the second scenario. Note that this setting of r may cause one or more output units representing the wrong category to fire. Finally. then allocate a new hidden unit centered at the current input vector/pattern and assign it a radius r = min(rmax. if x2 causes the output unit to fire and if x2 belongs to the category represented by this unit.projects its output z1 to the allocated output unit (OR gate). then their radii are shrunk until they become inactive. proceed by reducing the threshold values (radii) of all active hidden units that are associated with categories other than the correct one until they become inactive. The reader is referred to Figure 6. this scenario might occur at a point during training where the network has multiple hidden and output units representing various categories. a new output unit representing the new category is added which receives an input from the newly allocated hidden unit. then the radii of such units are shrunk as described earlier under the first scenario. the correct output unit may fire along with one or more other output units.
Figure 6.2. The second scenario involves the case when the input x2 happens to belong to the same category as x1 but does not cause the output unit to fire.3. new classes of patterns can be introduced at arbitrary points in training without always involving the need to retrain the network on all of its previously trained data. the learning algorithm will tend to drive the radii of all hidden unit regions of influence to zero. where dmin is the distance from this new center to the nearest center of a hidden unit representing any category different from that of the current input pattern. Here. In general. This indicates that the regions of influence of hidden units representing various categories overlap and that the present input pattern lies inside the overlap region. an arbitrary degree of accuracy in the separating boundaries can be achieved if no restriction is placed on the size of the training set. Schematic diagram for an RCE classifier in two-dimensional space solving a nonlinearly separable problem. That is. Now.3. under this scenario.

Mukhopadhyay et al. 1974). describing the regions of influence.1). Polynomial-Time Trained Hyperspherical Classifier In the following. but which employs a training algorithm that is shown to construct and train the classifier network in polynomial time. the learning algorithm attempts to use a single hidden unit to cover a whole class region.3. (1993).4) and w0. However. we describe a classifier network with an architecture identical to the RCE network (see Figure 6. (1993). A quadratic region of influence contains the hypersphere as a special case but allows for the realization of hyperellipsoids and hyperboloids. wi.Several variations of the RCE network are possible. Here. and Roy et al. Empirical examination of RCE classifiers appeared in Lee and Lippmann (1990) and Hudak (1992). A quadratic region of influence is assumed which defines the transfer characteristics of a hidden unit according to:
(6.001) and its value is computed as part of the training procedure described below. one may view each hidden unit as a quadratic threshold gate (QTG). the sample patterns in that class are split into two or more clusters.
is also an acceptable function
The learning algorithm determines the parameters of the hidden units in such a way that only sample patterns of a designated class are "covered" by a hidden unit representing this class. the k-means procedure described in Section 6. using a clustering procedure (e. As in the RCE net.3) where
(6. If this fails. introduced in Chapter One. A second variation would be to allow the hyperspheres to grow. This polynomial time classifier (PTC) uses clustering and linear programming models to incrementally generate the hidden layer. or if only some clusters are covered. The idea here is to allow a hidden unit to cover (represent) as many of the sample patterns within a given class as is feasibly possible (without including sample patterns from any other class).3. the PTC net uses hidden units to cover class regions. the region of influence of each hidden unit need not be restricted to the hypersphere. thereby minimizing the total number of hidden units needed to cover that class. is greater than or equal to a small positive constant (say 0. For example. Modifications to the RCE network for handling overlapping class regions can be found in Reilly and Cooper (1990).3.1) and then attempts are made to adapt separate hidden units to cover each of these clusters.3. For example. then the uncovered clusters are further split to be separately covered until covers are provided for each ultimate cluster. Initially.g.
. wi. With the transfer characteristics as in Equation (6. one might employ mechanisms that allow the centers of the hyperspheres to drift to more optimal locations in the input space. This enables the PTC to form more accurate boundary regions than the RCE classifier. and wij are modifiable real-valued parameters to be learned.. Other regions of influence may be used in the PTC as long as they are represented by functions which are linear in the parameters to be learned: w0. If that fails. Our description of the PTC net is based on the work of Roy and Mukhopadhyay (1991). These two mechanisms have been considered for more general hyperspherical classifiers than RCE classifiers (Batchelor.
and wij.3). The algorithm also attempts to minimize the number of hidden units required to accurately solve a given classification problem.

All single point clusters will produce feasible solutions which implies that we have just designed a PTC with perfect recall on the training set (note. 2. As for most artificial neural nets. the PTC training procedure requires
.. the linear program set up for adjusting the parameters of the jth hidden unit is as follows:
The positive margin j ensures a finite separation between the classes and prevents the formation of common boundaries. The polynomial complexity of the above training algorithm is shown next.5) for a class will be infeasible until the class is broken up into mc single point clusters... 1993) reported comparable classification performance of a PTC net to the k-nearest neighbors classifier and backprop nets on relatively small sample classification tasks. a PTC net allocates a much smaller number of hidden units compared to the RCE net when trained on the same data. Thus. 1975). 1984.4 which guarantees class separation (with finite separation between classes) if a solution exists or. 1979) and each clustering operation to obtain a specified number of clusters can also be performed in polynomial time (Everitt. in this case. Hartigan. only empirical studies of generalization are available. Let us further assume the worst case scenario in which the mc pattern vectors are broken up into one extra cluster at each clustering stage. Using simple counting arguments. the reader is referred to Chapter Five in Duda and Hart (1973)]. it follows that the above learning algorithm is of polynomial complexity. For each class label
c = 1. 1980.The parameters of each hidden unit are computed by solving a linear programming problem [for an accessible description of linear programming. One such study (Roy et al. Now.1. L.3. that such a design will have poor generalization!). Linear programming is appropriate here because the regions of influence are defined by quadratics which are linear functions of their parameters. An alternative to linear programming is to use the HoKashyap algorithm described in Section 3. Similar to the RCE net. The linear program is used to adjust the location and boundaries of the region of influence of a hidden unit representing a given class cluster such that sample patterns from this cluster cause the net inputs to the hidden unit [the argument of f in Equation (6. Formally put. (1993) gave an extension to the above training method which enhances the PTC performance for data with outlier patterns. otherwise. since each linear program can be solved in polynomial time (Karmarkar. gives an indication of nonlinear separability.3)] to be at least slightly positive and those from all other classes to be at least slightly negative. Roy et al. Khachian. however. In general. Consider the worst case scenario (from a computational point of view) where a separate cover (hidden unit) is required for each training pattern.. all of the linear programs from Equation (6. Unit j becomes a permanent fixed unit of the PTC net if and only if the solution to the above problem is feasible. . let mc be the number of pattern vectors (for a total of patterns) to be covered. What remains to be seen is the generalization capability of PTC nets.3. However. we can readily show that for successful training a total of
linear programs (feasible and infeasible combined) are solved and clustering performed m times. all hidden units in the PTC whose respective regions of influence cover clusters of the same class have their outputs connected to a unique logical OR unit (or an LTG realizing the OR function) in the output layer.

The CCN differs from all networks considered so far in two major ways: (1) It builds a deep net of cascaded units (as opposed to a net with a wide hidden layer) and (2) it can allocate more than one type of hidden unit." which is attributed to the slowness of backprop learning. or a mixture of such units. The output units can also take various forms.3. The cascade-correlation network architecture with three hidden units. for example. Gaussian units. The number of output units is dictated by the application at hand.3. instead of moving quickly to assume useful roles in the overall problem solution. The CCN architecture after allocating three hidden units is illustrated in Figure 6. each hidden unit sees a constantly changing environment. The number of each hidden unit represents the order in which it has been allocated. The original CCN uses sigmoid units as hidden units. 1990).3.3. Therefore. but typically a sigmoid (or hyperbolic
. Because all of the weights in a backprop net are changing at once.
Figure 6.2 Cascade-Correlation Network The Cascade-Correlation Network (CCN) proposed by Fahlman and Lebiere (1990) is yet another example of a unit-allocating architecture.simultaneous access to all training examples which makes the PTC net inapplicable for on-line implementations. An important requirement on a candidate hidden unit is that it has a differentiable transfer function. 6. This network is suited for classification tasks or approximation of continuous functions. the hidden units engage in a complex "dance" with much wasted motion (Fahlman and Lebiere. etc.3. and by the way the designer chooses to encode the outputs. sigmoidal units and Gaussian units may coexist in the same network. The CCN was developed in an attempt to solve the so-called "moving target problem. since units are trained individually without requiring back propagation of error signals. The CCN has a significant learning speed advantage over backprop nets. The hidden units can be sigmoid units.

a unit becomes tuned to the features in the input pattern which have not been captured by the existing net. Every input (including a bias x0 = 1) is connected to every output unit by a connection with an adjustable weight. one trained unit from this pool. This may lead to a very deep network with high fan-in for the hidden and output units. At this point. over all output units... in the direction of maximizing a performance measure E. Next. During this phase.3. output unit weights are trained (e. of the magnitude of the covariance between the candidate unit's output zk and the residual output error . a precise value of J is almost impossible to determine. E is chosen as the sum. the residual errors (the difference between the actual and desired output) is recorded for each output unit l on each training pattern xk. independently of other candidates in the pool.. k = 1. though. the candidate unit which achieves the highest covariance score E is added to the network by fanning out its output to all output layer units through adjustable weights. . is selected for permanent placement in the net. In the first phase. then training is stopped (this convergence would be an indication that the training set is linearly separable. After the training reaches an asymptote. During this training phase. Initially.tangent) activation unit is employed for classification tasks.. i. Unfortunately. available for producing outputs or for creating other. At the completion of training. The CCN has been empirically shown to be capable of learning some hard classification-type tasks 10 to 100 times faster than backprop. each added hidden unit defines a new one-unit layer. performing steepest gradient ascent on E. Each allocated hidden unit receives a connection from each pre-existing hidden unit.e. Note that if the first training phase converges (SSE error is very small) with no hidden units. The learning algorithm consists of two phases: Output unit training and hidden unit training. The maximization of E by each candidate unit is done by incrementing the weight vector w for this unit by an amount proportional to the gradient . m. a new hidden unit is inserted whose weights are determined by the hidden unit training phase described below. more complex feature detectors. where J is the number of hidden units ultimately needed to solve the given task. Linear output units are employed when the CCN is used for approximating mappings with real-valued outputs (e. Therefore. the weights of any pre-existing hidden units are frozen. which is almost always possible. the weights of each candidate unit are adjusted. The motivation here is that. Note that is recomputed every time w is incremented by averaging the unit's outputs due to all training patterns. Here. the one which best optimizes some performance measure. if the SSE remains above a predefined threshold. Each candidate hidden unit receives trainable input connections from all of the network's external inputs and from all pre-existing hidden units. In fact once allocated. the network repeats the output layer training phase using the delta learning rule. a hidden unit never changes its weights. the criterion E is defined as:
(6. In the hidden unit training phase.. using the delta rule) to minimize the usual sum of squared error (SSE) measure at the output layer. This unit then becomes a permanent featuredetector in the network. a pool of randomly initialized candidate hidden units (typically four to eight units) is trained in parallel. the network has no hidden units. The two training phases continue to alternate until the output SSE is sufficiently small.6)
where and are average values taken over all patterns xk. assuming we are dealing with a classification task). function approximation/interpolation). Formally. the outputs of these candidate units are not yet connected to any output units in the network. Also. and the residual output errors are recomputed. by maximizing covariance. Later. In addition.g. observed at unit l.g. Fahlman and Lebiere (1990) have empirically estimated the learning time in epochs to be roughly J log(J). This multiple candidate training strategy minimizes the chance that a useless unit will be permanently allocated because an individual candidate has gotten stuck at a poor set of weights during training. 2. the CCN is
. if any.

2. and (c) is from K. 1990). Figure 6. the CCN shows 50-fold speedup over standard backprop on the two spiral task. Therefore.3. 1989. backprop outperforms the CCN. The output should be +1 for the first spiral and − for the other spiral.3.000 training cycles and about 8. (b) Solution found by a cascade-correlation network. not just from the immediately preceding layer.4 (b) shows a solution to this problem generated by a 1 trained CCN (Fahlman and Lebiere. Finally. E. The backprop net requires about 20. Another undesirable feature of the CCN is its inefficient hardware implementation: The deep layered architecture leads to a delay in response proportional to the number of layers. J.3. while building a network of about the same complexity. (a) Training samples for the two spiral problem. with permission of Morgan Kaufmann. For comparison purposes. This is due to the lower number of computations constituting a single CCN training cycle compared to that of standard backprop. the CCN outperforms standard backprop in training cycles by a factor of 10.3.capable of learning difficult tasks on which backprop nets have been found to get stuck in local minima. Simulations with the Mackey-Glass time series prediction task show poor generalization performance for the CCN (Crowder. Also. Another is the two spiral problem shown in Figure 6. however. Here. we show in Figure 6. Fahlman and C. J. This is simply because the simulation in (c) used training spirals which have the opposite direction to those shown in Figure 6. in this case.] On the other hand. Lebiere. Lang and M. typically.4 (a). 1989). 12 to 19 sigmoidal hidden units and an average of 1. One of these difficult tasks is the n-bit parity problem.
(a) (b) (c) Figure 6.4 (c) a solution generated by a backprop network employing three hidden layers of five units each and with "shortcut" connections between layers (Lang and Witbrock. high device fan-in leads to increased device capacitance and thus slower devices. (c) Solution found by a 3-hidden layer backprop network which employs "short-cut" connections. Note also that the solution generated by the CCN is qualitatively better than that generated by backprop (the reader might have already noticed a difference in the spiral directions between Figures 6. when used for function approximation. the high fan-in of the hidden units imposes additional implementation constraints for VLSI technology. x2). This task requires.4.3) is used.3. the CCN suffers from overfitting. it should be noted that the CCN requires that all training patterns be available for computing the averages and inappropriate for on-line implementations. The two spiral problem was proposed by Alex Wieland as a benchmark which is extremely hard for a standard backprop net to solve.4 Clustering Networks
.4 (b) and (c). The task requires a network with two real-valued inputs and a single output to learn a mapping that distinguishes between points on two intertwined spirals. 1990.000 cycles if Fahlman's version of backprop (see Section 5. The associated training set consists of 194 point coordinates (x1. Witbrock. In terms of actual computation on a serial machine. [(a) and (b) are from S.700 training cycles when a pool of 8 candidate hidden units is used during training. half of which come from each spiral.4 (a). which makes the CCN
6. after each training cycle. 1991).3. each unit receives incoming connections from every unit in every earlier layer.

In this section. Once a prototype unit is allocated. This architecture is identical to that of the simple competitive net in Figure 3. However. 6. they let the network adapt yet prevent current inputs from destroying past training. There are various ways of representing clusters. Networks with incremental clustering capability can handle an infinite stream of input data. Alternatively. we adopt Moore's abstraction of the clustering algorithm from the ART1 architecture and discuss this algorithm in conjunction with a simplified architecture. The details of the ART1 clustering procedure are considered next. Another interesting property of this net is that it does not require an explicit user defined similarity measure! The network develops its own internal measure of similarity as part of the training phase.4.1 Adaptive Resonance Theory (ART) Networks Adaptive resonance architectures are artificial neural networks that are capable of stable categorization of an arbitrary sequence of unlabeled input patterns in real time. The first network is also characterized by its ability to allocate clusters incrementally.1).. ART networks are biologically motivated and were developed as possible models of cognitive phenomena in humans and animals. as needed. which usually means that they are "close" to each other in the input space. is characterized by a system of ordinary differential equations (Carpenter and Grossberg. Lippmann. example xk is added to wi's cluster and wi is modified to make it better match xk. The basic architecture of the ART1 net consists of a layer of linear units representing prototype vectors whose outputs are acted upon by a winner-take-all network (described in Section 3. A number of interpretations/simplifications of the ART1 net have been reported in the literature (e. 1987 . then xk becomes the prototype for a new cluster. The general idea behind ART1 training is as follows. were introduced by Grossberg (1976). ART1 embodies certain simplifying assumptions which allows its behavior to be described in terms of a discrete-time clustering algorithm. If no prototype matches xk. 1989. so that each input is assigned a label corresponding to a unique cluster. The clustering process is normally driven by a similarity measure.g. It employs a distributed representation as opposed to a single prototype unit cluster representation. appropriate lateral-inhibitory and self-excitatory connections are introduced so that the allocated unit may compete with pre-existing prototype units. 1989). Pao. The basic principles of the underlying theory of these networks.
. A class of ART architectures. These architectures are capable of continuous training with non-stationary inputs. know as adaptive resonance theory (ART). This network is non-incremental in terms of cluster formation.The task of pattern clustering is to automatically group unlabeled input vectors into several categories (clusters). 1987a) with associated theorems proving its self-stabilization property and the convergence of its adaptive weights. Here. In the following. the highly nonlinear multiple layer architecture of this clustering net enables it to perform well on very difficult clustering tasks.1 with a large number of inactive (zero weight) units.4.1 with one major difference: The linear prototype units are allocated dynamically. Every training iteration consists of taking a training example xk and examining existing prototypes (weight vectors wj) that are sufficiently similar to xk. in response to novel input vectors. Moore. A simple clustering net which employs competitive learning has already been covered in Section 3. called ART1.4. Vectors in the same cluster are similar. two additional clustering neural networks are described which have more interesting features than the simple competitive net of Chapter Three. They also solve the "stability-plasticity dilemma. They don't require large memory to store training data because their cluster prototype units contain implicit representation of all the inputs previously encountered. Either network is capable of automatic discovery (estimation) of the underlying number of clusters. The first network we describe uses a simple representation in which each cluster is represented by the weight vector of a prototype unit (this is also the prototype representation scheme adapted by the simple competitive net)." namely. If a prototype wi is found to "match" xk (according to a "similarity" test based on a preset matching threshold).4. and its weights are adapted accordingly.4. one may assume a pre-wired architecture as in Figure 3. The second clustering network described in this section has a more complex architecture than the first net. a unit becomes active if the training algorithm decides to assign it as a cluster prototype unit.

||x||2 and ||wi||2 are equal to the number of 1's in x and wi. Here. This test is passed if
(6. say cluster j." "selfscaling." If the above two tests are passed by the winner unit i for a given input x (here.4) a new prototype wi can only have fewer and fewer 1's as training progresses. The winner-take-all net computes a "winner" unit i.4. The second test is a match verification test between wi and x. that it is possible for a training example to join a new cluster but eventually to leave that cluster because other training examples have joined it. Every time an input vector x is presented to the ART1 net. If this scenario persists even after all existing prototype units are exhausted." or "noise-insensitivity. Note.
(6. 1}n of the jth prototype unit.4.3) causes more differentiation among input vectors of smaller magnitude. According to Equation (6. 1}n. respectively.1) and feeds it to the winner-take-all net for determining the winner unit. x {0.2) Here.4.4.4. the network is said to be in resonance).e. is represented by the weight vector wj {0.The input vector (pattern) x in ART1 is restricted to binary values. the input x must be "close enough" to the winner prototype wi.3) where 0 < < 1 is a user defined "vigilance" parameter.5)
.. In order to pass the first test. Here. i. The verification comes in the form of passing two tests. then x joins cluster j and this unit's weight vector wi is updated according to
= wi x (6. then a new unit representing a new cluster j is allocated and its weight vector wj is initialized according to
wj = x (6. Each learned cluster.4)
where "" stands for the logical AND operation applied component-wise to the corresponding components of vectors wi and xi. the ith unit is deactivated (its output is clamped to zero until a new input arrives) and the tests are repeated with the unit with the next highest normalized output. Subject to further verification. the weight vector of the winner unit wi now represents a potential prototype for the input vector. This feature of ART1 is referred to as "automatic scaling. Passing this test guarantees that a sufficient fraction of the wi and x bits are matched. wi is declared to "match" x if a significant fraction (determined by ) of the 1's in x appear in wi. Note that yj is the ratio of the overlap between prototype wj and x to the size of wj. each existing prototype unit computes a normalized output (the motivation behind this normalization is discussed later)
(6. The second scenario corresponds to unit i passing the first test but not the second one.4.4. Note that Equation (6.

Also.4. unit i does not pass the first test. Here. These criteria are different. This mechanism of favoring small prototype vectors helps maintain prototype vectors apart. These ART models are capable of clustering binary and analog input patterns. The family of ART networks also include more complex models such as ART2 (Carpenter and Grossberg.5) upon being allocated. It also helps compensate for the fact that in the updating step of Equation (6. is allocated with its weight vector wj initialized as in Equation (6. The vigilance parameter in Equation (6." It favors smaller magnitude prototype vectors over vectors which are supersets of them (i. Note that the learning dynamics in the second scenario described above constitutes a search through the
prototype vectors.5 and = 0.. The resulting prototype vectors are also shown. This is true for any unit in the ART1 net since all units undergo the initialization in Equation (6. So going further away by the first measure may actually bring us closer by the second.3)... respectively (here. the final clusters will not change if additional training is performed with one or more patterns drawn from the original training set. It should also be ntoed that this search only occurs before stability is reached for a given training set. whereas the second criterion measures the fraction of the bits in x that are also in wj. ART2-A.4. Hence. This search is continued until a prototype vector is found that satisfies the matching criterion in Equation (6. Here.4. A key feature of the ART1 network is its continuous learning ability. the vectors are shown as 4 4 patterns of "on" and "off" pixels for ease of visualization). Regardless of the setting of . 1987b) and ART3 (Carpenter and Grossberg. and hence lead to a small set of clusters.4. these models are inefficient from a computational point of view. i.4).4.. The first criterion measures the fraction of the bits in wj that are also in x. the vector wj conserves its binary nature. 1991b). The clustering behavior of the ART1 network is illustrated for a set of random binary vectors. j. looking at the closest.e. initially wj is a binary vector.7. a supervised real-time learning ART model called ARTMAP has been proposed (Carpenter et al. However. next closest. The normalization factor ||wi||2 in Equation (6.e. 1990). A simplified model of ART2.In a third scenario. each prototype vector is matched on the first attempt and no search is needed. the task of ART1 is to cluster twenty four uniformly distributed random vectors x {0. the prototype vectors always move to vectors with fewer 1's. Small values allow for large deviations from cluster centers. Note the effect of the vigilance parameter setting on the granularity of the generated clusters. have 1's in the same places) when an input matches them equally well.5). On the other hand. 1991a).4. the ART1 network is stable for a finite training set. since wj is updated according to Equation (6.4. After that. This feature coupled with the above stability result allows the ART1 net to follow nonstationary input distributions. has been proposed which is two to three orders of magnitude faster than ART2 (Carpenter et al. by the maximum criterion.4).3) controls the granularity of the clusters generated by the ART1 net. wi is declared "too far" from the input x and a new unit representing a new cluster. etc.
(a)
. a higher vigilance leads to a larger number of tight clusters.4. Simulation results are shown in Figure 6. 1}16.1 (a) and (b) for = 0.1) is used as a "tie-breaker. And.

5 and (b) = 0. pp. Knapp and Anderson. 1986). Characterization of the clustering behavior of ART2 was given by Burke (1991) who draws an analogy between ART-based clustering and k-means-based clustering. Different vigilance values cause different numbers of categories (clusters) to form: (a) = 0. 26(23).4.4. a person is able to recognize that physically different objects are really "the same" (e.2.. ART2 clustering of analog signals for two vigilance levels: The vigilance value in (a) is smaller than that in (b).
(a)
(b) Figure 6. 49194930. Random binary pattern clustering employing the ART1 net.. 1983 . a person's concept of "tree"). For each case. ©1987 Optical Society of America. 1987b).e. (From G.
.) 6.1. It is left to the reader to check (subjectively) the consistency of the formed clusters. ART networks are sensitive to the presentation order of the input patterns. the top row shows prototype vectors extracted by the ART1 network.2 (Carpenter and Grossberg. 1984 .7.4. An example of ART2 clustering is shown in Figure 6. the problem is to cluster a set of fifty 25-dimensional real-valued input signals (patterns). 1987b. Here.g. For example. Carpenter and S.2 Autoassociative Clustering Network Other data clustering networks can be derived from "concept forming" cognitive models (Anderson.4. A "concept" describes the situation where a number of different objects are categorized together by some rule or similarity relationship. Grossberg. The results are shown for two different vigilance levels.(b) Figure 6. k-means is also sensitive to the initial choice of cluster centers). Anderson and Murphy. Applied Optics. they may yield different clustering on the same data when the presentation order of patterns is varied (with all other parameters kept fixed). Similar effects are also present in incremental versions of classical clustering methods such as kmeans clustering (i.

3.. Next. 1. n.4. Hassoun et al. the prototype and its surrounding basin of attraction represent an individual concept.9)
. Now. .6) with the noisy patterns. 2. the trained net will be represented by a connection matrix W which approximates the mapping
Wx = x (6. According to Equation (6.4. we describe the above ideas in the context of a simple single layer network (Hassoun and Spitzer. these attractors may be identified with the prototypes derived from the training phase. the autoassociative training phase attempts to estimate the unknown prototypes and encode them as eigenvectors of W with eigenvalues of 1.... Consider an unlabeled training set of vectors Rn representing distorted and/or noisy versions of a set {xk. Also. we update the connection matrix W in response to the presentation of a sample according to the matrix form of the Widrow-Hoff (LMS) rule
(6. Let us denote by W the n n matrix whose ith row is the weight vector . it requires that the network be incapable of memorizing the training autoassociations and that the number of underlying prototypes m to be learned be much smaller than n. .8). This approximation results from minimizing J in Equation (6. .A simple concept forming model consists of two basic interrelated components. 1990) and then present a more general autoassociative clustering architecture.4. 2. a feedforward (single or multiple layer) net is trained in an autoassociative mode (recall the discussion on autoassociative nets in Section 5. The training is supervised in the sense that each input pattern to the net serves as its own target.. Here. Anderson et al. Spitzer et al. In addition. The second component is a retrieval mechanism where a prototype becomes an attractor in a dynamical system. superimposed patterns (Anderson et al. the trained feedforward net is transformed into a dynamical system by using the output layer outputs as inputs to the net.. 2.. This is an iterative method which can be used to pick out the eigenvector with the largest-magnitude eigenvalue of a matrix A by repeatedly passing an initially random vector c0 through the matrix.. The strategy here is to force the network to develop internal representations during training so as to better reconstruct the noisy inputs. Hopefully. refer to footnote 6 in Chapter 3)
ck+1 = Ack. and with proper stabilization. 1992. according to (also. i = 1.. Now. 1994b). This simple network outputs the vector y Rn upon the presentation of an input where
y = W . In the prototype extraction phase. 1990. thus forming an external feedback loop. k = 0.4. it consists of a prototype forming component. when initialized with one of the input patterns the dynamical system will evolve and eventually converge to the "closest" attractor state. k = 1. these targets are not the "correct answers" in the general sense of supervised training.4.. Here. Artificial neural networks based on the above concept forming model have been proposed for data clustering of noisy.8)
for all m prototype vectors x. 1990 . However.4. A simple method for extracting these eigenvalues (prototypes) is to use the "power" method of eigenvector extraction. m} of m unknown prototype vectors.7) Therefore. (6. which is responsible for generating category prototypes via an autoassociative learning mechanism. 1988..6) which realizes a gradient descent minimization of the criterion function
(6.. assume a single layer net of n linear units each having an associated weight vector wi. First. 1994a.7) and it requires that the clusters of noisy samples associated with each prototype vector are sufficiently uncorrelated.

e. Therefore. Once initialized with one of the noisy vectors . the extracted prototype is the one that is "closest" to the initial state (input vector) of the net. is due to the fact that all learned prototypes have comparable eigenvalues close to unity. This clustering is further enhanced by aspects of
. the resulting dynamical system evolves its state (the ndimensional output vector y) in such a way that this state moves towards the prototype vector x "most similar" to .1.After a number of iterations. and constrain it to discovering a limited set of unique prototypes which describes (clusters) the training set.e. Care must be taken in matching the learning algorithm for prototype encoding in the feedforward net to the dynamic architecture which is ultimately used for prototype extraction.. also. with A represented by W and c by y or x. each unit in the second hidden layer receives inputs from all units in the first hidden layer plus a bias input of 1. the eigenvector with the largest magnitude eigenvalue will dominate. 1992) considered next. The stability of the dynamic autoassociative net is an important design issue.1). Note that the remaining eigenvectors of W arise from learning uncorrelated noise and tend to have small eigenvalues compared to 1. i. The effect of this bottleneck is to restrict the degrees of freedom of the network. each linear output unit receives inputs from all second hidden layer units plus a bias.. An essential feature of the network's architecture is therefore a restrictive "bottleneck" in the hidden layer. starting from any initial state. Hassoun et al.. The activation slopes of the second hidden layer units (the layer between the first hidden and output layers) are fixed (typically set to 1)..4.2 for further details on Liapunov functions and stability). as described above. the activation slopes for the units in the first hidden layer are made monotonically increasing during training as explained below. the n output layer units serve to reconstruct (decode) the n-dimensional vector presented at the input. This feature is discussed in connection with the dynamic nonlinear multilayer autoassociative network (Spitzer et al. the number of learned prototypes. Thus. Similarly. When two different clusters are merged by the linear net. 1990 . Anderson et al. see Section 7. i. The ability of the autoassociative dynamical net to selectively extract a learned prototype/eigenvector from a "similar" input vector. 1977. This layered network functions as an autoassociative net. A serious limitation of the autoassociative linear net we have just described is that it does not allow the user to control the granularity of the formed clusters. (1990) reported a stable autoassociative clustering net based on the brain-state-in-a-box (BSB) concept forming model (Anderson et al. they become represented by a distorted prototype which is a linear combination of the two correct (but unknown) prototypes. the system's state always evolves along a trajectory for which the energy function is monotonically decreasing (the reader may refer to Section 7. the number of units in each hidden layer (especially the first hidden layer) is small compared to n. Finally. The network does not have sufficient capacity to memorize the training set. This is possible because the prototype vectors x approximate the dominating eigenvectors of W (with eigenvalues close to '1'). Wang. This network will tend to merge different clusters that are "close" to each other in the input space due to the lack of cluster competition mechanisms (recall the similarity and vigilance tests employed in the ART1 net for controlling cluster granularity). We wish the network to discover a limited number of representations (prototypes) of a set of noisy input vectors. Stability is determined by the network weights and network architecture.. as opposed to always extracting the most dominant eigenvector. 1991 . All hidden layer units employ the hyperbolic tangent activation function (6. to describe the training data. On the other hand. One would like to design an autoassociative clustering net which minimizes an associated energy or Liapunov function. Each hidden unit in the first layer receives inputs from all components of an n-dimensional input vector and an additional bias input (held fixed at 1). Consider a two hidden layer feedforward net with an output layer of linear units. Introducing nonlinearity into the autoassociative net architecture can help the net escape the above limitation by allowing control of the granularity of the clusters.4.10) where controls the slope of the activation. The above eigenvector extraction method can be readily implemented by adding external feedback to our simple net.

the slope saturation parameter controls cluster granularity and may be viewed as a vigilance parameter similar to that in the ART nets. described below. (1990). This gradually forces "similar" inputs to activate a unique distributed representation at this layer. This process is repeated iteratively until convergence.6 for details.4. A pass is now made over the training set. As a result.the learning algorithm. and extract the prototype discovered by the network for each cluster. respectively. As a result of learning. the first hidden layer in this net discovers a set of bipolar binary distributed representations that characterize the various input vectors. The learning rate coefficient used is therefore dynamically adjusted according to
(k) = k (6. (1991) among others. a limited number of representations for "features" of the input vectors are available. Funahashi (1990). The learning algorithm employed is essentially the incremental backprop algorithm of Section 5. In these studies. an output is generated and is fed back to the input of the network. the network generates two sets of prototypes (concepts): Abstract binary-valued concepts and explicit real-valued concepts. Oja (1991). As a result of this exponentially decaying learning rate. such nets have been found to implement principal component analysis (PCA) and nonlinear PCA when one hidden layer and three or more hidden layers are used. This binary state gives an intermediate distributed representation (activity pattern) of the particular cluster containing the present input. the slope of the activation functions of first hidden layer units is made dynamic.3. As these activations saturate.e. i. in turn. and damped learning rate coefficients. The second hidden and the output layers perform a nonlinear mapping which decodes these representations into reduced-noise versions of the input vectors. This output can be taken as a "nominal" representation of the underlying cluster "center".4. the nonlinearity gradually (over a period of many cycles through the whole training set) becomes the sgn (sign) function. These modifications include a dynamic slope for the first hidden layer activations that saturates during learning. When the network settles into a final state the outputs of the first hidden layer converge to a bipolar binary state. but this time no learning occurs. The primary objective of this pass is to classify the vectors in the training data. Therefore. but then the declining rate of learning produces a deemphasis of the most recently learned input vectors. learning initially proceeds rapidly. which reduces "forgetting" effects and allows the repeating patterns to be learned evenly. As each vector is presented. The intermediate activity pattern is mapped by the rest of the net into a real-valued activity pattern at the output layer. a prototype of the cluster containing the current input. Autoassociative multilayer nets with hidden layer bottleneck have been studied by Bourlard and Kamp (1988). The tendency is for the network to best remember the most recently presented training data.11)
where is greater than but close to 1 and k is the learning step index. Kramer (1991). and the outputs of the first hidden layer become functionally restricted to bipolar binary values.. The larger the value of . In order to enhance separation of clusters and promote grouping of similar input vectors. Hence. thus leading the rest of the network to reconstruct an equal number of prototypes. This increased sensitivity increases the number of unique representations at this layer. In the prototype extraction phase. Baldi and Hornik (1989). a dynamic net is generated by feeding the above trained net's output back to the input. The network also supplies the user with parallel implementations of the two mappings from one concept representation to the
. A decaying learning coefficient helps counteract this tendency and balance the sensitivity of learning for all patterns. and Usui et al. The other modification to backprop is the use of exponentially damped learning coefficients. increases the sensitivity of the first hidden layer representations to differences among the input vectors.12)
where is a predefined constant less than but close to unity. The mapping characteristics of this highly nonlinear first hidden layer may also be thought to emerge from a kind of nonlinear principal component analysis (PCA) mechanism where unit activities are influenced by high order statistics of the training set.1 with simple heuristic modifications.1. The reader is referred to Section 5. These heuristics significantly enhance the network's tendency to discover the best prototypical representations of the training data. according to
(k) = k (6. the faster is the saturation of the activations which. Hoshino et al. it increases monotonically during learning.

the net is used to cluster the 50 analog patterns of Figure 6.0003 required about 350 training cycles to saturate at 200. More specifically. to prove stability.4.4. according to Equation (6. Note how the level of controls cluster granularity.3.) is shown in the top window in Figure 6. Figures 6. Clusters of patterns formed by the multilayer autoassociative clustering net. (1994a). We conclude this section by presenting two simulation results which illustrate the capability of the above autoassociative clustering net. A typical setting for in Equation (6. Proving the stability of this dynamic multilayer autoassociative clustering net is currently an open problem.4.4. It is left to the reader to compare the clustering results in Figure 6. In fact.0003 and 1. a significant number of noisy input MUPs converged to these prototypes.3 to those due to the ART2 net shown in Figure 6. the first hidden layer maps the input vectors into their corresponding abstract concepts.
.4. In the second simulation.4.0005.0005.
Figure 6. A preprocessing algorithm detected 784 candidate MUPs (each MUP pattern is uniformly sampled and is represented by a 50-dimensional vector centered at the highest peak in the MUP signal) in the 10 second EMG recording shown. In the first simulation.8 sec.other. and to associate each noisy MUP in the signal with its correct prototype. The clustering results are shown in Figure 6. while the second hidden layer and the output layer implement the inverse of this mapping.3 (a) and (b) show the learned prototypes (top row) and their associated cluster members (in associated columns) for = 1.4..2. the reader may consult Wang (1991).2. empirical evidence suggests a high degree of stability when the system is initialized with one of the training vectors and/or a new vector which is "similar" to any one of the training vectors. Hassoun et al. if one exists. For a detailed description and analysis of this problem.4.12) is 0. 1991). The network used had eight units in each of the first and second hidden layers. The figure shows the prototypes associated with the 11 most significant clusters in small windows.0005. and Hassoun et al.4. Examples of MUPs from each cluster are shown to the right of discovered cluster prototypes. This result is superior to those of existing automated EMG decomposition techniques. it would be difficult to find an appropriate Liapunov function. the autoassociative clustering net (with comparable size and parameters as in the first simulation) is used to decompose a 10 second recording of an EMG signal obtained from a patient with a diagnosis of myopathy (a form of muscle disease). respectively ( = 1. The top row shows the learned prototype for each cluster. A segment of this signal (about 0. These patterns are repetitively presented to the net until the slopes in the first hidden layer rise to 200. (a) = 1. However.4.4. The highly nonlinear nature of this dynamical system makes the analysis difficult.0003 and = 1. It is interesting to note that the choice of network size and parameters and were not optimized for this particular clustering problem.11). these parameters were also found to be appropriate for clustering motor unit potentials in the electromyogram (EMG) signal (Wang. (1994b). A total of 24 unique prototype MUPs are identified by the network.9999. while = 1.e.0005 required about 210 cycles). i. The objective here is to extract prototype motor unit potentials (MUPs) comprising this signal. These detected noisy MUPs were used to train the autoassociative clustering net. Two experiments were performed with = 1.0003 and (b) = 1.

Examples of classified waveforms are shown to the right. They also have sparse interconnections between the hidden units and the output units.g. RBF nets require more training data and more hidden units compared to backprop nets for achieving the same level of accuracy.. Rosenblatt's perceptron). They employ hidden units with adaptive localized receptive fields. Two major variations to the RBF network were given which lead to improved accuracy (though. One distinguishing feature of the CMAC. This network model is motivated by biological nervous systems as well as by early results on the statistical and approximation properties of radial basis functions. These networks have comparable prediction/approximation capabilities to those of backprop networks. However. It also has common features (and limitations) to those found in classical perceptrons (e. The CMAC has been successfully applied in the control of robot manipulators. The extracted MUP waveform prototypes are shown in small windows. The motivation behind unit-allocating nets is two fold: (1) The elimination of the guesswork involved in determining the appropriate number of hidden units (network size) for a given task. This method uses a simple clustering (competitive learning) algorithm to locate hidden unit receptive field centers. RBF networks employ a computationally efficient training method that decouples learning at the hidden layer from that at the output layer. The CMAC was originally developed as a model of the cerebellum. though. by a factor of at least one order of magnitude. In fact. Three unit-allocating adaptive multilayer feedforward networks were also described in this chapter. The second variation to the RBF net substitutes semilocal activation units for the local activation hidden units. Decomposition of an EMG signal by the autoassociative clustering net. The first two networks belong to the class of hyperspherical classifiers. The third unit allocating network (cascade-correlation net) differs from all previous networks in its ability to build a deep net of cascaded units and in its ability to utilize more than one type of hidden units. It also uses the LMS or delta rule to adjust the weights of the output units.4.5 Summary
This chapter started by introducing the radial basis function (RBF) network as a two layer feedforward net employing hidden units with locally-tuned response characteristics.Figure 6.8 sec. The most natural application of RBF networks is in approximating smooth. at the cost of reduced training speed). This network model shares several of the features of the RBF net such as fast training and the need for a large number of localized receptive field hidden units for accurate approximation. but train by orders of magnitude faster. with rapid training times.) is shown in the top window.
6. is its built-in capability of local generalization.4. coexisting in the same network. and (2) training speed. for one of these networks (PTC net). The first variation involved replacing the k-means-clustering-based training method for locating hidden unit centers by a "soft competition" clustering method.
. the training time was shown to be of polynomial complexity. These networks are easily capable of forming arbitrarily complex decision boundaries. One segment of an EMG (about 0. continuous multivariate functions of few variables. The CMAC is another example of a localized receptive field net which was considered in this chapter. Another advantage of RBF nets is their lower "false-positive" classification error rates. and employs gradient descent-based learning for adjusting all unit parameters.

Finally. and the motivations for developing them give the reader an appreciation of the diversity and richness of these networks. These networks are intended for tasks involving data clustering and prototype generation. A slightly modified backprop training method is employed in a customized autoassociative net of sigmoid units in an attempt to estimate and encode cluster prototypes in such a way that they become attractors of a dynamical system. The second clustering net is motivated by "concept forming" cognitive models. It is based on two interrelated mechanisms: prototype formation and prototype extraction. Generalizations of this network allow the extension of these desirable characteristics to the clustering of analog patterns. and the way their development has been influenced by biological. results of motor unit potential (MUP) prototype extraction and noisy/distorted MUP categorization (clustering) for a real EMG signal are presented. The ART1 net is characterized by its on-line capability of clustering binary patterns. its stability.
. It is hoped that the different network models presented in this chapter. These ART networks are biologically motivated and were developed as possible models of cognitive phenomena in humans and animals. In particular. cognitive. Results of simulations involving data clustering with these nets are given. This dynamical system is formed by taking the trained feedforward net and feeding its output back to its input. and its ability to follow nonstationary input distributions. and/or statistical models. two examples of dynamic multilayer clustering networks are discussed: The ART1 net and the autoassociative clustering net.

Hopfield. 7. These two basic associative memories will help define some terminology and serve as a building ground for some additional associative memory models presented in Section 7. 1988. Okajima et al. and W is an L × n interconnection matrix whose lth row is given by . 1972 and 1974. this simple associative memory is extended into a recurrent autoassociative memory by employing feedback. Baird. m.1. The retrieval of these stored "memories" is accomplished by first initializing the DAM with a noisy or partial input pattern (key) and then allowing the DAM to perform a collective relaxation search to arrive at the stored memory which is best associated with the input pattern. Anderson. .. 1990.1.1) is shown in Figure 7. the complexity and capability of the memory storage/recording algorithm.1. Nakano. 1972a. The chapter starts by presenting some simple networks which are capable of functioning as associative memories and derives the necessary conditions for perfect storage and retrieval of a given set of memories. yk}. These memories are treated as nonlinear dynamical systems where information retrieval is realized as an evolution of the system's state in a high-dimensional state space. the nature of the stored associations (autoassociative versus heteroassociative).. Dynamic associative memories (DAM's) are a class of recurrent artificial neural networks which utilize a learning/recording algorithm to store vector patterns (usually binary patterns) as stable memory states. Kanerva. Finally. 1972. it is referred to as a linear associative memory (LAM). In this section.. 1972. etc. 1973.1 Basic Associative Neural Memory Models
Several associative neural memory models have been proposed over the last two decades [e. These memory models can be classified in various ways depending on their architecture (static versus recurrent).4. The chapter continues by presenting additional associative memory models with particular attention given to DAM's. k = 1. 1987.0 Introduction
This chapter is concerned with associative learning and retrieval of information (vector patterns) in neurallike networks.. the application of a DAM to the solution of combinatorial optimization problems is described. A block diagram of the simple associative memory expressed in Equation (7. 1987. and stability.1 Simple Associative Memories and Their Associated Recording Recipes One of the earliest associative memory models is the correlation memory (Anderson.1)
where {xk. Amari.
7.g. and retrieval dynamics of various DAM's are analyzed. For an accessible reference on various associative neural memory models the reader is referred to the edited volume by Hassoun (1993)]. Various associative memory architectures are presented with emphasis on dynamic (recurrent) associative memory architectures.. 1988. Associative Neural Memories
7. 1972).1. their retrieval mode (synchronous versus asynchronous). with the lth unit having a weight vector wl Rn.1. capacity. Note that this associative memory is characterized by linear matrix-vector multiplication retrievals. is a collection of desired associations. 1972. Chiueh and Goodman. It associates real-valued input column vectors xk Rn with corresponding real-valued output column vectors yk RL according to the transfer equation
yk = Wxk (7. The characteristics of high-performance DAM's are defined. Kohonen. Nakano. 1982. Hence. 1972.7. Kosko. Then. This correlation memory consists of a single layer of L non-interacting linear units. 2. a simple static synchronous associative memory is presented along with appropriate memory storage recipes. Kohonen and Ruohonen. This
. Kohonen. These networks are usually referred to as associative neural memories (or associative memories) and they represent one of the most extensively analyzed class of artificial neural networks.

then one simply updates the current W by adding to it the matrix ym+1(xm+1)T.3) where Y = [y1 y2 .1. k = 1. we get an expression for the retrieved pattern as:
(7. If yk = xk for all k. ym+1}. how the yk affects the cross-talk term if the xk's are not orthogonal). Correlation Recording Recipe The correlation memory is a LAM which employs a simple recording/storage recipe for loading the m associations {xk. perfect recall of binary
. an already recorded association {xi. yk} into memory. independent of the encoding of the yk (note.1. though. A block diagram of a simple linear heteroassociative memory. then this memory is called autoassociative.2) is
(7.4) The second term on the right-hand side of Equation (7.1. For example. Similarly.1. 2.
Figure 7. Hence. This recording recipe is responsible for synthesizing W and is given by
(7.4) is proportional to the desired memory yh. This term can be reduced to zero if the xk vectors are orthogonal. yi} may be "erased" by simply subtracting yi(xi)T from W. Another way of expressing Equation (7. What are the requirements on the {xk. The first term on the right-hand side of Equation (7. Note that for the autoassociative case where the set of association pairs {xk.. y1} through {xm. xm]. yk} associations which will guarantee the successful retrieval of all recorded vectors (memories) yk from their associated "perfect key" xk? Substituting Equation (7..2) or (7. However.1) and assuming that the key xh is one of the xk vectors.1. a sufficient condition for the retrieved memory to be the desired perfect recollection is to have orthonormal vectors xk. If nonlinear units replace the linear ones in the correlation LAM.1.1.1.1. with a proportionality constant equal to the square of the norm of the key vector xh.4) represents the "cross-talk" between the key xh and the remaining (m − 1) patterns xk.1. the price paid for simple correlation recording is that to guarantee successful retrievals. ym} it is desired to record one additional association {xm+1.. if after recording the m associations {x1. xk} is to be stored. One appealing feature of correlation memories is the ease of storing new associations or deleting old ones. as is seen next.1.3) with yk replaced by xk. one may still employ Equation (7. . the interconnection matrix W is simply the correlation matrix of m association pairs. m} of input vectors...1.LAM is labeled heteroassociative since yk is different (in encoding and/or dimensionality) from xk...2) In other words.2) in Equation (7. ym] and X = [x1 x2 . we must place stringent conditions on the set {xk.

A Simple Nonlinear Associative Memory Model The assumption of binary-valued associations xk {− +1}n and yk {− +1}L and the presence of a clipping 1.7) gives
from which it can be seen that the condition for perfect recall is given by the requirements
and
for i = 1. For the ith component of . Based on this analysis.{xk.. nonlinearity F operating componentwise on the vector Wx (i. Now. yk}. yk} associations is. 1. respectively.1.7) where h represents the cross-talk term. and random bipolar binary associations {xk. possible even when the vectors xk are only pseudo-orthogonal.1. L.e. . Here. uniformly distributed. These requirements are less restrictive than the orthonormality requirement of the xk's in a LAM. if one of the recorded key patterns xh is presented as input. 1990) [see also Amari and Yanai (1993)] analyzed the error correction capability of the above nonlinear correlation associative memory when the memory is loaded with m independent. This can be seen in the following analysis.6) which automatically normalizes the xk vectors (note that the square of the norm of an n-dimensional bipolar binary vector is n). in the limit of large n is given by:
.5)
relaxes some of the constraints imposed by correlation recording of a LAM.1. Equation (7... consider the normalized correlation recording recipe given by:
(7. Uesaka and Ozeki (1972) and later Amari (1977a. Next.1. W needs to be synthesized with the requirement that only the sign of the corresponding components of yk and Wxk agree. the relation between the output and input error rates Dout and Din.. each unit now employs a sgn or sign activation function) according to
y = F[Wx] (7. in general. then the following expression for the retrieved memory pattern can be written:
(7. 2.

. The error rates Din and Dout may also be viewed as the probability of error of an arbitrary bit in the input and output vectors. Bipolar binary association vectors are assumed which have independent. the error correction capabilities of memory improves. . Note how the ability of the correlation memory to retrieve stored memories from noisy inputs is reduced as the pattern ratio r approaches and exceeds the value 0.1.1. perfect key vector xk and a noisy version between xk and . 1987). For large n with n >> m it can be shown that a random set of m key vectors xk become mutually orthogonal with probability approaching unity (Kanter and Sompolinsky. Similarly.15.2 for several values of the pattern ratio r.8) where is defined as
(7.(7.2.1.9)
Here. yk}. Dout is the normalized Hamming
distance between yk and the output of the associative memory due to the input . Equation (7.9).1. Hence.6). and uniformly distributed components. Din may also be related to the "overlap" . between a of xk.15).8) and (7. Output versus input error rates for a correlation memory with clipping (sgn) nonlinearity for various values of pattern ratio r = .2. with arbitrary yk.1). Din is the normalized Hamming distance. For low loading levels (r < 0. and for r << 1 the memory can correct up to 50 percent error in the input patterns with a probability approaching 1. respectively (for an insight into the derivation of Equations (7. the loading of the m associations {xk. as
.
Figure 7.1. random.9). . the reader is referred to the analysis in Section 7.1. is assured using the normalized correlation recording recipe of Equation (7.1. Dotted lines correspond to the approximation in Equation (7.1.1.8) is plotted in Figure 7.

1. where xk is the key of one of the m stored associations {xk. In this case. then XT X = I and Equation (7.12) is desirable since it leads to the best error-tolerant (optimal) LAM (Kohonen. yk}.10) exists giving:
(7.. the matrix X is square and a unique solution for W in Equation (7.Optimal Linear Associative Memory (OLAM) The correlation recording recipe does not make optimal use of the LAM interconnection weights. if the set {xk} is orthonormal. a LAM's interconnection matrix W must satisfy the matrix equation given by:
Y = WX (7.. the set {xk} is linearly independent.1. This recording technique leads to the optimal linear associative memory (OLAM) (Kohonen and Ruohonen. 1984). .10).12)] has been analyzed by Kohonen (1984) and Casasent and Telfer (1987). The vector n is a noise vector of zero-mean with a covariance matrix noise by and .1. among others.1. the minimum Euclidean norm solution (Rao and Mitra.12) can be found in Hassoun (1993). returning to Equation (7.1. the expectation where is taken over the
is the ith component of the retrieved vector
. This method is convenient since a new association can be learned (or an old association can be deleted) in a single update step without involving other earlier-learned memories. Other adaptive versions of Equation(7. m} is linearly independent (as opposed to the more restrictive requirement of orthogonality required by the correlation-recorded LAM).10)
with Y and X as defined earlier in this section.10) can always be solved exactly if all m vectors xk (columns of X) are linearly independent. for the case of real-valued associations. An iterative version of the projection recording recipe in Equation (7. Let .1. The following is a brief account of the key points of such analysis. In particular. we are free to choose any of the W* solutions satisfying Equation (7. 1984) based on Greville's theorem (Albert. assume xk Rn and yk RL. it can be seen that an exact solution W* may not be unique. That is.3).12) will be referred to as the "projection" recording recipe since the matrix-vector product xk transforms the kth stored vector xk into the kth column of the m × m identity matrix. Denote the variance of the input and output and . 1971. ..1. For the case m = n.1.e.1.11) which requires that the matrix inverse X−1 exists.12) exists (Kohonen.1. Equation (7.1. Here. For perfect storage of m associations {xk. Note that for an arbitrary Y.. respectively.10) with the assumption m < n and that the xk are linearly independent. Now.12) reduces to the correlation recording recipe of Equation (7. A more optimal recording technique can be derived which guarantees perfect retrieval of stored memories yk from inputs xk as long as the set {xk. 2. 1972) which leads to the exact weight matrix W after exactly one presentation of each one of the m vectors xk.1. Thus. 1973) and is considered next. i. Also. See Problem 7. this solution guarantees the perfect recall of any yk upon the presentation of its associated key xk.1. which implies that m must be smaller or equal to n.1. k = 1. yk}. OLAM Error Correction Capabilities The error correcting capabilities of OLAMs [with the projection recording in Equation (7. Equation (7.6 for further details) given by:
(7.

For the heteroassociative case (yk xk). which is the case for the input and output noise. how can one improve associative retrieval and memory loading capacity? There are various strategies for enhancing associative memory performance. for all k) with a linearly independent set {xk}. two example strategies are presented. their number relative to their dimension.1. the input noise is not amplified). For an autoassociative OLAM
(yk = xk. it is assumed that yk and xk are not correlated and that the recollection vectors yk have equal energy E(y). In the following. defined as . also both expectation operators are taken over the entire ensemble of possible recollection and key vectors.5) and employing projection recording with uniformly distributed random bipolar binary key/recollection vectors have also been analyzed. 1987) that the OLAM error correction is given by:
(7.1. one can artificially reduce the value of this measure by merely reducing the energy of the recollection vectors yk (i.1. though. Also.13) Thus. and the recording recipe employed highly affect the performance of an associative memory.1.e. respectively..elements of the argument vector. Strategies for Improving Memory Recording The encoding of the xk's. Poor performance is to be expected when the matrix XTX is nearly singular [which
leads to a large value for Tr(XTX)−1]. Error correction characteristics of nonlinear associative memories whose transfer characteristics are described by Equation (7. Equation (7. The reader is referred to the theoretical analysis by Amari (1977a) and the empirical analysis by Stiles and Denq (1987) for details. the error correction measure (Kohonen.14) where yij is the ijth element of matrix Y. it can be shown (Casasent and Telfer. The first expected value operator is taken over all elements of Y. the better the noise suppression capability of the OLAM. Also. The reader is referred to Problem 7. Assuming that an associative memory architecture and a suitable recording recipe are identified. not over k (note that for zero-mean data. The reader should be warned. and Tr() is the trace operator (which simply sums the diagonal elements of the argument matrix).11 for a more appropriate error correction measure for heteroassociative LAM's.14) shows that the error correction in a heteroassociative OLAM depends on the choice of not only the key vectors xk but also on that of the recollection vectors. reducing ). note that the smaller m is relative to n. that the error correction measure is not suitable for heteroassociation OLAM's because it is not normalized against variation in key/recollection vector energies. since the LAM retrieval operation is linear).
. 1984)
is given by
(7. for linearly independent key vectors (requiring m n) the OLAM always reduces the input noise (or in the worst case when m = n.1.

One possibility of employing this strategy is when a "default memory" is required. Therefore.One strategy involves the use of a multiple training method (Wang et al.15) where and are the ith components of the desired memory yk and that of the estimated one. one arrives at the following memory storage recipe:
†
(7. it is assumed that there exists a weighted correlation matrix which can store all desired association pairs.5). Now. one way to enhance the error correction capability of an associative memory is to augment the set of association pairs {xk.1. The retrieval procedure just described amounts to constructing a recurrent associative memory with the synchronous (parallel) dynamics given by
. augmenting the original set of associations with associations of the form {si. This process could continue with more passes until we eliminate all errors and arrive at a final output y equal to xk. we may proceed by taking the output y and feed it back as an input to the associative memory hoping that a second pass would eliminate more of the input noise. In this case. As an example. This strategy arises naturally in training pattern classifiers and is useful in enhancing the robustness of associative retrieval.16)
where X† is the psuedo-inverse of matrix X (Penrose. consider the autoassociative version of the single layer associative memory employing units with the sign activation function and whose transfer characteristics are given by Equation (7. yk} to be stored with a collection of associations of the form { . yk} where represents a noisy version of xk. Now assume that this memory is capable of associative retrieval of a set of m bipolar binary memories {xk}. the interconnection matrix is equivalent to a weighted correlation matrix with different weights on different association pairs. m. the associative memory retrieves (in a single pass) an output y which is closer to stored memory xk than .2 Dynamic Associative Memories (DAMs) Associative memory performance can be improved by utilizing more powerful architectures than the simple ones considered above. In general. thus preventing these undesirable inputs from causing the associative memory to retrieve the "wrong" memories. respectively. The strategy of adding specialized associations increases the number of associations.1. This memory is retrieved when highly corrupted noisy input key vectors are input to the associative memory. Here. this strategy is not well suited for correlation LAMs. for associations encoded such that sparse input vectors si have low information content. Upon the presentation of a key which is a noisy version of one of the stored memory vectors xk. to be stored and may result in m > n. The addition of specialized association pairs may also be employed when specific associations must be introduced (or eliminated). User defined specialized associations may also be utilized in a strategy for improving associative memory performance. 7. Mathematically speaking.. Intuitively. 0} during recording leads to the creation of a default "no-decision" memory 0. This solution assumes that the inverse of the matrix XXT exists. one may employ a recording technique which synthesizes W such that the association error is minimized over all m association pairs. 1955). the desired solution is the one which minimizes the SSE criterion function J(W) given by:
(7.1. only a fraction of the noise (error) in the input vector is corrected in the first pass (presentation). For instance. several instances of noisy key vector versions for each desired association pair may be added. Here. This strategy is potentially useful when correlation recording is employed. Here.1. For example. by setting the gradient of J(W) to zero and solving for W. 1990) which emphasizes those unsuccessfully stored associations by introducing them to the weight matrix W through multiple recording passes until all associations are recorded.

Since. The resistor Rij connects the output voltage xj (or − j) of the jth x
amplifier to the input of the ith amplifier. capacitors. the play the role of interconnection weights. n.1. 2. the current Ii represents an external input signal (or bias) to amplifier i.1. Connecting a resistor Rij to − i x helps avoid the complication of actually realizing negative resistive elements in the circuit. For proper associative retrieval.1.3 is known as the Hopfield net. 2.x(t + 1) = F[W x(t)] (7. This circuit consists of resistors. Continuous-Time Continuous-State Model Consider the nonlinear active electronic circuit shown in Figure 7.3. various variations of the above dynamical associative memory are presented and their stability is analyzed. 1. . The R and C are positive quantities and are assumed equal for all n amplifiers. The circuit in Figure 7. Finally.. i = 1. In this case. we should synthesize W (which is the set of all free parameters wij of the dynamical system in this simple case) so that starting from any initial state x(0). The dynamical equations describing the evolution of the ith state xi. ideal current sources.1. and x(0) is the initial state of the dynamical system which is set equal to the noisy key . Each amplifier provides an output voltage xi given by f(ui) where ui is the input voltage and f is a differentiable monotonically increasing nonlinear activation function. It can be thought of as a single layer neural net of continuous nonlinear units with feedback.. 3. in the Hopfield net can be derived by applying Kirchoff's current law to the input node of the ith amplifier as
(7. positive as well as "negative" resistors are required.19)
. and identical nonlinear amplifiers.. .17).17)
where t = 0.4.18) which can also be written as
(7. the dynamical associative memory converges to the "closest" memory state xk.1.1.. Note that a necessary requirement for such convergent dynamics is system stability. In the following.1.. as will be seen later. the set of memories {xk} must correspond to stable states (attractors) of the dynamical system in Equation (7. Each amplifier is also assumed to provide an inverting terminal for producing output . The ith unit in this circuit is shown in Figure 7. such as tanh(ui).

The above Hopfield net can be considered as a special case of a more general dynamical network developed and studied by (Cohen and Grossberg (1983) which has an ith state dynamics expressed by:
.
Figure 7.4.3.1.1.1.3. Circuit diagram for the ith unit of the associative memory in Figure 7.
where and (or if the inverting output of unit j is connected to unit i).Figure 7. Circuit diagram for an electronic dynamic associative memory.

giving
u = Wx + = WF(u) + (7.e. = [I1 I2 . f(uj) approaches the sign function. i. and is the inverse of the activation function xj = f(uj). continuous-time. the third term in Equation 1.. 1984) that the only stable states of the high-gain.1.1. The sigmoidal nature of f(u) leads to a large positive contribution near hypercube boundaries.1.23) depends on the specific shape of the nonlinear activation function f.. i.1.e.23)
where x = [x1 x2 .. Note that the value of the right-most term in Equation (7.. 1984). In]T. For high gain approaching infinity. the amplifiers in the Hopfield net become threshold elements. = diag(1..1.21) where C = CI (I is the identity matrix). In this case. and W is an interconnection matrix defined as
The equilibria of the dynamics in Equation (7.20) The overall dynamics of the Hopfield net can be described in compact matrix form as
(7. continuous-state system in Equation (7.1. x = F(u) = [f(u1) f(u2) . Hopfield showed that the stable states of the network are the local minima of the bounded computational energy function (Liapunov function)
(7.1..23) begins to contribute. For large but finite amplifier gains. 2.24) are states x* {− +1}n.. Furthermore.21) are the corners of the hypercube.22)
A sufficient condition for the Hopfield net to be stable is that the interconnection matrix W be symmetric ((Hopfield..
. xn]T is the net's output state. the computational energy function becomes approximately the quadratic function
(7.1. f(un)]T.. the local minima of Equation (7..(7.1.21) are determined by setting
.24) It has been shown ((Hopfield..1. . but negligible contribution far from the boundaries. This causes a slight drift of the stable states toward the interior of the hypercube. n). (7.

1.3) or the more optimal recipe in Equation (7. When used as a DAM. and the use of high gain amplifiers in such DAMs lead to the truncated energy function:
(7. i.25)
where = diag( . The operation of the Hopfield net as an autoassociative memory is straight forward.25) correspond to local minima (or maxima or points of inflection) of E(x) since means .1. hopefully.1.15.23). the gradient system in Equation (7. .1.1.12). the Hopfield net is operated with very high activation function gains and with binary-valued stored memories.25) converges asymptotically to an equilibrium state which is a local minimum or a saddle point of the energy E ((Hirsch and Smale.. Now.1. its output state evolves along the negative gradient of E(x) until it reaches the closest local minima which. Note that the external bias may be eliminated in such DAMs. we first note that the equilibria of the system described by Equation (7.26)
is always negative [since is always positive because of the monotonically nondecreasing nature of the relation xj = f(uj)] or zero at x*. Hence. given a set of memories {xk}.1. These recording recipes lead to symmetrical W's (since autoassociative operation is assumed. Then.21). For each isolated local minimum x*.1. ).1. The synthesis of W can be done according to the correlation recording recipe of Equation (7.27) Additional properties of these DAMs are explored in Problems 7. Discrete-Time Continuous-State Model
.. when the net is initialized with a noisy key . 1974) (fortunately. To see this.13 through 7. .. there exists a neighborhood over which the candidate function V(x) = E(x) − E(x*) has continuous first partial derivatives and is strictly positive except at x* where V(x*) = 0.1.
(7. we have the following gradient system:
(7. Additionally. E(x) will have additional local minima other than the desired ones encoded in W. Hence V is a Liapunov function. by equating terms.e. however. To see this.. the symmetric W. In general. the interconnection matrix W is encoded such that the xk's become local minima of the Hopfield net's energy function E(x). The elimination of bias. These additional undesirable stable states are referred to as spurious memories.1. the unavoidable noise in practical applications prevents the system from staying at the saddle points and convergence to a local minimum is achieved). and x* is asymptotically stable. yk = xk for all k) which guarantees the stability of retrievals. is one of the fundamental memories xk's. simply take the gradient of E with respect to the state x and compare to Equation (7.Another way of looking at the Hopfield net is as a gradient system which searches for local minima of the energy function E(x) defined in Equation (7.

23) since it is assumed that j = 1. (7.29) when W is symmetric and the activation gain [e.An alternative model for retrieving the stable states (attractors) can be derived by employing the relaxation method (also known as the fixed point method) for iteratively solving Equation (7.1. So it is important to have a rigorous discrete-time analysis of the stability of the dynamics in Equation (7.1.1. Equation (7. Most current implementations of continuous-time neural networks are done using computer simulations which are necessarily discrete-time implementations.28)
or..1. the most negative of them. 2.30) where
(7..30) is identical to Equation (7. when the unit activations are piece-wise linear. Equation (7.32) is satisfied.4. To prove that E in Equation (7.g.1. assume f(x) = tanh(x)] satisfy the condition
(7.
(7. 2..1. On the other hand.17).30) is a Liapunov function when Equation (7. if W has one or more negative eigenvalues. For .1.1.32) is satisfied by any value of . If W has no negative eigenvalues.. by solving for x(k+1)..1. Here.22) ((Cichocki and Unbehauen. 1.1.1. The parallel update nature of this DAM is appealing since it leads to faster convergence (in software simulations) and easier hardware implementations as compared to the continuous-time Hopfield model. then Equation (7. n.1. min is the smallest eigenvalue of the interconnection matrix W.29) leads to a special case of the BSB model. .1.1.29) Equation (7. we may write the relaxation equation
. j = 1. an initial guess x(0) for an attractor state is used as the initial search point in the relaxation search.29) is identical to Equation (7. However. . Also. since > 0.31) is a Liapunov function for the DAM in Equation (7. k = 0.22) with = I (without loss of generality) and recalling that . except that the unit activations in the above relaxation model are of sigmoid-type as opposed to the threshold-type (sgn) assumed in Equation (7.29) describes the dynamics of a discrete-time continuous-state synchronously updated DAM. (Marcus and Westervelt (1989) showed that the function
(7. 1993). then min. the stability results obtained earlier for the continuous-time DAM do not necessarily hold for the discrete-time versions.29). Equation (7.1. places an upper limit on the gain for stability.1.32) Here.1.1.17) which was intuitively derived.1.. consider the change in E between two discrete time steps:
. discussed in Section 7.1. Starting from Equation (7.

1. Now. then it can be shown that the DAM can develop period-2 limit cycles ((Marcus and Westervelt.1.. Discrete-Time Discrete-State Model Starting with the dynamical system in Equation (7.1. 2.1.29) and replacing the continuous activation function by the sign function.1. Equation (7. If.1..32).35) leads to
or.1. The requirement that be positive definite is satisfied by the inequality of Equation (7.
. 1989)
(7.1. if the matrix is positive definite. the inequality of Equation (7.35)
where and (7.34)
(7.32) is violated.30) is a Liapunov function for the DAM in Equation (7.1..34) when xi(k) = xi(k+1) − xi(k). by using Equation (7.31).29) and thus the DAM is stable.36) where x(k) = x(k + 1) − x(k).33) Using the update equation for xi from Equation (7.29) and the symmetry property of W.1.1.1. Combining Equations (7. and ij = 1 for i = j and ij = 0 otherwise. combined with the fact that E(k) is bounded. i = 1. The last term in Equation (7. are updated simultaneously according to (7.33) becomes
(7. 1989).1. shows that the function in Equation (7.1. then E(k) 0 (equality holds only when x(k) = 0 which implies that the network has reached an attractor).37)
. This result.(7.1. on the other hand. .1. one arrives at the discrete-time discrete-state parallel (synchronous) updated DAM model where all states xi(k).34) is related to xi(k) by the inequality ((Marcus and Westervelt. n.

This capacity measure. though. i.1.2. 1982. Another result on the capacity of this DAM for the case of error-free memory recall by one-pass parallel convergence is (in probability) given by the absolute capacity (Weisbuch and Fogelman-Soulié. 1977a... known as "relative" capacity.. McEliece et al. then must not exceed 0. This implies that there is a one-to-one correspondence between the memories of the two models.e. known as "absolute" capacity. 1985. It assumes the same dynamics as Equation (7. expressed as the limit (7. does not assume any error correction behavior.1. the correlation-recorded discrete Hopfield net is an inefficient DAM model.2.2 DAM Capacity and Retrieval Dynamics
In this section. that is. Hopfield (1984) showed that both nets (discrete and continuous nets with the above assumptions) have identical energy maxima and minima. i = 1. 1985) that if most of the memories in a correlation-recorded discrete Hopfield DAM. This value is the relative capacity of the DAM.. Newman. this net was proposed as an associative memory which employed the correlation recording recipe for memory storage. This asynchronously updated discrete-state DAM is commonly known as the discrete Hopfield net.
7.. 1987. It has been shown (Amari. are to be remembered approximately (i. In its original form.1) Equation (7. The unit which updates its state is chosen randomly and independently of the times of firing of the remaining (n − 1) units in the DAM. but only one unit updates its state at a given time.15. the vector components are independent random variables taking values 1 or with probability ) and at the same time be capable of associative recall (error correction). nonperfect retrieval is allowed). has been proposed which is an upper bound on such that the fundamental memories or their "approximate" versions are attractors (stable equilibria). since the two models may be viewed as minimizing the same energy function E. Amari and Maginu. one would expect that the macroscopic behavior of the two models are very similar. followed by projectionrecorded DAMs. with a probability approaching one.1) indicates that the absolute capacity approaches zero as n approaches infinity! Thus..e. i. which was originally proposed and analyzed by (Hopfield (1982). the capacity and retrieval characteristics of the autoassociative DAMs introduced in the previous section are analyzed. Another capacity measure.1. 2.2)
. It can be shown (see Problem 7.. ..1 Correlation DAMs DAM capacity is a measure of the ability of a DAM to store a set of m unbiased random binary patterns (that is.37) for the ith unit.1. 1988.17) that the discrete Hopfield net with a symmetric interconnection matrix (wij = wji) and with nonnegative diagonal elements (wii 0) is stable with the same Liapunov function as that of a continuous-time Hopfield net in the limit of high amplifier gain. both models will perform similar memory retrievals. The absolute capacity result in Equation (7. A commonly used capacity measure.Another version of this DAM is one which operates in a serial (asynchronous) mode. n. Correlation-recorded DAMs are considered first. Assuming yk = xk (autoassociative case) in Equation (7. it has the Liapunov function in Equation (7. 7.e.2.7) and wii = 0.2. with wii = 0. 1988). Amit et al. then by direct recall from initial input x(0) = xh with xh {xk}. it does not require that the fundamental memories xk be attractors with associated basins of attraction.1) is derived below.2. the ith bit of the retrieved state x(1) is given by (7. Also. Hopfield. takes the form of an upper bound on the pattern ratio in the limit such that all stored memories are equilibrium points.24).

3.Consider the quantity Ci(0) = − i(0)i(0). The effects of a nonzero diagonal on DAM capacity are treated in Section 7. Hence.1.2. by integrating N(0. a correlation-recorded discrete Hopfield DAM must have its pattern ration.2. and if the ith bit xi(1) in Equation (7. using the fact that for a random variable x distributed according to N(0. the Ci(0) term is approximately distributed according to a normal distribution N(. for large m and n.2. . this inequality may be approximated as Perror < .7). and equivalently Ci(0). (1 − Perror)mn > 0. if Ci(0) is positive and larger than 1. ) from 1 to .99. then the condition 3 < 1 or must be satisfied. in one-pass. Employing the binomial expansion.005 which gives .e.5) By noting that ln mn < ln n2 (since n > m). defines the radius of attraction of a fundamental memory. On the other hand.8) is set to zero. 1985.4) Now.2.2.2. if all memories xk are required to be equilibria of the DAM with a probability close to 1. Thus. Ci(0) approaches N(0. 1970). leads us to Equation (7.4) and eliminating all constants and the ln factor (since they are dominated by higher-order terms) results in the bound (7.3) can be solved for the requirement Perror 0. i.. 2). ) asymptotically as m and n become large (Mosteller et al. say 0.99. 2) = N(0. Rather. Note that Equation (7. is times the sum of (n − 1)(m − 1) independent.7) in order that error-free one-pass retrieval of a fundamental memory (say xk) from random key patterns lying inside the Hamming hypersphere (centered at xk) of radius n ( < ) is achieved with probability approaching 1.3) where is the error function.2. Next. we may compute the probability that xi(1) is in error. Therefore.. On the other hand. According to this capacity measure. Now. ). The error probability for the ith bit is then given by Equation (7. we first note that .0014.6) which represents the absolute capacity of a zero-diagonal correlation-recorded DAM. assume that the stored memories are random. and thus it has a binomial distribution with zero mean and variance .2. by virtue of the Central Limit Theorem. This approximation can then be used to write the inequality Perror < as (7.1. if Ci(0) is negative then xi(0) and i(0) have the same sign and x the cross-talk term i(0) does no harm. Similarly. Therefore. by using a similar derivation to the one leading to Equation (7.2. one can readily show that the ith retrieved bit xi(1) is in error if and only if Ci(0) > 1 − 2.6). more useful. independently for each k and i. 1987). giving (7. To see this. m < 0. a random input x(0) is assumed that has an overlap with one of the stored memories.3) is a special case of Equation (7. In other words. This leads to which. satisfy (7. then the onepass retrieved bit xi(1) is in error. say xh.3) can be approximated using the asymptotic (x ) expansion .8) where Din in Equation (7. Equation (7.5) becomes (7.0014. Perror = Prob(Ci(0) > 1). with equal probability for and for . taking the natural logarithm (ln) of both sides of the inequality in Equation (7. McEliece et al. Equation (7.3) with the lower limit on the integral replaced by 1 − 2. is the largest normalized Hamming distance from a fundamental memory within which almost all of the initial states reach this fundamental memory.2. Now.4. Here. DAM capacity measure gives a bound on in terms of error correction and memory size (Weisbuch and Fogelman-Soulié. Another. a more stringent requirement on in Equation (7. Here.2. Noting that this stringent error correction requirement necessitates small values.2.2.2) with one difference that the input x(0) is not one of the stored memories. .2. The inequality in Equation (7.2.2) is required to be retrievable with error probability less than 0. then an upper bound on can be derived by requiring that all bits of all memories xk be retrievable with less than a 1 percent error.7) can be derived by starting with an equation similar to Equation (7.2..111n is required for Perror 0. and uniformly distributed random bipolar binary numbers.
.

spin glass. For the continuous-time DAM. Komlós and Paturi (1988) showed that if Equation (7.14 reported by Amit et al. k =1.6) is satisfied. In other words. the capacity and retrieval characteristics of two analog (continuous-state) correlation-recorded DAMs (one continuous-time and the other discrete-time parallel updated) based on models introduced in the previous section are analyzed. Shiino and Fukai.. valid in the limit of large n (Marcus et al. 2.6) with zerodiagonal. Amari and Maginu (1988) (see also Amari and Yanai. they showed that each of the fundamental memories is an attractor with a basin of attraction surrounding it. ..13).1. a set of 2m attractors exist each having a large overlap (inner product) with a stored pattern or its inverse (the stability of the inverse of a stored memory is explored in Problem 7.2.1. the desired memories are no longer attractors. Kühn et al..29) with = 0. They also showed that once initialized inside one of these basins of attraction. the analysis is more complicated due to the fact that x(k) becomes correlated with the stored memory vectors. Marcus et al. 1990.1a) shows three regions labeled origin.. Its interconnection matrix W is defined by the autocorrelation version of Equation (7. It is obtained from Equation (7. This same analysis can not be applied if one starts the DAM at x(k). the state converges (in probability) to the basin's attractor in order ln(ln n) parallel steps.6) and (7. This region also contains attractors corresponding to spurious states that have negligible overlaps with the stored memories (or their inverse).2.2.1. In other words.21) by setting C = = I and = 0. starting from x(0) and retrieving x(1).16. is required to recover a fundamental memory even when is very close to (e. the diagram (Figure 7. 1986. He also showed that a relatively small number of parallel iterations.2. a large number of undesirable spurious states which compete with fundamental memories for basins of attraction "volumes" are intrinsic to correlation-recorded discrete Hopfield DAMs. . In this case. 1993). 1985. This later analog DAM will be referred to as the discrete-time DAM in this section.e. and recall. Finally. and hence the statistical properties of the noise term i(k) in Equation (7. for = 0. 1987. 1990. Under these assumptions.499. When initialized with a key input x(0) lying outside the basins of attraction of fundamental memories. Equations (7. Waugh et al..). the DAM has the single attractor state x = 0.2. independent of the retrieval mode (serial or parallel). the DAM is capable of either exact or approximate associative retrieval of stored memories. 1991. 1988).. The dynamics of these two DAMs have been studied in terms of the gain and the pattern ratio for unbiased random bipolar stored memories (Amit et al. This variance was calculated by taking the direct correlations up to two steps between the bits of the stored memories and those in x(k). i. In the spin glass region (named so because of the similarity to dynamical behavior of simple models of magnetic material (spin glasses) in statistical physics).
. Both DAMs employ the tangent hyperbolic activation function with gain . This theoretical value is in good agreement with early simulations reported by Hopfield (1982) and with the theoretical value of 0. then the DAM is capable of error correction when multiple-pass retrievals are considered..7) assumed a single parallel retrieval iteration.1. such correlations depend on the whole history of x(k − T).. 1993) analyzed the transient dynamical behavior of memory recall under the assumption of a normally distributed i(k) [or Ci(k)] with mean zero and variance 2(k).The capacity analysis leading to Equations (7. Gardner. Thus. at most 20 iterations are required). in the origin region. The second analog DAM is obtained from Equation (7.2. 2. 1985.2. Figure 7. Waugh et al.6) and (7. Next. the only attractors are spurious states..g. It employs the normalized correlation recording recipe for W with zero-diagonal as for the continuous-time DAM. Burshtien (1993) took these results a step further by showing that the radius of the basin of attraction for each fundamental memory is . These memories are linear combinations of fundamental memories (Amit et al..7) are not valid for the second or higher DAM iterations. asymptotically independent of n. These diagrams indicate the type of attractors as a function of activation function gain and pattern ratio .2) are more difficult to determine (in fact. The first analog DAM considered here will be referred to as the continuoustime DAM in the remainder of this section. (1985) using a method known as the replica method.. 1990.1 shows two analytically derived phase diagrams for the continuous-time and discrete-time DAMs.2. T = 0. 1. In the recall region... the discrete Hopfield DAM converges to one of an exponential (in n) number of spurious memories. 1993). the relative capacity was found to be equal to 0. Komlos and Paturi. hence..

2.
. In both DAMs. especially the shallow ones which corresponds to spurious memories.1). See Problem 7. Here. The phase diagram for the discrete-time DAM is identical to the one for the continuous-time DAM except for the presence of a fourth region marked oscillation in Figure 7. the stability condition in Equation (7. This method is explored in Problems 7.) The boundary separating the recall and spin glass regions determines the relative capacity of the DAM.(a)
(b)
Figure 7. error correction capabilities cease to exist at activation gains close to or smaller than 1 even as approaches zero. (1991. This phenomenon is termed deterministic annealing and it is reminiscent of what happens as temperature increases in simulated annealing (the reader is referred to the next chapter for a discussion of annealing methods in the context of neural networks). g is monotonically increasing in . (Adapted from C. Phase diagrams for correlation-recorded analog DAMs with activation function for (a) continuous-time and (b) parallel discrete-time updating. Marcus et al.1. Therefore. so that shallow local minima are eliminated.2. and in the limit of high gain ( > 10).2. with permission of the American Physical Society.1. respectively.2. This result supports the arguments presented at the end of Section 7. 1990. essentially all of the local minima eliminated correspond to spurious memories. this boundary asymptotes at which is essentially the relative capacity of the correlation-recorded discrete Hopfield DAM analyzed earlier in this section.1 b. M. Waugh et al.32) is violated (note that the zero-diagonal autocorrelation weight matrix can have negative eigenvalues.8) This expression can be derived by performing local stability analysis about the equilibrium point x* = 0 that defines the origin region.. The reason behind the improved DAM performance as gain decreases is that the Liapunov function becomes smoother. high-gain Hopfield net. It has been shown analytically and empirically that the basin of attraction of fundamental memories increases substantially as the activation function gain decreases with fixed pattern ration . The associative retrieval capability of the above analog DAMs can vary considerably even within the recall region.2. deep basins. Since the fundamental memories tend to lie in wide. The boundary separating the recall and spin glass regions has been computed by methods that combine the Liapunov function approach with the statistical mechanics of disordered systems.1 (a) and (b) is given by the expression (7.2 on the equivalence of the macroscopic dynamics of the discrete Hopfield net and the continuous-time.2. even a small decrease in can lead to substantial reduction in the number of local minima.1.2 and 7. for values of . but the DAM can also become trapped in period-2 limit cycles (oscillations).3 for the continuous-time and discrete-time DAMs. The boundary between the spin glass and origin regions in Figure 7. Associative retrieval and spurious states may still exist in this region (especially if is small). 1993) showed that the number of local minima (and thus spurious states) in the Liapunov function of the above DAMs increases exponentially as exp(ng) where .2. In this region. For the continuous-time DAM.

respectively. On the other hand. parallel binary.2.2 Projection DAMs The capacity and performance of autoassociative correlation-recorded DAMs can be greatly improved if projection recording is used to store the desired memory vectors (recall Equation (7. Amari and Maginu.2. labeled a and b.2. Empirical results show that.. Here.2. deriving similar relations for multiple-pass retrievals and/or more complex recording recipes (such as projection recording) is a much more difficult problem. (m < n) of these vectors to be linearly independent approaches 1 in the limit of large n (Komlós.5 to 0 as increases from 0 to 1. see Kanter and Sompolinsky. The lines are guides to the eye. the two versions of discrete-state DAMs. Figure 7.7). and parallel analog projection DAMs. the corresponding DAMs with zero-diagonal projection matrix continue to have substantial error correction capabilities even after exceed . Sompolinsky. projection DAMs are well suited for memorizing unbiased random vectors xk {− +1}n since it can be shown that the probability of m 1. serially updated and parallel updated. 1987.) There are two pairs of plots.
Figure 7. In such cases. such a relation has been derived analytically for single-pass retrieval and is given by Equation (7. For correlation-recorded binary DAMs. but ultimately lose these capabilities at . with permission of the American Physical Society.2 represent serial and parallel retrievals.2.2.1. the retrieval properties of projection-recorded discrete-time DAMs are analyzed. Kanter and H. in this figure. On the other hand. shows plots of the radius of attraction of fundamental memories versus pattern ration generated by numerical simulation for serial binary and parallel binary projections DAMs (multiple-pass retrieval is assumed). In particular. reported byKanter and Sompolinsky (1987). Note that the error correction capability of fundamental memories ceases as approaches and then exceeds for both serial and parallel DAMs for the nonzerodiagonal case (Problem 7. (Adapted from I. A common feature of these discrete projection DAMs is the monotonic decrease of from 0. More specifically. The pair labeled a corresponds to the case of a zero-diagonal projection weight matrix W. any set of memories can be memorized without errors as long as they are linearly independent (note that linear independence restricts m to be less than or equal to n). Lines tagged by a refer to a zero-diagonal projection matrix W. Solid lines refer to serial update with a specific order of updating as described in the text. respectively. The relation between the radius of attraction of fundamental memories and the pattern ratio is a desirable measure of DAM retrieval/error correction characteristics.7. Measurements of as a function of by computer simulation for projection-recorded binary DAMs. According to these simulations. 1988). these three DAMs will be referred to as the serial binary.12).
. Whereas pair b corresponds to the case of a projection W matrix with preserved diagonal. The solid and dashed lines in Figure 7.g. with Y = X). 1993). 1967).2. forcing the self-coupling terms wii in the diagonal of W to zero has a drastic effect on the size of the basin of attraction. The following analysis assumes the usual unbiased random bipolar binary memory vectors. These results are similar to the theoretical ones reported for parallel updated correlation-recorded DAMs (Burshtien. inside the basin of attraction of stable states. Dashed lines refer to parallel update. and the parallel updated continuous-state DAM are discussed.2. In the following. numerical simulations with large n values (typically equal to several hundred) are a viable tool (e. For the remainder of this section. The typical size of the statistical fluctuations is indicated.8 explores this phenomenon). the flows to the states are fast with a maximum of 10 to 20 parallel iterations (corresponding to starting at the edge of the basin). 1987. Lines tagged by b refer to a standard projection matrix.

For the above parallel updated zero-diagonal projection DAM, simulations show that in almost all cases where the retrieval did not result in a fundamental memory, it resulted in a limit cycle of period two as opposed to spurious memories. For the preserved-diagonal projection DAM, simulations with finite n and show that no oscillations exist (Youssef and Hassoun, 1989). (The result of Problem 7.2.10 taken in the limit of can serve as an analytical proof for the nonexistence of oscillations for the case of large n). The zero-diagonal serial projection DAM has the best performance depicted by the solid line a in Figure 7.2.2. In this case an approximate linear relation between and can be deduced from this figure as (7.2.9) Here, the serial update strategy used employs an updating such that the initial updates are more likely to reduce the Hamming distance (i.e., increase the overlap) between the DAM's state and the closest fundamental memory rather than increase it. In the simulations, the initial state x(0) tested had its first n bits identical to one of the stored vectors, say x1, while the remaining bits were chosen randomly. Thus the region represents the bits where errors are more likely to occur. The serial update strategy used above allowed the units corresponding to the initially random bits {xi(0), } to update their states before the ones having the correct match with x1 (i.e., units corresponding to {xi(0), }). However, in practical applications, this update strategy may not be applicable (unless we have partial input keys that match a segment of one of the stored memories such as a partial image) and hence, a standard serial update strategy (e.g., updating the n states in some random or unbiased deterministic order) may be employed. Such standard serial updating leads to reduced error correction behavior compared to the particular serial update employed in the above simulations. The performance, though, would still be better than that of a parallel updated DAM. Spurious memories do exist in the above projection DAMs. These spurious states are mixtures of the fundamental memories (just as in the correlation-recorded discrete Hopfield DAM) at very small values. Above 0.1, mixture states disappear. Instead, most of the spurious states have very little overlap with individual fundamental memories. Lastly, consider the parallel analog projection DAM with zero-diagonal interconnection matrix W. This DAM has the phase diagram shown in Figure 7.2.3 showing origin, recall, and oscillation phases, but no spin glass phase (Marcus et al., 1990). The absence of the spin glass phase does not imply that this DAM does not have spurious memories; just as for the correlation-recorded discrete Hopfield DAM, there are many spurious memories within the recall and oscillation regions which have small overlap with fundamental memories, especially for large . However, there is no region where only spurious memories exist. Note also that in the oscillation region, all fundamental memories exist as stable equilibria states with basins of attraction defined around each of them. The radius of this basin decreases (monotonically) as increases until ultimately all such memories lose their basins of attraction.

Figure 7.2.3. Phase diagram for the parallel updated analog projection DAM (with zero-diagonal W). (Adapted from C. M. Marcus et al., 1990, with permission of the American Physical Society.) According to Equation (7.1.32), oscillations are possible in the dynamics of the present analog DAM when , where min is the minimal eigenvalue of the interconnection matrix W. It can be shown (Kanter and Sompolinsky, 1987) that a zero-diagonal projection matrix which stores m unbiased random memory vectors xk {− +1}n has the extremal eigenvalues and . Therefore, the oscillation region in the phase 1, diagram of Figure 7.2.3 is defined by . Also, by following an analysis similar to the one outlined in Problem 7.2.3, it can be shown that the origin point loses its stability when for and for > 0.5. With these expressions, it can be easily seen that oscillation-free associative retrieval is possible up to if the gain is equal to 2. Adding a positive diagonal element to W shifts the extremal eigenvalues min and max to and , respectively, and thus increases the value of for oscillation-free associative retrieval of fundamental memories to a maximum value of , which exists at = 2. The recall regions for several values of are shown in Figure 7.2.4.

Here, one should note that the increase in the size of the recall region does not necessarily imply increased error correction. On the contrary, a large diagonal term greatly reduces the size of the basins of attraction of fundamental memories as was seen earlier for the binary projection DAM. The reader is referred to Section 7.4.3 for further exploration into the effects of the diagonal term on DAM performance.

Figure 7.2.4. The recall region of a parallel updated analog projection DAM for various values of diagonal element . (Adapted from C. M. Marcus et al., 1990, with permission of the American Physical Society.)

**7.3 Characteristics of High-Performance DAMs
**

Based on the above analysis and comparison of DAM retrieval performance, a set of desirable performance characteristics can be identified. Figures 7.3.1 (a) and (b) present a conceptual diagram of the state space for high- and low-performance DAMs, respectively (Hassoun, 1993).

Figure 7.3.1. A conceptual diagram comparing the state space of (a) high-performance and (b) lowperformance autoassociative DAMs. The high-performance DAM in Figure 7.3.1(a) has large basins of attraction around all fundamental memories. It has a relatively small number of spurious memories, and each spurious memory has a very small basin of attraction. This DAM is stable in the sense that it exhibits no oscillations. The shaded background in this figure represents the region of state space for which the DAM converges to a unique ground state (e.g., zero state). This ground state acts as a default "no decision" attractor state where unfamiliar or highly corrupted initial states converge to this default state. A low performance DAM has one or more of the characteristics depicted conceptually in Figure 7.3.1 b. It is characterized by its inability to store all desired memories as fixed points; those memories which are stored successfully end up having small basins of attraction. The number of spurious memories is very high for such a DAM, and they have relatively large basins of attraction. This low performance DAM may also exhibit oscillations. Here, an initial state close to one of the stored memories has a significant chance of converging to a spurious memory or to a limit cycle. To summarize, high-performance DAMs must have the following characteristics (Hassoun and Youssef, 1989): (1) High capacity. (2) Tolerance to noisy and partial inputs. This implies that fundamental memories have large basins of attraction. (3) The existence of only relatively few spurious memories and few or no

limit cycles with negligible size of basins of attraction. (4) Provision for a "no decision" default memory/state; inputs with very low "signal-to-noise" ratios are mapped (with high probability) to this default memory. (5) Fast memory retrievals. This list of high-performance DAM characteristics can act as performance criteria for comparing various DAM architectures and/or DAM recording recipes. The capacity and performance of DAMs can be improved by employing optimal recording recipes (such as the projection recipe) and/or using proper state updating schemes (such as serial updating) as was seen in Section 7.2. Yet, one may also improve the capacity and performance of DAMs by modifying their basic architecture or components. Such improved DAMs and other common DAM models are presented in the next section.

**7.4 Other DAM Models
**

As compared to the above models, a number of more sophisticated DAMs have been proposed in the literature. Some of these DAMs are improved variations of the ones discussed above. Others, though, are substantially different models with interesting behavior. The following is a sample of such DAMs [for a larger sample of DAM models and a thorough analysis, the reader is referred to Hassoun (1993)]. 7.4.1 Brain-State-in-a-Box (BSB) DAM The "brain-state-in-a-box" (BSB) model ( Anderson et al., 1977is one of the earliest DAM models. It is a discrete-time continuous-state parallel updated DAM whose dynamics are given by (7.4.1) where the input key is presented as the initial state x(0) of the DAM. Here, x(k), with 0 1, is a decay term of the state x(k) and is a positive constant which represents feedback gain. The vector = [I1 I2 ... In]T represents a scaled external input (bias) to the system, which persists for all time k. Some particular choices for are = 0 (i.e., no external bias) or = . The operation F() is a piece-wise linear operator which maps the ith component of its argument vector according to: (7.4.2) The BSB model gets its name from the fact that the state of the system is continuous and constrained to be in the hypercube [− +1]n. 1, When operated as a DAM, the BSB model typically employs an interconnection matrix W given by the correlation recording recipe to store a set of m n-dimensional bipolar binary vectors as attractors (located at corners of the hypercube [− +1]n). Here, one normally sets = 0 and assumes the input to the DAM (i.e., 1, x(0)) to be a noisy vector which may be anywhere in the hypercube [− +1]n. The performance of this 1, DAM with random stored vectors, large n, and m << n has been studied through numerical simulations by Anderson (1993). These simulations particularly address the effects of model parameters and on memory retrieval. The stability of the BSB model in Equation (7.4.1) with symmetric W, = 0, and = 1 has been analyzed by several researchers including Golden (1986), Greenberg (1988), Hui and ak (1992), and Anderson (1993). In this case, this model reduces to (7.4.3) Golden (1986, 1993) analyzed the dynamics of the system in Equation (7.4.3) and found that it behaves as a gradient descent system that minimizes the energy

(7.4.4) He also proved that the dynamics in Equation (7.4.3) always converges to a local minimum of E(x) if W is symmetric and min 0 (i.e., W is positive semidefinite) or , where min is the smallest eigenvalue of W. With these conditions, the stable equilibria of this model are restricted to the surface and/or vertices of the hypercube. It is interesting to note here, that when this BSB DAM employs correlation recording (with preserved diagonal of W), it always converges to a minimum of E(x) because of the positive semidefinite symmetric nature of the autocorrelation matrix. The following example illustrates the dynamics for a twostate zero-diagonal correlation-recorded BSB DAM. Example 7.4.1: Consider the problem of designing a simple BSB DAM which is capable fo storing the memory vector x = [+1 − T. One possible way of recording this DAM with x is to employ the normalized 1] correlation recording recipe of Equation (7.1.6). This recording results in the symmetric weight matrix

after forcing the diagonal elements to zero. This matrix has the two eigenvalues min = − and max = 0.5. 0.5 The energy function for this DAM is given by Equation (7.4.4) and is plotted in Figure 7.4.1. The figure shows two minima of equal energy at the state [+1 − T and its complement state [− +1]T, and two 1] 1 maxima of equal energy at [+1 +1]T and [− − T. Simulations using the BSB dynamics of Equation (7.4.3) 1 1] are shown in Figure 7.4.2 for a number of initial states x(0). Using = 1 and = 0.3 resulted in convergence to one of the two minima of E(x), as depicted in Figure 7.4.3a and 7.4.3b, respectively. The basins of attraction of these stable states are equal in size and are separated by the line x2 = x1. Note that the values of used here satisfy the condition . The effects of violating this condition on the stability of the DAM are shown in Figure 7.4.3, where was set equal to five. The figure depicts a limit cycle or an oscillation between the two states of maximum energy, [− − T and [+1 +1]T. This limit cycle was generated by 1 1] starting from x(0) = [0.9 0.7]T. Starting from x(0) = [0.9 0.6]T leads to convergence to the desired state [+1 − T as depicted by the lower state trajectory in Figure 7.4.3. It is interesting to note how this state was 1] reached by bouncing back and forth off the boundaries of the state space, .

Figure 7.4.1. A plot of the energy function E(x) for the BSB DAM of Example 7.4.1. There are two minima with energy E = − at states [+1 − T and [− +1]T, and two maxima with energy E = 0.5 at [+1 +1]T and 0.5 1] 1 [− − T. 1 1]

(a)

(b) Figure 7.4.2. State space trajectories of a two-state BSB DAM, which employs a zero-diagonal autocorrelation weight matrix to store the memory vector . The resulting weight matrix is symmetric with min = − 0.5 and . (a) = 1, and (b) = 0.3. Circles indicate state transitions. The lines are used as guides to the eye.

Figure 7.4.3. State space trajectories of the BSB DAM of Figure 7.4.2, but with = 5. The limit cycle (top trajectory) was obtained by starting from x(0) = [0.9 0.7]T. The converging dynamics (bottom trajectory) was obtained by starting from x(0) = [0.9 0.6]T. Greenberg (1988) showed the following interesting BSB DAM property. He showed that all vertices of a BSB DAM are attractors (asymptotically stable equilibria) if

,

i = 1, 2, ..., n (7.4.5)

Equation (7.4.5) defines what is referred to as a "strongly" row diagonal dominant matrix W. As an example, it is noted that the BSB DAM with W = I has its vertices as attractors. For associative memories, though, it is not desired to have all vertices (2n of them) of the hypercube as attractors. Therefore, a row diagonally dominant weight matrix is to be avoided (recall that the interconnection matrix in a DAM is usually treated by forcing its diagonal to zero). A more general result concerning the stability of the verticies of the BSB model in Equation (7.4.1), with = 1 and = , was reported by Hui and ak (1992, 1993). They showed that if for i = 1, 2, ..., n, then all vertices of the bipolar hypercube are asymptotically stable equilibrium points. Here, W need not be symmetric and Ii is an arbitrary constant bias input to the ith unit. Hui and ak also showed that, if W is symmetric, a hypercube vertex x* which satisfies the condition (7.4.6) is a stable equilibrium. Here, []i signifies the ith component of the argument vector. Equation (7.4.6) is particularly useful in characterizing the capacity of a zero-diagonal correlation-recorded BSB DAM where m unbiased and independent random vectors xk {− +1}n are stored. Let xh be one of these vectors and 1, substitute it in Equation (7.4.6). Assuming the DAM is receiving no bias ( = 0), the inequality in Equation (7.4.6) becomes (7.4.7) or (7.4.8)

since and > 0. Thus, the vertex xh is an attractor if (7.4.9) or, equivalently (7.4.10) where the term inside the parenthesis is the cross-talk term. In section 7.2.1, it was determined that the probability of the n inequalities in Equation (7.4.10) to be more than 99 percent correct for all m memories approaches 1, in the limit of large n, if . Hence, it is concluded that the absolute capacity of the BSB DAM for storing random bipolar binary vectors is identical to that of the discrete Hopfield DAM when correlation recording is used with zero self-coupling (i.e., wii = 0 for all i). In fact, the present capacity result is stronger than the absolute capacity result of Section 7.2.1; when is smaller than the condition of Equation (7.4.6) is satisfied and, therefore, all xk vectors are stable equilibria. 7.4.2 Non-Monotonic Activations DAM As indicated in Section 7.3, one way of improving DAM performance for a given recording recipe is by appropriately designing the DAM components. Here, the idea is to design the DAM retrieval process so that the DAM dynamics exploit certain known features of the synthesized interconnection matrix W. This section presents correlation-recorded DAMs whose performance is significantly enhanced as a result of modifying the activation functions of their units from the typical sgn- or sigmoid-type activation to more sophisticated non-monotonic activations. Two DAMs are considered: A discrete-time discrete-state parallel-updated DAM, and a continuous-time continuous-state DAM. Discrete Model First, consider the zero-diagonal correlation-recorded discrete Hopfield DAM discussed earlier in this chapter. The retrieval dynamics of this DAM show some strange dynamical behavior. When initialized with a vector x(0) that has an overlap p with one of the stored random memories, say x1, the DAM state x(k) initially evolves towards x1 but does not always converge or stay close to x1 as shown in Figure 7.4.4. It has been shown (Amari and Maginu, 1988) that the overlap , when started with p(0) less than some critical value pc, initially increases but soon starts to decrease and ultimately stabilizes at a value less than 1. In this case, the DAM converges to a spurious memory as depicted schematically by the trajectory on the right in Figure 7.4.4. The value of pc increases (from zero) monotonically with the pattern ratio and increases sharply from about 0.5 to 1 as becomes larger than 0.15, the DAM's relative capacity (note that pc can also be written as where is the radius of attraction of a fundamental memory as in Section 7.2.1). This peculiar phenomenon can be explained by first noting the effects of the overlaps p(k) and , on the ith unit weightedsum ui(k) given by (7.4.11) or, when written in terms of p(k) and qh(k), (7.4.12) Note the effects of the overlap terms qh(k) on the value of ui(k). The higher the overlaps with memories other than x1, the larger the value of the cross-talk term [the summation term in Equation (7.4.12)] which, in turn, drives |ui(k)| to large values. Morita (1993) showed, using simulations, that both the sum of squares of the overlaps with all stored memories except x1, defined as (7.4.13) and p2(k) initially increase with k. Then, one of two scenarios might occur. In the first scenario, s(k) begins to decrease and p2(k) continues to increase until it reaches 1; i.e., x(k) stabilizes at x1. Whereas in the second scenario, s(k) continues to increase and may attain values larger than 1 while p2(k) decreases.

Figure 7.4.4. Schematic representation of converging trajectories in a correlation-recorded discrete Hopfield DAM. When the distance (overlap) between x(0) and x1 is larger (smaller) than some critical value, the DAM converges to a spurious memory (right hand side trajectory). Otherwise, the DAM retrieves the fundamental memory x1. (From Neural Networks, 6, M. Morita, Associative Memory With Nonmonotone Dynamics, pp. 115-126, Copyright 1993, with kind permission from Pergamon Press Ltd., Headington Hill Hall, Oxford 0X3 0BW, UK.) The above phenomenon suggests a method for improving DAM performance by modifying the dynamics of the Hopfield DAM such that the state is forced to move in such a direction that s(k) is reduced, but not p2(k). One such method is to reduce the influence of units with large |ui| values. Such neurons actually cause the increase in s(k). The influence of a unit i with large |ui|, say |ui| > > 0, can be reduced by reversing the sign of xi. This method can be implemented using the "partial reverse" dynamics (Morita, 1993) given by: (7.4.14) where > 0 and F and G are activation functions which operate componentwise on their vector arguments. Here, F is the sgn activation function and G is defined by (7.4.15) where u is a component of the vector u = Wx(k). The values of parameters and must be determined with care. Empirically, = 2.7 and may be chosen. These parameters are chosen so that the number of units which satisfy |ui| > is small when x(k) is close to any of the stored memories provided is not too large. It should be noted that Equation (7.4.14) does not always converge to a stable equilibrium. Numerical simulations show that a DAM employing this partial reverse method has several advantages over the same DAM, but with pure sgn activations. These advantages include a smaller critical overlap pc (i.e., wider basins of attraction for fundamental memories), faster convergence, lower rate of convergence to spurious memories, and error correction capability for pattern ratios up to . Continuous Model Consider the continuous-time DAM in Equation (7.1.21) with C = = I, and = 0. Namely, (7.4.16) where W is the usual zero-diagonal normalized autocorrelation matrix . Here, the partial reverse method described above cannot be applied directly due to the continuous DAM dynamics. One can still capture the essential elements of this method, though, by designing the activation function such that it reduces the influence of unit i if |ui| is very large. This can be achieved by employing the non-monotone activation function shown in Figure 7.4.5 with the following analytical form (Morita et al., 1990 a, b): (7.4.17) where , ', and are positive constants with typical values of 50, 15, 1 and 0.5, respectively. This nonmonotone activation function operates to keep the variance of |ui| from growing too large, and hence

implements a similar effect to the one implemented by the partial reverse method. Empirical results show that this DAM has an absolute capacity that is proportional to n with substantial error correction capabilities. Also, this DAM almost never converges to spurious memories when retrieval of a fundamental memory is not successful; instead, the DAM state continues to wander (chaotically) without reaching any equilibrium (Morita, 1993).

Figure 7.4.5. Non-monotonic activation function generated from Equation (7.4.17) with , , and . Figure 7.4.6 gives simulation-based capacity curves (Yoshizawa et al., 1993a) depicting plots of the critical overlap pc (i.e., 1 − 2 where is the radius of attraction of a fundamental memory) versus pattern ratio for three DAMs with the dynamics in Equation (7.4.16). Two of the DAMs (represented by curves A and B) employ a zero-diagonal autocorrelation interconnection matrix and a sigmoidal activation function. The third DAM (curve C) employs projection recording with preserved W diagonal and the activation function in Equation (7.4.17). As expected, the DAM with sigmoid activation (curve A) loses its associative retrieval capabilities and the ability to retain the stored memories as fixed points (designated by the dashed portion of curve A) as approaches 0.15. On the other hand, and with the same interconnection matrix, the nonmonotone activation DAM exhibits good associative retrieval for a wide range of pattern ratios even when exceeds 0.5 (for example, Figure 7.4.6 (curve B) predicts a basin of attraction radius 0.22 at = 0.5. This means that proper retrieval is possible from initial states having 22 percent or less random errors with any one of the stored memories). It is interesting to note that this performance exceeds that of a projectionrecorded DAM with sigmoid activation function, represented by curve C. Note, though, that the performance of the zero-diagonal projection-recorded discrete DAM with serial update of states (refer to Section 7.2.2 and Figure 7.2.2) has (or ) which exceeds that of the non-monotone activations correlationrecorded DAM. Still, the demonstrated retrieval capabilities of the non-monotone activations DAM are impressive. The non-monotone dynamics can thus be viewed as extracting and using intrinsic information from the autocorrelation matrix which the "sigmoid dynamics" is not capable of utilizing. For a theoretical treatment of the capacity and stability of the non-monotone activations DAM, the reader is referred to Yoshizawa et al. (1993a, b). Nishimori and Opris (1993) reported a discrete-time discrete-state version of this model where the nonmonotonic activation function in Figure 7.4.5 is used with , = 1.0, and arbitrary > 0. They showed that a maximum capacity of is possible with , and gave a complete characterization of this model's capacity versus the parameter .

Figure 7.4.6. Simulation generated capacity/error correction curves for the continuous-time DAM of Equation (7.4.16). Curves A and B represent the cases of zero-diagonal correlation-recorded DAM with sigmoidal activation function and non-monotone activation function, respectively. Curve C is for a projection recorded (preserved diagonal) DAM with sigmoidal activation function which is given for comparison purposes. (From Neural Networks, 6, S. Yoshizawa et al., Capacity of Associative Memory Using a Nonmonotonic Neuron Model, pp. 167-176, with kind permission from Pergamon Press Ltd., Headington Hill Hall, Oxford 0X3 0BW, UK.)

7. the update rule for the ith unit is (7.4. For small . Here. the larger the value of .25 at ). Hysteresis can arise from allowing a non-zero diagonal element wii. It has been shown (Yanai and Sawada. for the proper choice of ... This phenomenon is described next in the context of a discrete Hopfield DAM. . 1990) that.6). thus increasing the chance for the DAM to correct its "wrong" states and ultimately converge to the closest fundamental memory.21) to Equation (7.g. the empirically optimal value for the hysteretic parameter ).. it is concluded that preserving the
. Note that this matrix has diagonal elements wii = . By comparing Equations (7. But. and hysteresis tends to prevent these transitions as well. i = 1. . To see this..4.3) through (7.4.20) and (7.4.. the former process is more effective than the later and associative retrieval can be enhanced.4. it was noted that when the state of a DAM is not far from a fundamental memory..2.
Figure 7. 2.. Transfer characteristics of a unit with hysteretic activation function.4. i = 1.19) The following discussion assumes a hysteretic parameter i = for all units i = 1. Qualitatively speaking. the interconnection matrix W is the normalized zero-diagonal autocorrelation matrix with the ith DAM state updated according to: (7. A plot of this activation function is given in Figure 7.2. In the previous section.3 Hysteretic Activations DAM Associative recall of DAMs can be improved by introducing hysteresis to the units' activation function.2) with xi(0) replaced by (1 + )xi(0). This implies that the basin of attraction of fundamental memories increase when hysteresis is employed.4. degradation of the associative retrieval process is caused by the units moving from the right states to the wrong ones.4. 2. The motivation behind the hysteretic property is that it causes units with proper response to preserve their current state longer.18) where the activation function fi is given by: (7. consider a discrete DAM with a normalized autocorrelation matrix . there are units moving from the wrong states to the right ones. the hysteresis term xi(k) in Equation (7.4. Empirical results suggest that a value of slightly higher than (with << 1) leads to the largest basin of attraction size around fundamental memories.7 which shows a hysteretic property controlled by the parameter > 0.19) of a hysteretic activation DAM reveals that the two DAMs are mathematically equivalent if is set equal to (it is interesting to note that is.4.21) Comparing Equation (7..15 at = 0 to about 0.19) favors a unit to stay in its current state xi(k).7. n.4.2. approximately.20) This result can be derived by first noting that the ith bit transition for the above DAM may be described by Equation (7. Yanai and Sawada (1990) showed that the absolute capacity of a zero-diagonal correlation-recorded DAM with hysteretic units is given by the limit (7. simultaneously. Now.. and following a derivation similar to that in Equations (7. the relative capacity increases from 0. Yanai and Sawada also showed that the relative capacity increases with (e.. n. Therefore.6) we find that hysteresis leads to a substantial increase in the number of memorizable vectors as compared to using no hysteresis at all. 2. n. .2. the higher the tendency of unit i to retain its current state.

.4. h = 1.4. The exponential DAM is 1. This DAM may update its state x(k) in either serial or parallel mode. The output nonlinearity F implements sgn activations which gives the DAM a discrete-state nature. 1986).2. m exponential in n) can be realized if one is willing to consider more complex memory architectures than the ones considered so far. a > 1 (Chiueh and Goodman.. To see this.4. The matrix X is the matrix whose columns are the desired memory vectors xh. Sayeh and Han. Exponential capacity is also possible with proper choice of g. with substantial error correction abilities.8. Architecture of a very high capacity discrete DAM.8. we have wii . Such choices include g() = (n − )−a. .5.23) where g is the scalar version of the operator G. However. h = 1. is approximately greater than 0.. An exponential capacity DAM was proposed by Chiueh and Goodman (1988.4. augmenting the matrices XT and X with (xh)T and xh. Krauth et al.
Figure 7. 2. g is normally assumed to be a continuous monotone non-decreasing function over [− +n]. This greatly reduces the basin of attraction size for fundamental memories as shown empirically in Section 7. described next.4. The choice of the first layer activation functions.. This DAM can store up to cn (c > 1) random vectors xh {− +1}n. one in each layer. 1988). All this suggests that the best approach is to give all selfconnections a small positive value << 1. 7.g. The dynamics are nonlinear due to the nonlinear operations G and F.24) which is simply the dynamics of the correlation-recorded discrete DAM.075 instead of zero increases the basins of attraction of fundamental memories by about 50%. 2. using a diagonal term of about 0. . very high capacity DAMs (e. the best DAM considered has a capacity proportional to n.e.22) to (7. On the other hand.4.. . in the network of Figure 7.original diagonal in a correlation-recorded DAM is advantageous in terms of the quality of associative retrieval. m.8 for further explanation). if one chooses g() = (n + )q. 1987) and g() = a. retaining the original diagonal leads to relatively large values for the self-connections if the pattern ratio is large (for m unbiased random memories . This assumption reduces the dynamics in Equation (7. 2. Here. This architecture describes a two-layer dynamic autoassociative memory..4 Exponential Capacity DAM Up to this point.22) and in component form as: (7. the number of units in the DAM.. The parallel updated dynamics are given in vector form as: (7. though. The recording of a new memory vector xh is simply done by n. respectively (this corresponds to allocating two new n-input units.5. g.4. The advantages of small positive self-connections has also been demonstrated for projection-recorded discrete DAMs.4).. For the projection DAM. i = 1. they found numerically that for = 0. assume first a simple linear activation function g() = ..4. m. 1988. 1991). (1988) have demonstrated that using a small positive diagonal element with the projection-recorded discrete DAM increases the radius of attraction of fundamental memories if the DAM is substantially loaded (i. a higher-order DAM results with polynomial capacity m nq for large n (Psaltis and Park. Consider the architecture in Figure 7. The choice g() = a results in an "exponential" DAM with capacity m = cn and with error correction capability.2.. For example. n). plays a critical role in determining the capacity and dynamics of this DAM. c is a function of a and the radius of attraction of fundamental memories as depicted in
. a > 1 (Dembo and Zeitouni. q is an integer greater than 1.4. when << 1 (see Problem 7. Here..

1991..4. (Adapted from T. According to this figure..4 for a = . .e.. Here. a simple correlation recording recipe may still be utilized for storing sequences. 2. for various values of the nonlinearity parameter a.. IEEE Transactions on Neural Networks.. Goodman.2 − 0.4. This sequence is a cycle if with mi > 2..Figure 7. Generally speaking. j = 1. For relatively small a. the capacity of the sequence generator DAM is similar to an autoassociative DAM if unbiased independent random vectors/patterns are assumed in all stored sequences. 2. associative retrieval may suffer if the loading of the DAM exceeds its capacity.37) operating in a parallel mode with Ii = 0.9. Furthermore.9 gives c 1. Equation (7. 275-284. the subscript i on xi and mi refers to the ith sequence.4.23) to be stable in both serial and parallel update modes is that the activation function g() be continuous and monotone non-decreasing over [− +n].4)n random memories if it is desired that all such memories have basins of attraction of size .
Figure 7.4. the asymmetric nature of W in Equation (7. namely. 2.4. though.4. no architectural changes are necessary for a basic DAM to behave as a sequence generator.4.5 Sequence Generator DAM Autoassociative DAMs can be synthesized to act as sequence generators.. .. in the limit of large a. Si with and mi > 2) can be stored using Equation (7. 7.. ©1991 IEEE. for an exponential DAM.4. This DAM has a storage capacity of cn. However.4. 2.25) will generally lead to spurious cycles (oscillations) of period two or higher. 2(2).25) vanishes).9. For example. This condition can be easily shown to be true for all choices of g() n. This DAM is also capable of storing autoassociations by treating them as sequences of length 1 and using Equation (7. one may come up with an approximate linear relation between c and .6 Heteroassociative DAM
. an exponential DAM with nonlinearity can store up to (1.. Finally.4.2 − 0. D. . in the limit of large n. this DAM has no error correction abilities at such loading levels.25) represents the normalized correlation recording recipe of the heteroassociations .1.25) with the autocorrelation term removed. indicated above. 7. Hence. Hence. As one might determine intuitively.. Figure 7. . i = 1. n. this sequence generator DAM is capable of simultaneous storage of sequences with different lengths and cycles with different periods.. M. the exponential DAM is capable of achieving the ultimate capacity of a binary state DAM. Recurrent Correlation Associative Memories.25) Note that the first term on the right hand side of Equation (7. An autoassociative DAM can store the sequence Si when the DAM's interconnection matrix W is defined by (7. mi − 1. the first term in Equation (7. . Relation between the base constant c and fundamental memories basins of attraction radius . namely m = 2n.4.4.25) (here. i = 1. one pass retrieval is assumed). This implies that the effective number of stored vectors must be very small compared to n. s. j = 1. for proper associative retrieval. In fact.25) can be extended for storing s distinct sequences Si by summing it over i. mi. a simple sequence generator DAM (also called temporal associative memory) is described whose dynamics are given by Equation (7. pp.) Chiueh and Goodman (1991) showed that a sufficient condition for the dynamics in Equation (7. The length of this sequence is mi.. Here. whereas the second term is an autocorrelation that attempts to terminate the recollection process at the last pattern of sequence Si. Also. a cycle (i. Consider a sequence of mi distinct patterns Si: with .4. 0 < (here. Chiueh and R. Similarly.

until state x (or equivalently y) ceases to change. it can be concluded that the BAM always converges to a local minimum of its energy function defined in Equation (7.. In the parallel mode. One may also use this equivalence property to show the stability of the parallel updated BAM (note that a parallel updated BAM is not equivalent to the (nonstable) parallel updated discrete Hopfield DAM.. the interconnection matrices W1 and W2 are computed independently by requiring that all one-pass associations xk yk and yk xk. i. In most of these methods. convergence to spurious cycles is possible.4. x in Equation (7. 1991) that these local minima include all those
. the HDAM starts from an initial state x(0) [y(0)]. see Problem 7. This is because either states x or y. Various methods have been proposed for storing a set of heteroassociations {xk.28) and showing that each serial or parallel state update decrease E (see Problem 7..26)]. k = 1. On the other hand.4.. are updated in parallel at each step.28). yk}.27) is the same vector generated by Equation (7.A Heteroassociative DAM (HDAM) is shown in the block diagram of Figure 7. Therefore. Similarly. One can also prove BAM stability by noting that a BAM can be converted to a discrete autoassociative DAM (discrete Hopfield DAM) with state vector x' = [xT yT]T and interconnection matrix W' given by (7. This memory is known as a bidirectional associative memory (BAM).26) or its serial (asynchronous) version where one and only one unit updates its state at a given time. is that they do not guarantee the stability of the HDAM. 1989 a.4. This can be shown by starting from the BAM's bounded Liapunov (energy) function (7. 1989b) that parallel updated projection-recorded HDAMs exhibit significant oscillatory behavior only at memory loading levels close to the HDAM capacity. 1991). it is assumed that the set of associations to be stored forms a one-to-one mapping. otherwise.1. the serially updated BAM is stable.4.. the second processing path computes a vector x according to (7. 1988).10. The HDAM can be operated in either parallel or serial retrieval modes.4.14). .4. since W' is a symmetric zero-diagonal matrix. in the serial update mode.26) is given by Equation (7. These methods require the linear independence of the vectors xk (also yk) for which a capacity of m = min(n. Examples of such HDAM recording methods include the use of projection recording (Hassoun.27) or its serial version.27) [Equation (7. Here.18). and then updates state x (y) according to Equation (7. Block diagram of a Heteroassociative DAM. The vector y in Equation (7. b) and Householder transformation-based recording (Leung and Cheung. proposed a heteroassociative memory with the architecture of the HDAM. but with the restriction .4. because of the feedback employed.27)]. Empirical results show (Hassoun. though. The interesting feature of a BAM is that it is stable for any choice of the real-valued interconnection matrix W and for both serial and parallel retrieval modes.4.1.e. It can be shown (Wang et al. This process is iterated until convergence.4. but not both. Kosko (1987.4.26).2 (also..4.4. are perfectly stored. only one randomly chosen component of the state x or y is updated at a given time. It consists of two processing paths which form a closed loop. 1987). The first processing path computes a vector from an input x {− +1}n according to the parallel update rule 1.4.4. (7. independently. perfect storage becomes impossible.4. computes its state y (x) according to Equation (7. F is usually the sgn activation operator. respectively. i.29) Now. One draw back of these techniques.e.10 (Okajima et al. Here.27).4. the autoassociative DAM is stable if serial update is assumed as was discussed in Section 7. Also.26) [Equation (7.
Figure 7.) From above. L) is achievable. m in the HDAM.. 2.

4.30) and (7. are substantially different models with interesting behavior. In]T represents a scaled external input (bias) to the system. Heuristics for improving the performance of correlation-recorded BAMs can be found in Wang et al. it should be noted that the above models of associative memories are by no means exclusive. one normally sets = 0 and assumes the input to the DAM (i. no external bias) or = . a number of more sophisticated DAMs have been proposed in the literature. 1.1 Brain-State-in-a-Box (BSB) DAM The "brain-state-in-a-box" (BSB) model ( Anderson et al. When operated as a DAM. A number of other interesting models have been reported in the literature (interested readers may find the volume edited by Hassoun (1993) useful in this regard.31) However.) Some of these models are particularly interesting because of connections to biological memories (e. Some of these DAMs are improved variations of the ones discussed above.. L) must be satisfied if good associative performance is desired (Hassoun. 1.) The most simple storage recipe for storing the associations as BAM equilibrium points is the correlation recording recipe of Equation (7. Alkon et al. the BSB model typically employs an interconnection matrix W given by the correlation recording recipe to store a set of m n-dimensional bipolar binary vectors as attractors (located at corners of the hypercube [− +1]n).g. This recipe guarantees the BAM requirement that the forward path and backward path interconnection matrices W1 and W2 are the transpose of each other. 1993). 1988 and 1993.4. some serious drawbacks of using the correlation recording recipe are low capacity and poor associative retrievals.1.
7. 1977is one of the earliest DAM models. Before leaving this section. The vector = [I1 I2 . 7.4. with 0 1.. (1990).1) where the input key is presented as the initial state x(0) of the DAM. x(0)) to be a noisy vector which may be anywhere in the hypercube [− +1]n.e. Some particular choices for are = 0 (i. The operation F() is a piece-wise linear operator which maps the ith component of its argument vector according to: (7. 1990).
. associations which are equilibria of the BAM dynamics..... the reader is referred to Hassoun (1993)]. Simpson. though. the condition m << min(n.2). Here..4. Here. The performance of this 1.e. which persists for all time k. when m random associations are stored in a correlation-recorded BAM.that correspond to associations {xk.e.4. x(k). Kanerva. yk} which are successfully loaded into the BAM (i. since (7.2) The BSB model gets its name from the fact that the state of the system is continuous and constrained to be in the hypercube [− +1]n. 1989b. Others.4 Other DAM Models
As compared to the above models. is a decay term of the state x(k) and is a positive constant which represents feedback gain.. The following is a sample of such DAMs [for a larger sample of DAM models and a thorough analysis. It is a discrete-time continuous-state parallel updated DAM whose dynamics are given by (7.

4.3) 1 1] are shown in Figure 7.4.3b. This limit cycle was generated by 1 1] starting from x(0) = [0. Greenberg (1988).3.1) with symmetric W.5 The energy function for this DAM is given by Equation (7.4. large n.4.1. The stability of the BSB model in Equation (7.4.e. The figure depicts a limit cycle or an oscillation between the two states of maximum energy. This matrix has the two eigenvalues min = − and max = 0. Using = 1 and = 0.9 0.4.7]T. In this case. Hui and ak (1992).3a and 7.. where was set equal to five.2 for a number of initial states x(0).3. With these conditions. A plot of the energy function E(x) for the BSB DAM of Example 7. the stable equilibria of this model are restricted to the surface and/or vertices of the hypercube.4.4.4.1: Consider the problem of designing a simple BSB DAM which is capable fo storing the memory vector x = [+1 − T.1. where min is the smallest eigenvalue of W. This recording results in the symmetric weight matrix
after forcing the diagonal elements to zero. Simulations using the BSB dynamics of Equation (7.6).4.6]T leads to convergence to the desired state [+1 − T as depicted by the lower state trajectory in Figure 7.9 0.1. and m << n has been studied through numerical simulations by Anderson (1993). The following example illustrates the dynamics for a twostate zero-diagonal correlation-recorded BSB DAM. and two maxima with energy E = 0.4.4.4) and is plotted in Figure 7. and two 1] 1 maxima of equal energy at [+1 +1]T and [− − T. These simulations particularly address the effects of model parameters and on memory retrieval. this model reduces to (7. Note that the values of used here satisfy the condition . and = 1 has been analyzed by several researchers including Golden (1986). It is interesting to note here. 1993) analyzed the dynamics of the system in Equation (7.3) and found that it behaves as a gradient descent system that minimizes the energy (7. and Anderson (1993).
Figure 7. The effects of violating this condition on the stability of the DAM are shown in Figure 7. it always converges to a minimum of E(x) because of the positive semidefinite symmetric nature of the autocorrelation matrix.5 1] 1 [− − T. that when this BSB DAM employs correlation recording (with preserved diagonal of W). There are two minima with energy E = − at states [+1 − T and [− +1]T.4) He also proved that the dynamics in Equation (7. . It is interesting to note how this state was 1] reached by bouncing back and forth off the boundaries of the state space.DAM with random stored vectors. W is positive semidefinite) or .4. [− − T and [+1 +1]T.3 resulted in convergence to one of the two minima of E(x). 0.4. The figure shows two minima of equal energy at the state [+1 − T and its complement state [− +1]T.4.5 at [+1 +1]T and 0.4. Example 7. = 0. respectively. 1 1]
. One possible way of recording this DAM with x is to employ the normalized 1] correlation recording recipe of Equation (7. as depicted in Figure 7. The basins of attraction of these stable states are equal in size and are separated by the line x2 = x1. Starting from x(0) = [0.5.1.3) Golden (1986.3) always converges to a local minimum of E(x) if W is symmetric and min 0 (i.

3.4.. and (b) = 0. .
Figure 7.4. Hui and ak also showed that. was reported by Hui and ak (1992. State space trajectories of a two-state BSB DAM.4. 1993).9 0. The converging dynamics (bottom trajectory) was obtained by starting from x(0) = [0.9 0. He showed that all vertices of a BSB DAM are attractors (asymptotically stable equilibria) if
. Greenberg (1988) showed the following interesting BSB DAM property.4.2.6]T.5 and . W need not be symmetric and Ii is an arbitrary constant bias input to the ith unit.4.4.5)
Equation (7.7]T. 2. The resulting weight matrix is symmetric with min = − 0.2. The lines are used as guides to the eye. though. it is noted that the BSB DAM with W = I has its vertices as attractors.6)
. 2.. n. A more general result concerning the stability of the verticies of the BSB model in Equation (7.. Here. Therefore..3. then all vertices of the bipolar hypercube are asymptotically stable equilibrium points. (a) = 1. State space trajectories of the BSB DAM of Figure 7.. a row diagonally dominant weight matrix is to be avoided (recall that the interconnection matrix in a DAM is usually treated by forcing its diagonal to zero).(a)
(b) Figure 7. For associative memories. but with = 5.1). They showed that if for i = 1. if W is symmetric..4.5) defines what is referred to as a "strongly" row diagonal dominant matrix W. Circles indicate state transitions. n (7. with = 1 and = . it is not desired to have all vertices (2n of them) of the hypercube as attractors. a hypercube vertex x* which satisfies the condition (7. . As an example. The limit cycle (top trajectory) was obtained by starting from x(0) = [0.
i = 1. which employs a zero-diagonal autocorrelation weight matrix to store the memory vector .

Equation (7.6) is particularly useful in characterizing the capacity of a zero-diagonal correlation-recorded BSB DAM where m unbiased and independent random vectors xk {− +1}n are stored.4. say x1.or sigmoid-type activation to more sophisticated non-monotonic activations. substitute it in Equation (7.. When initialized with a vector x(0) that has an overlap p with one of the stored random memories. the DAM state x(k) initially evolves towards x1 but does not always converge or stay close to x1 as shown in Figure 7. and a continuous-time continuous-state DAM. Here. In fact.2.7) or (7.10) to be more than 99 percent correct for all m memories approaches 1.4.4.e. the inequality in Equation (7. the DAM converges to a spurious memory as depicted schematically by the trajectory on the right in Figure 7. Here.4. The retrieval dynamics of this DAM show some strange dynamical behavior.2 Non-Monotonic Activations DAM As indicated in Section 7. []i signifies the ith component of the argument vector.4.5 to 1 as becomes larger than 0. Hence.2. the vertex xh is an attractor if (7. 7.8) since and > 0. when is smaller than the condition of Equation (7.11) or.6) is satisfied and.4.1.4. Thus. Discrete Model First.4. the DAM's relative capacity (note that pc can also be written as where is the radius of attraction of a fundamental memory as in Section 7.4.4.
. it is concluded that the absolute capacity of the BSB DAM for storing random bipolar binary vectors is identical to that of the discrete Hopfield DAM when correlation recording is used with zero self-coupling (i. wii = 0 for all i).6) becomes (7. if . when written in terms of p(k) and qh(k). Let xh be one of these vectors and 1. The value of pc increases (from zero) monotonically with the pattern ratio and increases sharply from about 0.1). Assuming the DAM is receiving no bias ( = 0).is a stable equilibrium. This peculiar phenomenon can be explained by first noting the effects of the overlaps p(k) and . in the limit of large n.4. It has been shown (Amari and Maginu. This section presents correlation-recorded DAMs whose performance is significantly enhanced as a result of modifying the activation functions of their units from the typical sgn. the idea is to design the DAM retrieval process so that the DAM dynamics exploit certain known features of the synthesized interconnection matrix W.15. therefore.4. In this case. all xk vectors are stable equilibria. the present capacity result is stronger than the absolute capacity result of Section 7. equivalently (7. initially increases but soon starts to decrease and ultimately stabilizes at a value less than 1.3. 1988) that the overlap . Two DAMs are considered: A discrete-time discrete-state parallel-updated DAM.1. consider the zero-diagonal correlation-recorded discrete Hopfield DAM discussed earlier in this chapter. In section 7. when started with p(0) less than some critical value pc.2.6).4. it was determined that the probability of the n inequalities in Equation (7. on the ith unit weightedsum ui(k) given by (7.9) or. one way of improving DAM performance for a given recording recipe is by appropriately designing the DAM components.4.10) where the term inside the parenthesis is the cross-talk term.4.

one of two scenarios might occur. faster convergence.
Figure 7. These advantages include a smaller critical overlap pc (i.4. say |ui| > > 0.4. The values of parameters and must be determined with care. s(k) continues to increase and may attain values larger than 1 while p2(k) decreases. but not p2(k). Here. 1993) given by: (7. F is the sgn activation function and G is defined by (7. s(k) begins to decrease and p2(k) continues to increase until it reaches 1. Then.(7. Whereas in the second scenario. can be reduced by reversing the sign of xi. Morita (1993) showed.13) and p2(k) initially increase with k. UK.
.e.4. These parameters are chosen so that the number of units which satisfy |ui| > is small when x(k) is close to any of the stored memories provided is not too large. pp.14) does not always converge to a stable equilibrium.4. One such method is to reduce the influence of units with large |ui| values. drives |ui(k)| to large values.12) Note the effects of the overlap terms qh(k) on the value of ui(k).) The above phenomenon suggests a method for improving DAM performance by modifying the dynamics of the Hopfield DAM such that the state is forced to move in such a direction that s(k) is reduced. lower rate of convergence to spurious memories. The higher the overlaps with memories other than x1. M.. The influence of a unit i with large |ui|. In the first scenario. This method can be implemented using the "partial reverse" dynamics (Morita. and = 0.. using simulations. defined as (7.4. but with pure sgn activations.7 and may be chosen. i.e. x(k) stabilizes at x1. Associative Memory With Nonmonotone Dynamics.1. Oxford 0X3 0BW. Headington Hill Hall. and error correction capability for pattern ratios up to . Morita.15) where u is a component of the vector u = Wx(k). 115-126. Empirically. with kind permission from Pergamon Press Ltd.12)] which. (From Neural Networks.. Such neurons actually cause the increase in s(k). Copyright 1993.4. Numerical simulations show that a DAM employing this partial reverse method has several advantages over the same DAM. = 2. 6. the DAM converges to a spurious memory (right hand side trajectory).21) with C = = I. Otherwise. wider basins of attraction for fundamental memories). It should be noted that Equation (7. in turn. Namely. that both the sum of squares of the overlaps with all stored memories except x1. When the distance (overlap) between x(0) and x1 is larger (smaller) than some critical value. the larger the value of the cross-talk term [the summation term in Equation (7.4. the DAM retrieves the fundamental memory x1.4.14) where > 0 and F and G are activation functions which operate componentwise on their vector arguments. Continuous Model Consider the continuous-time DAM in Equation (7. Schematic representation of converging trajectories in a correlation-recorded discrete Hopfield DAM.

1993a) depicting plots of the critical overlap pc (i. b).4. the demonstrated retrieval capabilities of the non-monotone activations DAM are impressive.(7.e.5. Figure 7.
Figure 7. Here.6 (curve B) predicts a basin of attraction radius 0.5 with the following analytical form (Morita et al. For a theoretical treatment of the capacity and stability of the non-monotone activations DAM. On the other hand. 1993).5 is used with .4. Still.5. '.5 (for example. This can be achieved by employing the non-monotone activation function shown in Figure 7. The non-monotone dynamics can thus be viewed as extracting and using intrinsic information from the autocorrelation matrix which the "sigmoid dynamics" is not capable of utilizing.17). 1 and 0. by designing the activation function such that it reduces the influence of unit i if |ui| is very large. respectively. and are positive constants with typical values of 50. .. the DAM with sigmoid activation (curve A) loses its associative retrieval capabilities and the ability to retain the stored memories as fixed points (designated by the dashed portion of curve A) as approaches 0.4.4.17) with . One can still capture the essential elements of this method. 15.2.2. instead. and hence implements a similar effect to the one implemented by the partial reverse method. the reader is referred to Yoshizawa et al.4. that the performance of the zero-diagonal projection-recorded discrete DAM with serial update of states (refer to Section 7. Two of the DAMs (represented by curves A and B) employ a zero-diagonal autocorrelation interconnection matrix and a sigmoidal activation function.5.16) where W is the usual zero-diagonal normalized autocorrelation matrix . this DAM almost never converges to spurious memories when retrieval of a fundamental memory is not successful. represented by curve C. As expected.17) where . This nonmonotone activation function operates to keep the variance of |ui| from growing too large. and arbitrary > 0.4.4.0. the DAM state continues to wander (chaotically) without reaching any equilibrium (Morita. b): (7.. though. 1990 a.. Nishimori and Opris (1993) reported a discrete-time discrete-state version of this model where the nonmonotonic activation function in Figure 7.4. This means that proper retrieval is possible from initial states having 22 percent or less random errors with any one of the stored memories). The third DAM (curve C) employs projection recording with preserved W diagonal and the activation function in Equation (7. = 1. (1993a. Empirical results show that this DAM has an absolute capacity that is proportional to n with substantial error correction capabilities.2 and Figure 7.4. and with the same interconnection matrix. They showed that a maximum capacity of is possible with . 1 − 2 where is the radius of attraction of a fundamental memory) versus pattern ratio for three DAMs with the dynamics in Equation (7.2) has (or ) which exceeds that of the non-monotone activations correlationrecorded DAM.22 at = 0.
.4. Note. the partial reverse method described above cannot be applied directly due to the continuous DAM dynamics. Figure 7. and gave a complete characterization of this model's capacity versus the parameter . It is interesting to note that this performance exceeds that of a projectionrecorded DAM with sigmoid activation function. Non-monotonic activation function generated from Equation (7. Also.6 gives simulation-based capacity curves (Yoshizawa et al. and .16).15. though. the nonmonotone activation DAM exhibits good associative retrieval for a wide range of pattern ratios even when exceeds 0.

(From Neural Networks. 1990) that.. UK.16). simultaneously. For small .20) and (7.4. degradation of the associative retrieval process is caused by the units moving from the right states to the wrong ones.4.4.2. Oxford 0X3 0BW. 167-176.
Figure 7. Capacity of Associative Memory Using a Nonmonotonic Neuron Model.6). it was noted that when the state of a DAM is not far from a fundamental memory..6) we find that hysteresis leads to a substantial increase in the number of memorizable vectors as compared to using no hysteresis at all.4. 2.4.3 Hysteretic Activations DAM Associative recall of DAMs can be improved by introducing hysteresis to the units' activation function.19) favors a unit to stay in its current state xi(k).6. This phenomenon is described next in the context of a discrete Hopfield DAM..7. The motivation behind the hysteretic property is that it causes units with proper response to preserve their current state longer. the interconnection matrix W is the normalized zero-diagonal autocorrelation matrix with the ith DAM state updated according to: (7. respectively. the larger the value of . Curve C is for a projection recorded (preserved diagonal) DAM with sigmoidal activation function which is given for comparison purposes. A plot of this activation function is given in Figure 7. Yanai and Sawada (1990) showed that the absolute capacity of a zero-diagonal correlation-recorded DAM with hysteretic units is given by the limit (7.) 7. It has been shown (Yanai and Sawada. In the previous section.18) where the activation function fi is given by: (7. the relative capacity increases from
.7 which shows a hysteretic property controlled by the parameter > 0. there are units moving from the wrong states to the right ones.2. Headington Hill Hall.2. Qualitatively speaking.2) with xi(0) replaced by (1 + )xi(0). thus increasing the chance for the DAM to correct its "wrong" states and ultimately converge to the closest fundamental memory. with kind permission from Pergamon Press Ltd.g. S. Here.4. 6. . for the proper choice of . n. By comparing Equations (7. and following a derivation similar to that in Equations (7.4. the former process is more effective than the later and associative retrieval can be enhanced.20) This result can be derived by first noting that the ith bit transition for the above DAM may be described by Equation (7. But..2..4. and hysteresis tends to prevent these transitions as well. the higher the tendency of unit i to retain its current state. the hysteresis term xi(k) in Equation (7.3) through (7. Yoshizawa et al.Figure 7. Transfer characteristics of a unit with hysteretic activation function.. Curves A and B represent the cases of zero-diagonal correlation-recorded DAM with sigmoidal activation function and non-monotone activation function. pp. Yanai and Sawada also showed that the relative capacity increases with (e.4. Simulation generated capacity/error correction curves for the continuous-time DAM of Equation (7.4.19) The following discussion assumes a hysteretic parameter i = for all units i = 1.

24)
.. Hysteresis can arise from allowing a non-zero diagonal element wii. This architecture describes a two-layer dynamic autoassociative memory.g. the update rule for the ith unit is (7.8 for further explanation). This DAM can store up to cn (c > 1) random vectors xh {− +1}n. .. 2. This implies that the basin of attraction of fundamental memories increase when hysteresis is employed. The parallel updated dynamics are given in vector form as: (7. consider a discrete DAM with a normalized autocorrelation matrix . very high capacity DAMs (e.25 at ). n. m. . . The exponential DAM is 1. For the projection DAM. one in each layer. h = 1.. Therefore... approximately. in the network of Figure 7.. For example.21) to Equation (7.23) where g is the scalar version of the operator G.4. though.2. The advantages of small positive self-connections has also been demonstrated for projection-recorded discrete DAMs. is approximately greater than 0. i = 1. Empirical results suggest that a value of slightly higher than (with << 1) leads to the largest basin of attraction size around fundamental memories. m exponential in n) can be realized if one is willing to consider more complex memory architectures than the ones considered so far. . g. it is concluded that preserving the original diagonal in a correlation-recorded DAM is advantageous in terms of the quality of associative retrieval. 1991). The choice of the first layer activation functions. we have wii . Krauth et al. n.2.15 at = 0 to about 0. An exponential capacity DAM was proposed by Chiueh and Goodman (1988... Note that this matrix has diagonal elements wii = . Consider the architecture in Figure 7. respectively (this corresponds to allocating two new n-input units.. the number of units in the DAM.e.21) Comparing Equation (7. 2. when << 1 (see Problem 7. augmenting the matrices XT and X with (xh)T and xh. The matrix X is the matrix whose columns are the desired memory vectors xh. This assumption reduces the dynamics in Equation (7.. To see this.4. m. retaining the original diagonal leads to relatively large values for the self-connections if the pattern ratio is large (for m unbiased random memories . i = 1. i = 1.. This greatly reduces the basin of attraction size for fundamental memories as shown empirically in Section 7.4. described next.4.8..5. All this suggests that the best approach is to give all selfconnections a small positive value << 1.4. the best DAM considered has a capacity proportional to n.4. 2. However. h = 1. To see this. .4 Exponential Capacity DAM Up to this point.19) of a hysteretic activation DAM reveals that the two DAMs are mathematically equivalent if is set equal to (it is interesting to note that is.4).4. n). the empirically optimal value for the hysteretic parameter ). (1988) have demonstrated that using a small positive diagonal element with the projection-recorded discrete DAM increases the radius of attraction of fundamental memories if the DAM is substantially loaded (i. Now. Here. using a diagonal term of about 0.075 instead of zero increases the basins of attraction of fundamental memories by about 50%..4.0. assume first a simple linear activation function g() = . they found numerically that for = 0.22) and in component form as: (7. plays a critical role in determining the capacity and dynamics of this DAM. The dynamics are nonlinear due to the nonlinear operations G and F. 2.4.. 2. The output nonlinearity F implements sgn activations which gives the DAM a discrete-state nature. 7.22) to (7. This DAM may update its state x(k) in either serial or parallel mode. with substantial error correction abilities..4. The recording of a new memory vector xh is simply done by n..4. g is normally assumed to be a continuous monotone non-decreasing function over [− +n]..5.

As one might determine intuitively. Si with and mi > 2) can be stored using Equation (7. Here.9. . the exponential DAM is capable of achieving the ultimate capacity of a binary state DAM. 1988). 2. j = 1.8. the subscript i on xi and mi refers to the ith sequence. In fact. This DAM has a storage capacity of cn.4. M.which is simply the dynamics of the correlation-recorded discrete DAM. namely. The choice g() = a results in an "exponential" DAM with capacity m = cn and with error correction capability. n.4 for a = . (Adapted from T. the first term in Equation (7. The length of this sequence is mi.9. 2. IEEE Transactions on Neural Networks. one pass retrieval is assumed). pp.. one may come up with an approximate linear relation between c and . if one chooses g() = (n + )q. For example. Figure 7. Architecture of a very high capacity discrete DAM.9 gives c 1. Goodman. namely m = 2n.4.25) can be
. Here. D. q is an integer greater than 1. Chiueh and R. Recurrent Correlation Associative Memories.4.1. a > 1 (Dembo and Zeitouni. though. Here. For relatively small a. for an exponential DAM. 1987) and g() = a. i = 1. 1986).2 − 0. Exponential capacity is also possible with proper choice of g.2 − 0. This sequence is a cycle if with mi > 2. 275-284. this DAM has no error correction abilities at such loading levels. c is a function of a and the radius of attraction of fundamental memories as depicted in Figure 7. whereas the second term is an autocorrelation that attempts to terminate the recollection process at the last pattern of sequence Si. an exponential DAM with nonlinearity can store up to (1.4. Furthermore.4. j = 1. Hence. ©1991 IEEE.) Chiueh and Goodman (1991) showed that a sufficient condition for the dynamics in Equation (7. This condition can be easily shown to be true for all choices of g() n. This DAM is also capable of storing autoassociations by treating them as sequences of length 1 and using Equation (7.25) Note that the first term on the right hand side of Equation (7. Sayeh and Han. .25) vanishes). An autoassociative DAM can store the sequence Si when the DAM's interconnection matrix W is defined by (7.4.23) to be stable in both serial and parallel update modes is that the activation function g() be continuous and monotone non-decreasing over [− +n].. in the limit of large a.
Figure 7. mi.4. 2. a simple sequence generator DAM (also called temporal associative memory) is described whose dynamics are given by Equation (7. . On the other hand..4. mi − 1. Finally.5 Sequence Generator DAM Autoassociative DAMs can be synthesized to act as sequence generators.. a simple correlation recording recipe may still be utilized for storing sequences. 1991.. indicated above.e.. 1988.37) operating in a parallel mode with Ii = 0.25) (here. According to this figure.4.4)n random memories if it is desired that all such memories have basins of attraction of size . no architectural changes are necessary for a basic DAM to behave as a sequence generator. Similarly..4. 7.25) represents the normalized correlation recording recipe of the heteroassociations . Relation between the base constant c and fundamental memories basins of attraction radius . Such choices include g() = (n − )−a.4.. Consider a sequence of mi distinct patterns Si: with . for various values of the nonlinearity parameter a. a > 1 (Chiueh and Goodman. . a higher-order DAM results with polynomial capacity m nq for large n (Psaltis and Park.. 2(2).
Figure 7.4.. 0 < (here. a cycle (i. Equation (7.25) with the autocorrelation term removed.

4. is that they do not guarantee the stability of the HDAM. This memory is known as a bidirectional associative memory (BAM). The first processing path computes a vector from an input x {− +1}n according to the parallel update rule 1. Various methods have been proposed for storing a set of heteroassociations {xk. it is assumed that the set of associations to be stored forms a one-to-one mapping.6 Heteroassociative DAM A Heteroassociative DAM (HDAM) is shown in the block diagram of Figure 7.4.4..4.26) is given by Equation (7.extended for storing s distinct sequences Si by summing it over i.4.4. associative retrieval may suffer if the loading of the DAM exceeds its capacity.. Also. Hence.27) [Equation (7. yk}. until state x (or equivalently y) ceases to change. 1989 a. . in the serial update mode.27) or its serial version. In the parallel mode. This can be shown by starting from the BAM's bounded Liapunov (energy) function (7. otherwise. One draw back of these techniques. m in the HDAM. x in Equation (7. independently. 1991). This process is iterated until convergence. because of the feedback employed. computes its state y (x) according to Equation (7. s. k = 1. Here.4. The interesting feature of a BAM is that it is stable for any choice of the real-valued interconnection matrix W and for both serial and parallel retrieval modes.25) will generally lead to spurious cycles (oscillations) of period two or higher.4. (7.27) is the same vector generated by Equation (7. in the limit of large n.26) or its serial (asynchronous) version where one and only one unit updates its state at a given time.4. respectively.. 1989b) that parallel updated projection-recorded HDAMs exhibit significant oscillatory behavior only at memory loading levels close to the HDAM capacity. 1987). This implies that the effective number of stored vectors must be very small compared to n. Empirical results show (Hassoun.4.. Kosko (1987. proposed a heteroassociative memory with the architecture of the HDAM. b) and Householder transformation-based recording (Leung and Cheung. Similarly. i. It consists of two processing paths which form a closed loop. the capacity of the sequence generator DAM is similar to an autoassociative DAM if unbiased independent random vectors/patterns are assumed in all stored sequences. 7. for proper associative retrieval. 2. L) is achievable.. the HDAM starts from an initial state x(0) [y(0)]. On the other hand. Here. These methods require the linear independence of the vectors xk (also yk) for which a capacity of m = min(n.26) [Equation (7.4. the asymmetric nature of W in Equation (7.
Figure 7. this sequence generator DAM is capable of simultaneous storage of sequences with different lengths and cycles with different periods.4. Examples of such HDAM recording methods include the use of projection recording (Hassoun.26).10 (Okajima et al..26)]. Block diagram of a Heteroassociative DAM. The vector y in Equation (7.27). though. However.4.4. F is usually the sgn activation operator.e. perfect storage becomes impossible.28)
.4. the second processing path computes a vector x according to (7. the interconnection matrices W1 and W2 are computed independently by requiring that all one-pass associations xk yk and yk xk. . The HDAM can be operated in either parallel or serial retrieval modes. 1988). Also. Generally speaking. are perfectly stored. i = 1.e. i. but with the restriction .10. and then updates state x (y) according to Equation (7.27)].. 2. convergence to spurious cycles is possible.. only one randomly chosen component of the state x or y is updated at a given time.. In most of these methods.

the autoassociative DAM is stable if serial update is assumed as was discussed in Section 7.4. This results in an optimal linear associative memory (OLAM) which has noise suppression capabilities. it should be noted that the above models of associative memories are by no means exclusive.4. 1990).) The most simple storage recipe for storing the associations as BAM equilibrium points is the correlation recording recipe of Equation (7. Kanerva.4.. some serious drawbacks of using the correlation recording recipe are low capacity and poor associative retrievals. 1993). 1991) that these local minima include all those that correspond to associations {xk. when m random associations are stored in a correlation-recorded BAM. the serially updated BAM is stable. the condition m << min(n. since W' is a symmetric zero-diagonal matrix. it can be concluded that the BAM always converges to a local minimum of its energy function defined in Equation (7. The most simple associative memory is the linear associative memory (LAM) with correlation-recording of real-valued memory patterns.1.6 Summary
This chapter introduces a variety of associative neural memories and characterizes their capacity and their error correction capability. Furthermore. associations which are equilibria of the BAM dynamics.28). In this case. One can also prove BAM stability by noting that a BAM can be converted to a discrete autoassociative DAM (discrete Hopfield DAM) with state vector x' = [xT yT]T and interconnection matrix W' given by (7. One may also use this equivalence property to show the stability of the parallel updated BAM (note that a parallel updated BAM is not equivalent to the (nonstable) parallel updated discrete Hopfield DAM. the associative memory is nonlinear. If the stored associations are binary patterns and if a clipping nonlinearity is used at the output of the LAM.30) and (7. It can be shown (Wang et al.) From above. This recipe guarantees the BAM requirement that the forward path and backward path interconnection matrices W1 and W2 are the transpose of each other..29) Now. yk} which are successfully loaded into the BAM (i.31) However. L) must be satisfied if good associative performance is desired (Hassoun.2). A number of other interesting models have been reported in the literature (interested readers may find the volume edited by Hassoun (1993) useful in this regard.1. since (7. Perfect storage in the LAM requires associations whose key patterns (input patterns) are orthonormal. 1988 and 1993.14)..and showing that each serial or parallel state update decrease E (see Problem 7. one only needs to have linearly independent key patterns if the projection recording technique is used. attention is given to recurrent associative nets with dynamic recollection of stored information. Heuristics for improving the performance of correlation-recorded BAMs can be found in Wang et al. Alkon et al.
7.2 (also.
. Simpson.18). then the orthonormal requirement on the key patterns may be relaxed to a pseudo-orthogonal requirement. (1990).4.) Some of these models are particularly interesting because of connections to biological memories (e. Before leaving this section. but not both.g. see Problem 7. In particular. This is because either states x or y.1. Therefore.e.4. are updated in parallel at each step.. 1989b.

capacity. A generalization of the basic correlation DAM into a model with higher nonlinearities allows for storage of an exponential (in memory size) number of associations with "good" error correction. The stability of these DAMs is shown by defining appropriate Liapunov (energy) functions. and associative retrieval properties of DAMs are characterized. sequence generator model.Methods for improving the performance of LAMs. substantial improvements in capacity and error correction capability are achieved when self-coupling is eliminated. exponential capacity model. are also discussed. the discretetime continuous-state model. The remainder of the chapter deals with DAMs (mainly single-layer autoassociative DAM's) which have recurrent architectures. and heteroassociative model.
. Several projection DAMs are discussed which differ in their state update dynamics and/or the nature of their state: continuous versus discrete. It is found that these DAMs are capable of storing a number of memories which can approach the number of units in the DAM. non-monotonic activations model. Improved capacity and error correction can be achieved in DAMs which employ projection recording. Another disadvantage of these DAMs is the presence of too many spurious attractors (or false memories) whose number grow exponentially in the size (number of units) of the DAM. hysteretic activations model. yet the retrieval dynamics employed results in substantial improvement in DAM performance. especially when error correction is required. and the discrete-time discrete-state model (Hopfield's discrete net). Here. Some of these models still employ simple correlation recording for memory storage. the following models are discussed: Brain-state-in-a-box (BSB) model. These DAMs also have good error correction capabilities. A serious shortcoming with the correlation-recorded versions of these DAMs is their inefficient memory storage capacity. this is indeed the case when non-monotonic or hysteretic activations are used. The chapter concludes by showing how a single-layer continuous-time continuous-state DAM can be viewed as a gradient net and applied to search for solutions to combinatorial optimization problems. Among the DAM models discussed are the continuous-time continuous-state model (the analog Hopfield net). The stability. such as multiple training and adding specialized associations to the training set. respectively. In addition to the above DAMs. the presence of self-coupling (diagonal-weights) is generally found to have a negative effect on DAM performance. It is also shown how temporal associations (sequences) and heteroassociations can be handled by simple variation of the recording recipe and intuitive architectural extension.

1 Local Versus Global Search
Consider the optimization problem of finding the extreme point(s) of a real-valued multi-dimensional scalar function (objective function) of the form y : R. a general discussion on the difference between local and global search is presented. There. In particular. by allowing the network to escape local minima during training. the function y may also admit local minima. A stochastic gradient descent algorithm is introduced which extends local gradient search to global search. GLOBAL SEARCH METHODS
FOR NEURAL NETWORKS
In Chapter 4. in Section 7. and in subsequent chapters.
. A mean-field approximation of simulated annealing for networks with deterministic units is presented which offers a substantial speedup in convergence compared to stochastic simulated annealing. In the following discussion. gradient search was employed for descending on a computational energy function to reach locally minimum points/states which may represent solutions to combinatorial optimization problems. it will be assumed.5].5. Finally. an improved hybrid genetic algorithm/gradient search method for feedforward neural net training is presented along with simulations and comparisons to backprop..8. An extreme point is a point x* in such that y(x*) takes on its maximum (or minimum) value. This is followed by a general discussion of stochastic simulated annealing search for locating globally optimal solutions. Also. simulated annealing is discussed in the context of stochastic neural nets for improved retrieval and training. for some > 0. where the search space is a compact subset of Rn. these methods can be used to modify the gradient-type dynamics of recurrent neural nets (e. First. Next. Figure 8. 0. Thus.1. Also. gradient-based search methods were utilized for discovering locally optimal weight configurations for single and multiple unit nets. Hopfield's energy minimizing net). that by optimization we mean minimization. an extreme point of y is the "global" minimum of y. These methods are expected to lead to "optimal" or "near optimal" weight configurations. learning in neural networks was viewed as a search mechanism for a minimum of a multidimensional criterion function or error function. This chapter discusses search methods which are capable of finding global optima of multi-modal multidimensional functions. Multiple extreme points may exist.1 illustrates the concepts of global and local minima for the uni-variate scalar function y(x) = x sin for x [0. so that the network is able to escape "poor" attractors.05. The chapter also reviews the fundamentals of genetic algorithms and their application in the training of multilayer neural nets.g. it discusses search methods which are compatible with neural network learning and retrieval. In addition to the global minimum (minima). without loss of generality. A point x* is a local minimum of y if y(x*) < y(x) for all x such that ||x* − x|| .
8.

the above approach is inefficient due to the computational overhead involved in its implementation on a digital computer. more efficient optimization techniques do exist. In Sections 8. 8. such as gradient descent. Thus.2.1) implements a search strategy.4 and 8. a strategy that cannot be easily fooled by local minima. Here. y(x(t)) . Other. Given an initial "guess" x(0) (e. Analytical techniques exploit Fermat's Stationarity Principle which states that the gradient (or derivative in the uni-variate case) of y with respect to x is zero at all minima (and maxima).Figure 8.1. . 0. is positive definite (i...
is generated such that
y(x(0)) y(x(1)) y(x(2)) . x(1). In practical situations. Local and global minima for the function y(x) = x sin
.... the reader is referred to the book by Törn and ilinskas (1989). x(2)..
There exists several ways of determining the minima of a given function.1. Assuming y is differentiable..1.g.. the quality (optimality) of the final solution is highly dependent upon the selection of the initial search point. we shall discuss extensions which help transform a simple gradient search into a global search.3).1.1 A Gradient Descent/Ascent Search Strategy
. It is clear that from this figure that given any initial search point x(0) [0. xTH(x*)x > 0 for all x 0). gradient search methods may lead only to local minima of y which happen to be "close" to the initial search point x(0). As an
example. x(t). which formed the basis for most of the learning rules discussed so far in this book. The global minimum may then be identified by direct evaluation of y(x*). though. the recursive rule in Equation (8. consider the one-dimensional function y(x) = x sin shown in Figure 8..1. .21)]. Global minimization (optimization) requires a global search strategy.1. or
if in the uni-variate case.. gradient descent can be expressed according to the recursive rule:
(8.. the gradient descent algorithm will always converge to one of the two local minima shown. one can find these minima (and maxima) by solving the set of equations (possibly nonlinear) of the form y(x) = 0.
and hence at each iteration we move closer to an optimal solution. a random point in ). evaluated at x*.12].1. Although computationally efficient (its convergence properties and some extensions were considered in Section 5. In this case.e.1) where is a small positive constant. For a survey of global optimization methods.2. 8.05. whereby a sequence of vectors:
x(0). Hence for local search algorithms. see Equation (3. As for the remainder of this section.. One example is the simple (steepest) gradient descent algorithm introduced in Chapter 3 [e.g. a solution x* is a minimum of y if the Hessian matrix.5 we will discuss three commonly used global search strategies. . This method is theoretically sound as long as the function y is twice differentiable.1. and t is a positive integer representing the iteration number. the region of search space containing the global solution will never be explored.

1. 8. in the multidimensional case. Now. A plot of a two-variable function showing multiple minima. the above gradient descent/ascent strategy may get caught in a repeating cycle. all minima of y over x [a. The reader is encouraged to stare at Figure 8. In fact. This is because when the search space has two or more dimensions.Using intuition and motivated by the saying: "There is a valley behind every mountain. when the search space is one-dimensional. thus leading to stochastic gradient algorithms. maxima. save the value and proceed by ascending the function y (using Equation (8. Start the initial search point at a corner of the search region () and randomly vary the direction of the perturbation x after each local minimum (maximum) is found. i. These methods utilize noise to perturb the function y(x) being minimized in order to avoid being trapped in "bad" local minima. such that x points in a random direction. Also. One may also use this strategy for multi-dimensional functions. and save the value .2 which depicts a plot of a two-dimensional function to see how (starting from a local minimum point) the choice of x effects the sequence of visited peaks and valleys. an example of a search strategy which would be useful for finding some "good" (sufficiently deep) local minima may be given as follows. This strategy will always lead to the global minimum when y is a function of a single variable. for problems of moderate to high dimensions. At this point.e. Next. start with x(0) = a and use gradient descent search to reach the first local minimum . this noise should be introduced appropriately and then subsequently
. b] have been obtained and the global minimum is the point satisfying
. Continue ascending until a local maximum is reached..2 Stochastic Gradient Search: Global Search via Diffusion The above search methods are deterministic in nature. 1991).1. This idea is similar to the one embodied in the global descent search strategy discussed in Section 5. Stochastic (non-deterministic) ideas may also be incorporated in gradient search-based optimization. and saddle points. away from the current minimum (maximum). A similar global search strategy is one which replaces the gradient ascent step in the above descent/ascent search strategy by a "tunneling step." Here. tunneling is used to move the search point away from the current minimum to another point in its vicinity such that the new point is (hopefully) on a surface of the search landscape which leads into a deeper minimum. Assuming a uni-variate objective function y(x).2.1) with the negative sign replaced by a positive sign) starting from the initial point where x is a sufficiently small positive constant. For such differentiable multivariate functions.2. Though in order for the stochastic method to converge to a "good" solution.1. x [a. the only feasible methods for global optimization are stochastic in nature (Schoen." we may employ gradient descent/ascent in a simple search strategy which will allow the discovery of global minima. b] with a < b. though the search is not guaranteed to find the global minimum. switch back to gradient descent starting from the current point (maximum) perturbed by x until convergence to the second local minimum .1.1. This full search cycle is repeated until the search reaches the point x = b. where the same local minimum and maximum solutions are repeatedly found.
Figure 8. the perturbation x is now a vector and it is unclear how to set the direction of x so that the search point visits all existing minima.

1.1.2) Here. k is the Boltzmann constant.5) with annealed temperature leads to the stochastic gradient rule in Equation (8. In the exponential schedule in Equation (8.1. the stochastic effects decay very rapidly. N) of the function y(x) where the perturbation is additive in nature:
(8.1. c(t) must be selected in such a way that it approaches zero as t tends to infinity.. T is the absolute temperature. on average.1.4) may allow the search to escape local minima. The search rule in Equation (8.1. and c(t) is a parameter which controls the magnitude of noise.3).1. the stochastic rule in Equation (8. (1987)] developed a method for global optimization..removed.1. A simple choice for c(t) is given by (Cichocki and Unbehauen. Convergence analysis of a slightly modified version of Equation (8.4) is inspired by the dynamics of the diffusion process in physical phenomena such as atomic migration in crystals or chemical reactions. The term is a stochastic force. A sufficiently large should be used in order for the search to explore a large range of the search space. the coefficient controls the amplitude of noise and determines the rate of damping. Geman and Hwang (1986) [see also Chiang et al. during the search process the perturbations to y are gradually removed so that the effective function being minimized will become exactly y prior to reaching the final solution.5) with the temperature T made inversely proportional to the logarithm of positive time (t) for almost guaranteed convergence to the global minimum. N2. thus
. Substituting the perturbed function gives the stochastic gradient rule: in the gradient descent rule in Equation (8. which is essentially the simulation of an annealed diffusion process. Note that for zero mean statistically independent noise. .1.1). 1993)
c(t) = e−t (8.1. the present search method will follow.4). For large . Unlike the gradient search rule in Equation (8. Nn]T is a vector of independent noise sources. This method is based on Equation (8.3)
with 0 and > 0.. and m is the
reduced mass. The discrete-time version of Equation (8.5) where E(x) is the potential energy. 1985):
(8. and prematurely reduce the search to the simple deterministic gradient search.1. To achieve the gradual reduction in noise mentioned above. The probability that stochastic gradient search leads to the global minimum solution critically depends on the functional form of the noise amplitude schedule c(t). the gradient of y.4) The stochastic gradient algorithm in Equation (8..4) can be found in Gelfand and Mitter (1991).1)
(8. the stochastic gradient algorithm performs gradient descent search on a perturbed form (x. The dynamics of the diffusion process across potential barriers involve a combination of gradient descent and random motion according to the Smoluchowski-Kramers equation (Aluffi-Pentini et al.1. N = [N1.1.1. That is.4) may be applied to any function y as long as the gradient information can be determined or estimated. In its most basic form.

Thus. which are located approximately at 0.3 shows two trajectories of the search point x for = 50 (dashed line) and = 200 (solid line) with = 0.058 will always be reached. Search trajectories generated by the stochastic gradient search method of Equation (8.091 and ).02.3. the search becomes a pure gradient descent search. 0.091 for a noise gain coefficient = 50 (dashed line) and to the global minimum x* 0.07 and converged to the local minimum
x* 0.091 (some simulations also led to the minima at 0.2) and (5.4) to search for minima of the
function y(x) = x sin near the origin. The subscripts l and j imply the
. − 0.07.091. and with x(0) = 0.223. the search has a very good chance of converging to the global minimum at 0.1.1.1.1. respectively. which is necessary for global optimization. the Langevin-type backprop is easily formulated by adding the decaying noise terms cl(t)N(t) and cj(t)N(t) to the right hand side of the batch version of Equations (5.223. This function is even. as is shown in Figure 8.1. The stochastic gradient rule in Equation (8.4) for
the function y(x) = x sin
. . − 0.1.05. Example 8.1.9). Values of > 100 and allowed the search to converge to "deep" minima of y(x). the coefficient needs to be chosen such that we strike a balance between the desire for fast convergence and the need to ensure an optimal (or near optimal) solution. for > 200. Initial simulations are performed to test for proper setting of the parameters of c(t).
Figure 8. In most simulations with x(0) close to zero.223 or its neighboring minimum at x* 0. For = 0.058. x The global minima of y are approximately at 0.5]. and 0.4) can be applied to all of the gradient descent-based learning rules for both single and multilayer feedforward neural nets. The function has three minima in the region x [0. 0.058. The following is an example of the application of stochastic gradient search for finding the global solution. The function y(x) has an infinite number of local minima which become very dense in the region closest to x = 0. very small values of lead to a very slow convergence process.223 and − 0.1: Let us use the stochastic gradient rule in Equation (8. searches with large lead to "deeper" minima compared to searches with small ( < 100).07 the local minima at 0. For example.increasing the probability of reaching a local suboptimal solution.1. On the other hand. and x(0) = 0. thus for each minima x* > 0 there exists a symmetric (with respect to the vertical axis) minima at − *.4) leads to the search rule
where c(t) = exp(− N(t) is normally distributed random noise with zero mean and unity variance.058. The same noise sequence N(t) is used in computing these trajectories.223 for (solid line).1. Differentiating y(x) and substituting in Equation (8. Figure 8.1. Small values for are desirable since they allow the stochastic search to explore a sufficiently large number of points on the search surface. The search started at x(0) = 0. However. 1993b). and t).1. This type of weight update is sometimes referred to as Langevin-type learning (Heskes and Kappen.

as opposed to being artificially introduced. Kirkpatrick et al. the particles of the molten metal seek minimum energy configurations (states) if allowed to cool. It should be noted here that the incremental update version of backprop [Equations (5. On the other hand. This is because a fast suboptimal schedule for c(t) can be used and still lead (on the average) to better solutions than gradient descent. The dynamic interaction among the above three processes is thus responsible for giving the search process its global optimizing character.1. The convergenceinducing process is realized effectively by the noise amplitude schedule c(t) and by the gradient term y(x). see Shrödinger (1946)]. the metal is annealed: First.9)] may also be viewed as a stochastic gradient rule. the metal is heated above its melting point and then cooled slowly until the metal solidifies into a "perfect" crystalline structure. a guidance process. As with all physical systems.1.
Another stochastic method for global optimization is (stochastic) simulated annealing (Kirkpatrick et al. A good example of such a function is y(x) = x sin
. functions whose gradients are expensive to compute or functions which are not differentiable) compared to the stochastic gradient method.1. Simulated annealing is analogous to the physical behavior of annealing a molten metal (Metropolis et al. the stochasticity is intrinsic to the gradient itself. A minimum energy configuration means a highly ordered state such as a defect-free crystal lattice. Also. This method does not use gradient information explicitly and thus is applicable to a wider range of functions (specifically. The defect-free crystal state corresponds to the global minimum energy configuration. the search guidance process is not readily identifiable. Aart and Korst. This exploration process is usually stochastic in nature. Kirkpatrick. The training of multilayer nets according to Langevin-type backprop can be more computationally effective than deterministic gradient descent (i. Note that gradient-based guidance is only effective when the function y(x) being minimized has its global minimum (or a near optimal minimum) located at the bottom of a wide "valley" relative to other shallow
local minima. two consecutive search points) and biases the exploration process to move towards regions of high quality solutions in . this method lacks an effective guidance process and the only guidance available is the local guidance due to the gradient term. the exploration process is realized by the noise term in Equation (8. one might consider identifying the above three processes in the global optimization method of stochastic gradient descent presented in the previous section.2) and (5. and a convergence-inducing process. As an exercise. due to the nature of the minimization of an "instantaneous" error and the random presentation order of the training vectors. 1953. 1984. The guidance process is an explicit or implicit process which evaluates the relative "quality" of search points (e.1983). Slow cooling (as opposed to quenching) is necessary to prevent dislocations of atoms and other crystal lattice disruption. In order to achieve the defect-free crystal.
8. Here.4). a global minimum) of a given function y(x) over x hinges on a balance between an exploration process. Finally. 1989)... The exploration process gives the search a mechanism of sampling a sufficiently diverse set of points x in . stochastic gradient search has a better chance of escaping local minima and areas of shallow gradient.. 1983. where x
. 1989). the convergence-inducing process assures the ultimate convergence of the search to a fixed solution x*. which may allow it to converge faster (Hoptroff and Hall.. In fact.g.possibility for using different noise magnitude schedules for the output and hidden layers and/or units. The starting point of statistical mechanics is an energy function E(x) which measures the thermal energy of a physical system in a given state x. Statistical mechanics is the central discipline of condensed matter physics which deals with the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature [for a concise presentation of the basic ideas of statistical mechanics.2 Simulated Annealing-Based Global Search
The success of global search methods in locating a globally optimal solution (say. as in Langevin-type learning. a brief presentation of related statistical mechanics concepts is given which will help us understand the underlying principles of the simulated annealing global optimization method. batch backprop).e. Next. Above its melting temperature. though. a metal enters a phase where atoms (particles) are positioned at random according to statistical mechanics.

Note that transitions from low to high energy states are possible except when T = 0. a common choice is the Metropolis state transition probability (Metropolis et al.2. Simulated annealing introduces artificial thermal noise which is gradually decreased in time. The idea is to apply uniform random perturbations x to the search point x and then determine the resulting change y = y(x + x) − y(x). the system will evolve its state in an average direction corresponding to that of decreasing energy E. In simulating physical systems...3) where E = E(x') − E(x).2. Two operations are involved in simulated annealing: a thermostatic operation which schedules decreases in the "temperature" and a stochastic relaxation operation which iteratively finds the equilibrium solution at the new temperature using the final state of the system at the previous temperature as a starting point.2. This noise is controlled by a new parameter T which replaces the constant kT in Equation (8.2. if the perturbation leads to an increase in y (i.2.e. T > 0).e.1) is known as the Boltzmann-Gibbs distribution. y < 0) the new search point x' = x + x is adopted. Noise allows occasional hill-climbing interspersed with descents.. the
. then the state x will vary in time causing E to fluctuate. Here. On the other hand. If the system's absolute temperature T is not zero (i.2. 1953) where
(8. A fundamental result from physics is that at thermal equilibrium each of the possible states x occurs with probability
(8.4) This has the advantage of making more transitions to lower energy states than those in Equation (8. Simulated annealing optimization for finding global minima of a function y(x) borrows from the above theory.3). and therefore reaches equilibrium more rapidly.4). Being a physical system.belongs to a set of possible states . which indicates that the system has reached thermal equilibrium. Equation (8.1)
(8.1) where k is Boltzmann's constant and the denominator is a constant which restricts P(x) between zero and one.2. by dividing by W(x x') and using Equation (8.2)
or.e. y > 0) the new search point x' may or may not be adopted..2. a function y of discrete or continuous variables can be thought of as the energy function E in the above analysis. Now define a set of transition probabilities W(x x') from a state x into x'. In this case. What is the condition on so that the system may reach and then remain in thermal equilibrium? A sufficient condition for maintaining equilibrium is that the average number of transitions from x to x' and from x' to x be equal:
P(x) W(x x') = P(x') W(x' x) (8. This continues until no further decrease in the average of E is possible. If the value of y is reduced (i.

. with probability . 1991). for large values of T. for small T the probability of an uphill move is low. where x is a small uniform random perturbation.g. and a number of methods have been proposed to accelerate the search (e. 3. 1. In practice. the probability of an uphill move in y is large. otherwise the search point remains at x.. Repeat steps 2 through 4 until the system reaches an equilibrium. 4. a suboptimal solution is sometimes sufficient and faster cooling schedules may be employed. (1986). it has been found theoretically (Geman and Geman. Rutenbar (1989). For example. if the cooling schedule is too fast.5) to guarantee that the simulated annealing search converges almost always to the global minimum. the algorithm will require an excessive amount of computation time to converge. 1984) that T must be reduced very slowly in proportion to the inverse log of time (processing cycle)
(8. Choose a "cooling" schedule for T. The effectiveness of simulated annealing in locating global minima and its speed of convergence critically depends on the choice of the cooling schedule for T. However. Initialize T at a sufficiently large value. The problem of accelerating simulated annealing search has received increased attention. so there is no danger of jumping out of a local minimum and falling into a worse one. If W(x x') > a then x' becomes the new search point. (1989. and if it is too slow. That is if y < 0. 1986. i. the search point becomes x'. Because of its generality as a global optimization method. Generally speaking. Ligthart et al. The following is a step-by-step statement of a general purpose simulated annealing optimization algorithm for finding the global minimum of a multi-variate function y(x). Hence. Use the Metropolis transition rule for deciding whether to accept x' as the new search point (or else remain at x)..2. This leads to an effective guidance of search since the uphill moves are done in a controlled fashion. Compute y = y(x') − y(x). otherwise accept x' as the new point with a transition probability W(x x') = . Salamon et al.determination of whether to accept x' or not is stochastic. Equilibrium is reached when the number of accepted transitions becomes insignificant which happens when the search point is at or very close to a local minimum. Update T according to the annealing schedule chosen in step 1 and repeat steps 2 through 5. Stop when T reaches zero. Szu.85 0. Compute x' = x + x. Initialize x to an arbitrary point in .98. as T decreases fewer uphill moves are allowed. Romeo (1989). Sontag and Sussann (1985). simulated annealing has been applied to many optimization problems. 6. 5. Example applications can be found in Geman and Geman (1984). one may even try a schedule of the form T(t) = T(t − 1) where 0. steps 2 through 4 may be repeated for a fixed prespecified number of cycles. Unfortunately. x . In practice. which reduces the temperature exponentially fast. 2. and Johnson et al.. 1988). a premature convergence to a local minimum might occur.e. For this purpose select a uniform random number a between zero and one.

.2. Amit (1989).. When y(x) is quadratic. Here. 3. There are several possible choices of f which could have been made in Equation (8.6.3. 8.1) controls the steepness of the sigmoid f(net) at net = 0. in the training of deterministic multilayer nets. for the purpose of global retrieval and/or optimal learning. As T increases.13) or the "batch" version of the entropy function in Equation (5.By now the reader might be wondering about the applicability of simulating annealing to neural networks. one can simply interpret the error function E(w) [e. In combinatorial optimization problems one is interested in finding a solution vector x {− 1}n (or {0. thus making the unit stochastic. and with the net input (post-synaptic potential) being a Gaussian random variable as explored in Problem 8. the reader is referred to Hinton and Sejnowski (1983).3.1). . the sigmoid becomes a step function and the stochastic unit becomes deterministic. the desired global minimum of y(x) may be obtained by a direct application of the simulated annealing method of Section 8.1 Global Convergence in a Stochastic Recurrent Neural Net: The Boltzmann Machine Since simulated annealing is a global optimization method.3. the SSE function in Equation (5. Simulated annealing may also be easily applied to the training of deterministic multilayer feedforward nets. However. (1991)].
8. signum activation function. Rather than using the suboptimal Hopfield net approach. This is achieved by first mapping y(x) onto the quadratic energy function of the Hopfield net and then using the net as a gradient descent algorithm to minimize E(x).1) and with controllable temperature T form a natural substrate for the implementation of simulated annealing-based optimization.1. 1963. it turns out that simulated annealing can be naturally mapped onto recurrent neural networks with stochastic units.3.g.16)] as the multi-variate scalar function y(x) in the above algorithm. and Hertz et al. units behave stochastically with the output (assumed to be bipolar binary) of the ith unit taking the value xi = +1 with probability f(neti) and value xi = − with probability 1 − f(neti).3. one might be tempted to consider its use to enhance the convergence to optimal minima of gradient-type nets such as the Hopfield net used for combinatorial optimization. because of its intrinsic slow search speed simulated annealing should only be considered. but the choice of the sigmoid function is motivated by statistical mechanics [Glauber.2. Equation (8. As is shown in the next section. The parameter in Equation (8. Peretto (1984). the promise of fast analog hardware implementations of the Hopfield net leads us to take another look at the network
. We may think of as the reciprocal pseudo-temperature.1)] with zero threshold.5 for details). 1974.1) may also be derived based on observations relating to the stochastic nature of the post-synaptic potential of biological neurons ( Shaw and Vasudevan. 1974). it is shown how stochastic neural nets with units described by Equation (8. Here. continuous Hopfield net may be used to search for local minimizers of y(x). the 1. if a global or near global solution is desired and if one suspects that E is a complex multimodal function.3. where neti is the 1 weighted-sum (net input) of unit i and
(8. Little. Next. this sharp threshold is "softened" in a stochastic way.1.1. thus minimizing y(x) (see Section 7.1) where P(xi) represents the probability distribution of xi. Here. However.1.3 Simulated Annealing for Stochastic Neural Networks
In a stochastic neural network the units have non-deterministic activation functions as discussed in Section 3. When the "temperature" approaches zero. 1}n) which best minimizes an objective function y(x). See also Amari (1971)] where the units behave stochastically exactly like the spins in an Ising model of a magnetic material in statistical physics [for more details on the connections between stochastic units and the Ising model. the neuron may be approximated as a linear threshold gate (LTG) [recall Equation (1.

when initialized at a random binary state x(0) the net will perform a stochastic global search seeking the global minimum of E(x). or equivalently
. the stochastic Hopfield net becomes equivalent to a simulated annealing algorithm. On the other hand.3. but with modifications which improve convergence to global solutions. though. now placed (hopefully) near the global minimum.2. consider a stochastic version of the above net. as the temperature approaches zero. this net will always converge to one of the local minima of its energy function E(x) given by
(8. In other words. the probability to flip unit i from +1 to − or vice versa) as 1
(8. where we replace the deterministic threshold units by stochastic units according to Equation (8. the probability of the states of the net is given by the Boltzmann-Gibbs distribution of Equation (8. Now.3. for T > 0 a transition which increases E(x) is allowed but with a probability which is smaller than that of a transition which decreases E(x).1)..3. In other words. is only suitable for quadratic functions.4) The right most term in Equation (8.3) Next. The above stochastic net is usually referred to as the Boltzmann machine because. the probability of a transition which increases the energy E(x) becomes zero and that of a transition which decreases E(x) becomes one.4) give a complete description of the stochastic sequence of states in the stochastic Hopfield net. at the beginning of the computation. As discussed in Section 8. referred to as the stochastic Hopfield net. It may now be concluded that the serially updated stochastic Hopfield net with the stochastic dynamics in Equation (8.. guarantees that "thermal" equilibrium will be reached for any T > 0.18).1. By employing Equation (8. i. repeated here for convenience:
(8.3.37). hence.3.1) and assuming "thermal" equilibrium one may find the transition probability from xi to − i x (i. will converge to this minimum.4) is obtained by using Equation (8. Then. The following is an optimization method based on an efficient way of incorporating simulated annealing search in a discrete Hopfield net.3. the state.3. this stochastic net will reach an equilibrium state where the average value of E is a constant when T is held fixed for a sufficiently long period of time. This method.1.2. the net reduces to the stable deterministic Hopfield net. at equilibrium.1). Consider the discrete-state discrete-time recurrent net (discrete Hopfield model) with ith unit dynamics described by Equation (7. if a slowly decreasing temperature T is used with an initially large value. as the computation proceeds.1) is stable (in an average sense) for T 0 as long as its interconnection matrix is symmetric with positive diagonal elements.2) This deterministic network has a quadratic Liapunov function (energy function) if its weight matrix is symmetric with positive (or zero) diagonal elements and if serial state update is used (recall Problem 7.optimization approach. a higher temperature should be used so that it is easier for the states to escape from local minima. Note that if T = 0. The transition probabilities in Equation (8.3. Finally. This last observation coupled with the requirement that one and only one stochastic unit (chosen randomly and uniformly) is allowed to flip its state at a given time.3.3) which gives . the temperature is gradually decreased according to a pre-specified cooling schedule.e.e.

. whose minima are the stable states x* characterized by = sgn(neti).2 Learning in Boltzmann Machines In the following. i = 1. 8. statistical mechanics ideas are extended to learning in stochastic recurrent networks or "Boltzmann learning" ( Hinton and Sejnowski. Equation (8. According to the discussion in Section 8.3. where thresholds (biases) are omitted for convenience and self-feedback is not allowed. The units are divided into visible and hidden units as shown in Figure 8.6)
Figure 8.3. and because of the existence of an energy function.1).3. all connections must be symmetric.3. Connections between any two units (if they exist) must be symmetric.2. The net activity at unit i is given by . the outside world. The visible units may be further divided into input and output units as illustrated. 1985). The hidden units have no direct inputs from. 1986. nor do they supply direct outputs to.3. with wii = 0.. Ackley et al. The visible units may (but need not to) be further divided into input units and output units.3.3. A stochastic net with visible and hidden units. 2.1..5) can be easily derived 1.(8. 1983.3.. but whatever the interconnection pattern. The units are interconnected in an arbitrary way. These networks consist of n arbitrarily interconnected stochastic units where the state xi of the ith unit is 1 or − with probability 1 f(netixi) as in Equation (8.1. by employing Equations (8. we find that the states of the present stochastic net are governed by the Boltzmann-Gibbs distribution of Equation (8. wij = wji.3).3.1). .5) where x and x' are two states in {− 1}n which differ in only one bit.1. which may be adapted for the present network/energy function as
.3. n. This leads to an energy function for this net given by
(8.1) and (8.

we denote the states of the visible units by the activity pattern and those of the hidden units by . the weights wij are adjusted to give the states of the visible units a particular desired probability distribution. P(x) = P(. The presence of hidden units has the advantage of theoretically allowing the net to represent (learn) arbitrary mappings/associations.8) and recalling that wij = wji we find
.3. Then. we have designed a stable stochastic net 1. A suitable measure of the difference between P() and R() is the relative-entropy H (see Section 5. Next. independent of the wij's. Using Equations (8.3. respectively. the derivation of a Boltzmann learning rule is presented.6) through (8.10) where is a small positive constant.3. we may view this net as an extension of the stochastic Hopfield net to include hidden units. E(x) = E(. Before proceeding any further. Also. Then.7) where = {− +1}n is the set of all possible states. Equation (8.3.3. (Boltzmann machine) which is capable of reaching the global minimum (or a good suboptimal minimum)
of its energy function by slowly decreasing the pseudo-temperature T = starting from a sufficiently high temperature.3. ) denotes the joint probability that the visible units are in state and the hidden units are in state . Therefore. assume that we are given a set of desired probabilities R(). given that the network is operating in its clamped condition. and the set G of all hidden patterns has 2K members. Therefore.2. ) is the energy of the network when the visible units are in state and the hidden units are jointly in state .8) gives the actual probability P() of finding the visible units in state in the freely running network at "thermal" equilibrium.3. In Boltzmann learning.7) and should be interpreted as . Let N and K be the numbers of visible and hidden units. for the visible states.9) which is positive or zero (H is zero only if R() = P() for all ). Now.6)
(8.8) Here. This probability is determined by the weights wij. the set A of all visible patterns has a total of 2N members (state configurations). Thus far. The vector x still represents the state of the whole network and it belongs to the set of 2N+K = 2n possible network states.3. we may arrive at a learning equation by performing gradient descent on H:
(8. The probability P() of finding the visible units in state irrespective of is then obtained as
(8. The term Z is the denominator in Equation (8.(8. This rule is derived by minimizing a measure of difference between the probability of finding the states of visible units in the freely running net and a set of desired probabilities for these states. the objective is to bring the distribution as close as possible to R() by adjusting the wij's.

3.
was replaced by
The first term in the right hand side of Equation (8.3. using Equation (8. averaged over 's according to
their probabilities R(). by substituting Equation (8.3.12) in (8.(8. Note that in Equation (8. Thermal equilibrium must be reached for each of these computations.14) represents the value of <xi xj> when the visible units are clamped in state .
.
(8. with the visible units clamped.13) is essentially a Hebbian learning term.11)
where denotes . and xi (xj) should be interpreted as the state of unit i (j).12) Thus.7).3.3. once with the visible units clamped in each of their states for which is non-zero.13)
where
is used and
(8. the term according to Bayes' rule. and once with the 's unclamped. Next. the state x fluctuates and we measure the correlations <xi xj> by taking a time average of xi xj. since the derivation of this rule hinges on the BoltzmannGibbs distribution of Equation (8.10) leads to the Boltzmann learning rule:
(8.8) and noting
that the quantity gives
is the average <xi xj>. given that the visible units are in state and the hidden units are jointly in state .3. At equilibrium.14). While the second term corresponds to anti-Hebbian learning with the system free running. Note that learning converges when the free unit-unit correlations are equal to the clamped ones.3. It is very important that the correlations in the Boltzmann learning rule be computed when the system is in thermal equilibrium at temperature T = > 0.3.3. This must be done twice.3.

3. 1987. (1985). (1990). However.4 Mean-Field Annealing and Deterministic Boltzmann Machines
. Then we want the network to learn associations . Boltzmann learning is very computationally intensive. the states are clamped to randomly drawn patterns from the training set according to a given probability distribution R(). 1987) hardware has been developed for the Boltzmann machine. Let us represent the input units. while only the input states are clamped in the anti-Hebbian (unlearning) term.4). Here. many units are sampled and are updated according to Equation (8. and Lippmann (1989). the network seeks equilibrium following the same annealing schedule. Hopfield et al. Note that the presence of hidden units allows for high memory capacity. and Apolloni and De Falco (1991). 1987)
(8.1.3.3. For each such training pattern. The weights are updated only after enough training patterns are taken.3. Here. (1983). one starts with a high temperature (very small ) and chooses a cooling schedule. and the hidden units by the states .1) or. At each of these temperatures the network is allowed to follow its stochastic dynamics according to Equation (8. equivalently.3. The learning rule described above is compatible with pattern completion. Usually. Kohonen et al. . The temperature is lowered slowly according to the preselected schedule until T approaches zero and equilibrium is reached. the weights derived in the training phase are held fixed and simulated annealing-based global retrieval is used as discussed in Section 8.3. This is reminiscent of retrieval in the dynamic associative memories of Chapter 7. Parks (1987).3. with averages over the inputs taken in both cases. The temperature is gradually lowered according to an appropriate schedule until the dynamics become deterministic at T = 0 and convergence to the "closest" pattern (global solution) is (hopefully) achieved.14). We may also extend Boltzmann learning to handle the association of input/output pairs of patterns. This simulated annealing search must be repeated with clamped visible units in each desired pattern and with unclamped visible units. This whole process is repeated many times to achieve convergence to a good set of weights wij. respectively. the output units. a suitable error measure is
(8. both the inputs and outputs are clamped in the Hebbian term.16). the demanding computational overhead associated with these machines would usually render them impractical in software simulations.
8. Assuming that the 's occur with probabilities p().As the reader may have already suspected by examining Equations (8. Ticknor and Barrett. During retrieval. state transitions are made according to Equation (8. In the computation of <xi xj>clamped.3. such hardware implementations are still experimental in nature. Examples of applications of learning Boltzmann machines can be found in Ackley et al. Specialized electronic ( Alspector and Allen.13) and (8. we need to distinguish between two types of visible units: input and output. (1988). the net follows the stochastic dynamics in Equation (8. Sejnowski et al. Galland and Hinton (1991). Theoretically.4). van Hemman et al. (1986). the visible units are clamped at corresponding known bits of a noisy/partial input pattern. as in supervised learning in multi-layer perceptron nets. Starting from a high temperature.15) which leads to the Boltzmann learning rule (Hopfield.3. in which a trained net is expected to fill in missing bits of a partial pattern when such a pattern is clamped on the visible nodes. We may now pose the problem as follows: for each we want to adjust the wij's such that the conditional probability distribution P(|) is as close as possible to a desired distribution R(|).1) or equivalently. Variations and related networks can be found in Derthick (1984). and .3. Boltzmann machines with learning may outperform gradient-based learning such as backprop. 1987) and optoelectronic ( Farhat. Smolensky (1986).16) In Equation (8. However.

The mean state <x> is thus one of the local minima of the quadratic energy function E(x) at temperature T = . as was discussed in Section 8... To see this.1).4.Mean field annealing ( Soukoulis et al. which means that all the quantities <xi> converge (become time-independent). Because computations using the mean transitions attain equilibrium faster than those using the corresponding stochastic transitions. the stochastic Hopfield net is guaranteed to reach thermal equilibrium. 3 we get
(8. mean field annealing relaxes to a solution at each temperature much faster than does stochastic simulated annealing. 8. namely
(8.1 Mean-Field Retrieval Consider a stochastic Hopfield net with the stochastic dynamics given in Equation (8. 1983.4.4. It is important to point out that Equation (8.. An alternative way is to solve for <x> by gradient descent on E(x) from an initial random state. such approximations are adequate in high dimensional systems of many interacting units (states) where each state is a function of all or a large number of other states allowing the central limit theorem to be used (see Problem 7.5. The location of this minimum may then be computed by solving the set of n nonlinear mean-field equations.1).. Equation (8. 1989). see Section 4.3. The idea of using a deterministic mean-valued approximation for a system of stochastic equations to simplify the analysis has been adopted at various instances in this book (e.1) has the same form as the equation governing the equilibrium points of the Hopfield net (Bilbro et al.3).9). Luckily. we recall from Equation (7. This leads to a significant decrease in computational effort. 1989) is a deterministic approximation to simulated annealing which is significantly more computationally efficient (faster) than simulated annealing ( Bilbro et al.2)
. we restrict our discussion of mean field annealing to the Boltzmann machine which was introduced in the previous section. 1992).. Generally speaking.1. At thermal equilibrium.1.1) The system is now deterministic and is approximated by the n mean-field equations represented by Equation (8. Let us transform the set of n stochastic equations in xi to n deterministic equations in <xi> 1 governing the means of the stochastic variables. the mean (or average) behavior of these transitions is used to characterize a given stochastic system.1) is meaningful only when the network is at thermal equilibrium.4. In this section.1.4.4.25)].g.. This is exactly what the deterministic continuous Hopfield net with hyperbolic tangent activations does [recall Equation (7. In fact.19) the dynamics of the ith unit in a continuous-state electronic Hopfield net. The evolution of the stochastic state xi of unit i depends on neti which involves variables xj that themselves fluctuate between − and +1. Instead of directly simulating the stochastic transitions in simulated annealing. Bilbro et al. the stochastic Hopfield net fluctuates about the constant average values in Equation (8.4.1). If we focus on a single variable xi and compute its average by assuming no fluctuations of the other xj's (this allows us to replace neti by its average <neti>).

4. One may employ an iterative method to solve for the unclamped states si according to
. 8.4.3) gives
(8. In addition to being faster than simulated annealing. This is referred to as "hardware annealing. 1989).The equilibria of Equation (8.4.4.1). For a clamped visible unit i. 1993).5) as in Equation (8. 1989." The deterministic Boltzmann machine is applicable only to problems involving quadratic cost functions. The electronic continuous-state (deterministic) Hopfield net must employ high-gain amplifiers (large ) in order to achieve a binary-valued solution as is normally generated by the original stochastic net. 1989. This approach is known as mean-field annealing.2 Mean-Field Learning The excessive number of calculations required by Boltzmann machine learning may be circumvented by extending the above mean-field method to adapting the wij's (Peterson and Anderson.4) which becomes identical in form to Equation (8. However.4. the principles of mean-field annealing may still be applied to more general cost functions with substantial savings in computing time by annealing a steady-state average system as opposed to a stochastic one. Mean-field annealing can be realized very efficiently in electronic (analog) nets like the one in Figure (7. si is set to 1 (the value that the unit's output is supposed to be clamped at). Bilbro and Snyder. Since annealing a stochastic net increases the probability that the state will converge to a global minimum.5).4. This means that we must use approximation terms sisj where the si's (sj's) are solutions to the n nonlinear equations represented by Equation (8.5) applies for free units (hidden units and unclamped visible units).3) Assuming the common choice x = f(u) = tanh(u) in Equation (8.3) where dynamic amplifier gains allow for a natural implementation of continuous cooling schedules (Lee and Sheu. the correlations <xi xj> are approximated by sisj where si is given by the average equation:
(8. As required by Boltzmann learning. Equation (8.4.2) are given by setting
to zero.4. we may try to reach this minimum by annealing the approximate mean-field system.1. Here. but with Ii = 0 for convenience.4.4. 1987). mean-field annealing has proved to lead to better solutions in several optimization problems (Van den Bout and Miller. starting with a deterministic net having large may lead to a poor local minimum as is the case with a stochastic net whose "temperature" T is quenched. 1988. Cortes and Hertz.1) after setting i = 1. However. giving
(8. the correlations <xi xj> should be computed at thermal equilibrium.

one of the apparent distinguishing features of genetic algorithms is their effective implementation of parallel multipoint search. while those strings with low fitness are given
. is created. i. This section presents the fundamentals of genetic algorithms and shows how they may be used for neural networks training.. . sM} . Genetic algorithms are similar to simulated annealing (Davis. crossover. The standard genetic algorithm uses a roulette wheel method for selection. this initial population is created randomly because it is not known a priori where the globally optimal strings in are likely to be found. However.. From this initial population.e. 1987) in that they employ random (probabilistic) search strategies.. Although there has been a lot of work done on modifications and improvements to the method. candidate strings from the current generation S(t) are selected to survive to the next generation S(t+1) by designing a roulette wheel where each string in the population is represented on the wheel in proportion to its fitness value. 1}n
In this case. If such information is given. S(t). the standard genetic algorithm is a method of stochastic optimization for discrete programming problems of the form:
Maximize f(s) (8. s2.5 Genetic Algorithms in Neural Network Optimization
Genetic algorithms are global optimization algorithms based on the mechanics of natural selection and natural genetics. Usually.1 Fundamentals of Genetic Algorithms The genetic algorithm (GA) as originally formulated by Holland (1975) was intended to be used as a modeling device for organic evolution. though.1) subject to s = {0. each with n bits. points at which the function being minimized have relatively low values.6) combined with annealing (gradual increasing of ). will be computed employing the three genetic operators of selection. and the n-dimensional binary vectors in are called strings. genetic algorithms maintain a collection of samples from the search space rather than a single point. Peterson and Anderson (1987) reported that this meanfield learning is 10 to 30 times faster than simulated annealing on some test problems with somewhat better results. To start the genetic search. S(2). 8. . and mutation. it may be used to bias the initial population towards the most promising regions of .. In its simplest form. De Jong (1975) demonstrated that the GA may also be used to solve optimization problems. They employ a structured yet randomized parallel multipoint search strategy which is biased towards reinforcing search points of "high fitness". Later.
8...(8. an initial population of. This collection of samples is called a population of strings. In this method of selection. .5.. and the analysis will follow the presentation given in Goldberg (1983). which is a stochastic version of the survival of the fittest mechanism. say M binary strings: S(0) = {s1..5. this section will present the standard genetic algorithm. The most noticeable difference between the standard genetic algorithm and the methods of optimization discussed earlier is that at each stage (iteration) of the computation. and that globally optimal by results may be produced. f : R is called the fitness function. So those strings which have a high fitness are given a large share of the wheel.4. subsequent populations S(1).

though. this is repeated until S(t) is empty. the fitness may be evaluated: f(si). 2.. n − 1} and then the bits from the two chosen strings are swapped after the kth bit with a probability Pc. are only candidate strings for the next population.1 shows a listing of the population with associated fitness values. To compute the next population of strings. Before actually being copied into the new population.1: As an example. Finally. For example.2. selections are made by spinning the roulette wheel M times and accepting as candidates those strings which are indicated at the completion of the spin. these strings must undergo crossover and mutation. The strings which are chosen from this method of selection.a relatively small portion of the roulette wheel. The probability that the crossover operator is applied will be denoted by Pc. Figure 8.
. Figure 8. The mutation operator is a stochastic bit-wise complementation. which is a recombination mechanism. applied with uniform probability Pm. In this case. suppose M = 5. As an example. A random integer k. and the corresponding roulette wheel. The integers shown on the roulette wheel correspond to string labels. (11110). An example of a crossover for two 6-bit strings. the crossing site k is 4. (00110)}. after crossover. Pairs of strings are selected randomly from S(t). and consider the following initial population of strings: S(0) = {(10110). (01001).
(a) (b)
Figure 8. . so the bits from the two strings are swapped after the fourth bit. (11000). The appropriate share of the roulette wheel to allot the ith string is obtained by dividing
the fitness of the ith string by the sum of the fitnesses of the entire population: . (b) Corresponding roulette wheel for string selection. called the crossing site. the value of the bit is flipped from 0 to 1 or from 1 to 0 with probability Pm.2 illustrates a crossover for two 6-bit strings. the roulette wheel is spun five times.5. (a) Two strings are selected for crossover.1. Example 8. for crossover. That is. mutation is applied to the candidate strings. Finally. for each single bit in the population.5.5. is chosen from {1. (c) Now swap the two strings after the kth bit. (a) A listing of the 5-string population and the associated fitness values. Pairs of the M (assume M even) candidate strings which have survived selection are next chosen for crossover. (b) A crossover site is selected at random. In this case k = 4..5. For each string si in the population.
(a) (b) (c)
Figure 8.5. without replacement.

(1952). 1989) that reasonable values for the probability of crossover are 0. where strings with above-average strengths tend to survive successive roulette wheel spins. The roulette wheel method of selection may be thought of as a "soft" stochastic version of this MAX operation. For this reason. So the maximum value may be obtained by simply averaging the elements of Q and excluding all elements which are below average. 1992) that a MAX operation may be implemented by successive application of a theorem by Hardy et al. it will be conjectured that this type of mixing will lead to the formation of optimal strings. In the next section. Grefenstette. Mutation.91. which may slow and impede convergence to a solution. Of the three genetic operators.6 Pc 0. the third bit was flipped. 2n. Hence. 1975. The remaining subset may be averaged.
After mutation. 1975.03. Grefenstette. then
min(Q) < ave(Q) < max(Q) (8. which states that for a set of non-negative real numbers Q R. 0.2)
(where it is assumed that not all of the elements of Q are equal). The creation of these new strings is usually required because of the vast differences between the number of strings in the population. This theory is based on the notion that even strings with very low fitness may contain some useful partial information to the search. gets at the crux of the underlying theory and assumptions of genetic search. In this case. although necessary. it has been found (De Jong.29) were generated. For the string s above. to flip is to choose a uniform random number r [0. suppose the random numbers (0.43. some qualitative comments may be helpful first. Although the next section presents an analytical analysis of the action of the genetic operators. Typically. M is chosen to be orders of magnitude smaller than 2n. the purpose of the mutation operator is to diversify the search and introduce new strings into the population in order to fully explore the search space. The easiest way to determine which bits. and the whole process is repeated by calculating the fitness of each string. Schaffer et al. these lowly fit strings are not altogether discarded during the search. It has been shown (Suter and Kabrisky.suppose Pm = 0. using a roulette wheel method of selection. Unlike the previous two operators which are used to fully exploit and possibly improve the structure of the best strings in the current population. 1986. and the string s = 11100 is to undergo mutation. Applying mutation too frequently will result in destroying the highly fit strings in the population. In the roulette wheel method of selection. 1986. otherwise no action is taken. 0. and the process repeated until only the maximum value survives. cuts with a double edge sword. then the resulting mutation is shown below. only a fraction of the total search space is explored.99. Based on empirical evidence. then the bit is flipped.5. Empirically. and the total number of strings in the entire search space . mutation is usually applied with a small probability. If r Pm. and applying the operators of crossover and mutation. thus allowing the search to overcome local minima. Schaffer et al. though. the operation of selecting the maximum from a set of numbers. Applying only selection to a population of strings results in a sort of MAX operation. 0.1. so by selecting and recombining (crossing) the M strings in a given population. though the survival probability is small.67. it is clear that only the above-average strings will tend to survive successive populations. 1989) that reasonable values for the probability of mutation
. the crossover operator is the most crucial in obtaining global results. the candidate strings are copied into the new population of strings: S(t+1). 1] for each bit in the string. So mutation forces diversity in the population and allows more of the search space to be sampled.. The reason that the stochastic version is used. M. Crossover is responsible for mixing the partial information contained in the strings of the population. rather than just deterministically always choosing the best strings to survive. it has been found (De Jong. 0. that is.. if any.

denote by m(H. For example. The following notation will be useful: The order of a schema H is the number of fixed positions over the strings in H. The dynamic mutation rate may be implemented by following a schedule where Pm is slowly decreased towards from an initial value Pm(0). (01110). and consider the collection of schemata which contain one or more of the strings in this population. Bäck shows that a dynamic mutation rate may overcome local minima. (11100). that is. global solutions to multimodal functions may be obtained. Since there are only 2n different strings in . then o(H) = 3 because there are 3 fixed positions in wazzu H. and mutation has on a typical population of strings. such that 1 > Pm(0) >> . For each such schema H. For example. respectively. it is useful to define the notion of schema (plural. the population of strings at time t. and (H) = 6 − 2 = 4. will be described next. To prove the fundamental theorem of genetic algorithms.are .5. we will demonstrate by example that the GA actually works. The string positions which aren't fixed in a given schema are usually denoted by ∗. t) the number of strings in the population at time t which are also in H. because 2 and 6 are the indices of the first and last fixed string positions in H. it is possible to determine the effect of these genetic operators on the schemata of a typical population. t) for those schema H which contain highly fit strings. A schema H is a structured subset of the search space . The theoretical basis which has been established thus far (Goldberg. whereas a fixed mutation rate may not. the schema H = ∗11∗0 is the collection of all 5-bit binary strings which contain a 1 in the second and third string positions.3)
. 1993). The order and defining length are denoted by o(H) and (H). respectively. More precisely. and the defining length of a schema is the distance between the first and last fixed positions of the schema. t) is easily seen to be:
(8. More generally. Using the roulette wheel method of selection outlined in the previous section. it is clear that a given string in will belong to several different schema. the underlying theory is far less understood. cross-over. there are 3n different schemas possible: all combinations of the symbols {0. the expected number of strings in H S(t + 1) given the quantity m(H. i. Later. To analyze the convergence properties of the GA. Thierens and Goldberg. each string in will belong to 2n different schema. We want to study the long term behavior of m(H. though.e. This is analogous to decreasing the temperature in simulated annealing. Bäck (1993) presented a theoretical analysis where he showed that Pm = is the best choice when the fitness function f is unimodal. (11110)}
In total. However. schemata). Consider S(t). The question here is: Why does the standard GA work? Surprisingly. although the literature abounds with applications which demonstrate the effectiveness of GA search. 1 ∗}. for the schema ∗11∗∗0.. 1983. and a 0 in the last position. for a multimodal fitness function. Davis and Principe (1993) showed that the asymptotic convergence of a GA with a suitable mutation probability schedule can be faster than that of simulated annealing. it is necessary to investigate the effect that selection.
H = {(01100). The structure in H is provided by string similarities at certain fixed positions of all the strings in H. The Fundamental Theorem of Genetic Algorithms We have just described the mechanisms of the standard genetic algorithm.

Proof. Consider the two schema A and B shown below. the 01 fixed positions will lie on the same side of the crossing site and will be copied into the resulting string. the mutation operator destroys schema only if it is applied to the fixed positions in the schema. . the binomial theorem may be employed to obtain the approximation: (1− m)o(H) 1 − o(H)Pm. we see that the number of strings in the population represented by H is expected to grow exponentially if . 1983) By using the selection. For small values of Pm. On the other hand.where f(H) is the average fitness of the strings H S(t + 1). 2. which explodes if a > 1 and decays if a < 1.. the above equation has the form of a linear difference equation: x(t + 1) = ax(t). the number of strings in the population represented by H will decay exponentially if . By noticing the difference in defining length for these two schema: (A) = 4 and (B) = 1. just take a representative example from A. say 100011. the following conclusion may be made: A schema survives crossover when the cross site is chosen outside its defining length. then short. and 5 and still preserve the structure of B. and the resulting string will not belong to A. and hence the resulting crossed string will not belong to H at time t + 1. Conversely. By comparing with Equation (8. If a string belonging to H is selected for crossover. The only crossing site which will destroy the structure of schema B would be k = 2.
Finally. the first and fifth positions). and above average schemata receive exponentially increasing trials in subsequent populations. or 4. 3. crossover.5. low order. B = 01
Claim: Schema A will not survive crossover if the cross site k is 1. the following quantity is a lower bound for the crossover survival probability: . or (2) The crossover destroys the structure.. 2. Theorem. and
is the average fitness
of all the strings in the population at time t.
Now consider the effect of crossover on a schema H with m(H. the schema is said to have survived crossover. 2.e. The P Fundamental Theorem of Genetic Algorithms may now be given. the fixed 1 at the fifth string position will be lost. if the average fitness of the schema is higher than the average fitness of the entire population. The solution of this equation is well known and given by: x(t) = atx(0).. or 4. t = 0. Assuming remains relatively constant. 1. 3. Making the reasonable assumption that the mating string is not identical to the example string at precisely the fixed string positions of A (i. 4. the probability that a schema H will survive crossover is given by . crossover. Hence. then upon crossover with cross site 1. Hence the probability that a schema H will survive mutation is given by (1 − Pm)o(H). and mutation are applied independently. because. (Goldberg. and mutation of the standard genetic algorithm. schema B may be crossed at sites 1.3). But since the crossover operator is only applied with probability Pc. in this case.. Since the operations of selection. one of two possibilities may occur: (1) The crossover preserves the structure of H.. 3. then the probability that a schema H will survive to the next generation may be obtained by a simple multiplication of the survival probabilities derived above:
.e. To see this. i. It's easy to see by example which schema are likely to survive crossover. in this case. t) samples in the population at time t.
A = 11.

because some arrangement of the building blocks is likely to produce a globally optimal string. there may be at most 2nM schema). Because the selection. Holland (1975) estimated that O(M3) schemata per generation are actually processed in a useful manner (see also Goldberg. and above average schemata are called building blocks. 1989). then the Fundamental Theorem implies that the GA is doing the right thing in allocating an exponentially increasing number of trials to the building blocks. It is believed (Goldberg. then there are 2n schema. Such problems are called GA-deceptive because by following the building block hypothesis.5) The short. Current trends in GA research (Kuo and Hwang. If this hypothesis is correct.5. the connection between the Fundamental Theorem and the observed optimizing properties of the genetic algorithm is provided by the following conjecture: The Building Block Hypothesis. Stated another way. Cases where the hypothesis fails can be constructed. Rather. the GA is lead away from the globally optimal solutions rather than towards them. Qi and Palmieri. The next question is: how many schema are actually processed per generation by the GA? Clearly. low order. and the Fundamental Theorem indicates that building blocks are expected to dominate the population. and mutation operations tend to favor certain schema. and hence shrink in size the class of GA-deceptive problems.(8. implicit parallelism implies that by processing M strings. then not all of the schema in the population will be processed by the GA. Unfortunately. 1983). The above analysis is based entirely on the schema in the population rather than the actual strings in the population. cross-over. the desired result is obtained:
(8. the GA actually processes O(M3) schemata for free! To apply the standard genetic algorithm to an arbitrary optimization problem of the form:
(8. Hence. though. This type of duality is called implicit parallelism by Holland (1975). Is this good or bad in terms of the original goal of function optimization? The above theorem does not answer this question. although the building block hypothesis seems reasonable enough.5. there are between 2n and 2nM schema present (if all the strings in the population are the same. though. The GA. where the globally optimal strings are surrounded (in a Hamming distance sense) by the worst strings in . processes strings--not schema. in every population of M strings. the hypothesis is that the partial information contained in each of the building blocks may be combined to obtain globally optimal strings. that such cases are of "needle in the haystack" type.4) By neglecting the cross product terms. The implicit parallelism notion means that a larger amount of information is obtained and processed at each generation by the GA than would appear by simply looking at the processing of the M strings. 1993.5. it does not always hold true. 1993) include modifying the standard genetic operators in order to enable the GA to solve such "needle in the haystack" problems. The globally optimal strings in may be partitioned into substrings which are given by the bits of the fixed positions of building blocks.6) it is necessary to establish the following:
. This additional information comes from the number of schema that the GA is processing per generation. if all the strings are different.

1 and plotted in Figure 8. Example 8. as well as a global minimum at x* 0. A schematic representation of the process of matching an optimization problem with the genetic algorithm framework. and x 0.3.05 and D (111111) = 0.
.1. A correspondence between the search space and some space of binary strings . 1}6 was used.1. as would be expected.223. a binary search space consisting of six-bit strings.. consider the function shown in Figure 8.5.3.5.e.
Figure 8.1.8) where d(s) is the ordinary decimal representation of the 6-bit binary string s.7) This is the same function considered earlier in Example 8.091. This situation is shown schematically in Figure 8.1. In this example.5. The first thing to notice is that the search space here is real-valued: = [0. the two end-points are mapped in the following way: D (000000) = 0.058. As mentioned above. i. An appropriate fitness function f(s). and the following optimization problem:
(8.5. so D (000011) = 0. the decimal representation of 000011 is 3.071. The standard genetic algorithm will be used to obtain the global minimum of this function.1. That is. This function has two local minima x 0. an invertible mapping of the form D: . some transformation is needed in order to encode/decode the real-valued search space into some space of binary strings .5.05.2: As an example of the solution process.1. with the decoding transformation given by
(8. 2. For example. 0. = {0.5. such that the maximizers of f correspond to the minimizers of y.5].

091 0. for example.170 0.131 0. As for the value of M.001 respectively. recall that the problem is to minimize y(x). and then.472 0.226 0. s10} 000010 (1) 000110 (1) 001010 (1) 010001 (1) 011001 (1) 100000 (1) 100111 (1) 101110 (1) 110101 (1) 111100 (1) 010101 (1) 110101 (1) 101010 (1) 001010 (1) 011001 (3) 010001 (3) 101001 (1) 011001 (4)
Decoded Value x = D(s) 0. In the first simulation. one may first try to choose Pc and Pm in the ranges 0.217
.938 1.217 1. the genetic operators of selection. 1}20.892 1.226 0. and the population has converged.9) where z = D(s).338 0.226
Fitness 1 − y(x) 0. values for M.To establish an appropriate fitness function for this problem. The following parameters were used: M = 10.833 0.1}10. Two simulations will be described below.5. which is close to the globally optimal solution of x* 0. and mutation were applied until the convergence criterion was met. In this table..8. a listing of the population is shown for the generations at times t = 0.916 0.092 0. Before applying the standard genetic algorithm. That is.01. So some sort of inverting transformation is required here. In this example. but maximize the fitness function f(s). The decoded real-value for each string is also shown. some stopping or convergence criterion is required.185 0.5.423 0.994 1. an initial population of strings was generated uniformly over the search space. and Pm must be chosen. Besides the above parameters.275 0. The values for these parameters are usually determined empirically by running some trial simulations. However.704 0. whatever accuracy is required. 2.597 1. crossover. After the fourth iteration (t = 4).120 0. the criterion used here is to stop when the population is sufficiently dominated by a single string. = {0.324 0.6 Pc 0. and the results of this simulation are shown in Table 8.99 and 0.. See also Reeves. .704 0. and 3. 1993).981 0.. as mentioned earlier. Notice how the population converges to populations dominated by highly fit strings. by using a GA search space with strings of higher dimension.345 0. or = {0.373 0. empirical results suggest that n M 2n is a good choice (Alander. 1992. Although there are several different convergence criteria which may be used. Pc. In this case. The string s* is decoded to the value x* = 0. 1. The number in parenthesis beside each string shows the number of multiplicities of the string appearing in the total population of 10 strings.064 0. convergence is obtained when a single string comprises 80 percent or more of the population.223.1.892 1.217 1.01 Pm 0. as usual.23.064 0. Pc = 0.063 1. and Pm = 0.198 0. the population is dominated by the string s* = 011001.423 0.120 0. Note that better accuracy may be obtained by using a more accurate encoding of the real-valued search space. the following fitness function is used:
(8. as well as the associated fitness values.170 0.
t=0
t=1
Population S(t) = {s1.

217
Table 8.999 0. The following parameters were used here: M = 10.1.922 0.233 0.071 0. The second simulation of this problem demonstrates a case where mutation is necessary to obtain the global solution.177 0.148 0.055 0.164 1.163 0.063 1. This helps the genetic algorithm branch out to explore the entire space. In fact. In this simulation.5. the initial population was created with all strings near the left endpoint x = 0.05.090 1.656 1. and Pm = 0.103 0.704 1.5.064 0. .191 0.233 0.994 1.170 0.5. Pm = 0.177 0.213 0.423 0.219
Fitness 1 − y(x) 0.9.106 0.994 1. the initial population of strings was concentrated at the left end-point of the search space .226
0.050 0.217
Table 8. if mutation was not used. The results of the simulation are shown in Table 8.05. the GA took 44 iterations to converge to the solution s* = 011000.057 0.064 0. Pc = 0.021 1.935 0.444 0. then the global solution could never be found by the GA. A listing of the population at various stages of the computation for the second simulation of Example 8.e. A listing of the first four populations in the first simulation for Example 8. Applying selection and crossover won't help because no new schemata would be generated.213 1.t=2 t=3
111001 (1) 010001 (3) 110101 (1) 010001 (4) 011001 (6)
0. i.289 0.2.170 0.. s10} 000000 (3) 000001 (1) 000010 (3) 000011 (3) 000010 (2) 000110 (1) 011010 (2) 001011 (1) 100010 (1) 010010 (1) 001000 (1) 001110 (1) 111000 (1) 010010 (4) 000010 (1) 011010 (2) 010000 (2) 010100 (1) 011000 (9)
Decoded Value x = D(s) 0. with corresponding real-value: x* = 0. This is because the initial population is dominated by the schema 00000∗∗ which is not a desirable building block because the fitness of this schema is relatively small.994 0..
t=0
t=5
t = 30 t = 44
Population P(t) = {s1. The numbers in parenthesis show the multiplicity of the string in the total population of 10 strings.873 1..640 1.929 0.127 0. This time.451 0. In this simulation.091 1.2.2..
.219.5.954 1.064 0.063 0. The increased mutation and crossover rates were used to encourage diversity in the
population.2.103 0.5.

nonlinearly separable classification tasks.. decoding each component separately might be desirable in certain applications. xn) in . On the other hand. LTG's). sn) . 1993). This type of crossover (known as multiple point crossover) is shown below:
A large number of other variations of and modifications to the standard GA have been reported in the literature. The general-purpose nature of GAs allows them to be used in many different optimization tasks. the mapping between the original search space and the GA space can vary from a simple real-to-binary encoding. then more bits would have to be allotted the substring representing the second component of x. the complete set of network weights is coded as a binary string si with associated fitness f(si) = − E(D(si)). 1989.. Also. etc. In addition. Whitley and Hanson. For example. one needs to establish a correspondence (an invertible mapping) between the search space in x () and the GA search space () which is typically a space of binary strings. since the GA is capable of global search and is not easily fooled by local minima... or the fitness function f = − may be used if y(x) is to be y minimized. Miller et al. since the fitness function need not be differentiable.. x2). The most obvious way is to use a GA to search the weight space of a neural network with a predefined architecture (Caudell and Dolan. As was discussed earlier. Empirical evidence suggests that different choices/combinations of fitness functions and encoding schemes can have significant effect on the GA's convergence time and solution quality (Bäck. one may simply use the function y itself as the fitness function if y(x) is to be maximized. 1989).5. an arbitrary optimization problem with objective function y(x) can be mapped onto a GA as long as one can find an appropriate fitness function which is consistent with the optimization task. . suppose x = (x1. In supervised learning one may readily identify a fitness function as − where E = E(w) may be the sum of E squared error criterion as in Equation (5.16). 1] [0. For example.. s4).1.). because it has a much larger range of values than the first component. s3. 1989.13) or the entropy criterion of Equation (5. s2.. Although not necessary. theoretical results on the specification of the space to be explored by a GA are lacking (De Jong and Spears. For examples. 100]. A decoding transformation may then be applied to each substring separately: D(s) = (D1(s1). the decoding transformations D1(s1) and D2(s2) would be different. Both of these requirements are possible to satisfy in many optimization problems. To obtain the same level of accuracy for the two variables. Instead of choosing a single crossing site over the entire string as shown below for a string of the form s = (s1.
. Montana and Davis.
crossing sites may be chosen for each of the substrings. As for specifying the search space for the GA. each string s will consist of n substrings s = (s1. Dn(sn)).. Here. and = [0. Hence.g. in this case.Although the previous example used a one-dimensional objective function. 1989. The use of GA-based learning methods may be justified for learning tasks which require neural nets with hidden units (e. Unfortunately. .g. multidimensional objective functions y: Rn R may also be mapped onto the genetic algorithm framework
by simply extending the length of the binary strings in to represent each component of the points x = (x1. nonlinear function approximation. The crossover operator may also be slightly modified to exploit the structure of the substring decomposition of s. D(si) is a decoding transformation. GAs are useful for use in nets consisting of units with non-differentiable activation functions (e. to more elaborate encoding schemes.. . That is. 1993).. the reader is referred to Chapter 5 in Goldberg (1989) and to the proceedings of the International Conference on Genetic Algorithms (1989-1993). where si is the binary encoding for the ith component of x.. 8.2 Application of Genetic Algorithms to Neural Networks There are various ways of using GA-based optimization in neural networks. and the crossover occur locally at each substring..2.

Thus. the optimal weight configuration for the learning task at hand for the predefined net architecture and predefined admissible weight values. 1989). GA's are also able to deal with learning in generally interconnected networks including recurrent nets. Nolfi et al. each weight is coded as a 3-bit signed-binary sub-string where the left most bit encodes the sign of the weight (e.5. This leads to hybrid GA/gradient search where a gradient descent step may be included as one of the genetic operators (Montana and Davis. In the context of our discussion. 2. This string represents. all evolution has to do is to evolve (find) an appropriate initial state of a system. In other words. Thus. and real-time learning (requires optimizing the learning rate). There are also large costs in speed and storage for working with a whole population of networks. evolution is recruited to discover a process of learning. Another interesting application of GA's is to evolve learning mechanisms (rules) for neural networks. On the other hand. and/or a set of weights) of fit networks are interchanged in hope of producing a network with even higher fitness.g. 1990) which optimizes one or more network performance measure. 1.
. VLSI hardware implementation compatibility (requires minimizing connectivity). the presence of learning makes evolution much easier. and ultimately. and supervised learning (or some other learning paradigm e. successive generations are constructed using the GA to evolve new strings out of old ones. Now. particularly if it is readily available. A simple two layer feed-forward net used to illustrate weight coding in a GA.. This inherent strength of GA's is in some ways also a weakness. It is interesting to note here how the cross-over operation may be interpreted as a swapping mechanism where parts (individual units. with a high probability. A more general view of the advantages of the marriage of GA and gradient descent can be seen based on the relationship between evolution and learning. Chalmers (1991) reported an interesting experiment where a GA with proper string representation applied to a population of single-layer neural nets evolved the LMS learning rule [Equation (3. 110 represents − and 011 represents +3) 2
Figure 8. By ignoring gradient (or more generally cause and effect) information when it does exist. Starting with a random population of such strings 1. Keeling and Stork (1991)]. the GA evolves weights based on a fitness measure of the whole network (a global performance measure). Belew et al. (population of random nets). These measures may include fast response (requires minimizing network size). which can make standard GA's impractical for evolving optimal designs for large networks. With gradient descent learning it is generally necessary to correlate causes and effects in the network so that units and weights that cause the desired output are strengthened. w21. w12. Potential applications of GA's in the context of neural networks include evolving appropriate network structures and learning parameters (Harp et al. from which learning can do the rest (much like teaching a child who already has an "evolved" potential for learning!).1. reinforcement or unsupervised learning) may be used to provide simple but powerful learning mechanisms. and the question of what caused any particular network state to occur is considered only in that the resulting state is desirable (Wieland..4. 1991). In this example. Hinton and Nowlan (1987).35)] as an optimal learning rule. (1990) have demonstrated the complementary nature of evolution and learning: the presence of learning facilitates the process of evolution [see also Smith (1987).5. These ideas motivate the use of hybrid learning methods which employ GA and gradient-based searches. strings whose fitness are above average (more specifically. genetic algorithms can be used to provide a model of evolution.4. (1990).4 as a contiguous sequence of substrings s = (101010001110011) which corresponds to the real valued weight string (w11. w13. w22) = (− 2. A specific example of such a method is presented in the next section. group of units. one can use such information to speedup the GA search. Thus. − 3). we may generate an appropriate GA-compatible representation for the net in Figure 8. when gradient information exists. 1989. Recurrent networks pose special problems for gradient descent learning techniques (refer to Section 5.4) that are not shared by GA's. the population converges to the "fittest" string..An example of a simple two-input two-unit feedforward net is shown in Figure 8.g.5. the GA's can become slow and inefficient. But with recurrent networks the cause of a state may have occurred arbitrarily far in the past. strings which meet the criteria of the Fundamental Theorem of GAs) tend to survive.

g. This could be done if some method of finding an appropriate set of hidden unit activations could be found... depending on the nature of the target vector d.53)]. In other words. k = 1.6. dk}. are linearly separable.10) for each training pair. perceptron rule.1 Hybrid GA/Gradient Descent Method for Feedforward Multilayer Net Training The basics of the hybrid GA/gradient descent (GA/GD) learning method for a multilayer feedforward net are described next. for k = 1. 2. Also. In the following.3. a general complex training set may not be linearly separable which necessitates the use of a hidden layer. A two-layer fully interconnected feedforward neural network. 1993 a. a GA search is used to explore the space of possible hidden targets
. a method for training multilayer neural nets was described which uses a GA to search for optimal weight configurations.. 1}J (or {− +1}J).1. 1992. In Section 2. or LTG's. a GA is used to "evolve" proper hidden targets. Consider the fully interconnected feedforward single hidden layer net of Figure 8. In fact. as long as the first layer (hidden layer) has a sufficient number of nonlinear units (e.8. Initially. Hassoun and Song.. including the threshold activation function. 8.6 Genetic Algorithm Assisted Supervised Learning
In the previous section. because of its gradient descent nature backprop's estimate of the hidden targets does not guarantee finding optimal hidden target configuration. The GA/GD method consists of two parts: genetic search in hidden target space and gradient-based weight update at each layer. a GA will be used to evolve such a set of hidden targets.. more sophisticated learning rules must be used which are usually far less efficient (slower) than LMS or delta rules and. backprop may be thought of as employing a dynamic version of the above method where hidden targets are estimated according to Equation (5. m. Ideally. a desired target vector (pattern) is specified for each input vector in a given training set. an alternative learning method is described which performs global search for finding optimal targets for hidden units based on a hybrid GA/gradient descent search strategy. may not always lead to satisfactory solutions. that the mappings xk hk and hk dk.1. Now. efficient learning of the weights can proceed independently for the hidden and output layers using the delta rule. Here. These hidden unit activations will be called hidden targets because they can be used as target vectors to train the first layer.1..g.. sigmoids or LTG's). which utilizes the above hidden target-based supervised learning strategy.
Figure 8. hm}. h2.. A linearly separable set of training input/target pairs can be efficiently learned in a single layer net using the gradient-based LMS or delta learning rules [See Equations (3. we do not know the proper set {hk} of hidden targets which solves the problem. this hybrid learning method is described in the context of a multilayer feedforward net having a single hidden layer. . consider the training set {xk.35) and (3.6. In supervised learning. This approach would be useful if it could be guaranteed that the mapping from the input to the hidden targets and that from those hidden targets to the desired output targets are linearly separable. delta-rule. Thus.1. though. hj {0. However. The output units can be linear units. then gradient descent-based search (e. The following is a more efficient method for training multilayer feedforward neural nets.. LMS. On the other hand.. . This method is a supervised learning method which is suited for arbitrarily interconnected feedforward neural nets. as in the case of backprop. b). Therefore. and a gradient descent search (LMS or delta rule) is used to learn optimal network interconnection weights (Song.. the universal approximation capabilities of single hidden layer feedforward nets was established for a wide variety of hidden unit activation functions. such 1. This implies that an arbitrary nonlinearly separable mapping can always be decomposed into two linearly separable mappings which are realized as the cascade of two single-layer neural nets. m. Now. one would like to "decouple" the training of a multiple-layer network into the training of two (or more) single layer networks. Here. 2. sigmoid units.6. if we are given a set of hidden target column vectors {h1. or Ho-Kashyap rule) may be employed to independently and quickly learn the optimal weights for both hidden and output layers. once those hidden targets are found.1 with LTG hidden units. .

the outputs of the hidden units (as opposed to the hidden targets) serve as the inputs to the output layer. the GA search in weight space involves a search space of dimension [(n+1)J + (J+1)L]b
. Here. the GA operators of reproduction.. Initially. Hj} is selected randomly without replacement from the temporary pool just generated. LMS is used to adapt the weights of the hidden layer in each of the M networks subject to the training set {xk. a natural coding of a set of hidden targets is the string s = (s1 s2 . One of the main motivations behind applying GA to the hidden target space as opposed to the weight space is the possibility of the existence of a more dense set of solutions for a given problem. different fitness functions may lead to different performance. each network is tested by performing feedforward computations and its fitness is computed.6. . 2. further and more extensive testing is still required in this area. In these computations. after cross-over.. During GA/GD learning. if the output error due to this pair is substantially larger than the average output error of network i (network j) on the whole training set) then the corresponding column hk of Hi is replaced by the kth column of Hj. . where each bit of the Hi's.. starting from random weight values and random hidden targets.1. cross-over. In reproduction. The above is a description of a single cycle of the GA/GD learning method. E Though. Cross-over is applied with a probability Pc (Pc is set close to 1). j = 1.1) Here. the weights of all M networks are reinitialized at the beginning of each cycle to small random values (one set of random weights may be used for all networks and for all cycles). A pair {Hi. hk}. Though. is flipped with a probability Pm (typically." Each array has an associated network labeled j whose architecture is shown in Figure 8. Pm = 0.01 is used). independent of the first hidden layer. In the architecture of Figure 8. If a training pair {xk. Here.. After the weights are updated. cross-over can affect multiple pairs of columns in the hidden target arrays. an adaptive version of the Ho-Kashyap algorithm or the perceptron rule may be employed directly to the hidden LTG's. and are motivated by empirical results (Hassoun and Song... The fitness of the jth search point (array Hj) is determined by the output SSE of network j: (8. the above GA/GD method involves a GA search space of dimension mJ. M.{h} (hidden target space) and converge to a global solution which renders the mappings x h and h d linearly separable.. dk} is poorly learned by network i (network j) during the above learning phase (i. Now.. A population of M random binary arrays {Hj}. the present GA/GD method showed better overall performance on a range of benchmark problems. This hypothesis was validated on a number of simple problems which are designed so that the weight space and the hidden target space had the same dimensions. any one of a number of fitness functions may be used. k = 1. On the other hand.6. The above reproduction and cross-over operations differ from those employed by the standard GA. and using different fitness functions. sm) where si is a string formed from the bits of the vector hi.. Equivalently. Since the hidden targets are binary-valued. 1993b). That is. Hassoun and Song (1993b) reported several variations of the above method including the use of sigmoidal hidden units. . with all M nets initialized with the same set of random weights. Alternatively.6. Examples are f(Hj) = − j. and mutation are applied to evolve the next generation of hidden target sets {Hj}. there may be many more optimal hidden target sets {H*} in hidden target space which produce zero SSE error than optimal weights {} in weight space. On the other hand.. Similarly. Next.1. This representation is particularly useful in the multipoint crossover described next. or even where is a very small positive number. is the output of the lth output unit in network j due to the input xk. hm]. is generated as the initial population of search "points. the weights of the output layer units are adapted subject to the training set {hk. we may represent the search "point" as a J m array (matrix) H = [h1 h2 . 2. This cycle is repeated until the population {Hi} converges to a dominant representation or until at least one network is generated which has an output SSE less than a prespecified value. dk}..e. the standard mutation operation is used here. m. However. the threshold activation is removed during weight adaptation and the hidden units are treated as linear units. the hidden target sets Hj with the highest fitness are duplicated and are entered into a temporary pool for cross-over. using the outputs of the hidden layer instead of the hidden targets to serve as the input pattern to the output layer during the training phase.

However. (3) standard GA learning in weight space (SGA).1 for the GA/GD method and three other methods: (1) a method similar to GA/GD but with the GA process replaced by a process where the search is reinitialized with random hidden targets and random weights at the onset of every learning cycle. On the other hand. The GA/GD method is tested with population sizes of 8. many practical problems (such as pattern recognition. a single sigmoidal unit is used for the output layer. This method is referred to as the random hidden target/gradient descent (RH/GD) method (it should be noted here that sufficient iterations of the delta rule are allowed for each cycle in order to rule out non-convergence).2) implies that the GA/GD method is preferable over GA-based weight search in neural network learning tasks when the size of the training set. As the case with the simulated annealing global search method in the weight space. Only ten delta rule learning steps are allowed for each layer per full GA/GD training cycle. which makes the GA/GD method less advantageous in term of computational speed. and bipolar output targets are assumed. and minimum) are computed. For each population size. The delta rule with a learning rate of 0.6. the system identification problem used one linear output unit. a two-layer feedforward net was used with sigmoidal hidden units employing the hyperbolic tangent activation.2 Simulations The GA/GD method is tested on the 4-bit parity binary mapping and a continuous mapping that arises from a nonlinear system identification problem. And. 8. Also. and 64 strings. Simulation results are reported in Table 8. maximum. and function approximation problems) lead to training sets characterized by m >> n. For the binary mapping problem. system identification.2) is chosen because it is known to pose a difficult problem to neural networks using gradient descent-based learning.5.6.1.g. the nonlinear system identification problem is chosen to test the ability of the GA/GD method with binary hidden targets to approximate continuous nonlinear functions. m. nb (here. standard deviation. However. (2) Incremental backprop (BP). and − 1 otherwise. fifty trials are performed (each trial re-randomizes all initial weights and hidden target sets) and learning cycles statistics (mean value. The desired output is taken as +1 if the number of +1 bits in the input vector is odd. The networks used in the following simulations employ four hidden units.1 is used to learn the weights at both hidden and output layers.where b is the number of binary bits chosen to encode each weight in the network (see Problem 8. Since one would normally choose a population size M proportional to the dimension of the binary search space in GA applications. On the other hand. in pattern recognition applications) by partial preprocessing of the training set using a fast clustering method which would substantially reduce the size of the training set (refer to Section 6. The 4-bit parity (refer to the K-map in Figure 2. 32. due to multiple local minima. the GA/GD method may not compete with backprop in computational speed. bipolar hidden targets.6). the GA/GD method is an effective alternative to backprop in learning tasks which involve complex multimodal criterion (error) functions.1 for details) and thus make the GA/GD method regain its speed advantage.2)
Equation (8.
8 Learning cycles statistics
Mean Standard deviation Max
Population Size 32
Min Mean
64
Max Min
Standard Standard Max Min Mean deviation deviation
.6. Unfortunately. one may alleviate this problem (e.6. bipolar training data. The 4-bit parity is a binary mapping from 4-dimiensional binary-valued input vectors to one binary-valued (desired) output. In both simulations. it is assumed that n >> L). we may conclude that the GA/GD method has a speed advantage over the other method when the following condition is satisfied
mJ < [(n+1)J + (J+1)L]b (8. if optimal solutions are at a premium.. is small compared to the product of the dimension of the training patterns and the bit accuracy.

2. in an average of a few hundred learning cycles or less.4 (a) and (b) for a two-hidden layer feedforward net and a single hidden layer feedforward net. The second simulation involves the identification of a nonlinear plant described by the discrete-time dynamics of Equation (5. the SGA method neither converged nor was it able to find a solution within 1000 generations. the random hidden target/gradient descent (RH/GD) method. Nonlinear system identification results (dotted line) employing a single-hidden layer feedforward neural net trained with the GA/GD method. Holland (1986. and the standard GA (SGA) in weight space method. We conclude this chapter by pointing out that genetic algorithms may also be used as the evolutionary mechanism in the context of more general learning systems. The classifier system develops a sequence(s) of actions or decisions so that a particular objective is achieved in a dynamically evolving environment. The exact dynamics are given by the solid line. Holland and Reitman.1. with population sizes of 64.6. and 8. Figure 8.4.
BP
SGA
Table 8. The test input signal in this case is the one given in Equation (5. also. Simulation results for the 4-bit parity using the GA/gradient descent (GA/GD) method.2 shows a typical simulation result after 20 learning cycles with a population of 50 hidden target sets. As an example. one may think of the classifier system as a controller whose objective is to regulate or control the state of a dynamical system. backprop (BP). Here.4. The RH/GD method could not find a single solution with 106 trials. only four out of 100 runs resulted in solution with the remaining 96 solutions reaching a high error plateau and/or a local minima. This shows clearly the difficulty of the task
and verifies the effectiveness of the GA/GD method. though. Finally.4. A four hidden unit feedforward neural net is used. which required on the order of 105 to 106 training iterations of incremental backprop.4). with population sizes of 64 and 132. This result compares favorably to those in Figure 5.6. Does not converge within 1000 generations with population sizes of 64 and 132. this being the case for codings of 8 and 16 binary bits for each weight.GA/GD 437 530 1882 14 159 195 871 8 105 335 2401 5 RH/GD Does not converge within 1 million trials Learning Method 96 out of 100 runs do not converge within 1 Million backprop cycles.5). The remaining 4 runs converged in an average of 3644 cycles. In all simulations. the GA/GD method led to successful convergence.1. is m = 100. As for backprop. The
. the fitness of a particular classifier (rule) is a function of how well an individual classifier complements others in the population. A feedforward neural network with 20 hidden units is used and is trained using the GA/GD method on a training set which is generated according to a random input signal in a similar fashion as described in Section 5. respectively.6. The size of the training set used here.4. 32. 1978) introduced a classifier system which is an adaptive parallel rule-based system that learns syntactically simple string rules (called classifiers) to guide its performance in an arbitrary environment.
Figure 8.

also. must be used in optimization problems where reaching the global minimum (or maximum) is at a premium. and optimal learning rules. optimal network architectures and learning parameters. Abu Zitar and Hassoun.
Problems
. in order to make global search methods more speed efficient. This approximation is found to preserve the optimal characteristics of the Boltzmann machine but with one to two orders of magnitude speed advantage. However. Genetic algorithms (GA's) are introduced as another method for optimal neural network design. The intrinsic slowness of global search methods is mainly due to the slow but crucial exploration mechanisms employed. and the second method is genetic algorithms which is motivated by the mechanics of natural selection and natural genetics. Recently. Global search methods. thus extending the applicability of the Boltzmann machine from combinatorial optimization to optimal supervised learning of complex binary mappings. it is possible to think of the GA as an evolutionary mechanism which could be accelerated by simple learning processes. Two major probabilistic global search methods are covered in this chapter." This method is especially appealing since it can be naturally implemented by a stochastic neural network known as the Boltzmann machine. Some global search methods may also be mapped onto recurrent neural networks such that the retrieval dynamics of these networks escape local minima and evolve towards global minimum. The most distinguishing feature of GA's is their flexibility and applicability to a wide range of optimization problems. The Boltzmann machine is a stochastic version of Hopfield's energy minimizing net which is capable of almost guaranteed convergence to the global minimum of an arbitrary bounded quadratic energy function. Simulated annealing is also applied to optimal weight learning in generally interconnected multilayer Boltzmann machines. It is interesting to see that applying meanfield theory to a single-layer Boltzmann machine leads to the deterministic continuous Hopfield net. Global search methods may be used as optimal learning algorithms for neural networks. In the context of GA optimization.
8. In the domain of neural networks. GA's are useful as global search methods for synthesizing the weights of generally interconnected networks. Abu Zitar (1993. It is argued that. and its convergence is determined by a "cooling" schedule of slowly decreasing "temperature. The first method is stochastic simulated annealing which is motivated by statistical mechanics.heart of the classifier system is a reinforcement-type learning mechanism assisted with GA exploration (see Goldberg (1989) for an accessible reference on classifier systems and their applications). as opposed to deterministic gradient-based search methods. local gradient information (if available) could be used advantageously. the price one pays for using global search methods is increased computational and/or storage requirements as compared to that of local search. However. Mean-field annealing is a deterministic approximation (based on mean-field theory) to stochastic simulated annealing where the mean behavior of the stochastic state transitions are used to characterize the Boltzmann machine. Mean-field annealing is applied in the context of retrieval dynamics and weights learning in a Boltzmann machine. They employ a parallel multipoint probabilistic search strategy which is biased towards reinforcing search points of high fitness. these desirable features of Boltzmann machines come with slow learning and/or retrieval.7 Summary
This chapter discusses probabilistic global search methods which are suited for neural network optimization. This observation motivates the hybrid GA/gradient descent method for feedforward multilayer net training introduced at the end of the chapter. The exploration mechanism in simulated annealing is governed by the Boltzmann-Gibbs probability distribution. 1993a. b) have developed a framework for synthesizing multilayer feedforward neural net controllers for robust nonlinear control from binary string rules generated by a classifier-like system.

5) to find the global minima of the functions in Problem 8.2 with the temperature schedule in Equation (8. 0. the global minimum can be reached by gradient descent on the noisy function
†
where N(t) is a "noise signal" and c(t) is a parameter which controls the magnitude of noise. maxima. Assume = 10−4. [0. y(x) = x sin.3 for the function in Problem 8. and experiment with different values of the repeller parameter .(c) and Problem 8.2.. x(0) = 0.6 "Global descent" is a global search method which was discussed in Section 5. Assume that N(t) is a normally distributed sequence with a mean of zero and a variance of one. x = 0. What range of values of are likely to lead to the global minimum of y(x)? 8.3 Employ the gradient descent search rule given by
. .4 Repeat Problem 8. y(x) = (x2 + 2x) cos(x).3 and |x2| 0.†
8.2.1.1.1.(d) in Problem 8.1. .25 and |x2| 1.1.1.3 to . x(0) = 0. |x1| 0.1. 20
8.1 Plot and find analytically the global minima of the following functions:
a. 8. and that c(t) = 150 exp(− Start from x(0) = − and experiment with t).
to find a minimum of the function y(x) = x sin starting from the initial search point: a.1.8 2
†
8..
†
8.1.005.7 For the uni-variate case.1. and saddle points for the following functions:
a.
† †
8.1 (c).1. |x1| 0. 0.. x(0) = 0.5 and |x2| < 1.1 (c). Apply the gradient descent rule (with = 0.1 Use the simulated annealing algorithm described in Section 8.1.1.01.1 (c) with x(0) = − and = 0. |x1| < 1.). c. 1.01.1. Assume . 20 different values of .1 (a) .5 Employ the gradient descent/ascent global search strategy described in Section 8.05. |x| < 20 d. t = 0.2 Plot and count the number of minima. Use the resulting "stochastic" gradient rule to find the global minimum of y(x) in Problem 8. x [0. .5 b. . 2.1. Implement the global descent method to search for the global minimum of the function y(x) given in Problem 8.7 e. |x| < 5 c. .2 (x + 3)2.001].2.5] b.5 f. b.05.1.12 and − x2 0. 1. = 2. Plot y[x(t)] versus t for various values of k. . = 0. .15.01.5 c.1. 2. y(x) = x6 − 15x4 + 27x2 + 250. y = y(x). y(x) = 10 sin2x + 0.2 (a) and (b). |x1| < 2 and |x2| < 2
†
8. Plot x(t) versus t (t = 0.. |x1| 1. |x| < 3. Make intelligent choices for the variance of the random perturbation x taking into account the
†
.01) of Problem 8.1.1 to find the global minima of the functions (a) .

Show that according to Equation (8.5.5].
*
where is the mean potential. Experimental observations (Katz. Assume the initial population S(0) as in Table 8. .12) starting from Equation (8. Note how the pseudo-temperature now has the physical interpretation as being proportional to the fluctuation of the post-synaptic potential of a real neuron.4) to compute a lower bound for the expected number of schemata of the form ∗11∗∗∗ in the generation at t = 1.2 (a) and (b) using the standard genetic algorithm of Section 8.2 Derive Equation (8.2 Consider the ten strings in population S(0) in Table 8.3.5.3.05.1 and let Pc = 0.3.3 Consider the problem of finding the global minimum of the function .5. respectively. Use binary strings of dimension n = 8. and use this information to guide your estimate of To.domain of the function being optimized.6 Consider the two-layer feedforward net shown in Figure 8. number of hidden units.3.1.16) by performing a gradient descent on H in Equation (8. Pm = 0. Next. 8. Use Equation (8. assume that all units have equal probability of being selected for updating and that only one unit updates its state at a given time).5.3.1 and the two schemata H1 = ∗11∗∗∗ and H2 = ∗01∗∗0.5 Derive Equation (8..3) and Equation (8. .1.1.
. Assume Pc = 0. The distribution width. where net is the post-synaptic potential. i. and a uniformly distributed initial population of 10 strings.5.e. J.5. and L are the input vector dimension.02.5. 1966) show that this post-synaptic potential is normally distributed.5. compare these two bounds to the actual number of schemata of the form ∗11∗∗∗ in population S(1) in Table 8.3.3.3.4 Repeat Problem 8.2. 1.1. Which schemata are matched by which strings in the population S(0)? What are the order and defining length of each of H1 and H2? 8.85. x [0.3. 8.3 with the schema ∗01∗∗0.(c) and Problem 8.5. 3.8 and Pm = 0.1 (a) .2.
8. where n.1 Employ Hardy's theorem (see Equation (8.3 Show that the relative-entropy H in Equation (8. using GA.e. Next show that the above probability can be roughly approximated by .2) and associated discussion) in a simple iterative procedure to find the largest number in the set Q = {1. Hint: Compare the series expansion of to that of tanh(x).4) the probability of a transition (bit-flip) that increases the energy E is always less than 0.5. 2}. and number of output units. Assume that a binary coding of weights is used where each weight is represented by a b-bit substring for the purpose of representing the network as a GA string s of contiguous weight substrings.3.1).5). 8.5. is determined by the parameters of the noise sources associated with synaptic junctions.
*
8.2.
†
8.5. Also. 8. (Hint: employ the thermal equilibrium condition of Equation (8.1.5. Repeat using the approximation of Equation (8. 0.3. 8. 8.5.5 Find the global minimum of the functions in Problem 8.4 Derive Equation (8.4).01 as in the first simulation of Example 8.9) is positive or zero. the probability of its output to be equal to one) is given by
where . Show that the probability that the neuron fires (i.1 Consider the simple model of a biological neuron with output x given by x(net) = sgn(net). 4.5.2. make use of the plots of these functions to estimate the largest possible change in y due to x. Compare your results to those in Problem 8.3.15).5.1.6.8).5. 4. 8.3.5.. Show that the dimension of s is equal to [(n + 1)J + (J + 1)L]b.

7 such that the network solves the XOR problem.85. assume Pc = 0. and 50. Also.
†
Figure P8. 30. Pm = 0. network in Figure P8. 40.7
. Assume a signed-binary coding of 5 bits (sign bit plus four magnitude bits) for each weight.8.5. and experiment with population sizes of M = 10.5. 20. The total number of correct responses may be used as the fitness function.7 Use the standard genetic algorithm to find integer weights in the range [− +15] for the neural 15.01.5.

If you are on a UNIX machine than these documents can be downloaded and transferred to a PC. Review of Engineering Mathematical Tools
• •
•
•
•
• •
Table of Contents Calculus Contents: I Functions: Graphs. If you don't have Matlab then it just opens as a Word document. Extreme Points and Roots II Differentiation III Integration IV Series Complex Numbers Contents: I What are Complex Numbers and why study them? II Complex Algebra III Polar Form IV Exponential Form V Trigonometric & Hyperbolic functions and Logarithms Linear Algebra Contents: I Vectors & Matrices II Matrix Algebra III Matrix Determinant and Inverse IV Solution of Linear Systems of Equations V Simple Programs: Creating and Plotting Arrays Differential Equations Contents: I Solutions of Ordinary Differential Equations II Solutions of Simple Nonlinear Differential Equations III Numerical Solutions Taylor Series Miscellaneous Commands
.Matlab Math Review Modules
These are MS Word file 6.0 or higher and are linked to Matlab (Verified for Matlab 4).

f =alpha*tanh(beta*net -theta).0) and are linked to Matlab.These are MS Word files (v. for j =1:1:size(net.
• • •
Shifted and Scaled Activation Functions (Chapter 5) Hyperbolic Tangent & Logistic Activation Functions and its derivative (Chapter 5) Two-dimensional Dynamical Systems (Chapter 7)
Shifted and Scaled Activation Functions This matlab program plots a shifted and scaled hyperbolic tangent activation function and its derivative.'b:')
.6. The hyperbolic tangent activation function is a bipolar function . As we increase beta.fprime. alpha is the scaling factor and theta is the shifting factor. end plot(net. alpha =2.
clear close all beta =1.f) hold on plot(net. we can find that the curves are sharper. net = linspace(theta-5. theta =10.2) fprime(j) = alpha*beta *(1-f(j)*f(j)).theta+5. If you don't have Matlab then it just opens as a Word document.100).

net2 = linspace(-10.grid on %axis([-4. end
close all plot(net2.4. Blue -fprime.j)-theta2))).1. its value ranges from 0 to a positive value.e.f2prime. i.f . alpha2=1.10.10. beta2 =2.5]) title('Hyperbolic tangent activation f and its derivative fprime') xlabel('net')
Hyperbolic tangent activation f and its derivative fprime 2 1 0 -1 -2 -3 -4 -5 -6 5 Red .-alpha2.-1. The logistic activation function is a unipolar function.2) f2(j) =alpha2*1/(1+exp(-(beta2*net2(1.
. for j =1:1:size(net2.5.'b:') grid on axis([-10. 10 net 15
theta2 =10.f2) hold on plot(net2.alpha2]). The following Matlab program plots the logistic activation function and its derivative.200). f2prime(j) =alpha2*beta2*f2(j)*(1-f2(j)).

5
-1 -10 Red .f.2) fprime(j) = beta *(1-f(j)*f(j)).
logistic function and its derivative 1
0.
-5
0 net
5
10
2nd EXAMPLE
This matlab program plots the hyperbolic tangent activation function and its derivative.
for j =1:1:size(net.f) hold on plot(net.title('logistic function and its derivative') xlabel('net'). Blue-fprime.
clear close all beta =1.'r:')
.fprime. net = linspace(-4. end plot(net. f =tanh(beta*net ).5
0
-0.4.100).

4.-1.5.
beta =4. title('logistic function and its derivative') xlabel('net').1.5]).5.3.5]) title('Hyperbolic tangent activation f and its derivative fprime') xlabel('net')
Hyperbolic tangent activation f and its derivative fprime 1. end
close all plot(net.5 1 0. for j =1:1:size(net.'r:') grid on axis([-4.-1.5 -4
-2
0 net
2
4
Following Matlab program plots the logistic activation function and its derivative.
.f2prime. net = linspace(-3.5 -1 -1.1.f2) hold on plot(net.4. f2prime(j) =beta*f2(j)*(1-f2(j)).j)))).5 0 -0.grid on axis([-4.2) f2(j) =1/(1+exp(-(beta*net(1.100).

we assume some initial conditions.e.7 page 415 Given system :.(1) . Followinf Matlab program evalutes the the values of x1 and x2 from the initial state to 3000 iterations. for 3 different initial states.logistic function and its derivative 1.. For the above problem .5 -1 -1. we can find the value of x1(t + del t) and x2(t+ del t) as a function of x1(t) and x2(t). x1= x2. ----------------. let the initial condition be x1=0 and x2=0 at time t=0. i.--------------. values of x1 and x2 at time t =0. To start with .(2) Euler's Method: According to Euler's method . x2= -2x1-3x2. ----------------.(3) del t Substituiting (3) in (1) and (2).5 1 0. x = x(t + del t) -x(t) ------------------.
. It also plots the graph of x1 vs x2.5.5 0 -0.5 -4
-2
0 net
2
4
2 Dimensional Dynamical Systems Problem 7.

for j =2:1:3000 x11(j) =x21(j-1)*delta +x11(j-1).01. x23(1)=1.'y') xlabel('x1') ylabel('x2')
I 1 0. x12(1)=0.2 -0.75.75.1) .4 -0. delta =0.4 x2 0.initial state of (0. x13(j) =x23(j-1)*delta +x13(j-1). x22(1)=1.6 0. end plot(x11.initial state of (0.1) 1
.x22. x22(j) =x22(j-1)+delta*(-2*x12(j-1)-3*x22(j-1)). .'g') hold on plot(x13. x12(j) =x22(j-1)*delta +x12(j-1).5. x13(1)=0.clear close all x11(1)=1.x21.2 0 -0.x23.5 x1 Red Yellow Green . x21(1) =1.1).6 0 0. x21(j) =x21(j-1)+delta*(-2*x11(j-1)-3*x21(j-1)).initial state of (1.5.'r') hold on plot(x12.8 0. x23(j) =x23(j-1)+delta*(-2*x13(j-1)-3*x23(j-1)).

for j =2:1:5000 x11(j) =x21(j-1)*delta -delta*x11(j-1)*(x11(j-1)*x11(j-1) + 1)*x21(j-1)) +x11(j-1). x12(j) =x22(j-1)*delta -delta*x12(j-1)*(x12(j-1)*x12(j-1) + 1)*x22(j-1)) +x12(j-1). x22(1)=0. x23(j) = -x13(j-1)*delta -delta*x23(j-1)*(x13(j-1)*x13(j-1) 1)*x23(j-1)) + x23(j-1).5.01.(b) page 415.x22.x21.
clear close all x11(1)=1. x2(t+del t) =-x1*del t .01.x23. x23(1) =1.'g') hold on plot(x13. x1 = x2 -x1(x1^2+x2^2) . x21(j) = -x11(j-1)*delta -delta*x21(j-1)*(x11(j-1)*x11(j-1) 1)*x21(j-1)) + x21(j-1).01. x22(j) = -x12(j-1)*delta -delta*x22(j-1)*(x12(j-1)*x12(j-1) 1)*x22(j-1)) + x22(j-1).el t*x2*(x1^2+x2^2) + x2(t).prob-7. x2 = -x1 -x2(x1^2+x2^2) Euler's method x1(t+del t) =x2*del t . x21(1) =1. x13(1) =0. delta =0.del t*x1*(x1^2+x2^2) + x1(t).'r') hold on plot(x12.5. end plot(x11. x13(j) =x23(j-1)*delta -delta*x13(j-1)*(x13(j-1)*x13(j-1) + 1)*x23(j-1)) +x13(j-1).'b') xlabel('x1') ylabel('x2')
x21(j+ x21(jx22(j+ x22(jx23(j+ x23(j-
.4 Given system :. x12(1) =0.

x2 = -x2 -(2+sin t)x1. x13(j) = x23(j-1)*delta + x13(j-1).delta*(2+sin(j-1))*x11(j-1) + x21(j-1).initial state of (0. end
. x22(j) = -x22(j-1)*delta .initial state of (1.75.1)
(c) page 415 .5
1
Red . x13(1) =0.
clear close all x11(1) =1. x21(j) = -x21(j-1)*delta .5 Given System :.
for j =2:1:8000 x11(j) = x21(j-1)*delta + x11(j-1).delta*(2+sin(j-1))*x12(j-1) + x22(j-1).7.1) Blue .initial state of (0. x1 = x2.1. delta =0.0.5 -0. .1) Green . x23(1) =1. prob. x12(j) = x22(j-1)*delta + x12(j-1). x23(j) = -x23(j-1)*delta .5.5.5.5
0 x1
0. x21(1)=1.5 x2 0 -0.001. x22(1) =1.delta*(2+sin(j-1))*x13(j-1) + x23(j-1). x12(1) =0.1
0.

initial state is (0. l=1. Function f should be continuous and x2*f(x2) >=0.x22.5 -1 -1.5
Red . x21(1) =1.initial state is (1. x2 = -(g/l)*sin x1 -f(x2).01.5
0
0. prob-7.5 -0.75.initial state is (0.5 x1
1
1.5.1) Green .plot(x11.8. x22(1) =1. g =9.5 0 x2 -0.
.'y') xlabel('x1') ylabel('x2')
1 0.1) Yellow .5. delta =0.5.x23.
clear close all x11(1) =1.x21.'g') hold on plot(x13. .'r') hold on plot(x12.1) (d) page 415.6 Damped pendulum system :. x12(1) =0. x1 = x2.

for j = 2:1:6000 f1=x21(j-1). f2=x22(j-1).1)
.x22.
plot(x11. (-g/l)*sin(x12(j-1))*delta -delta*f2 +x22(j-1).'b') hold on plot(x12.1) Magenta . (-g/l)*sin(x11(j-1))*delta -delta*f1 +x21(j-1).'m') xlabel('x1') ylabel('x2')
2 1 0 x2 -1 -2 -3 -1
-0.initial state is (0.5.x21.5
0 x1
0.5
1
1. x22(j-1)*delta +x12(j-1).initial state is (1. x11(j) x21(j) x12(j) x22(j) end = = = = x21(j-1)*delta +x11(j-1).5
Blue .

The Extreme Point Machine (EPM)
Other Sites with Relevent Java Demos
Collection of Applets for Neural Networks and Artificial Life Genetic Algorithm (GA) Demo (by Marshall C.
Neural Networks Java Applets
2 Dimensional Linear Dynamical Systems (CNNL) (Check below for a more general applet) Perceptron Learning Rule (CNNL) Principal Component Extraction via Various Hebbian-Type Rules (CNNL) Clustering via Simple Competitive Learning (CNNL) SOM: Self-Organizing Maps (Supports various methods & data distributiuons. Ramsey) Trailer Truck Backup Demo: Fuzzy Rule_Based Approach (by Christopher Britton)
Useful Java Applets for Engineers!
Mathtools. Algorithms were developed in the CNNL) Support Vector Machine (SVM) (by Lucent Technologies) Here is a note on Dr. authored by H. Hassoun's PhD dissertation which included polynomial-type "SVM" . Loos & B..net: The Technical Solutions Portal On-Line Graphic Calculator and Solver
. Schlueter) Backprop Trained Multilayer Perceptron for Function Approximation (CNNL) (Demonstrates generalization effects of early stoping of training) Neural Nets for Control: The Ball Balancing Problem (CNNL) Backprop Trained Multilayer Perceptron Handwriting Recognizer (by Bob Mitchell) Image Compression Using Backprop (CNNL) Generalizations of the Hamming Associative Memory (CNNL) Associative Memory Analysis Demo (Applet written and maintained by David Clark.. Fritzke) Traveling Salesman Problem (TSP) via SOM (authored by O..Java Demonstrations of Neural Net Concepts
The Computation and Neural Networks Laboratory has developedJava demonstrations of neural net concepts.S.

com: Rated Java Applets
.com Java Resource Center JARS. Fourier Series Construction of Waves Solution of a 2-D system of autonomous differential equations
Major Java Applet Collections (with search capabilities)
Digital Cat's Java Resource Center Developer.

UCSD. but first take a look at Professor Hassoun's Course: ECE 512 Neural Network-Related Sites Around the World
World Wide Search for Neural Networks Related Sites Computation and Neural Systems Related Groups in USA On-Line Neural Network Courses
Data Repositories
Pattern Recognition Related Archives (NIST.) Face Recognition Data (& Research Groups) Link to UCI Machine Learning Repository Data for Benchmarking of Learning Algorithms (maintained by David Rosen) Data: Spirals.. NetTalk. Sonar etc. Parity. etc.. (from CMU) DELVE: Data for Evaluating Learning in Valid Experiments (plus learning methods/software) (by C. Rasmussen and G.. Vowel. Hinton)
.

(1975).. 256(3). Hassoun. (1993). (1989). "Neural Networks for Computing?" in Neural Networks for Computing." BYTE.. Dissertation. Blackwell. S.. Albert. Editor. Almeida. Ackley. Editors. Michigan. A. B.. D. "A Learning Algorithm for Boltzmann Machines." in Proceedings of the Fifth International Conference on Genetic Algorithms (UrbanaChampaign 1993). S.
. M. Albus. "A Model of the Brain for Robot Control. Abu-Mostafa. Y. T. "Backpropagation in Perceptrons with Feedback. Morgan Kaufmann. "Neurocontrollers Trained with Rules Extracted by a Genetic Assisted Reinforcement Learning System.References Aart. B. IEEE Computer Society Press. J. E. Part 2: A Neurological Model. "A New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC). Albus. Y." IEEE Transactions on Neural Networks. D. Editor. H. (1992). IEEE. D. J. Springer-Verlag." in Associative Neural Memories: Theory and Implementation. 31-49. 550-557. 1-6. A. J. Abu-Mostafa.. New York. 54-95. Detroit. K. Editor. M. Department of Electrical and Computer Engineering. (1990). Brains. Transactions of the ASME. Butler. M. "A Theory of Cerebellar Functions. D. 151. L. Springer-Verlag. 88-95. R. New York. (1981). (1985)." in IEEE First International Conference on Neural Networks (San Diego 1987). 609-618. 220-227.D. (1993). S. Hinton. and Robotics. L. Abu Zitar. S. "Optical Neural Computers. Simulated Annealing and Boltzmann Machines." Scientific American. J. (1971). J. H. "A Learning Rule for Asynchronous Perceptrons with Feedback in a Combinatorial Environment. H. "On Optimal Population Size of Genetic Algorithms." Journal of Dynamic Systems Measurement and Control. von der Malsburg. Touretzky. J. Academic Press. T. "Regulator Control via Genetic Search Assisted Reinforcement. 147-169. (1972). P. S. 199-208. II. Editor. Alander. A. R. (1986b). 10. J. and Hassoun. Albus. Caudill and C. S. "Generalization and Scaling in Reinforcement Learning. "Complexity of Random Problems. Y. New York. E. 97. G. S. 65-70. Almeida. Peterborough. M. vol. D." Proceedings of CompEuro 92 (The Hague. Regression and the Moore-Penrose Pseudoinverse. (1988). Forrest. Wayne State University. Albus. H. San Mateo. Abu-Mostafa. (1993a). Abu-Mostafa. (1986a). (1993b). (1987). A. (1987). S. A. Vogl. San Mateo. 9. American Institute of Physics." in Neural Computers (Neuss 1987). and Korst. 115-131. S. Y. Eckmiller and C. Editor. Abu Zitar. Behavior. S. and Sejnowski. Oxford University Press. New York." in Advances in Neural Information Processing II (Denver 1989)." in Complexity in Information Theory. M. Ackley. Denker. Abu Zitar. Alkon. H. Ph. BYTE/McGraw-Hill." Cognitive Science. and Psaltis." Mathematical Biosciences. New York. Editors. Wiley. J. T. and Hassoun. T. (1979). R. Berlin. "Biological Plausibility of Artificial Neural Networks: Learning by Non-Hebbian Synapses. Netherlands 1992). and Werness. Machine Learning with Rule Extraction by Genetic Assisted Reinforcement (REGAR): Application to Nonlinear Control. Berlin. L. R. Morgan Kaufmann. 254-262. to appear in 1994. and Littman. S. NY. 25-61. S. New York.

Amari. S.. 201-215.-I. Geometrical Theory of Information. "Statistical Neurodynamics of Associative Memory. 63-73. and Cybernetics." IEEE Transactions on Systems. H. C-21." IEEE Proc.-I. N.-I. Cambridge.-I. (1993). P. "Characteristics of Random Nets of Analog Neuron-Like Elements. SMC-2(5)." Neural Computation. 1(1). Amari. 77-87. S. SMC-13.. S. K.. 175-185. Amari. Editor. 605-618.-I. Amari. In Japanese. "Four Types of Learning Curves. B. F. Losleben. MIT. "Characteristics of Randomly Connected Threshold-Element Networks and Network Systems. Fujita. 140-153. Amari. and Zirilli.-I. (1972b). (1977a). Parisi." Neural Computation. S. D. 5(1).. B.. and Maginu. S. (1967). and Murata." in Advanced Research in VLSI: Proceedings of the 1987 Stanford Conference. Man. "Global Optimization and Stochastic Differential Equations. 42." Journal of Optimization Theory and Applications. S. (1968). "Topographic Organization of Nerve Fields.Alspector. Cambridge University Press." Neural Networks. S. Syst. "Statistical Neurodynamics of Various Types of Associative Nets." IEEE Trans.-I. 313-349. 299-307. "Dynamics of Pattern Formation in Lateral-Inhibition Type Neural Fields. J. "Characteristics of Sparsely Encoded Associative Memory.-I. "A Method of Statistical Neurodynamics. Amit. Biology. Hassoun. Cambridge.-I. "Mathematical Foundations of Neurocomputing. S. Oxford University Press. "Statistical Theory of Learning Curves Under Entropic Loss Criterion. 59(1). 27. 26." IEEE Trans. 47(1)." Kybernetik. and Allen. of Math. "Field Theory of Self-Organizing Neural Nets. Amari. "A Neuromorphic VLSI Learning System." Biological Cybernetics. and Shinomoto. Kyoritsu-Shuppan. (1971). (1987). New York. Amari. (1989).
. 161-166.-I.. "A Universal Theorem on Learning Curves. 451-457. S.-I." Neural Networks. S." Biological Cybernetics." in Associative Neural Memories: Theory and Implementation. 169-183." Proceedings of the IEEE.-I.-I.-I. and Yanai." IEEE Trans. (1993). (1980). H. S. Amari. Aluffi-Pentini. Editor." Bull. S. Electronic Computers. (1990). Tokyo. M. Amari. (1977b). Amari. 1-16. Amari. Computers.-I. S. S. 4(2). 339-364. (1993). Man. EC-16. 14. Amari. (1988). 6(2). S. Modeling Brain Function: The World of Attractor Neural Networks. 2(6). V. N. (1972a). 643-657. J. (1983). (1985).-I. Amari. "Neural Theory of Association and Concept-Formation. (1974). 1443-1463. Amari. Amari. S. Amari. "Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements. (1992). "Theory of Adaptive Pattern Classifiers.-I. 35-47. S. 741-748. 78(9). Amari. Cybernetics. (1989)." Neural Networks. S.-F. 1197-1206. F.

Anderson. N. Morgan Kaufmann. J. Forrest. 413-451. J.. D. 799-815.Amit. 1(4). B. A. M. and Le Texier. 318-336. M. C. of the National Academy of Sciences. Anderson. J. Anderson. Y. H. and Probability Learning: Some Applications of a Neural Model. 1646-1657. Penz. Oxford University Press. (1987). USA. "Psychological Concepts in a Parallel System.. Silverstien. 7529-7531. "Radar Signal Categorization Using a Neural Network. NY. (1990). J. 84. 55(14)." Mathematical Biosciences. J. Categorical Perception. G. A. Angeniol. (1977). "Neural Models for Cognitive Computations. J. N. (1991). C." Optical Engineering. SMC-13." in Associative Neural Memories: Theory and Implementation. and De Falco. (1987). and Zeitouni. Oxford University Press.. (1957). (1986). (1988). A. J. (1993). Cluster Analysis for Applications. L. (1972). Y.. Ritz. Editor. H. R. Amit.." Neural Networks. "Distinctive Features. Anderson. R. IEEE. 135-166. Z." Proceedings of the Fifth International Conference on Genetic Algorithms (Urbana-Champaign 1993). Gately. and Murphy. San Mateo. Anderson.. (1987).. Morgan Kaufmann. M. A. San Mateo. Phys. H. Anderson. Hassoun. 14. Anderson. and Jones. "Associative Memory in a Simple Model of Oscillating Cortex. W. S.. A.. (1993). 3(3). D." Proc. Hassoun. T. M. Editor. T." Psychological Review. D.. Baird. T. 434-444. Editor. Mathematical Analysis: A Modern Approach to Advanced Calculus. Man. J. F. B. New York. B." Neural Computation." Proc. Reading. A.
. Gutfreund.. 173. Baird. Cooper." Physica. Apolloni. S. (1983)." in Associative Neural Memories: Theory and Implementation. "Optimal Mutation Rates in Genetic Search. B." Ann. and Collins. 84. Academic Press.." in Advances in Neural Information Processing Systems 2 (Denver 1989) D. "A Relaxation Model for Memory with High Storage Density. O. A. Vaubois. and Erie. 68-75. 30-67.. "Learning by Asymmetric Parallel Boltzmann Machines. J. D. "Self-Organizing Feature Maps and the Traveling Salesman Problem. M. (1993). S. "The BSB Model: A Simple Nonlinear Autoassociative Neural Network. 78. P. A. "Resonator Memories and Optical Novelty Filters. and Sompolinsky. (1973). 77-103. R. Bachmann. "A Simple Neural Network Generating Interactive Memory. M. (1985). "Storing Infinite Numbers of Patterns in a SpinGlass Model of Neural Networks. D. 26.-Y. "Statistical Mechanics of Neural Networks Near Saturation. Apostol. S. and Eeckman.. H. 197-220. Gutfreund. G. 1530-1533. "A Normal Form Projection Algorithm for Associative Memory. Anderberg. 289-293. New York. Touretzky. Bäck. M. AddisonWesely. and Cybernetics. L. 402-408. (1990). 2-8.. A." Physical Review Lett. 22-D. MA. Dembo. and Sompolinsky." IEEE Transactions on Systems.. H. H. Editor.

"Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. E." in Proceedings of the 1988 Connectionist Models Summer School (Pittsburgh 1988)." Journal of Complexity. Lippmann. 42." IEEE Trans. R. Batchelor. Morgan Kaufmann. 3(4). 735-742. (1983). T. (1989). Anderson. on Systems. Becker. and Anandan. G. B. thesis. (1988). D. and Cybernetics. Baldi. Baxt. G. and Cybernetics. 29-37. E. 52-61. "Use of Artificial Neural Network for Data Analysis in Clinical Decision-Making: The Diagnosis of Acute Coronary Occlusion. Man. (1989). R." in IEEE First International Conference on Neural Networks (San Diego 1987). (1989). IEEE. C. "Gradient Following without Backpropagation in Layered Networks. San Mateo. (1969). Learning Machines for Pattern Recognition. (1985).Baldi." Neural Computation. 35-44. 201207. SMC-15. Baum. Battiti. Plenum. "Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems. Practical Approach to Pattern Classification. "Supervised Learning of Probability Distributions by Neural Networks. A. W. P. G. P. G. F. S. Morgan Kaufmann. 4(2). D. Editors. Morgan Kaufmann." IEEE Trans. 151-160. M. (1987). B. England. Barto. vol. "Pattern Recognizing Stochastic Learning Automata. 1(1).and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method. E. Sutton." in Connectionist Models: Proceedings of the 1990 Summer School (Pittsburgh 1990). and Le Cun. Publ. J.. San Mateo. D. University of Southampton. Z. (1989). Barto. B. M. G. Sejnowski. Editors. and Jordan. "First. "What Size Net Gives Valid Generalization?" Neural Computation. Baum. "Adaptive Discriminant Functions. Editors. and Wilczek. Hinton. "Temporal Evolution of Generalization during Learning in Linear Networks. S. "On the Capabilities of Multilayer Perceptrons. A. G. SMC-13(5). (1968). and Hornik. (1991). J. 168-178. 4. (1992).. Butler. A. "Computing with Arrays of Bell-Shaped and Sigmoid Functions. S. W." Neural Computation. II. Editors. Baum. Moody. G. E. 360-375. IEEE Conf. System. (1985). and G. 193-215." Neural Computation. Touretzky. I. and Singh. New York. E. Hinton. R. 480-489." in Neural Information Processing Systems 3 (Denver 1990). 1(2). and Chauvin. Baum. P." Neural Computation. Elman. 834-846. American Institute of Physics. and Haussler. 2(4). E.D. New York. Southampton. L. Man. 53-58.
. 141-166. Batchelor. 589-603. D." in Neural Information Processing Systems (Denver 1987)." Human Neurobiology. Ph. Touretzky. P. Editor. and T. S. S. and Wilkins. Barto A. Batchelor. 629-636. 2(1). 229-256. and Anderson. "On The Computational Economics of Reinforcement Learning. "A Proposal for More Powerful Learning Algorithms. G. Barto. "Learning by Statistical Cooperation of Self-Interested Neuron-Like Computing Elements. 4. Sejnowski. Y. B." Pattern Recognition. Touretzky. New York. Barto. (1991). San Mateo. P. B. G. K. J. Caudill and C. B. (1974). "Improving the Convergence of Back-Propagation Learning with Second Order Methods. (1988). A. G. Baldi. P. Y. (1990)." Neural Networks. and D. R. (1991).

San Mateo.. D. Bilbro. T. (1992). H. Garnier.. A. Dover. 594-601. G.. R. Haussler." in Advances in Neural Information Processing Systems I (Denver 1988).. and Levin. 407411. "Range Image Restoration Using Mean Field Annealing. (1989).. P. 6(2)." Neural Computation. "Mean Field Annealing: A Formalism for Constructing GNC-like Algorithms. Blumer.." IEEE Transactions on Neural Networks. W." in Artificial Neural Networks. R. Benaim. K.. Aleksander and J. and Snyder. N. vol.. Bilbro. Blum. 36(4). S. (1988).. Belew. J..Beckman. 4(4)." in IEEE International Conference on Neural Networks (San Diego 1988). Block. (1989). Boole. "Training a 3-Node Neural Network is NP-Complete. "A Multilayer Perceptron Network for the Diagnosis of Low Back Pain. 59. 91-98. (1988). "Training a 3-Node Neural Network is NP-Complete.. "A Traveling Salesman Objective Function that Works. N. C. and White.." Neural Networks. J. L. D.. Morgan Kaufmann. 229-235." JACM. Snyder. "On the Boundedness of an Iterative Procedure for Solving a System of Linear Inequalities. "On Functional Approximation with Normalized Gaussian Units. Bounds. Touretzky. 579-588. D. E. (1989). van den Bout. Blum. Wiley. L. "Approximating Functions and Predicting Time Series with MultiSigmoidal Basis Functions. (1992). W. "Exact Calculation of the Hessian Matrix for the Multilayer Perceptron. (1991). (1990). 3(4). Wilf. II. L. 9-18. "Improving the Generalization Properties of Radial Basis Function Neural Networks. J. J. and Gault. Bishop. Bishop." Neural Computation. New York. 26. H." Proceedings of the 1988 Workshop on Computational Learning Theory. and Warmuth. S." in Advances in Neural Information Processing Systems I (Denver 1988). McInerney.. Elsevier Science Publisher B. and Rivest. and Schraudolph. An Investigation of the Laws of Thought. (1989). L." CSE Technical Report CS90-174.. (1964). Morgan Kaufmann. (1992). American Mathematical Society.. Editors. M. D. 494-501. L. A. Snyder. Ehrenfeucht. G. V. A. Mann. B.
. IEEE International Conference on Neural Networks (San Diego 1988). Morgan Kaufmann. A. (1992). J. A. R. G. L. (1988). E. Y. van den Bout. C. Amsterdam. Mathew. E." Biological Cybernetics. F. (1970). S. D. 1. 3(1). "The Solution of Linear Equations by the Conjugate Gradient Method. G.. Lloyd." in Proc. M. vol. E. NY. "Auto-Association by Multilayer Perceptrons and Singular Value Decomposition. Touretzky. S. 291-294. and Tomasini. "Optimization by Mean Field Annealing." in Mathematical Methods for Digital Computers. 117-127.. San Diego. A. W. 299-303. T. (1994). K.. G. E. IEEE. S. and Rivest. and Wadell." Neural Computation. M. and Miller. 929-965. R. Taylor. vol. 481-489 Bourlard. Bilbro. San Mateo. 131-138. 319-333. D. S. G. Editors. and Kamp. II. New York. "Learnability and the VapnikChervonenkis Dimension. 5(1). University of California." Proc. Benaim. (1854). D. Miller. Ralston and H. San Mateo. M. W. "Evolving Networks: Using the Genetic Algorithm with Connectionist Learning.

" Neural Networks. see Ph. B. Cameron." AIEE Transactions. E. (1967). "An Estimate of the Complexity Requisite in a Universal Decision Network. E. and Grossberg. Report 6771-1. R. 429-440. (1989). 37. "A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine. 197-212. Math Anal. and Image Processing. Luther. 62. L. Report 60-600. Carpenter.D." J. 5(3). Applied Mechanics. D.. Ph. 1965. "A Generalized Computer Procedure for the Design of Optimum Systems: Parts I and II. (1993). (1991a)." Complex Systems. "A Steepest-Ascent Method for Solving Optimum Programming Problems. J. J. Grossberg. and Appl. 565-588. 4(5). (1960). Carpenter. H. and Lowe. and Parks. S. Graphics. Blaisdell. 4(4). H. (1969). S. Part I: Communications and Electronics. S. Applied Numerical Methods. Bromley." Neural Networks. A.. Applied Optimal Control. G. "ART2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns. New York." Neural Networks.. A. Adaptive Multiple-Output Threshold Systems and Their Storage Capacities. II. and Rosen. 247-257. Bryson.. and Denker. D. (1988)." Neural Networks. 4919-4930. F. D.D. Carpenter. "The Interpolation Capabilities of the Binary CMAC." Neural Networks. "ARTMAP: Supervised Real-Time Learning and Classification of Nonstationary Data by a Self-Organizing Neural Network. (1987b). M. G. D. "Perceptron Type Learning Algorithms in Nonseparable Situations. 493-504." J. Labs. and Ho." Computer Vision. Also. Broomhead. Harris.van den Bout. World Congress on Neural Networks (Portland 1993). NJ. (1991b). (1967). and Reynolds. J. D. 285-293. Burke.. 224-227. "Improving Rejection Performance on Handwritten Digits by Training with 'Rubbish'. O. J.. S. A. University of Minnesota. E. A. "ART3: Hierarchical Search Using Chemical Transmitters in Self-Organizing Pattern Recognition Architectures. A. 78. (1993). P. Carpenter. S. A. (1987a). vol. C. 129-139. and Grossberg." Neural Computation. (1964). 3(2). "Multivariate Functional Interpolation and Adaptive Networks. Cannon Jr. 129-152. and Miller. 367-370. 29(2). Dynamics of Physical Systems. (1959)." Wright Air Development Division. (1990). 560-576.. 485-491. R. I. Bryson. "Clustering Characterization of Adaptive Resonance. K. A. B. A. S. R. Dissertation. 54-115. 26(23). Stanford Electron. Y. Tech. J. New York. H. A. and Denham..-C. "Nondirect Convergence Analysis of the Hopfield Associative Memory. CA." Applied Optics. McGraw-Hill.. Carpenter. Brown. John Wiley and Sons. H. (1969).
. G. "Improving the Performance of the Hopfield-Tank Neural Network Through Normalization and Annealing. G. (1962). "ART2-A: An Adaptive Resonance Algorithm for Rapid Category Learning and Recognition.." Biological Cybernetics. 321-355. 4(4). G. (1991). 17. Grossberg. and Wilkes. New York. R. Thesis. Stanford University. J. Burshtien. W. and Grossberg. LEA. 6(3). S. R." in Proc. Brown. Hillsdale. 2. T. Butz. Carnahan. S. (1993). Brown.

81-90. Editor.. (1988). Chalmers." IEEE Transactions on Neural Networks. SPIE. P. T. (1993). D. Storage Capacity. (1993a). "Terminal Repeller Unconstrained Subenergy Tunneling (TRUST) for Fast Global Optimization. September . Control Optimization.
. 370-374. C. and Goodman. A. Morgan Kaufmann.-J. New York. and Updating: New Heteroassociative Memory Results. Changeux. P. (1989). 35D. "Parametric Connectivity: Training of Constrained Networks Using Genetic Algorithms." IEEE Trans. (1989). Editors. 335-356." in Proc. A. B. J. and Fallside." Physica. Cichocki. M. (1987). M. J. San Mateo. and Grossberg." Computer Speech and Language. S.. 25. IEEE Press. S. IEEE First International Conference on Neural Networks (San Diego 1987). 97-126. D. Chauvin. 519-526. "Diffusion for Global Optimization in Rn. 737-752. and Dolan. (1991). "A Back-Propagation Algorithm with Optimal Use of Hidden Units. Casdagli. L. (1983). England. C. B. 25(3). A. "Recurrent Correlation Associative Memories. Cater. D. 815-826. Intelligent Robots and Computer Vision. Hinton." Proc." in Connectionist Models: Proceedings of the 1990 Summer School (Pittsburgh 1990).. Butler. Y. R. "Absolute Stability of Global Pattern Formation and Parallel Memory Storage by Competitive Neural Networks. R. IEEE. 205-218. 536-538. J." Nature (London). Barhen. (1991). "Associative Memory Synthesis. 313-333. SMC-13. vol. and G. IEEE. Hwang. C. Cauchy. J. Touretzky. (1987). Touretzky. Chiueh. 705-712." Comptes Rendus Hebdomadaires des Séances del l' Académie des Sciences. II. W. J. "The Evolution of Learning: An Experiment in Genetic Connectionism. 2. Neural Networks for Optimization and Signal Processing. John Wiley & Sons. M. R. and Barhen. and Unbehauen. 153-160. I. Chan.. "Méthod Générale pour la Résolution des Systémes d' E'quations Simultanées. Morgan Kaufmann. (1987). E. P. Man. Morgan Kaufmann. J. 645-651. vol. Schaffer. Elman. T. Caudell. J." in Proceedings of the Third International Conference on Genetic Algorithms (Arlington 1989). P. A. 848. D. J. New York. Systems. J. Editor. M. Paris. and Danchin. Conference on Neural Networks (San Diego 1988). vol. (1847). Cetin. (1989). Cetin. (1976). 275-284.-S. and Cybernetics. Caudill and C." in Advances in Neural Information Processing Systems 1 (Denver 1988). L. (1987). S. and Sheu. 264. New York. "Nonlinear Prediction of Chaotic Time Series. F.. "Global Descent Replaces Gradient Descent to Avoid Local Minima Problem in Learning With Artificial Neural Networks. Chiueh. and Burdick. T. "High Capacity Exponential Associative Memory. (1993b).December. D. "Successfully Using Peak Learning Rates of 10 (and Greater) in Back-Propagation Networks with the Heuristic Learning Algorithm. W. San Mateo. B. Editors. and Goodman. M. W.-R. II." in Proc. D.Casasent." SIAM J. "An Adaptive Training Algorithm for Backpropagation Networks. "Selective Stabilization of Developing Synapses as a Mechanism for the Specification of Neural Networks. 77. Performance. and Telfer. Burdick. C. Chichester. T. 2(2). San Mateo. IEEE Int. 836-842." in IEEE International Conference on Neural Networks (San Francisco 1993). Cohen. S." Journal of Optimization Theory and Applications.. Chiang. D.

D. G. (1989). E. San Mateo. P." Phys. Geometrical and Statistical Properties of Linear Threshold Devices. A. Elec. D. Cottrell. Hinton. "A Network System for Image Segmentation. IT-13(1). (1962). 249-269. Crowder III. M. (1992). S." Neural Networks. Erlbaum. 117-123. 1. D. H. G. and Tesauro.Cohn. J. "The CMAC and a Theorem of Kolmogorov. Munro. Comp. Cottrell. "Image Compression by Back Propagation: An Example of Extensional Programming. E. 326-334. 462-473. J. P. vol. "A Stochastic Model of Retinotopy: A Self-Organizing Process. 21-27. "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. P. "How Tight are the Vapnik-Chervonenkis Bounds?" Neural Computation. 911-917. Dissertation. 221-228. vol.. "Nearest Neighbor Pattern Classification. P. in Proceedings of the Society of Photo-Optical Instrumentation Engineers (Cambridge 1988). (1992). CA. "Extracting Features from Faces Using Compression Networks: Face. Munro. 948-949. L. Touretzky. Cover. G. P. 1001. Cortes. 36." Information and Control. Cooper. T. "Can Neural Networks do Better than the Vapnik-Chervonenkis Bounds?" in Neural Information Processing Systems 3 (Denver. Sejnowski. Sharkey. W. "Principal Component Analysis of Images via Backpropagation. Cohn." invited talk." in International Joint Conference on Neural Networks (Washington 1989). (1991). and G. Editor. 4(2). (1991). N. E." IEEE Transactions on Electronic Computers (December 1966). "Dynamics of Spin Systems with Randomly Asymmetric Bounds: Langevin Dynamics and a Spherical Model. C. L. Hinton. Ph. and Zipser." in Connectionist Models: Proceedings of the 1990 Summer School (Pittsburgh 1990). Cotter. EC-14. J." IEEE Transactions on Information Theory. M. Crisanti. J. T. Morgan Kaufmann. Sejnowski. Labs. (1989). and Sompolinsky. T. Identity.. Touretzky.. Morgan Kaufmann. E. Morgan Kaufmann. 405-411. G. J. Elman. (1967). (1988). R. Editors. Editors. and Hertz. Elman. P. Cover. S. J. and D. 121-127. Cover. Cooper. W. "Rates of Convergence of Nearest Neighbor Decision Procedures. R. and Hart. 53." Biol. S. G. S. Tech. T. San Mateo. M. A. Emotion. 5. W.
. Lippmann." in Connectionist Models: Proceedings of the 1990 Summer School (Pittsburgh 1990). C. Moody. (1964). (1986). Touretzky. M.D.. Ablex. 5(2). and Zipser. M. IEEE. P.. Hillsdale. Stanford Electron. Cybern.. I. (1991). and G. D. D. (1987). and Tesauro. W. T. Stanford University. and Gender Recognition Using Holons. J. 1070-1077. and Munro. Cottrell. G. N. Norwood. First Annual Hawaii Conference on Systems Theory.. Cottrell. D. W. "The Hypersphere in Pattern Recognition. A. "A Note on Adaptive Hypersphere Decision Boundary.. Report 6107-1." IEEE Trans. E. 1990). (1987). (1965). "Learning Internal Representations from Gray-Scale Images: An Example of Extensional Programming. T." in Models of Cognition: A Review of Cognitive Science. San Mateo. 328-337. (1968). Editors. Cottrell. vol. New York. and Fort. T. J. W. Rev. (1966). 413-415." Proc. Cover." in Ninth Annual Conference of the Cognitive Science Society (Seattle 1987).324-346. "Predicting the Mackey-Glass Time Series with Cascade-Correlation Learning. and Guillerm. 4922.

Moody. R. P. J. Anderson. Editor. and Moody. J. Duda. (1987). R. and Houpis. England. (1992). L. J. Morgan Kaufmann. Pittsburgh. J. "A Markov Chain Framework for the Simple Genetic Algorithm. 618623. T. MA. 6(3). Denoeux. and Moody. D. Derthick." in Proceedings of the 10th World Congress on Automatic Control (Munich. 4. (1993). An Analysis of the Behavior of a Class of Genetic Adaptive Systems. Pattern Classification and Scene Analysis. 1991). "Variations on the Boltzmann Machine. West Germany 1987). "Improving Generalization Performance Using Double Backpropagation. L.. Lippmann. P. Cowan. and Le Cun. C. Dertouzos. 3(6). "Approximation by Superpositions of a Sigmoidal Function." Neural Networks. Englewood Cliffs. Ann Arbor. and Computation. Systems: Analysis. "Non-Linear Dimensionality Reduction. Editor. Carnegie Mellon University. T." Math. Pitman. and Hart. Department of Computer and Communications Sciences. J. 211-212. 832-838. B. M. MIT Press. "Note on Learning Rate Schedules for Stochastic Optimization. M. (1983). John Wiley. J. B. Moody. A. DeMers. New York. and C. (1993). Linear Control Systems Analysis and Design (3rd edition). (1991). 991-997. Lippmann. Prentice-Hall. (1987). K. De Jong. 550587. (1991). E. and Schnabel. Prentice-Hall. Control Signals Systems. "Towards Faster Stochastic Gradient Search. J. Design. London. C. Morgan Kaufmann. Englewood Cliffs. Threshold Logic: A Synthesis Approach. D. D'Azzo. De Jong. W. H. J. Darken. D. S. American Institute of Physics. Darken. (1975). (1993). (1988). and Zapp. Editors. San Mateo. McGraw-Hill. New York. "Initializing Back Propagation Networks with Prototypes. (1973). Dickinson. Morgan Kaufmann. and Spears. Z. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. and Zeitouni. G. Y. (1989). Drucker. Forrest. R. Dennis Jr. 2. and R. (1965). Dembo. 351-363. Editors. P. E. "Autonomous High Speed Road Vehicle Guidance by Computer Vision. PA.Cybenko. vol. E." Evolutionary Computation. University of Michigan. J. O. Touretzky. J." in Neural Information Processing Systems 3 (Denver 1990). Giles. "High Density Associative Memories. and Cottrell. R. 303-314. Doctoral Thesis. CA. C. Genetic Algorithms and Simulated Annealing. J. 269-288. San Mateo. "On the State of Evolutionary Computation. L. W. and Principe. C. K. and Lengellé. San Mateo." Technical Report CMU-CS-84-120. Editors. A. 1009-1016. E. S." in Advances in Neural Information Processing Systems 5 (Denver 1992). San Mateo. E. O. (1992). Davis. Davis. E. 1(3)." in Neural Information Processing Systems 4 (Denver. Cambridge.
. S. S." in Proceedings of the Fifth International Conference on Genetic Algorithms (Urbana-Champaign 1993). Morgan Kaufmann." IEEE Transactions on Neural Networks. (1993)." in Neural Information Processing Systems (Denver 1987). Hanson. D. G. Dickmanns. New York. Department of Computer Science. and D. Hanson. (1984). (1988). 221-226. H.

" in Proceedings of the 1988 Connectionist Models Summer School (Pittsburgh 1988). E. (1964). (1990). Fix. Durbin. M. L. San Mateo. S. H. Morgan Kaufmann. 285-295. C. "Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima." Report 4. Y. (1978). and Hodges Jr. (1988). Randolph Field." Journal of Acoustical Society of America. Morgan Kaufmann. L." Applied Optics." Neural Computation. 83. and Zipser. Editors.. W. Sejnowski. D. E. (1951). 439-446. S." Neural Computation. W..Duda. D. (1981). D. J. Report.." Ph. 524-532." IEEE Transactions on Neural Networks." IEEE Trans.. J. J. 326. RADC-TDR-63-533. L. (1990). Cluster Analysis (2nd edition). (1993). S. Hanson. (1994). R. "Quantitative Universality for a Class of Nonlinear Transformations.. "The Lagrangian Relaxation Method for Solving Integer Programming Problems. 6(5). 25-52. ASSP-29. Elamn.2. B. Stat." in Advances in Neural Information Processing Systems 2 (Denver 1989). and Sejnowski." Tech." J. Fang. C. (1981). Acoust. "The Perceptron Correction Procedure in Non-Separable Situations. Editors. Sci. Feigenbaum. 270-273. 771-783. and C. San Mateo. (1993). 1615-1626. 27(1). G. G. "Fast Learning Variations on Back-Propagation: An Empirical Study. "Optoelectronic Analogs of Self-Programming Neural Nets: Architectures and Methods for Implementing Fast Stochastic Learning by Simulated Annealing. Everitt. R. 4(1)." in Advances in Neural Information Processing Systems 5 (Denver 1992). D. Finnoff.D. "Discriminatory Analysis: Non-parametric Discrimination. Canada. "Glove-Talk: A Neural Network Interface Between a Data-Glove and a Speech Synthesizer. Farden. USAF School of Aviation Medicine. Finnoff. Fels. C. Giles. H. D. Department of Electrical and Computer Engineering. and Hinton. Touretzky. Cowan. Texas. E. E. Fahlman. "Improving Model Selection by Nonconvergent Methods. Morgan Kaufmann. Editor. Phys. M. and Lebiere." Neural Networks. T. S. 38-51. "The Cascade-Correlation Learning Architecture. Heinemann Educational Books. Finnoff. University of Waterloo. Zimmermann. F. Fisher. (1987). 26. 19. "Faster Learning for Dynamic Recurrent Backpropagation." Manag. Waterloo. Hergert. Fakhr. (1989). "Training a Threshold Logic Unit with Imperfect Classified Patterns. J. 689-691. "Learning the Hidden Structure of Speech." IRE Western Electric Show and Convention Record. W. Hinton. S. 1-18. S. San Mateo. "Tracking Properties of Adaptive Signal Processing Algorithms. Thesis. "Optimal Adaptive Probabilistic Neural Networks for Pattern Classification. Paper 3. Project 21-49-004. R. D.
. (1964). Farhat. Efron. J. and Willshaw. L. O. Fahlman. and Singleton. and T. London. D. "An Analogue Approach to The Traveling Salesman Problem Using an Elastic Net Method." Nature. No. W.. (1980). 459-466. N. G. (1993). Touretzky. 50935103. 2(3). (1987). Speech Signal Proc. "Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima. 6(2). 2-8. (1993). S.

The Theory of Matrices. 2(2). S. "Structure of Metastable States in Hopfield Model. Garey. D. E. D. and Hwang. Amsterdam. 403-408. Fritzke. Hinton. Pattern Analysis and Machine Intelligence. 8." J. "Stochastic Relaxation. Funahashi. Freeman." in Artificial Neural Networks. Funahashi. "Diffusions for Global Optimization. 801-806. S." Proc. (1991). 36. M." in Connectionist Models: Proceeding of the 1990 Summer School (Pittsburgh 1990). New York. IRE. R. and Geman. "Almost Sure Stable Oscillations in a Large System of Randomly Coupled Equations. and J. (1990). 349-350. 2nd edition. "Recursive Stochastic Algorithms for Global Optimization in Rd. S. Proc. and Mitter. Editors. Math. J73-A. San Mateo. Touretzky. T." Neural Computation. Gardner. "The Upstart Algorithm: A Method for Constructing and Training Feedforward Neural Networks. M. Frean. (1986). "Optimum Performance of Learning Machines.. on Artificial Neural Networks (Espoo 1991). 252-261. Conf. and Johnson (1979). and Come Again?" Proceedings of the IEEE International Conference on Neural Networks (San Diego 1987). Physics A. (1993). MA. (1987). Geman. I. vol." SIAM J. of Control and Optimization. F. S. R. vol. 1031-1043. Gallant.-I. 24(5). IEEE. K. (1991). (1993).. Geman. Geman. "A Limit Theorem for the Norm of Random Matrices." in Proceedings of the Ninth Annual Conference of the IEEE Engineering in Medicine and Biology Society (Boston 1987). 1702-1703. K. Gallant. Control and Optimization. (1984). D. J. (1979). Gantmacher. J. Mäkisara. I. 3-9. "Deterministic Boltzmann Learning in Networks with Asymmetric Connectivity. Geman.Franzini. V. "On the Approximate Realization of Identity Mappings by 3-Layer Neural Networks (in Japanese. IEICE A." Neural Networks. A. New York. Galland." IEEE Trans. Y. (1986). 1. (1991).. G." Trans. K. S. Kangas.Self-Organizing Feature Maps with Problem Dependent Cell Structure. L. Appl. II.
. Morgan Kaufmann. 6(6). and Nakamura. B." SIAM J. 86-105. 183-192. of the 1991 Int. C. S. (1961). Elsevier Science Publishers B. S. (1990). C. S. (1990).-I.-I. K. and Smith. Elman. 999-1018." Neural Networks." SIAM J.. M. Gamba. "Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks. "Let it Grow . A. Funahashi. 139-145. Simula. E. and Hinton. S. "Some Averaging and Stability Results for Random Differential Equations. "On the Approximate Realization of Continuous Mappings by Neural Networks. (1987). O." SIAM J. 29(5). (1980). Gibbs Distributions. Gelfand. R. (1989). 695-703. Editors. "Random Cells: An Idea Whose Time Has Come and Gone .. I. T. 2(3). Neural Network Learning and Expert Systems. 198-209. 19. B. K. and The Bayesian Restoration of Images. vol. 671-678. L1047-1052. E. 6. Chelsea. Computers and Intractability: A Guide to the Theory of NPCompleteness. MIT Press. Cambridge.. Prob. 721-741. (1982). 42(4). "Speech Recognition with Back Propagation. and G. Sejnowski. S." Ann. Kohonen. Geman. Applied Math. 49. New York. C.

122-128.. "Equilibria of the Brain-State-in-a-Box (BSB) Neural Model.. MA. 318. Greenberg.. F. M. Physical Models of Neural Networks. T. (1993). P. Gordon. Engl. B.. M. (1986).. "Adding Stochastic Search to Conjugate Gradient Algorithms. S. 797-803.
. "A Computer Protocol to Predict Myocardial Infarction in Emergency Department Patients with Chest Pain. on Parallel Applications in Statistics and Economics. A. "Optimization of Control Parameters for Genetic Algorithms. R. (1990). Goldstein-Wayne.. F. (1987). 23492352. R. Applied Numerical Analysis (2nd edition). New York. Weisberg. H. Acampora. 323-324. D." in Proceedings of 3rd International Conf. Mellors. R. T. Genetic Algorithms. 4. 3.. SMC-16. 2nd edition. Glanz.. Addison-Wesley.. (1963). "The "Brain-State-in-a-Box" Neural Model is a Gradient Descent Algorithm. B." Journal of Statistical Physics. Lee. G." Neural Computation. and Berchier. A. Rouan. R. "Learning Algorithms for Perceptrons from Statistical Physics. Reading. C. Jakubowski. M. 294-307. World Scientific. Walshon. P.. (1993)." IEEE Trans. D. 1. Digital Image Processing. Stasiulewicz. and Miller. 319-350. Kobernick. (1988).. 465-469. and Miller. Practical Optimization. Gorse. W." Journal of Mathematical Physics. Gill." Proceedings of the International Conference on Acoustics and Signal Processing. Physique I. vol. L. 848. Robotics and Intelligent Systems. R. (1978). "Vector Quantization." N. F. W. C. C. Reading. (1989). and Wright. SPIE. 30. E. H. 37. J. Addison-Wesley. 1. 4. M. Gonzalez. Cook. J. F. (1989). Glanz. Glauber. 4-29. E. W. (1969). Goldman. R.. H. and Poggio. MA. L.. F. 282-298.. A. Brandt." IEEE ASSP Magazine. (1992). Singapore. 294-298. Golden.. Peretto. 1(4). G. Man and Cybernetics. Murray." J. C." J. Grefenstette. D. (1984). and Wintz. Grossberg. A. (1987). J. Gottlieb. Geszti.Gerald. Reading.. Girosi. Med. H.." Proc. Goldberg. Jones. and Shepherd. T.. (1986). Copen. D. 73-80. W." Neural Networks. 377-387." Journal of Mathematical Psychology. Praque: Tisk renfk Zacody. Daley. D. D. Academic Press. Psychol. (1981). T. M. 1(4).. MA. K. "Time Dependent Statistics of the Ising Model. Addison-Wesley. (1988). "On Learning and Energy-Entropy Dependence in Recurrent and Nonrecurrent Signed Networks.. Gterranova.. T. M. J. (1989). Math. "Stability and Optimization Analysis of the Generalized Brain-State-in-a-Box Neural Network Model. "Deconvolution and Nonlinear Inverse Filtering Using a Neural Network. P. H. J. "Shape Recognition Using a CMAC Based Learning System. J. D. J. on System. Brand. M.. Golden.. "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant.
Gray..

" Biological Cybernetics." Bellcore Technical Report. (1987). (1989b). H-K Liu. 23." in Advances in Neural Information Processing Systems 1 (Denver 1988). (1976). Hassoun. T. and Keeler. D. "Adaptive Dynamic Heteroassociative Neural Memories for Pattern Classification. J. S. Expectation. "Adaptive Ho-Kashyap Rules for Perceptron Training. 3(4). 360-369. 23. Morgan Kaufmann. E. 2(4). A. Olfaction. M. Hassoun. (1989a). D. 75-83.
.. Harp. Harp. M." IEEE Transactions on Neural Networks. Hartman. S. Touretzky. (1989). M. D. 258-264. Optical Pattern Recognition. "Two-Level Neural Network for Deterministic Logic Processing. J. H. John Wiley & Sons.. 431-440. "Towards the Genetic Synthesis of Neural Networks. Hassoun. 561-566. Editor. (1990). Editor. and Polya." Proc. and Guha. New York. 881. M. Morgan Kaufmann. Keeler. J. San Mateo. 275-287." in Advances in Neural Information Processing Systems 2 (Denver 1989)." in Proceedings of the Third International Conference on Genetic Algorithms (Arlington 1989). Samad. vol. and Guha. "An Adaptive Attentive Learning Algorithm for Single-Layer Neural Networks. Hartman. Cambridge. J. M." Neural Computation. "Knowledge Representation in Connectionist Networks. Parallel Development and Coding of Neural Feature Detectors. New York. "Dynamic Heteroassociative Neural Memories. (1990). Touretzky. and Pratt. Clustering Algorithms. Hassoun. S. 348-357. J. J. 447-454." Proc. San Mateo. NY. Littlewood. L. Cambridge University Press. J.. "Adaptive Pattern Classification and Universal Recording: I. Editor (1993). SPIE. A. New York. (1991b). Associative Neural Memories: Theory and Implementation. S. II: Feedback. J. Hardy. 177-185. Grossberg. and Kowalski. H. Hanson. G. "Designing Application-Specific Neural Networks Using the Genetic Algorithms. J. (1988). and Song. and Illusions.. (1989). D. 210-215. "Semi-local Units for Prediction. 1053. H. T. "Layered Neural Networks with Gaussian Hidden Units as Universal Approximators. "A Comparison of Different Biases for Minimal Network Construction with Back-Propagation. H. Hanson. and Burr. I. Schaffer. Samad. 3(1). J. J. 121-134. (1976). J. S. A. "Minkowski-r Back-Propagation: Learning in Connectionist Models with Non-Euclidean Error Signals." in Proc. "Predicting the Future: Advantages of Semilocal Units.. "Adaptive Pattern Classification and Universal Recording. (1975). Hartigan. D. M. and Burr. England. A. E." Neural Computation. S. S. 187-202. vol. Hassoun. J. II. D. Anderson. S. D. Neural Networks. D. IEEE Annual Conf. J. M. (1991a). and Clark." Neural Networks. San Mateo. G. 51-61. American Institute of Physics. (1988).. and Keeler. D. IEEE. (1992). E. Oxford University Press. Z. Hartman. 566-578. SPIE. (1988). D." in Proceedings of the International Joint Conference on Neural Networks (Seattle 1991). A. S. W. Editor." in Neural Information Processing Systems (Denver 1987). New York. Morgan Kaufmann. Hassoun. J.Grossberg. (1952). J... Optical Computing and Nonlinear Materials. Editor. Editor.. 2(2). Hanson." Biological Cybernetics. H. Inequalities. H.

Hecht-Nielsen. M. 405-412... M. Hassoun. B.. Algorithms and Devices for Optical Neural Networks (Part 1). (1989). "Multilayer Perceptron Learning Via Genetic Search for Hidden Layer Activations. (1991). New York.. vol. R.Hassoun. 621-626. (1990). Shen. Neural Networks (San Diego 1987). "NNERVE: Neural Network Extraction of Repetitive Vectors for Electromyography. D. Moody. 1990). "Kolmogorov's Mapping Neural Network Existence Theorem. (1992). 855-862. (1994a). R. 44(4). MA. Morgan Kaufmann. S.. "Electromyogram Decomposition via Unsupervised Dynamic Multi-Layer Neural Network. Statistical Physics and VC Dimension Methods. "NNERVE: Neural Network Extraction of Repetitive Vectors for Electromyography. M. "A Comparison of Weight Elimination Methods for Reducing Complexity in Neural Networks.
." Physical Review A. 163A. M. Part II: Performance Analysis. 2718-2726.. Hassoun. H. Hassoun." Optical Memory and Neural Networks. and Zimmermann. 437-444. R. H. Reading. New York. IEEE. R. Finnoff. New York. III. Hecht-Nielsen. Ioffe. 368-392. M... (1992). M. "Estimating Average-Case Learning Curves Using Bayesian. 443.. vol. C. "Learning Processes in Neural Networks. Wang. van Hemman. (1990). The Organization of Behavior. (1992). XXX to appear XXX. E. 1-15. Hassoun. M." IEEE Transactions on Biomedical Engineering. Addison-Wesley. 2(1). R. Int. Part I: Algorithm. R.-M. P. H. G. J. (1993a). New York. Pergamon Press. H. 28(1). M." in Proceedings of the International Joint Conference on Neural Networks (Baltimore 1992). C. and Song. vol." Neural Networks. C. Designs." in Neural Information Processing Systems 4 (Denver 1991). IEEE. J. vol." in Proceedings of the International Joint Conference on Neural Networks (Baltimore 1992). San Mateo... I. Opper. W. (1994b). M. 11-14... and Spitzer. C.. Supplement 1: Abstracts of the First Annual Meeting of the International Neural Networks Society (Boston 1988). III. A. Introduction to the Theory of Neural Computation. and Youssef. (1949). F. Kearns. Lippmann. A. and Spitzer. Wang. L. Hebb.. and Song. J." IEEE Transactions on Biomedical Engineering. "Neural Network Identification and Extraction of Repetitive Superimposed Pulses in Noisy 1-D Signals. Editors. A. Wiley. A. and Kappen. vol. New York. G. 1. and Spitzer. D. Heskes. Hanson. Song. Conf. Hassoun. Wang. (1993b). Hertz. J. J.. and R. J. H. R. (1988). L. and Spitzer. "A High-Performance Recording Algorithm for Hopfield Model Associative Memories. R. D. M.. IEEE Press. Special Issue on Architectures. Neurocomputing. B. "Hybrid Genetic/Gradient Search for Multilayer Perceptron Training.." in Proceedings of the World Congress on Neural Networks (Portland 1993). Krogh. IEEE. Haussler. Hergert. AddisonWesley. H. "Increasing the Efficiency of a Neural Network through Unlearning. and Schapire." Physica." Optical Engineering." Proceedings of the International Joint Conference on Neural Networks (Washington. 46-54. and Spitzer. M. H. Hassoun. and Vaas. II. T. New York. III. XXX to appear XXX.. "Self-Organizing Autoassociative Dynamic Multiple-Layer Neural Net for the Decomposition of Repetitive Superimposed Signals. M. H. A. Hassoun. 980-987. (1990). (1987). New York. A. H. A. R. J. S.. (1991). and Palmer." in Proc. M.

Goos and J. and Stiefel. A. G. Cambridge (1986). M. (1978). M. EC-14. S. Hinton. The University of Michigan Press. (1986). and Linear Algebra. Amsterdam. Heskes. vol. Hinton. "Neural Networks and Physical Systems with Emergent Collective Computational Abilities.-C. J. Elec. E. T. H. L. and Kappen. Michigan. 495502. 1-13. Comp. (1984). Editors. 2. (1975). E." in Machine Learning: An Artificial Intelligence Approach. Rumelhart. Hestenes. Cambridge. National Academy Sciences. IEEE. Editor. J. Nat. 1-12. "Neurons with Graded Response have Collective Computational Properties like Those of Two-State Neurons.Heskes. G. J. R." in PARLE: Parallel Architectures and Languages. Holland. (1987a). McClelland. Hinton. E. 1. (1952). Hayes-Roth. vol. Academic Press. 49. 2(5). Ho. 683-688. MIT Press. CarnegieMellon University. "Methods of Conjugate Gradients for Solving Linear Systems. and Smale. "Connectionist Learning Procedures. G. Computer Science Department." IEEE Trans. 2445-2558. J. San Mateo. New York. (1989). "Learning Translation Invariant Recognition in a Massively Parallel Network. (1965). III. D. Y. G. Res. S. Hillsdale. J. "Learning and Relearning in Boltzmann Machines. "Cognitive Systems Based on Adaptive Algorithms. New York. and T. Ann Arbor. vol. and Nowlan. B. Reprinted as a second edition (1992). (1993a). "Learning Distributed Representations of Concepts. Mitchell. "How Learning can Guide Evolution. Erlbaum. H. Dynamical Systems. Hartmanis. (1987). E. Hopfield. H. L." in IEEE International Conference on Neural Networks (San Francisco 1993). (1986)." in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Washington 1983). Editors. R. 199-233. 448-453. M. (1987b). 81. Bur. (1983)." Technical Report CMU-CS-87-115. 593-623. USA. Michalski. J. Differential Equations. USA. E. New York. Berlin. "On-Line Learning Processes in Artificial Neural Networks. "Convergent Activation Dynamics in Continuous Time Networks. S. G. G. and Reitman. Hinton. M. V. MIT Press. J. Standards. National Academy Sciences. Adaptation in Natural and Artificial Systems." in Mathematical Approaches to Neural Networks." Proc. 1219-1223." Complex Systems." J." Proc. "Error Potentials for Self-Organization. T. J. R. Holland. IEEE. and Sejnowski. (1982). J. T. Hirsch. PA. J. "Escaping Brittleness: The Possibilities of General-Purpose Learning Algorithms applied to Parallel Rule Based Systems. I. Carbonell. T. E. W. Editors." in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. G. Academic Press." in Pattern Directed Inference Systems. and Kashyap. E. (1993)b. (1986). and Sejnowski. Taylor. D. Hinton.. 79. Europe Lecture Notes in Computer Science. Hirsch." in Proceedings of the 8th Annual Conference of the Cognitive Science Society (Amherst 1986). E. Hopfield. G. 313-329." Neural Networks. Hinton. B. New York. and the PDP Research Group. 3088-3092. J. "An Algorithm for Linear Inequalities and its Applications. and Kappen. Elsevier Science Publishers B. Springer-Verlag.. J. Morgan Kaufmann. Pittsburgh. J. 409-436. M. Holland. Waterman and F. "Optimal Perceptual Inference. (1974). J.
. 331-349.

Hopfield, J. J. (1987). "Learning Algorithms and Probability Distributions in Feed-Forward and Feed-Back Networks," Proceedings of the National Academy of Sciences, USA, 84, 8429-8433. Hopfield, J. J. (1990). "The Effectiveness of Analogue 'Neural Network' Hardware," Network: Computation in Neural Systems, 1(1), 27-40. Hopfield, J. J. and Tank, (1985). "Neural Computation of Decisions in Optimization Problems," Biological Cybernetics, 52, 141-152. Hopfield, J. J., Feinstein, D. I., and Palmer, R. G. (1993). " "Unlearning" Has a Stabilizing Effect in Collective Memories," Nature, 304, 158-159. Hoptroff, R. G. and Hall, T. J. (1989). "Learning by Diffusion for Multilayer Perceptron," Electronic Letters, 25(8), 531-533. Hornbeck, R. W. (1975). Numerical Methods. Quantum, New York. Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks," Neural Networks, 4(2), 251-257. Hornik, K. (1993). "Some New Results on Neural Network Approximation," Neural Networks, 6(8), 10691072. Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366. Hornik, K., Stinchcombe, M., and White, H. (1990). "Universal Approximation of an Unknown Mapping and its Derivatives Using Multilayer Feedforward Networks," Neural Networks, 3(5), 551-560. Horowitz, L. L. and Senne, K. D. (1981). "Performance Advantage of Complex LMS for Controlling Narrow-Band Adaptive Arrays," IEEE Trans. Circuits Systems, CAS-28, 562-576. Hoshino, T., Yonekura, T., Matsumoto, T., and Toriwaki, J. (1990). "Studies of PCA Realized by 3-Layer Neural Networks Realizing Identity Mapping (in Japanese)," PRU90-54, 7-14. Hoskins, J. C., Lee, P., and Chakravarthy, S. V. (1993). "Polynomial Modeling Behavior in Radial Basis Function Networks," in Proc. World Congress on Neural Networks (Portland 1993), vol. IV, 693-699. LEA, Hillsdale. Householder, A. S. (1964). The Theory of Matrices in Numerical Analysis. Blaisdel, New York. (Reprinted, 1975, by Dover, New York.) Hu, S. T. (1965). Threshold Logic. University of California Press, Berkeley, CA. Huang, W. Y. and Lippmann, R. P. (1988). "Neural Nets and Traditional Classifiers," in Neural Information Processing Systems (Denver 1987), D. Z. Anderson, Editor. American Institute of Physics, New York, 387-396. Huang, Y. and Schultheiss, P. M. (1963). "Block Quantization of Correlated Gaussian Random Variables," IEEE Trans. Commun. Syst., CS-11, 289-296. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Hudak, M. J. (1992). "RCE Classifiers: Theory and Practice," Cybernetics and Systems: An International Journal, 23, 483-515.

Hueter, G. J. (1988). "Solution of the Traveling Salesman Problem with an Adaptive Ring," in IEEE International Conference on Neural Networks (San Diego 1988), vol. I, 85-92. IEEE, New York. Hui, S. and ak, S. H. (1992). "Dynamical Analysis of the Brain-State-in-a-Box Neural Models," IEEE Transactions on Neural Networks, 3, 86-89. Hui, S., Lillo, W. E. and ak, S. H. (1993). "Dynamics and Stability Analysis of the Brain-State-in-a-Box (BSB) Neural Models," in Associative Neural Memories: Theory and Implementation, M. H. Hassoun, Editor. Oxford Univ. Press, NY. Hush, D. R., Salas, J. M., and Horne, B. (1991). "Error Surfaces for Multi-layer Perceptrons," in International Joint Conference on Neural Networks (Seattle 1991), vol. I, 759-764, IEEE, New York. Irie, B. and Miyake, S. (1988). "Capabilities of Three-Layer Perceptrons," IEEE Int. Conf. Neural Networks, vol. I, 641-648. Ito, Y. (1991). "Representation of Functions by Superpositions of Step or Sigmoid Function and Their Applications to Neural Network Theory," Neural Networks, 4(3), 385-394. Jacobs, R. A. (1988). "Increased Rates of Convergence Through Learning Rate Adaptation," Neural Networks, 1(4), 295-307. Johnson, D. S., Aragon, C. R., McGeoch, L. A., and Schevon, C. (1989). "Optimization by Simulated Annealing: An Experimental Evaluation; Part I, Graph Partitioning," Operations Research, 37(6), 865-892. Johnson, D. S., Aragon, C. R., McGeoch, L. A., and Schevon, C. (1990a). "Optimization by Simulated Annealing: An Experimental Evaluation; Part II, Graph Coloring and Number Partitioning," Operations Research, 39(3), 378-406. Johnson, R. A. and Wichern, D. W. (1988). Applied Multivariate Statistical Analysis (2nd edition), Prentice-Hall, Englewood Cliffs, NJ. Jones, R. D., Lee, Y. C., Barnes, C. W., Flake, G. W., Lee, K., Lewis, P. S., and Qian, S. (1990). "Function Approximation and Time Series Prediction with Neural Networks," in Proc. Intl. Joint Conference on Neural Networks (San Diego 1990), vol. I, 649-666. IEEE Press, New York. Judd, J. S. (1987). "Learning in Networks is Hard," in IEEE First International Conference on Neural Networks (San Diego 1987), M. Caudill and C. Butler, Editors, vol. II, 685-692. IEEE, New York. Judd, J. S. (1990). Neural Network Design and the Complexity of Learning. MIT Press, Cambridge. Kadirkamanathan, V., Niranjan, M., and Fallside, F. (1991). "Sequential Adaptation of Radial Basis Function Neural Networks and its Application to Time-Series Prediction," in Advances in Neural Information Processing Systems 3 (Denver 1990) R. P. Lippmann, J. E. Moody, and D. S. Touretzky, Editors, 721-727. Morgan Kaufmann, San Mateo. Kamimura, R. (1993). "Minimum Entropy Method for the Improvement of Selectivity and Interpretability," in Proc. World Congress on Neural Networks (Portland 1993), vol. III, 512-519. LEA, Hillsdale. Kanerva, P. (1988). Sparse Distributed Memory. Bradford/MIT Press, Cambridge, MA. Kanerva, P. (1993). "Sparse Distributed Memory and Other Models," in Associative Neural Memories: Theory and Implementation, M. H. Hassoun, Editor, 50-76. Oxford University Press, New York.

Kanter, I. and Sompolinsky, H. (1987). "Associative Recall of Memory Without Errors," Phys. Rev. A., 35, 380-392. Karhunen, J. (1994). "Optimization Criteria and Nonlinear PCA Neural Networks," IEEE International Conference on Neural Networks, (Orlando 1994), vol. XXX, XXXpage numbersXXX, IEEE Press. Karhunen, K. (1947). "Uber lineare methoden in der Wahrscheinlichkeitsrechnung," Annales Academiae Scientiarium Fennicae, A, 37(1), 3-79, (translated by RAND Corp., Santa Monica, CA, Rep. T-131, 1960). Karmarkar, N. (1984). "A New Polynomial Time Algorithm for Linear Programming," Combinatorica, 1, 373-395. Karnaugh, M. (1953). "A Map Method for Synthesis of Combinatorial Logic Circuits," Transactions AIEE, Comm and Electronics, 72, Part I, 593-599. Kashyap, R. L. (1966). "Synthesis of Switching Functions by Threshold Elements," IEEE Trans. Elec. Comp., EC-15(4), 619-628. Kaszerman, P. (1963). "A Nonlinear Summation Threshold Device," IEEE Trans. Elec. Comp., EC-12, 914-915. Katz, B. (1966). Nerve, Muscle and Synapse. McGraw-Hill, New York. Keeler, J. and Rumelhart, D. E. (1992). "A Self-Organizing Integrated Segmentation and Recognition Neural Network," in Advances in Neural Information Processing Systems 4 (Denver 1991), J. E. Moody, S. J. Hanson, and R. P. Lippmann, Editors, 496-503. Morgan Kaufmann, San Mateo. Keeler, J. D., Rumelhart, D. E., and Leow, W.-K. (1991). "Integrated Segmentation and Recognition of Handprinted Numerals," in Advances in Neural Information Processing Systems 3 (Denver 1990), R. P. Lippmann, J. E. Moody, and D. S. Touretzky, Editors, 557-563. Morgan Kaufmann, San Mateo. Keesing, R. and Stork, D. G. (1991). "Evolution and Learning in Neural Networks: The Number and Distribution of Learning Trials Affect the Rate of Evolution," in Advances in Neural Information Processing Systems 3 (Denver 1990), R. P. Lippmann, J. E. Moody, and D. S. Touretzky, Editors, 804-810. Morgan Kaufmann, San Mateo. Kelly, H. J. (1962). "Methods of Gradients," in Optimization Techniques with Applications to Aerospace Systems, C. Leitmann, Editor. Academic Press, New York. Khachian, L. G. (1979). "A Polynomial Algorithm in Linear Programming," Soviet Mathematika Doklady, 20, 191-194. Kirkpatrick, S. (1984). "Optimization by Simulated Annealing: Quantitative Studies," J. Statist. Physics, 34, 975-986. Kirkpatrick, S., Gilatt, C. D., and Vecchi, M. P. (1983). "Optimization by Simulated Annealing," Science, 220, 671-680. Kishimoto, K. and Amari, S. (1979). "Existence and Stability of Local Excitations in Homogeneous Neural Fields," J. Math. Biology, 7, 303-318. Knapp, A. G. and Anderson, J. A. (1984). "A Theory of Categorization Based on Distributed Memory Storage," Journal of Experimental Psychology: Learning, Memory, and Cognition, 9, 610-622. Kohavi, Z. (1978). Switching and Finite Automata. McGraw-Hill, NY.

Kohonen, T. (1972). "Correlation Matrix Memories," IEEE Trans. Computers, C-21, 353-359. Kohonen, T. (1974). "An Adaptive Associative Memory Principle," IEEE Trans. Computers, C-23, 444445. Kohonen, T. (1982a). "Self-Organized Formation of Topologically Correct Feature Maps," Biological Cybernetics, 43, 59-69. Kohonen, T. (1982b). "Analysis of Simple Self-Organizing Process," Biological Cybernetics, 44, 135-140. Kohonen, T. (1984). Self-Organization and Associative Memory. Springer-Verlag, Berlin. Kohonen, T. (1988). "The 'Neural' Phonetic Typewriter," IEEE Computer Magazine, March 1988, 11-22. Kohonen, T. (1989). Self-Organization and Associative Memory (3rd ed.). Springer-Verlag, Berlin. Kohonen, T. (1990). "Improved Versions of Learning Vector Quantization," in Proceedings of the International Joint Conference on Neural Networks (San Diego 1990), vol. I, 545-550. IEEE, New York. Kohonen, T. (1991). "Self-Organizing Maps: Optimization Approaches," in Artificial Neural Networks, T. Kohonen, K. Makisara, O. Simula, and J. Kanga, Editors, 981-990. North-Holland, Amsterdam. Kohonen, T. (1993a). "Things You Haven't Heard About the Self-Organizing Map," IEEE International Conference on Neural Networks (San Francisco 1993), vol. III, 1147-1156. IEEE, New York. Kohonen, T. (1993b). "Physiological Interpretation of the Self-Organizing Map Algorithm," Neural Networks, 6(7), 895-905. Kohonen, T. and Ruohonen, M. (1973). "Representation of Associated Data by Matrix Operators," IEEE Trans. Computers, C-22, 701-702. Kohonen, T., Barna, G., and Chrisley, R. (1988). "Statistical Pattern Recognition with Neural Networks: Benchmarking Studies," in IEEE International Conference on Neural Networks (San Diego 1988), vol. I, 61-68. IEEE, New York. Kolen, J. F. and Pollack, J. B. (1991). "Back Propagation is Sensitive to Initial Conditions," in Advances in Neural Information Processing Systems 3 (Denver 1990). R. P. Lippmann, J. E. Moody, and D. S. Touretzky, Editors, 860-867. Morgan Kaufmann, San Mateo. Kolmogorov, A. N. (1957). "On the Representation of Continuous Functions of Several Variables by Superposition of Continuous Functions of one Variable and Addition," Doklady Akademii. Nauk USSR, 114, 679-681. Komlós, J. (1967). On the Determinant of (0,1) Matricies. Studia Scientarium Mathematicarum Hungarica, 2, 7-21. Komlós, J. and Paturi, R. (1988). "Convergence Results in an Associative Memory Model," Neural Networks, 3(2), 239-250. Kosko, B. (1987). "Adaptive Bidirectional Associative Memories," Applied Optics, 26, 4947-4960. Kosko, B. (1988). "Bidirectional Associative Memories," IEEE Trans. Sys. Man Cybern., SMC-18, 49-60. Kosko, B. (1992). Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence. Prentice-Hall, Englewood.

Kramer, A. H. and Sangiovanni-Vincentelli, A. (1989). "Efficient Parallel Learning Algorithms for Neural Networks," in Advances in Neural Information Processing Systems 1 (Denver 1988) D. S. Touretzky, Editor, 40-48. Morgan Kaufmann, San Mateo. Kramer, M. (1991). "Nonlinear Principal Component Analysis Using Autoassociative Neural Networks," AICHE Journal, 37, 233-243. Krauth, W., Mézard, M., and Nadal, J.-P. (1988). "Basins of Attraction in a Perceptron Like Neural Network," Complex Systems, 2, 387-408. Krekelberg, B. and Kok, J. N. (1993). "A Lateral Inhibition Neural Network that Emulates a Winner-TakesAll Algorithm," in Proc. of the European Symposium on Artificial Neural Networks (Brussels 1993). M. Verleysen, Editor, 9-14. D facto, Brussels, Belgium. Krishnan, T. (1966). "On the Threshold Order of Boolean Functions," IEEE Trans. Elec. Comp., EC-15, 369-372. Krogh, A. and Hertz, J. A. (1992). "A Simple Weight Decay Can Improve Generalization," in Advances in Neural Information Processing Systems 4 (Denver 1991), J. E. Moody, S. J. Hanson, and R. P. Lippmann, Editors, 950-957. Morgan Kaufmann, San Mateo. Kruschke, J. K. and Movellan, J. R. (1991). "Benefits of Gain: Speeded Learning and Minimal Hidden Layers in Back-Propagation Networks," IEEE Transactions on System, Man, and Cybernetics, SMC-21(1), 273-280. Kuczewski, R. M., Myers, M. H., and Crawford, W. J. (1987). "Exploration of Backward Error Propagation as a Self-Organizational Structure," IEEE International Conference on Neural Networks (San Diego 1987), M. Caudill and C. Butler, Editors, vol. II, 89-95. IEEE, New York. Kufudaki, O. and Horejs, J. (1990). "PAB: Parameters Adapting Backpropagation," Neural Network World, 1, 267-274. Kühn, R., Bös, S., and van Hemmen, J. L. (1991). "Statistical Mechanics for Networks of Graded Response Neurons," Phy. Rev. A, 43, 2084-2087. Kullback, S. (1959). Information Theory and Statistics. Wiley, New York. Kung, S. Y. (1993). Digital Neural Networks. PTR Prentice-Hall, Englewood Cliffs, New Jersey. Kuo, T. and Hwang, S. (1993). "A Genetic Algorithm with Disruptive Selection," Proceedings of the Fifth International Conference on Genetic Algorithms (Urbana-Champaign 1993), S. Forrest, Editor, 65-69. Morgan Kaufmann, San Mateo.

K rková, V. (1992). "Kolmogorov's Theorem and Multilayer Neural Networks," Neural Networks, 5(3), 501-506. Kushner, H. J. (1977). "Convergence of Recursive Adaptive and Identification Procedures Via Weak Convergence Theory," IEEE Trans. Automatic Control, AC-22(6), 921-930. Kushner, H. J. and Clark, D. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, New York. Lane, S. H., Handelman, D. A., and Gelfand, J. J. (1992). "Theory and Development of Higher Order CMAC Neural Networks," IEEE Control Systems Magazine, April 1992, 23-30.

Lang, K. J. and Witbrock, M. J. (1989). "Learning to Tell Two Spirals Apart," Proceedings of the 1988 Connectionists Models Summer Schools (Pittsburgh 1988), D. Touretzky, G. Hinton, and T. Sejnowski, Editors, 52-59. Morgan Kaufmann, San Mateo. Lapedes, A. S. and Farber, R. (1987). "Nonlinear Signal Processing Using Neural Networks: Prediction and System Modeling," Technical Report, Los Alamos National Laboratory, Los Alamos, New Mexico. Lapedes, A. and Farber, R. (1988). "How Neural Networks Works," in Neural Information Processing Systems (Denver 1987), D. Z. Anderson, Editor, 442-456. American Institute of Physics, New York. Lapidus, L. E., Shapiro, E., Shapiro, S., and Stillman, R. E. (1961). "Optimization of Process Performance," AICHE Journal, 7, 288-294. Lawler, E. L. and Wood, D. E. (1966). "Branch-and-bound methods: A Survey," Operations Research, 14(4). 699-719. Lay, S.-R. and Hwang, J.-N. (1993). "Robust Construction of Radial Basis Function Networks for Classification," in Proceedings of the IEEE International Conference on Neural Networks (San Francisco 1993), vol. III, 1859-1864. IEEE, New York. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1(4), 541-551. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1990). "Handwritten Digit Recognition with a Backpropagation Network," in Advances in Neural Information Processing Systems 2 (Denver 1989), D. S. Touretzky, Editor, 396-404. Morgan Kaufmann, San Mateo. Le Cun, Y., Kanter, I., and Solla, S. A. (1991a). "Second Order Properties of Error Surfaces: Learning Time and Generalization," in Advances in Neural Information Processing Systems 3 (Denver 1990), R. P. Lippmann, J. E. Moody, and D. S. Touretzky, Editors, 918-924. Morgan Kaufmann, San Mateo. Le Cun, Y., Kanter, I., and Solla, S. A. (1991b). "Eigenvalues of Covariance Matrices: Application to Neural Network Learning," Phys. Rev. Lett., 66, 2396-2399. Le Cun, Y., Simard, P. Y., and Pearlmutter, B. (1993). "Automatic Learning Rate Maximization by OnLine Estimation of the Hessian's Eigenvectors," in Advances in Neural Information Processing Systems 5 (Denver 1992), S. J. Hanson, J. D. Cowan, and C. L. Giles, Editors, 156-163. Morgan Kaufmann, San Mateo. Lee, B. W. and Shen, B. J. (1991). "Hardware Annealing in Electronic Neural Networks," IEEE Transactions on Circuits and Systems, 38, 134-137. Lee, B. W. and Sheu, B. J. (1993). "Parallel Hardware Annealing for Optimal Solutions on Electronic Neural Networks," IEEE Transactions on Neural Networks, 4(4), 588-599. Lee, S. and Kil, R. (1988). "Multilayer Feedforward Potential Function Networks," in Proceedings of the IEEE Second International Conference on Neural Networks (San Diego 1988), vol. I, 161-171. IEEE, New York. Lee, Y. (1991). "Handwritten Digit Recognition Using k-Nearest Neighbor, Radial-Basis Functions, and Backpropagation Neural Networks," Neural Computation, 3(3), 440-449. Lee, Y. and Lippmann, R. P. (1990). "Practical Characteristics of Neural Networks and Conventional Pattern Classifiers on Artificial and Speech Problems," in Advances in Neural Information Processing Systems 2 (Denver 1989), D. S. Touretzky, Editor, 168-177. Morgan Kaufmann, San Mateo.

Lee, Y., Oh, S.-H., and Kim, M. W. (1991). "The Effect of Initial Weights on Premature Saturation in Back-Propagation Learning," in International Joint Conference on Neural Networks (Seattle 1991), vol. I, 765-770. IEEE, New York. von Lehman, Paek, E. G., Liao, P. F., Marrakchi, A., and Patel, J. S. (1988). "Factors Influencing Learning by Back-Propagation," in IEEE International Conference on Neural Networks (San Diego 1988), vol. I, 335-341. IEEE, New York. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). "Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function," Neural Networks, 6(6), 861-867. Leung, C. S. and Cheung, K. F. (1991). "Householder Encoding for Discrete Bidirectional Associative Memory," in Proc. Int. Conference on Neural Networks (Singapore 1991), 237-241. Levin, A. V. and Narendra, K. S. (1992). "Control of Nonlinear Dynamical Systems Using Neural Networks, Part II: Observability and Identification," Technical Report 9116, Center for Systems Science, Yale Univ., New Haven, CT. Lewis II, P. M. and Coates, C. L. (1967). Threshold Logic. John Wiley, New York, NY. Light, W. A. (1992a). "Ridge Functions, Sigmoidal Functions and Neural Networks," in Approximation Theory VII, E. W. Cheney, C. K. Chui, and L. L. Schumaker, Editors, 163-206. Academic Press, Boston. Light, W. A. (1992b). "Some Aspects of Radial Basis Function Approximation," in Approximation Theory, Spline Functions, and Applications, S. P. Singh, Editor, NATO ASI Series, 256, 163-190. Klawer Academic Publishers, Boston, MA. Ligthart, M. M., Aarts, E. H. L., and Beenker, F. P. M. (1986). "Design-for-Testability of PLA's Using Statistical Cooling," in Proc. ACM/IEEE 23rd Design Automation Conference (Las Vegas 1986), 339-345. Lin, J.-N. and Unbehauen, R. (1993). "On the Realization of a Kolmogorov Network," Neural Computation, 5(1), 21-31. Linde, Y., Buzo, A., and Gray, R. M. (1980). "An Algorithm for Vector Quantizer Design," IEEE Trans. on Communications, COM-28, 84-95. Linsker, R. (1986). "From Basic Network Principles to Neural Architecture," Proceedings of the National Academy of Sciences, USA, 83, 7508-7512, 8390-8394, 8779-8783. Linsker, R. (1988). "Self-Organization in a Perceptual Network," Computer, March 1988, 105-117. Lippmann, R. P. (1987). "An Introduction to Computing with Neural Nets," IEEE Magazine on Accoustics, Signal, and Speech Processing (April), 4, 4-22. Lippmann, R. P. (1989). "Review of Neural Networks for Speech Recognition," Neural Computation, 1(1), 1-38. Little, W. A. (1974). "The Existence of Persistent States in the Brain," Math Biosci., 19, 101-120. Ljung, L. (1977). "Analysis of Recursive Stochastic Algorithms," IEEE Trans. on Automatic Control, AC22(4), 551-575. Ljung, L. (1978). "Strong Convergence of Stochastic Approximation Algorithm," Annals of Statistics, 6(3), 680-696.

Lo, Z.-P., Yu, Y., and Bavarian, B. (1993). "Analysis of the Convergence Properties of Topology Preserving Neural Networks," IEEE Transactions on Neural Networks, 4(2), 207-220. Loève, M. (1963). Probability Theory, 3rd edition, Van Nostrand, New York. Logar, A. M., Corwin, E. M., and Oldham, W. J. B. (1993). "A Comparison of Recurrent Neural Network Learning Algorithms," in Proceedings of the IEEE International Conference on Neural Networks (San Francisco 1993), vol. II, 1129-1134. IEEE, New York. Luenberger, D. G. (1969). Optimization by Vector Space Methods. John Wiley, New York, NY. Macchi, O. and Eweda, E. (1983). "Second-Order Convergence Analysis of Stochastic Adaptive Linear Filtering," IEEE Trans. Automatic Control, AC-28(1), 76-85. Mackey, D. J. C. and Glass, L. (1977). "Oscillation and Chaos in Physiological Control Systems," Science, 197, 287-289. MacQueen, J. (1967). "Some Methods for Classification and Analysis of Multivariate Observations," in Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, L. M. LeCam and J. Neyman, Editors, 281-297. University of California Press, Berkeley. Magnus, J. R. and Neudecker, H. (1988). Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester. Makram-Ebeid, S., Sirat, J.-A., and Viala, J.-R. (1989). "A Rationalized Back-Propagation Learning Algorithm," in International Joint Conference on Neural Networks (Washington 1989), vol. II, 373-380. IEEE, New York. von der Malsberg, C. (1973). "Self-Organizing of Orientation Sensitive Cells in the Striate Cortex," Kybernetick, 14, 85-100. Mano, M. M. (1979). Digital Logic and Computer Design, Prentice-Hall, Englewood Cliffs, NJ. Mao, J. and Jain, A. K. (1993). "Regularization Techniques in Artificial Neural Networks," in Proc. World Congress on Neural Networks (Portland 1993), vol. IV, 75-79. LEA, Hillsdale. Marchand, M., Golea, M., and Rujan, P. (1990). "A Convergence Theorem for Sequential Learning in TwoLayer Perceptrons," Europhysics Letters, 11, 487-492. Marcus, C. M. and Westervelt, R. M. (1989). "Dynamics of Iterated-Map Neural Networks," Physical Review A, 40(1), 501-504. Marcus, C. M., Waugh, F. R., and Westervelt, R. M. (1990). "Associative Memory in an Analog IteratedMap Neural Network," Physical Review A, 41(6), 3355-3364. Marr, D. (1969). "A Theory of Cerebellar Cortex," J. Physiol. (London), 202, 437-470. Martin, G. L. (1990). "Integrating Segmentation and Recognition Stages for Overlapping Handprinted Characters," MCC Tech. Rep. ACT-NN-320-90, Austin, Texas. Martin, G. L. (1993). "Centered-Object Integrated Segmentation and Recognition of Overlapping Handprinted Characters," Neural Networks, 5(3), 419-429. Martin, G. L., and Pittman, J. A. (1991). "Recognizing Hand-Printed Letters and Digits Using Backpropagation Learning," Neural Computation, 3(2), 258-267.

. C. Rodemich. Miller. M. F.-P. F. MIT Press. 20-28. "Real-Time Dynamic Control of an Industrial Manipulator Using a Neural-Network-Based Learning Controller. MA. "Effects of Adaptation Parameters on Convergence Time and Tolerance for Adaptive Threshold Elements. Hewes. McCulloch. Rep. CA.. Schaffer.. L. (1953). IEEE. Sutton. Mel." IEEE Trans. II. Morgan Kaufmann. Miller. 2. E. J. (1961). and Omohundro. W. Megiddo. R." J. Lippmann. W. and Papert. Budapest. P. T. (1991). Elec. S. (1969). G. Morgan Kaufmann. Haines. S. S. Micchelli. 11-22. IT-33. D. W. and Pitts. E. "Backpropagation Error Surfaces Can Have Local Minima." in Advances in Neural Information Processing Systems 3 (Denver 1990). M. 461-482. Robotics Automation. Editors. Glanz. M. J. W. 21(6). 627.. (1987). and Teller. R. (1989). B. and Nadal. N. N. Biafore. "Learning in Feedforward Layered Networks: The Tiling Algorithm. S. EC-13.. Cambridge." Journal of Physics A." Bulletin of Mathematical Biophysics. U.. (1990c). E..Mays. 1561-1657. E.." in Proceedings of the Third International Conference on Genetic Algorithms (Arlington 1989).. J. Theory. Moody. K. A. (1991). "A Logical Calculus of Ideas Immanent in Nervous Activity." IEEE Trans. C. E. W. R. San Mateo. A. 115-133. 2191-2204.. P.IS.. Miller. G." Report No. Perceptrons: An Introduction to Computational Geometry. Info. McInerny. G. T. F. "How Receptive Field Parameters Affect Neural Learning. S. Hungarian Academy of Sciences. Minsky. Medgassy. Miller. S. Mead. ECE. 1087-1092. (1964).. G. "The Capacity of the Hopfield Associative Memory. L. Mézard. C. Chemical Physics. Metropolis. T. H. J. IBM Almaden Research Center. 10(2). IEEE. "Design and Implementation of a High Speed CMAC Neural Network Using Programmable CMOS Logic Cell Arrays. San Jose. Posner.. RJ 5252. and Hedg. 5. J. McEliece. 22. (1943). B. Glanz. Comp. and Kraft. (1986). "Designing Neural Networks Using Genetic Algorithms.01. C. 465-468. and Venkatesh. Cambridge. "CMAC: An Associative Neural Network Alternative to Backpropagation. "Interpolation of Scattered Data: Distance and Conditionally Positive Definite Functions. Decomposition of Superposition of Distributed Functions." Constructive Approximation.90. "Equation of State Calculations by Fast Computing Machines. (1989).. S. J. 757-763. S. Teller. J. 6. Box.. vol. Touretzky. Editors (1990a). H. R." Proc. M. R. A. M." in International Joint Conference on Neural Networks (Washington 1989). R. San Mateo.. "On the Complexity of Polyhedral Separability. H.
." Tech. (1986). 1-9. P.. P... and Werbos. W." IEEE Trans. Neural Networks for Control. MIT Press. (1989). Todd. and Kraft. (1990b). And D... T. P. L. and Hecht-Nielsen. University of New Hampshire. New York. (1990d). and Whitney. "Neuromorphic Electronic Systems. 78(10). A. Rosenbluth. C." Aerospace and Defense Science. 379-384. Editor. Miller..

Lippmann. Information and Communication Engrs.Møller. (1990). 762-767. Robust Estimation Procedures. Moody. H. S. Moore. D.
. 115-126. NY. L.. J. G. S. B. "A Polynomial Time Algorithm for Generating Neural Networks for Pattern Classification: Its Stability Properties and Some Test Results. and Darken." in Advances in Neural Information Processing Systems 4 (Denver 1991). Denmark. and Nakano. 281-294. (1965). and Davis. Addison-Wesley. and Nakano. Morgan Kaufmann. "Fast Learning in Multi-Resolution Hierarchies. 868-871." Neural Computation. Montana. Probability with Statistical Applications. 2nd edition. Touretzky. Morgan Kaufmann. S. Morita. S. J73-D-III(2). S. "Generalization and Parameter Estimation in Feedforward Nets: Some Experiments. Moody. 317-330. (1990a). New York. Muroga. (1971). (1989a)." Trans.. "ART1 and Pattern Clustering. "Networks with Learned Unit Response Functions." in Advances in Neural Information Processing Systems I (Denver 1988). (1970). Hanson. Editor. (1989). and Tukey. M. K. (1989). and Govil. MA. EC-14(2). Morgan Kaufmann. New York. Kim. and T. San Mateo. J. Roy. D. Morgan. S.. "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. 232-242. F. Editors. Institute Electronics. University of Aarhus. vol. Touretzky." Neural Networks. and T.. and R. A. (1989b)." in Eleventh International Joint Conference on Artificial Intelligence (Detroit 1989). San Mateo.. M. J. on Information Processing. (1989). J. E. (1980). San Mateo.. Comp.. Morita. Yoshizawa." Technical Report PB-339. E. Touretzky. Morgan Kaufmann. S. Addison-Wesley. G. Sejnowski. G." Proc. M. (1993). "Learning with Localized Receptive Fields. Hinton." Neural Computation. Paris. Morgan Kaufmann. Touretzky. 1048-1055. Yoshizawa. 630-637. J. S. Conf. 29-39. Moody. Editor. F. (1990b). Mukhopadhyay. 174-185. Moody. Muroga. P. "Memory of Correlated Patterns by Associative Neural Networks with Improved Dynamics. Morgan Kaufmann." in Advances in Neural Information Processing Systems 2 (Denver 1989). M. (1992)." in Proc. Int. (1959). B. D.. J. Muroga. San Mateo. Moody. L. Elect. and Thomas Jr. INNC '90. 5(2). Aarhus. San Mateo. D. Computer Science Department. Reading. Mosteller. "Lower Bounds of the Number of Threshold Functions and a Maximum Weight. 1(2). N. "Associative Memory with Nonmonotone Dynamics. S. 6(1)." in Proceedings of the 1988 Connectionist Models Summer School (Pittsburgh 1988). "The Principle of Majority Decision Logical Elements and the Complexity of their Circuits. "Training Feedforward Networks Using Genetic Algorithms. J. (1993). S. 133-143. Rourke. R. D. and Yarvin. Hinton." in Proceedings of the 1988 Connectionists Models Summer Schools (Pittsburgh 1988). Editors. and Bourlard. "Fast Learning in Networks of Locally-Tuned Processing Units. Mosteller. Paris." IEEE Trans. Sridhara. K. San Mateo. C. S. 136-148.. N.. F. Morita.. Editor. John Wiley Interscience. "Analysis and Improvement of the Dynamics of Autocorrelation Associative Memory. J. 2. N. 400-407. and Darken. C. Threshold Logic and its Applications. (1990).. Editors. S. Sejnowski.

Nishimori. K. LEA. vol. (1993). World Congress on Neural Networks (Portland 1993). "Associatron: A Model of Associative Memory. 305320. New Haven. (1962). Nakano." in Proc. J. F. 1991). 1(1). 357-362. K.. S. I. San Mateo. J. Editor. K. Brooklyn. "The Truck Backer-Upper: An Example of Self-Learning in Neural Networks. McGraw-Hill. San Diego. Morgan Kaufmann. Touretzky. (1988). S. Hanson. D. (1988). "Retrieval Process of an Associative Memory with a General InputOutput Function. Nerrand. San Mateo." Neural Networks.Musavi. "ExpoNet: A Generalization of the Multi-Layer Perceptron Model. DC 1989). Theory of Automata (Polytechnic Institute of Brooklyn. 993-1000. and Opri .. 223-239." Neural Networks. "Maximum Likelihood Competitive Learning. Cambridge University. D. Nguyen." in Proceedings of the International Joint Conference on Neural Networks (Washington. Morgan Kaufmann. Narayan." Technical Report CUEDIF-INFENG17R22. Engineering Department. G.). "Neural Networks and Radial Basis Functions in Classifying Static Speech Patterns. (1992a). L.." Neural Computation. Moody.. 380-388." Complex Systems. and R. Symp. S. N. C. and Hummels. M. 5(2).
." IEEE Trans." in Advance in Neural Information Processing Systems 4 (Denver. 574-582. E. (1965). Novikoff. H. Chan. J. W. Personnaz." CRL Technical Report 9019. 494-497. K. Nowlan. Sys. L. Morgan Kaufmann.. P. Reissued as The Mathematical Foundations of Learning Machines. vol.. Learning Machines. (1990). New York. (1988). and Parthasarathy. CA. 595-603. University of California. NY. K." Proc. SMC2. Neural Networks. K. (1990). Center for Systems Science. Ahmed. S. E. Nowlan. 1990. (1989). S. J. 5(4). and Marcos.. Faris. Narendra. CT.. J. 3(2). Hillsdale. H. D. "Gain Variation in Recurrent Error Propagation Networks. S. Niranjan. M. Nilsson. Dreyfus." in Advances in Neural Information Processing Systems 2 (Denver 1989). J. S." Neural Networks. (1991). B. on Math. J. Yale University. III. 1061-1067. B. Editors. "Identification and Control of Dynamical Systems Using Neural Networks. and Parisi. and Hinton. M." Technical Report. "Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms. P. "Memory Capacity in Neural Network Models: Rigorous Lower Bounds. (1993). Newman.. T. and Widrow. "Adaptive Soft Weight Tying using Gaussian Mixtures. Lippmann. Narendra." IEEE Trans. Elman. S. 6(8). (1990). D. S. 165199. 2. K.. 615-622 Nowlan. A. and Fallside. 4-27. G. Roussel-Ragot. "Learning and Evolution in Neural Networks. San Mateo. II. (1992). "On the Training of Radial Basis Function Classifiers. Man Cybern.. and Wakatsuki. (1972). O. Nolfi. B. "On Convergence Proofs of Perceptrons. (1993). J. "A Comparative Study of Two Neural Network Architectures for the Identification and Control of Nonlinear Dynamical Systems.

Proceedings of the 1991 International Conference on Artificial Neural Networks (Espoo 1991). S. Caudill and C. Adaptive Pattern Recognition and Neural Networks. Pao. Simula. 4(4). IEEE. S. (1987). Subspace Methods of Pattern Recognition. and Subspaces. "A Heteroassociative Memory Network with Feedback Connection. New York. 69-84. V. Pergamon Press. M. "Neural Networks. Editors. G. T. W." Journal of Mathematical Analysis and Applications. Park. M. and Second Order Hebbian Learning. England. Parzen. 61-68. "Optimal Algorithms for Adaptive Networks: Second Order Backprop. Oja. 1065-1076. IEEE First International Conference on Neural Networks (San Diego 1987). K. 246-257. Caudill & C. Combinatorial Optimization: Algorithms and Complexity." International Journal of Neural Systems. D. vol. Oja. Mäkisara. (1989). P. II. Editors." Neural Computation. E. 715-719. Center for Computational Research in Economics and Management Science. C. "Universal Approximation Using Radial-Basis-Function Networks. MA." Journal of Mathematical Biology. (1993). and Hinton. O." in IEEE First International Conference on Neural Networks (San Diego 1987). and Psaltis. Paek. G. J." Neural Computation. E. (1991). Zurich. Cambridge. J. Prentice-Hall. (1982). vol. Math. Caudill and C. 593-600. 428-433. "Characterization of the Boltzmann Machine Learning Rate. (1962). (1987). Research Studies Press and John Wiley. 3(2). 26.. Butler. Proceedings of the 1st IFAC Symposium on Design Methods of Control Systems. vol. Y.
. Papadimitriou. New York. E. M." in Proc.. Oja.. Amsterdam. col." in IEEE First International Conference on Neural Networks (San Diego 1987). (1987). 305-316. III. K. I. (1985). and Autoassociation in Feedforward Neural Networks. 106. "On Stochastic Approximation of the Eigenvectors of the Expectation of a Random Matrix. M. and Karhunen. IEEE. Butler. I. J. Letchworth. "On Estimation of a Probability Density Function and Mode. B. (1991). 711-718. Editors.Nowlan. Parks. Okajima. Butler. Parks." Technical Report TR-47. (1992b). Massachusetts Institute of Technology.. (1991). Parker. and Militzer. "A Simplified Neuron Model As a Principal Component Analyzer. and J. "Simplifying Neural Networks by Soft Weight-Sharing. H. B. Feature Extraction. MA.. Principal Components. 33." Optical Engineering. Elsevier Science Publishers B. C. Oja. Second Order Direct Propagation. J. 777-782. (1987). "Approximation and Radial-Basis-Function Networks. Englewood Cliffs. H. 5(2). Kohonen. II. Addison-Wesley. Oja. Tanaka. Parker. 267-273. E. and Steiglitz (1982). E. Kangas. (1989). D. 473-493. and Sandberg. J. II. "Improved Allocation of Weights for Associative Memory Storage in Learning Control Systems. W. 1(1). "Learning Logic. and Sandberg. "Data Compression. E." Ann. "Optical Associative Memory Using Fourier Transform Holograms." Artificial Neural Networks. vol. D. Reading. S. 15. (1983). Statist. and Fujiwara. E." Neural Computation. E. I. Park. 737745. (1985).

and Girosi. T. (1989a). 24(2)." Mathematik. Plaut. 365-372. 4217-4227." Technical Report CMU-CS-88-191. Poggio. E. F. PA. 51.." Technical Report CMU-CS-86-126. 1. (1990b). F. Pittsburgh. Carnegie Mellon University. "A Generalized Inverse for Matrices. 213-225. 2229-2232. New York. "A Mean Field Theory Learning Algorithm for Neural Networks. "A Theory of Networks for Approximation and Learning. D. T. Peterson. 59. S. Polyak. G. "Dynamics and Architectures for Neural Computation. B. 216-245. 247." Complex Systems. and Barney. Pineda.T. School of Computer Science. (1990a). (1952). 978-982. (1986). L. L. "Learning State Space Trajectories in Recurrent Neural Networks. 34(5). Memo 1140. II. 406413. R. MA. B. J. "A Resource-Allocating Network for Function Interpolation. "Generalization of Back-Propagation to Recurrent Neural Networks. "Non-Asymptotic Confidence Bounds for Stochastic Approximation Algorithms with Constant Step Size. I. (1989b)..I.Pearlmutter. G. B.. (1991). (1988)." Proceedings of the IEEE." Neural Computation." A. "Regularization Algorithms for Learning that are Equivalent to Multilayer Networks. J. "Networks for Approximation and Learning. (1984). A. 51. (1987). Optimization Software." Journal of Complexity. B. I. 110. and Hinton. Pineda. (1989). Cambridge Philosophical Society. 51-62.. Peretto. (1990). Inc." Proc. Pittsburgh. B. "New Method of Stochastic Approximation Type. Pearlmutter. IEEE. 1481-1497. Poggio. "Note sur la Convergence de Methods de Directions Conjugées." Biological Cybernetics. "Collective Computational Properties of Neural Networks: New Learning Mechanisms. H. (1986). Guyon. Platt. T. T..
." Physical Review A. (1987). 937-946. S. G. PA. G." Physical Review Letters. 35-43." Neural Computation. D." Science. A. T. Penrose. E. 3. 995-1019. Introduction to Optimization. "Experiments on Learning by Back Propagation. vol." Journal of the Acoustical Society of America. "Control Methods used in a Study of the Vowels. Nowlan. Peterson. P. (1969)." Revue Francaise d'Informatique et Recherche Operationnalle. (1990). 1(2). A. (1955). G. Polyak. (1988). Remote Control. "Collective Properties of Neural Networks: A Statistical Physics Approach. New York. and Anderson. J. and Dreyfus. 78(9).. Cambridge." Automat. and Girosi. M. 263-269. F. F. 3(2). "Learning State Space Trajectories in Recurrent Neural Networks. (1987). 297-314. A. Pearlmutter. R. and Girosi. 50. "Learning State Space Trajectories in Recurrent Neural Networks. Pflug. Polak. C. 175-184. Department of Computer Science. Ch. Personnaz. and Ribiére. 4. Poggio." in International Joint Conference on Neural Networks (Washington 1989). Carnegie Mellon University.

S. L. 370-375. (1993). Editors. vol. Riedel. (1986). IEEE. "Learning to Solve Random-Dot Stereograms of Dense Transparent Surfaces with Recurrent Back-Propagation. Editor.. B. N. J. Qi. and Park. Proc. Ritter. F.." Neural Networks. W. W. New York. Numerical Recipes: The Art of Scientific Computing. Editor. P." Biological Cybernetics. New York. Caudill. Touretzky. S. and Mitra. H. 99-106. and Tepedelenlioglu. 5(3)." IEEE Trans. 3(1). 4(5). C. (1993). Stanford Electronics Labs. Generalized Inverse of Matrices and its Applications. C. Hinton. Forrest. and C. A. IEEE. New York. "Using Genetic Algorithms with Small Populations. CA. H. P. Cybernetics. D." Proceedings of the Fifth International Conference on Genetic Algorithms (Urbana-Champaign 1993). "An Overview of Neural Networks: Early Models to Real World Systems. T. Neural Networks. Qian." Technical Report 1556-1. DC 1990). R. and Cooper.
. Editors. Reed. 435-443. Denker." in An Introduction to Neural and Electronic Networks. Cooper. A. 151. Ricotti. Rezgui. J. Cambridge. 45. S. A. and Vetterling. San Diego.. K. "The Diversification Role of Crossover in the Genetic Algorithms. Rao." Neural Computation. and Palmieri. W. Sejnowski. D. G. D. N. "A Neural Model for Category Learning. San Mateo. D. John Wiley. (1987). Cambridge University Press. Editor. Ragazzini.. C. Mason and M. Forrest. (1988)." in Proceedings of the International Joint Conference on Neural Networks (Washington. Stanford University. Editor. I. Boston. 92-99.Pomerleau. J. (1993).A Survey. Press. L. Morgan Kaufmann. 88-97. 459-463. Reilly. "The Effect of the Slope of the Activation Function on the Backpropagation Algorithm. Physics. "Nonlinear Discriminant Functions and Associative Memories. and Elbaum. and Schild." in IEEE First International Conference on Neural Networks (San Diego 1987). Pomerleau. Editors. A. M. Clarendon Press. Teukolsky. M. (1982). Oxford. vol.." Biol. Morgan Kaufmann. T. L. H. 35-41. Ridgway III." in Neural Networks for Computing. Caudill and C. (1971). Editors. (1991). "The Dynamics of Hebbian Synapses can be Stabilized by a Nonlinear Decay Term. K. San Mateo. Cox. (1989). C. (1990). C. (1992). S. Kluwer. J." Proceedings of the Fifth International Conference on Genetic Algorithms (Urbana-Champaign 1993). R. vol. L. D.. Davis. S. M. N." in Proceedings of the 1988 Connectionist Models Summer School (Pittsburgh 1988). (1986). N. "On the Stationary State of Kohonen's Self-Organizing Sensory Mapping. Neural Network Perception for Mobile Robot Guidance. D. D. and Sejnowski. and Schulten. (1993). Psaltis. G. D. Zornetzer. "Learning of Word Stress in a Sub-Optimal Second Order Back-Propagation Neural Network. G. (1962). "Pruning Algorithms . Reilly. (1986). S. L. I. Morgan Kaufmann. 707-710. Stanford. C. F. 54. L. Butler. Reeves." in Algorithms for the Approximation of Functions and Data. and T. 355-361. Lau. 740-747. Academic Press. and Martinelli. "Efficient Training of Artificial Neural Networks for Autonomous Navigation. (1990). American Inst. R. Flannery. H. Powell. "An Adaptive Logic System with Generalizing Properties. 132-137. "Radial Basis Functions for Multivariate Interpolation: A Review. S. San Mateo. X..

MIT Press. Washington." Cognitive Science. volume 1. New York. (1989). (1962). San Mateo. CA. L. J. A. (1989). "The 'Moving Targets' Training Algorithm. P. Roy. "A Self-Organizing Network for Principal-Component Analysis. D. "Generating Radial Basis Function Net in Polynomial Time for Classification. 109-116. Spartan Books. McClelland. Anderson." Automat.
. printed as Technical Report CUED/F-INFENG/TR. "Learning and Generalization in Multilayer Networks.Ritter. and the PDP Research Group. "Convergence Properties of Kohonen's Topology Conserving Maps: Fluctuations. Robinson. NY. H. Cambridge. 558-565. American Institute of Physics. Stability. A. Washington. 59-71. D. (1993). University of California at Berkeley. "A Polynomial Time Algorithm for the Construction and Training of a Class of Multilayer Perceptrons. (1993). 535-545. L. E. (1986a). S. (1988a). and Mukhopadhyay. II. Thesis." presentation given at the NATO Advanced Research Workshop on Neuro Computing." in Neural Information Processing Systems (Denver 1987). 3(1). IEEE. III. Niranjan. L. Parallel Distributed Processing: Exploration in the Microstructure of Cognition. I. Rumelhart.. vol. Rudin. S. A. C. LEA. H. and Schulten. Hillsdale. 9. 693-698. 75-112." in IEEE International Conference on Neural Networks (San Diego 1988). Ritter. Int... Kim. M." in Advances in Information Processing Systems 2 (Denver 1989). I." in Proc. 6(4). A. R. D. Also. S. (1988). and Dimension Selection. D. D. "Static and Dynamic Error Propagation Networks with Application to Speech Coding. Editor. Memorandum UCB/ERL-M89/29. I. Engineering Department. vol. (1989)." (abstract) in Proc. C. "Feature Discovery By Competitive Learning. New York." Biol. 5.. 66-80. Rumelhart. Cambridge." ORSA J. Ph. "Pattern Classification Using Linear Programming. and Fallside. "Kohonen's Self-Organizing Maps: Exploring Their Computational Capabilities. France 1989). F. and Applications (Les Arcs. "Generalizing the Nodes of the Error Propagation Network. 582. Joint Conference on Neural Networks (Washington. 1989). I. and Mukhopadhyay. Roy.. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.25. K. Cybernetics. McGraw-Hill. Rohwer. World Congress on Neural Networks (Portland 1993). W. (1961). J. F. Berkeley. New York. F. 632541. D. and Govil. and Zipser. Principles of Mathematical Analysis. S. (1985). Editor. New York." Neural Networks. Touretzky. D. F. J.D. Robinson. Architecture. E. vol. (1969). (1989). Rumelhart. 536-539. E. Roy. IEEE. "Random Logic Nets." Europhysics Letters. S. (1991). Rubner.. D. Z. 60. Rozonoer. Simulated Annealing: Theory and Applications to Layout Problems. 137-147. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Morgan Kaufmann. A. K. DC. F.. Telemekh. Rosenblatt. (1964). Comput. J. Cambridge University. and Fallside. 10. (1990). Romeo. Spartan Press. England. Rosenblatt. and Schulten. and Tavan. (1988b).

"Simulated Annealing with Constant Thermodynamic Speed. Threshold Logic. 374-385. vol. E. V. Anderson. J. S. Rumelhart. Shannon. and the PDP Research Group. A." Trans. T." in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. (1989).. Editor. Saha. Intelligent Robots and Computer Vision.. Biosci.. (1991). (1988). R. Solla. 237-241. "Learning Internal Representations by Error Propagation. 2(6). of the AIEE. Hinton. "A Symbolic Analysis of Relay and Switching Circuits. R.. D. (1988). and Liao. MIT Press.. "A Study of Control Parameters Affecting Online Performance of Genetic Algorithms for Function Optimization. 459-473. A. 49. (1990). L. R.. 203-210." Trans. Academic Press.. Reilly. Sejnowski. T. 51-60. Rutenbar.. IEE. R. (1969)." Math. D. 848.. Elbaum. R. "Algorithms for Better Representation and Faster Learning in Radial Basis Function Networks. D. Schwartz. "Parallel Networks that Learn to Pronounce English Text." Neural Networks.. 145-168. A. P.. Spline Functions: Basic Theory. 22D. Schultz. Samalam. and Han. Cambridge (1986). Petersen.." Physica. G. Editor. D. American Institute of Physics. J. Sheng. C. 1. C.. D. NY. Shaw. L. and Gibson. (1938). 2(3). New York. L. J. N. "A Real Time Learning Algorithm for Recurrent Analog Neural Networks. Wiley. and Denker. (1986b). Morgan Kaufmann. Kienker. J. "The Variable Gradient Method for Generating Liapunov Functions. L. D. L. "Pattern Class Degeneracy in an Unrestricted Storage Density Memory. D. (1990). C. E.. D." Neural Computation. Schaffer. S. and Rosenberg. SPIE. "Learning Symmetry Groups with Hidden Units: Beyond the Perceptron. J. J. (1987). Robinson. D. G. J. and Cooper. "Optimal Unsupervised Learning in a Single Layer Linear Feedforward Neural Network. L. Sayeh.. J. (1986)." in Advances in Neural Information Processing Systems 2 (Denver 1989). San Mateo. L. Schumaker. G. Z. D." in Neural Information Processing Systems (Denver 1987)." Biological Cybernetics. 207-218. J. M. D. "Pattern Recognition Using a Neural Network. Sejnowski. T. Morgan Kaufmann. 207-228. J.." Complex Systems. (1989). (1990). "Exhaustive Learning.. Y. A. E.. B. J. Schoen. and Das. J. Eshelman. G. San Mateo. 281-285. and Williams. 674-682. 423-428. 1. L. I." Journal of Global Optimization. and Keeler. C. 62. 713-723. E. (1981). and Hinton. Touretzky. L. Schaffer. Sanger. 19-26. McClelland. New York. (1987). 482-489. 57." IEEE Circuits Devices Magazine. "Stochastic Techniques for Global Optimization: A Survey of Recent Advances. F. (1974). R." in Proceedings of the Third International Conference on Genetic Algorithms and their Applications (Arlington 1989). 260-275. 81(II). Scofield." Proc." Computer Physics Communications. Nulton. and Vasudevan.
. 5(1). (1989).. Caruana. D. M.Rumelhart. New York. J. Editor. K. Ruppeiner. k. "Simulated Annealing Algorithms: An Overview. G. (1962). "Persistent States of Neural Networks and the Nature of Synaptic Transmissions. E. P. Sato. R. Salamon. 21. C. J.

(1988). H." in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Sietsma. Touretzky. Smith. (1983). Snapp. 103-112.. Springer-Verlag. J. "Accelerated Learning in Layered Neural Networks. (1990). 1-11. and Grest. Ising Model. Song. B. Ottaway. E. J." Complex Systems. Detroit. I.D. Editor (1989). Shrödinger.. and Ballard. D. Cambridge. and the PDP Research Group. 932-938. J. Simard. (1990). (1990).Shiino. "Analysis of Recurrent Backpropagation." J. (1988). D. and Almeida. Berlin. E. Slansky. Morgan Kaufmann. 768-773. P. "Minimisation Methods for Training Feedforward Neural Networks.Why and How. 637-653. L. D. P. Rumelhart. "Analysis of Recurrent Backpropagation. J. Specht. 761-762. P. Moody. MIT Press." in Proceedings of the 1988 Connectionist Models Summer School (Pittsburgh 1988). P. G.. New York. D. Simard." Neural Networks. S. "Hybrid Genetic/Gradient Learning in Multi-Layer Artificial Neural Networks. B28. Sontag. S. A. London. Almeida and Wellekens. F. I." Neural Networks. I. 1495-1509. Editors. and Fukai. Psaltis. 24th Conference on Decision and Control (Ft. L. New York. T. Simeone. M. P. Solla. Lauderdale 1985). J.
. "Acceleration Techniques for the Backpropagation Algorithm. 7(1). Levin. M. Pattern Classification and Trainable Machines. M. A. "When Learning Guides Evolution. vol. 20(3)." in Advances in Neural Information Processing Systems 3 (Denver 1990). Department of Computer Science. Silva. McClelland. Sejnowski. M. "Higher-Ordered and Intraconnected Bidirectional Associative Memory. F. (1946). San Mateo. (1981). and Wassel. Michigan. G. S. Editors. 110119. M. Soukoulis. Smolensky. "Asymptotic Slowing Down of the Nearest-Neighbor Classifier. K. Y. 325-333. Springer-Verlag. Springer-Verlag. (1986)." Technical Report 253. "Irreversibility and Metastability in Spin-Glasses. "Neural Net Pruning . "Image Restoration and Segmentation Using Annealing Algorithm.. Ottaway. C. R. (1991). F. Department of Electrical and Computer Engineering. and Fleisher. E. Dissertation. Combinatorial Optimization. (1990). and Venkatesh. 329. (1987). "Replica-Symmetric Theory of the Nonlinear Analogue Neural Networks. J. S. and D. S. P. K. 109-118. 23. (1989)." Physical Review.. D. E. Wayne State University. Cambridge University Press. San Mateo. E." Nature. vol. (1992).. N. G. IEEE. Lippmann.. "Probabilistic Neural Networks.. Europe Lecture Notes in Computer Science. 2. System. Editors. Touretzky. R. (1985). and Ballard." Neural Networks. Statistical Thermodynamics. Simpson. H. and Dow. Phs. L1009-L1017. D." in IEEE International Conference on Neural Networks (San Diego 1988). IEEE Trans. H. 625-639. 3(1). Hinton.. B. and T. Y. and Cybernetics. (1994). D. Morgan Kaufmann. B. and Sussann. R. M. Levin. "Information Processing in Dynamical Systems: Foundations of Harmony Theory. M. van der Smagt." Ph. Man. (1988). L. B. J.. P." in Proc. New York. J. B. University of Rochester. R.

Wang. Y. Stiles. 2143-2146. A. (1986). Special Issue on Reinforcement Learning. "A Physiological Mechanism for Hebb's Postulate of Learning.Sperduti." in Neural Networks for Computing (Snowbird 1986). 8. A. G. Denker. 613-617. Int. Hassoun. IEEE Press. "Extensions of Generalized Delta Rule to Adapt Sigmoid Functions. and Lee. and Bearden.-C.. Stone. (1991).-Z. S. "Nonlinear Hebbian Rule: A Statistical Interpretation. Computers. Machine Learning." Neural Computation. R. R. Sun. M. and R. and Hassoun. D. 1393-1394. (1986). 257-263. Sun. Miller." Aequationes Math. and Kabrisky. Szu. Stent." in Advances in Neural Information Processing 4 (Denver 1991). and Lee. M." Proceedings of the 13th Annual International Conference IEEE/EMBS. I. K. (1992). Neural Networks (Washington. A. New York. "A Quantitative Comparison of Three Discrete Distributed Associative Memory Models." Math. vol. 420-425.
. H. 6(3). D-L. F. Operationsforsch Statistik. "Signal Decomposition and Diagnostic Classification of the Electromyogram Using a Novel Neural Network Technique. 226-235. "Green's Function Method for Fast On-Line Learning Algorithm of Recurrent Neural Networks... A. J. Spitzer.. vol. Sudjianto. G. XXX. "A Universal Mapping for Kolmogorov's Superposition Theorem. (1991). Barto. Moody. San Mateo. (1991)." Proc. J. 1989). "Cross-Validation: A Review. (1990). Editor. R." IEEE International Conference on Neural Networks. 1089-1094. Los Alamitos. Hanson. 4(2). (1992). H. C-36. XIVth Ann. "Artificial Neural Network for Four-Coloring Map Problems and KColorability Problem. of the American Control Conference (Boston 1991).. E.." IEEE Transactions Circuits ad Systems. Morgan Kaufmann. (1992). 224-233. Spreecher. H. "Speed Up Learning and Networks Optimization with Extended Back Propagation. American Institute of Physics. Editor. and Cheney." IEEE Trans. 1991. M. "The Fundamentals of Sets of Ridge Functions. XXXpage numbersXXX. Editor. Y. 365-383. C. J.. Takefuji." Proceedings of the 8th Annual Conference on the Cognitive Science Society (Amherst 1986). A.. 997-1001. Lawrence Erlbaum. (1989). and Williams. "Fast Simulated Annealing. B." Proceedings of the National Academy of Sciences (USA)." Neural Networks. 1-395. S. R. Stinchcombe. Suter. "Universal Approximations Using Feedforward Networks with Non-Sigmoid Hidden Layer Activation Functions. A. X. San Diego. "Two Problems with Backpropagation and Other Steepest-Descent Learning Procedures for Networks." Neural Networks. "Reinforcement Learning is Direct Adaptive Optimal Control. "On a Magnitude Preserving Iterative MAXnet Algorithm. (1994). A. G. 6(8). A. 44. 70. (1987). (1978). 823-831. H." in Proc. Sutton. C.. S.. R. IEEE Computer Society Press. C. Sperduti. A. R. Editors. 1990). C. 38. M. (1973). E. Sutton. and Denq. M. IEEE. (Orlando 1994). S. D. 127-140. SOS Printing. and White. J. 552-556. P. W. 9. Chen. 317-324. New York. (1993). Joint Conf. Symposium on Computer Applications in Medical Care (Washington D. S. G. and Starita. Hillsdale. (1993). (1992). Lippmann. S.-H. and Starita." in Proc. 326-333. Sutton.

366-381. and
ilinskas." Biological Cybernetics. Editor. L. Tank. 26. Tank. "Formation of Topographic Maps and Columnar Microstructures. E. New York. Young. 38-45. T.-I. vol." Optical Engineering.-S. Ticknor." Proceedings of the Fifth International Conference on Genetic Algorithms (Urbana-Champaign 1993). San Mateo. Nauka. D." in Dynamical Systems and Turbulence.. II. Tishby. (1993). E. Cybernetics. R. and Janssens. CT.
Törn." in IEEE First International Conference on Neural Networks (San Diego 1987). A. 39-44. P. 16-21. (1987)." in International Joint Conference on Neural Networks (Washington 1989). and Solla." Biol. Hafner. (1911). Moskow 1968)
. 33. Beckenbach. G. Lecture Notes in Mathematics.
Tsypkin. Editors.. C. Global optimization. Morgan Kaufmann. Linford. Caudill and C. G. Academic Press.. Darien. (1990). 2. IEEE. Taylor. W. S. and Hopfield. A. Berlin. R. "SuperSAB: Fast Adaptive Back Propagation with Good Scaling Properties. "Mixing in Genetic Algorithms. (1990). McGraw-Hill. Linggard. Ya. S. J. "Learning Higher Order Correlations. Tawel. 403-410. D. "Optical Implementations of Boltzmann Machines. Editors. London. D. Thierens. 169-176. Thorndike. Touretzky. A. A. (1988). B. V. J. J. D. vol. Wheddon and R. F. D. "Neural Arrays for Speech Recognition. Morgan Kaufmann. IEEE. A. N. "Does the Neuron 'Learn' Like the Synapse?" in Advances in Neural Information Processing Systems 1 (Denver 1988). Signal Decision Circuit. "Simple "Neural" Optimization Networks: An A/D Converter. Forrest. (1989). Tolat. (1979). Nikolic. Takeuchi. W. 533-541. Animal Intelligence. "Detecting Strange Attractors in Turbulence. and Goldberg. D. 6(3). and Hopfield. "Methods of Steepest Descent. B. "An Analysis of Kohonen's Self-Organizing Maps Using a System of Energy Functions. and Coombes." Complex Systems. Adaptation and Learning in Automatic Systems. J. Editor. 3(5). (1956). vol. Chapman and Hall. C. "Concentrating Information in Time: Analog Neural Networks with Applications to Speech Recognition Problems. Springer-Verlag. IV. J. H. Editor. Butler. S. M. E. Tesauro.Takens. (First published in Russian language under the title Adaptatsia i obuchenie v avtomaticheskikh sistemakh. and a Linear Programming Circuit. 423427. (1990). 561-573. (1987). 455-468. Springer-Verlag. W. (1989). 245-290. (1989)." Neural Networks. Tattersal." Neural Networks. and Linggard. "Consistent Inference of Probabilities in Layered Networks: Predictions and Generalization. A. New York. San Mateo. New York. G. 155-164. Tollenaere. Rand and L." in Speech and Language Processing. Berlin. (1993). New York. Editors. and Amari. and Barrett. S. Translated by Z. Z. 898 (Warwick 1980). (1986). 64." in Modern Mathematics for the Engineer. (1981). Tompkins. 35." IEEE Transactions on Circuits and Systems. Levin. (1971). J. S. 63-72. A. B. "Scaling Relationships in Back-Propagation Learning. J. D..

Simula. New York. (1994). M." Philosophical Transactions of the Royal Society. Wang. III. K. and Ozeki. 1. "Some Properties of Associative Type Memories. and J. P. and Lang. 867872. Department of Electrical and Computer Engineering.D. Hanazawa." IEEE Transactions on Neural Networks. R. Dissertation. Department of Electrical and Computer Engineering. 81-92. 39-46. Villiers. C.. K. New York. (1989). Marcus. Detroit.. Editor. Waibel. E. Y. Ph. T.
. and Mulligan Jr. Detroit. "Modular Construction of Time-Delay Neural Networks for Speech Recognition. R. Y. R. 264-280. Usui. Rigler. J. T. " A Chart Method for Simplifying Truth Functions. Shikano.. Waugh. G. Series B." Biological Cybernetics. Wayne State University. 127-133. (1991).. (1993).. K. A. Hinton. I. and Signal Processing. Wayne State University. P. vol. "Phoneme Recognition Using Time-Delay Neural Networks. Cruz Jr. IEEE." Ph. H.." Theory of Probability and its Applications. M. 328339. (1971). (1988). B. "Guaranteed Recall of All Training Pairs for Bidirectional Associative Memory. Hassoun. Kangas. (1991). Amsterdam.. "Reducing Neuron Gain to Eliminate FixedPoint Attractors in an Analog Associative Memory.. 4(1). Watta. 16(2). Mäkisara." in Proceedings of the IEEE International Conference on Neural Networks (San Francisco 1993). Wang. and Westervelt. and Alkon." Journal of the Institute of Electrical and Communication Engineers of Japan. 237. Editors. of the ACM. M. O." Phys." Artificial Neural Networks. C. F. Speech.. 1(1). N.-F. M. J. 559-567. H. Nakauchi. T. B. Proceedings of the 1991 International Conference on Artificial Neural Networks (Espoo 1991). K.. (1972).-F. 323-330.. and Mulligan Jr. A. and Nakano. A. C. Waugh. T. K. W. Y. 257-263. Manglis. J. F.. Vapnik. B. Rev. T. "The Chemical Basis of Morphogenesis. (1990). M.. Michigan. Uesaka. Stochastic Approximation. New York. S. Waibel. 1841-1846. (1952). J. 136-141. E. "A Coupled Gradient Network Approach for Static and Temporal Mixed Integer Optimization. Wasan.. Marcus. "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. A Robust System for Automated Decomposition of the Electromyogram Utilizing a Neural Network Architecture. and Chervonenkis. R. "Two Coding Strategies for Bidirectional Associative Memory. vol. "Accelerating the Convergence of the Back-propagation Method. 43. Vogl.. (1991). Veitch.. 55-D. L.. "Internal Color Representation Acquired by a Five-Layer Network. Cruz Jr.." IEEE Transactions on Neural Networks. J. V. Vogt. S. "Nonlinear Dynamics of Analog Associative Memories. A. 59. (1969)... Elsevier Science Publishers B. M. D. Wang. Kohonen. M. Michigan. A. Zink. Cambridge University Press. (1952). "Combination of Radial Basis Function Neural Networks with Optimized Learning Vector Quantization." IEEE Transactions on Neural Networks." Neural Computation. W. 37. (1993).. and Barnard. Dissertation." in Associative Neural Memories: Theory and Implementation. 3131-3142. K. 197-211. and Westervelt. A. Oxford University Press. M. "Backpropagation Neural Nets with One and Two Hidden Layers." IEEE Transactions on Acoustics.Turing. M. G. V. H." Proc.. 2(6). (1993). 5-72. (1989). (1991)..D. J.

Weigend. Rumelhart. and Barnard. B." in Proceedings of the IEEE International Conference on Neural Networks (San Francisco 1993). 339-356. "Optimizing Neural Networks Using Faster. of the NTAO Advanced Research Workshop on Comparative Time Series Analysis (Santa Fe 1992). and Backpropagation. E. (1989). (1990). (1962). 1. B. J.. L-623-L-630.. (1985). "Partitions of Unity Improve Neural Function Approximators. (1958). MA. Editors. "Reliable. S. and Hoff Jr. Werbos. and R. N." in Proc. Touretzky. Werntges. "Results of the Time Series Prediction Competition at the Santa Fe Institute. Editor. Weigend. Morgan Kaufmann. IEEE. Wettschereck. "Adaptive Switching Circuits.1963. Time Series Prediction: Forecasting the Future and Understanding the Past. B. F. (1988). A. 78-123. 425-464. 3(6). (1974). "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. IEEE. J. S. J. B. Editors. L.Wegstein. Addison-Wesley. on Neural Networks (San Diego 1982). (1993). 1415-1442. J. F. (1991). E. M. and Angell. and Lehr. IEEE 1st Int. 96-104." IRE Western Electric Show and Convention Record." Journal De Physique Lett. E. Wessels. H. D. New York. B. Moody. (1992). Part 4. and Fogelman-Soulié. A." ACM Commun. 899-905. (1960). A. Hanson. G. Conf. Trainable Networks for Computing and Control. vol. Lippmann.. 914-918. 21-25. and Huberman. H. W. San Mateo. "Avoiding False Local Minima by Proper Initialization of Connections." in Proceedings of the Third International Conference on Genetic Algorithms (Arlington 1989). "Scaling Laws for the Attractors of Hopfield Networks. "Accelerating Convergence in Iterative Processes. P. More Accurate Genetic Search. "Improving the Performance of Radial Basis Function Networks by Learning Center Locations. H." Aerospace Eng. 9-13. Proc. (1992). 875-882. P. Widrow. Werbos.
. N." Proc." Ph. (San Francisco 1975)." Neural Networks. 1786-1793. S. San Mateo. A. "Learning in Artificial Neural Networks: A Statistical Perspective. 9th Asilomar Conf. T. S. "30 Years of Adaptive Neural Networks: Perceptron. Morgan Kaufmann." IEEE Transactions on Neural Networks.. 1133-1140. Weisbuch. (1975). S. T. 391-396. and Dietterich. P.. Editors (1994). A. Dissertation. Lippmann. North Hollywood.. D.. Cambridge. 143-158. Circuits Syst. J. P. (1989)." Proceedings of the IEEE International Conference on Neural Networks (San Francisco 1993). Weigend." Neural Networks. A. 1(6). II. D. "Generalization by Weight-Elimination with Application to Forecasting. 21 (September issue)." in Advances in Neural Information Processing Systems 4 (Denver 1991). "An Adaptive Recursive Digital Filter. M. I. and D. and Gershenfeld. Moody. CA." in Advances in Neural Information Processing Systems 3 (Denver 1990). Harvard University. A. Widrow. vol. "Generalization of Backpropagation with Application to Gas Market Model. R. (1987). Proc. and Gershenfeld. and Hanson. E. Schaffer. A. Widrow. J. IEEE. Western Periodicals. 78(9). (1993). Whitley.D. D. Reading MA. J. Morgan Kaufmann. Widrow. E. 1. San Mateo. vol. White." Plenary Speech. III. Committee on Applied Mathematics. S. Comput. A. 46(14). Madaline. B. "ADALINE and MADALINE . New York. White.

D. "A Class of Gradient Estimating Algorithms for Reinforcement Learning in Neural Networks. IEEE. 6(5). 64(8). D. "A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Willshaw. R. vol. 3(1). 115-121." IEEE Trans." Proc. O. "Geometric Analysis of Neural Network Capabilities. UK. NY. (1987). N. Larimore. "Experimental Analysis of the Real-Time Recurrent Learning Algorithm. Elec. M. Wittner.. B. 8. "Evolving Controls for Unstable Systems. and Johnson Jr.. 561-564.. Xu. (1976). New York. (1962). J. 87-111.. N. A. Oxford. Oxford University Press. R. Doubleday. "Stationary and Nonstationary Learning Characteristics of the LMS Adaptive Filter. (1976). R. NJ.Widrow. A. and Zipser." Neural Computation. on System. "Theories of Unsupervised Learning: PCA and its Nonlinear Extensions. J. (1956)." in Connectionist Models: Proceedings of the 1990 Summer School (Pittsburgh 1990). L. (1994). R. "Strategies for Teaching Layered Networks Classification Tasks. IEEE Press. "Punish/Reward: Learning with a Critic in Adaptive Threshold Systems. D. Morgan Kaufmann. "Least Mean Square Error Reconstruction Principle for Self-Organizing Neural-Nets. (1989b). 1151-1162. Butler. A. D. Wong. S. (1985). 91102. S. XXX. Z. The Algebraic Eigenvalue Problem. C.. Dissertation. Anderson. "How Patterned Neural Connections can be set up by Self-Organization.. and Stearns. Caudill and C. III. 627-648. S. Elman. Widrow. IEEE." First IEEE Int.. vol. O. Dept. Princeton University. New York. R. 270-280. K. IEEE. B. J. and Leighton. P. and G. (1988). (1965). Y. McCool. Gupta. Englewood Cliffs. D. Wiener. M. L. on Neural Networks (San Diego 1987). Widrow. R. Wieland." in IEEE First International Conference on Neural Networks (San Diego 1987). L." IEEE International Conference on Neural Networks. Computers. Editors. B 194. Williams. Wilkinson. XXXpage numbers XXX. SMC-3. G. "Bounds on Threshold Gate Realizability. 385-392. R. II. and Maitra. Editor. J. Williams. Editors. J. Prentice-Hall. (1989a). and Denker.-F.
. Threshold Logic. Winder. J. J. C. American Institute of Physics. and Cybernetics. D. "Learning Convergence in the Cerebellar Model Articulation Controller. on Neural Networks. E. 229-256. (1993). (1992). and Zipser. 601-608. San Mateo. Touretzky. Williams. of Mathematics. and Sideris. J. Wieland. 850-859. "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. R. 1. 455-465. I Am a Mathematician. D. J. (1992). Winder. M. Xu. 1(2). Man. 431-445." IEEE Trans." in Neural Information Processing Systems (Denver 1987)." Machine Learning. and von der Malsburg. Hinton. H. Adaptive Signal Processing. (1987). B. Williams. Conf. S." IEEE Trans. New York. (Orlando 1994)." Connection Science. (1963). Ph." Proceedings of the Royal Society of London. (1991). (1973)." Neural Networks. S. B. EC-12(5). vol.

and Amari.-K. Bergstresser. C. and Yu. (1989). K. 11571161. M. M. H. Yuille. N. W. Y. A. Oxford University Press." in IEEE International Conference on Neural Networks (San Francisco 1993)..-I. M. Yoshizawa. "Backpropagation with Homotopy. A." Proc. New York. R. Editor.. Capacity. X.-I. and Hassoun. and Peterson. Yu. W. and Amari. (1993). M. S.Yanai. W. H. H.. and Cohen. 61. Zhang. "Quadrature and the Development of Orientation Selective Cortical Cells by Hebb Rules. "A Desktop Neural Network for Dermatology Diagnosis. 167-176.. "Dynamics and Formation of Self-Organizing Maps. L.." Neural Computation. H. S.. D. L. L. (1993b). Youssef. (1989). Zak. Morita." Neural Computation." Biological Cybernetics. 43-52. 1053. and Miller. M. 3(2)." Neural Networks." Neural Networks. 183-194.. Editor.. (1991). "Capacity of Associative Memory Using a Nonmonotonic Neuron Model. and Sawada. Kammen. (1989)." Neural Networks. L. J. M. Liu. 54-66." Journal of Neural Network Computing. (1993). 223-228. Yoshizawa. IEEE. D. vol. 239-248. Optical Pattern Recognition. S.
. III. Brobst. (1989). "Analysis of Dynamics and Capacity of Associative Memory Using a Nonmonotonic Neuron Model. New York. SPIE. "Associative Memory Network Composed of Neurons with Hysteretic Property. O. 52-59. 3(1). Morita. Yoon. Loh. (1993a). 5(3). Y. R. 6(2). "Dynamic Autoassociative Neural Memory Performance vs. "A New Acceleration Technique for the Backpropagation Algorithm. Summer. Hassoun. Yang. S. M.. "Terminal Attractors in Neural Networks. (1990)." in Associative Neural Memories: Theory and Implementation. S. P. 363-366. 258-274. 2(4)..

20-22. 7 AHK see Ho-Kashyap learning rules AI. 220. 398. 46. 51 algorithmic complexity. 67 m-LMS rule. 25 f-surface. 77. 332 ADALINE. 24. 369 schedule. 26 analog optical implementations. 49 sigmoidal. 67-69. 387-388 logistic function. Hassoun MIT Press
A
a-LMS rule. 323-328 adjoint net. 198. 3 gate/unit. 182 ambiguous response. 51 ALVINN. 302 ambiguity. 20-21. 428
. 433 activation slope. 21. 413-414 activation function. 270 admissible pattern. 25 f-space. 244-246 AND. 20. 201. 332 hysteretic. 220 hyperbolic tangent. 198. 20 dichotomy. 53 analog VLSI technology. 383 activity pattern. 25 f-mapping. 66. 53 annealing deterministic. 201 nondeterministic. 220. 385 nonsigmoidal. 428 nonmonotonic. 72 f--general positon. 66 adaptive resonance theory (ART) networks. 20 f-separable. 76 sign (sgn). 79. 36. 77. 219 homotopy. 85 adaptive linear combiner element. 347. 24-25 A analog-to-digital (A/D) convertor.Subject Index
Fundamentals of Artificial Neural Networks Mohamad H.

144 architecture adaptive resonance. 348. 350-351 oscillation. 381. 353 dynamic (DAM). 375 noise-suppression. 374 ground state. 374 linear (LAM). 367. 372. 385. 345 absolute capacity. 381. 346 hetero. 375 high-performance. 254 unit-allocating. 366. 390 basin of attraction. 197. 221 function. 319 fully recurrent. 36 time-delay. 382 variations on. 387. 38. 43. 378. 374-375. 352. 388 simple.see also simulated annealing anti-Hebbian learning. 1 ART. 332 CCN. 100 approximation capability. 348. 346 relative capacity. 38. 347. 373 performance characteristics. 346 pairs. 346 spurious memories. 363. 40. 365. 388 cross-talk. 318 artificial intelligence. 259 multilayer feedforward. 144 theory. 254 threshold-OR net. 1 artificial neuron. 250. 363. 323-328 association auto. 48. 323 AND-OR network. 374 performance criteria. 261 partially recurrent. 389 fundamental memories. 383 default memory. 51 artificial neuarl network. 184. 374-375 recording/storage recipe. 259 randomly interconnected feedforward. 364. 275-394 see also DAM
. 351 optimal linear (OLAM). 353-374 error correction. 346 low-performance. 365. 345. 373 oscillation-free. 359. 363. 363. 346 associative memory. 271. 370-371. 374-375. 366-367. 385. 35-36 bottleneck.40 recurrent. 366.

150 automatic scaling. 148 attractor state. 219-221 applications. 358. 203. 147. 310
. 172. 375 autoassociative net. 230-234 derivation. 202-203. 287 Gaussian. 218 stochastic. 167. 224. 328 autocorrelation matrix. 318 backpropagation. 221. 203-205. 325 average entropic error. 71. 211-213 local minima. 271-274 variations on. 202 convergence speed. 234. 211 criterion (error) functions. generalization phenomenon. 210-211 weight decay. 424 learning rate. 203. 287 batch mode/update.associative reward-penalty. 328 basis function. 259-262 time-dependent recurrent. 290 Bayes decision theory. 421 momentum. 199-202. 366. 183 B backprop. 268. 358 asymptotically stable points. 211-226. 248 autoassociative clustering net. 202 through time. 455 see also backpropagation backprop net. 186 average generalization ability. 90. 199. 112 Bayes classifier. 176 average prediction error.199-201 example. 199-202. 265-271 second-order. 81. 208 batch mode. 225 basin of attration. 201. 265. 183 average learning equation. 354. 198 recurrent. 229 incremental. 270. 271. 202. 89 asymptotic stability. 63. 234-253 basin of attraction. 97-98. 271 Langevin-type. 199-202 activation function. 91. 213-218 network. 213. 230-234 weight initialization.

446 Butz's rule. 49 Bernstein polynomials. 250 brain-state-in-a-box (BSB) model. 364. 441 boarder aberration effects. 363-365 linear threshold gate (LTG).Bayes' rule. 445 bit-wise complementation. 364 theorem. 422. 42. 311 classifier system. 318-322 center of mass. 375-381 see also DAM building block hypothesis. 50. 272 calibration. 188 threshold. 431-432 Boltzmann learning. 376 bidirectional associative memory (BAM). 380 Hopfield network. 22 random. 318-322 CCN. 107 capacity. 58. 436 cerebeller model articulation controller (CMAC). 41 polynomial threshold gate (PTG). 426 Boltzmann-Gibbs distribution. 331. 431 Boolean functions. 35. 431-432 Boltzmann constant. 304-309 cerebellum. 416 Chebyshev polynomial. 306. 197. XOR bottleneck. 19. 5 nonthreshold. 364 expansion. 426. 17. XNOR. 308. 3-4. 49 bias. 301-304 relation to Rosenblatt's perceptron. 29. 116 Boltzmann machine. 461
. 304 see also AND. 49 classifier. 446 building blocks. 393 binary representation. 21 see also associative memory capacity cascade-correlation net (CCN). 105 central limit theorem. 354. 301 chaos hypothesis. 63 C calculus of variations. 107 process. 398 binomial distribution. 434 bell-shaped function.

110 combinatorial complexity. 346 eigenvalues. 102. 76. 290 stochastic analysis. 69 convergence phase. 217-218 connections lateral. 318 space. 180. 328 conditional probability density. 346 cost function. 320 critic. 395. 85 conjugate gradient method. 357 concept forming cognitive model. 106. 306 codebook. 51 computational complexity. 167 cluster granuality. 125. 103. 171. 429 compact representation. 322 CMAC. 51 time. 461 cluster membership matrix. 52. 100. 167. 143. 301-304. 120. 312 covariance. 326 network. 91. 88 critical features. 311 competitive learning. 427-428 correlation learning rule. 51 Kolmogorov. 107. 51 learning. 269 computational energy. 168 deterministic analysis. 334 clusters. 51 algorithmic. 106.classifiers. 167 complexity. 173 lateral-inhibitory. 70 cooling schedule. 91 eigenvectors. 148 correlation matrix. 323 constraints satisfaction term. 306-308
. 397 see also criterion function cost term. 424 convex. 187. 288 clustering. 328 behavior. 395 combinatorial optimization. 116 convergence-inducing process. 331. 310 polynomial. 168. 91 correlation memory. 396 Coulomb potential. 264 convergence in the mean. 323 self-excitatory. 396 controller.

62 region. 384. 290 criterion function. 16. 71 crossing site. 109 dead-zone. 168. 68. 83. 221 Cybenko's theorem. 127-133. 230 Gordon et al. 332. 363. 386 curve fitting. 393 BSB. 86 sum of squared error (SSE). 389-391 heteroassociative. 21. 392-394 hysteretic activations. 231 perceptron. 70. 86. 68. 71 Minkowski-r. 199 Durbin and Willshaw. 24 deep net. 145. 79. 87. 381 projection. 311 surface. 58. 375-381 correlation. 59 D DAM. 47 training cycle. 226-230. 231 see cost function criterion functional. 369-374 sequence generator. well-formed. 63. 65. 187. 141 entropic. 195. 27. 137 Kohonen's feature map. 441 crossover see genetic operators cross-validation. 86. 311 hyperplane. 155. 271 critical overlap. 82. 288. 91 cross-correlation vector. 230. 220. 230-234 backprop. 318 degrees of freedom. 171 mean-square error (MSE). 394-399 exponential capacity. 386-388 nonmonotonic activations.. 143. 353-374 bidirectional (BAM). 62 DEC-talk. 59 training pass. 381 combinatorial optimization. 186. 236 decision boundary. 391-392 see also associative memory data compression.cross-correlations. 353 travelling salesman problem. 425
. 89. 29. 63.

173 desired associations. 334-337 encoder. 25. 199. 199. 361. 323. 362. 80 discrete-time states. 188 reduction. 232 Gaussian. 393. 274 distributed representation. 25 machine. 15. 346 desired response. 308-309. 250. 147 error-backpropagation network see backpropagation error function. 433 environment nonstationary. 334 EMG. Laplace. 268 deterministic annealing. erf(x). 247 energy function. 153 ergodic. 376. 185 dimensionality expansion. 231 non-Gaussian. 432 truncated. 186 entropy. 12 dynamic associative memory see DAM dynamic mutation rate. 268 see also criterion function error function. 120 electromyogram (EMG) signal. 143. 291. 70. 231 . 421 dichotomy. 332 E eigenvalue. 443 dynamic slope. 455. 120 direct Ho-Kashyap (DHK) algorithm. 364
. 459 density-preserving feature. 52 diffusion process. 16-18. 357. 331 elastic net. 88. 333 distribution Cauchy. 369 deterministic unit. 359 entropic loss function. 289.tapped-delay lines. 426. 376 eigenvector extraction. 231 don't care states. 97. 87 device physics. 254 delta learning rule. 153 stationary. 420. 43 linear. 92.

257 filtering. 189 extreme points. 86. 154 nonlinear. 418 filter FIR. 233 function approximation. 157 effects. 288 exclusive NOR see XNOR exclusive OR. 296 fundamental theorem of genetic algorithms. 219.error rate. 457 multimodal. 5 see XOR expected value. 424 exponential decay laws. 254 IIR. 319 function counting theorem. linear. 17 function decomposition. 27-29 expected number. 171. 146. 211. 28-29 F false-positive classification error. 232 estimated target. 311. flat spot. 69 exploration process. 46. 288. 443 see also genetic algorithms fixed point method. 220. 295 extreme inequalities. 299. 179 free parameters. 360 fixed points. 77 forgetting term. 77. 440 fitness function. 241 Feigenbaum time-series. 348 error suppressor function. 439. 294-295 feature map(s). 201 Euclidean norm. 294. 295. 173 extrapolation. 231 elimination. 257 finite difference approximation. 58. 272 fitness. 443 G GA see genetic algorithm
. 354 effective. 278 Fermat's stationarity principle.

357 Gamba perceptron. 440. 445 mutation. 446 gain. 452 GA-deceptive problems. 180. 206. 418-419. 243. 307 parameter. 310 guidance process. 442 crossover. 418-419 gradient-descent/ascent startegy. 70. 439-447 example. 351 GSH model. 376. 226. 148. 202. 291 see also pseudo-inverse genetic algorithm (GA). 440. 180. 419 optimization. 443. 70. 233 error. 217 Greville's theorem. 200. 421 fit. 308 Gaussian-bar unit see units Gaussian distribution see distribution Gaussian unit see unit general learning equation. 180-187 worst case. 301-302 theoretical framework. 26-27. 180 enforcing. 424 H Hamming Distance. 419 Glove-Talk. 300.GA-assisted supervised learning. 24. 206. 417 minimization. 288. 371
. 440 global descent search. 395 minimum. 64. 182 average. 186. 425 search strategy. 295 ability. 445 reproduction. 307. 419. 394-399 gradient system. 440. 187. 442. 16-17. 447-452 genetic operators. 454 GA-based learning methods. 358. 226 local. 180. 236-240 gradient descent search. 295 minimal solution. 396-397 Gram-Schmidt orthogonalization. 221. 145 general position. 184-185 generalized inverse. 145. 420 gradient net. 302. 41 generalization. 185.

360 discrete. 455-456. 268
. 198. 408 normalized. 19 continuous. 307. 100 Hermitian matrix. 197-198. 79. 148-149. 124 hidden layer. 161 Hebb rule. 376. 311 hysteresis. 101. 91 Hessian matrix. 202 incremental update. 333 higher-order unit. 218. 317 Ho-Kashyap learning rules. 25. 202 input sequence. 78-82 Hopfield model/net. 438 Hassoun's rule. 59. 200 hidden targets. 453 hyperbolic tangent activation function see activation function hyperboloid. 311 hyperspheres.hypersphere. 98. 286 weights. 64. 357. 365 net. 296-297 hardware annealing. 103 Ho-Kashyap algorithm. 430-431. 447 incremental gradient descent. 354. 386-387 term. 16-17. 240 hard competition. 286. 453-454 hybrid learning algorithms. 101. 66. 91 normalized. 174 instantaneous error function. 292 image compression. 396-397. 396 hyperellipsoid. 418 hexagonal array. 437 hybrid GA/gradient search. 387 I ill-conditioned. 365 handwritten digits. 429 capacity. 458 hidden units. 349. 302. 212. 5. 396. 88. 316 hyperplane. 64. 315 hyperspherical classifiers. 70. 455 hidden-target space. 261 input stimuli. 316 hypercube. 311 hypersurfaces. 431 higher-order statistics. 247-252 implicit parallelism. 215. 362 stochastic.

292 quality. 120. 36 K-map technique. 98 see also principal component analysis kernel. 318 Karhunen-Lo‚ve transform. 424 Laplace-like distribution. 287 unit. 290 key input pattern. 290 incremental algorithm. 167. 310. 223.instantaneous SSE criterion function. 149 joint distribution. 290
. 125 J Jacobian matrix. 296 k-nearest neighbor classifier. 172-173 Kirchoff's current law. 429 isolated-word recognition. 173 distribution. 105. 328 Boltzmann. 350 interpolation. 354 Kohonen's feature map. 145. 101 lateral inhibition. 6. 292 Ising model. 57 autoassociative. 171 Kolmogorov's theorem. 97 learning associative. 272 LAM see associative memory Langevin-type learning. 85 K Karnaugh map (K-map). 346. 103 lateral weights. 85 lateral connections. 357 weights. 330. 290 function. 199 interconnection matrix. 204. 439 competitive. 46-47 Kronecker delta. 37 k-means clusterting. 431. 291 matrix. 173 leading eigenvector. 77. 268 L Lagrange multiplier. 347-348 kinks.

91. 78-82. 434. 183-184. 289 temporal. 58. 57. 110. 76. 92. 72 Levenberg-Marquardt optimization.436 Ho-Kashyap. 95-97 LMS. 304 pseudo-Hebbian. 304. 92. 57 see also learning rule learning curve. 291. 358 first method. 173 dynamic. 67-69. 161 Hebbian. 206-209 Hassoun. 148 covariance. 87-89. 232 signal. 330. 208 search-then-converge. 213 optimal. 433-434 Butz. 331. 457 Langevin-type. 454 Mays. 218 hybrid GA/gradient descent. 105 correlation. 88. 253 unsupervised. 158 learning vector quantization. 304. 57. 66.. 88. 111 least-mean-square (LMS) solution. 173 adative. 436 associative reward penalty. 59. 295 parameter. 99 Widrow-Hoff. 57 anti-Hebbian. 199. 212 learning rule. 63 competitive. 178 Sanger.hybrid. 218 Liapunov asymptotic stability theorem. 153 reinforcement. 453. 151. 150. 308 Linsker. 199-202. 76 delta. 150. 455 Boltzmann. 334. 106 on-line. 66 Oja. 155-156 perceptron. 330 Yuille et al. 186 learning rate/coefficient. 434. 154-155. 89 backprop. 424 leaky. 430
. 146 supervised. 165 robustness. 459 global descent. 357-361. 101. 71. 149 function. 176.

378 linear array. 2. 286 representation. 67-69. 74 see learning rules local encoding. 417 local property. 157 locality property.global asymptotic stability theorem . 289 locally tuned units. 41. 304. 330 batch. 95-97 LMS rule. 109. 371. 77. 79-80
. 288 also see activation function lossless amplifiers. 350 linearly separable mappings. 294. 43 LTG. 120 linear associative memory. 68. 367. 81 vector. 358 limit cycle. 455 Linsker's rule. 397 lower bounds. 99. 68. 51 excitation. 62. 84 matrix differential calculus. 420 local minimum. 322 Manhattan norm. 304 network. 285 log-likelihood function. 286. 2. 362. 111 M Mackey-Glass time-series. 288 response. 295 local maximum. 113. 80 linear unit. 304. 203. 301. 429 linear separability. 176-177 local fit. 50. 346 linearly dependent. 61. 281. 35 see linear threshold gate LTG-realizable. 85 logic sell array. 150. 304 logistic function. 359 second (direct) method. 5. 5 LVQ. 156 margin. 157 linear programming. 188. 64. 74 incremental. 346 linear matched filter. 316-317 linear threshold gate (LTG).

171. 186 Mays's rule. 83. 71. 230 see criterion functions Minkowski-r weight update rule. 438-439 theroy. 144. 426-427 minimal disturbance principle. 173. 150.MAX operation. 297 estimator. 231. 350 minimum energy configuration. 436 mean-valued approximation. 436. 213-214 motor unit. 3 misclassified. 246 mean-field annealing. 337 mutation see genetic operators N NAND. 6 natural selection. 439 multivatiate function. 21-24 minimum Euclidean norm solution. 231. 222 memory see associative memory memory vectors. 347 Metropolis algorithm. 318 multilayer feedforward networks see architecture multiple point crossover. 439 neighborhood function. 436 memorization. 291 Minkowski-r criterion function. 442 max selector. 407 maximum likelihood. 66 minimal PTG realization. potential.232 minterm. 425 minimum MSE solution. 72 minimum SSE solution. 438 learning. 420 MUP. 334-337 moving-target problem. 85. 84 estimate. 437-438 mean transitions. 113. 456 multiple recording passes. 451. 65 momentum. 352 multipoint search strategy. 66 McCulloch-Pitts unit see linear threshold gate medical diagnosis expert net. 76. 177-180
. 61.

215. 143. 295 optical interconnections. 212. 6 nonlinear PCA. 302 ordering phase. 439 OR. 173-174 on-line classifier. 80 nonlinearly separable function. 242 nonlinear activation function. 66. 3 gate/unit. 53 optimal learning step. 332 NP-complete. 215-216 approximate. 173 potential. 419 see also criterion function off-line training. 279 optimization. 36. 306 nonstationary process. 64. 174 neural network architecture see architecture neural net emulator. 102 nonlinear dynamical system. 264 Newton's method. 292 O objective function. 63. 92. 347 orthonormal set. 78 training set.NETtalk. 157 OLAM see associative memory on-center off-surround. 12 mapping. 154 nonstationary. 188 problems. 326 nonthreshold function. 155-156 multuple-unit rule. 63. 272 Oja's rule 1-unit rule. 84 nonuniversality. 395. 12 NOR. 234-236 neural field. 252 nonlinear separability. 116 orthonormal vectors. 322 on-line training. 5. input distribution. 101. 15
. 99 Oja's unit. 417. normal distribution. 417. 311 on-line implementations. 307 Nyquist's sampling criterion. 264 nonlinear repreasentations. 5.

372 recall phase. 225 perceptron criterion function. 372 spin-glass phase. 367.outlier data (points). 231. 386 overtraining. 65. 35-37. 330 prediction error. 383 partial reverse method. 350 phase diagram. 301 potential energy. 316-318 population. 51 PCA net. 224 approximation. 230 P parity function. 222. 61-62 perfect recall. 163. 8. 310 polynomial complexity. 291 training time. 144. 435 pattern ratio. 287. 418 positive semidefinite. 52 power method. 318. 252 penalty term. 384. 395 polynomial threshold gate (PTG). 256. 58-60. 63. 372 phonemes. 168. 428
. 347. 16. 264 partial reverse dynamics. 145. 15. 92. 322 overlap. 296 pattern completion. 376 phonotopic map. 306 ploynomial-time classifier (PTC). 217 polynomial. 189. 370 pattern recognition. 264 Polack Ribi‚re rule. 86 perceptron learning rule. 150 postprocessing. 149. 349. 440 positive definite. 458 partial recurrence. 235 piece-wise linear operator. 85 prediction set. 177 power dissipation. 361. 385 partition of unity. 99. 230 premature convergence. 304 convergence proof. 367 origin phase. 308. 124 plant identification. 147. 311. 312 field. 422 function. 382 critical. 70. 317 overfitting. 372 oscillation phase. 294. 371. 212.

98 principal eigenvector. 287 radius of attraction. 316-318 PTG see polynomial threshold gate Q QTG see quadratic threshold gate quadratic form. 315 real-time learning. 252 analysis (PCA). 211 preprocessing. 285. 294 RCE. 347 PTC net. 102 principal component(s). 301 pseudo-inverse. 51 Rayleigh quotient. 274-275 recall region. 328 phase. 295. 98. 333 prototype unit. 92. 322. 101. 26-27 prototype extraction. 7. 250 principal directions. 11. 316 quadratic unit. 79. 286-294 radially symmetric function.premature saturation. 253 subspace. 155. 353 pseudo-orthogonal. 301
. 70. 244 real-time recurrent learning (RTRL) method. 357 quadratic threshold gate (QTG). 361 quadratic function. 101. 288 localized. 102 quantization. 324 pruning. 250 quickporp. 390 random motion. 225. 312-315 RCE classifier. 252 probability of ambiguous response. 287. 367 receptive field. 164 principal manifolds. 214 R radial basis functions. 97. 292 centers. 155 RBF net. 285-287 radial basis function (RBF) network. 11 layer. 422 random problems. 332 nonlinear.

359. 311. 365 parallel. 345 multiple-pass. 179 RMS error. 366 representation layer. 313. 113 retrieval dynamics. 340 reinforcement learning. 366. 309 root-mean-square (RMS) error. 350. 163 smoothness. 302 semilocal. 225. 85 robustness. 299 width. 307. 440 reconstruction vector. 304 roulette wheel method. 440. 202 robust decision surface. 382 mode. 325 see also adaptive resonance theory resting potential. 316 regression function. 265-271 recurrent net. 312-315 retinotopic map. 250 reproduction see genetic operators resonance. 370. 346 correlation. 232. 85. 359 repeller. 288 recombination mechanism. 371 one-pass. 274 see also architecture region of influence. 296 term. 365 properties. 369 Riccati differential equation. 230. 87 reinforcement signal. 150. 145. 88. 386 recurrent backpropagation. 376 normalized correlation. 152. 346. 57. 379
. 271 . 433 see also criterion function relaxation method. 62 robustness preserving feature. 73 regularization effects. 348 projection. 442 row diagonal dominant matrix. 109 recording recipe. 90. 78 robust regression. 165 relative entropy error measure. 207 replica method. 174 restricted Coulomb energy (RCE) network. 202 Rosenblatt's perceptron.overlapping.

439-452 global. 391 recognition. 14 sequence generation. 106 sensor units. 21 nonlinear. 206-209. 48. 431 second-order search method. 321 sigmoid function. 168 relationship. 17 separating hyperplane. 173 semilocal activation. 388 self-coupling terms. 371 self-scaling. separating capacity. 268. 452 self-connections. 100 Sanger's rule. 328
. 325 self-stabilization property. 375. 18 separating surface. 302 scaling see computational complexity schema (schemata). direction. 358 Sanger's PCA net. 254-256 series-parallel identification model. 299 sensory mapping. 257 shortcut connections. 402 similarity measure. 101 self-organization. 419 stochastic. 99. 144 see also activation functions sign function see activation functions signal-to-noise ratio. 444 defining length. 444 search. 323 sensitized units. 171 neural field. 288. 113. 254-255 reproduction. 443. 447 order. 163 search space. 64. 419 gradient ascent. 112 self-organizing feature map. 215 secong-order statistics. 419 gradient descent. 217 genetic.S saddle point.

367 spin glass region. 427.simulated annealing. 42 stochastic approximation. 87. 298 soft weight sharing. 58 Smoluchowski-Kramers equation. 125 see also self-organizing feature map soft competition. 428. 422 global search. 425-426 algorithm. 431
. 165. 425 learning equation. 216 . 265 specialized associations. 147 differential equation. 73 algorithm. 125. 431 gradient descent. 421. 179 dynamics. 302 trajectories. 255 spin glass. 431 single-unit training. 352 speech processing. 120 sparse binary vectors. 296 SOFM. 109. 265 statistical mechanics. 439 process. 436 unit. 323 stable categorization. 148 force. 148 network. 147. 375 detectors. 438 optimization. 124 speech recognition. 409 spatial associations. 425. 216 optimal. 393 spurious memory see associative memory SSE see criterion function stability-plasticity dilemma. 64 best-step. 147 transitions. 422 smoothness regularization. 279 Sterling's approximation. 113. 263 static mapping. 429 steepest gardient descent method. 120 space-filling curve. 323 state space. 71. 367 spurious cycle. 60 somatosensory map. 296. 233 solution vector. 185. 71-72. 273 state variable.

326 time-delay neural network. 255. 90 synaptic signal post-. 211 synapse. 455 also see learning rules sum-of-products. 1. 207 thermal energy. 3-4. 434. 259. 254-259 teacher forcing. 109 temporal association. 3 functions. 425 templates. 5 threshold gates. 254-259 time-series prediction. 88. 273
. 391 temporal associative memory. 262-265. 437 threshold function.Stone-Weierstrass theorem. 275 temperature. 37 tie breaker factor. 68 superimposed patterns. 57. 90 pre-. 176. 152.275 terminal repeller. 425. 35 see also Boolean functions symmetry-breaking mechanism. 391 temporal learning. 328 survival of the fittest. 422. 426 thermal equilibrium. 7 threshold logic. 345 strict interpolation. 291 method. 254 . 2 polynomial. 439 strongly mixing process. 90 synchronous dynamics see dynamics T Taken's theorem. 256 tapped delay lines. 3 sum of squared error. 179 supervised learning. 48 storage capacity see capacity storage recipes. 2 linear. 35 theory. 253. 293 string. 440 switching algebra. 52 synaptic efficaces. 8 quadratic.

112 see also learning rule update nous. 151 self-organization. 198. 304. 318 unit elimination. 59. 308 of dynamical systems. 36 gate. 50 universal logic. 221. 15. 3 TSP see travelling salesman problem tunneling. 369 serially. 287. 120 truck backer-upper problem. 6 unsupervised learning. 261 in time. 230 transition probabilities. 299-300. 48. 299-300. 388 linear. 318 Gaussain-bar. 206. 208. 273 universal classifier. 265 unit Gaussian. 346 quadratic. 351 training error. 90 competitive. 360 parallel. 99. 185. 221 unity function. 426 travelling salesman problem (TSP). 299-300 hysteretic. 57. 43 V
. 113. 166 Hebbian. 421 Turing machine. 90. 112 topological ordering. 144. 369 upper bound. 227 training with rubbish. 296 universal approximation. 68. 275 twists. 102 sigmoidal. 310. 321 U unfolding. 286.topographic map. 50. 263 truth table. 290. 318 unit allocating net. 172-173 two-spiral problem. 105. 295 training set. 116 trace. 454 universal approximator.

221 . 5 Y Yuille et al. 110 tessellation. 92. 330 Wiener weight vector. 66. 428 Widrow-Hoff rule. 2 weight-sharing interconnections. 158
. 382. 304 analog. 325-328. 333 visible units. 352 vector quantization. 103. 240 weighted sum. 14 XOR. 297 see also competitive learning winner unit. 53 Voronoi cells. 227. 226 validation set. 104. 187. 72 winner-take-all. 32 VC dimension. 163. 60. 431 VLSI. 324 X XNOR. 103 competition. 104. 113. 185 vector energy. 104. 288. 230 Vandermonde's determinant. 210 sharing. 187. 2-3. 2 decay. 225 decay term. 298 W Weierstrass's approximation theorem. 111 vowel recognition. 15 weight. 157 elimination. 221 initialization. 458 update rule. 110 quantizer. 241 space. 408 operation. rule.validation error. 109-110 Veitch diagram. 6 vigilance parameter. 233 . 157. 323-324. 64 vector. 170 network.

Z ZIP code recognition. 240-243
.

1995)
View as . H.Errata Sheet for M. Fundamentals of Artificial Neural Networks (MIT press.jpg Image Page 1 Page 2 Page 3 Page 4 View as Postscript Errata Sheet
. Hassoun.