Widrow Lehr

30 Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation BERNARD WIDROW, rttow, iff, AND MICHAEL A. LEHR Fundamental development a feedlonard artis! neva net wos om the past iney years are reviewed. The cena heme {hs paper iss descplon of the history onginaon, operating ‘harsctensies and asl theory of sever supersed neal et ‘ork uainngalgothons including the Percept rule, the INS Lignin, thre Madaline rales, and the backpropagaon ich ‘igue. These methods were develope! indepondent but with th penpective of history they ean abe elated to eachother The ‘oneept undering these algorithms is he "minimal disturbance (Giec ner information ato a network ive eater tet ated ‘Moved infomation to esas extent possible, ‘This year marks the 30th anniversary of the Perceptron rule and the LMS algorithm, two early rules for taining Adaptive elements. Both algorithms were frst published in 1960. 1m the yeas following these discoveries, many new techniques have been developed inthe field of neural net works, and the discipline is growing rapidly. One early evelopment was Steinbuch’s Learning Matrx(1],2pattera recognition machine based on linear discriminant func. tions, At the same time, Widrow and his students devised -Madaline Rule (MRD, the earliest popular learning euletor neural networks with multipe adaptive elements 2], Other farly work included the “mode-seeking” technique of Stark, Okajima, and Whipple 3]. This was probably the fist example of competitive learning in the literature, though Incould be argued tha earlier work by Rosenblatt on "spon: taneous learning’ [i {5} deserves this distinction. Further pioneering work on competitive learning and selt-organi ation was performed in the 1970s by von der Malsbutg (6) land Grossberg (71. Fukushima explored related ideas with his biologically inspired Cogritron and Neocognitron models (8 (9 Manuscript received September 12,1989; revised Api 13,1980, ‘his work was sponsored by SOIO Innovative Science and Tech nology ofice and managed by ONR under contract no, NIDOTESS- GTI, by the Dept of the Army Belvoir RDAE Center under con tracts OAAK 7087 P-4and no DAAK 708K 001, bya grant from the Lockheed Mises and Space Ca by NASA winder com tract no. NCA209, and by Rome Air Development Cente under onvraci no. 3060338 D0, subcontract no E21 The authors ae wih the Information Systems laboratory Department of Electrical Engineering, Stantord University, star ford, casemsaoss, usa. IEEE Log Number 9038824, Widrow devised a reinforcement learning algorithm called “punishireward” or "bootstrapping" {10} (1] in the ‘mid-1960s. This can be used to solve problems when uncer tainty about the error signal causes supervised training methods to be impractical. A related reinforcement learn ing approach was later explored in aclassic paper by Barto, Sviton, and Andersonon the “creditassignment” problem [12], Barto et a's technique is also somewhat reminiscent ‘of Albus's adaptive CMAC, a distributed table-ook-up sys tem based on models of human memory 13, [14 Tn the 1970s Grossberg developed his Adaptive Reso- ance Theoty (ART), a number of novel hypotheses about the underlying principles governing biological neural sys tems {15}. These ideas served asthe bass for later work by Carpenter and Grossberg involving three classes of ART architectures: ART 116], ART 2 [17], and ART 3 [16], These {re seltorganizing neural implementations of pattera clus- tering algorithms. Other important theory on sel¥-organiz~ ing systems was pioneered by Kohonen with his work on feature maps 19) (20, Tn the eatly 980s, Hopfield and others introduced outer product rules as well as equivalent approaches based on the early work of Hebb (21 for Waining a class of recurrent [signal feedbaclo networks now called Hopfield models (22), [23], More recently, Kosko extended some of the ideas of Hopiield and Grossberg to develop his adaptive Bidiec sional Associative Memory (BAM) [24 a network model ‘employing diferential as well as Hebbian and competitive learning laws, Other significant models from the past de- ‘eade include probabilistic ones such as Hinton, Sejnowski, and Ackley’s Boltzmann Machine [23 [26] which, to over “Simplify, is a Hoplield model that settes into solutions by ‘simulated annealing process governed by Boltzmann sta- tistics, The Boltzmann Machine is trained by a clever two- phase Hebbian-based technique. While these developments were taking place, adaptive systems reearch at Stanford taveled an independent path, ‘Aer devising their Madaline I rule, Wideow and his stu- ‘dents developed uses for the Adaline and Madaline. Early applications included, among others, speech and pattern recognition 27], weather forecasting{ 28], and adaptive controls [29]. Work then switched to adaptive filtering. and adaptive signal processing [30 ater attempts to develop learning rules for networks with multiple adaptive layers were unsuccessful. Adaptive signal processing proved 10bbe a ruitfl avenue forresearch with applications involving adaptive antennas (31, adaptive inverse controls [32 adap tivenoise canceling(33), and seismic signal processing 20). Outstanding work by Lucky and others at Bell Laboratories led to major commercial applications af adaptive filters and the LMS algorithm to adaptive equalization in high-speed ‘modems [34], [35] and to adaptive echo cancellers for long distance telephone and satelite cicuits 36), After 20 yeats ‘of research in adaptive signal processing, the work in Wid row’ laboratory has once again returned to neural net. works, The first major extension ofthe feedforward neural net: work beyond Madaline | took place in 1971 when Werbos developed a backpropagation training algorithm which, in 1974, he fist published in his doctoral dissertation 371" Unfortunately, Werbos's work remained almost unknown Inthe scientific community. In 1982, Parker rediscovered the technique [39] and in 1985, published a report on iat MALT. [40], Not long after Parker published his findings, Rumethar, Hinton, and Willams (a1, [42]als0 rediscovered thetechnigueand, largely asa result of theclear framework ‘within which they presented their ideas, they finally suc: ‘ceeeded in making it widely known, The elements used by Rumelhart etal. n the backprop- _agation network differ from those use inthe earlier Made line architectures. The adaptive elements in the original Madaline structure used hardimiting quantizers (sig ‘nums) while the elements inthe backpropagation network use only differentiable nonlinearities, oF “sigmotd” tune tions.* In digital implementations, the hard-limiting {quantizer is more easily computed than any of the difer tentiable nonlinearities used in backpropagation networks, In 1987, Widrow, Winter, and Baxterlooked backatthe orig. inal Madaline algorithm with the goal of developing anew technique that could adapt multiple layers of adaptive ele- ‘ments using the simpler hard imiting quantizers. The esult was Madaline Rule (83). David Andes ofS. Naval Weapons Center of Chinatake, ‘CA, modified Madalin Ii 1988 by replacing the har lim: iting quantizers in the Adaline and sigmoid functions, ‘thereby inventing Madaline Rule (RIL, Widrow andhis students were fist to recognize that this rule is mathematically equivalent to backpropagation. ‘The outline above gives only a partal view of the disc pline, and many landmark discoveries have not been men: tioned. Needless to say, the field of neural networks is {quickly becoming a vast one, and in one short survey We {could not hope to cover the entire subject in any detal Consequently, many signiticant developments, including some of those mentioned above, are nat discussed inthis paper. The algorithms described are limited prim: “‘Weshould note however, hatin the ld ofvarationaaletas the idea of error backpropagation through noslinear systeme ‘sted conturesbetore Werbosestthoughtto apy thisconcept Toneural networks. Inthe past 23 yeas, these methods Mave bean Used widely inthe il of optimal contol as duced by te Cun om “The term “sigmoid usualy used in telerence to monoton ieallyincresing"sshapedlunctions, such asthe hyperbalitan {Een Inths paper, however, we genealy ue the frm fo denote ny smooth nonlinear functions the ouput ef snes adap "ement nother papers, these nonlinearities go by 3 varity of ames, suchas "squashing functions "activation functions transfer characteris,” of “thresald functions” those developed in our laboratory at Stanford, andtorelated techniques developed elsewhere, the most important of ‘whieh isthe backpropagation algorithm. Section I explores fundamental concepts, Section Ii discusses adaptation and ‘the minimal disturbance principle, Sections WV and V cover error correction rules, Sections VI and Vil delve into Steepestdescent rules, and Section Vill provides a sum: mary. nformationaboul the neural network paradigms not dis cussediinthis paper can be obtained roma numberof other sources, such asthe concise survey by Lippmann [4], and the collection of classics by Anderson and Rosenfeld (4). ‘Much of the early work inthe field from the 1960s is care fully reviewed in Nilsson’s monograph [d6l. A good view of some of the more recent results is presented in Rumel- hart and McClelland’s popular three-volume set 47]. A paper by Moore [48] presents a clear discussion about ART ‘and some of Grossberg's terminology. Another resource is the DARPA Study report [45] which gives a very compre- hensive and readable "snapshot" of the field in 1968, 1, Funoawenrat Coneerrs Today we can build computers and other machines that periormavarietyofwelldefinedtaskswithcelerityandrel blity unmatched by humans. No human can invert mar es or solve systems of differential equations at speeds rivaling modern workstations. Nonetheless, many problems remain to be solved to our satisfaction by any man- ‘made machine, but are easily disentangled by the percep- ‘wal or cognitive powers of humans, and often lower mam: mals or even fish and insects. No computer vision system can fival the human abiliy to recognize visual images formed by objects ofall shapes and orientations under a wide range of conditions. Humans effortlessly recognize objects in diverse environments and lighting conditions, even when obscured by dirt, oF occluded by other objects Likewise, the performance of current speech-recognition technology pales when compared to the performance of the human adult who easily recognizes words spoken by diferent people, at different rates, pitches, and volumes, {even in the presence of distortion ar background noise, ‘The problems solved more effectively by the brain than by the digital computer typically have two characteristics: they are generally ill defined, and they usually require an {enormous amount of processing, Recognizing the character ofan object from its image on television, frinstance, involves resolving ambiguities associated with distortion And lighting. It also involves filling in information about a threedimensional scene which is missing from the two- dimensional image on the screen. An infinite number of threedimensional scenes can be projected into a two dimensional image. Nonetheless, the brain deals well with this ambiguity, and using learned cues usually has lite dit. ficuty correctly determining the role played by the missing dimension. ‘Asanyone whohas performed even simplefitering oper- ations on images is aware, processing. high-esolution Images requires a great deal of computation. Our brains accomplish this by utilizing massive parallelism, with mi Tionsandeven billions of euronsin partsof the brainwork: ing together to solve complicated problems. Because soid- state operational amplifiers and logic gates can compute‘many orders of magnitude faster than current estimates of the computational speed of neurons in the brain, we may ‘soon be able to build relatively inexpensive machines with the ability to process as much information 2s the human brain. This enormous processing power willdolitletohelp ussolve problems, however, unless we can utilize i effec tively, For instance, coordinating many thousands of pro: cessors, which must efficiently cooperate to solve a prob lem, is not a simple task. If each processor must be programmed separately, andifallcontingencies associated ‘ith various ambiguities must be designed into the soit. ware, even a elatvely simple problem can quickly become Unmanageable. The slow progress over the past?5 years or Soinmachinevisionandother areas of artificial intelligence Is testament to the difficulties associated with solving ambiguous and computationally intensive problems on von Neumann computers and related architectures, “Thus, theee is some reason to consider atacking certain problems by designing naturally parallel computers, which processinformationand learn by principles borrowed irom the nervous systems of biological creatures. This does not ‘necessarily mean we should attempt to copy the brain part for par. Although the bird served to inspire development ofthe airplane, bids do not have propellers, and airplanes do not operate by flapping feathered wings. The primary parallel between biological nervous systems and artificial neural networks is that each typically consste of 3 large ‘number of simple elements that learn and are able to co lectively solve complicated and ambiguous problems. “Today, most artificial neural network research and appl cation is accomplished by simulating networks on serial Computers. Speed limitations keep such networks cela tively small, but even with small networks some su ‘ingly dificult problems have been tackled. Networks with fewer than 180 neural elements have been used success: fully in vehicular control simulations (50, speech genera tion [51,52], and undersea mine detection {49}. Small networks have aso been used successfully in airport explosive detection 53}, expert systems 54), (55), and scores of other applications. Furthermore efforts to develop parallel neural network hardwarearemecting withsome success, and such hardware should be available inthe future for attacking ‘more dificult problems, such as speech recognition (36) 1, Whether implemented in parallel hardware or simulated fn a computer, all neural networks consist ofa collection ‘of simple elements that work together to solve problems. [A basic building block of nearly all artiicial neural net- ‘works, and most other adaptive systems, isthe adaptive lin: fear combiner A. The Adaptive Linear Combiner ‘The adaptive linear combiner diagrammed in Fig. 1. output iss linear combination of inputs. In a git implementation, this element reeives at time kan input Signal vector or input pattern vector Xi = [Ke toe “an and a desired response dy, a special mpUt sed to effect learning. The components ofthe input vector are weighted by set of coefiiens, the weight vector Wy Ue te Wey * Wa The sum of the weighted inputs fr then computed, producing a linear output, the inner product = XW. The components of X, may be ether Adaptive linear combiner. Fes Continuous analog values or binary values. The weights are ‘essentially continuously variable, and can take on negative ‘3s well as positive values. During the training process, input patterns and corresponding desired responses are presented to the linear ‘combiner. An adaptation algorithm automatically adjusts the weights so that the output responses tothe input patterns willbe as close as possible to their respective desired reponses. In signal processing applications, the most popular method for adapting the weights isthe simple LMS {least mean square) algorithm (58), [58], often called the ‘Widrow.Hoft delta rule (42), This algorithm minimizes the sumol squares ofthe linear errors overthe raining set. The linear error és defined to be the ditference between the ‘desired response d, and the linear outputs, during pre sentation ke Having this etror signal is necessary for adapt. ing the weights. When the adaptive linear combiner is ‘embedded ina multielement neural network, however, a erro signal soften not drectlyavailable for each individual linear combiner and more complicated procedures must be devised for adapting the weight vectors. These proce: dures are the main focus ofthis paper. B, A Linear Cassifier—The Single Threshold Element ‘The basic building block used in many neural networks isthe “adaptive linear element,” or Adaline [58] Fig. Thisisan adaptive threshold logic element. It consists of an adaptive linear combiner cascaded with aharelimiting, {quantizer, which is used to produce a binary + 1 output, Y= sgn sy). The bias weight wy, whichis connected to 4 Constant input xy = +1, effectively controls the threshold level of the quantizer Insingle-element neural networks, an adaptive algorithm (suchas the LMS algorithm, or the Perceptonrule)isoften Used to adjust the weights ofthe Adaline so that t responds correctly 038 many patterns as possible in a training set that has binary desired responses, Once the weights are adjusted, the responses the trained element canbe tested by applying various input patterns. ithe Adaline responds correctly with high probability to input patterns that were ‘ot included in the training set itis said that generalization hhastaken place. Learning and generalization areamong the most uselul attributes of Adalnes and neural networks. “near Separability: With n binary inputs and one binary 2in the neural network literature, such elements are often refered to 3 adaptive neurons.” However n'a conversation (eneen avid Hubelof Harvard Medial School and Bereard Wed. ‘ow, Dr. Hubel pointed out thatthe Adaline ers rom the bi {opie neuron hat contansnotonly the neural cel body, but flko the impart syapses anda mechanism Tor aining themFig. 2. Adaptive linear element Adaline ‘output, single Adaline of the type shown in Fig. is capa ble of implementing certain logic functions. There are 2" possible Input patterns. A general logic implementation ‘would be capable of classifying each pattern a either +1 4, in accord with the desired response. Thus, there are 2 possible logic functions connecting n inputs toa single binary output. A single Adaline is capable of realizing only smal subset of these functions, known as the linearly sep arable logic functions or threshold logic functions (60. These are the st of logic functions that ean be obtained with all possible weight variations Figure3 shows two-nput Adaline element. Figure 4rep. resents all possible binary inputs to this element with four large dots in pattern vector space. In this space, the com: ponents ofthe input pattern vector ie along the coordinate axes. The Adaline separates input patterns into two cate sories, depending on the values of the weights, &eritial Fig. 3. Twoinput Adaline wee Sc Fig. 4. Separating in in atess space. thresholding condition occurs when the linear output = equals zero: o theretore a Figure 4 graphs this finear relation, which comprises a ‘separating ine having slope and intercept given by stope = — intercept = 8 3 The three weights determine slope, intercept, and the side ‘ofthe separating line tha corresponds toa postive output ‘The opposite side ofthe separating line corresponds to a negative output. For Adalines with four weights, the separating boundary isa plane; with more than four weights, the boundary is a hyperplane. Note thatthe bias weight iszero, the separating hyperplane will be homogeneous-— it will passthrough the origin in pattern space [As sketched in Fig 4, the binary input patterns are cla- sified 35 follows: Gane (4-994 Cuan Caner This is an example of a linearly separable function. An ‘example of a function which is nat linearly separable i the ‘worinput exclusive sor function Gane e a 4 a 6 Nosinglestraghtline exists that can achieve this separation ‘ofthe input patterns; thus, without preprocessing, no sin {le Adaline can implement the exclusive xox function. With two inputs, single Adaline can realize 14 ofthe 16 possible logic functions. With many inputs, however, only small fraction of all possible logic functions i realizable, {thats linearly separable. Combinationsof elements or net. works of elements can be used to realize functions that are not linearly separable. Capacity of Linear Classifier: The numberof training pat ternsor stimuli thatan Adalinecan learnto correctly classify ‘san importantissue. Each pattern and desired output com= bination represents an inequality constraint on the weights, Itispossieto have inconsistencies in setsof simultaneous Inequalities just as with simultaneous equalities, When the Inequalities that is, the patterns) are determined at ran ne Prcaaie = i © 1 fot Np = New In Fig. 5 this formula was used to plot a set of analytical ‘curves, which show the probability that a set of N random patterns can be trained Into an Adaline asa function of the {atio N,JNy. Notice from these curves that as the number ‘of weights increases, the statistical pattern capacity of the ‘Adaline C, = 2N, becomes an accurateestimateol he num ber of responses it can learn ‘Another fact that can be observed from Fig. Sis that a Nw pt Pate eis Fig 5. Probability that an Adaline can separate a training ter stata function ofthe ratio NN problem is guaranteed to have a solution if the number of patterns is equal tor less than) half the statistical pattern Capacity; that i, if the number of patterns is equal to the rhumber of weights. Weill referto this asthe deterministic pattern capacity Co the Adaline. An Adaline ean learn any ‘wo-category pattern classification task involving ne more patterns than that represented by itsdeterministc capacity, C= Ny. Both ‘the statistical and deterministic capacity results {depend upon a mild condition on the positions ofthe input patterns: the patterns must be in general position with respect to the Adaline Ifthe input patterns to an Adaline “Underying theory for thisresul was discovered independently toyamnumber researchers including, among others, Winder 3) (Chmaron (ad Josep 16). "Pattern aren gener postion with respect oan Adaline with sotresolsweiht ay sae of patrnvetorsconanng Slenty tno et of Ny or more input points athe Ny imenssonal piters space le on's homogeneous hypeepane. For the mare ommon case involving an Adaline with tareshold weight gem: ‘el positon meant that nose of Noor more patterns inthe mension pattern space icon 3 hyperplane not constrained {to pase trough te ong 6), are continuous valued and smoothly distributed (that i, pattern positions are generated by a distribution function Eontaining no impulses), general position is assured. The {general postion assumption is often invalid i the pattern vectors ae binary. Nonetheless, even when the points are notin general postion the capacity results represent useful upper bounds. ‘The capacity results apply to randomly selected taining patterns In most problems of interest the patterns inthe training set are not random, but exhibit some statistical reg Ularities. These regularities are what make generalization possible. The numberof patterns that an Adaline can learn Ina practical problem often far exceeds its statistical capac ity because the Adalineisableto generalize within the Wain ing set-and eatns many ofthe raining patterns before they are even presented Nonlinear Classifiers ‘Thelinearclasifierslimited nits capacity, andof course Js limited to only linearly separable forms of pattern dis- crimination. More sophisticated classifiers with higher capacities are nonlinear. Two types of nonlinear classifiers fare described here. The first ia fixed preprocessing net- ‘work connected toasingle adaptive element, and the other isthe multi-element feedforward neural network. Polynomial Discriminant Functions: Nonlinear functions ‘of the inputs applied to the single Adaline can yield nonlinear decision boundaries. Useful nonlinearities include the polynomial functions. Consider the system illustrated In Fig. 6 which contains only linear and quadratic input = x — im . aa Fig. 6. Adalinewithinputsmappedthough noalinearits, functions. The tical thresholding condition for this sys ym day + xp + Any + = 0 » With proper choice ofthe weights the separating bound: aryinpaternspacecan be established as shown, or exam pie nfig.7 This represents asolution forthe exclusive xo function of) Of course allo the linearly separable une tions arealsorealiable, The use of such nonlinearities an be generalized for more than two input an for higher degree polynomial unctions ofthe inputs Some ofthe frst ‘workin thin area was done by Specht 66-168] Stanford Inthe 1960s when he succesfally applied polynomial i- iminants tthe classiest and analyse of letroar- ‘lographic signals, Work on this topic ae slso Been done=< Fig,7.elipucal separating Boundary for realizing 3 fone {ido whichis not neatly separable by Barron and Barton (69}(71] and by lvankhnenko [72] in the Soviet Union, ‘The polynomial approach offers great simplicity and beauty: Through t onecan realize awidevvarietyofadaptive ‘nonlinear discriminant functions by adapting only a single ‘Adaline element, Several methods have been developedtor training. the polynomial discriminant function. Specht developed 3 very efficient noniterative thats, single pass through the training set) raining procedure: the polyno mmial discriminant method (PDM), which allows the poly ‘omial discriminant funetion to implement 2 nonpara metre classifier based on the Bayes decision rule. Other ‘methods for raining the system include erative errorcor ‘ection rules such as the Perceptron and a-LMS rules, and iterative gradientedescent procedures such as the LMS and SER (also called RLS) algorithms (30), Gradient descent witha single adaptive elements typically much faster than ‘witha layered neural newwork Furthermore, aswe shall see, ‘when the single Adaline is trained by a gradient descent procedure, it will converge toa unique glabal solution. ‘Aiter the polynomial discriminant function has been trained by a gradient descent procedure, the weights the Adaline will represent an approximation to the coefficients ina multidimensional Taylor series expansion of the desired response function, Likewise, i appropriate teigonometric ermsare used in place ofthe polynomial preprocessor, the ‘Adaline's weight solution wll approximate the termsin the (eruncated) multidimensional Fourier series decomposi- tion ofa periadic version ofthe desired response function. ‘The choice of preprocessing functions determines how well a network wil generalize for patterns outside the training Set. Determining “good” functions remains a focus of cur rent research (73), [74], Experience seems to indicate that tunless the nonlinearities are chosen with care to sut the problem at hand, often better generalization can be ‘dbtained from networks with morethan one adaptive layer. Infact, onecan viewmultlayer nenworks assinglelayer net works with trainable preprocessors which are essentially selFoptimizing Madatine (One ofthe earliest trainable layered neural networks with multiple adaptive elements was the Madaline | structure of ‘Widrow (2|and Hott (75). Mathematical analyses of Mada~ line | were developed in the Ph.D. theses of Ridgway 76), Hoff (73), and Glanz (771. Inthe early 1960s, a T000-weight -Madaline | was built out of hardware 78] and used in pat tern recogoition research, The weights inthis machine were tmemistors, electrically variable resistors developed By Wideow and Hoff which are adjusted by electroplating a resistive link (79 ‘Madaline | was configured in the following way. Retinal inputs were connected to a layer of adaptive Adaline elements, the outputsof which were connectedtoa fixed logic device that generated the system output, Methods for adapting such systems were developed at that time. An example of thiskind of network is shown in Fig. 8. Two Ada Fig. 8. Two Adaline form of Madalve. lines are connected to an Axo logic device to provide an output ‘With weights suitably chosen, the separating boundary in pattern space fr the system of Fig. would be as shown in Fig, 9. Ths separating boundary implements the exclu- sve nok function of) Fig. 9. Separating ines for Madaline of Fig. -Madalines were constructed with many more input, with ‘many more Adaline elements in the first ayer, and with var ‘ous fixed logic devices such 25 aN0, 0, and majority-ote- taker elements in the second layer. Those three functions (Fig. 10) are all threshold logic functions. The given weight ‘values willimplement these thre functions, but the weight choices are not unique. Feedforward Networks ‘The Madalines ofthe 1960s had adaptive first layers and fixed threshold functions in the second (output layers (76)eS Fig. 10. Fixed eight Adaline implementations ofa, 08 hd i loge fonctions. " {46}, The feedfoward neural networks of today often have ‘many layers, and usually all ayers are adaptive. The back propagation networks of Rumelhart etal. 47} are perhaps: the best-known examples of multilayer networks. A fully connected three-layer" feedforward adaptive network is ilustrated in Fig. 111m a fully connected layered network, Fat Tiweeayer adaptive neural network leach Adaline receives inputs from every output in the pre ceding layer. During trainin, the response of each output element in the network Is compared with a corresponding desired response. Error signalsassociated withthe output elements are readily computed, so adaptation ofthe output layer is “Straightforward. The fundamental dificultyassociated with adapting layered network lesin obtaining “eror signals” forhidden layer Adaline, thatis for Adalines in ayers other than the output layer. The backpropagation and Madaline lilalgorsthms contain methods for establishing these error signals. ‘in Ramethat e's terminology, this would be called a fur layer network, folowing Rorenblats convention of counting ‘rrofsigalinluding the inp ayer. For our purposes, find more wefal to count ony layers of competing elements. We do ‘ht count as layer the sto inp terminal points Theteis no reason why afeedtonward network must have the layered structure of Fig 17, In Werbos's development ‘of the backpropagation algorithm (37) infact, the Adalines are ordered and each receives signals directly from each input component and from the output of each preceding ‘Adaline, Many other variations othe feedforward network larepossible.Aninterestingareaofcurcentresearch involves ‘a generalized backpropagation method which can be used tetrain “high-order” or "e=1” networks that incorporate 2 polynomial preprocessor for each Adaline (47, (80), ‘One characteristic that is offen desired in pattern ree: cognition problems is invariance of the network output to changes inthe position and size of the input pattern ot Image. Various techniqueshave been usedtoachievetrans- lation, rotation, scale, and time invariance, One method involves including in the training set several examples of teach exemplar ransformedin size, ange, and position, but ‘witha desired response that depends only on the original ac (78) Other research has dealt with various Fourier transform preprocessors [61] [82], as well as ‘preprocessors (83). Giles and Maxwell have devel ‘oped 3 clever averaging approach, hich removes Unwanted dependencies from the palynomial termsin high ‘order threshold logic units (polynomial discriminant func ons) (74) and high-order neural networks (80). Other approaches have considered Zeenike moments [4], graph ‘matching [85], spatially repeated feature detectors [9], and time-averaged outputs [86 Capacity of Nonlinear Classiions [An important consideration that should be addressed when comparing various network topologies cancerns the. amount of information they can store.” Of the nonlinear Classifiers mentioned above, the pattern capacity of the ‘Adaline driven by afixed preprocessor composedotsmooth nonlineaitesis the simplest to determine If the inputs 10 the system are smoothly distributed in postion, the out puts ofthe preprocessing network wll be In general pos: tion with respectto the Adaline. Thus the inputs tothe Ada Tine will satisfy the condition required in Cover’s Adaline ‘capacity theory. Accordingly, the deterministic and statis tical pattern capacities of the system are essentially equal to those of the Adaline. Te capacities of Madatine structures, which uilizeboth the majority element and the o8 element, were experi- mentally estimated by Kotord inthe ealy 1960s. Although the logic funetions that can be realized with these output flements are quite diferent, both types of elements yield tssentially the same statistical storage capacity. The aver age numberof patterns that a Madaline I network can learn toclassity was found tobe equalto the capacity per Adaline ‘multiplied by the number of Adalines inthe structure. The Statistical capacity , is therefore approximately equal to twice the number of adaptive weights. Although the Mada line and the Adaline have roughly the same capacity per adaptive weight, without preprocessing the Adaline can Separate only linearly separable sets, while the Madaline has no such limitation. !Wesould emphasize that theinformationelerredtohereeor segundo he atmo umber ory put map ‘ings network scieve with propery adjusted weighs, ot umber bts ofintormation that can be toed det Into the networ’s weights.AA great deal of theoretical and experimental work has been directed toward determining the capacity of both ‘Adalines and Hopfield networks [871-[90). Somewhat less ‘theoretical work has been focused on the pattern capacity, ‘of multilayer feedforward networks, though some know ‘edge exists about the capacity of two-layer networks, Such results are of particular interest because the two-layer net. work i surprisingly powerful. With a sufficient number of, hidden elements, a signum network with two layers can implement any Boolean function "Equally impressive isthe Power ofthe twosayer sigmoid network, Given a sufficient hhumber of hidden Adaline elements, such networks can implement any continuous input-output mapping to arb trary accuracy (92/194). Although two-layer networks are {uite powerful, itslikely that some problemscan be solved more eficiently by networks with more than two layers. 'Nonfnite-order predicate mappings (such asthe connect. edness problem [85) can often be computed by small net. works using signal feedback (95) Inthe mid-1960s, Cover studied the capacity ofa feed. forward sigoum network with an arbitrary number of layers? anda singleoutput element (61,97). Hedetermined lower ‘bound on the minimum number of weights N, needed to tenable such a network to realize any Boolean function defined over an arbitary set of N, patterns in general pos tion. Recently, Baum extended Caver's result fo multrout. put networks, and also used a construction argument ta find corresponding upper bounds for the special case of thetworlaersignum network| 98), Consideratworlayer fully connected feedforwatd network of signum Adalines that has Nv, input components (excluding the bias inputs) and 'N, output components, If this network is required to learn to map any set containing Ny patterns that are in general position to any set of binary desired response vectors (with 1, components, it follows from Baum's results” that the ‘minimum requisite number of weights N, can be bounded by Np (Np ites men (eriaren even, From Eq, it an be shown that for a tworlayer feedtor. ward network with severaltimesas many inputsand hidden lementsas outputs say, atleast stimes as many), the deter- ‘ministic pattern capacity is bounded below by something. slightly smaller than NN. It also follows from Eq, (@) that the pattern capacity ofany feedforward network withalarge ratio of weights to outputs (that is, NaN, atleast several thousand can be bounded above by'a number of somewhat larger than (Nq/N) log: (NN). Thus, the determin istic pattern capacity Cy of a twolayer network can be sey Snauattin (len 0 ‘This can be seen by noting that any Boolean function can be witen nthe sums-produc form [st] and that such an expres Sion canberealaedwihatwolayer network by using the stlayer ‘Adlines to implement so gates, whe using the Second ayer ‘alines to Implement or gates. “actualy, the network can bean arbitrary eedlorwand structure ang need not be layered. "She upper bound used heres Baum’ loose bound: minimum ‘number aden odes = N [NN] = NINN where K,and Kare positive numbers which aresmall terms ifthe networks large with few outputs relative tothe nur. ber of inputs and hidden elements, It is easy t0 show that Eq. (8) also bounds the number of Weights needed to ensure that ‘patterns can be learned with probability V2, except in thiscase the lower boundon Ny becomes: (N,N, ~ T(t + log (Na). It follows that Eg, (G)also serves to bound the statistical capacity C, of a two. layer signum network: Ie is interesting to note that the capacity bounds (9) ‘encompass the deterministic capacity for the single-layer network comprising abank of N,Adalines. In thiscase each ‘Adaline would have N,N, weights, so the system would have a deterministic pattern capacity of NIN. AS Ny bbecomes large, the statistical capacity also approaches NIN, (fOr Ny finite). Until further theory on feedforward ‘nctwork capacity is developed, it seems reasonable to use the capacity results from the single-layer network to est ‘mate that of multilayer networks. Lite is known about the number of binary patterns that layered sigmoid networks can learn to classy correctiy. ‘Thepatterncapacityof sigmoid networks cannot be smaller than that of signum networks of equal size, however, because as the welghts of a sigmoid network grow toward Infinity, t becomes equivalent to a signum network with weight vector in the same direction. Insight relatingo the Capabilities and operating principles of sigmoid networks can be winnowed trom the literature (9}-[101], ‘Anetwork’s capacity is of ite utility unless itis accom panied by useful generalizations to patterns not presented uring taining a acti generalization isnot needed, we can simply store the associations in alook-up table, and will have litle need for a neural network. The relationship between generalization and pattern capacity represents 2 fundamental tradeoff in neural network. applications: the Adaline’s inability to realize all functions isin a sense strength rather than the fatal law envisioned by some crt. iesof neural networks (95, becauseit helps limit the capacity of the device and thereby improves Its ability to Ren: ceralize. For good generalization, the training set should conta 2 numberof patterns at least several times larger than the retwork’s capacity (i.e, Np >> Ny/N). Tas can be under stood intuitively by noting that if the numberof degrees of freedom in a network (ie, N) is larger than the number fof constraints associated withthe desited response func tion (ie, NN, the taining procedure will be unable to completely constrain the weights in the network. Appar. ently, thisallows elects of ital weight conditions toner. {ere with learned information and degrade the trained net. work's ability to generalize. A detailed analysis of eneralization performance o signum networks a a func: tion of taining set size fs described in (102) ‘A.Nanlinear Classifier Application Neural networks have heen used successfully in a wide range of applications. To gain some insight about how neural networks are trained and what they can be used to Compute, itis insieutive to consider Seinowskiand Rosen: berg’s 1986 NETtalk demonstration (51), (52). With the ‘exception of work on the traveling salesman problem with Hoplield networks (103), this was the frst neural networkpplication since the 1960s to draw widespread attention. NéTialkis a tworlayer feedforward sigmotd network wi 40 Adalines inthe first layer and 26 Adalines inthe second layer. The network is trained to convert text into phone Feally correct speech, a task well suited to neural imple ‘mentation. The pronunciation of most words follows gen: era rules based upon spelling and word context, but there are many exceptions and special cases. Rather than pro: ramming a system to respond properly to each case, the network can learn the general rules and special cases by example. ‘One of the more remarkable characteristics of NETtalk {s that learns to pronounce words in stages suggestive of the earning process in children. When the output of NET- talks connected to a voice syathesizer, the system makes ‘babbling noises during the early stages ofthe training process. As the network learns, it next conquers the general rules and, likea child, tends to make alot of errors by using these rules even when not appropriate, Asthe taining con tinues, however, the network eventually abstracts the exceptions and special cases and is able to produce inte: ligible speech with few errors. The operation of NETtalkis surprisingly simple. Its input Is a vector of seven characters (including spaces) from @ transcript of text, and its output is phonetic information corresponding tothe pronunciation of the center fourth} character in the sever-character input field. The other six characters provide context, which helps to determine the desired phoneme. To read text the seven-character wine dow is scanned across a document in computer memory and the network generates asequence of phonetic symbols that can be used to control a speech synthesizer. Each of the seven characters at the network's input i = 2.com Ponent binary vector, with each component representing aiferentalphabeticcharacteror punctuation mark Aone. isplacedinthe component associated withthe represented ‘character; all other components ate set to zero." Thesystem's 26 outputs correspondt0 23 articulatory ea: tures and 3 additional features which encode stressandsy- lable boundaries. When taining the network, the desired response vector has 2er0$ in all components except those which correspond the phonetic features associated with the center character in the input field. In one experiment, Sejnowski and Rosenberg had the system scan a W24ord transcript of phonetically transcribed continuous speech, \With the presentation of each seven-character window, the system's weights were trained by the backpropagation algorithm in response tothe network's output error After roughly 50 presentations ofthe entice taining set, the net work was able to produce accurate speech from data the network had not been exposed to during training, ‘Backpropagation isnot the only technique that might be sed to train NETtalk. In other experiments, the slower Boltzmann learning method was used, and, in fact, Mada "The input representation often has a considerable impact on thesuccesofaework In ET the ipursate sparsely ceded in29 components. One mht consider instead ehtosing # sbi binary representation ofthe 7b ASCIl code should be cea, however itn this ease he sparse representation helps spi the networks Job of interpreting input characters a8 29 dstint "symbol: Usual the apprepratenput encoding sot cific to ‘decide. When ntti falsyhowever-one sometimes mustexpe. iment with siferent encodings to lind one hat works well. line Rule Il could be used as well Likewise, if the sigmoid ‘network was replaced by a similar signum network, Mad line Rue I would also work, although more fest layer Ada lines would likely be needed for comparable performance. ‘The remainder ofthis paper develops and compares var fous adaptive algorithms for training Adalines and artical neural networks to solve classification problems such as INETialk, These same algorithms can be used to tran net ‘works for other problems suchas those involving nonlinear control [50], system identification (50), [104 signal pro ‘cessing [30], oF decision making (55) IL ADarrATION—THe Minwat DisTuRBANCE PRINCE The iterative algorithms described inthis paper are all designed inaccord witha single underlying principle. These fechniques—the twoLMS algorithms, Mays'srules,and the Perceptron procedure for trainingasingle Adaline, the MRE ‘ule for training the simple Madaline, aswell as MRI, MRI, and backpropagation techniques for training multilayer ‘Madalinesall rely upon the principle of minimal distur bance: Adapt to reduce the output error for the current training pattern, with minimal disturbance to responses already learned. Unless this principle is practiced, ii dit ficult to simultaneously store the required. pattern responses, The minimal disturbance principle is intuitive Iwas the motivating idea that led to the discovery of the LS algorithm and the Madaline rules Infact, te LMS algorithm hac existed for several months asan error-educ- tion rule before it was discovered thatthe algorithm uses an instantaneous gradient to follow the path of steepest ), adap: tation follows a normalized variant ofthe fixedsinerement erceptron rue with a/|X,/ used in place of). the linear ‘outputfallswithin thedeadzone, whether ornottheoutput response ys correct, the weights are adapted by the nor ‘malized variant of the Perceptron ruleas though the output response y had been incorrect. The weight update rule for -Mays's increment adaptation algorithm can be waitten mathematically a8 hse Wis «9 Me lab y in where fs the quantize error of, (7- With the dead vane = 0, May's increment adaptation algorithen reduces tos normalized version of the Percep- "This results because the length ofthe weight vector decreases synch adi at doe no ase he neu ‘hange sign and atiume s magnitude greater than that before ‘cipion niu hr ae excep or noua is Siuaton occursonlyaepiltheweightvectorismuchlongerthan the weight increment veer “Stheterms fixed increment and“absolue correction” are de to Nilsson ts} Rosenblatt retered fo methode of these ype respectively, a8 quantized and nonquntized Teaming rules "The increment adaptation rue was proposed by others before ays though fom a derent perspec 171tron rule (18), Mays proved that ithe training patterns are lineatly separable, increment adaptation will sways con. verge and separate the patternsin a finite number of steps, He also showed that use of the dead zone reduces sensi tivity to weight errors. f the training sets not linearly sep arable, Mays’s increment adaptation rule typically pe forms much better than the Perceptron rule because 4 sufficiently large dead zone tends to cause the weight vec. tortoadaptaway from erowhenany reasonably good sol tion exists. n such cases, the weight vector may sometimes appear to meander rather aimlessly, but it will typically remain ina region associated with relatively low average The increment adaptation rule changes the weights with ‘increments that generally are not proportional tothe lineat errors. The other Mays tule, modified relaxation is closer toi in its use of the linear error, (refer to Fig. 2) The Sesired sponse and the quantizer outputlevelsarebinary Tfthequantizeroutputy.iswrongorifthelinear output Sls within the dead zone +, adaptation follows LMS toreduce the linear error. the quantizer output yi cor. ‘ect and the linear output alls outside the dead zone, the \weights are not adapted. The weight update rule for this algorithm can be writen as, ™ = Oand |s| = eo) A otherwise a Mase M+ a Where is the quantizer error of Eg. (17) IW the dead zone y s et to =, this algorithm reduces to the a-LMS algorithm (10). Mays showed that, for dead zone 0 <7 < Tand learning rate 0 < a = 2, thi algorithm will converge and separate any lineatly separable input set in 4 finite numberof steps. 1 the training set isnt linearly ‘separable, this algorithm performs much like Mays's neve iment adaptation rule ‘Mays’ two algorithms achieve similar pattern separation results, The choice of a does not afect stabil although itdoes affectconvergencetime. Thetwo rules die intheit ‘convergence properties butthereisnocansensuson which isthe better algorithm. Algorithms like these can be quite Useful, and wwe believe that there ate many more to be invented and analyzed “The a-LMS algorithm, the Perceptron procedure, and ‘Mays algorithms can all be used for adapting the single ‘Adaline element or they can be incorporated into proce dures for adapting networks of such elements, Multilayer network adaptation procedures that use some of these algorithms ae discussed in the following, Y. Exnoe-Consicrion RUUS—Mur-Etrnent Neworks ‘The algorithms discussed next are the Widrow-Hott ‘Madaline rule from the early 1960s, now called Madaline Rute (MRI), and Madaline Ruletl (MRI), developed by Wide row and Winter in 1967. A. Madaline Role The MRI rule allows the adaptation of afirst layer of hard ‘imited (signum) Adaline elements whose outputs provide inputs toa second layer, consisting ofa single fixed thresh. ‘old-ogic element which may be, for example, the ot gate, oo ze, —— ht ae Bee Fl is Fig. 15. Afive-Adaline example ofthe Madan architec- ‘ax gate, or majority-votetaker discussed previously, The ‘eights ofthe Adalines areinitillyset to small random val- Figure 15 shows a Madaline | architecture with five fully connected firstlayer Adalines. The second layer Isa major ‘ty element (MA). Because the secondlayer logic element is fined and known, itis possible to determine which first layer Adalines can be adapted to correct an output error. ‘The Adalines inthe fst layer assist each other n solving problems by automatic load sharing, (One procedure fortraining the networkin ig, 15 follows. A pattern is presented, and if the output response of the ‘majority element matches the desired response, no adaptation takes place. However, if, for instance, the desired response is +1 and three ofthe five Adalines read ~1 for agiveninputpattern,oneot the later three must beadapted tothe +1 state. The element that is adapted by MRI is the ‘one whose linear output isclosest to zero-the one whose analog response s closest tothe desired response. f more ‘ofthe Adalines were originally in the ~1 state, enough of them are adapted tothe + 1stateto make the majority dec sion equal +1. The elements adapted are those whose lin- ‘ear outputs are closest to zero. A similar procedure is fol lowed when the desired response is ~1, When adapting a given element, the weight vector can be moved in the LMS direction fr enough to reverse the Adaline's output abso lute correction, or “fast Ieatning), or ican be adapted by the small increment detecmined by the a-LMS algorithm (statistical, or “slow” learning). The one desired response 14 is used forall Adalines that are adapted, The procedure canalso be modified toallow one of Mays's ules to be used. Inthatevent forthecase we have considered (majority out put element, adaptations take place i atleast hat of the Adalines ether have outputs differing from the desired responseorhaveanalog outputswhich arein the dead zone. By setting the dead zone of Mays’s increment adaptation tule to zer0, the weights ean also be adapted by Rosen bats Perceptron rule Differences in inital conditions and the results of sub- sequent adaptation cause the various elements to take responsibilty” for certain parts ofthe taining problem, ‘The basic principle of load sharing ss summarized thus: Assign responsibilty to the Adaline or Adalines that can ‘most easly assume i.In Fig. 15, the "job assignes," a purely mechanized process, asigns responsibilty during training by transferring theappropriateadapt commands and desired tesponsesig- nals tothe selected Adalines. The job assigner utilizes ln fear output information. Load sharing is important, since it results in the various adaptive elements developing ind vidual weight vectors. If all the weights vectors were the Same, there would be no point in having more than one. clement inthe frst layer. When training the Madaline, the pattern presentation sequence should be random. Experimenting with this, Ridgway 76] found that cyclic presentation ofthe patterns ould ead to cycles of adaptation. These cyles would cause the weights of the entire Madaline to cycle, preventing convergence. ‘The adaptive system of Fig. 15 was suggested by common sense, and was found to work well in simulations. Ridgway found that the probability that a given Adaline will be adapted in response to an input pattern is greatest i that ‘element had taken such responsibilty during the previous adapt cycle when the pattern was most recently presented. The division of responsibilty stabilizes atthe same time that the responses of individual elements stabilize to their share of the load. When the training problem is not per: fectlyseparableby this system, the adaptation processtends tominimize error probability although itis possible for the algorithm to "hang up” on local optima. ‘The Madaline structure of Fig. 15 has 2layers—the first layer consists of adaptive ogicelements the secondof fixed logic. A variety of fixed-logic devicas could be used for the second layer. A variety of MRI adaptation rules were devised bby Hoff (75} that can be used with all possible fixed-ogic ‘output elements. An easily described training procedure results when the output element isan ox gate, During taining, ifthe desired output for a given input pattern is +1, ‘only the one Adaline whose linear output is closest to 2er0 ‘would be adapted if any adaptation is needed—in other ‘words, fall Adalines give ~ 1 outputs. I the desited output is “Tall elements must give 1 outputs, and any giving +1 outputs must be adapted ‘The MRI rule abeys the "minimal disturbance principle’ In the following sense, No more Adaline elements are adapted than necessary to correct the output decision and fny dead-zone constrain, The elements whose linear out puls are nearest to zer0 are adapted because they require the smallest weight changes to reverse theif output responses, Furthermore, whenever an Adaline is adapted, the weights are changed inthe direction of its input vector, providing the requisite error correction with, minimal ‘weight change. 8B, Madatine Rule 1 The MRI rule was recently extended to allow the adaptation of multilayer binary networks by Winter and Widrow with the introduction of Madaline Rute 1 (MRID (43, (83, {109} typical two-layer MRI network is shown in Fig 16 ‘The weights in both layers are adaptive, “Training withthe MRII rule is similar to training with the MRI algonthm. The weights are initialysetto small random values. Training patterns are presented ina random Sequence. If the network produces an error during a tain ing presentation, we begin by adapting firstayer Adaline. eo ons . fF % Ee a a ~ EF fen =e} Tae SI Fig. 16, Typical wotayer Madaline it architecture. By the minimal disturbance principle, we select the firs. layer Adaline with the smallest linear output magnitude and perform a "tial adaptation” by inverting its binary output This can be done without adaptation by adding a perturbation asofsuitable amplitude and polaritytothe Adaline's sum refertoFig. 16). the output Hammingerrorisreduced by this bit inversion, that i, i the number of output errors Iseduced, the perturbation Asis removed and the weights of the selected Adaline element are changed by «LMS in a direction collinear with the corresponding input vector— the direction that reinforces the bit reversal with minimal disturbance to the weights. Conversely, if the tral adaptation does not improve the network response, no weight adaptation is performed, ‘Ater finishing with the frst element, we perturb and update other Adalines in the first layer which have "sul- ficiently small” lineaeoutput magnitudes. Further error reductions can be achieved, if desired, by eversing pais, triples, and so on, up to some predetermined limit. After exhausting possibilities with the frst layer, we move on to the next ayer and proceed inalike manner, When the final layers reached, each ofthe output elements is adapted by -UMS. A this point, a new training patern is selected at fandom and the procedures repeated. The goalisto reduce Hamming error with each presentation, thereby hopefully ‘minimizing the average Hamming error aver the training ‘set Like MRI, the procedure can be modified so that adap tations follow an absolute correction rule oF one of Mays's rules rather than a-LMS. Like MRI, MRI can "hang up”"on local optima. Vi. SreeresrDrscent RuussSincte Trsnouo EUnet “Thus far, we have described a variety of adaptation rules that acto reduce error with the presentation of each tain ing pattern. Often the objective of adaptation isto reduce fertor averaged in some way over the traning set. The most ‘common error funetion is meansquare error (MSD), though in somesituations other errreriteria may bemore appropriate [109-[111. The most popular approaches to IMSE reduction in both single-element and multi-element ‘networks are based upon the method of steepest descent. ‘More sophisticated gradient approaches such as qu ‘Newton [30 [1121114] and conjugate gradient (14), [115] techniques often have better convergence properties, butthe conditions under which the additional compleity is warranted arenot generally known. The discussion that follows is restricted to minimization of MSE by the method of steepest descent {116}, [117 More sophisticated leasning, procedures usualy requice many of the same computations sed in the basic steepest-descent procedure, ‘Adaplation of a network by steepestdescent stats with an arbitrary intial value W, fr the system's weight vector. The gradient of the MSE function is measured and the weight vector i altered in the direction corresponding to the negative of the measured gradient. This procedure is repeated, causing the MSE to be successively reduced on average and causing hheweight vector te approach aocally ‘optimal value The method of steepest descent can be described by the relation Waa Wt uO) ew where 1 is a parameter that conteols stability and rate of ‘convergence, and Vs the value ofthe gradient at point ‘on the MSE surface corresponding to W = W,. To begin, we derive rules for steepest descent minim zation ofthe MSE associated with a single Adaline element. “These rules are then generalized to apply to fullblown, neural networks. Like error-correction rules, the most prac ticaland eficient steepest descent rulestypically work with ‘one pattern ata time. They minimize mean-square error, approximately, averaged over the entire sel of training pat. tems. A. near Rules Steepest descent rules forthe single threshold element are said to be linear if weight changes are proportional 10 the linear error, the difference between the desited response d, and the linear output of the element 5, Mean-SquareEror Surface of the Linear Combiner In this section we demonstrate that the MSE surface ofthe linear combiner of Fig. 11s a quadratic function of the weights, and thus easily traversed by gradient descent. Let the input pattern X, and the associated desired response d) be drawn from a statistically stationary pop- lation. During adaptation, the weight vector varies so that leven with stationary inputs, the output s, and error, will generally be nonstationary. Care must be taken in defining the MSE since it is time-varying. The only possibility is an ensemble average, defined below. At the kth tration et the weight vector be W,. Squaring and expanding Eq (1) yields d= td = XI cy ah = 2d, XL, + WIE, 2) [Now assume an ensemble of identical adaptive linear com= Diners, each having the same weight vector W, a the kth iteration. Let each combiner have individual inputs X, and 4, derived from stationary ergodic ensembles. Each com= biner will produce an individual error «represented by £4 (23) Averaging Eq. 23) over the ensemble yields Eten, > Ell — 2E(, XM, + WieiXxTIM, oy Defining the vector Pas the crosscorrelation between the sired response (a scalar) and the Xvector” then yields PP 8 EIXI) = Fld dex “* deal’. 05) The input correlation matrix R is defined in terms of the ‘ensemble average RS aX] 1 Mo 7 ee a ‘This matsx is real, symmetric, and positive definite, oF in rare cases, positive semi-definite. The MSE £, can thus be expressed as By 8 Elva = Flay — 2P'W, + WiRw, en Note thatthe MSE is a quadratic function ofthe weights. Iisa convex hyperparaboloidal surace, a function that never goes negative. Figure 17 shows atypical MSE surface tain a LP ag Le Fig. 17. Typical mean-square error surface of ines com bier for a linear combiner with two weights. The position of a point on the ridin this figure represents the value of the ‘Adaline’s two weights. The height of the surface at each point represents MSE over the training set when the Ada Tine'sweightsarefixedat the values associated with the grid point. Adjusting the weights involves descending along this Surface toward the unique minimum point("the bottom of the bow”) by the method of steepest descent The gradient Vs of the MSE function with W = Wy is ‘obtained by differentiating Eq. (27) ete) Om , -2P + 28M. es) atte) | BW J ams “we assume here that X includes a bias component xg = +1‘This sa linear function of the weights. The optimal weight, vector W", generally called the Wiener weight vector ‘obtained from Eq. (28) by setting the gradient to zero: wee Re, 29 This is a matrix form of the Wiener-Hopf equation (118- (120). n the next section we examine y-LMS, an algorithm which enables usto obtain an accurateestimate of W" with- ‘out first computing R~* and P The wlMS Algorithm: The j-UMS algorithm works by pet= {forming approximate steepest descent on the MSE surface in weight space. Because tis a quadratic function of the Weights, this surface is convex and has 2 unique (global) minimum." An instantaneous gradient based upon the ‘square ofthe instantaneous linear error is 2) awe LMS works by using this cude gradient estimate in place othe ue gradient ff, Making thi placement ino tg. yds a Ware Wi + wl B) = Wy = "ow, a The instantaneous gradients wsed because ts realy 2alable rom a single data sample. The tue gradient Benedict to obtain. Computing would involve fveraging the instantaneous gradients ssocated with all pattems inthe raining set Tiss ually impractical ond Eimow shays efile ertorming the ferentiation in fq. 01) and replacing the near eto by defiton (goes ae. Weer = Me = Bye BE aw = we tue, Me = WIRD = We ton SM on Noting that and X, are independent of Ws yields Whey = Wh + 200%. o ‘This isthe LMS algorithm, The learning constant a deter. mines stability and convergence rate. For input patterns Independent overtime, convergence ofthe mean and var- ance of the weight vector Is ensured [30] for most practical purposes if o 98 OOK: ov Inserting his int Ea, yields fy = 24 sg he om Using this gradient estimate with the method of steepest descent provides a means for minimizing the mean-square tertorevenafterthe summed signals goes through the nonlinear sigmoid. The algorithm is Wao = Me lB) 6 = W, # 2de 5m DX 6 [Algorithm (5) the backpropagation algorithm for the ‘sigmoid Adaline element, The backpropagation name makes more sense when the algorithm is utilized ina lay-ears wager ram ee te 7s i 4 nie Fg 19. Implementation of backpropagation forthe ered network, which will be studied below, Implementa: tion of algorithm 54 is illustrated in Fig. 18 IT the sigmoid is chosen to be the hyperbolic tangent function (5), then the derivative sgm’(s) is given by tog) = tanh 60) sem oy) = =Atanh (sy == yk 5) ‘Accordingly, Fa (54) becomes Wao = Wit 208K ~ YER 6 Madeline Rule Il! for the Sigmoid Adaline: The implementation of algorithm (54) Fig, 19) requires accurate real ization ofthe sigmoid function and its derivative function. These functions may not be realized accurately when Implemented with analog hardware. Indeed, in an analog network, each Adaline will have its own individual nonlin artes. Difficulties in adaptation have been encountered In practice with the backpropagation gorithm because of Imperfections in the nonlinear functions, To circumvent these problems anewalgorithm hasbeen devised by David Andes for adapting networks of sigmoid ‘Adaline. This s the Madaline Rule Ill (MRII}) algorithm. “The idea of MRI for a sigmoid Adaline is illustrated in Fig. 20. The derivative ofthe sigmoid function is not used here. tnstead, small perturbation signal as is added tothe sum 5, and the effec ofthis perturbation upon output ye and error fis noted. i ees i Fe. rmpemenison othe i aarti ‘noid Ada element ‘An instantaneous estimated gradient can be obtained as follows Ba ate 3, a OW, atau : Rte Since as is smal su? os (282) ae? 3s, at = a4 (88 radix =24(S8)m. on om ‘AG mewn a(S), wo osm aaan(AB) an For small perturbations these two forms are essentially Identical, Neither one requires a prion knowledge of the sigmoid's derivative, and both are robust with respect to natural variations, biases, and drift the analog hardware. Which form to use isa matter of implementational con venience. The algorithm of Eq 60 isllustrated in Fig. 20 Regarding algorithm (61), some changes can be made to ‘establish a point of interest. Note that, in accord with Eq 6), aie ty of, may eto a Worm 20i(M) Since ss small, the ratio of increments may be replaced by 8 catio of diferentals, finaly giving a Wy + 25, 2 Wor a M+ Dnt TEX 6) = Wi + 248 sg’ (501K cy Thisisidenticalto the backpropagation algorithm (54) for the sigmoid Adaline. Thus, backpropagation and MRII are mathematically equivalent ifthe perturbation As is small, ‘but MRIII is robust, even with analog implementations. MSE Surfaces ofthe Adaline: Fig. 21 shows a linear combiner connected oboth sigmoidand sigaum devices. Three e110 «and # are designated inthis figure. They ate linear ertor = = d= sigmoid error = ¢ = d ~ sgms) signum error = 2 = d ~ sgn (sgm () = san onfe. 21. The iar, sgoid and sigma errors the Ad To demonstrate the nature of the square error surfaces associated with these three types of error, a simple exper. Imentwith atwo-input Adaline was performed. The Adaline ‘was driven by atypical set of input patterns and ther associated binary { 41, ~1} desited responses. The sigmoid function used was the hyperbolic tangent. The weights ‘ould have been adapted to minimize the mean-square error of e @, oF The MSE surfaces of Eller EUG, EGP Plotted as functions ofthe two weight values, are shown in Figs. 22,23, and 24 respectively. Fig.22. frample MSE surtace of lines err, Fig. 23. trample MSE surface of igmovd eer. Although the above experiments no allencompassing, we can infer from itthat minimizing the mean square ofthe linear ertor is easy and minimizing the mean square of the sigmoid error is more dificult, but typically much easier Fig. 24. fxample MSE surface of ignum error. han minimizing the mean square of the signum errar.Only the linear errors guaranteed to have an MSE surface with 4 unique global minimum (assuming invertible Renate ‘The other MSE surfaces can have local optima (122, [123 {in nonlinear neural networks, gradient methods gener: ally work better with sigmoid rather than signum aoniin. tarts. Smooth nonlinearities are required by the MRI and backpropagation techniques. Moreover, sigmoid net works are capable of forming internal representations that aremore complexthan simple binary codes and, thus, these retworks can often form decision regions that are more sophisticated than those associated with similar signum retworks. In fact, i a noiseless infinite;precision sigmoid Adaline could be constructed, it would be able to convey an infinite amount of information at each time step. This isin contrast to the maximum Shannon information capacity of one bit associated with each binary element. The signum does have someadvantages over the sigmoid In that itis easier to implement in hardware and much si pler to compute on a digital computer. Furthermore, the ‘outputs of signums are binary signals which can be ef ciently manipulated by digital computers. na signum net work with binary inputs, for instance, the output of each linear combiner can be computed without performing ‘weight multiplications. This. involves simply adding together the values of weights with +1 inputs and sub- teacting fom this the values ofall weights that are connected to —1 inputs. Sometimes a signum is used in an Adaline to produce ecisive output decisions. The error probability isthen pro: portional tothe mean square of the output error, To min: mize this error probability approximately, one can easly minimize E{e instead of directly minimizing (1 (58) However, with only alittle more computation one could minimize Ea) and typically come much closer to the objective of minimizing E( Gi, The sigmoid can therelore be used in training the weights even when the signum is used o form the Adaline output, ain Fig. 2. VILL Steese-Descen RULESMUCTEELEMENT NETWORKS ‘We now study rules for steepestdescent minimization ofthe MSE associated with entire networks of sigmoid Ada line elements. Like their single-element counterparts, the ‘mos practical and efficient steepestdescent rules for ult element networks typically work with one pattern presen: tation atatime. Wewill describe two steepestdescentrules for multielement sigmoid networks, backpropagation and Madaline Rule IFit.25. fsample wolayer backpropagation network architecture ‘A. Backpropagation for Networks The publication of the backpropagation technique by Rumethart et a. [42] has unquestionably been the most ‘influential devetopmentinthefield of nevral networks dur ing the past decade. In retrospect, the technique seems simple. Nonetheles, largely because early neural nework research deat almost exclsively with hardlimiting non: linearties, the idea never occurred to neural network researchers throughout the 1960s, The basic concepts of backpropagation are easily Brasped. Unfortunately, these simple ideas are often ‘obscured by relatively intricate notation, so formal der. vations of the backpropagation rule are often tedious. We present an informal derivation ofthe algorithm and illus. trate how it works fr the simple network showin in Fig. 25. ‘The backpropagation technique is a nontrivial general lzation of the single sigmoid Adaline ease of Section VIB. When applied to mult-element networks, the backpropagation technique adjusts the weights in the direction ‘opposite the instantaneous error gradient a2} * | at a-2 } Dns, Now, however, Wis a long mcomponent vector of all ‘weights In the entire network. The instantaneous. sum squared error ¢] is the sum of the squares ofthe errors at each of the N, outputs of the network, Thus da Inthe network example shown in Fig. 25, the sum square error is given by a Udy — yi? + Udy — ys Where we now suppress the time index k for convenience. Inits simplest form, backpropagation taining begins by presenting an input pattern vector Xtothe network, sweep ing forward through the system to generate an output response vector ¥, and computing the errors at each out put. The next step involves sweepingtheetfectsoftheerrors backward through the network to associatea'"square error derivative” é with each Adaline, computing gradient from each 8, and finally updating the weights of each Adaline based upon the cortespondling gradient. A new pattern is then presented and the process is repeated. The initial weight values are normaly set to small random numbers. The algorithm will not work properly with multilayer net. works f the initial weights ae either zero or poorly chosen snonzero values.” ‘We can get some idea about what is involved inthe cal culations associated withthe backpropagation algorith by examining the network of Fig 25. Fach ofthe five large circles represents a linear combiner, a5 well as some ass0- ciated signal paths for error backpropagation, and the cor responding adaptive machinery for updating the weights, This detail is shown in Fig 2. The solid lines in these dia grams represent forward signal paths through the network, 00) *eecenly, Nguyen has dacovered that a more sophisticated choice of intial weigh values in iddeniayerecan esata reduced problems with lca! optima and dramatic increases In network {raining speed (100, Experimental evidence suggests that {ivsable to choose the inital weights of each der layer ta {guastrandom manne’, which ensures that at each positon in & Byers input space the outputs ofall but afew fs Adalines wil Desaturate, while ensuring thateachAdalineinthelayerisunest Uratedin some region of ts input space When ths method used, ‘he weights in the output layer ate se to mall random values.Fig. 26. Deuil of neat combiner and associated circuitry Inbaciprepagation network and the dotted lines represent the separate backward paths that are used in association with calculations ofthe square error derivatives. From Fig. 25,we see that the calculations associated with the backward sweep are of a complexity ‘oughly equal to that represented by the forward pase through the network. The backward sweep requires the samenumber of function calculations asthe forward sweep, but no weight multiplications inthe first layer. As stated earlier, after a pattern has been presented ta thenetwork, and the response errorofeachoutputhasbeen calculated, the next step ofthe backpropagation algorithm involves finding the instantaneous squate-error derivative B associated with each summing junction inthe network. The square error derivative associated with the jth Adaline In layer lis defined as eee 2 6p ae ov Each of these derivatives in estence tlls us how sensitive the sum square output errr ofthe network so changes inthe ner output ofthe assoclted Acaline element Theinstantaneous square error devvatives are istcom- puted foreach element inthe output layer. The eacultion IS simple. As an erampl, below we derive the required expression for 6, the derivative associated with the top Adaline elementinthe ovtputlayeroffg 25 Werbegin with ‘he deintion of 8 from.) ers Oe ae ” Expanding the squarederror erm by fa 2 yi ope Marten gy atc, — sgm Py? es >in Fg. 25, all notation folios the convention that superscripts within parentheses indicate te Iyer numberof the teocated [auline or input node, while subscripts Ideniy the astecated ‘alin within ayer, We note thatthe second term is zero, Aecordingly, 12 = opm a 2a (Observing that dy ands are independent yields an SP a th sgm om ERE gg (dy — sgm (3/0) sgm’ (57, a) We denote the error d, ~ sgm (s/, by e/'. Therefore, OP = oF! sgn is). 7a) Note that thiscorrespondstothe computation of asillus trated in Fig. 25. The value of 8 associated with the other ‘output element in the figure can be expressed in an anal ‘ogous fashion. Thus each square-error derivative 6 inthe ‘output layer is computed by multiplying the output error associated with that element by the derivative of the asso- Ciated sigmoidal nonlinearity. Note from &q. (55) that f the sigmoid function isthe hyperbolic tangent, 7B)becomes simply BP = Pa — yr, 3) Developing expressions forthe square-error derivatives associated with hidden layers is not much more dificult (refer to Fig. 25). Weneed an expression for, the square- error derivative associated withthe top element in the first layer of Fig. 25. The derivative 8 i defined by Expanding this by the chain rule, noting that eis deter mined entirely by the values of s! and s?, yields = of (anak, ae ae) We 3 (Fa iee tag, 2 Gait 969 a8! ae Using the definitions of 4 and 8 and then substituting txparded versions Of Adaline linear outputs 2 and co gp (vt Bntanen) 14 ta (ns Bweeno). a Noting that 3fsgm (1/35! = 0, fj, leaves 8) = Owl gm Gi) + BP wi gpm’) oa) = 8 wi + OFFI sem os, Now, we make the following definition: 4 ait + awl 8) ‘Accordingly, 29 =P sqm is o Referring to Fig. 25, we can trace through the circuit to verify that is computed in accord with Eqs. 86) and (8.The easiest way to find values of & forall the Adaline ele ments inthe network is 0 follow the schematic diagram of Fig. 25 ‘Thus, the procedure for finding 8", the square-error derivative associated witha given Adaline in hidden layer J, involves respectively multiplying each derivative 5°"? associated with each element in the layer immediately ‘downstream from a given Adaline by the weight that con- recs it tothe given Adaline, These weighted squares derivatives are then added together, producing an error terme", which, intuen,ismultiplied by sgm(s", the deriv ative of the given Adalne’s sigmoid function a its current ‘perating point. fa network has more than two layers, this, process of backpropagating the instantaneous squarettOr derivatives from one layer to the immediately preceding, layeris successively repeated untilasquare-errorderivative Siscomputed for each Adaline inthe network. Thsiseasily shown at each layer by repeating the chain rule argument associated with Eq. (6. We now have a general method for finding a derivative # for each Adaline element in the network. The next step, is to_use these és to obtain the corresponding gradients. Consider an Adaline somewhere inthe network which, dur. ing presentation k, has a weight vector W, an input vector Xi, and a linear output 5, = WX The instantaneous gradient for this Adaline element i ack =a co This can be written as ¢ at = 9) Note that Wy and X, ate independent so as, _ OWI aw, ~ ow, * * ay Therelore, a nn Sh, on Hx on For this element, on Accordingly, f= 280%. o» Updating the weights of the Adaline element using the ‘method of steepest descent with the instantaneous gra dient isa process represented by Mis Wet WB) = We 2X, 8) Thus after backpropagating all square-error derivatives, We complete a backpropagation iteration by adding to each ‘weight vector the corresponding input vector scaled by the associated squareerror devvative. Eq (98) and the means for finding 6, comprise the general weight update rule of ‘the backpropagation algorithm, ‘There is. a great similarity between Eq (94) andthe wLMS. algorithm (3), butone should view this similarity with cau- tion. The quantity, defined asa squared-etror derivative, ‘might appear to play the same role in backpropagation as ‘that played by the error in the eLMS algorithm. However, ‘isnotan error Adaptation of the given Adaline iseffected to reduce the squared output error ej, not 4, of the given Adaline or of any other Adaline in the network, The objective is not to reduce the 4's ofthe network, but to reduce fat the network output Itis interesting to examine the weight updates that backpropagation imposes on the Adaline elementsinthe output layer. Substituting Eq, (77) into Eq. (94) reveals the Adaline ‘which provides output y, in Fig. 25 is updated by the cule Woes = Wa + Que? sem’ (sXe, 93) This ruleturns out tobe identical tothe single Adaline version (4) of the backpropagation rule This is not surprising since the output Adalineis provided ith both input signals and desired responses, so ls taining cicumstance i the sameas that experienced by an Adaline trained insolation, There are many variants of the backpropagation algorithm. Sometimes, the size of y is reduced during taining to diminish the effects of gradient noise in the weights, Another extension isthe momentum technique [42] which Invalvesincluding in theweight change vector AW, of each, ‘Adaline aterm proportional to the corresponding weight change from the previous iteration, That is, Eq, (4) Is replaced by a pair of equations: AW. = 2H = NOX HAM, 6) Weer = We + aM on where the momentum constant = 9 < 1isin practice usu ally set to something around 0.8 or 03. The momentum technique low-pass filters the weight ‘updates and thereby tends to resist erratic weight changes. ‘caused either by gradient noise or high spatial frequencies Inthe MSE surface. The factor (1 ~ in Eg. (86's included togive the filter a DC gain of unity so that the learning rate 1adoes not need tobe stepped down asthe momentum con stant pis increased. A momentum term can also be added to the update equations of other algorithms discussed in this paper. A detailed analysis of stability issues associated with momentum updating for the j-lMS algorithm, for instance, has been described by Shynk and Roy [124 Inour experience, the momentum technique used alone is usually o litle value. We have found, however, that tis often useful to apply the technique in situations that require relatively “clean” gradient estimates. One case is anor malized weight update equation which makes the net work's weight veetor mave the same Euclidean distance With each iteration. This can be accomplished by replacing Eq, (96) and (97) with OX + dey 8) Mae Mir 9) ards el? where againO < 9 < 1. The weight updates determined by qs. (8) and 99) can help a network find a solution when a relatively flat local region in the MSE surface is encoun Clean gradientestiates are those with tle gradient noetered, The weights move by the same amount whether the Surfaceis lator inclined. Its reminiscentofa-LMS because the gradient term in the weight update equation is normalized by a time-varying factor. The weight update rule could be further modified by including terms from both techniques associated with qs. (96) through (9, Other methods or speeding up backpropagation raining include Fahlman’s popular quickprop method (125), a well as the delte-bardelta approach reported in an excellent paper by Jacobs (126,"" (One ofthe most promising new areas of neural network ‘esearch involves backpropagation variants for training var tous recurrent signal feedback) networks. Recently, back propagation rules have been derived for training recurrent networks to learn static associations [1271 (128), More inter. esting isthe ortine technique of Willams and Zipser [123] which allows a wide class of recurrent networks to learn dynamic associations and trajectories. A more general and computationally viable variant ofthis technique has been advanced by Narendra and Parthasarathy [104, These om line methods are generalizations ofawell-known steepest descent algorithm for taining linear IR fiters (130), 30, ‘An equivalent technique that is usually fa less computationally intensive but bes suited for ofline computation [B71,(22} 131 called “backpropagation through time,” has been used by Nguyen and Widrow {50} enable a neural network to learn without a teacher how ta back up a com- putersimulated aller tuck toaloading dock Fig. 27. This isa highly nonlinear steering task and itis not yet known, howto designa controler to perform t. Nevertheless, with just 6 inputs providing information about the current posi tion of the ruck, a tworlayer neural network with only 26 ‘Adalines was able to learn of its own accord t0 solve this problem. Once trained, the network could successtlly back up the truck from any initial position and orientation infront of the loading dock 2B. Madaline Rule Ii for Networks {tis dificult to build neural networks with analog hard: ‘ware that can be tained elfectvely by the popular back. propagation technique. Attempts to overcome this diff. ‘culty have led tothe development ofthe MRII algorithm, ‘A.commercial analog neurocomputing chip based primar: ily on this algorithm has already been devised (132). The ‘method described in this section isa generalization of the single Adaline MRIII technique(60). The multi-element gen: ceralzation of the other single element MRI rule (61 i= described i [133 ‘The Mall algorithm canbe readily described by referring to Fig. 28. Although this figure shows a simple two-layer feediorward architecture, the procedure to be developed will work for neural networks with any number of Adaline Jacob's paper ike anyother papersinhe iterate, assumes {or ana that the tre paciots the an nstantancous gr dente are used to peste the weights that I tht weights re ‘hanged perodiatyroniy ater ail ainingpatters are presenta ‘his eliminates gradient noite but can sew down training enor ‘ously the taming set lage, The deta delta procedure Inlacab’s paperimvolves monitoring changes of ta tut gracents in esponseto weight enanges should Be possible to avoid the ‘expense of computing the ue gradients exc in this cas by instead monitoring changesinicoutputs of soy t4o momentum ters with flere time constants Fig.27. ample tuck backup sequence ¥ Fig, 28. Example wolayer Madaline I achtecure lementsin any feedforward steucture. In 133), we discuss vaiants of the basic MRIII approach that alow steepest descent training to be applied to more general network topologies, even those with signal feedback, ‘Assume thatan input pattern Kand its associated desired ‘output responses dy and d; are presented to the network Ooffig. 28. Atthis point, we measure the sum squared output response errore! = (dj~ yi + (ds ~ ys = +e Wethen ‘addasmall quantity sto selected Adaline inthe network, Providing a perturbation tothe elements linear sum. This perturbation propagates through the network, and causes 8 change inthe sum of the squares of the eors, Ae’) ‘(e+ An easily measured rato is Ale) ad +2 _ ae) 100 as” as aBelow we use this to obtain the instantaneous gradient of «Ewith respect othe weight vector of the selected Adaline, For the kth presentation, the instantaneous gradient is Wed) eb, Replacing the derivative with a ratio of diferences yields = Med ae a + aw, a5, 3, on ated) ier . 0) Theidea of obtaining adecvativeby perturbing the linear ‘output ofthe selected Adaline element is the same as that expressed forthesingleelementinSection VIB, exceptthat here the error is obtained from the output of @ multi-ele ‘ment network rather than from the output ofa single ele- ‘The gradient (102) can be used to optimize the welght vector accord with the method of steepest descent re Wyo Me ‘Maintaining thesameinput patern, one covldeither per- turball the elements in the network in sequence, adapting aftereach gradient calculation, orelse the derivatives could bbe computed and stored to allow all Adalines tobe adapted atonce, These two MRIIl approaches both involve the same. weight update equation (103), and if is small, both lead to.equivalent solutions. With large n, experience indicates that adapting one element ata time results in convergence. ater fewer iterations, especially in large networks. Storing ‘the gradients, however, has the advantage that after the in tial unperturbed error is measured during a given training, presentation, each gradient estimate requires only the perturbed error measurement. I adaptations take place after ‘each eor measurement, both perturbed and unperturbed ‘errorsmustle measured foreach gradient calculation. This is because each weight update changes the associated Uunperturbed error. Comparison of MRI with MR -MRIIL was derived from MRII by replacing the signum ronlineaities with sigmoids, The similarity ofthese algorithms becomes evident when comparing Fig. 28, repre senting MIL, with Fig. 16 representing MRI ‘The MRI networks highly discontinuous and nonlinear. Usingan instantaneous gradienttoadjust the weights s not possible. n fact, from the MSE surface forthe signum Ada line presented in Section VIB, itis clear that even gradient descent techniques that use the true gradient could run Into severe problems with focal minima. The idea of adding 4 perturbation to the linear sum of a selected Adaline ele mmentis workable, however Ifthe Hamming eror has been reduced by the perturbation, the Adaline is adapted to reverseits output decision. Thisweight changers inthe LMS direction, along its Xovector. If adapting the Adaline would not reduce network output error, its not adapted. This is cord with the minimal disturbance principe. The Ada- 12s selected for possible adaptation are those witose ana log sums are closest to zero, that is, the Adalines that can bbe adapted to give opposite responses with the smallest weight changes. Its useful to note that with binary £1 desired responses, the Hamming error is equal to 14 the sum square error, Minimizing the output Hamming etror Istherefore equivalent a minimizing theourput sum square The MRIIl algorithm works i a similar manner, All the ‘Adalinesin the Mill networkare adapted, but those whose. analog sumsareclosestto zerowill usually be adapted most ‘strongly, because the sigmoid has its maximum slope at zero, contributingto high gradientvalues. Aswith MRI, the ‘objective is to change the weights forthe given input pre- entation to reduce the sum square error at the network ‘output. In accord with the minimal disturbance principle, ‘the weight vectors ofthe Adaline elements are adapted in the LMS direction, along their Xvectors, and ate adapted in proportion to their capabilities for reducing the sum square error (the square of the Euclidean error) tthe out put. . Comparison of MRI with Backpropagation In Section V-B, we argued that for the sigmoid Adaline clement, the MRII algorithm (1) is essentially equivalent tothe backpropagation algorithm (52). The same argument canbe extended tothe network of Adaline elements, dem- fonstating that if Is small and adaptation is applied to allelementsinthe networkatonce, then MRI isessentaly equivalent to backpropagation That is, tothe extent that the sample derivative A¢,/As from Eq. (103)is equal tothe analytical derivtive de3/8s, from Eq (91, the two rules fo: low identical instantaneous gradients, and thus perform identical weight updates. The backpropagation algorithm requires fewer opera tions than MII to calculate gradients, since itis able to {ake advantage ofa prior! knowledge of the sigmoid non: lincarities and theit derivative functions. Conversely, the MRI algorithm uses no prior knowledge about the char. actersties of the sigmoid functions. Rather, it acquires instantaneous gradients from perturbation measurements. Using MRIIL, tolerances on the sigmoid implementations can be greatly relaxed compated to acceptable tolerances for successful backpropagation. ‘Steopest descent training of multilayer networks imple- ‘mented by computer simulation or by precise parallel digital hardware is usually best carved out by backpropagation. During each training presentation, the backpropagation method requires only one forward computation through the network followed by one backward compu: tation in order to adaptallthe weights ofan entirenetwork. To accomplish the same effect with the form of MRI that ‘updatesall weights at once, one measures the unperturbed error followed by a number of perturbed error measure ‘ments equal tothe number ofelementsin the nework. This could require a lot of computation, Ifa network isto be implemented in analog hardware, however, experience has shown that MRIIl offers strong advantages over backpropagation. Comparison of Fig. 25, ‘with Fig. 28 demonstrates the relative simplicity of MRII Allthe apparatus for backward propagation of erortelated signals is eliminated, and the weights do not need to carry Signals in both digections (see Fig. 26. MII isa much sim pler algorithm to build and to understand, and in principle Ft produces the same instantaneous gradient as the back propagation algorithm. The momentum technique and ‘most other common variants of the backpropagation algorithm ean be applied to MRI taining,E, MSE Surfaces of Neural Networks In Section VIB, “typical” mean-square-error surfaces of sigmoid and signum Adalines were shown, indicating that sigmoid Adalines are much more conducive to gradient approaches than signum Adalines. The same phenomena result when Adalines are incorporated into multi-element etworks. The MSE surfaces of MRII networks are reasor- ably chaotic and will not be explored here. In this section We examine only MSE surfaces from a typical backpropa- Bation training problem with a sigmoidal neural network ‘Ina network with more than two weights, the MSE sur- {aces high-dimensional and dificult to visualize. Itis pos. sible, however, to look at slices ofthis surface by plotting the MSE surface created by varying twoof the weights while holding all others constant. The surfaces plotted in Figs, 29 a Bi wuts Fig.28. Bample MSE surface of untrained sigmoidal net. ‘wor ap 3 Fnetion oft fssayer weights. Fig. 30. sample MSE surface of rained sigmoidal network {85 function of two Hest layer weight and 30 show two such slices ofthe MSE surface from atypical learning problem involving, respectively, an untrained sigmoidal network and a trained one. The first surface ‘cesulted fom varying two first layer weights ofan untrained ‘network. Thesecond surface resulted from varying the same ‘wo weights after the network was fully tained. The two surfaces are similar, but the second one has a deeper min |Imum which was carved out by the backpropagation earn Ing process. Figs 31 and 32 resulted fromvaryinga different set of two weightsin the same network, Fig. 31 the result from varying a frstayer weight and third layer weight in the untrained network, whereas Fig. 32 isthe surface that resulted from varying the same two weights after the net work was tained. ana ao a i Fig. 31. ample MSE surface of untrained sigmoidal net ‘work asa fonetion ofa fstlayer weight and sth layes vrei Fig. 32. Grample MSE surface of trained sigmoidal network SF foncton ofa frstayer weight and third Iyer weight By studying many pots, it becomes clear that backpropagation and MRIIL ill be subject to convergence on local ‘optima. The same is true for MRII. The most common rem: ey for tis isthe sporadic addition of noise tothe weights ‘or aradients. Some of the "simulated annealing” methods [U7] do this Another method involves retraining the network several times using difernt random intial weight va ‘ues until a satisfactory solution is found. Solutions found by people in everyday lfeare usually not ‘optimal, but many of them are use. If a local optimum yields satisfactory performance, often there is simply no ‘need to search for a better solution. Vit. Susmasy ‘This year isthe 30th anniversary ofthe publication of the Perceptron rule by Rosenblatt and the LMS algorithm by ‘Widrow and Hoff It has also been 16 years since Werbos first published the backpropagation algorithm. These learning rules and several others have been studied and compared. Although they differ significantly from each other, they ll belong tothe same "amily ‘A distinction was drawn between error-correction rules and steepestdescent rules. The former includes the Per Ceptron rule, Mays’ rules, the a-LMS algorithm, the original ‘Madaline I rule of 1962, and the Madaline I! rule. The later includes the w1MS algorithm, the MadalinellI rule, and the

Widrow Lehr

Uploaded by

Copyright:

Available Formats

You might also like

Widrow Lehr

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Widrow Lehr

Uploaded by

Copyright:

Available Formats

You might also like