This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Neural Networks

David Kriesel

dkriesel.com

Download location: http://www.dkriesel.com/en/science/neural_networks NEW – for the programmers: Scalable and efficient NN framework, written in JAVA http://www.dkriesel.com/en/tech/snipe

dkriesel.com

In remembrance of Dr. Peter Kemp, Notary (ret.), Bonn, Germany.

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

iii

Please let me know if you ﬁnd out that I have violated this principle. everything is explained in both colloquial and formal language.give a short overview – but this is also ex- v . to provide a comprehensive overview of the subject of neural networks and.com on 5/27/2005). while the opposite holds for readers only interested in the subject matter. most of them directly in A L TEX by using XYpic. I did all the illustrations myself. Ambition and intention of this manuscript The entire text is written and laid out more eﬀectively and with more illustrations than before. The sections of this text are mostly independent from each other The document itself is divided into diﬀerent parts.learning procedures).dkriesel. they are also individually accessible to readers with little previous knowledge. Although the chapters contain cross-references. First and foremost. the classic neural network structure: the perceptron and its Nevertheless.A small preface "Originally. ever since the extended text (then 40 pages long) has turned out to be a download hit. stand the deﬁnitions without reading the running text. second. the mathematically and for.g. And who knows – maybe one day this summary will become a real preface!" Abstract of this work. the smaller chapters mally skilled readers will be able to under. which are again divided into chapters. end of 2005 The above abstract has not yet become a preface but at least a little preface. They reﬂect what I would have liked to see when becoming acquainted with the subject: Text and illustrations should be memorable and easy to understand to oﬀer as many people as possible access to the ﬁeld of neural networks. but it has been and will be extended (after being presented and published online under www. There are larger and smaller chapters: While the larger chapters should provide profound insight into a paradigm of neural networks (e. this work has been prepared in the framework of a seminar of the University of Bonn in Germany. just to acquire more and more A knowledge about L TEX .

com plained in the introduction of each chapter. In addition to all the deﬁnitions and explanations I have included some excursuses to provide interesting information not directly related to the subject. Snipe may have lots and lots more capabilities than may ever be covered in the manuscript in the form of practical hints. It is available at no SNIPE: This manuscript frequently incorporates Snipe. I omitthan lots of other implementations due to 1 Scalable and Generalized Neural Information Processing Engine.com/tech/snipe. I was not able to ﬁnd free German sources that are multi-faceted in respect of content (concerning the paradigms of neural networks) and. class names are used. Recently. reading. nevertheless. the original high-performance simulation design goal. online JavaDoc at http://snipe. I decided to just have to skip the shaded Snipegive it away as a professional reference imparagraphs! The Snipe-paragraphs asplementation that covers network aspects sume the reader has had a close look at handled within this work. dkriesel. feature-rich and usable way. in my experience almost all of the implementation reWant to learn not only by quirements of my readers are covered well. Some of the kinds of neural networks are not supported by Snipe. neural networks in a speedy. Those of you who are up for learning by doing and/or have to use a fast and stable neural networks implementation for some reasons.dkriesel.dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . same time being faster and more eﬃcient Often. while when it comes to other kinds of neural networks. downloadable at http://www. but also by coding? On the Snipe download page.com ted the package names within the qualiﬁed class names for the sake of readability. Unfortunately. should deﬁnetely have a look at Snipe. the aspects covered by Snipe are not entirely congruent with those covered by this manuscript. look for the Use SNIPE! section "Getting started with Snipe" – you will ﬁnd an easy step-by-step guide conSNIPE 1 is a well-documented JAVA li. However. As Snipe consists of only a few diﬀerent packages. as brary that implements a framework for well as some examples. written in coherent style. This also implies that those who do not want to use Snipe. Shaded Snipe-paragraphs cost for non-commercial purposes. It was like this one are scattered among large originally designed for high performance parts of the manuscript. while at the the "Getting started with Snipe" section. simultaneously. Anyway. providing inforsimulations with lots and lots of neural mation on how to implement their connetworks (even large ones) being trained text in Snipe. vi D. The aim of this work is (even if it could not be fulﬁlled at ﬁrst go) to close this gap bit by bit and to provide easy access to the subject.cerning Snipe and its documentation.

which then is marked in the ta. and speaking ones. Speaking headlines throughout the text.com It’s easy to print this manuscript This text is completely illustrated in color. Other chapters additionally depend This document contains diﬀerent types of on information given in other (preceding) indexing: If you have found a word in chapters. such long headlines would bloat the table of contents in an unacceptable way. too. you can easily ﬁnd it by searching ble of contents. throughout the text. There are many tools directly integrated into the text Diﬀerent aids are directly integrated in the document to make reading more ﬂexible: Marginal notes are a navigational However. but it can also be printed as is in monochrome: The colors of ﬁgures. tables and text are well-chosen so that in addition to an appealing design the colors are still easy to distinguish when printed in monochrome. anyone (like me) who prefers aid reading words on paper rather than on screen can also enjoy some features. x D. but centralize the information given in the associated section to a single sentence. short ones in the table of contents The whole manuscript is now pervaded by such headlines. like the latter. In the named instance. that are marked as "fundamental" are deﬁnitely ones to read because almost There are several kinds of indexing all subsequent chapters heavily depend on them. Speaking headlines are not just title-like ("Reinforcement Learning").dkriesel.the index and opened the corresponding page. allowing you to "scan" the document quickly to ﬁnd a certain pasIn the table of contents. The entire document contains marginal notes in colloquial language (see the example in the margin). an appropriate headline would be "Reinforcement learning methods provide feedback to the network. Chapters. marked within the table of contents. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) vii . Hypertext on paper :-) types of chapters are marked New mathematical symbols are marked by speciﬁc marginal notes for easy ﬁnding Diﬀerent types of chapters are directly (see the example for x in the margin). However. diﬀerent sage in the text (including the titles). So I used short titles like the ﬁrst one in the table of contents. whether it behaves good or bad".

or me and my readers very much. Acknowledgement 1. Terms of use and license Beginning with the epsilon edition. 3. Those are indexed in the category "Persons" and are still mine.com for highlighted text – all indexed words are highlighted like this. to the success of this work.dkriesel. In albuild upon the document except for phabetical order: Wolfgang Apolinarski. since a work like this needs many helpers. transform. Paul Imhoﬀ. Thomas 2 http://creativecommons. the text is licensed under the Creative Commons Attribution-No Derivative Works 3. so you need to be careful with your citation. First of all. I want to thank the proofreaders of this text. respectively the subpage concerning the manuscript3 . except for some little portions of the work licensed under more liberal licenses as mentioned (mainly some ﬁgures from Wikimedia Commons). personal use. Mathematical symbols appearing in several chapters of this document (e. Kathrin Gräve.0 Unported License 2 .com/en/science/ neural_networks viii D. Note that this license does not extend to the source Names of persons written in small caps ﬁles used to produce the document. A quick license summary: ment (even though it is a much better idea to just distribute the URL of my homepage. the above bullet-point summary is just informational: if there is any conﬂict in interpretation between the summary and the actual license.Now I would like to express my grati- tude to all the people who contributed. You are free to redistribute this docu. the actual license always takes precedence. Please ﬁnd more information in English and German language on my homepage. so they can easily be assigned to the corresponding term. for it always contains the most recent version of the text).g. ordered by the last names. You may not modify.org/licenses/ by-nd/3. in whatever manner. You may not use the attribution to imply that the author endorses you or your document use. For I’m no lawyer. who helped 2. How to cite this manuscript There’s no oﬃcial publisher.dkriesel.0/ 3 http://www. 4. Ω for an output neuron. You must maintain the author’s attri- bution of the document at all times. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . I tried to maintain a consistent nomenclature for regularly recurring elements) are separately indexed under "Mathematical Symbols".

Matthias Siegmund. Joachim Nock. Julia Damaschek. Department of Computer Science Dr. Christian Schulz and Tobias Wilken. I want to thank my parents who never get tired to buy me specialized and therefore expensive books and who have always supported me in my studies. Malte Lohmeyer. suggestions and remarks. Oliver Tischler. I would like to thank Beate Frank Nökel. Andreas Hochrath. Kolb in Bonn. Igor Wall. Alexander Schier. In particular. Andreas Müller. Andreas Friedmann. Rainer Penninger. Mario Krenn. Sebastian Hirsch. Adam Maciak.low . Kemp! D. Achim Weber. Conversations with Prof.dkriesel. Jochen Döll. Boris Jentsch. I’d like to thank Sebastian Merzbach. Frank Weinreis.-) I want to thank Andreas Huber and Tobias Treutler. always felt to be in good hands and who Rolf Eckmiller and Dr. Thilo Keller. Sascha Fink. Daniel Rosenthal. Christoph Kunze. Maximilian Ernestus.com Kühn. Especially Dr. Additionally. Mathias Tirtasana. Wilfried Hartmann. and for her questions Furthermore I would like to thank the which made me think of changing the whole team at the notary’s oﬃce of Dr. who examined this work in a very conscientious way ﬁnding inconsistencies and errors. of the University of Bonn – they all made sure that I always learned (and also had to learn) something new about neural networks and related subjects. Lena Reichel. Anne Feldmeier. Hardy Falk. Since our ﬁrst semester it has rarely been boring with you! Now I would like to think back to my school days and cordially thank some teachers who (in my opinion) had imparted some scientiﬁc knowledge to me – although my class participation had not always been wholehearted: Mr. Gideon Maillette de Buij Wenniger. Thomas Ihme. Especially. Kuhl for translating the entire text from German to English. Eckmiller made me step back from the whiteboard to get a better overall view on what I was doing and what I should do next. Hubert Peters and Mr. Marie Christ. Tim Hussein. Maikel Linke. he cleared lots and lots of language clumsiness from the English version. I want to thank the readers Dietmar Berger. Maximilian Voit. Mr. Markus Gerhards. Jan Gassen. Kemp and Dr. Nico Höft. Goerke has always been willing to respond to any questions I was not able to answer myself during the writing process. Mirko Kunze. where I have I would particularly like to thank Prof. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) ix . Daniel Plohmann. David Möller.in particular Christiane Flamme and matics. Igor Buchmüller. Globally. and not only in the context of this work. Nils Goerke as have helped me to keep my printing costs well as the entire Division of Neuroinfor. Benjamin Meier. phrasing of some paragraphs. For many "remarks" and the very special and cordial atmosphere . Additionally. Philipp Woock and many others for their feedback.

dkriesel. so to speak.com Thanks go also to the Wikimedia Commons. and Christiane Schultze. David Kriesel x D. where I took some (few) images and altered them to suit this text. who found many mathematical and logical errors in my text and discussed them with me. who carefully reviewed the text for spelling mistakes and inconsistencies. although she has lots of other things to do. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . a place of honor: My girlfriend Verena Thomas. Last but not least I want to thank two people who made outstanding contributions to this work who occupy.

. . . . . . . . . . . . . . .4 The amount of neurons in living organisms . . . . .2 Simple application examples . . . . . . . . . . .1.2. . . . . .1 Peripheral and central nervous system . . 1. . .1. . . . . . . . .3 Light sensing organs . . . . . . . . . .1 Various types . . . . . . . . . . . . . . . 2. . . . . . . 2. . . . . . . .3 Cerebellum .3 Long silence and slow reconstruction 1. . . . . . . . . . . . . . . . . . . . 2. . .Contents A small preface v I From biology to formalization – motivation. . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . .2 Information processing within the nervous 2.5 Brainstem . . . . . . . . 2. . . . .1 Components . . . . .4 Diencephalon . . 3 3 5 6 8 8 9 11 12 12 13 13 13 14 15 15 16 16 16 19 24 24 25 26 28 1 Introduction. . . 2. . . . . . . 1. . . . history and realization of neural models 1 . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . 2. . . .3. . . . motivation and history 1. .2 Electrochemical processes in the neuron . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . philosophy. . . . . . . . . . . . 2 Biological neural networks 2. . . . .1. . . . . . . . . . . . . . . . . .1 Why neural networks? . .2. . . . . . . . . . . . .1. . . . . . .3 Receptor cells . . . . . . . . . . . . . . . . . . . .2. . . . 1. . . . . . . . . . . . . . . . .1 The vertebrate nervous system . . . 2. . . 2. . . 1. . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . .2. . . . .1. . . . 1. .1 The 100-step rule . . . . . . . . . . . . . . . . . .2 Golden age . . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . .2. . . . . . . . . . . .2 History of neural networks . . . . . . . . . . . . . . . . .2 Cerebrum . . xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. .1. . . . . system . . . . . . . . . . . . .2. . . .2 The neuron . . . 2. . . . .1 The beginning . . .4 Renaissance . . . . . . . .

. 3. . .2 Training patterns and teaching input . . . . . . . . . 3. . . . 4.1.1 Connections . . . . . . . . . . . . . . . . 3. . . . . .Contents dkriesel. . . . . . . . . . . . . . . .4. .2 Propagation function and network input . .1.1 Paradigms of learning . . . . . . . . . .3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. .2. 31 3 Components of artiﬁcial neural networks (fundamental) 3. .2. . . . . . . . . . . . . . . . . . . . . . . .1 Feedforward . . . . . . . . . . . . . . . .5 Technical neurons as caricature of biology . . 3. . . . . . . .3 Network topologies . . . . . . . . . . . . . . . . . . . . . . . . . .1 Synchronous activation . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . . . .2.6 Orders of activation . . . . . . . . . . . . . . . . . .1. . . . . . 4. . . . . . .2 Components of neural networks . . . . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . .2. . . . . . . . . . . . . .6. . . . . . . . . . . . . . . . 33 33 33 34 34 35 36 36 37 38 38 39 39 40 42 43 45 45 45 46 48 48 51 51 52 53 53 54 54 54 56 57 57 58 59 (fundamental) xii D. .1 When do we stop learning? . . . . . . . . .8 Learning strategy . . 3. . . . . . . . . . . 4. . . . . . . . . . . . . . . . 3. . . . . . . . . . 4. . . . . . . . . . . . . . . .3. . .3. . . . . . . . . . . . . . . . . . . . . . . . . .4 Oﬄine or online learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. . . . . . .2 Asynchronous activation .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . .2.2 Reinforcement learning . .2. . 4. . . . . . . . . . .3 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 The bias neuron . . 3. . . . . . . . . . . . . . . . Exercises .3. . . . 3. . 3. . .3 Using training samples . . . . . . . . . . . . . . . . . . . .5 Representing neurons . . .4 Threshold value . . . . . . . . . . . . . . . . . . . 3. . 3. . . . . 3. . . .7 Output function . . . . .7 Input and output of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Division of the training set . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . .1 The concept of time in neural networks . . . . . . . . . 4. . . . . .2. . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . 3. . . . . . . . . .3 Completely linked networks . . . . . . . . . . . . . . .4 Learning curve and error measurement . . . . . . . . . . . . . . . . . 3. . . . . . 3. . . . . .1 Unsupervised learning . . . . . 3. . . 4. . . . 4 Fundamentals on learning and training samples 4. . . . . .6 Common activation functions . . . .2 Order of pattern representation . 4. . . . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . .com 2. . . . . . . . . .5 Activation function . . . . .2 Recurrent networks . . . . . . . . . . . . . . . . . . . 4. . 30 Exercises . . .3. . . . . . . . . .5 Questions in advance . . . . . . . . 3. . . . . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . .

. . 4. . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . . . .7. . . 4.7. . . . . . . . . . . . . 4. . . . . . 4. . . . . . . . . . . . . . . . . . . . . D. . . . . . . . . . .1 The singlelayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Rprop in practice .5. . . . . . . . . . . . . . . . . . . . . . . . .2 Linear separability . . . . . . . Exercises . . . . . . . 5.2 The parity function . . . . . . . 5. . . . . . . . . . 5. . . . . .4 Weight decay .6. . . . . . . . . . . . . . . 5. . . . 5. . . .1 Number of layers . . . . . . . . . . . . . .6 Other exemplary problems .6. . . . . . . . . . . . . . . . 4. Contents . . . . . . . . .6. . . . . .2 Dynamic learning rate adjustment . . . . . . . . . . . . . . . . . . .7. . 4. . . .7 Hebbian rule .1 Momentum term . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Resilient backpropagation . . . . . 5. . . . . 61 62 64 64 64 64 65 65 66 66 66 67 67 II Supervised learning network paradigms . . . . 5. . . . . . . . . . . . . . . . 69 71 74 75 75 81 84 86 87 91 92 93 94 94 95 96 96 97 98 98 98 99 99 100 5 The perceptron.3 The multilayer perceptron . . . . . . . . . . . . . . . . . .3 Second order backpropagation . . . . . .1 Problems of gradient procedures 4. . . . . . . . . . . . . . . . . .4 The checkerboard problem .3 Selecting a learning rate . . . . . . . . . . . . . .5 . . . . . . . . . .2 Flat spot elimination .7. . . . . . . . . . . . . . . . . . . .6. . . . . 5. . . . . . . . .5 The identity function . . . . . . . . .1 Boolean functions . . . 4. . . . . . .4. . . .2 Generalized form . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . .1. . .6. . . . . . . . . .5. . . . . . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . .2 The number of neurons . . . . 5. . . . . . . .6. . . . . . . . . . . . . . . . . .6. .6 Exemplary problems . . . . . . . . . . . . . . . . . 5.2 Boiling backpropagation down to the delta rule . . . . . . .1 Perceptron learning algorithm and convergence theorem 5. . . 4. . 4. . . . . . . . 5. . . . . . . . . . . . . 4. . . . . . . . . . . . . .1 Derivation . . . . . .5 Pruning and Optimal Brain Damage . . . 5. . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.6. . .2 Delta rule . . . . . . . . . 5. . . . . . . . . .1 Original rule . .6. . 5. .6. . .com Gradient optimization procedures . . . .dkriesel. . . 5. . . . . . . .6 Further variations and extensions to backpropagation . . . . . . . .4.7 Initial conﬁguration of a multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Backpropagation of error . . . . . . . .6. . . backpropagation and its variants 5. . . 5. . . . . .5. . . . . . . . . . . . . . . . . . .1 Adaption of weights . . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) xiii . . . . . . 5. . . . . . . . . .3 The 2-spiral problem . . . . . . .

. . . .4 Initializing weights . . . 6. . 111 . . . . . . . . . . . . . . . . .4 Training with evolution . . . . . . . . . . . .1 Components and structure . . . . . . 131 . . . . . .5. . . . . . . . . .3. . . 7 Recurrent perceptron-like networks (depends on chapter 5) 7. . . . . . . . . . . . . . . . . . 7. . .3 Training recurrent networks . 8.2. . . . . .3 Recurrent backpropagation .com . . . . . . . . . . . . . . . . . . . . . .2. . . . . 6. . . . . . . 129 . 5. .2 Limiting the number of neurons . . . . .2 Teacher forcing . . . . . . . . . . . . . . . . 118 .4. . . . . . . . . 6. . . . . . . . 6. . . . . . . . . 133 . . . .1 Generating the heteroassociative matrix . . . 7. . . . . . . . . . . .5.1 Unfolding in time .5 Comparing RBF networks and multilayer perceptrons Exercises . . . . . . . . . . .2. . . . 129 . . . . . . . . .4. . . . . . . 121 122 123 124 125 127 127 127 8 Hopﬁeld networks 8. . . . . . . . . . . . . . 8. . . . . . . . . 7. . . . . . . . . . . . 105 . . . . . . . . 115 . 131 . . . . . . 8. . . . . . 130 . . . . . . . . . . . . . . 5. . . . . . . . 8. . 119 . . .7. .5. . . . . . . . . . . 134 . .4 Growing RBF networks . . . .3 Generating the weight matrix . . . . . . 119 . . . . . . . . . . . .2 Elman networks . . . . . . . . 135 . . . . . . . . . . . . . .3 Biological motivation of heterassociation . . . .3 Selecting an activation function . . . . . . . . . . 6. . . . . . . . . . . . . . .4. . . . . . 6. . . . . . . . 135 . . . . . . .1 Centers and widths of RBF neurons .3 Change in the state of neurons . 108 . . . . . . . 8. . 129 . . 114 . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . .4 Autoassociation and traditional application . . . . . . 106 . . . . . . . . . . . . . . 6. . . . . . . . . . . . . . .2 Information processing of an RBF network . . . . . . . .Contents 5. . . . . . . . 7. . . . . . .2. . . . . . . . . . . . . . . .1 Jordan networks . . . . . . .3. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . .2 Analytical thoughts prior to the training . . . 105 . . . .1 Information processing in RBF neurons . . . . . . 7. . . . . . . . . .2 Structure and functionality . . . .7. . . . .2 Signiﬁcance of weights . . . . . . . . . . . 100 101 101 102 6 Radial basis functions 6. . . . . .2 Stabilizing the heteroassociations . . . . 8. . . . .8 The 8-3-8 encoding problem and related problems Exercises . . . . . . . . 6. . . .5 Heteroassociation and analogies to neural data storage 8. . . . . . . . . . .2. . . . . . . . 118 . . . . 8. 8. . . . . . . . . 132 . . . .3. . . . . . . . . . . . . . . . 136 xiv D. . . . 6. . . 7. . . 8. . .1 Adding neurons . 120 . . . . . . . . . 119 . . . . . . . . . . . . . . . .3 Deleting neurons . . . . . . . . . . . . .3. . . . . dkriesel. . . .3 Training of RBF networks . . . . . . . . . . . . . . . . . . . . . . . . . .1 Input and output of a Hopﬁeld network . .1 Inspired by magnetism . . . . . . . . .

. . . . . . . . . . . . . . . . . . . 140 . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) xv . . 10. . . . . . . . . . . . . . . . . . . . 10. . . . . . . . . . . . . 10. . . . . . . . . . . .dkriesel.1 Resonance . . . . . .1 Topological defects . . . . . .2 Resonance and bottom-up learning . . . .4. . . . . . . . .5 Adjustment of resolution and position-dependent learning rate . . . . . . . . . . . . . . . . . . . . . .2 Purpose of LVQ . . . . . . . . . . . . . . . . . . . . . . . . . 165 . .7. . . . . . . . . . . . . . . . . . . . . . 141 . . . 9. . . 10.7.2. .1 About quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 11. . . .7. . . . . . . . . . .1 The procedure of learning 9. . .3. . . . .4 Examples . .1 Neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . 137 9 Learning vector quantization 9. .1 Task and structure of an ART network .4 Adjusting codebook vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . 167 . . . . . . .4 Growing neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 . . . . . . . . . . . . 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .com Contents 8. . . . . . . . . . . . . . . . . . . . . . . . .4. . . . 9. . . . . . . . . . . . 167 D. . . . . . 11. . . . . . . . . 10. . .3 Multi-neural gas . . . . . . .3. . 140 . . . 167 . . . .1 Interaction with RBF networks . . .5 Connection to neural networks . . . .3 Training . . . . . . .6 Application .2 Functionality and output interpretation . . .1 Structure . . . . . . . . . . . . . . . . . . . . . .7. 10. . . 139 . . . . . . . . . . . . . . . . . . . .7 Variations . . . . 166 . . . . . . . . . . 9. 143 . . 10. . . . . . . .6. . . . . 11 Adaptive resonance theory 11. . . . . . . .2 Learning process . 10. . 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. .2. . . . . . . Exercises . . . . . . . . . . . . . .3 Using codebook vectors . . . . . . . . . . . . . . . . . . . . 136 Exercises . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Pattern input and top-down learning 11. . . . . . . . . . . .6 Continuous Hopﬁeld networks . . . .2 Multi-SOMs . . . . . .2 Monotonically decreasing learning rate and neighborhood 10. . . . .1 The topology function . . . . . . 11. . . . . . . . . 10. 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 147 147 149 149 150 152 155 156 156 159 161 161 161 163 163 164 164 165 . . . . . . . . . . . . . . 143 III Unsupervised learning network paradigms 10 Self-organizing feature maps 10. . . . .3 Adding an output neuron . . . . . . . . . . . 10. . . . . . . . . . . 139 . . . . . . . . 9.

. . . .2. A.1 Structure of a ROLF . . . . .1 About time series . . . . . .1 The gridworld . . . . . . . . . . . .5 The policy . . . . . . . C. . . . . . . . . . . . . .5. . C.3. . . . . . . .5. . . . . . . . . . . . . . . .2 The state-value function . . . . . .4 Comparison with popular clustering methods . C. . . . . . . . . . . . . . . .1 Changing temporal parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. . . 169 171 172 172 173 173 175 176 177 178 179 180 180 180 181 181 183 185 185 185 185 185 187 187 191 192 192 193 194 195 196 198 198 199 xvi D. . B. . . . .Contents dkriesel. . . . . . . . . . . . . . . . . . . .1 Recursive two-step-ahead prediction . . . . . . . . .5. . .3 ε-nearest neighboring . . . . . . . . . . . B. .5. . . . . . . . . .3 States. . . . .4. . . . . . . . .6 Application examples . . . .5 Initializing radii. . . . . . . . . . .3 Extensions . . learning rates and multiplier . . . . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . . . . . . .2. . . . . . .2 Direct two-step-ahead prediction . . . . . . appendices and registers A Excursus: Cluster analysis and regional and online learnable ﬁelds A. . . . . . .1. . .1. . . . . . . . . . . . . . . . . . . . . . . . . B. . . A. . . . . . . . . . . . . . A. .3. . . . . . . . . . . . . . . . . . A. . . . . . . . .1 k-means clustering . . . . . . . . . . . . . . . . . . .1 System structure . . . . . . . . . . . . . . . . . . . . . . . B. . . . . . . . . . . . . . . . . . . . . . . . . . 167 IV Excursi.3 Two-step-ahead prediction . . . . . . . . . .2 Agent und environment . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . . . . . . . B. . . C. . . .1 Rewarding strategies . . . . . . . . . . . . . . . . . . . .4 Additional optimization approaches for prediction . . . . . . . . . . . .4 The silhouette coeﬃcient . . . . . . . . . . . . . . C. . . . . . . . . . . . . . A. . . . . . . . . .2 Learning process . . . . . . . . . . . B. .4 Reward and return . . . . . . C Excursus: reinforcement learning C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. . . . . . . . . . . . .2 One-step-ahead prediction . . . . . . . . . . . . . . . . . A. . . . . . . . . . C.2 Heterogeneous prediction . .5 Remarks on the prediction of share prices . . . . . . . . . . . . . . . . . . Exercises . . . . . . . .1. . .com 11. . . . B Excursus: neural networks used for prediction B. . . . . . . . . . . . A. .2 Training a ROLF . . . . B. . . . . . . . . . . . A. . . . . . . . . . . . situations and actions C. . . . . . . . .5. . . . . . . . . . . . . . . . . .4. . . .1. . . . . . . . . . . . . . .5. B. . . . .5 Regional and online learnable ﬁelds . . . . . . .3 Evaluating a ROLF . . . . . . . . . . . . . .2 k-nearest neighboring . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . C. . . . . . . . . . Contents . . C. . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . .2. . . . . .2. . . .3 The pole balancer . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) xvii . . . . . . .2. . . . . C. . . . . . . .dkriesel. . . neural . . . . . . . . with . . . .com C. .2 The car in the pit . . . . . . . . 201 202 203 204 205 205 205 206 207 207 209 215 219 D. . . .1 TD gammon . . . . . . . . C. . . . . . . . . . . . . . . . .6 Q learning . . . . . . .3 Monte Carlo method . . .3. . . . . . . . . . . . . . . . C. . .4 Temporal diﬀerence learning C. . . . . . . . . . . . . . . networks . . . . . . . . . . . . . . . .4 Reinforcement learning in connection Exercises . . . . . C. .2. . . . . .5 The action-value function . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Example applications . Bibliography List of Figures Index . . . . . . . . . . . . . . . . . . . .

.

history and realization of neural models 1 . philosophy.Part I From biology to formalization – motivation.

.

Living beings do not have any programmer writing a program for developing their skills. development. the computer should be more powerful than our brain: It comprises 109 transistors with a switching time of 10−9 seconds. but they are not adaptive. The brain contains 1011 neurons. decline and resurgence of a wide approach to solve problems. we will note that. They allow the computer to perform the most complex numerical calculations in a very short time. What qualities are needed to achieve such a behavior for devices like computers? Can such cognition be adapted from biology? History. for example the purchase price of a real estate which our brain can (approximately) calculate. Humans have a brain that can learn. theoretically. while the largest part of the computer is only passive data storage.1 Why neural networks? There are problem categories that cannot be formulated as an algorithm. parallelism Computers cannot learn 3 .Chapter 1 Introduction. motivation and history How to teach a computer? You can either write a ﬁxed program – or you can enable the computer to learn on its own. Computers have some processing units and memory. Therefore the question to be asked is: How do we learn to explore such problems? Exactly – we learn . The largest part of the brain is working continuously. Nevertheless. this comparison is . They learn by themselves – without the previous knowledge from external impressions – and thus can solve problems better than any computer today. Problems that depend on many subtle factors. since response time and quantity do not tell anything about quality and performance of the processing units as well as neurons and transistors cannot be compared directly. Thus. the comparison serves its purpose and indicates the advantage of parallelism by means of processing time. but these only have a switching time of about 10−3 seconds. If we compare computer and brain1 . which then only has to be executed. 1. a capability computers obviously do not have. the brain is parallel and therefore performing close to its theoretical maxi1 Of course.for obvious reasons .controversially discussed by biologists and computer scientists. Without an algorithm a computer cannot do the same.

heard that someone forgot to install the n. I have never the capability of neural networks to gen.with a carrot and a stick. is not One result from this learning procedure is automatically fault-tolerant. motivation and history Brain ≈ 1011 Neurons massively parallel associative ≈ 10−3 s ≈ 1013 1 s ≈ 1012 1 s dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Inspired by: [Zel94] mum. tioned. Our modern technology. the brain is tolerant against internal There is no need to explicitly program a errors – and also against external errors.1: The (ﬂawed) comparison between brain and computer at a glance.Chapter 1 Introduction. which . aspects) have the capability to learn. Thus. in which this characteristic is very distinct: As previously menbrain for a computer system.in (about 105 neurons can be destroyed while comparison to the overall system . network fault tolerant 4 D. our cogniis probably one of the most signiﬁcant tive abilities are not signiﬁcantly aﬀected. a human has about 1011 neurons So the study of artiﬁcial neural networks that continuously reorganize themselves is motivated by their similarity to success. a computer is static . from which the computer is orders of magnitude away (Table 1. however. Nevertheless.consist in a drunken stupor. neural network. network capable to learn Within this text I want to outline how Fault tolerance is closely related to biologwe can use the said characteristics of our ical neural networks. to compensate errors and so forth. so to speak (reinforcement learning ).or are reorganized by external inﬂuences fully working biological systems. eralize and associate data : After successful training a neural network can ﬁnd reasonable solutions for similar problems of the same class that were not explicitly trained. of processing units Type of processing units Type of calculation Data storage Switching time Possible switching operations Actual switching operations Table 1.the brain as a biological neural network can reorganize itself during its "lifespan" and therefore is able to learn. it can learn for we can often read a really "dreadful from training samples or by means of en. some types of food of very simple but numerous nerve cells or environmental inﬂuences can also dethat work massively in parallel and (which stroy brain cells). simple but many processing units n.scrawl" although the individual letters are couragement .1). This in turn results in a high degree of fault tolerance against noisy input data.com Computer ≈ 109 Transistors usually serial address-based ≈ 10−9 s ≈ 1018 1 s ≈ 1010 1 s No. nearly impossible to read. For instance. Additionally.

There are diﬀerent paradigms for neural networks. so that the system as a whole was aﬀected by the missing component.≈ 10−3 seconds in ≈ 100 discrete time steps of parallel processing. Most often we can only transfer knowledge into our neural network by means of a learning procedure. however. which can cause several errors and is not always easy to manage. but the data stream remains largely responds to a neuron switching time of unaﬀected. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 5 .e. i. Generalization capability and Now we want to look at a simple application example for a neural network. can do practically nothing in 100 time steps of sequential proSelf-organization and learning capacessing. how they are trained and where they are used. My goal is to introduce some of these paradigms and supplement some remarks for practical application. which cority.or person in ≈ 0. but not completely destroyed. i. If there is a scratch on a 1. Important! We have already mentioned that our brain works massively in parallel. the audio information on this spot will be completely lost (you will hear a pop) and then the music goes on. parallel processing D.1 seconds. on the other hand. it is easier to perform such analyses for conventional algorithms.dkriesel. thing.com hard disk controller into a computer and therefore the graphics card automatically took over its tasks.e. Usually. In the introductory chapter I want to clarify the following: "The neural network" does not exist. can be cited.1 Why neural networks? What types of neural networks particularly develop what kinds of abilities and can be used for what problem classes will be discussed in the course of this work. is already more sophisticated in state-ofthe-art technology: Let us compare a record and a CD. Fault tolerance.1 The 100-step rule record.A computer following the von Neumann tics we try to adapt from biology: architecture. removed conductors and developed communication. every component is active at any time. A disadvantage of this distributed faulttolerant storage is certainly the fact that we cannot realize at ﬁrst sight what a neural neutwork knows and performs or where its faults lie. If we want to state an argument for massive parallel processing. 1. in contrast to the functioning of a computer. So let us summarize the main characteris. On a CD Experiments showed that a human can the audio data are distributedly stored: A recognize the picture of a familiar object scratch causes a blurry sound in its vicin. The listener won’t notice any.1. cycle steps. then the 100-step rule Fault tolerance of data. which are 100 assembler steps or bility.

1. Each sensor provides a real numeric value at any time.2 The way of learning On the other hand.2. Such procedures are applied in the classic artiﬁcial intelligence. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . This robot has eight distance sensors from which it extracts input data: Three sensors are placed on the front right. After that we refer to the technical reference of the sensors.2 Simple application examples Let us assume that we have a small robot as shown in ﬁg. 1. 1. – and the robot shall learn on its own what to do in the course of its robot life.2. We ﬁrst treat the 6 D. since the example is very simple). The arrow indicates the driving direction. study their characteristic curve in order to learn the values for the diﬀerent obstacle distances. On the one hand. and if you know the exact rules of a mapping algorithm. more interesting and more successful for many mappings and problems that are hard to comprehend straightaway is the way of learning : We show diﬀerent possible situations to the robot (ﬁg.Chapter 1 Introduction. that applies the input signals to a robot activity.1. Thus. and embed these values into the aforementioned set of rules.1. our output is binary: H = 0 for "Everything is okay. and ﬁnally the result is a circuit or a small computer program which realizes the mapping (this is easily possible. Therefore we need a mapping f : R8 → B1 .1 The classical way There are two ways of realizing this mapping. 1. and two on the back.com put is called H for "halt signal"). that means we are always receiving an input I ∈ R8 .1. 1. there is the classical way : We sit down and think for a while. three on the front left. Despite its two motors (which will be needed later) the robot in our simple example is not capable to do much: It shall only drive on but stop when it might collide with an obstacle. motivation and history dkriesel. you are always well advised to follow this scheme. 1. In this example the robot shall simply drive on" and H = 1 for "Stop" (The out.learn when to stop.2 on page 8).1: A small robot with eight sensors and two motors. Figure 1.

This means we do not know its structure but just regard its behavior in Our goal is not to learn the samples by heart. good samples.e changing. has two motors with wheels eralize from these samples and ﬁnd a uniand various sensors. on. stop the robot but also lets it avoid obstacles.g. As a consequence. the sensor values are changed once again. In this case we are looking for a mapping cannot only. approx.network and changes its position. with the sensor layout being the same. into the neural net. Again the robot queries the edge. algorithm or a mathematical formula.is a constant cycle: The robot queries the ing sample consists of an exemplary input network. It is obvious that this system can also The samples can be taught to a neural be adapted to dynamic. ommend to refer to the internet. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 7 . Figure 1. and so work. placing the robot able to avoid obstacles. and de facto a neural network would neural network as a kind of black box be more appropriate. Here it is more diﬃcult to analytically derive the rules. 1. to continously avoid obstacles. the neural network will gen7 cm in diameter.sors values. i. a train.dkriesel.g.vironments (e. for example. ennetwork by using a simple learning pro. In particular. If we have done everything right and chosen 2 There is a robot called Khepera with more or less similar characteristics. it will drive and a corresponding desired output.neural network in any situation and be sured sensor values (e. the information. For the purpose of direction control it would be possible to control the motors of our robot separately2 . (ﬁg. D.1 Why neural networks? Our example can be optionally expanded. the robot should apply the The situations in form of simply mea. the in front of an obstacle. we regard the robot control as a black box whose inner life is unknown.3). f : R8 → R2 . but to realize the principle behind practice. which changes the senthe question is how to transfer this knowl. robot should query the network continuwhich we show to the robot and for which ously and repeatedly while driving in order we specify whether to drive on or to stop. It is round-shaped. the moving obstacles in cedure (a learning procedure is a simple our example). them: Ideally. see illustration). The black box receives eight real sensor values and which gradually controls the two motors by means of the sensor inputs and thus maps these values to a binary output value.3: Initially. For more information I recversal rule when it has to stop. The result are called training samples. Now in one direction.com 1. Thus.

as we will see soon. The youth of this ﬁeld of research. as with the ﬁeld of computer science itself.1 The beginning in text form but more compact in form of a timeline.com Figure 1. like any nized due to the fact that many of the other ﬁeld of science. Citations and bibliographical references are added mainly for those topics As soon as 1943 Warren McCulloch and Walter Pitts introduced modthat will not be further discussed in this els of neurological networks. We add the desired output values H and so receive our learning samples.2: The robot is positioned in a landscape that provides sensor values for diﬀerent situations. Citations for keywords that will be ated threshold switches based on neuexplained later are mentioned in the correrons and showed that even simple sponding chapters. can be easily recogThe ﬁeld of neural networks has. The directions in which the sensors are oriented are exemplarily applied to two robots. Further- 1. motivation and history dkriesel. recretext.2.Chapter 1 Introduction. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . development with many ups and downs. a long history of cited persons are still with us. networks of this kind are able to The history of neural networks begins in calculate nearly any logic or ariththe early 1940’s and thus nearly simultametic function [MP43]. neously with the history of programmable electronic computers. To continue the style of my work I will not represent this history 1.2 A brief history of neural networks 8 D.

Hebb could postulate this rule. tween top-down and bottom-up research developed. brain information storage is realized as a distributed system. D. how but due to the absence of neurological to simulate a brain. "in the order of appearance" as far as possible. Hebb. where only the extent but not the location of the destroyed nerve tissue inﬂuences the rats’ performance to ﬁnd their way out of a labyrinth. Bernard Widrow. but nobody really plies that the connection between two knows what it calculates.dkriesel. who was tired of calculating ballistic trajectories by hand. since it is capable learning procedures.4: Some institutions of the ﬁeld of neural networks. Hebb formulated the Snark. 1951: For his dissertation Marvin Minsky developed the neurocomputer 1949: Donald O. to put it crudely. While the early 1950: The neuropsychologist Karl Lashley defended the thesis that 3 We will learn soon what weights are. But it has never been practiform the basis of nearly all neural cally implemented. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 9 . The rule imto busily calculate.2 Golden age namely the recognition of spacial patterns by neural networks [PM47]. 1956: Well-known scientists and ambiThis change in strength is proportious students met at the Darttional to the product of the two activmouth Summer Research Project ities. the ﬁrst computer precursors ("electronic brains")were developed.2. Seymour Papert. Marvin Minsky. From left to right: John von Neumann. 1. neurons is strengthened when both neurons are active at the same time. among others supported by Konrad Zuse. Diﬀerences beresearch he was not able to verify it. His thesis was based on experiments on rats. John Hopﬁeld. 1947: Walter Pitts and Warren McCulloch indicated a practical ﬁeld of application (which was not mentioned in their work from 1943). which has already been capaclassical Hebbian rule [Heb49] which ble to adjust its weights3 automatirepresents in its more generalized cally. more.2 History of neural networks Figure 1. Teuvo Kohonen. and discussed. Donald O.com 1.

Charles Wightman and their coworkers developed the ﬁrst successful neurocomputer. and a learning limits. In the following stagnation and out of fear of scientiﬁc unpopularity of the neural networks ADALINE was renamed in adaptive linear element – which was undone again later on. motivation and history supporters of artiﬁcial intelligence wanted to simulate capabilities by means of software. 1960: Bernard Widrow and MarNils Nilsson gave an overview of cian E. was a PhD student of Widrow.com modern microprocessors. a fast and precise assumed that the basic principles of adaptive learning system being the self-learning and therefore. dkriesel. supporters of neural networks wanted to achieve system behavior by imitating the smallest parts of the system – the neurons. he deconvergence theorem. who 1969: Marvin Minsky and Seymour Papert published a precise mathehimself is known as the inventor of 10 D. Today this asnearly every analog telephone for realsumption seems to be an exorbitant time adaptive echo ﬁltering and was overestimation. which was capable to recognize simple numerics by means of a 20 × 20 pixel image sensor and electromechanically worked with 512 motor driven potentiometers . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .each potentiometer representing one vari1961: Karl Steinbuch introduced techable weight. the Mark I perceptron .Chapter 1 Introduction. Additionally. nical realizations of associative mem1959: Frank Rosenblatt described difory. the connecting weights also changed in larger steps – the smaller the steps. rule adjusting the connecting weights. but at that time it trained by menas of the Widrow-Hoﬀ provided for high popularity and sufrule or delta rule. development accelerates ﬁrst spread use 1957-1958: At the MIT. Disadvantage: missapplication led to inﬁnitesimal small steps close to the target. generally ﬁrst widely commercially used neuspeaking. and analyzed their possibilities and threshold switches. which can be seen as predecessors ferent versions of the perceptron. It was ron ) [WH60]. One advantage the delta rule had over the original perceptron learning algorithm was its adaptivity : If the diﬀerence between the actual output and the correct solution was large. the closer the target was. ﬁcient research funds. forof today’s neural associative memmulated and veriﬁed his perceptron ories [Ste61]. 1965: In his book Learning Machines. Frank Rosenblatt. Hoff introduced the ADAthe progress and works of this period LINE (ADAptive LInear NEuof neural network research. "intelligent" systems had alral network: It could be found in ready been discovered. He described scribed concepts for neural techniques neuron layers mimicking the retina. At that time Hoﬀ. later co-founder of Intel Corporation.

In the same year. independently developed neural network paradigms: They researched. The implication that more powerful mod1974: For his dissertation in Harvard els would show exactly the same probPaul Werbos developed a learning lems and the forecast that the entire procedure called backpropagation of ﬁeld would be a research dead end reerror [Wer74]. Furthermore.2.3 Long silence and slow numerous neural models are analyzed reconstruction mathematically. the [Koh72]. but it was not until sulted in a nearly complete decline in one decade later that this procedure research funds for the next 15 years reached today’s importance. He was received. – no matter how incorrect these forecasts were from today’s point of view.dkriesel. however. extremely short. [MP69] to show that the perceptron model was not capable of representing many important problems (keywords: 1973: Christoph von der Malsburg used a neuron model that was nonXOR problem and linear separability ). self-organizing feature maps (SOM) [Koh82. enough memory for a structure like a model of an associative memory the brain. he dedicated himself to the problem of The research funds were. Under conferences nor other events and therefore cooperation of Gail Carpenter only few publications. vated [vdM73]. Anderson matical analysis of the perceptron [And72]. such a brain has to organize and create model was presented independently itself for the most part). the basic theories for the still looking for the mechanisms involving continuing renaissance were laid at that self-organization in the brain (He time: knew that the information about the creation of a being is stored in the 1972: Teuvo Kohonen introduced a genome. but there were neither already learned associations.2 History of neural networks research funds were stopped of view by James A. which has. linear and biologically more motiand so put an end to overestimation. Everywhere of learning without destroying research went on.com 1. As a consequence. and from a neurophysiologist’s point backprop developed D. but there 1982: Teuvo Kohonen described the was no discourse among them. This isolation of this led to models of adaptive individual researchers provided for many resonance theory (ART). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 11 . popularity and research funds. 1976-1980 and thereafter: Stephen Grossberg presented many papers (for instance [Gro76]) in which 1. as previouslykeeping a neural network capable mentioned. not model of the linear associator. Koh98] – also In spite of the poor appreciation the ﬁeld known as Kohonen maps.

Chapter 1 Introduction. A company using neural networks. Exercise 1. They were not widely used in technical applications. explosive.4 Renaissance time a certain kind of fatigue spread John Hopfield also invented the in the ﬁeld of artiﬁcial intelligence. netism in physics. Brieﬂy characterize the four development phases of neural networks and give expressive examples for each phase.2. who had personally convinced many researchers of the importance of the ﬁeld. Give one example for each of the following topics: A book on neural networks or neuroinformatics. but some of its results will be 1983: Fukushima. troduced the neural model of the Neocognitron which could recognize handwritten characters [FMI83] and was an extension of the Cognitron net. Show at least four applications of technical neural networks: two from the ﬁeld of pattern recognition and two from the ﬁeld of function approximation. 1985: John Hopfield published an article describing a way of ﬁnding acceptable solutions for the Travelling Salesman problem by using Hopﬁeld nets. Through the inﬂuence of John Hopfield. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .Exercises work already developed in 1975. the development of but the ﬁeld of neural networks slowly the ﬁeld of research has almost been regained importance. Exercise 3. A collaborative group of a university working with neural networks. the ﬁeld of neural networks slowly showed signs of upswing. Miyake and Ito inseen in the following. It can no longer be itemized. and the wide publication of backpropagation by Rumelhart. A software tool realizing neural networks ("simulator"). and A product or service being realized by means of neural networks. so-called Hopﬁeld networks [Hop82] caused by a series of failures and unwhich are inspired by the laws of magfulﬁlled hopes. Exercise 2. Hinton and Williams. motivation and history dkriesel. 1986: The backpropagation of error learning procedure as a generalization of the delta rule was separately developed and widely published by the Parallel Distributed Processing Group [RHW86a]: Non-linearly-separable problems could be solved by multilayer perceptrons.com 1. and Marvin Minsky’s negative evaluations were disproven at a single blow. From this time on. At the same Renaissance 12 D.

**Chapter 2 Biological neural networks
**

How do biological systems solve problems? How does a system of neurons work? How can we understand its functionality? What are diﬀerent quantities of neurons able to do? Where in the nervous system does information processing occur? A short biological overview of the complexity of simple elements of neural information processing followed by some thoughts about their simpliﬁcation in order to technically adapt them.

Before we begin to describe the technical side of neural networks, it would be useful to brieﬂy discuss the biology of neural networks and the cognition of living organisms – the reader may skip the following chapter without missing any technical information. On the other hand I recommend to read the said excursus if you want to learn something about the underlying neurophysiology and see that our small approaches, the technical neural networks, are only caricatures of nature – and how powerful their natural counterparts must be when our small approaches are already that eﬀective. Now we want to take a brief look at the nervous system of vertebrates: We will start with a very rough granularity and then proceed with the brain and up to the neural level. For further reading I want to recommend the books [CR00, KSJ00], which helped me a lot during this chapter.

**2.1 The vertebrate nervous system
**

The entire information processing system, i.e. the vertebrate nervous system, consists of the central nervous system and the peripheral nervous system, which is only a ﬁrst and simple subdivision. In reality, such a rigid subdivision does not make sense, but here it is helpful to outline the information processing in a body.

**2.1.1 Peripheral and central nervous system
**

The peripheral nervous system (PNS ) comprises the nerves that are situated outside of the brain or the spinal cord. These nerves form a branched and very dense network throughout the whole body. The pe-

13

Chapter 2 Biological neural networks ripheral nervous system includes, for example, the spinal nerves which pass out of the spinal cord (two within the level of each vertebra of the spine) and supply extremities, neck and trunk, but also the cranial nerves directly leading to the brain. The central nervous system (CNS ), however, is the "main-frame" within the vertebrate. It is the place where information received by the sense organs are stored and managed. Furthermore, it controls the inner processes in the body and, last but not least, coordinates the motor functions of the organism. The vertebrate central nervous system consists of the brain and the spinal cord (Fig. 2.1). However, we want to focus on the brain, which can - for the purpose of simpliﬁcation - be divided into four areas (Fig. 2.2 on the next page) to be discussed here.

dkriesel.com

**2.1.2 The cerebrum is responsible for abstract thinking processes.
**

The cerebrum (telencephalon ) is one of the areas of the brain that changed most during evolution. Along an axis, running from the lateral face to the back of the head, this area is divided into two hemispheres, which are organized in a folded structure. These cerebral hemispheres are connected by one strong nerve cord ("bar ") and several small ones. A large number of neurons are located in the cerebral cortex (cortex ) which is approx. 24 cm thick and divided into diﬀerent cortical ﬁelds, each having a speciﬁc task to Figure 2.1: Illustration of the central nervous

system with spinal cord and brain.

14

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

dkriesel.com

2.1 The vertebrate nervous system and errors are continually corrected. For this purpose, the cerebellum has direct sensory information about muscle lengths as well as acoustic and visual information. Furthermore, it also receives messages about more abstract motor signals coming from the cerebrum. In the human brain the cerebellum is considerably smaller than the cerebrum, but this is rather an exception. In many vertebrates this ratio is less pronounced. If we take a look at vertebrate evolution, we will notice that the cerebellum is not "too small" but the cerebum is "too large" (at least, it is the most highly developed structure in the vertebrate brain). The two remaining brain areas should also be brieﬂy discussed: the diencephalon and the brainstem.

Figure 2.2: Illustration of the brain. The colored areas of the brain are discussed in the text. The more we turn from abstract information processing to direct reﬂexive processing, the darker the areas of the brain are colored.

fulﬁll. Primary cortical ﬁelds are responsible for processing qualitative information, such as the management of diﬀer2.1.4 The diencephalon controls ent perceptions (e.g. the visual cortex fundamental physiological is responsible for the management of viprocesses sion). Association cortical ﬁelds, however, perform more abstract association and thinking processes; they also contain The interbrain (diencephalon ) includes our memory. parts of which only the thalamus will be brieﬂy discussed: This part of the diencephalon mediates between sensory and 2.1.3 The cerebellum controls and motor signals and the cerebrum. Particucoordinates motor functions larly, the thalamus decides which part of the information is transferred to the cereThe cerebellum is located below the cere- brum, so that especially less important brum, therefore it is closer to the spinal sensory perceptions can be suppressed at cord. Accordingly, it serves less abstract short notice to avoid overloads. Another functions with higher priority: Here, large part of the diencephalon is the hypothaparts of motor coordination are performed, lamus, which controls a number of proi.e., balance and movements are controlled cesses within the body. The diencephalon

thalamus ﬁlters incoming data

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

15

Chapter 2 Biological neural networks

dkriesel.com

is also heavily involved in the human cir- All parts of the nervous system have one cadian rhythm ("internal clock") and the thing in common: information processing. This is accomplished by huge accumulasensation of pain. tions of billions of very similar cells, whose structure is very simple but which communicate continuously. Large groups of 2.1.5 The brainstem connects the brain with the spinal cord and these cells send coordinated signals and thus reach the enormous information procontrols reﬂexes. cessing capacity we are familiar with from our brain. We will now leave the level of In comparison with the diencephalon the brain areas and continue with the cellular brainstem or the (truncus cerebri ) relevel of the body - the level of neurons. spectively is phylogenetically much older. Roughly speaking, it is the "extended spinal cord" and thus the connection between brain and spinal cord. The brain- 2.2 Neurons are information stem can also be divided into diﬀerent arprocessing cells eas, some of which will be exemplarily introduced in this chapter. The functions will be discussed from abstract functions Before specifying the functions and protowards more fundamental ones. One im- cesses within a neuron, we will give a portant component is the pons (=bridge), rough description of neuron functions: A a kind of transit station for many nerve sig- neuron is nothing more than a switch with nals from brain to body and vice versa. information input and output. The switch will be activated if there are enough stimIf the pons is damaged (e.g. by a cere- uli of other neurons hitting the informabral infarct), then the result could be the tion input. Then, at the information outlocked-in syndrome – a condition in put, a pulse is sent to, for example, other which a patient is "walled-in" within his neurons. own body. He is conscious and aware with no loss of cognitive function, but cannot move or communicate by any means. 2.2.1 Components of a neuron Only his senses of sight, hearing, smell and taste are generally working perfectly normal. Locked-in patients may often be able Now we want to take a look at the comto communicate with others by blinking or ponents of a neuron (Fig. 2.3 on the facing page). In doing so, we will follow the moving their eyes. way the electrical information takes within Furthermore, the brainstem is responsible the neuron. The dendrites of a neuron for many fundamental reﬂexes, such as the receive the information by special connections, the synapses. blinking reﬂex or coughing.

16

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

e. On the presynaptic side of the synaptic cleft the electrical signal is converted into a chemical signal. 2. You might think that.3: Illustration of a biological neuron with the components discussed in this text. The neurotransmitters are degraded very fast. coming from the presynaptic side.2 The neuron Figure 2. a process induced by chemical cues released there (the so-called neurotransmitters). These neurotransmitters cross the synaptic cleft and transfer the information into the nucleus of the cell (this is a very simple explanation. so we will discuss how this happens: It is not an electrical. is directly transferred to the postsynaptic nucleus of the cell. the synapses. relevant to shortening reactions that must be "hard coded" within a living organism. but a chemical process. where it is reconverted into electrical information. but later on we will see how this exactly works). electrical synapse: simple The electrical synapse is the simpler variant. nevertheless. the electrical coupling of source and target does not take place. Such connections can usually be found at the dendrites of a neuron. An electrical signal received by the synapse. the coupling is interrupted by the synaptic cleft. Thus. which is. unadjustable connection between the signal transmitter and the signal receiver.com 2. i.dkriesel. strong. so that it is possible to re- D.1 Synapses weight the individual parts of information Incoming signals from other neurons or cells are transferred to a neuron by special connections. sometimes also directly at the soma. We distinguish between electrical and chemical synapses.1. for example. there is a direct. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 17 . This cleft electrically separates the presynaptic side from the postsynaptic one. Here. The chemical synapse is the more distinctive variant. the information has to ﬂow.2.

utmost advantages: 2. Some 2. As soon as the acarea.com cemical synapse is more complex but also more powerful lease very precise information pulses here. too. which 2. the soma accucannot ﬂash over to the presynaptic mulates these signals. over time they can form a stronger or The axon is electrically isolated in order weaker connection. however. slender extension of the soma.3 In the soma the weighted information is accumulated One-way connection: A chemical synapse is a one-way connection. to achieve a better conduction of the electrical signal (we will return to this point later on) and it leads to dendrites. and others that slow down such stimulation. Due to the fact that there is no direct After the cell nucleus (soma ) has reelectrical connection between the ceived a plenty of activating (=stimulatpre. The amount of branching dendrites is also In spite of the more complex function.2. information other neurons. the cell nucleus Adjustability: There is a large number of of the neuron activates an electrical pulse diﬀerent neurotransmitters that can which then is transmitted to the neurons also be released in various quantities connected to the current one. There are neurotransmitters that stimulate the postsynaptic cell nucleus. many diﬀerent sources. in a synaptic cleft. ing.1. for example.compared with the electrical synapse .and postsynaptic area.1.Chapter 2 Biological neural networks dkriesel. An axon can. and one of the central points by means of the axon.g. that here the an extreme case. too. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . the chemical synapse has . cumulated signal exceeds a certain value (called threshold value).4 The axon transfers outgoing pulses synapses transfer a strongly stimulating signal.1. The adjustability varies The pulse is transferred to other neurons a lot. 18 D.2 Dendrites collect all parts of transfer the information to. an axon can stretch up synapses are variable.called dendrite tree.2. transfer nucleus of the neuron (which is called information to other kinds of cells in order soma ) and receive electrical signals from to control them. electrical ing) and inhibiting (=diminishing) signals pulses in the postsynaptic area by synapses or dendrites. which are then transferred into the nucleus of the cell. The axon is a in the examination of the learning long. That is. to one meter (e.2. some only weakly stimulating ones. within the spinal cord). In ability of the brain is. So now we are back at the beginning of our description of the neuron Dendrites branch like trees from the cell elements.

e. there is of course the central question of how to maintain these concentration gradients: Normally. Concentration gradient: As described above the ions try to be as uniformly distributed as possible. Thus. we now want to take a small step from biology towards technology.. i. potential in the resting state of the neu- D. we will ﬁnd certain kinds of ions more often or less often than on the inside. To maintain the potential. we assume that no electrical signals are received from the outside. In the membrane (=envelope) of the neuron the charge is diﬀerent from the charge on the outside.e.. This difference in charge is a central concept that is important to understand the processes within the neuron. whose concentration varies within and outside of the neuron.dkriesel. the diﬀerence in charge. The diﬀerence is called membrane potential.2 The neuron ron. But another group of negative ions. a simpliﬁed introduction of the electrochemical information processing should be provided. In this case. The membrane potential. but not for others. the neuron actively maintains its membrane potential to be able to process information. and therefore it slowly diﬀuses out through the neuron’s membrane. If the concentration of an ion is higher on the inside of the neuron than on the outside. the membrane potential is −70 mV. i.2. various mechanisms are in progress at the same time: 2.2. diﬀusion predominates and therefore each ion is eager to decrease concentration gradients and to spread out evenly. the inside of the Let us ﬁrst take a look at the membrane neuron becomes negatively charged. In doing so. Since we have learned that this potential depends on the concentration gradients of various ions. is created by several kinds of charged atoms (ions). which is permeable to some ions. so ﬁnally there would be no membrane potential anymore. it will try to diﬀuse to the outside and vice versa. a potential. 2. Thus.1 Neurons maintain electrical membrane potential One fundamental aspect is the fact that compared to their environment the neurons show a diﬀerence in electrical charge. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 19 .2. If this happens. collectively called A− . remains within the neuron since the membrane is not permeable to them.com 2. This descent or ascent of concentration is called a concentration gradient.2 Electrochemical processes in the neuron and its components After having pursued the path of an electrical signal from the dendrites via the synapses to the nucleus of the cell and from there via the axon into other dendrites. If we penetrate the membrane from the inside outwards. The positively charged ion K+ (potassium) occurs very frequently within the neuron but less frequently outside of the neuron. the membrane potential will move towards 0 mV. How does this work? The secret is the membrane itself.

there is less sodium within the the fact that the membrane is impermeneuron than outside the neuron. Due to the low diﬀusion of sodium into the cell the intracellular sodium concentration increases. diﬀuses strongly out of the cell.sodium slowly. the sodium tential is −70 mV as observed. The intracellular charge is now very strong. although it tries to get into the cell positive ions: K+ wants to get back along the concentration gradient and into the cell. and a membrane potenback into it. ron receives and transmits signals. reach Potassium. But we want to achieve a resting membrane potential of −70 mV. The result is another gradient. sodium is positively charged tively pumped against the concentration but the interior of the cell has negative and electrical gradients. Now that we charge. The Above we have learned that sodium and sodium shifts the intracellular equilibrium potassium can diﬀuse through the memfrom negative to less negative. But even with these two ions a standstill with all gradients being balanced out could still be achieved. dkriesel. 20 D.ATP ) actively transports ions against the ent acts contrary to the concentration direction they actually want to take! gradient. for which the mem. positive K ions disappear. therefore it attracts Sodium is actively pumped out of the cell. compared brane . however. On the able to some ions and other ions are acother hand. All in all is driven into the cell all the more: On the the membrane potential is maintained by one hand.2. But at the same time the inside 2. which is a second reason for the know that each neuron has a membrane potential we want to observe how a neusodium wanting to get into the cell.2.2 The neuron is activated by changes in the membrane of the cell becomes less negative.Chapter 2 Biological neural networks Negative A ions remain. Now the last piece of the puzzle gets into the game: a "pump" (or rather. and so the inside of the cell becomes more negative. slowly pours through the mem. there is another important maintains the concentration gradient for ion. but is actively pumped a steady state. the protein Electrical Gradient: The electrical gradi.For this reason the pump is also called ist some disturbances which prevent this. tial of −85 mV would develop. brane is not very permeable but which. they would eventually balance out. potassium faster. so that some sort of steady state equilibhowever.rium is created and ﬁnally the resting pobrane into the cell. As a result. If these two gradients were now left alone. The pump Furthermore. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . so that + potential K pours in more slowly (we can see that this is a complex mechanism where everything is inﬂuenced by everything). thus there seem to ex. Na+ (sodium). the electrical gradient.the sodium as well as for the potassium.com with its environment. sodium-potassium pump.

the action popotassium pump.which is the electrical sensory cells. the prox. The membrane fact that the potassium channels close potential is at −70 mV and actively more slowly. This is due to the are permeable. specialized forms of neurons. Additionally.com 2. the sodium and potassium can pour in.dkriesel. Remember: Sodium wants to pour into brane. There exist. Then creases the eﬄux of ions even more. are closed and the potassium channels are opened.4 on the next Hyperpolarization: Sodium as well as page): potassium channels are closed again.2 The neuron They move through channels within the Stimulus up to the threshold: A stimulus opens channels so that sodium membrane. the more negatively charged than the excells "listen" to the neuron. for which a light incidence pulse. (positively kept there by the neuron. response "if required". This massive instimuli can be received from other neurons ﬂux of sodium drastically increases or have other causes. In addition to these perbecomes more positive. this signal is transmitted to the cells conThe interior of the cell is once again nected to the observed neuron.e. there also exist channels tential is initiated by the opening of that are not always open but which only many sodium channels. i. As a result. tive sodium ions. could be such a stimulus. If the incoming amount of light exceeds the threshold. to take a closer look at the diﬀerent stages of the action potential (Fig. At ﬁrst the membrane potential is Resting state: Only the permanently slightly more negative than the restopen sodium and potassium channels ing potential. The positively charged The said threshold (the threshold potenions want to leave the positive intetial ) lies at about −55 mV. which inan action potential.up to apample. the neucellular concentration is much higher ron is activated and an electrical signal. As soon as the rior of the cell. the action potential. the intrareceived stimuli reach this value. 2. the These controllable channels are opened as cell is dominated by a negative ensoon as the accumulated received stimulus vironment which attracts the posiexceeds a certain threshold. i. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 21 . Additionally. is initiated. Now we want terior. it also changes the membrane pothe cell because there is a lower intential.e. tracellular than extracellular concentration of sodium. than the extracellular one. For example.Depolarization: Sodium is pouring in. As soon as manently open channels responsible for the membrane potential exceeds the diﬀusion and balanced by the sodiumthreshold of −55 mV. for exthe membrane potential . The intracellular charge channels. Since the opening of these channels changes the concentration of ions within and outside of the mem. Repolarization: Now the sodium channels controllable channels are opened. D.. +30 mV .

Chapter 2 Biological neural networks dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .com Figure 2. 22 D.4: Initiation of action potential over time.

a transfer that is impossible at those parts of the axon which are situated between two nodes (internodes) and therefore insulated by the myelin sheath. the so-called nodes of Ranvier. slender extension of the soma. It is obvious that at such a node the axon is less insulated. Then the resulting pulse is transmitted by the axon. the action po1 Schwann cells as well as oligodendrocytes are vari.2.dkriesel. 180 meters per second. and mostly even several nodes are active at the same time here. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 23 . After a refractory period of 1 − 2 ms the resting state is re-established so that the neuron can react to newly applied stimuli with an action potential.2 The neuron Now you may assume that these less insulated nodes are a disadvantage of the axon .however.3 In the axon a pulse is conducted in a saltatory way We have already learned that the axon is used to transmit the action potential across long distances (remember: You will ﬁnd an illustration of a neuron including an axon in Fig. 2. In simple terms. they are not. At a distance of 0. Thus. One action potential initiates the next one.1 − 2mm there are gaps between these cells. etc.tentials are not only generated by informaeties of the glial cells.3 on page 17). This mass transfer permits the generation of signals similar to the generation of the action potential within the soma. Obviously. 2. 2.com charged) potassium eﬀuses because of its lower extracellular concentration. However. neurons (glia = glue). provide energy. the more often a neuron can ﬁre per time. since the action potential to be transferred would fade too much until it reaches the next node. The axon is a long. mass can be transferred between the intracellular and extracellular area. Axons with large internodes (2 mm) achieve a signal dispersion of approx. The cells receiving the action potential are attached to the end of the axon – often connected by dendrites and synapses. The action potential is transferred as follows: It does not continuously travel along the axon but jumps from node to node. The pulse "jumping" from node to node is responsible for the name of this pulse conductor: saltatory conductor.2. The said gaps appear where one insulate cell ends and the next one begins. There are about 50 times tion received by the dendrites from other more glial cells than neurons: They surround the neurons. too: to constantly amplify the signal. In vertebrates it is normally coated by a myelin sheath that consists of Schwann cells (in the PNS) or oligodendrocytes (in the CNS) 1 . The shorter this break is. D. a series of depolarization travels along the nodes of Ranvier. the internodes cannot grow indeﬁnitely. So the nodes have a task. At the nodes. which insulate the axon very well from electrical activity. the pulse will move faster if its jumps are larger. As already indicated above. insulate them from each other. the refractory period is a mandatory break a neuron has to take in order to regenerate.

for example. There can be individual receptor cells or cells forming complex sensory organs (e. the stimulus intensity is proportional to the amplitude of the action potential. which is responsible for transferring the stimulus. Action potentials can also be generated by sensory information an organism receives from its environment through its sensory cells.g. They can receive stimuli within the body (by means of the interoceptors) as well as stimuli outside of the body (by means of the exteroceptors). which is. this is an amplitude modulation. This process is a frequency modulation. which allows to better perceive the increase and decrease of a stimulus. an encoding of the stimulus. They do not receive electrical signals via dendrites but the existence of the stimulus being speciﬁc for the receptor cell ensures that the ion channels open and an action potential is developed. This is working because of the fact that these sensory cells are actually modiﬁed neurons. a gateway to the cerebral cortex and therefore can reject sensory impressions according to current relevance and thus prevent an abundance of information to be managed.com 2. The stimulus in turn controls the frequency of the action potential of the receiving neuron.3 Receptor cells are modiﬁed neurons 2. Specialized receptor cells are able to perceive speciﬁc stimulus energies such as light. Therefore. as we have already learned. These pulses control the amount of the related neurotransmitter. it will be interesting to look at how the information is processed. the signals are ampliﬁed either during transduction or by means of the stimulus-conducting apparatus. Here. Usually. After having outlined how information is received from the environment.Chapter 2 Biological neural networks dkriesel. The resulting action potential can be processed by other neurons and is then transmitted into the thalamus. however. the stimulus energy itself is too weak to directly cause nerve signals. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . This process of transforming stimulus energy into changes in the membrane potential is called sensory transduction. eyes or ears). Technically. 24 D. the sense of smell).3. temperature and sound or the existence of certain molecules (like. Secondary receptors. continuously transmit pulses. A good example for this is the sense of pain.1 There are diﬀerent receptor cells for various types of perceptions Primary receptors transmit their pulses directly to the nervous system.

On the lowest level.e. Firstly. Other sensors change their sensitivity according to the situation: There are taste receptors which respond more or less to the same stimulus according to the nutritional condition of the organism. the information is not only received and transferred but directly processed.3 Receptor cells this subject is to prevent the transmission of "continuous stimuli" to the central nervous system because of sensory adaptation : Due to continuous stimulation many receptor cells automatically become insensitive to stimuli. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 25 .2 Information is processed on every level of the nervous system There is no reason to believe that all received information is transmitted to the brain and processed there. are situated much lower in the hierarchy. doesn’t matter as well. The information processing is entirely decentralized. The ﬁltering of information with respect to the current relevance executed by the midbrain is a very important method of information processing. which serves – as we have already learned – as a gateway to the cerebral cortex. and that the brain ensures that it is "output" in the form of motor pulses (the only thing an organism can actually do within its environment is to move). an overload is prevented and secondly. for example in the form of ampliﬁcation: The external and the internal ear have a speciﬁc shape to amplify the sound. too. On closer examination. One of the main aspects of D. receptor cells are not a direct mapping of speciﬁc stimulus energy onto action potentials but depend on the past.com 2. which leads us again from the abstract to the fundamental in our hierarchy of information processing. Now let us continue with the lowest level. information processing can already be executed by a preceding signal carrying apparatus. The midbrain and the thalamus. If a jet ﬁghter is starting next to you. It is certain that information is processed in the cerebrum. we want to take a look at some examples. i. at the receptor cells. But even the thalamus does not receive any preprocessed stimuli from the outside. this is necessary.3. Thus. small 2. In order to illustrate this principle. the sensory cells. Even before a stimulus reaches the receptor cells. the fact that the intensity measurement of intensive signals will be less precise. since the sound pressure of the signals for which the ear is constructed can vary over a wide exponential range. Here.dkriesel. which is the most developed natural information processing structure. which also allows – in association with the sensory cells of the sense of hearing – the sensory stimulus only to increase logarithmically with the intensity of the heard signal. a logarithmic measurement is an advantage.

pound eye from the outside. organs often found in compound eye (Fig. movies with 25 images per second appear as a ﬂuent motion). i.. which projects a sharp image onto the sensory cells behind. The diﬀerent wavelengths of this electromagnetic radiation are perceived by the human eye as diﬀerent colors. the single lens eye.com changes in the noise level can be ig. the spatial resolution. the eye. we insects and crustaceans. individual eyes. exist much longer than the human. Consequently.1 Compound eyes and pinhole eyes only provide high temporal nored. sensory organs have been developed which can detect such electromagnetic radiation and the wavelength range of the radiation perceivable by the human eye is called visible range or simply light. Since the individual eyes can be distinguished. we will brieﬂy describe "usual" light Let us ﬁrst take a look at the so-called sensing organs. found in octopus species and work – as you can guess – similar to a pinhole camera. Pinhole eyes are. The visible range of the electromagnetic radiation is diﬀerent for each organism. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .Chapter 2 Biological neural networks dkriesel. however. the spatial resolution is much higher than in the compound eye.5 on the next nature. But compound eyes have advantages. too. The compound will discuss the information processing in eye consists of a great number of small.3 An outline of common light sensing organs For many organisms it turned out to be extremely useful to be able to perceive electromagnetic radiation in certain regions of the spectrum. of compound eyes must be very low and the image is blurred.3. the individual eyes are clearly visible and arranged in a hexagonal pattern.e.. from an evolutionary point of view. But due to the very small opening for light entry the resulting image is less bright. For the third light sensing organ page). Certain compound eyes process more than 300 images per second (to the human eye.3. A pinhole eye has a very small opening for light entry. others can even perceive additional wavelength ranges (e. low temporal resolution 26 D. low spatial resolution pinhole camera: high spat. or spatial resolution Just to get a feeling for sensory organs and information processing in the organism. If we look at the com- 2. which is.e.g. i. Some organisms cannot see the colors (=wavelength ranges) we can see. common in described below. for example.3. Each individual eye has its own nerve ﬁber which is connected to the insect brain. Compound eye: high temp. for example.2. it is obvious that the number of pixels. 2. in the UV range). especially for fast-ﬂying insects. Thus. Before we begin with the human being – in order to get a broader knowledge of the sense of sight– we brieﬂy want to look at two organs of sight which.

3 Receptor cells 2. 2 There are diﬀerent kinds of bipolar cells. This means that here the information has already been summaintensity. which the image.2 Single lens eyes combine the ties). high-resolution image of the environment at high or variable light bipolar cell.5: Compound eye of a robber ﬂy Single lense eye: high temp. as well. Finally. We want to brieﬂy discuss the diﬀerent steps of this information processing and in doing so. the receptive ﬁeld. but they are more they are sensitive to such an extent complex that only one single photon falling on the retina can cause an action poThe light sensing organ common in vertetential. we follow the way of the information carried by the light: Figure 2. The resulttransmit their signals to one single ing image is a sharp. These receptors are the real advantages of the other two light-receiving part of the retina and eye types.com 2. but to discuss all of them would go too far. the now transformed complex.3. the size of the pupil can transmit their information to one ganbe adapted to the lighting conditions (by glion cell. which expands of photoreceptors that aﬀect the ganor contracts the pupil). the larger the ﬁeld of perin pupil dilation require to actively focus ception. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 27 . Various bipolar cells can to the pinhole eye.3 The retina does not only receive information but is also responsible for information processing The light signals falling on the eye are received by the retina and directly preprocessed by several layers of informationprocessing cells. Therefore. the single lens eye covers the ganglions – and the less contains an additional adjustable lens.3.3. But in contrast ganglion cells. and spat. On the other hand it is more rized. D.3. These diﬀerences glion cell. Then several photoreceptors brates is the single lense eye. (retina ). The higher the number means of the iris muscle. resolution Photoreceptors receive the light signal und cause action potentials (there are diﬀerent receptors for diﬀerent color components and light intensi2.dkriesel. Similar to the pinhole eye the light signal travels from several bipolight enters through an opening (pupil ) lar cells 2 into and is projected onto a layer of sensory cells in the eye.

the bodily functions are work. has a considThese ﬁrst steps of transmitting visual inerable sensory system because of comformation to the brain show that informapound eyes. Now we want to take a look at [RD05]): the 302 neurons are required by the nervous horizontal and amacrine cells. blurred in the peripheral ﬁeld of vision. Amacrine cells can further intensify certain With 105 neurons the nervous system of stimuli by distributing information a ﬂy can be constructed. we have learned about the information processing in An overview of diﬀerent organisms and the retina only as a top-down struc. is based upon this massive division of Of course. dimensional space. A ﬂy can from bipolar cells to several ganglion evade an object in real-time in threecells or by inhibiting ganglions. So the information is living organisms at already reduced directly in the retina diﬀerent stages of and the overall image is. use of diﬀerent attractants and odors. vibrissae. which cells are not connected from the serves as a popular model organism front backwards but laterally. information is received and.com sharp is the image in the area of this 2. on the other a ﬂy has considerable diﬀerential and hand. nerves at the tion is processed from the ﬁrst moment the end of its legs and much more. We all system’s power and resistance to errors know that a ﬂy is not easy to catch. it can land upon the ceiling upside down. Nematodes live in the soil allow the light signals to inﬂuence and feed on bacteria. Due to the than compressing and blurring. Thus. outlines and bright points. it has a cognitive capacity similar This ensures the clear perception of to a chimpanzee or even a human. for examdevelopment ple.4 The amount of neurons in ganglion cell. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .their neural capacity (in large part from ture. is processed in parallel within milintegral calculus in high dimensions lions of information-processing cells. So far. If you reand at the same time inhibit more gard such an ant state as an individdistant bipolar cells and receptors. themselves laterally directly during the information processing in the 104 neurons make an ant (To simplify matters we neglect the fact that some retina – a much more powerful ant species also can have more or less method of information processing eﬃcient nervous systems). they are able to social behavior and form huge states excite other nearby photoreceptors with millions of individuals. They in biology. 28 D. ual. The implemented "in hardware". When the horizontal cells are excited ants are able to engage in complex by a photoreceptor.Chapter 2 Biological neural networks dkriesel. These system of a nematode worm.

With 2 · 10 neurons there are nervous systems having more neurons than erable probability. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 29 . is able to speak.Our state-of-the-art computers are not ing. the human nervous system. Rats have an exchimpanzee. 1. prey while ﬂying. elegant. an octopus can be positioned within the same magnitude. patient carnivores that can show a variety of behaviors. capabilities. companion of man for should be ignored here. describe a separate.4 The amount of neurons in living organisms also controlled by neurons. it can swim and has as well as the knowledge of other huevolved complex behavior. by means of his eyes while jumping in three-dimensional space and and 11 catch it with its tongue with consid. ages. an animal which is denounced as being extremely intelligent and are often used to participate in a variety of intelligence tests representative for For 6 · 109 neurons you already get a the animal world. some processing power of a ﬂy. the human has considerable cognitive positioned within the same dimension. A frog mans to develop advanced technolocan continuously target the said ﬂy gies and manifold social structures. It uses acoustic signals to localize able to keep up with the aforementioned self-camouﬂaging insects (e. one of the animals being traordinary sense of smell and orienvery similar to the human. but these 1. The brain of a frog can be 1011 neurons make a human. Honeybees build colonies and have 3 · 108 neurons can be found in a cat. to abThe frog has a complex build with stract. amazing capabilities in the ﬁeld of which is about twice as much as in aerial reconnaissance and navigation. a dog. We know that cats are very 4 · 106 neurons result in a mouse. Only very few people know that. tation.8 · 106 neurons we have enough ular companion of man: cerebral matter to create a honeybee. exact up to several centimeters. room.g. Usually.6 · 108 neurons are required by the brain of a dog. D. for example. Now take a look at another popWith 0. By the way. by only using their sense of hear. to remember and to use tools many functions. The bat can should mention elephants and certain navigate in total darkness through a whale species.com 2. in labyrinth orientation the octopus is vastly superior to the rat.dkriesel. and they also show social behavior.5 · 107 neurons are suﬃcient for a rat. Recent research moths have a certain wing structure results suggest that the processes in nerthat reﬂects less sound waves and the vous systems might be vastly more powecho will be small) and also eats its erful than people thought until not long ago: Michaeva et al. Here we 5 · 107 neurons make a bat. and here the world of vertebrates already begins.

a scalar. have to cross the synaptic cleft where the signal is changed again by variable chem. The set of such weights reptransmit their signal via the axon. This means that after itself emits a pulse or not – thus. In the receiving neuron inputs are summarized to a pulse acthe various inputs that have been postcording to the chemical change. They are multiplied by rons are linked to each other in a weighted a number (the weight) – they are way and when stimulated they electrically weighted.5 Transition to technical ron only consists of one component. 30 D. neurons: neural networks Several scalar outputs in turn form the vectorial input of another neuron. Posterity will show if they are right. which we will get to lated by the cumulated input.Accumulating the inputs: In biology. too. the neuron know later on. Vectorial input: The input of technical Adjustable weights: The weights weighting the inputs are variable. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Scalar output: The output of a neuron is a scalar. Our brief summary tor. which means that the neu2. but they ﬁrst original and technical adaptation. instead of a vecthe cumulated input. i.Chapter 2 Biological neural networks synapse-integrated information way of information processing [MBW+ 10]. the outaccumulation we continue with only put is non-linear and not proportional to one value. In nature a neuron receives pulses of 103 to 104 other neurons on average. I want to brieﬂy remains. the ical processes. summarize the conclusions relevant for the technical part: Synapses change input: In technical neural networks the inputs are preproWe have learned that the biological neucessed. cal side this is often realized by the Depending on how the neuron is stimuweighted sum. similar to neurons consists of many components. corresponds exactly with the few elements of biological neural networks we want to Non-linear characteristic: The input of take over into the technical approximaour technical neurons is also not protion: portional to the output. processed in the synaptic cleft are summathey are accumulated – on the technirized or accumulated to one single pulse.com therefore it is a vector.e.. From resents the information storage of a the axon they are not directly transferred neural network – in both biological to the succeeding neurons. dkriesel. are a caricature of biology This particularly means that somewhere in the neuron the various input How do we change from biological neural components have to be summarized in networks to the technical ones? Through such a way that only one component radical simpliﬁcation.

the chemical processes at the synaptic cleft. After this transition we now want to specify more precisely our neuron model and add some odds and ends. It is estimated that a human brain consists of approx.5 Technical neurons as caricature of biology bits of information. 1011 nerve cells.dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 31 . So our current. with components xi . For this exercise we assume 103 synapses per neuron. Naïvely calculated: How much storage capacity does the brain have? Note: The information which neuron is connected to which other neuron is also important. These are multiplied by the appropriate weights wi and accumulated: wi x i . This adds a great dynamic to the network because a large part of the "knowledge" of a neural network is saved in the weights and in the form and power of the chemical processes in a synaptic cleft. only casually formulated and very simple neuron model receives a vectorial input x.com 2. i The aforementioned term is called weighted sum. Let us further assume that a single synapse could save 4 D. each of which has about 103 to 104 synapses. Exercises Exercise 4. Afterwards we will take a look at how the weights can be adjusted. Then the nonlinear mapping f deﬁnes the scalar output y : y=f i wi x i .

.

weighted connections between those neurons. If in the following chapters several mathematical variables (e. All other time steps are referred to analogously. for example. the neurons. the next time step as (t + 1). the notation will be.Chapter 3 Components of artiﬁcial neural networks Formal deﬁnitions and colloquial explanations of the components that realize the technical adaptations of biological neural networks. discrete time steps 3. Time is divided into discrete time steps: Deﬁnition 3. of course. certain point in time. Initial descriptions of how to combine these components into a neural network. Here. respectively. netj or oi ) refer to a From a biological point of view this is. 3. The current time (present time) is referred to as (t).2 Components of neural networks (t) A technical neural network consists of simple processing units. netj (t − 1) or oi (t).g. After this chapter you will be able to read the individual chapters of this work without having to know the preceding ones (although this would be useful). This chapter contains the formal deﬁnitions for most of the neural network components used later in the text. and directed. the strength of a connection (or the connecting weight ) be- 33 . the preceding one as (t − 1). but it signiﬁcantly simpliﬁes the implementation. not very plausible (in the human brain a neuron does not wait for another one).1 (The concept of time).1 The concept of time in neural networks In some deﬁnitions of this text we use the term time or the number of cycles of the neural network.

The deﬁnition of connections has already been included in the deﬁnition of the neural network. 3. here again. network = neurons + weighted connection wi. which is according to ﬁg. Looking at a neuron j .2 (Neural network). Indeed. in this case the numeric neural network is a sorted triple 0 marks a non-existing connection.Chapter 3 Components of artiﬁcial neural networks (fundamental) dkriesel. where w((i. e. an instance of the class NeuralNetworkDescriptor is created in the ﬁrst place. optionally.e. 3. weights method W So the weights can be implemented in a square weight matrix W or. Depending on the facing page in top-down direction): the point of view it is either undeﬁned or 0 for connections that do not exist in the network. SNIPE: Connection can be set using the NeuralNetwork.1 on ron j . j ∈ N} whose elements are called connections between neuron i and The neurons and connections comprise the neuron j . where N is the set of neurons and ton diagram 2 .j 1 .setSynapse. the descriptor object is used to instantiate an arbitrary number of NeuralNetwork objects. 3. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . To get started with Snipe programming. The descriptor object roughly outlines a class of neural networks.1 Connections carry information SNIPE: In Snipe. it deﬁnes the number of neuron layers in a neural network. i.2. 2 Note that. nection begins. the weight of following the path of the data within a the connection between neuron i and neu. The function w : V → R deﬁnes following components and variables (I’m the weights. In a second step.com n. A target. the documentations of exactly these two classes are – in that order – the right thing to read. But in this text I try to use the notation I found more frequently and in the more signiﬁcant citations.g. V. that is processed by neurons Data are transferred between neurons via connections with the connecting weight being either excitatory or inhibitory. because it is enables to create and maintain general parameters of even very large sets of similar (but not neccessarily equal) networks.2. which neuron is the Deﬁnition 3. and the column number of the matrix indicating. we will usually ﬁnd in a weight vector W with the row numa lot of neurons with a connection to j . is shortened to wi. be interchanged in wi.j tween two neurons i and j is referred to as ber of the matrix indicating where the conwi.2 The propagation function converts vector inputs to scalar network inputs 34 D. w) with two sets N . V a set {(i. as well. The published literature is not consistent here. V and a funcmatrix representation is also called Hintion w. j )). a consistent standard does not exist.j .neuron.j . in some of the cited literature axes and rows could be interchanged. This (N. j )|i. The presented layout involving descriptor and dependent neural networks is very reasonable from the implementation point of view. Here. 1 Note: In some of the cited literature i and j could which transfer their output to j .

Deﬁnition 3. . verarbeitet Eingaben zur Netzeingabe) Eingaben anderer Neuronen Netzeingabe (Erzeugt aus Netzeingabe und alter dkriesel. n} : ∃wiz . SNIPE: The propagation function in Snipe was implemented using the weighted sum. . .3 (Propagation function and network input). to a certain extent. oin of other neurons i1 . . in (which are connected to j ). Then the network input of j .com Aktivierung die neue Aktivierung) Aktivierung Aktivierungsfunktion 3.Propagierungsfunktion (oft gewichtete Summe. . i2 . and transforms them in consideration of the connecting weights wi.j into the network input netj that can be further processed by the activation function.2 Components of neural networks Ausgabe zu anderen For aNeuronen neuron j the propagation func- Ausgabefunktion (Erzeugt aus Aktivierung die Ausgabe. The activation function of a neuron implies the threshold value. is calculated by the propagation function fprop as follows: netj = fprop (oi1 . . Thus. ist oft Identität) tion receives the outputs oi1 .j ) (3. i2 . .2) Figure 3. 3. . such that ∀z ∈ {1. . the network input is the result of the propagation function.3 The activation is the "switching status" of a neuron Based on the model of nature every neuron is. called netj . . . . win . oin .j . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 35 . . . . in } be the set of neurons. . . transforms outputs of other neurons to net input) Network Input Activation function (Transforms net input and sometimes old activation to new activation) Activation Output function (often identity function. Let I = {i1 .1: Data processing of a neuron.j ) (3. .j . . at all times active. and the summation of the results: netj = i∈ I Data Output to other Neurons (oi · wi.1) manages inputs Data Input of other Neurons Propagation function (often weighted sum. . .2. .j . excited or whatever you will call it. wi1 . . transforms activation to output for other neurons) Here the weighted sum is very popular: The multiplication of the output of each neuron i by wi. . The D.

it can be deﬁned as follows: dkriesel. with the threshold value Θ playing an important role. for instance Θj as Θj (t) (but for reasons of clarity. Deﬁnition 3. but generally the deﬁnition is the following: highest point of sensation Θ Deﬁnition 3. as well as the previous activation state aj (t − 1) into a new activation state aj (t). (3. fact 3. Let j be a neuron. Let j be a neuron. for example by a learning procedure. The activation function is also called transfer function.com How active is a neuron? 3. Unlike the other variables within the neural network (particularly unlike the ones deﬁned so far) the activation function is often deﬁned globally for all neurons or at least for a set of neurons and only the threshold values are diﬀerent for each neuron.tivation function is deﬁned as tent of the neuron’s activity and results from the activation function.4 Neurons get activated if the network input exceeds their treshold value Near the threshold value. is Activation). The acexplicitly assigned to j . So it can in particular become necessary to relate the threshold value to the time and to write. in short activation. indicates the ex. the activation function of a neuron reacts particularly sensitive.2. Θj ).Chapter 3 Components of artiﬁcial neural networks (fundamental) reactions of the neurons to the input values depend on this activation state. We should also keep in mind that the threshold values can be changed. 36 D.6 (Activation function and activation state aj . Its formal deﬁnition is included in the following deﬁnition of the activation function.5 The activation function determines the activation of a neuron dependent on network input and treshold value At a certain time – as we have already learned – the activation aj of a neuron j depends on the previous 3 activation state of the neuron and the external input. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . The Deﬁnition 3. But generally. The threshold value Θj is uniquely assigned to j and 3 The previous activation is not always relevant for marks the position of the maximum gradithe current – we will see examples for both varient value of the activation function. The threshold value is also mostly included in the deﬁnition of the activation function.4 (Activation state / activation in general).3) SNIPE: It is possible to get and set activation states of neurons by using the methods getActivation or setActivation in the class NeuralNetwork. The activation state indicates the extent of a neuron’s activation and is often shortly referred to as activation. calculates activation It transforms the network input netj . From the biological point of view the threshold value represents the threshold at which a neuron starts ﬁring. aj (t − 1). aj (t) = fact (netj (t).5 (Threshold value in general). as already mentioned. I omitted this here).2. ants. Let j be a neuron.

Thus. it can be calculated 200 times faster because it just needs two multiplications and one addition. is impossible (as we will see later). It is possible to deﬁne individual behaviors per neuron layer.2 on the next page). Objects that inherit from this interface can be passed to a NeuralNetworkDescriptor instance. The fast hyperbolic tangent approximation is located within the class TangensHyperbolicusAnguita. the function changes from one value to another. The interface NeuronBehavior allows for implementation of custom behaviors. it has some other advantages that will be mentioned later.96016. At the price of delivering a slightly smaller range of values than the hyperbolic tangent ([−0.2) which maps to (−1. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 37 . but otherwise remains constant. 3.com SNIPE: In Snipe.5) The smaller this parameter. The Fermi function can be expanded by a temperature parameter T into the form D. Thinking about how to make neural network propagations faster. or even incorporate internal states and dynamics.96016] instead of [−1. This implies that the function is not diﬀerentiable at the threshold and for the rest the derivative is 0. which can only take on two values (also referred to as Heaviside function ). which also contains some of the activation functions introduced in the next section. What’s more.4) T which maps to the range of values of (0. 3. 3. 1).2. backpropagation learning. the more does it compress the function on the x axis. SNIPE: The activation functions introduced here are implemented within the classes Fermi and TangensHyperbolicus. 0. Also very popular is the Fermi function or logistic function (ﬁg. activation functions are generalized to neuron behaviors. 1) and the hyperbolic tangent (ﬁg. they "engineered" an approximation to the hyperbolic tangent.2) 1 . Incidentally.dkriesel. Both functions are diﬀerentiable. Consequently. for example. Corresponding parts of Snipe can be found in the package neuronbehavior. 1 + e −x (3. they quickly identiﬁed the approximation of the e-function used in the hyperbolic tangent as one of the causes of slowness. both of which are located in the package neuronbehavior. just using two parabola pieces and two half-lines. 1]). who have been tired of the slowness of the workstations back in 1993. Such behaviors can represent just normal activation functions. there exist activation functions which are not explicitly deﬁned but depend on the input according to a random distribution (stochastic activation function ). 3. Due to this fact. [APZ93]. A alternative to the hypberbolic tangent that is really worth mentioning was suggested by Anguita et al. 3.6 Common activation functions The simplest activation function is the binary threshold function (ﬁg. dependent on what CPU one uses.2 Components of neural networks 1 1+e −x T . (3. If the input is above a certain threshold. one can arbitrarily approximate the Heaviside function.

8 Learning strategies adjust a network to ﬁt our needs Since we will address this subject later in detail and at ﬁrst want to get to know the principles of neural network structures. the output function is deﬁned globally.e.4 0. Fermi function.5 −1 −4 −2 0 x 2 4 Fermi Function with Temperature Parameter 1 0.Chapter 3 Components of artiﬁcial neural networks (fundamental) dkriesel.6 f(x) 0. Often this function is the identity. from top to bottom: Heaviside or binary threshold function.6) informs other neurons calculates the output value oj of the neuron j from its activation state aj . The output function fout (aj ) = oj (3. 3. the activation aj is directly output4 : fout (aj ) = aj .2 −0.7) fout Hyperbolic Tangent 1 0. 38 D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . we will use the identity as output function within this text. 1 2. Generally.8 0.4 tanh(x) 0. too. Let j be a neuron. I will only provide a brief and general deﬁnition here: 4 Other deﬁnitions of output functions may be useful if the range of values of the activation function is not suﬃcient.2. the temperature parameters of the modiﬁed Fermi func1 tions are. so oj = aj (3.5 f(x) 0 −0.6 0.6 −0.com Heaviside Function 1 3. 1 1 10 und 25 .8 −1 −4 −2 0 x 2 4 Unless explicitly speciﬁed diﬀerently. ordered ascending by steepness. i.2 0 −4 −2 0 x 2 4 Deﬁnition 3. Figure 3.2 0 −0. The original Fermi function is represented by dark colors. hyperbolic tangent.4 −0.2: Various popular activation functions.2.8 0.7 An output function may be used to process the activation once again The output function of a neuron j calculates the values which are transferred to the other neurons connected to j . The Fermi function was expanded by a temperature parameter.7 (Output function). More formally: 0. 5.

network of layers D. 3. Snipe deﬁnes diﬀerent kinds of are only permitted to neurons of the folsynapses depending on their source and lowing layer. i.The neuron layers of a feedforward netserted the small arrow in the upper-left work (ﬁg. To prevent naming conand output arrows. output layer and one or more processing SNIPE: Snipe is designed for realization layers which are invisible from the outside of arbitrary network topologies. which were added for ﬂicts the output neurons are often referred reasons of clarity. Feedforward In this text feedforward networks (ﬁg. so that the network produces a desired output for a given input.9 (Feedforward network). with feedforward networks in which every In the Hinton diagram the dotted weights neuron i is connected to all neurons of the are represented by light grey ﬁelds. 3. that’s why the neurons are also renetwork. In ﬁg.Deﬁnition 3. The 3. Any kind of synapse can separately be allowed or forbidden for a set of networks using the setAllowed methods in a NeuralNetworkDescriptor instance. rons and the column neurons. Hinton diagram. I want to give an overview of ferred to as hidden neurons ) and one outthe usual topologies (= designs) of neural put layer.3 on map and its Hinton diagram so that the the next page the connections permitted reader can immediately see the character. 3.3 on the following page) are the networks we will ﬁrst explore (even if we will use diﬀerent topologies later). We will often be confronted istics and apply them to other networks.dkriesel.neuron in one layer has only directed consisting of these elements. 3. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 39 .e. In a feedforward network each networks. to construct networks con.8 (General learning rule). In this (also called hidden layers). one cell. their target. cannot be found in the to as Ω.3 Network topologies Deﬁnition 3. I have in. Every topology nections to the neurons of the next layer described in this text is illustrated by a (towards the output layer). In order to clarify that the connections are between the line neu. the next layer (these layers are called comsolid ones by dark grey ﬁelds.for a feedforward network are represented by solid lines.3.1 Feedforward networks consist The learning strategy is an algorithm of layers and connections that can be used to change and thereby towards each following layer train the neural network.com 3. n hidden proAfter we have become acquainted with the cessing layers (invisible from the outcomposition of the elements of a neural side.3 on the following page) are clearly separated: One input layer.3 Network topologies neurons are grouped in the following layers: One input layer. Connections respect. The input pletely linked ).

3. neurons inhibit and therefore strengthen themselves in order to reach their activation limits.Chapter 3 Components of artiﬁcial neural networks (fundamental) dkriesel. Characteristic for the Hinton diagram of completely linked feedforward networks is the formation of blocks above the diagonal.1 Direct recurrences start and end at the same neuron Some networks allow for neurons to be connected to themselves. three hidden neurons and two output neurons.4 on the next page): connections that skip one or more levels. 3. 40 D. As a result.1 Shortcut connections skip layers Some feedforward networks permit the socalled shortcut connections (ﬁg.2. These connections may only be directed towards the output layer. i ei ee } }} i i e } } i i ee } ee }} iiiiii } } } 2 ~i ~ 2 B t i GFED @ABC @ABC GFED Ω2 Ω1 3. 3.10 (Feedforward network i1 i i2 e iii} e i } eee } i i e } with shortcut connections). Similar to the ee iiiii }} ee }} i i e e } } i } ii ee }} eee feedforward network.3. 3.2 Recurrent networks have inﬂuence on themselves Recurrence is deﬁned as the process of a neuron inﬂuencing itself by any means or by any connection.1. too. Shortcuts skip layers GFED @ABC @ABC GFED Deﬁnition 3. Therefore in the ﬁgures I omitted all markings that concern this matter and only numbered the neurons. but the connections e2 }}iiiiiii B 2 may not only be directed towards the next ~} ~}} i t i @ABC GFED GFED @ABC @ABC GFED h1 e h2 e iii} h3 layer but also towards any other subsei i ee e } i i ee iii }}} ee }} i e i quent layer.com 3. Recurrent networks do not always have explicitly deﬁned input or output neurons.5 on the facing page).3: A feedforward network with three layers: two input neurons. which is called direct recurrence (or sometimes selfrecurrence (ﬁg.3. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . i1 i2 h1 h2 h3 Ω1 Ω2 i1 i2 h1 h2 h3 Ω1 Ω2 Figure 3.

com 89:. neurons inﬂuence themselves D.3 Network topologies 89:. ?>=< 1 v Ð uv ?>=< 89:. 6 3. Deﬁnition 3. with the weights of these connections being referred to as wj. which are represented by solid lines.11 (Direct recurrence). On the right side of the feedforward blocks new connections have been added to the Hinton diagram.5: A network similar to a feedforward network with directly recurrent neurons. 3 0 Ð uv ?>=< 89:. ?>=< 2 v 0 v A ?>=< 89:.dkriesel. 0 Ðv ?>=< 89:.4: A feedforward network with shortcut connections. Now we expand the feedforward network by connecting a neuron j to itself. 5 GFED @ABC i1 ~t @ABC GFED h1 2 ~t GFED @ABC Ω1 s 2 ~ GFED @ABC h2 GFED @ABC i2 2 B GFED @ABC 7 h3 2 ~ C B GFED @ABC Ω 2 1 2 3 4 5 6 7 1 2 3 4 5 6 7 i1 i2 h1 h2 h3 Ω1 Ω2 i1 i2 h1 h2 h3 Ω1 Ω2 Figure 3. 4 0 Ðv A ?>=< 89:.j . Figure 3. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 41 . The direct recurrences are represented by solid lines and exactly correspond to the diagonal in the Hinton diagram matrix. In other words: the diagonal of the weight matrix W may be diﬀerent from 0.

diagram are now occupied. now with additional connections between neurons and their preceding layer being allowed.3 Completely linked networks allow any possible connection Completely linked networks permit connections between all neurons. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . they will be called indirect recurrences. 6 0 Ð 89:.7 on the facing page). Then a neuron j can use indirect forwards connections to inﬂuence itself. The ﬁelds that are symDeﬁnition 3.3. each neuron often inhibits the other neurons of the layer and strengthens itself. for example. 3. 5 VP ?>=< 0 Ð A ?>=< 89:.12 (Indirect recurrence). below the diagonal of W is diﬀerent from 0.6). A metric to the feedforward blocks in the Hinton laterally recurrent network permits con. Again our network is based on a feedforward network. 3. 3. Here. Deﬁnition 3.3 Lateral recurrences connect neurons within one layer Connections between neurons within one layer are called lateral recurrences (ﬁg. 3 g 0 Ðu ?>=< 89:. The indirect recurrences are represented by solid lines.13 (Lateral recurrence).2. Therefore. 1 g V ?>=< Ðu ?>=< 89:. As we can see. except for direct 42 D. dkriesel. 3.3. 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Figure 3. As a result only the strongest neuron becomes active (winner-takes-all scheme ).com 89:. too.2.3.Chapter 3 Components of artiﬁcial neural networks (fundamental) 3.2 Indirect recurrences can inﬂuence their starting neuron only by making detours If connections are allowed towards the input layer. by inﬂuencing the neurons of the next layer and the neurons of this next layer inﬂuencing j (ﬁg.6: A network similar to a feedforward network with indirectly recurrent neurons. 2 VP ?>=< 0 A 89:. 4 V ?>=< 89:. connections to the preceding layers can exist here. nections within one layer.

jn can also be realized as connecting weight of a continuously ﬁring neuron : For this purpose an additional bias neuron whose output value Figure 3. . . which will be introduced in chapter 10. j2 . A popular example are the selforganizing maps. But threshold values Θj1 . 89:. direct recurrences normally cannot be applied here and clearly deﬁned layers do not longer exist. 3 jk 0 Ðu ?>=< 89:. 2 4 0 CB A ?>=< 89:. ?>=< 1 k Ðu ?>=< 89:. the matrix W may be unequal to 0 everywhere. 3. Thus. D. k C ?>=< 89:.com 3.7: A network similar to a feedforward network with laterally recurrent neurons. . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 43 . .8 on the next page). Here. . 7 Deﬁnition 3. From the biological point of view this sounds most plausible. Thus. but it is complicated to access the activation function at runtime in order to train the threshold value. every neuron is always allowed to be connected to every other neuron – but as a result every neuron can become an input neuron. but the diagonal is left uncovered. Θjn for neurons j1 . 6 k 0 Ð C ?>=< 89:. . Furthermore. The direct recurrences are represented by solid lines. 5 0 Ð C A ?>=< 89:. except along its diagonal. .4 The bias neuron recurrences. ﬁlled squares are concentrated around the diagonal in the height of the feedforward blocks. recurrences only exist within the layer. the threshold value is an activation function parameter of a neuron. the connections must be symmetric (ﬁg. In the Hinton diagram.14 (Complete interconnection). In this case. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 3.dkriesel. . Therefore.4 The bias neuron is a technical trick to consider threshold values as connection weights By now we know that in many network paradigms neurons have a threshold value that indicates when a neuron becomes active.

.jn with −Θj1 . generating connections between the said bias neuron and the neurons j1 . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . j i 3b 4 j j d5 b d b j j b j bb bb jjjj bb bjj jjjb bb bb j j j j b0 j b0 Ðj Ð u jjj 89:. It is used to represent neuron biases as connection weights. i. 3. These new connections get the weights −Θj1 . . rences. . . . . By inserting a bias neuron whose output value is always 1. it is now included in the propagation funcFigure 3. A bias neuron is a neuron whose output value is always 1 and which is represented by ?>=< 89:.j1 .15. . . . . j2 . .e. . i. . Now the threshold values are implemented as connection weights (ﬁg. Then the threshold value of the neurons j1 . .8: A completely linked network with tion. j2 . jn be neurons with threshold values Θj1 . −Θjn . jn is set to 0. = Θjn = 0 and 1 2 3 4 5 6 7 1 2 3 4 5 6 7 bias neuron replaces thresh. i o 1 2 jjjj bb bb d y b d y b j j j j b bb j j b j j bbb jjjjj bbb b0 b0 jj Ðj Ð G A ?>=< u o jjj G o ?>=< 89:. which enables any weighttraining algorithm to train the biases at the same time.e. . jn and weighting these connections wBIAS.Chapter 3 Components of artiﬁcial neural networks (fundamental) dkriesel. . . In other words: Instead of including the threshold value in the activation function. . j2 . . ?>=< RS 89:. wBIAS. . mally: Let j1 . In the Hinton diagram only the diagonal it is part of the network input. 89:. . ?>=< G A ?>=< 89:. . .9 on page 46) and can directly be trained together with the connection weights. . .com is always 1 is integrated in the network and connected to the neurons j1 . we can set Θj1 = . Deﬁnition 3. . . GS ?>=< 89:. Θjn . value with weights 44 D. . jn . More foris left blank. which considerably facilitates the learning process. they get the negative threshold values. Or even shorter: The threshold value symmetric connections and without direct recuris subtracted from the network input. . −Θjn . . . . . . 6 o 7 @ABC GFED BIAS . j2 .

x|| WVUT PQRS Gauß 3.6 Orders of activation @ABC GFED Σ ONML HIJK Σ WVUT PQRS L|H Undoubtedly.6. is to illustrate neurons according to their type of data processing. let alone with a great appear in the following text. the advantage of the bias Σ Σ Σ PQRS @ABC GFED WVUT PQRS ONML HIJK BIAS neuron is the fact that it is much easier WVUT fact Tanh Fermi to implement it in the network.1 Synchronous activation All neurons change their values synchronously. One disadvantage is that the representation of the network already becomes quite ugly with Figure 3.6 Take care of the order in which neuron activations are calculated For a neural network it is very important in which order the individual neurons receive and process the input and output the results. D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 45 . 3. By the way. a bias neuron is often referred to as on neuron. but it is – if to be implemented in hardware – only useful on certain parallel computers and especially not for feedforward networks. activation and output. From now on.10 for some examples without further explanation – the diﬀerent types of neurons are explained as soon as we need them.dkriesel. and pass them on. Here.10: Diﬀerent types of neurons that will only a few neurons. This order of activation is the most generic and can be used with networks of arbitrary topology. 3. i. a bias neuron was implemented instead of neuron-individual biases.e. the bias neuron is omitted for clarity in the following illustrations. which we will use several times in the following. ||c.5 Representing neurons We have already seen that we can either write its name or its threshold value into a neuron. they simultaneously calculate network inputs.com receive an equivalent neural network whose threshold values are realized by connection weights. Synchronous activation corresponds closest to its biological counterpart. The neuron index of the bias neuron is 0. Another useful representation. 3. SNIPE: In Snipe. number of them. but we know that it exists and that the threshold values can simply be treated as weights because of it. we distinguish two model classes: 3. See ﬁg.

6. BIAS − Θ 1 0 e e e e −Θe −Θ3 2 e e e2 Ð B 0 89:. one without bias neuron on the left. some of which I want to introduce in the following: easier to implement 3. 3.Apparently. SNIPE: When implementing in software.16 (Synchronous activation). and after that calculating all activations. not at all. ai and oi are updated. this order of activation is not ues simultaneously but at diﬀerent points always useful.com GFED @ABC Θ1 f ff || ff || ff | | ff | | ~| 2 @ABC GFED @ABC GFED Θ2 Θ3 GFED @ABC G ?>=< 89:. because Snipe has to be able to realize arbitrary network topologies. activation by means of the activation function and output by means of the output function.2 Asynchronous activation Here. All neurons of a network calculate network inputs at the same time by means of the propagation function.17 (Random order of activation). After that the activation cycle is complete. For n neurons a cycle is the n-fold execution of this step.Chapter 3 Components of artiﬁcial neural networks (fundamental) dkriesel. biologically plausible Deﬁnition 3. I omitted the weights of the already existing connections (represented by dotted lines on the right side). one could model this very general activation order by every time step calculating and caching every single network input. With random order of activation a neuron i is randomly chosen and its neti . ?>=< 89:. the connecting weights at the connections. however. of time. 46 D. ?>=< 0 0 Figure 3.1 Random order Deﬁnition 3. the neurons do not change their val. This is exactly how it is done in Snipe. some neurons are repeatedly updated during one cycle. The neuron threshold values can be found in the neurons. and others. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Obviously. For this.6. Furthermore.2. there exist diﬀerent orders. one with bias neuron on the right.9: Two equivalent neural networks.

the activation is calculated right after the net input. but in random order. it will cause the data propagation to be carried out in a slightly diﬀerent way.2 Random permutation With random permutation each neuron is chosen exactly once. the order is generally useless and. can be taken as a starting point. implementing. Therefore. Once fastprop is enabled. a permutation of the neurons is calculated randomly and therefore deﬁnes the order of activation.dkriesel. The neuron numbers are ascending from input to output layer. for every neuron. then the inner neurons and ﬁnally the output neurons.2. The order during implementation is deﬁned by the network topology.6.4 Fixed orders of activation and according to a ﬁxed order. followed by all activations. it is very timeconsuming to compute a new permutation for every cycle. Deﬁnition 3.18 (Random permutation).19 (Topological activation). secondly. This may save us a lot of time: Given a synchronous activation order. networks. for which we are calculating the activations. ﬁxed orders of activation This procedure can only be considered for can be deﬁned as well. Initially. non-recurrent. The neuron values are calculated in ascending neuron index order. for instance. when non-cyclic. we just need one single propagation.3 Topological order 3. Then the neurons are successively processed in this order. a feedforward network with n layers of neurons would need n full propagation cycles in order to enable input data to have inﬂuence on the output of the network.2. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 47 .6. However. feedforward D.2. a ﬁxed order of activation is preferred. Given the topological activation order. all net inputs are calculated ﬁrst. In the standard mode. the neuron activations at time t + 1.e. not every network topology allows for ﬁnding a special activation order that enables saving time. With topological order of activation the neurons are updated during one cycle 3. 3. This order of activation is as well used rarely because ﬁrstly. if already existing. For all orders either the previous neuron activations at time t or. Obviously.6. which provides us with the perfect topological activation order for feedforward networks. A Hopﬁeld network (chapter 8) is a topology nominally having a random or a randomly permuted order of activation.com 3. But note that in practice. i. often very useful Deﬁnition 3. during one cycle. In the fastprop mode. in feedforward networks (for which the procedure is very reasonable) the input neurons would be updated ﬁrst. Thus. for the previously mentioned reasons.6 Orders of activation since otherwise there is no order of activation. SNIPE: Those who want to use Snipe for implementing feedforward networks may save some calculation time by using the feature fastprop (mentioned within the documentation of the class NeuralNetworkDescriptor.

. . y2 .this in relation to the representation and work with n input neurons needs n inputs implementation of the network. x2 . . that their derivatives can be exnetwork by using the components of the inpressed by the respective functions themput vector as network inputs of the input selves so that the two statements neurons. . 1. They are considered as in. . let us take a look at the fact that. x2 . .7 Communication with the outside world: input and output of data in and from neural networks Finally. Exercises n y Exercise 5. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . . 3. They are regarded as output vector y = (y1 . many types of neural networks permit the input of data.Chapter 3 Components of artiﬁcial neural networks (fundamental) networks it is very popular to determine the order of activation once according to the topology and to use this order without further veriﬁcation at runtime. and returns the output vector in the same way. yn ).20 (Input vector). for example. m 3. . the propagate method is used. . Data is put into a neural tanh(x). the output dimension is referred to as m. xn ) and y = (y1 . x Now we have deﬁned and closely examined the basic components of neural networks – without having seen a network in action. xn . SNIPE: In order to propagate data through a NeuralNetwork-instance. As a simpliﬁcation we summarize the input and output components for n input or output neurons within the vectors x = (x1 . Would it be useful (from your point of view) to insert one bias neuron in each layer of a layer-based network. regard the feedforward network shown in ﬁg. . of course.com outputs y1 . Show for the Fermi function consequence.result of the network change? put vector x = (x1 . . x2 and outputs y1 . Data is output by a neural network by the output neurons adopting the components of the output vector in their output values. the input dimension is ref (x) as well as for the hyperbolic tangent ferred to as n. But this is not necessarily useful for networks that are capable to change their topology. such as a feedforward network? Discuss Deﬁnition 3. . Let us. A net. f (x) = f (x) · (1 − f (x)) and Deﬁnition 3. dkriesel. y2 . ym ). which means that it also has two numerical inputs x1 . Thus. . . . Then these data are processed and can produce output. As a Exercise 6. . . .3 on page 40: It has two input neurons and two output neurons. . . y2 . But ﬁrst we will continue with theoretical explanations and generally describe how a neural network could learn. . A network with m output neurons provides m 2. Will the x1 .21 (Output vector). ym . . y2 . xn ). x2 . It receives the input vector as array of doubles. . tanh (x) = 1 − tanh2 (x) 48 D.

dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 49 . 3.7 Input and output of data D.com are true.

.

Chapter 4 Fundamentals on learning and training samples Approaches and thoughts of how to teach machines. after suﬃcient training. 51 .g. or 7. the most interesting characteristic of neural networks is their capability to familiarize with problems by means of training and. there will always be rons. a neural network changes when its components are changing. 2. 4. 6. developing new neurons.1 There are diﬀerent paradigms of learning Learning is a comprehensive term. to be able to solve unknown problems of the same class. In principle. deleting existing neurons (and so. a neural network could learn by 1. Should neural networks be corrected? Should they only be encouraged? Or should they even learn without any help? Thoughts about what we want to change during the learning procedure and how we will change it. I want to propose some basic principles about the learning procedure in this chapter. changing the threshold values of neu- 4. the question of how to implement it. developing new connections. A neural network could learn from many things but. 5. as we have learned above. about the measurement of errors and when we have learned enough. 3. of From what do we learn? course. of course. A learning system changes itself in order to adapt to e. varying one or more of the three neu- ron functions (remember: activation function. Before introducing speciﬁc learning procedures. deleting existing connections. This approach is referred to as generalization. Theoretically. changing connecting weights. propagation function and output function). As written above. environmental changes. existing connections).

we assume the change in weight to be the most common procedure. since we acnetwork.com Thus.2 (Unsupervised learning). Thus. the network tries by itself to detect similarities and to generate pattern classes. 52 D. The possibilities to develop or paradigms of learning by presenting the delete neurons do not only provide well diﬀerences between their regarding trainadjusted weights during the training of a ing sets. respectively activation functions per layer. neural network. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .1. Only the input patterns are given. Therefore it is not very popular and I will omit this I will now introduce the three essential topic here. they attract a growing interest and are often realized by using 4. it is planned to cally most plausible method. deletion of connections can be realized by additionally taking care that a connection is no longer trained when it is set to 0. which we use to train our neuto implement. we let our neural network learn by modifying the connecting weights according to rules that can be formulated as algorithms. But. not very intuitive and not ral net. and addition and removal of both connections and neurons. to identify similar patterns and to classify SNIPE: Methods of the class them into similar categories. but no learning aides cept that a large part of learning possibilities can already be covered by changes in weight.1 Unsupervised learning provides input patterns to the evolutionary procedures.Unsupervised learning is the biologiter of this text (however. they are also not the subject mat. Methods in NeuralNetworkDescriptor enable the change of neuron behaviors. we can develop further connections by setting a non-existing connection (with the value 0 in the connection matrix) to a value different from 0. The training set only consists of input patterns. Therefore a learning procedure is always an algorithm that can easily be implemented by means of a programming language. Furthermore. As for the modiﬁcation of threshold values I refer to the possibility of implementing them as weights (section 3. A training set (named P ) is a set of training The change of neuron functions is diﬃcult patterns.Chapter 4 Fundamentals on learning and training samples (fundamental) As mentioned above. Moreover. we perform any of the ﬁrst four of the learning paradigms by just training synaptic weights. exactly biologically motivated. Let a training set be deﬁned as follows: Learning by changes in weight Deﬁnition 4. P Deﬁnition 4. but also optimize the network topology. Later in the text I will assume that the deﬁnition of the term desired output which is worth learning is known (and I will deﬁne formally what a training pattern is) and that we have a training set of learning samples. the network tries training). but is not extend the text towards those aspects of suitable for all problems. Thus. NeuralNetwork allow for changes in connection weights. dkriesel.4).1 (Training set).

Intuitively it is clear that this procedure should This learning procedure is not always biobe more eﬀective than unsupervised learn. how right or wrong it was. it generalises.3 (Reinforcement learning). the network. put and output patterns independently after the training.2 Reinforcement learning results to unknown. which deﬁnes turned. for each training set that is fed into the net. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 53 . for instance. where mathematical formalisation of learning and the network weights can be changed is discussed. Forward propagation of the input by the network. work the output.are corresponding to the following steps: Entering the input pattern (activation of input neurons).in this text .com 4. The training set consists of input patterns network can In reinforcement learning the network with correct results so that the 1 receive a precise error vector can be rereceives a logical or a real value after completion of a sequence. which . generation of the output. The training set consists of input patterns.fective and therefore very practicable.4 (Supervised learning). The objeclar example of Kohonen’s self-organising tive is to change the weights to the eﬀect that the network cannot only associate inmaps (chapter 10). correct results in the form of the precise activation of all output neurons.1. whether it behaves well or bad network receives reward or punishment 4. whether the result is right or wrong.1.Corrections are applied. At ﬁrst we want to look at the the supervised learning procedures in general.3 Supervised learning methods provide training patterns together with appropriate desired outputs learning scheme network receives correct results for samples In supervised learning the training set consists of input patterns as well as their Corrections of the network are calculated based on the error vector.logically plausible. but can provide plausible 4. similar input patterns. but it is extremely efing since the network receives speciﬁc crit. Deﬁnition 4. Thus. methods provide feedback to i. D.according to their diﬀerence.2. possibly. era for problem-solving. after completion of a sequence a value is returned to the network indicating whether the result was right or wrong and. can directly 1 The term error vector will be deﬁned in section be compared with the correct solution and 4. Comparing the output with the desired output (teaching input ). provides error vector (diﬀerence vector).e. Deﬁnition 4.1 Paradigms of learning Here I want to refer again to the popu.dkriesel.

e. The network learns directly from the errors of each training sample. the total error is calculated by means of a error function operation or simply accumulated see also section 4.2 Training patterns and teaching input Before we get to know our ﬁrst learning rule. we need to introduce the teaching input.5 Questions you should answer before learning The application of such schemes certainly requires preliminary thoughts about some questions.com 4. and each network topology individually. answer them in the course of this text: Where does the learning input come from and in what form? desired output 54 D. whether it will reach an optimal state after a ﬁnite time or if it.1. no easy answers! 4. How must the weights be modiﬁed to allow fast and reliable learning? How can the success of a learning process be measured in an objective way? Is it possible to determine the "best" learning procedure? Is it possible to predict if a learning procedure terminates. if possible. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Deﬁnition 4.4 Oﬄine or online learning? It must be noted that learning can be oﬄine (a set of training samples is presented. Oﬄine training procedures are also called batch training procedures since a batch of results is corrected all at once. these output values are referred 4. Both procedures have advantages and disadvantages.1. i. In (this) case of supervised learning we assume a training set consisting of training patterns and the corresponding correct output values we want to see at the output neurons after the training.Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel. which will be discussed in the learning procedures section if necessary.5 (Oﬄine learning).e. While the network has not ﬁnished training. the errors are accumube generally answered but that they have lated and it learns for all patterns at the to be discussed for each learning procedure same time. Such a training section of a whole batch of training samples including the related change in weight values is called epoch. for example. which I want to introduce now as a check list and. Several training patterns are entered into the We will see that all these questions cannot network at once.4) or online (after every sample presented the weights are changed). as long as it is generating wrong outputs. i. then the weights are changed. will oscillate between diﬀerent states? How can the learned patterns be stored in the network? Is it possible to avoid that newly learned patterns destroy previously learned associations (the so-called stability/plasticity dilemma)? Deﬁnition 4.6 (Online learning).

as already mentioned. It contains a ﬁnite number of or. or to the er- Ep error vector Ep is the diﬀerence between the teaching input t and the actural output y . tj is the teaching input. Ωn the difference between output vector and teachDeﬁnition 4.com 4.pending on whether you are learning ofdered pairs(p. Analogously to the vec. . D. . In the literature as well as in Now I want to brieﬂy summarize the vecthis text they are called synonymously pat. . Thus. t always refers to a speciﬁc traintraining purposes because we know ing pattern p and is. Decalled P . Ω2 . tn of training sample p is nothing more than the neurons can also be combined into a an input vector.dkriesel. pn whose t1 − y1 . . training pattern into the network we retn − yn ceive an output that can be compared with the teaching input.Deﬁnition 4. input vector x. . the diﬀerence vector refers corresponding desired output.output vector y . to as teaching input. A ing input under a training input p training pattern is an input vector p with the components p1 . The class TrainingSampleLesson allows for storage of training patterns and teaching inputs. which is the desired is referred to as error vector. . Basically.8 (Teaching input). desired output is known. Let j the neural network. t) of training patterns with ﬂine or online. training samples etc. By entering the . . which can be entered into Deﬁnition 4. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 55 . than the desired output vector to the SNIPE: Classes that are relevant training sample. which means it is the correct or de. for a neuron j with training data. t2 . eral output neurons Ω1 . and that for each neuas well as simple preprocessing of the ron individually. Ep = . p2 . the incorrect output oj .7 (Training patterns).9 (Error vector). . . The for training data are located in the package training. to a speciﬁc training pattern. . . patterns. sometimes output.2 Training patterns and teaching input p t desired output ror of a set of training patterns which is Training patterns are often simply called normalized in a certain way. Depending on be an output neuron. For sevsired output for a training pattern p.tors we have yet deﬁned. the tor p the teaching inputs t1 . The set of training patterns is it is also called diﬀerence vector. We only use it for vector t. There is the terns. that is why they are referred to as p. the corresponding contained in the set P of the training patteaching input t which is nothing more terns. . The teaching inthe type of network being used the put tj is the desired and correct value j neural network will output an should output after the input of a certain training pattern.

top). One advice concerning notation: We referred to the output values of a neuron i as oi . and otherwise will output 0 .e. Suppose that we want the network to train a mapping R2 → B1 and therefor use the training samples from ﬁg.Chapter 4 Fundamentals on learning and training samples (fundamental) So. whether it can use our training samples to quite exactly produce the right output but to provide wrong answers for all other problems of the same class. what x and y are for the general network operation are p and t for the network training .com Important! 4. too.1: Visualization of training results of the same training set on networks with a capacity being too high (top). But the output values of a network are referred to as yΩ . it has suﬃcient storage capacity to concentrate on the six training Figure 4. In this respect yΩ = oΩ is true. Thus.3 Using training samples We have seen how we can learn in principle and which steps are required to do so. ﬁnally.1: Then there could be a chance that. the output of an output neuron Ω is called oΩ .1. 4. but they are outputs of output neurons. these network outputs are only neuron outputs. After successful learning it is particularly interesting whether the network has only memorized – i. 56 D.and during training we try to bring y as close to t as possible. dkriesel. 4. Thus. the network will exactly mark the colored areas around the training samples with the output 1 (ﬁg. Certainly. correct (middle) or too low (bottom). Now we should take a look at the selection of training data and the learning curve. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .

this rough presentation of input data does not correspond to the good generalization By training less patterns. Thus.3 Using training samples samples with the output 1. on the other hand. Always the same sequence of patterns. But this text is not about 100% 4. solve both problems. the training set into one training set really used to train . bottom) – ful training. 4.1 It is useful to divide the set of exact reproduction of given samples but about successful generalization and aptraining samples proximation of a whole function – for which it can deﬁnitely be useful to train An often proposed solution for these probless information into the network. A random permutation would spect to a given ratio.com 4. explicitly for the training. this is the standard method). that these data are included an oversized network with too much free in the training. there is no guarantee that the patterns are learned equally well (however.dkriesel. we have to withhold information from the network ﬁnd the balance (ﬁg. – provided that there are enough training samples. but it is – as already mentioned – very time-consuming to calBut note: If the veriﬁcation data provide culate such a permutation. provokes that the patterns will be memoSNIPE: The method splitLesson within rized when using recurrent networks (later. and one veriﬁcation set to test our 4. loring the network to the veriﬁcation data.3. The usual division relations are.2 Order of pattern representation progress You can ﬁnd diﬀerent strategies to choose the order of pattern presentation: If patterns are presented in random sequence. We can ﬁnish the training when the network provides good results on the training data as well as on the veriﬁcation data. we obviously performance we desire. the class TrainingSampleLesson allows for we will learn more about this type of netsplitting a TrainingSampleLesson with reworks). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 57 . for instance. D. and risk to worsen the learning performance.3. middle). The solution is a third set of validation data used only On the other hand a network could have for validation after a supposably successinsuﬃcient capacity (ﬁg. poor results.1. do not modify the network structure until these data provide good reSNIPE: The method shuffleSamples located in the class TrainingSampleLesson sults – otherwise you run the risk of taipermutes a lesson. 70% for training data and 30% for veriﬁcation data (randomly chosen). This implies This means. 4.1. even if they are not used storage capacity. lems is to divide.

which can be determined in various ways. The total error Err is based on all training samples.e. For this. The root mean square of two vectors t and y is deﬁned as Errp = Ω∈O (tΩ norm to compare − yΩ )2 |O | .2) Depending on our method of error measurement our learning curve certainly 58 D.course of a whole epoch. surement methods and sample problems The Euclidean distance (generalization of are discussed (this is why there will be a the theorem of Pythagoras) is useful for simmilar suggestion during the discussion lower dimensions where we can still visual. which means it is gener. it is possible to use other types of error meaated online.4) Deﬁnition 4. For example.1) that means it is generated oﬄine. To get used to further error Additionally. The motivation to create a learning curve is that such a curve can indicate whether the network is progressing or not. Deﬁnition 4.measurement methods. In this report. squared error with a prefactor. (4. The Analogously we can generate a total RMS speciﬁc error Errp is based on a single and a total Euclidean distance in the training sample.12 (Root mean square). (4.10 (Speciﬁc error).of exemplary problems). represent a distance measure between the correct and the current output of the network. surement. the root mean square (ab. Deﬁnition 4. Of course.com 4. both error meadistance are often used. Err Errp Deﬁnition 4.13 (Total error). (tΩ − yΩ )2 . too: Err = p∈ P Errp (4. the total error in the course of one training epoch is interesting and useful.3) As for oﬄine learning.Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel. I suggest to have a breviated: RMS ) and the Euclidean look into the technical report of Prechelt [Pre94]. we can take the same pattern-speciﬁc. i. the root mean square is commonly used since it considers extreme outliers to a greater extent. ize its usefulness.11 (Euclidean distance).4 Learning curve and error measurement The learning curve indicates the progress of the error. the error should be normalized. which we are also going to use to derive the backpropagation of error (let Ω be output neurons and O the set of output neurons): Errp = 1 (tΩ − yΩ )2 2 Ω∈O Generally. The Euclidean distance between two vectors t and y is deﬁned as Errp = Ω∈O SNIPE: There are several static methods representing diﬀerent methods of error measurement implemented in the class ErrorMeasurement. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . (4.

the shape of the learning curve can provide an indication: If the learning curve of the veriﬁcation samples is suddenly and rapidly rising while the learning curve of the veriﬁcation objectivity 4.metaphorically speaking .com changes. for example. the representation of the learning curve can be illustrated by means of a logarithmic scale (ﬁg. in any case. but was the ﬁrst to ﬁnd it.1 When do we stop learning? Now.4 Learning curve and error measurement depends on a more objective view on the comparison of several learning curves. the problems being not too diﬃcult and the logarithmic representation of Err you can see . we reach the limit of the 64-bit resolution of our computer and our network has actually learned the optimum of what it is capable of learning. When the network eventually begins to memorize the samples. As we can also see in ﬁg. too.4. i. Thus. 4. however. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 59 . the training is stopped when the user in front of the learning computer "thinks" the error was small enough. be overtaken by another curve: This can indicate that either the learning rate of the worse curve was too high or the worse curve itself simply got stuck in a local minimum. Remember: Larger error values are worse than the small ones. note: Many people only generate a learning curve in respect of the training data (and then they are surprised that only a few things will work) – but for reasons of objectivity and clarity it should not be forgotten to plot the veriﬁcation data on a second learning curve. 4. On the other hand. 4. is boosted. A perfect learning curve looks like a negative exponential function.dkriesel. Typical learning curves can show a few ﬂat areas as well.2.2 on the following page). there is no easy answer and thus I can once again only give you something to think about. which is no sign of a malfunctioning learning process. D. when the network always reaches nearly the same ﬁnal error-rate for diﬀerent random initializations – so repeated initialization and training will provide a more objective result. the big question is: When do we stop learning? Generally. it can be possible that a curve descending fast in the beginning can. too. which generally provides values that are slightly worse and with stronger oscillation. they can show some steps. that means it is proportional to e−t (ﬁg. second diagram from the bottom) – with the said scaling combination a descending line implies an exponential descent of the error. Conﬁdence in the results. But with good generalization the curve can decrease. a well-suited representation can make any slightly decreasing learning curve look good – so just be cautious when reading the literature. which. But. Indeed.e.a descending line that often forms "spikes" at the bottom – here.2. With the network doing a good job. 4. after a longer time of learning.

Note the alternating logarithmic and linear scalings! Also note the small "inaccurate spikes" visible in the sharp bend of the curve in the ﬁrst and second diagram from bottom.0002 0.com 0.2: All four illustrations show the same (idealized.Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel.0002 0.00012 0.00015 0.0001 5e−005 0 0 1 1e−005 1e−010 Fehler Fehler 1e−015 1e−020 1e−025 1e−030 1e−035 0 100 200 300 400 500 600 700 800 900 1000 Epoche 100 200 300 400 500 600 700 800 900 1000 Epoche 0. 60 D.00018 0. because very smooth) learning curve.00016 0.0001 8e−005 6e−005 4e−005 2e−005 0 1 1 1e−005 1e−010 1e−015 1e−020 1e−025 1e−030 1e−035 1 10 Epoche 100 1000 10 Epoche 100 1000 Fehler Figure 4.00014 0.00025 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) Fehler .

. x2 . the direction to which a ball would roll from the starting point). The gradient operator ∇ is referred to as nabla operator. . x2 . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 61 . but principally there is no limit to the number of dimensions. . the longer the steps). g directs from any point of f towards the steepest ascent from this point. for example. the negative gradient −g exactly points towards the steepest descent. xn ) = ∇f (x1 . this could indicate memorizing and a generalization getting poorer and poorer.5 Gradient optimization notation is deﬁned as procedures g (x1 . the overall notation of the the gradient g of the point (x. If we came into a valley. we move slowly on a ﬂat plateau. . Therefore. Due to clarity the illustration (ﬁg. . . . At this point it could be decided whether the network has already learned well enough at the next point of the two curves. ∇ gradient is multi-dim. 4. y ) of a twodimensional function f being g (x. with the size of the steps being proportional to |g | (the steeper the descent. Let g be a gradient. . similar to our ball moving within a round bowl. . Gradient descent procedures are generally used where we want to maximize or minimize n-dimensional functions. involves this mathematical basis and thus inherits the advantages and disadvantages of the gradient descent. In order to establish the mathematical basis for some of the following learning procedures I want to explain brieﬂy what is meant by gradient descent : the backpropagation of error learning procedure.jump over it or we would return into the valley across the opposite hillside in order to come closer and closer to the deepest point of the valley by walking back and forth. xn ). and on a steep ascent we run downhill rapidly. Then g is a vector with n If-Then conclusions. the gradient is a generalization of the derivative for multidimensional functions. . derivative Once again I want to remind you that they are all acting as indicators and not to draw Deﬁnition 4. xn ).com data is continuously falling.14 (Gradient). Gradient descent means to going downhill in small steps from any starting point of our function towards the gradient g (which means.5 Gradient optimization procedures of its norm |g |. 4. y ) = ∇f (x. .dkriesel. The gradient is a vector g that is deﬁned for any diﬀerentiable point of a function. Thus. that points from this point exactly towards the steepest ascent and indicates the gradient in this direction by means D. y ). components that is deﬁned for any point of a (diﬀerential) n-dimensional function f (x1 . we would .3 on the next page) shows only two dimensions. x2 .depending on the size of our steps . and maybe the ﬁnal point of learning is to be applied here (this procedure is called early stopping ). with |g | corresponding to the degree of this ascent. . Accordingly. vividly speaking. The gradient operator 4.

4 on the facing page). the faster the steps).ac.1 Gradient procedures incorporate several problems As already implied in section 4.e.5. . One problem. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . 4. with the step width being proportional to |g | (the steeper the descent. Let f be an n-dimensional function and s = (s1 .fhs-hagenberg. 4.1 Often.at/staff/sdreisei/Teaching/WS2001-2002/ PatternClassification/graddescent. s2 . i. 4. We move forward in the opposite direction of g .com Figure 4. on the right the steps over the contour lines are shown in 2D. let us have a look on their potential disadvantages so we can keep them in mind a bit. . i.5.3: Visualization of the gradient descent on a two-dimensional error function.e. Anyway.15 (Gradient descent). Gradient descent procedures are not an errorless optimization procedure at all (as we will see in the following sections) – however. get stuck within a local minimum (part a of ﬁg.5. On the left the area is shown in 3D. they work still well on many problems. Gradient descent means going from f (s) against the direction of g . gradient descents converge against suboptimal minima Every gradient descent procedure can. Here it is obvious how a movement is made in the opposite direction of g towards the minimum of the function and continuously slows down proportionally to |g |. with the steepest descent towards the lowest point. . which makes them an optimization paradigm that is frequently used.pdf We go towards the gradient Deﬁnition 4. .Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel.1. sn ) the given starting point. towards −g with steps of the size of |g | towards smaller and smaller values of f . the gradient descent (and therefore the backpropagation) is promising but not foolproof. is that the result does not always reveal if an error has occurred. gradient descent with errors 62 D. Source: http://webster. for example.

1.2 Flat plataeus on the error surface may cause training slowness 4. 4.dkriesel.4: Possible errors during a gradient descent: a) Detecting bad minima.5.4).4). d) Leaving good minima.4 Steep canyons in the error surface may cause oscillations When passing a ﬂat plateau. A hypothetically possible gradient one can even result in oscillation (part c of ﬁg. not occur very often so that we can think about the possibilities b and d. 4. D.5 Gradient optimization procedures Figure 4. 4. b) Quasi-standstill with small gradient. if an acceptable minimum is found. In reality. c) Oscillation in canyons. one cannot know if the optimal minimum is reached and considers a training successOn the other hand the gradient is very ful. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 63 . 4.5.com 4. and there afterwards is no universal solution. large at a steep slope so that large steps can be made and a good minimum can possibly be missed (part d of ﬁg.5. This problem is increasing proportionally 4.1. for instance.1. In nature. the gradient also becomes negligibly small because there is hardly a descent (part b A sudden alternation from one very strong of ﬁg.3 Even if good minima are reached. which requires many further negative gradient to a very strong positive steps. they may be left to the size of the error surface. such an error does of 0 would completely stop the descent.4).

Another favourite example for singlelayer perceptrons are the boolean functions AND and OR. close to n = 4.1: Illustration of the parity function with three inputs. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . of course) be the hyperbolic tangent.6. We need a hidden neuron layer. depending on whether the function XOR outputs 1 or 0 . The reader may create a score tathe limits of the hyperbolic tangent (or ble for the 2-bit parity function.6.6 Exemplary problems allow for testing self-coded learning strategies We looked at learning from the formal point of view – not much yet but a little.com Ω 1 0 0 1 0 1 1 0 4. Therefore it is wiser to enter the teaching As a training sample for a function let inputs 0.and exactly here is where the ﬁrst beginner’s mistake occurs. we need at least two neurons in the inner layer.Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel. n = 3 (shown in table 4.0 or −1.2 The parity function The parity function maps a set of bits to 1 or 0.1 Boolean functions A popular example is the one that did not work in the nineteen-sixties: the XOR function (B2 → B1 ). which have to be learned: 4.1).5 on the facing page) with the those values instead of 1 and −1. Thus. Basically. 4.e. we conspicuous? need very large network inputs. Let the activation function in all layers (except in the input layer.6. which we have discussed in detail. but the learning eﬀort rapidly increases from For outputs close to 1 or -1. Trivially. The only chance to reach these network inputs are large weights. What is in case of the Fermi function 0 or 1). i. 4. It is characterized by easy learnability up to approx. i1 0 0 0 0 1 1 1 1 i2 0 0 1 1 0 0 1 1 i3 0 1 0 1 0 1 0 1 4.3 The 2-spiral problem The learning process is largely extended. Now it is time to look at a few exemplary problem you can later use to test implemented networks and learning rules. Table 4.9 as desired outputs or us take two spirals coiled into each other to be satisﬁed when the network outputs (ﬁg. we now expect the outputs 1. function certainly representing a mapping 64 D.9 or −0.0. depending on whether an even number of input bits is set to 1 or not. this is the function Bn → B1 .

com 4. mathematically speaking. The diﬃculty increases proportionally to the size of the function: While a 3×3 ﬁeld is easy to learn.6) with one colored ﬁeld representing 1 and all the rest of them representing 0. the larger ﬁelds are more diﬃcult (here we eventually use methods that are more 4.6. The 2-spiral problem is very similar to the checkerboard problem. 4. the other spiral to 0. Here. suitable for this kind of problems than the MLP).dkriesel.6 Exemplary problems Figure 4. but we put some obstacles in its way by using our sigmoid functions so that D. I just want to introduce as an example one last trivial case: the identity. The network has to understand the mapping itself.5 The identity function By using linear activation functions the identity mapping from R1 to R1 (of course only within the parameters of the used activation function) is no problem for the network. One of the spirals is assigned to the output value 1.4 The checkerboard problem We again create a two-dimensional function of the form R2 → B1 and specify checkered training samples (ﬁg. memorizing does not help. too. only that.6.6: Illustration of training samples for the checkerboard problem R2 → B1 . This example can be solved by means of an MLP. 4. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 65 .5: Illustration of the training samples of the 2-spiral problem Figure 4. the ﬁrst problem is using polar coordinates instead of Cartesian coordinates.

16 (Hebbian rule). Why am I speaking twice about activation. as well as. weights go ad inﬁnitum 66 D. Hebb postulated his rule long before the speciﬁcation of technical neurons.1)2 . The changes in weight ∆wi. ∆wi. but in the formula I am using oi and aj .7 The Hebbian learning rule is the basis for most other learning rules In 1949. 4. lems. i.Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel. 0) the weights will either increase or remain constant. "If neuron j receives an input from neuron i and if both neurons are strongly active at the same time. Hebb formulated the Hebbian rule [Heb49] which is the basis for most of the more complicated learning rules we will discuss in this text.j are simply added to the weight wi. the weights are decreased when the activation of the predecessor neuron dissents from the one of the successor neuron. also has been named in the sections about error measurement procedures.. We distinguish between the original form and the more general form. 4. Besides. the learning rate. otherwise they are increased.j ∼ ηoi aj (4. from i to j . Just try it for the fun with ∆wi. This can be compensated by using the activations (-1.4. Considering that this learning rule was preferred in binary activations.com ∆wi.6 There are lots of other exemplary problems the activation aj of the successor neuron j .6.3. 2 But that is no longer the "original version" of the the rule is: Hebbian rule. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . I want to recommend the technical which will be discussed in section report written by prechelt [Pre94] which 5.7. Donald O. it is clear that with the possible activations (1. the output oi of the predecessor neuron i.1 Original rule early form of the rule Deﬁnition 4.j being the change in weight of it.j 4. the strength of the connection between i and j ). Sooner or later they would go ad inﬁnitum. then increase the weight wi. it is time to hava a look at our ﬁrst following factors: mathematical learning rule. the output of neuron of neuron i and the activation of neuron j ? Remember that the identity is often used as output function and therefore ai and oi of a neuron are often the same. i.e. which is proportional to the Now. which is a kind of principle for other learning rules. since they can only be corrected "upwards" when an error occurs." Mathematically speaking. Thus.j . For lots and lots of further exemplary proba constant η .5) it would be diﬃcult for the network to learn the identity.e.e.j (i.

Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 67 .dkriesel.j = η · h(oi . 4) Hebbian Rule only speciﬁes the proporp4 = (6. 3) eral). more genp2 = (3. The generalized form of the p3 = (4. 3.17 (Hebbian rule. h receives the output of the predecessor cell oi as well as the weight from predecessor to successor wi. 2) Deﬁnition 4. p6 = (0. 6) ∆wi. 0) tionality of the change in weight to the product of two undeﬁned functions. tj ) Thus. but p5 = (0. p1 = (2. tj ) and h(oi . rule.7 Hebbian rule 4. wi. 0. 6. 0. 0) with deﬁned input values.6) D. we will now return to the path of specialization we discussed before equation 4. 4.2 Generalized form Exercises Most of the learning rules discussed before Exercise 7. Therefore.7. 2. After we have had a short picture of what a learning rule could look like and of our thoughts about learning itself.com 4.j ) as well as the constant learning rate η results in the change in weight. the product of the functions g (aj . As already mentioned g and h are not speciﬁed in this general deﬁnition. we will be introduced to our ﬁrst network paradigm including the learning procedure.j while g expects the actual and desired activation of the successor aj and tj (here t stands for the aforementioned teaching input ). (4. Calculate the average value are a specialization of the mathematically µ and the standard deviation σ for the folmore general form [MR86] of the Hebbian lowing data points. As you can see. wi.6.j ) · g (aj .

.

Part II Supervised learning network paradigms 69 .

.

But 71 . this is not the the following layer and is called input easiest way to achieve Boolean logic. but the want to illustrate that perceptrons can weights of all other layers are allowed to be be used as simple logical components and changed. any Boolean retina are pattern detectors. depending on the threshold value Θ of the output neuron. Initially. Derivation of learning procedures and discussion of their problems. If we talk about a neural network. As already mentioned in the history of neural networks. the binary activation function represents an IF query which can also be negated by means of negative weights. 1}). but most of the time the term complish true logical information processis used to describe a feedforward network ing.1 on the next page).function can be realized by means of pertially use a binary perceptron with every ceptrons being connected in series or inoutput neuron having exactly two possi. This network has a layer of scanner neurons (retina ) Whether this method is reasonable is anwith statically weighted connections to other matter – of course. Description of a perceptron. the perceptron was described by Frank Rosenblatt in 1958 [Ros58]. All neurons subordinate to the that. with shortcut connections. {0.g. I just layer (ﬁg. Perceptrons are multilayer networks without recurrence and with ﬁxed input and output layers. There is no established deﬁnition for a per. a binary threshold function is used as activation function. 5.terconnected in a sophisticated way. Here we ini. theoretically speaking. Rosenblatt deﬁned the already discussed weighted sum and a non-linear activation function as components of the perceptron. its limits and extensions that should avoid the limitations. In a way.Chapter 5 The perceptron. Thus. backpropagation and its variants A classic among the neural networks. 1} or {−1.The perceptron can thus be used to acceptron. ble output values (e. then in the majority of cases we speak about a percepton or a variation of it.

vn Ω Abbildung 5. Kriesel – Ein kleiner Uberblick u ¨ber Neuronale Netze (EPSILON-DE) 72 D. Die durchgezogene Gewichtsschicht in bottom den unteren Abbildungen ist trainierThe solid-drawn weight layer in the two illustrations on the can beiden be trained. Left side: bar. Right side.com | 4 { 4 5 { 5 | AGFED C AGFED DC AGFED GFEDu sr @ABC @ABC @ABC @ABCu s GFED @ABCu yyy dd ~ yyy ooo ~~ ooooo yyy ddd ~ yyy dd o ~ yyy dd ~~~ooooo yy9 1 ~ wooo Σ WVUT PQRS L|H @ABC GFED @ABC @ABC GFED GFED @ABC @ABC GFED i3 i2 g i i1 GFED i4 nn 5 gg { n n { n gg { n gg {{ nnn gg {{nnnnn {n n g3 { { } n @ ?>=< 89:. nen Ansichten. with the name of each neuron Unten: Ohne eingezeichnete feste Gewichtsschicht. The ﬁxed-weight layer will noim longer be taken into in the unserer Konvention. Right side. lower part: Without indicated ﬁxed-weight layer.1: Architecture a perceptron with one layer ofSchicht variable connections in diﬀerent views. Oben: Am Beispiel der Informationsabtastung im with Auge. Wir werden die feste Gewichtschicht weiteren Verlauf der account Arbeit nicht mehr course of this work. betrachten. upper part: Drawing of the same example indicated ﬁxed-weight layer using the Mitte: Skizze desselben mit eingezeichneter fester Gewichtsschicht unter Verwendung der deﬁnierdeﬁned designs of the functional descriptions for neurons. 70 ¨ D. ten funktionsbeschreibenden Designs f¨ ur Neurone.Kapitel 5 Das Perceptron dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . backpropagation and its variants dkriesel.com Chapter 5 The perceptron. mit Benennung der einzelnen Neuronen nach corresponding to our convention. Example of scanning information in the eye.1: of Aufbau eines Perceptrons mit einer variabler Verbindungen in verschiedeFigure 5.

Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 73 .1. Information processing neurons somehow process the input information. Then the activation function of the neuron is the binary threshold function. because this layer only forwards the input values. Therefore the input neuron is repre@ABC . Before providing the deﬁnition of the perceptron.3 (Perceptron). A binary neuron sums up all inputs by using the weighted sum as propagation function. Deﬁnition 5. input neuron only forwards data Deﬁnition 5. 5. I therefore suggest keeping my deﬁnition in the back of your mind and just take it for granted in the course of this work. It exactly forwards the information received.e.1 on the facing page) is1 a feedforward network containing a retina that is used only for data acquisition and which has ﬁxed-weighted connections with the ﬁrst neuron layer (input layer). D. since they do not process information in any case. A feedforward network often contains shortcuts which does not exactly correspond to the original description and therefore is not included in the deﬁnition. sented by the symbol GFED Now that we know the components of a perceptron we should be able to deﬁne it. The ﬁrst layer of the perceptron consists of the input neurons deﬁned above. One neuron layer is completely linked with the following layer. do not represent the identity function. Deﬁnition 5. So. The perceptron (ﬁg. This leads us to the complete depiction of information processing neurons. 5. are similarly represented by Σ WVUT PQRS Tanh Σ WVUT PQRS Fermi Σ ONML HIJK . As a matter of fact the ﬁrst neuron layer is often understood (simpliﬁed and suﬃcient for this method) as input layer. i. Other neurons that use retina is unconsidered the weighted sum as propagation function but the activation functions hyperbolic tangent or Fermi function. which should be indicated by the symbol . An input neuron is an identity neuron. or with a separately deﬁned activation function fact .1 (Input neuron).com we will see that this is not possible without connecting them serially. fact These neurons are also referred to as Fermi neurons or Tanh neuron. the depiction of a perceptron starts with the input neurons. The ﬁxed-weight layer is followed by at least one trainable weight layer. it represents the identity function. L|H Σ PQRS namely WVUT L|H . The retina itself and the static weights behind it are no longer mentioned or displayed.dkriesel. which we want to illustrate by the sign Σ. We can see that the retina is not included in the lower part of ﬁg. Thus. I want to deﬁne some types of neurons used in this chapter. which can be illustrated by .2 (Information processing neuron). 1 It may confuse some readers that I claim that there is no deﬁnition of a perceptron but then deﬁne the perceptron in the following section.

Important! 74 D. Ωn does not considerably change the concept of the perceptron (ﬁg.3): A perceptron with several out. Ω2 .Ωwi1 .Ω wi2 . . connections with trainable weights go from the input layer to an output neuron Ω.1 on page 72).Chapter 5 The perceptron.2. which returns the information whether the pattern entered at the input neurons was recognized or not.Ω @ABC GFED i2 5. dkriesel. the bias neuron will no longer be included. all others are not. Although the weight wBIAS. . The network returns the output by means of the arrow leaving the network. Ω1 Ω2 Ω3 Certainly. The technical view of an SLP is n i n e } } i n i n n 2 ~} ~} B @ 2 ~~~ iii 9 2 n w n vn i tn @ABC GFED @ABC GFED @ABC GFED shown in ﬁg. the existence of several output neurons Ω1 .@ABC GFED i i n i nn n i e n } } ee i dd ~ n i n n i e e } } ~ n n i n i dd e } nn ee} }ii nnn ~~ able weights and one layer of output neue } i } i} dd ei e } nn i n i en e } nnn ~~~ i n i n dd ee e } } i n rons Ω. .Figure 5. A singlelayer perceptron (SLP ) is a @ABC @ABC @ABC GFED GFED @ABC GFED i3 e i5 i1 d i4 i2 GFED perceptron having only one layer of vari. Deﬁnition 5.2: A singlelayer perceptron with two input neurons and one output neuron. and fastprop is activated. 1 trainable layer Figure 5.3: Singlelayer perceptron with several output neurons put neurons can also be regarded as several diﬀerent perceptrons with the same input. In future. backpropagation and its variants SNIPE: The methods setSettingsTopologyFeedForward and the variation -WithShortcuts in a NeuralNetworkDescriptor-Instance apply settings to a descriptor. which are appropriate for feedforward networks or feedforward networks with shortcuts. Thus. 5.1 The singlelayer perceptron provides only one trainable weight layer Here. The respective kinds of connections are allowed.Ω is a normal weight and also treated like this. a singlelayer perception (abbreviated SLP) has only one level of trainable weights (ﬁg. 5. As a reminder. The trainable layer of weights is situated in the center (labeled). 5. the bias neuron is again included here. ?>=< Ω wBIAS. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . .4 (Singlelayer perceptron).com @ABC GFED BIAS @ABC GFED i1 2 89:. I have represented it by a dotted line – which signiﬁcantly increases the clarity of larger networks.

1 The singlelayer perceptron 5.1. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 75 . the threshold values are written into the neurons. Now we want to know how to train a singlelayer perceptron.5 1 During the exploration of linear separability of problems we will cover the fact that at least the singlelayer perceptron unforFigure 5. The activation function of the information processing neuron is the binary threshold function.com GFED @ABC ee ee GFED @ABC } } }} 1 1e ee } } e2 ~}} @ABC GFED 1. 1. It has been proven that the algorithm converges in ﬁnite time – so in ﬁnite time the perceptron can learn anything it can represent (perceptron convergence theorem. Compared with the aforementioned perceptron learning algorithm. being far away from The Boolean functions AND and OR shown in ﬁg.dkriesel.1.4 are trivial examples that can easily be composed. We will therefore at ﬁrst take a look at the perceptron learning algorithm and then we will look at the delta rule. The upper singlelayer perlems. 5. the lower one realizes an OR.1 Perceptron learning algorithm and convergence theorem The original perceptron learning algorithm with binary neuron activation function is described in alg. fact now diﬀerentiable D. 5. ceptron realizes an AND.2 The delta rule as a gradient based learning strategy for SLPs In the following we deviate from our binary threshold value as activation function because at least for backpropagation of error we need. 1e e ee }} ~}} 2 @ABC GFED 0. But please do not get your hopes up too soon! What the perceptron is capable to represent will be explored later.5 GFED @ABC ee ee @ABC GFED } } }} 5. This fact. a diﬀerentiable or even a semi-linear activation function.4: Two singlelayer perceptrons for tunately cannot represent a lot of probBoolean functions. For the now following delta rule (like backpropagation derived in [MR86]) it is not always necessary but useful. [Ros62]). the delta rule has the advantage to be suitable for non-binary activation functions and. however. will also be pointed out in the appropriate part of this work. Where available. as you will see.

Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .. The perceptron learning algorithm reduces the weights to output neurons that return 1 instead of 0.. 1: 2: 76 D. no correction of weights 6: else 7: if yΩ = 0 then 8: for all input neurons i do 9: wi... backpropagation and its variants dkriesel.Ω := wi. and in the inverse case increases weights.increase weight towards Ω by oi } 10: end for 11: end if 12: if yΩ = 1 then 13: for all input neurons i do 14: wi.Chapter 5 The perceptron.decrease weight towards Ω by oi } 15: end for 16: end if 17: end if 18: end for 19: end while Algorithm 1: Perceptron learning algorithm.com while ∃p ∈ P and error too large do Input p into the network. calculate output y {P set of training patterns} 3: for all output neurons Ω do 4: if yΩ = tΩ then 5: Output is okay.Ω + oi {.Ω := wi.Ω − oi {.

Ω2 . the gradient until a minimum is reached. as already detotal error Err as a function of the weights: ﬁned. gradient sample p. the error vector Ep represents the difference (t − y ) under a certain training As already shown in section 4.1 The singlelayer perceptron the learning target. I am aware of this conﬂict but it should not bother us here. which we here regard as the vector W . Now our learning target will certainly be. . scent). . pending on how we change the weights. to automatically learn of the network is approximately the desired output t. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 77 . that for all training samples the output y 2 Following the tradition of the literature. the pairs (p. . that Suppose that we have a singlelayer perceptron with randomly set weights which we ∀p : y ≈ t or ∀p : Ep ≈ 0. The set of these training samThis means we ﬁrst have to understand the ples is called P . I previously deﬁned W as a weight matrix. The error function Err : W → R Err(W ) output neurons are referred to as regards the set2 of weights W as a vector and maps the values onto the normalized Ω1 . Deﬁnition 5.5 (Error function). for example.dkriesel. . for an output o and a teachSo we try to decrease or to minimize the ing input t an additional index p may be error by simply tweaking the weights – set in order to indicate that these values thus one receives information about how are pattern-speciﬁc. error as function Errp (W ) D. descent procedures calculate the gradient Furthermore.com 5. we deﬁned that for a single pattern p. Err(W ) is deﬁned on the set of all weights Another naming convention shall be that.of an arbitrary but ﬁnite-dimensional function (here: of the error function Err(W )) put neurons and and move down against the direction of I be the set of input neurons. Ω|O| . It is obvious that a speciﬁc error function can analogously be generated Additionally. t) of the training samThe total error increases or decreases deples p and the associated teaching input t.5.e. output error (normalized because otheri is the input and wise not all errors can be mapped onto one single e ∈ R to perform a gradient deo is the output of a neuron. Sometimes this will to change the weights (the change in all considerably enhance clarity. let O be the set of out. want to teach a function by means of training samples. formally it is true faster. i. It contains. I also want to remind you that x is the input vector and y is the output vector of a neural network.

Ω of how to change this weight. we tweak every single weight and observe how the error function changes. (5.2) Now the following question arises: How is our error function deﬁned exactly? It is not good if many results are far away from the desired ones.Ω (5. I just ask the reader to be patient for a while. backpropagation and its variants 5 4 3 2 1 0 dkriesel. 2 Ω∈O (5. Generally. Thus. neural networks have more than two connections. we now rewrite the gradient of the error-function according to all weights as an usual partial derivative according to a single weight wi.4) To simplify further analysis.): ∆W = −η ∇Err(W ). It provides the error Errp that is speciﬁc for a training sample p over the output of all output neurons Ω: Errp (W ) = 1 (tp.com Thus. ∂ Err(W ) .Ω and obtain the value ∆wi.Ω = −η Figure 5.Ω (the only variable weights exists between the hidden and the output layer Ω). which complicates the search for the minimum.5: Exemplary error surface of a neural network with two trainable connections w1 und w2 . we calculate the squared diﬀerence of the components of the vectors t and y .e. but this would have made the illustration too complex. The squared distance between the output vector y and the teaching input t appears adequate to our needs. −2 −1 w1 0 1 2 −2 −1 0 1 w2 2 ∆wi. it is similarly bad if many results are close to the desired ones but there exists an extremely far outlying result. The summation of the speciﬁc errors Errp (W ) of all patterns p then yields the deﬁnition of the error Err and there- 78 D. and sum up these squares. (5. ∂wi.1) Due to this relation there is a proportionality constant η for which equality holds (η will soon get another meaning and a real practical use beyond the mere meaning of a proportionality constant.Chapter 5 The perceptron.Ω )2 . the error function should then provide large values – on the other hand. we derive the error function according to a weight wi. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . i. given the pattern p. And most of the time the error surface is too craggy.3) weights is referred to as ∆W ) by calculating the gradient ∇Err(W ) of the error function Err(W ): ∆W ∼ −∇Err(W ).Ω − yp.

since we do not need it for minimization. (5. This derivative corresponds to the sum of the derivatives of all speciﬁc errors Errp according to this weight (since the total error Err(W ) Once again I want to think about the question of how a neural network processes data. as this formula looks very similar to the Euclidean distance. 2 p∈P Ω∈O sum over all Ω ∂ Err(W ) ∂wi.9) = fact (oi1 · wi1 .5) ∆wi. the path of the neuron outputs oi1 and oi2 .Ω ). initially is the propagation function (here weighted sum).Ω + oi2 · wi2 . If we ignore the output function.8) (5. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 79 . this output results from many nested functions: oΩ = fact (netΩ ) (5.Ω p∈P (5. Now we want to continue deriving the delta rule for linear activation functions. and so on. Similarly. Basically.Ω − yp. Because the root function decreases with its argument.10) It is clear that we could break down the output into the single input neurons (this is unnecessary here.4 on the preceding page suddenly came from and why there is no root in the equation. the result of the function is sent through another one.com 5.Ω )2 . We have already discussed that we tweak the individual weights wi.7) (5. As we can see. ∂wi.Ω ∂ Errp (W ) = −η . which the neurons i1 and i2 entered into a neuron Ω. This is just done so that it cancels with a 2 in the course of our calculation. from which the network input is going to be received.1 The singlelayer perceptron fore the deﬁnition of the error function results from the sum of the speciﬁc erErr(W ): rors): Err(W ) = p∈ P sum over all p Errp (W ) (5.Ω = −η 1 = (tp. since they do not D. we can simply omit it for reasons of calculation and implementation eﬀorts.Ω a bit and see how the error Err(W ) is changing – which corresponds to the derivative of the error function Err(W ) according to the very same weight wi. This is then sent through the activation function of the neuron Ω so that we receive the output of this neuron which is at the same time a component of the output vector y : netΩ → fact = fact (netΩ ) = oΩ = yΩ .6) The observant reader will certainly wonder 1 where the factor 2 in equation 5.dkriesel. it does not matter if the term to be minimized is divided by 2: Therefore I am al1 lowed to multiply by 2 .Ω . Both facts result from simple pragmatics: Our intention is to minimize the error. the data is only transferred through a function.

i wi. i. Thus.i and therefore: ∂ Errp (W ) = −δp. we want to calculate the derivatives of equation 5.Ω = yp.i ∂wi.16) We insert this in equation 5. This diﬀerence is also called δp.Ω · ∂op.Ω (which is the reason for the name delta rule): ∂op.Ω i∈I The resulting derivative ∂wi. ∂wi. ∂ i∈I (op.Ω ∂wi.Ω change when the weight from i to Ω is changed? However: From the very beginning the derivation has been intended as an oﬄine rule by means of the question of how to add the errors of all patterns and how to learn them after all patterns have been represented.Ω page.Ω ) Let us take a look at the ﬁrst multiplicative factor of the above equation 5.i wi.e.Chapter 5 The perceptron. op.17) The second multiplicative factor of equation 5. and only the summand op.13) op.Ω = −op.4 on page 78) clearly shows that this change is exactly the difference between teaching input and output (tp. (5.Ω · op.Ω is changing: ∂ ∂ Errp (W ) = −δp.Ω .i · δp.Ω ) ∂wi. as we will see later in this chapter.Ω .Ω ) · ∂wi.com Due to the requirement at the beginning of the derivation.14) (op.15) (5.i wi.Ω ) ∂wi. (5. the change of the error Errp with an output op. So how does op.8 on the previous ∂wi.Ω : ∆wi.Ω contains the variable wi. which results in our modiﬁcation rule for a weight wi.Ω ∂wi.Ω ) to be derived consists of many summands.Ω = η · p∈P (5.Ω = · .Ω − op.12) (5. The closer the output is to the teaching input.Ω − op.Ω ).Ω : The examination of Errp (equation 5.Ω ∂wi. backpropagation and its variants process information in an SLP). we only have a linear activation function fact .Ω = op.Ω ∂op.Ω ) (remember: Since Ω is an output neuron.Ω ∂ Errp (W ) = −(tp.Ω i∈I (op. therefore we can just as well look at the change of the network input when wi. the implementation is far more time-consuming and. Although this approach is mathematically correct.Ω · ∂wi. partially 80 D.i · δp. Thus we can replace one by the other.Ω .8 on the preceding page and due to the nested functions we can apply the chain rule to factorize the derivative ∂ Errp (W ) in equation 5.Ω . ∂ Errp (W ) ∂ Errp (W ) ∂op.Ω can now be simpliﬁed: The function i∈I (op.Ω (5. (5.11 and of the following one is the derivative of the output speciﬁc to the pattern p of the neuron Ω according to the weight wi.Ω .Ω = −δp.i wi. according to which we derive.8 on the previous page.11 which represents the derivative of the speciﬁc error Errp (W ) according to the output. the smaller is the speciﬁc error. Thus.11) ∂ dkriesel.i wi. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .

we obtain to the diﬀerence between the current activation or output aΩ or oΩ and the corresponding teaching input tΩ . between tΩ and oΩ . Apparently the delta rule only applies for SLPs. analogously to the aforementioned derivation. and there is no teaching input for the inner processing layers of neurons.6 on the following page). i2 and one output neuron Ω proportional (ﬁg. essarily related to a pattern p): ∆wi.Ω = η · oi · δΩ .1). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 81 . If we determine.Ω = η · oi · (tΩ − oΩ ) = ηoi δΩ (5. 2 0 1 0 1 5. D. the output values on the right.6 (Delta rule).Table 5. (5.2 Linear separability Output 0 1 1 0 The "online-learning version" of the delta rule simply omits the summation and learning is realized immediately after the presentation of each pattern. this also sim. 1 0 0 1 1 In. We want to refer to this factor as δΩ . that the function h of the Hebbian theory (equation 4.6 on page 67) only provides the output oi of the predecessor neuron i and if the function g is the diﬀerence between the desired activation tΩ and the actual activation aΩ . since the formula is always related to the teaching input.input values are shown of the left.19) If we use the desired output (instead of the activation) as teaching input.2 A SLP is only capable of representing linearly separable data Let f be the XOR function which expects ∆wi.com needs a lot of compuational eﬀort during training.1: Deﬁnition of the logical XOR.Ω = η · oi · (tΩ − aΩ ) = ηoi δΩ (5. 5. we will receive the delta rule. The pliﬁes the notation (which is no longer nec. δ delta rule only for SLP 5. which is also referred to as "Delta ".dkriesel.20) two binary inputs and generates a binary output (for the precise deﬁnition see taand δΩ then corresponds to the diﬀerence ble 5. and therefore the output function of the output neurons does not represent an identity. In.18) This version of the delta rule shall be used for the following deﬁnition: Deﬁnition 5. also known as WidrowHoﬀ rule : ∆wi. the change tion by means of an SLP with two input of all weights to an output neuron Ω is neurons i1 . Let us try to represent the XOR funcIn the case of the delta rule.

A and B show the corners belonging to the sets of the XOR function that are to equality 5.f i2 .22) positive wi2 .21) netΩ = oi1 wi1 .7: Linear separation of n = 2 inputs of the input neurons i1 and i2 by a 1-dimensional We assume a positive weight wi2 . For a (as required for inequation 5. a binary activation function with the threshold value Θ and the identity as output function.which is impossible.6: Sketch of a singlelayer perceptron that shall represent the XOR function .Chapter 5 The perceptron.Ω Ω ff || f2 | ~| 89:. (5.com GFED @ABC i1 f ff f GFED @ABC i2 | || | w wi1 .Ω + oi2 wi2 .Ω . ?>=< Ω XOR? Figure 5. Depending on i1 and i2 .7).Ω (ΘΩ − oi2 wi2 .21 is then equivalent to be separated.22 is a straight line through a coordinate system deﬁned by the possible outputs oi1 und oi2 of the input neurons i1 and i2 (ﬁg.22) With a constant threshold value ΘΩ . the right part of inequation 5.Ω ) (5. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Here we use the weighted sum as propagation function. the in. backpropagation and its variants dkriesel.Ω the output neuron Ω ﬁres for 82 D.straight line. 5.Ω ≥ ΘΩ o i1 ≥ 1 w i 1 . Ω has to output the value 1 if the following holds: Figure 5.

for more diﬃcult tasks with more inputs we need something more powerful than SLP. 5. 0).com n 1 2 3 4 5 6 number of binary functions 4 16 256 65.8).8 · 1019 lin.8: Linear separation of n = 3 inputs from input neurons i1 .5% 40. (1. we have to turn and move the straight line so that input set A = {(0. SLP cannot do everything Unfortunately.e. Figure 5. and number and proportion of the functions thereof which can be linearly separated. i2 and i3 by 2-dimensional plane. the input parameters of n many input neurons can be represented in an ndimensional cube which is separated by an SLP through an (n − 1)-dimensional hyperplane (ﬁg. Was89]. Only sets that can be separated by such a hyperplane. which are linearly separable.2: Number of functions concerning n binary inputs. 134 share 100% 87.2 Linear separability Table 5.6% 2. 1)} is separated from input set B = {(0. can be classiﬁed by an SLP. Thus.dkriesel. 572 5. few tasks are linearly separable D. tests for linear separability are diﬃcult. The XOR problem itself is one of these tasks.9 on the next page). For a negative wi2 . obviously. Generally.3 · 109 1. Additionally. 028. 772 94. Wid89. impossible.7% 0. it seems that the percentage of the linearly separable problems rapidly decreases with increasing n (see table 5. Note that only the four corners of the unit square are possible inputs because the XOR function only knows binary inputs. 0)} – this is. separable ones 4 14 104 1. which limits the functionality of the SLP. 536 4. 1). since a perceptron that is supposed to represent the XOR function already needs a hidden layer (ﬁg. (1.Ω it would ﬁre for all input combinations lying below the straight line. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 83 .002% ≈ 0% 5.2). i. In accordance with [Zel94. input combinations lying above the generated straight line. In order to solve the XOR problem. 5.

A multilayer perceptron represents an universal function approximator. this proof is not constructive and therefore it is left to us to ﬁnd the correct 5.10 on the facing page). plane (in a two-dimensional input space Perceptrons with more than one layer of by means of a straight line). Unfortunately.10 on the next page). Those can be added. ers. Another trainable weight layer proceeds analogously. 3". be. a singlelayer perceptron can divide the input space by means of a hyper.com part of ﬁg. As 3-4-MLP.9: Neural network realizing the XOR with one layer of hidden neurons can arfunction.to as multilayer perceptrons (MLP ).g.regarded here) with neuron layer 1 being low straight line 2 and below straight line the input layer.7 (Multilayer perceptron).5 1 II II II−2 $ Ö GFED @ABC 0. we – metaphorically speaking . Generally.variably weighted connections are referred stage perceptron (two trainable weight lay. which is proven by the Theorem of Cybenko [Cyb89]. in the form "recognize and n + 1 neuron layers (the retina is dispatterns lying above straight line 1. Thus.Chapter 5 The perceptron. now with the convex polygons.3 A multilayer perceptron number of neurons and weights.5 dkriesel. A two.Since three-stage perceptrons can classify rons and "attached" another SLP (upper sets of any form by combining and sepa- XOR contains more trainable weight layers more planes 3-stage MLP is suﬃcient 84 D. 3 neurons in the hidden layer weight layers (called multilayer perceptron and 4 neurons in the output layer as a 5or MLP) is more powerful than an SLP.An n-layer or n-stage perceptron has vex polygons by further processing these thereby exactly n variable weight layers straight lines. 5. 5. e. with only ﬁnitely many discontinuities as well as their ﬁrst derivatives. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . it can be mathematically proven that even a multilayer perceptron Figure 5. three neuron layers) can classify con. backpropagation and its variants GFED @ABC @ABC GFED ee II } } II ee }} 1 II 1ee } } II ee2 ~}} @ABC GFED 1II 1.Deﬁnition 5.took an SLP with several output neu. Threshold values (as far as they are bitrarily precisely approximate functions existing) are located within the neurons. we know. subtracted or somehow processed with other operations (lower part of ﬁg. In the following we want to use a widespread abbreviated form for diﬀerent multilayer perceptrons: We denote a twostage perceptron with 5 neurons in the inA perceptron with two or more trainable put layer.

D.dkriesel. With 2 trainable weight layers.10: We know that an SLP represents a straight line. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 85 . By using 3 trainable weight layers several polygons can be formed into arbitrary sets (below). Ω GFED @ABC @ABC GFED i1 d i2 d d dd ~ ~ dd ~ ~ dd ~ ~ dd ~ ~ dd ~ ~ d ~ ~ dd dd ~ ~~ ~~~ 2 ~ 2 ~ 9 w u A GFED t B GFED @ABC GFED GFED @ABC GFED @ABC @ABC GFED @ABC @ABC h1 h2 d h3 h4 h5 h6 ~ nnn n ~ ddd n ~~ dd nnn ~~ nnnnn dd ~ n d2 ~~~ n 9 GFED w nn t EBD GFED @ABC @ABC h7 drq h8 dd ~ dd ~~ ~ dd ~ dd ~~ ~ 1 ~ 89:.com 5. ?>=< Ω Figure 5. several straight lines can be combined to form convex polygons (above).3 The multilayer perceptron @ABC GFED @ABC i1 i2 jGFED ddd jjjj ddd j j j ddjjjj dd jjjjjjjddd ddd d1 d1 jjjj tj B j @ABC GFED @ABC GFED @ABC GFED h2 h1 o h3 o o o ooo ooo o o oooo wo 9 ?>=< 89:.

Some sources count the neuron layers. Be cautious when reading the literature: There are many diﬀerent deﬁnitions of what is counted as a layer. i. Backpropagation is a gradient descent procedure (including all strengths and weaknesses of the gradient descent) with the error function Err(W ) receiving all n weights as arguments (ﬁg. Some exclude (for some reason) the output neuron layer. On Err(W ) a point of small error or even a point of the smallest error is sought by means of the gradient descent.Chapter 5 The perceptron. which can be used to train multistage perceptrons with semi-linear 3 activation functions. And it is exactly 3 Semilinear functions are monotonous and diﬀerentiable – but generally they are not linear. the most information about the learning capabilities – and I will use it cosistently. I chose the deﬁnition that provides. being n-dimensional. In this work. some count the weight layers. in my opinion. but that doesn’t matter: We have seen that the Fermi function or the hyperbolic tangent can arbitrarily approximate the binary threshold function by means of a temperature parameter T .e.3: Representation of which perceptron can classify which types of sets with n being the number of trainable weight layers. backpropagation trains the weights of the neural network. another step will not be advantageous with respect to function representations.e.com 5. 5. Binary threshold functions and other non-diﬀerentiable functions are no longer supported. backprop or BP). no advantage dkriesel. rating arbitrarily many convex polygons. You can ﬁnd a summary of which perceptrons can classify which types of sets in table 5.3. Some sources include the retina. We now want to face the challenge of training perceptrons with more than one weight layer. I want to derive and explain the backpropagation of error learning rule (abbreviated: backpropagation. in analogy to the delta rule. Remember: An n-stage perceptron has exactly n trainable weight layers. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . some the trainable weight layers. i. Once again I want to point out that this procedure had previously been published by Paul Werbos in [Wer74] but had consideraby less readers than in [MR86]. Table 5. backpropagation and its variants n 1 2 3 4 classiﬁable sets hyperplane convex polygon any set any set as well. 86 D. To a large extent I will follow the derivation according to [Zel94] and [MR86].5 on page 78) and assigning them to the output error. Thus.4 Backpropagation of error generalizes the delta rule to allow for MLP training Next.

h we want to calculate δ ? It is obvious to select an arbitrary inner neuron h having ∂ Err(wk. ()*+ 89:. D.h as a set of L successor neurons l. neti . for the delta rule and split functions by the weighted sum is included in the numermeans the chain rule. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 87 .4. ?>=< .i ) holds for any neuron i.-. As already indicated. L l Figure 5.dkriesel. /.23 is −δh . ()*+v 89:. k K vvv a ppp p vvv aaa p vvv aa ppp wk. but the prin. be deﬁned as the already familiar oi . netp. the subsequent layer is L.. 5.h ) ∂ Err ∂ neth = (5. etc. ()*+a /.e.h vvv aa p 5. rons.11).i . this derivation in great detail. It is therefore irrelevant whether the predeThe ﬁrst factor of equation 5.l xxx rrr ¡¡¡ r r xxx ¡ r ¡ r xx9 r Let us deﬁne in advance that the network ¡ r /.h . I will not discuss ator so that we can immediately derive it. 5. in the generalized δ ).-.. under the input pattern p we used for the training... cessor neurons are already the input neuwhich we will deal with later in this text. the preceding layer is K .i etc. The numerator of the second factor of the Now we perform the same derivation as equation includes the network input. all summands of the sum drop out cipal is similar to that of the delta rule (the apart from the summand containing wk. i. which =−δh are also inner neurons (see ﬁg. Furthermore.23) · a set K of predecessor neurons k as well ∂wk. It is lying in layer H . Let the output function be the identity again. but Σ ONML HIJK h H r¡ fact xxxxx with a generalized delta r r x x rr ¡¡ wh. as with the derivation of the delta rule. we have to generalize the variable δ for every neuron.-. we use the same formula framework as with the delta rule (equation 5. ()*+ /. let op.Again.h ∂ neth ∂wk. ()*+xr /. Since this is a generalization of the delta rule.11: Illustration of the position of our neuron h within the neural network. ?>=< .20 on page 81).-.com the delta rule or its variable δi for a neuron i which is expanded from one trainable weight layer to several ones by backpropagation.4 Backpropagation of error generalization of δ input of the individual neurons i results from the weighted sum.1 The derivation is similar to p pp vva 80 wppp the one of the delta rule. thus oi = fact (netp. We initially derive the error function Err according to a weight First of all: Where is the neuron for which w . as already mentioned. k. ()*+Ð¡ /.-. diﬀerences are.-.

h (5.31) As promised. If the aforementioned equations are 88 D. The reader might already have noticed that some intermediate results were shown in frames. Exactly those intermediate results were highlighted in that way. we immediately obneuron k becomes: tain equation 5.28) (5.31: ∂ ∂ neth = ∂wk.l (5.24) (5.27. 5.Chapter 5 The perceptron.com This summand is referred to as wk.12 on the facing page. We simply calculate the second factor in the following equation 5.h · ok . which ∂ Err(netl1 . Now we want to discuss these factors being added over the subsequent layer L. . we will now discuss the −δh of equation 5.34) − tion of the activation function according ∂ netl to the network input: ∂oh ∂fact (neth ) = ∂ neth ∂ neth = fact (neth ) (5. backpropagation and its variants dkriesel.27) clearly equals the deriva∂ Err = δl (5.23 on the previous page.l · oh ∂ netl = ∂oh ∂oh = wh. This is reﬂected in equation 5.30) are a factor in the change in weight of ∂oh ∂oh wk. we have to point out that the derivation of the error function according to the output of an inner neuron layer depends on the vector of all network inputs of the next following layer.31 contains two factors.29) Now we insert: ⇒− ∂ Err = δl wh. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .26) (5.35) Consider this an important passage! We now analogously derive the ﬁrst factor in equation 5.27) The sum in equation 5. If According to the deﬁnition of the multiwe calculate the derivative.32) (5. .h . which is split up again according of the chain rule: δh = − ∂ Err ∂ neth ∂ Err ∂oh · =− ∂oh ∂ neth (5.33: ∂ h∈H wh. netl|L| ) ∂ Err − =− (5.h ok ∂wk.25) − ∂ Err ∂ Err ∂ netl = − · ∂oh ∂ netl ∂oh l∈L (5.l ∂oh l∈L (5.33) The derivation of the output according to The same applies for the ﬁrst factor accordthe network input (the second factor in ing to the deﬁnition of our δ : equation 5. the output of dimensional chain rule.h = ok k∈K wk. Therefore.30: You can ﬁnd a graphic version of the δ generalization including all splittings in ﬁg. . .

12: Graphical representation of the equations (by equal signs) and chain rule splittings (by arrows) in the framework of the backpropagation derivation.l ·oh ∂oh wh.l Figure 5.4 Backpropagation of error δh ∂ Err −∂ neth ∂oh ∂ neth Err −∂ ∂oh fact (neth ) ∂ Err −∂ netl ∂ netl l∈L ∂oh δl ∂ h∈ H wh. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 89 . which are framed in the derivation.dkriesel.com 5. The leaves of the tree reﬂect the ﬁnal results from the generalization of δ . D.

according to our training pattern p.h from k to h is changed proportionally according to the learning rate η .h = fact (netp.h of the successor neuron h. Input changed for the outer weight layer – of course only in case of h being an inner neuron (otherweise there would not be a subsequent layer L). then (5.h ). the gradient of the activation function at the position of the network input of the successor neuron fact (netp.h .h ) holds.k of the predecessor neuron k . the output op.com the diﬀerence between teaching input tp.h − yp.h = fact (netp. If h is an inner. the result is the generalization of the delta rule. the outcome of this will be the wanted change in weight ∆wk. called backpropagation of error : ∆wk. In this case. and this is the diﬀerence.36) (δl wh. I want to explicitly mention that backpropagation is now working on three layers. Thus. hidden) neuron: 1. the gradient of the activation function at the position of the network input of the successor neuron fact (netp. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .h ) · (tp. All in all.h = ηok δh with fact (neth ) · (th − yh ) (h outside) δh = fact (neth ) · l∈L (δl wh. the output layer with the successor neuron h and the preceding layer with the predecessor neuron k .h ) · l ∈L (δp. the output of the predecessor neuron op.k .l ) (h inside) (5.l ) Teach. l∈L (δp. Here. under our training pattern p the weight wk.37) δp.Chapter 5 The perceptron. backpropagation is working on two neuron layers.38) Thus.l · wh. δ is treated diﬀerently depending on whether h is an output or an inner (i. as well as. backpropagation for inner layers (5.e. neuron k is the predecessor of the connection to be changed with the weight wk. the neuron h is the successor of the connection to be changed and the neurons l are lying in the layer following the successor neuron.h to ∆wk. 2. hidden neuron.h = ηok δh with δh = fact (neth ) · l ∈L dkriesel.h and output yp. according to the weighted sum of the changes in weight to all neurons following h. If h is an output neuron. then δp. backpropagation and its variants combined with the highlighted intermediate results.h ) and 90 D. The case of h being an output neuron has already been discussed during the derivation of the delta rule. the weight wk.l ) (5.l · wh.h from k to h is proportionally changed according to the learning rate η .l ).39) In contrast to the delta rule.

The ﬁrst part is obvious. backprop expands delta rule ∆wk. Here I describe the ﬁrst (delta rule) and the second part of backpropagation (generalized delta rule on more layers) in one go. Since we only use it for one-stage perceptrons. Thus.2 Heading back: Boiling summarize formulas 5.com 5.l ) (h inside) out of backpropagation in order to augment the understanding of both rules. As is generally known. we only want to use linear activation functions so that fact (lightcolored) is constant. we redelta rule ceive the following ﬁnal formula for backpropagation (the identiﬁers p are om.8 (Backpropagation).h = ηok δh with fact (neth ) · (th − yh ) (h outside) circumstance and develop the delta rule δh = fact (neth ) · l∈L (δl wh. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 91 .42) Furthermore.l ) (h inside) (5. Like many groundbreaking inventions.40) have seen that backpropagation is deﬁned by SNIPE: An online variant of backpropagation is implemented in the method trainBackpropagationOfError within the class NeuralNetwork. the delta rule is a mited for reasons of clarity): special case of backpropagation for onestage perceptrons and linear activation functions – I want to brieﬂy explain this ∆wk. the result is: ∆wk. and therefore we directly merge the constant derivative fact and (being constant for at least one lerning cycle) the learning rate η (also light-colored) in η . constants can be combined.dkriesel. recursive part.4 Backpropagation of error Deﬁnition 5.41) It is obvious that backpropagation initially processes the last weight layer directly by means of the teaching input and then works backwards from layer to layer while considering each preceding change in weights. the teaching input leaves traces in all weight layers. which may meet the requirements of the matter but not of the research. which you will soon see in the framework of a mathematical gimmick. D. The result is: ∆wk. Decades of development time and work lie between the ﬁrst and the second.38 on the preceding backpropagation down to page and 5.h = ηok δh with δh = fact (neth ) · (th − oh ) (5.43) This exactly corresponds to the delta rule deﬁnition.4.h = ηok δh = ηok · (th − oh ) (5.As explained above. it was not until its development that it was recognized how plausible this invention was.h = ηok δh with fact (neth ) · (th − yh ) (h outside) δh = fact (neth ) · l∈L (δl wh. We (5.39 on the facing page. Thus. the second part of backpropagation (light-colored) is omitted without substitution. If we 5.

A common error (which also seems to be a ten as η .9.the learning rate gradually as mentioned face would be very uncontrolled. Here it quickly happens that the descent of the If the value of the chosen η is too large.3. Experience shows that good learning rate values are in the range 5. A smaller learning rate is more time-consuming. in any case.9 (Learning rate). very neat solution at ﬁrst glance) is to continually decrease the learning rate.at this ascent. the movements across the error sur. can cost a huge. however. Addition. but the result is more precise.1 Variation of the learning rate over time During training. Thus. a large learning rate leads to good results. the selection of η is crucial for the behaviour of backpropagation and for learning procedures in general. narrow valleys ing. how fast will be learned? η Deﬁnition 5.com 5. for example. and to slowly decrease it down to 0.4. a above. the slower backpropagation is learning. The result is that we simply get stuck could simply be jumped over. during the learning process the learning rate needs to be decreased by one order of magnitude once or repeatedly. Thus. Thus. which.4. e. proportional to the learning rate η . 92 D. 0. small η is the desired input. it is a good idea to select a larger learning rate for the weight layers close to the input layer than for the weight layers close to the output layer.2 Diﬀerent layers – Diﬀerent learning rates of 0. often unacceptable amount of time. Thus. Speed and accuracy of a learning procedure can always be controlled by and are always proportional to a learning rate which is writ. backpropagation and its variants dkriesel.01 ≤ η ≤ 0.3 The selection of the learning rate has heavy inﬂuence on the learning process In the meantime we have often seen that the change in weight is. learning rate is larger than the ascent of the jumps on the error surface are also a hill of the error function we are climbtoo large and. but later it results in inaccurate learning.4. The selection of η signiﬁcantly depends on the problem. 5. Solution: Rather reduce ally. the network and the training data.1. For simpler problems η can often be kept constant.9.g. so that it is barely possible to give practical advise.3.Chapter 5 The perceptron. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . But for instance it is popular to start with a relatively large η . The farer we move away from the output layer during the learning process. another stylistic device can be a variable learning rate : In the beginning.

5 Resilient backpropagation is an extension to backpropagation of error One learningrate per weight We have just raised two backpropagationspeciﬁc properties that can occasionally be a problem (in addition to those which are already caused by gradient descent itself): On the one hand.com 5. we incorporate evtin Riedmiller et al. this is really pagation learns. these learning rates added. the further the weights are tionally to the gradient of the error from the output layer. Rprop takes other ways propagation and Rprop. It remains static untion. Thus the change in weight is mary ideas behind Rprop (and their connot proportional to the gradient. whether this is always [RB93. I want to compare backuseful. the weight changes are not static but Much smoother learning ηi. Rprop pursues a completely diﬀerent approach: there In contrast to backprop the weight update is no global learning rate.j automatic learning rate adjustment D. Marintuitive. Before actually dealing with formuthe automatically adjusted learning las. but are au.being implemented? tomatically set by Rprop itself. Rie94]. Here. it is sequences) to the already familiar backproonly inﬂuenced by the sign of the grapagation. the slower backprofunction. but let me anticipate that default a learning rate η . Third. also the problem of an increasingly slowed down learning throughout the layers is solved in an elegant way. At ﬁrst glance. without explicas well: the amount of weight change itly declaring one version superior to the ∆wi. First.j has its own learning rate for the adjustment of the learning rate is ηi. and second. Until now we still do not know how exactly the ηi. weights are changed proporthe other hand.dkriesel. Here.j simply directly corresponds to other. It is at least silient backpropagation (short Rprop ) questionable.j . Now how exactly are these ideas are not chosen by the user. To account for the temporal change. 5.j (t). However. We have already explored the disadvantages of this approach. we have to correctly call it ηi. which is sethe resulting process looks considerlected by the user. For this reason. On Weight change: When using backpropagation. let us informally compare the two prirate ηi. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 93 .j . til it is manually changed.j are adapted at Learning rates: Backpropagation uses by run time. each step is replaced and an additional step weight wi. enhanced backery jagged feature of the error surface propagation and called their version reinto the weight changes. and applies to the ably less rugged than an error funcentire network. users of backpropagation can choose a bad learning rate. dient. This not only enables more focused learning.5 Resilient backpropagation are adapted for each time step of Rprop.

If the sign changes from g (t − 1) to g (t).j (t − 1). We have already noticed that the weightspeciﬁc learning rates directly serve as absolute values for the changes of the respective weights.10 −ηi. 5.j (t) = 5.Chapter 5 The perceptron.1 Weight changes are not proportional to the gradient Let us ﬁrst consider the change in weight.j (t).j is added to it. This might decrease clarity at ﬁrst glance.j (t). we consider only the sign of the gradient. we again have to consider the associated gradients g of two time steps: the gradient that has just passed (t − 1) and the current one (t). The corresponding terms are aﬃxed with a (t) to show that everything happens at the same time step. Now. we obtain a new ηi. The gradient hence no longer determines the strength. Finally. nothing happens at all.com (Weight change in if g (t) > 0 if g (t) < 0 (5. we have skipped a local minimum in the gradient.j and (W ) obtain gradients ∂ Err ∂wi.j .2 Many dynamically adjusted learning rates instead of one static To adjust the learning rate ηi. that the search needs to be more accurate. ∆wi. As with the derivation of backpropagation.j .j . Again. Hence. only the sign of the gradient matters. we must decrease the weight wi.j . which is between 1 and 0. we shorten the gra(W ) dient to: g = ∂ Err ∂wi. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . and we now must ask ourselves: What can happen to the sign over two time steps? It can stay the same.44) otherwise. If the gradient is exactly 0. backpropagation and its variants dkriesel. 5. but is nevertheless important because we will soon look at another formula that operates on diﬀerent time steps. the weight needs to be increased. the last update was too large and ηi. So ηi.5. In this case we know that in the last time step (t − 1) something went wrong – gradient determines only direction of the updates η↓ 94 D. If the sign of the gradient is negative.j . Let us now create a formula from this colloquial description. Instead.j itive. +ηi.5. the big diﬀerence: rather than multiplicatively incorporating the absolute value of the gradient into the weight change. So the weight is reduced by ηi.j (t) by multiplying the old ηi. Deﬁnition Rprop). but only the direction of the weight change. (W ) If the sign of the gradient ∂ Err is pos∂wi. and it can ﬂip. One can say.j (t−1) with a constant η ↓ . In mathematical terms.j (t) has to be reduced as compared to the previous ηi. we will deal with the remaining details like initialization and some speciﬁc constants. we derive the error function Err(W ) by the individual weights wi. There remains the question of where the sign comes from – this is a point at which the gradient comes into play. 0 We now know how the weights are changed – now remains the question how the learning rates are adjusted. once we have understood the overall system.

if the sign remains the same. Rprop only learns oﬄine Caution: This also implies that Rprop is exclusively designed for oﬄine. The authors of the Rprop paper explain in an obvious way that this value – as long as it is positive and without an exorbitantly high absolute value – does not need to be dealt with very critically.e. weakened)? Here we obtain our new ηi.j (t − 1). How large are η ↑ and η ↓ (i.com 5. one 1. What are the upper and lower bounds rates in Rprop). without further mathematical justiﬁcation. If the gradients do not have a certain continuity. namely However. ηi.j (0) = 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 95 .5 Resilient backpropagation η↑ hence we additionally reset the weight up. How to choose ηi.j (5.j (t) = η ηi.j at time step (t) to details to use Rprop in 0. how are which is greater than 1. the learning process slows down to the lowest rates (and remains there). A few minor issues remain unanswered. and that is exactly what we need here. one changes – loosely speaking – the error function with each new epoch. how can perform a (careful!) increase of ηi.11 (Adaptation of learning 3. The initial value for the learning rates should be somewhere in the order of the initialization of the weights. One can set this parameter to lower values in order to allow only very cautious updates.j (0) (i.45) ↑ η ηi. which is why it is used there frequently.j can be changed only by multiplication. We now answer these questions with a quick motivation. 0 would be a rather suboptimal initialization :-) ηmin ηmax D. however. It lacks. so that it not applied at all (not shown practice in the following formula). a value of 50 which is used throughout most of the literature. since it is based on only one training pattern. Small update steps should be allowed in any case. When learning online.1 has proven to be a good choice. the weight-speciﬁc learning rates initialized)?4 Deﬁnition 5.e.5.j (t − 1) with a constant η ↑ 2. i.j set? g (t − 1)g (t) > 0 ↓ ηi.j (t − 1).5. This may be often well applicable in backpropagation and it is very often even faster than the oﬄine version.3 We are still missing a few date for the weight wi. a clear mathematical motivation. 4 Protipp: since the ηi. so we set ηmin = 10−6 .j (t) by multiplying the old ηi. ηmin and ηmax for ηi. Equally uncritical is ηmax .j to much are learning rates reinforced or get past shallow areas of the error function. g (t − 1)g (t) < 0 η (t − 1) otherwise. for which they recommend.dkriesel. as it will be quickly overridden by the automatic adaptation anyway.

Let us start with η ↓ : If this value is used. by always adding a fraction of the previous change to every new change in weight: (∆p wi.6 Backpropagation has often been extended and altered besides Rprop Backpropagation has often been extended. which is why the canonical choice η ↓ = 0. Here we cannot generalize the principle of binary search and simply use the value 2.j )now = ηop. I would recommend testing the more widespread backpropagation (with both oﬄine and online learning) and the less common Rprop equivalently. Furthermore. not dealt with in this work. from which we do not know where exactly it lies on the skipped track.what prevents us from immediately stopping at the edge of the slope to the plateau? Exactly .e. Many of these extensions can simply be implemented as optional features of backpropagation in order to have a larger scope for testing. So we need to halve the learning rate.j +α·(∆p wi.0. deep networks. we assume it was in the middle of the skipped track.our momentum. 5. a value of η ↑ = 1. For such networks it is crucial to prefer Rprop over the original backpropagation. With backpropagation the momentum term [RHW86b] is responsible for the fact that a kind of moment of inertia (momentum ) is added to every step size (ﬁg.i δp. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .6. 5. which is. Slight changes of this value have not signiﬁcantly aﬀected the rate of convergence.com SNIPE: In Snipe resilient backpropagation is supported via the method trainResilientBackpropagation of the class NeuralNetwork. learns very slowly at weights wich are far from the output layer. Rprop is very good for deep networks 96 D. as already indicated.5 is being selected. dkriesel. i. With advancing computational capabilities of computers one can observe a more and more widespread distribution of networks that consist of a big number of layers. learning rates shall be increased with caution.1 Adding momentum to learning Let us assume to descent a steep slope on skis . you can also use an additional improvement to resilient propagation. If the value of η ↑ is used.j )previous . Analogous to the procedure of binary search.2 has proven to be promising. 5. where the target object is often skipped as well. Independent of the particular problems. backpropagation and its variants Now we have left only the parameters η ↑ and η ↓ . This fact allowed for setting this value as a constant as well. There are getters and setters for the diﬀerent parameters of Rprop.Chapter 5 The perceptron. For problems with a smaller number of layers. otherwise the learning rate update will end up consisting almost exclusively of changes in direction. however. we have skipped a minimum. In the following I want to brieﬂy describe some of them. because backprop.13 on the next page).

which is continued successively.6 perbolic tangent as well as with the Fermi on page 37.6 Further variations and extensions to backpropagation Of course. then the previous cycle is identiﬁed by (t − 1).12 (Momentum term). and ﬁnally lands in the minimum. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 97 . neurons get stuck It is an interesting observation. introduced in section 3.j (t − 1) (5. ∆wi. This results in the fact that it becomes very diﬃcult to move neurons away from the limits of the activation (ﬂat spots ). the eﬀect of inertia can be varied via the prefactor α.46) Figure 5. as already deﬁned by the concept of time.j (t) = ηoi δj + α · ∆wi. who would hardly stop immediately at the edge to the We accelerate on plateaus (avoiding quasi. when referring to the current cycle as (t).1). Despite its nice one-dimensional appearance.9. this notation is only used for a better understanding. the momentum enables the positive eﬀect that our skier swings back and forth several times in a minimum. And now we come to the formal deﬁnition of the momentum term: moment of inertia Deﬁnition 5.Anguita et al.dkriesel. This problem can be dealt with by modifying the derivative. function the derivative outside of the close proximity of Θ is nearly 0. that success has also been achieved by using deriva5.2.13: We want to execute the gradient α standstill on plateaus) and slow down on craggy surfaces (preventing oscillations). which could extremely extend the learning time. the otherwise very rare error of leaving good minima unfortunately occurs more frequently because of the momentum term – which means that this is again no optimal solution (but we are by now accustomed to this condition). common values are between 0.g. which is called ﬂat spot elimination or – more colloquial – fudging. 0. In the outer regions of it’s (as D. Generally.6 und 0. Moreover. Additionally.2 Flat spot elimination prevents tives deﬁned as constants [Fah88].plateau. The variation of backpropagation by means of the momentum term is deﬁned as follows: descent like a skier crossing a slope. for example by adding a constant (e.com 5.6. A nice neurons from getting stuck example making use of this eﬀect is the fast hyperbolic tangent approximation by It must be pointed out that with the hy.

Of course. those require much more computational ef1 ErrWD = Err + β · (w)2 (5. As a result the tives only rarely improve the estimations. the procedures reduce synaptic weights cannot become inﬁnitely the number of learning epochs. this Damage learning procedure is a second-order procedure. punishment Hessian matrices. too ErrWD keep weights small β prune the network 98 D. but signiﬁ.ple pragmatics.5 Cutting networks down: Pruning and Optimal Brain directly jump to this point. we use further derivatives (i. network is keeping the weights small durThus.5. the second multi-dimensional derivative of the error does not only increase proportionally to function.e.4 Weight decay: Punishment of tive.j .strong as well. than backpropagation. it makes use of a small constant.e.com well approximated and accelerated) deriva. backpropagation and its variants dkriesel.to 0.If we have executed the weight decay long proximated by a parabola (certainly it is enough and notice that for a neuron in not always possible to directly say whether the input layer all successor weights are this is the case).6.3 The second derivative can be used. Thus. Additionally. The factor β controls the dure [Fah88] uses the second derivative of strength of punishment: Values from 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .02 are often used here. we can remove the neuron. cay Second order backpropagation also usErrWD ese the second gradient. So in the end these shows weaker ﬂuctuations. 2 w∈W In general. due to these cantly increase the computational eﬀort of small weights. less training cycles are needed but ing learning. since the functions are multidimensional) for higher order meth. The prefactor 1 2 again resulted from simThe quickpropagation learning proce.47) fort. large weights The weight decay according to Paul Werbos [Wer88] is a modiﬁcation that extends the error by a term punishing large weights. We analytically determine the vertex (i.001 the error propagation and locally under.e. to obtain more precise estimates the actual error but also proportionally to of the correct ∆wi.the square of the weights. 0 or close to 0.6. allowing easier procedures often need more learning time and more controlled learning. Even higher deriva. So the error under weight deAccording to David Parker [Par87].Chapter 5 The perceptron.6. the error function often the individual epochs. As expected. i. this does not work with error surfaces that cannot locally be ap. stands the error function to be a parabola. the lowest point) of the said parabola and 5. 5.This approach is inspired by nature where ods.

Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 99 . we need – as we have already overview of neural networks. but since my aim is to oﬀer an Additionally. as already mentioned.dkriesel. but more are also used Let us begin with the trivial circumstance that a network should have one layer of input neurons and one layer of output neuThere are many other variations of back. it would be useful to consider how to implement such a network. realize this knowledge.7. If a weight is strongly needed to minimize the error. If this is not the case.6. the second term will win. which results in at least two layers.1 Number of layers: Two or three may often do the job. I just want learned during the examination of linear to mention the variations above as a moti. considers the diﬀerence between output and teaching input. the other one tries to "press" a weight towards 0. I now advise you to deal 5 Note: We have not indicated the number of neuwith some of the exemplary problems from rons in the hidden layer. to feedforward networks with backpropagamathematically prove that this MLP with tion learning procedures. Neurons which only have zero weights can be pruned again in the end. the ﬁrst term will win. I only want to describe it brieﬂy: The mean error per output neuron is composed of two competing terms.7 Initial conﬁguration of a multilayer perceptron hence losing this neuron and some weights and thereby reduce the possibility that the network will memorize. as we have seen. hypothetical possibility. Representability means this work.problem by means of a perceptron but also cate this experience in the framework of the learnability. in principle.likely). 5. one hidden neuron layer is already capable We have gotten to know backpropagation of approximating arbitrary functions with 5 and feedforward topology – now we have any accuracy – but it is necessary not to learn how to build a neural network. arable (which is. D. It only to discuss the representability of a is of course impossible to fully communi. This procedure is called pruning.rons. To obtain at least some of that a perceptron can. While one term. Such a method to detect and delete unnecessary weights and neurons is referred to as optimal brain damage [lCDS90]. as usual. if our problem is not linearly sepvation to read on.com 5. we only mentioned the 4.7 Getting started – Initial conﬁguration of a multilayer perceptron After having discussed the backpropagation of error learning procedure and knowing how to train an existing network.separability – at least one hidden layer of neurons. very For some of these extensions it is obvi. ous that they cannot only be applied to It is possible. 5. prop and whole books only about this subject.

Chapter 5 The perceptron. experience shows that two new networks with more neurons until the hidden neuron layers (or three trainable result signiﬁcantly improves and. by the problem statement) principally corresponds to the number of free parameters For tasks of function approximation it of the problem to be represented. while a linear activation funcor a too imprecise problem representation. GenerThe number of neurons (apart from input ally. a promising way is to try it with one hidden layer at ﬁrst and if that fails. the activation function is the same for and output layer. The ﬁrst question to be asked is whether we actually want to use the same acti5.3 Selecting an activation function Another very important parameter for the way of information processing of a neural network is the selection of an activation function. the most useful approach is to initially train with we are also able to teach it.14 on Since we have already discussed the netpage 102) as activation function of the hidwork capacity with respect to memorizing den neurons. the output layer dard solution for the question of how many still processes information. particuweight layers) can be very useful to solve larly. Contrary necessary. since many problems can be aﬀected (bottom-up approach ). The latter is it is clear that our goal is to have as few absolutely necessary so that we do not genfree parameters as possible but as many as erate a limited output intervall. one should consider more layers. 5. to the input layer which uses linear actiBut we also know that there is no stan. 5. where the number of in.7. retry with two layers. because it has 100 D. Only if that fails. One should keep in mind that any additional layer generates additional subminima of the error function in which we can get stuck.com a mapping . The activation function for input neurons is ﬁxed to the identity function. given the increasing calculation power of current computers. only a few neurons and to repeatedly train In this respect. the generalization performance is not a problem.7.but learnability means that neurons should be used. backpropagation and its variants dkriesel. Thus.vation functions as well. has been found reasonable to use the hyperbolic tangent (left part of ﬁg. deep networks with a lot of layers are also used with success. since they do not process information.all hidden neurons as well as for the output put and output neurons is already deﬁned neurons respectively.2 The number of neurons has vation function in the hidden layer and to be tested in the ouput layer – no one prevents us from choosing diﬀerent functions. All these things considered. However. tion is used in the output. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . represented by a hidden layer but are very diﬃcult to learn.

linear activation functions in the output can also cause huge learning steps and jumping over good minima in the error surface. . 5. The maximum Fermi function (right part of ﬁg.com 5.5] not including 0 or values very close to 0. this network represents a function B8 → B8 . i8 . randomly chosen values The initialization of weights is not as trivial as one might think. . But generally. The simple solution of this problem is called symmetry breaking. unless problem and related the network is modiﬁed. with the ized randomly (if a synapse initialization is wanted). i2 . they will all change equally during training. The 6 Generally. .8 The 8-3-8 encoding problem and related problems range of random values could be the interval [−0.7. weights are initiallike with the hyperbolic tangent. method setSynapseInitialRange. However. . Thus.8 The 8-3-8 encoding values far from thei threshold value. This can be avoided by setting the learning rate to very small values in the output layer. here a lot of freedom is given for selecting an activation function.4 Weights should be initialized with small. In our MLP we have an input layer with eight neurons i1 . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 101 . which results in 8 training samples. which is the initialization of weights with small random values. the hyperbolic tangent is used in any case. the output interval will be a bit larger. If they are all initialized by the same value. If they are simply initialized with 0. an output layer with eight neurons Ω1 . random initial weights The 8-3-8 encoding problem is a classic among the multilayer perceptron test training problems. If from the start of learning. .5. This random initialization has a nice side eﬀect: Chances are that the average of network inputs is close to 0. pattern recognition is understood as a special case of function approximation with a few discrete output possibilities. Ω8 and one hidden layer with three neurons. UnSNIPE: In Snipe. However. . Now the training task is that an input of a value 1 into the neuron ij should lead to an output of a value 1 from the neuron Ωj (only one neuron should be activated.dkriesel. 0. . problems 5. An unlimited output interval is not essen.allowing for strong learning impulses right tial for pattern recognition tasks6 .14 on absolute weight value of a synapse the following page) it is diﬃcult to learn initialized at random can be set in something far from the threshold value a NeuralNetworkDescriptor using the (where its result is close to 0). Ω2 . threshold values. During the analysis of the trained network we will see that the network with the 3 D. the disadvantage of sigmoid functions is the fact that they hardly learn something for 5. there will be no change in weights at all. . a value that hits (in most activation functions) the region of the greatest derivative.

hidden neurons represents some kind of bidimensionality for encoder problems like nary encoding and that the above mapthe above.g.or an 8-2-8-encoding network? Yes.2 f(x) 0. The original Fermi function is thereby represented by dark colors. The Fermi function was expanded by a temperature parameter. backpropagation and its variants Hyperbolic Tangent 1 0. 5 . for example. Fig. 5.6 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Write tables with all computational parameters of neural networks (e.2 −0. But is it possible to neuron. neuron is compensated by another one is Analogously. Perform the calculations for the four possible inputs of the networks and write down the values of these variables for each input. 5. 5.6 0. ordered ascending by steepness. 10 and 25 .8 1 dkriesel.8 −1 −4 −2 0 x 2 4 0 −4 −2 0 x 2 0.com Fermi Function with Temperature Parameter 4 Figure 5.). does not work.6 −0. our network is a ma.2 0 −0. 102 D.4 0.Chapter 5 The perceptron. 1 2 . But the encoding of the network is far more diﬃcult to understand (ﬁg.15 on the next page) and the training of the networks requires a lot more time.9 on page 84). since the network does not depend on binary encodings: Thus. there is certainly no compensatory improve the eﬃciency of this procedure? neuron. however.14: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function (right).8 0. a 1024-91024. activation etc.An 8-1-8 network. network input. Thus. chine in which the input is ﬁrst encoded since the possibility that the output of one and afterwards decoded again. Do the same for the XOR network (ﬁg. SNIPE: The static method getEncoderSampleLesson in the class TrainingSampleLesson allows for creating simple training sample lessons of arbitrary Exercises Exercise 8. we can train a 1024-10-1024 essential. and if there is only one hidden encoding problem.4 −0. the temperature parameter of the modiﬁed Fermi functions 1 1 1 are. ping is possible (assumed training time: ≈ 104 epochs). an 8-2-8 network is suﬃcient for our problem. even that is possible.4 on page 75 shows a small network for the boolean functions AND and OR. Could there be.4 tanh(x) 0.

−1). (7 + ε. 1. that are linearly separable and characterize them exactly. The illustration shows an exemplary separation of one point. List those that are not linearly sepa- Exercise 10. 3 + ε. 3 − ε. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 103 . 1). 0.4. bias neuron and binary threshold function as activation function divides the two-dimensional space into two regions by means of a straight line g .15: Illustration of the functionality of 8-2-8 network encoding. too. at what value. D. it is possible to ﬁnd inner activation formations so that each point can be separated from the rest of the points by a straight line. −1]. 2. t) be deﬁned by p = (p1 . Verify if the error 1 Err = Errp = (t − y )2 2 converges and if so. (0. Randomly initalize the weights in the interval [1. p2 . Exercise 11. As you can see. 0. A one-stage perceptron with two input neurons. −1. How does the error curve look like? Let the pattern (p. A simple 2-1 network shall be trained with one single pattern by means of backpropagation of error and η = 0.7) and tΩ = 0. (0 − ε. (7 − ε. 1). The marked points represent the vectors of the inner neuron activation associated to the samples.3. Analytically calculate a set of weight values for such a perceptron so that the following set P of the 6 patterns of the form (p1 . (2. p2 ) = (0. List all boolean functions B3 → B1 . 1).1.8 The 8-3-8 encoding problem and related problems Exercise 9. −2 − ε.dkriesel. tΩ ) with ε 1 is correctly classiﬁed. −2.com 5. rable and characterize them exactly. −1). −1)} Figure 5. P ={(0.

5.1). Calculate in a comprehensible way one vector ∆W of all changes in weight by means of the backpropagation of error procedure with η = 1. 0. For all other weights the initial value should be 0.com 104 D. backpropagation and its variants Exercise 12. For all weights with the target Ω the initial value of the weights should be 1. p2 . 0. tΩ ) = (2. Let a 2-2-1 MLP with bias neuron be given and let the pattern be deﬁned by p = (p1 .Chapter 5 The perceptron. What is conspicuous about the changes? dkriesel. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .

neurons. This will deﬁne a so far unknown type of neuis inserted into a radial activation rons. Initially. The RBF networks are adding all input values and returning like MLPs . we want to discuss colloquially But in this case. each hidden neuron calculates a norm that mation processing itself and in the compurepresents the distance between the tational rules within the neurons outside input to the network and the so-called of the input layer. they have exactly three and then deﬁne some concepts concerning layers. Description of their functions and their learning process. Like perceptrons. the RBF networks are built in layers. in a moment we position of the neuron (center). 105 . Comparison with multilayer perceptrons. As propagation function.universal function approximathe sum. According to Poggio and Girosi [PG89] 6. Output neurons: In an RBF network the Like perceptrons. So. only one single layer of hidden RBF networks. i.Chapter 6 Radial basis functions RBF networks approximate functions by stretching and compressing Gaussian bells and then summing them spatially shifted. they do little more than processing. tors.e. network which was developed considerably later than that of perceptrons. Thus. the networks have a output neurons only contain the idenfeedforward structure and their layers are tity as activation function and one completely linked.1 Components and radial basis function networks (RBF netstructure of an RBF works) are a paradigm of neural networks. the input layer weighted sum as propagation funcagain does not participate in information tion. Hidden neurons are also called RBF neuDespite all things in common: What is the rons (as well as the layer in which diﬀerence between RBF networks and perthey are located is referred to as RBF ceptrons? The diﬀerence lies in the inforlayer). Here.

This scalar (which. The so.1 on page 73 of the input are unweighted. ||c. pletely linked with the following one. what can be realized tion fact . sented by I . ogy to the perceptrons – it is apparent that the closer the input vector is to the center such a deﬁnition can be generalized. A vector of an RBF neuron.2 Information processing of an RBF network RBF output neurons Ω use the weighted sum as propagation function fprop . shortcuts do not exist (ﬁg. Gauß 3 layers. the inner neurons are called raand the input vector y . The connections between RBF Deﬁnition 6. the higher is its bias neuron is not used in RBF networks.layer and output layer are weighted. the input.1 on the next page) Deﬁnition 6. 6. called RBF neurons h have a propagation function fprop that determines the distance between the center ch of a neuron Therefore.5 (RBF network). the hidden layer scalar.– it is a feedforward topology.2 on page 108). i. 6.Chapter 6 Radial basis functions dkriesel.dial basis neurons because from their defresents the network input. Each layer is comthe activation of the neuron. This distance rep. the set of hidden neurons by Deﬁnition 6. They are represented by the symby such a network and what is its purpose. feedforward I. for exam- 106 D.e. RBF neurons put value (ﬁg. Let us go over the RBF network from top to bottom: An RBF network receives the Deﬁnition 6. An input by means of the unweighted conRBF network has exactly three layers in nections. and the identity as activation funcNow the question is. Def.through a norm so that the result is a sisting of input neurons.4 (RBF output neuron). but – in analthe RBF neuron is located .inition follows directly that all input vecwork input is sent through a radial basis tors with the same distance from the cenfunction fact which returns the activation ter of a neuron also produce the same outor the output of the neuron. 6. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . The connecinition and representation is identical to tions between input layer and RBF layer the deﬁnition 5. Σ HIJK bol ONML . can (also called RBF layer) consisting of RBF only be positive due to the norm) is proneurons and the output layer consisting of cessed by a radial basis function.1 (RBF input neuron). The center ch of an RBF neuron original deﬁnition of an RBF network only h is the point in the input space where referred to an output neuron. Then the input vector is sent the following order: The input layer con.com input is linear again c Position in the input space Important! function which calculates and outputs RBF output neurons. by the way. The set of input neurons shall be repreactivation. H. Then the net. O only sums up Deﬁnition 6.H and the set of output neurons by O. In general. The ron).2 (Center of an RBF neu.3 (RBF neuron). they only transmit neuron.x|| PQRS are represented by the symbol WVUT .

. which coincide with the names of the MLP neurons: Input neurons are called i. . . h|H | Ω1 .1: An exemplary RBF network with two input neurons.dkriesel. h2 . . . . . H and O. i|I | h1 .x|| sh ||c. hidden neurons are called h and output neurons are called Ω. The associated sets are referred to as I . Ω|O| Figure 6. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 107 . Right of the illustration you can ﬁnd the names of the neurons.com 6.x|| WVUT PQRS WVUT PQRS WVUT PQRS WVUT PQRS WVUT PQRS h h Gauß g Gauß g Gauß g Gauß m m h h m m h m { Gauß gg gg gg { mm {{ hhhhh m { mmm mmm {{{ gg g g { { h m h m h g g { { h mmm m hh{ gg g gg m { m gg {h {{ h m h mm m gg g { {{ { h m m h h g g { { { m m h h m m g3 g g h { { { h m mm h 3 }{ }{ mm }{ m B @ 3 thhhh @ONML Σ vm Σ vm Σ ONML HIJK HIJK ONML HIJK i1 . .x|| ||c. . . .x|| ||c. The connections to the hidden neurons are not weighted. i2 . D.x|| vll ||c. they only transmit the input.2 Information processing of an RBF network @ABC GFED GFED @ABC l ii ii lh hhh lh y y hhhh ii ii lllyyy h yy h l h l y ii i h l y y ii l hhh h l y y ii h l h h l y y i h l h y y ii ll ii4 hhh y h l h l y y |y 4 | h C @ hhh ||c. Ω2 . . ﬁve hidden neurons and three output neurons.

weights. It is apparent of the RBF layer or of the diﬀerent Gaus. let us take as an example a simThe output values of the diﬀerent neurons ple 1-4-1 RBF network. . . one can easily ues of the bells are multiplied by those see that any surface can be shaped by drag. and de facto provides diﬀerent values. and therefore it has Gausand a fourth RBF neuron and therefore sian bells which are ﬁnally added within four diﬀerently located centers. even On the contrary. c2 . Additionally. the network includes the censpace. We will get to know methods and approches for this later. Each of the output neuron Ω. . . the height of the Gausif the Gaussian bell is the same. ters c1 . 108 D.that we will receive a one-dimensional outsian bells are added within the third layer: put which can be represented as a funcbasically. compressing and removing Gaussian bells and subsequently accumulating them. Here. .weights. . c4 of the four inner neurons Suppose that we have a second.2. the network architecture offers the possibility to freely deﬁne or train height and width of the Gaussian bells – due to which the network paradigm becomes even more versatile. Figure 6. h4 . .Chapter 6 Radial basis functions dkriesel. since the individual output vallated in the output layer.com ging. . a third h1 . 6.2: Let ch be the center of an RBF neuron h.1 Information processing in RBF neurons RBF neurons process information by using norms and radial basis functions input → distance → Gaussian bell → sum → output ple by a Gaussian bell (ﬁg. Then the activation function fact h is radially symmetric around ch . Gaussian bells are added here. . 6. in relation to the whole input tion (ﬁg. h2 . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . σ4 which tance from the input to its own center inﬂuence the width of the Gaussian bells. Furthermore. the parameters for the superposition of the Gaussian bells are in the weights of the connections between the RBF layer and the output layer. σ2 .possesses four values σ1 . The network also these neurons now measures another dis.4 on the facing page).3 on the next page) . At ﬁrst. . . 6. . Since sian bell is inﬂuenced by the subsequent these values are ﬁnally simply accumu.

5 on the following page. In both cases σ = 0.or two-dimensional Gaussian bells. D.2 0 Gaussian in 2D Figure 6. the widths σ1 .4 1.2 0 −0. .8 0.3: Two individual one.6 h(r) 0.com 6.4. . . c2 .6 0. 3. 0. .4 holds and the centers of the Gaussian bells lie in the coordinate origin. The Gaussian bells have diﬀerent heights.4 0. Their centers c1 . 4.4 −0. 0.4 0. 6. c4 are located at 0.5 1 1.5 −1 −0. .2 1 0.4: Four diﬀerent Gaussian bells in one-dimensional space generated by means of RBF neurons are added by an output neuron of the RBF network. You can see a two-dimensional example in ﬁg. The distance r to the center (0.8.dkriesel.2 Information processing of an RBF network h(r) Gaussian in 1D 1 0. .6 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 109 . widths and positions. 1. σ2 .6 −2 0 2 x 4 6 8 y Figure 6. . 1.2 −0.8 0. 0) is simply calculated according to the Pythagorean theorem: r = x2 + y 2 .2 0 −2 −1.8 0.2. 1. σ4 at 0.5 2 −1 x 0 1 −2 −1 0 y 1 2 1 0. .5 0 r 0.4 −2 0.

σ3 = 0.5 0.25 −0. −1.4. widths σ and centers c = (x. The heights w. 0).x|| WVUT PQRS WVUT PQRS WVUT PQRS Gauß e Gauß m m Gauß ee mmm }} m e } m e }} mmm eee }} mmmmm } ee @ 2 } ~ }mmm Σ vm ONML HIJK Sum of the 4 Gaussians 2 1.5).5 −1 −2 −1 x 0 1 −2 −1 0 y 1 Gaussian 4 2 1.com h(r) Gaussian 2 2 1.5 −1 −0.5 2 Figure 6. c4 = (−2.5 −1 −2 −1 x 0 1 −2 −1 0 y 1 2 2 ||c. 110 D.75 −1 −2 −1.5 1.5 1 0. σ1 = 0.15. w4 = 0.5.x|| ||c.x|| WVUT PQRS Gauß ||c. c3 = (−0.5 −1 −2 −1 x h(r) 2 0 1 −2 −1 0 1 y 2 h(r) Gaussian 3 2 1.75 0.5 −0.6.5. Once again r = x2 + y 2 applies for the distance.5 1 0.5 1 0.5 0 −0.5: Four diﬀerent Gaussian bells in two-dimensional space generated by means of RBF neurons are added by an output neuron of the RBF network.5 1 0. c1 = (0.5 −1 −0.5 −1 −2 −1 x 0 1 −2 −1 0 y 1 dkriesel.5 1.4. −1).25 0 −0. c2 = (1.5 0 −0.75 1. w3 = 1. w2 = −1.5 2 −2 −1.5.5 0 −0.15). 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .5 1 1. σ4 = 1. y ) are: w1 = 1.Chapter 6 Radial basis functions h(r) Gaussian 1 2 1.x|| ||c.8.5 y 0 1 0.5 0 x 0.25 1 0.2.5 0 −0. σ2 = 0.

this distance has The output y of an RBF output neuron Ω to be passed through the activation func. As we can see. fact |H | with H being the set of hidden neurons.2. Strictly speaking.2 Information processing of an RBF network activation function fact . So I simply use the name fact for all activation functions and regard σ and c as variables that are deﬁned for individual neurons but no directly included in the activation function. as already mentioned. (6. would only yield diﬀerent factors there. adds them and extracts the root of the sum. . Here.1) (6. ﬁrst by a normalization factor and then by the connections’ weights. We do not need this factor (especially because for our purpose the integral of the Gaussian bell must not always be 1) and therefore simply leave it out. we have diﬀerent choices: Often the Euclidian norm is chosen to calculate the distance: rh = ||x − ch || = i∈ I (6.i )2 Remember: The input vector was referred to as x. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 111 . By the way. rh Now that we know the distance rh beprior to the training tween the input vector x and the center ch of the RBF neuron h. But as a result the explanation would be very confusing. From the deﬁnition of a norm directly follows that the distance can only be positive. Normally. One solution would be to number the activation functions like fact 1 .Ω results from combining the functions of tion. . fact 2 . Here we use. the Euclidean distance generates the squared diﬀerences of all vector components. avoid this factor because we are multiplying anyway with the subsequent weights and consecutive multiplications. however. that contains |P | D. . We can.com 6. functions that are monotonically decreasing over the interval [0. activation functions other than the Gaussian bell are possible.2) (xi − ch. an RBF neuron to a Gaussian bell: fact (rh ) = e −r 2 h 2σ 2 h 6. The reader will certainly notice that in the literature the Gaussian bell is often normalized by a multiplicative factor.Ω · fact (||x − ch ||) . . we hence only use the positive part of the activation function. ∞] are chosen.2 Some analytical thoughts yΩ = (6. the index i runs through the input neurons and thereby through the input vector components and the neuron center components. In two-dimensional space this corresponds to the Pythagorean theorem.4) It is obvious that both the center ch and Suppose that similar to the multilayer perthe width σh can be seen as part of the ceptron we have a set P . and hence the activation functions should not be referred to as fact simultaneously. Since we use a norm to calculate the distance between the input vector and the center of a neuron h.3) h∈H wh.dkriesel.

i. We are looking for the weights wh.Ω · fact (||p − ch ||) .com T =M ·G ⇔ ⇔ ⇔ where M M −1 (6.2. our problem can be seen as a system of equations since the only thing we want to change at the moment are the weights. . . (6. .7) M −1 · T = E · G −1 i. it performs a precise interpolation. . .e.1 Weights can simply be computed as solution of a system of equations Thus. −1 · M · G (6.e. Of course.Chapter 6 Radial basis functions training samples (p. This means. t). This demands a distinction of cases concerning the number of training samples |P | and the number of RBF neurons |H |: |P | = |H |: If the number of RBF neurons equals the number of patterns. the centers c1 . σk . that the network exactly meets the |P | existing nodes after having calculated the weights. Then we obtain |P | functions of the form yΩ = h∈H dkriesel. with this eﬀort we are aiming at letting the output y for all training patterns p converge to the corresponding teaching input t. Mathematically speaking.Ω with |H | weights for one output neuron Ω. M is the |P | × |H | matrix of the outputs of all |H | RBF neurons to |P | samples (remember: |P | = |H |. To calculate such an equation we certainly do not need an RBF network.6) (6. Exact interpolation must not be mistaken for the memorizing ability mentioned with the MLPs: First.9) wh. one function for each training sample. the matrix is squared and we can therefore attempt to invert it). . i.2. . ck and the training samples p including the teaching input t are given. T is the vector of the teaching inputs for all training samples. Thus.5) ·T =M · T = G. Now let us assume that the widths σ1 . we are not talking about the training of RBF T M 6. we have |P | equations. G is the vector of the desired weights and E is a unit matrix with the same size as G. we can simply calculate the weights: In the case of |P | = |H | there is exactly one RBF neuron available per training sample. and therefore we can proceed to the next case.e.8) (6. |P | = |H |. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . the equation can be reduced to a matrix multiplication G E simply calculate weights 112 D. c2 . . σ2 .

In the aforementioned equations 6. |P | > |H |: But most interesting for further discussion is the case if there are signiﬁcantly more training samples than RBF neurons. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 113 . this case normally does not occur very often. we The problem is that this time we cannot invert the |P | × |H | matrix M because it is not a square matrix (here.11 and the following ones please do not mistake the T in M T (of the transpose of the matrix M ) for the T of the vector of all teaching inputs. we again want to use the generalization capability of the neural network. it can be used similarly in this case1 . Certainly.15) Another reason for the use of the Moore-Penrose pseudo inverse is the fact that it minimizes the squared error (which is our goal): The estimate of the vector G in equation 6.10) networks at the moment. Thus. Second.13) (6. i. |P | = |H | is true).14) (6.15 corresponds to the Gauss-Markov model known from statistics. If we have more training samples than RBF neurons. to solve the system of equations. 1 Particularly. I do not want to go into detail of the reasons for these circumstances and applications of M + . (6. we must try to ﬁnd a function that approximates our training set P as closely as possible: As with the MLP we try to reduce the sum of the squared error to a minimum. which is used to minimize the squared error. that means |P | > |H |.2 Information processing of an RBF network have to ﬁnd the solution M of a matrix multiplication T = M · G.com 6. there is a huge variety of solutions which we do not need in such detail. |P | < |H |: The system of equations is under-determined. if we cannot exactly hit the points and therefore cannot just interpolate as in the aforementioned ideal case with |P | = |H |. there are more RBF neurons than training samples.12) (6. |P | < |H |. We get equations that are very similar to those in the case of |P | = |H |: ⇔ ⇔ ⇔ M ·T =M ·M ·G M ·T =E·G M ·T =G + + + T =M ·G + (6. So.e. We can select one set of weights out of many obviously possible ones.they can easily be found in literature for linear algebra. In this case. Here. we cannot assume that every training sample is exactly hit.dkriesel. it could be advantageous for us and might in fact be intended if the network exactly interpolates between the nodes. M + = M −1 is true if M is invertible.11) M+ Although the Moore-Penrose pseudo inverse is not the inverse of a matrix. How do we continue the calculation in the case of |P | > |H |? As above. we have to use the Moore-Penrose pseudo inverse M + which is deﬁned by M + = (M T · M )−1 · M T (6. D.

which generally requires a lot of computational eﬀort. because such extensive computations can be prone to many inaccuracies. whereas the matrix M + . wouldn’t it? M + complex and imprecise inexpensive output dimension for every new output neuron Ω.16) rough or even unrecognizable) of the desired output.2. without any diﬃculty.2. |P | |H |: You can. Here. What will happen if there are several output neurons.3 Computational eﬀort and accuracy For realistic problems it normally applies that there are considerably more training samples than RBF neurons. i. always stays the same: So it is quite inexpensive – at least concerning the computational complexity – to add more output neurons.2 The generalization on several outputs is trivial and not quite computationally expensive We have found a mathematically exact way to directly calculate the weights. as usual. we could ﬁnd the terms for the mathematically correct solution on the blackboard (after a very long time).3 Combinations of equation system and gradient strategies are useful for training Analogous to the MLP we perform a gradient descent to ﬁnd the suitable weights by means of the already well known delta rule. the set of the output neurons Ω? In this case. which leads us to the real training methods – but otherwise it would be boring. but such calculations often seem to be imprecise 6. it does not change much: The additional output neurons have their own set of weights while we do not change the σ and c of the RBF layer.e.2. our Moore-Penrose pseudoinverse is. This means that we also get only approximations of the correct weights (maybe with a lot of accumulated numerical errors) and therefore only an approximation (maybe very (6. no guarantee that the output vector corresponds to the teaching vector. Thus. backpropagation is unnecessary since we only have to train one single retraining delta rule 114 D. i. 6. if you like. in an RBF network it is easy for given σ and c to realize a lot of output neurons since we only have to calculate the individual vector of weights GΩ = M + · TΩ dkriesel.2. with O being. even though the calculation is mathematically correct: Our computers can only provide us with (nonetheless very good) approximations of the pseudo-inverse matrices. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . we should use it nevertheless only as an initial value for our learning process. in spite of numeric stability. Furthermore.Chapter 6 Radial basis functions 6. as we have already indicated.e.com and very time-consuming (matrix inversions require a lot of computational effort). If we have enough computing power to analytically determine a weight vector. |O| > 1. Theoretically. use 106 training samples.

the answer is similar to the answer for the multilayer perceptron: Initially. (6.18) centers c and the widths σ of the Gaussian bells: Here again I explicitly want to mention that it is very popular to divide the training into two phases by analytically computing a set of weights and then reﬁning it by training with the delta rule. ﬁxed selection: Again centers and widths are selected ﬁxedly. Conditional.com 6. this chapter but it can be found in you can be successful by using many methconnection with another network ods. the errors are once again accumubut certainly the most challenging lated and. However. As already indicated. 2 widths of 3 of the distance between the D. Then. topology (section 10. for a more precise approximaone.6.1 It is not always trivial to ing time. one often trains online (faster movement across the error surface).1). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 115 . but we have previous knowledge about the functions to be approximated and comply with it. So let us now take a look at the possibility to In any case.17) racy of RBF networks can be increased by adapting the widths and positions of the Gaussian bells in the input space to the in which we now insert as follows: problem that needs to be approximated.Ω = η · (tΩ − yΩ ) · fact (||p − ch ||) (6. vary σ and c training in phases There is still the question whether to learn oﬄine or online.Adaptive to the learning process: This is deﬁnitely the most elegant variant. similar to the MLPs.Ω = η · δΩ · oh . lution. after having approximated the so. the goal is to cover the invary σ and c.1 Fixed selection and the output layer can be optimized.3 Training of RBF networks weight layer – which requires less comput. in an RBF network not only the weights between the hidden 6.3. There are several methods to deal with the ∆wh. Here.1. Fixed selection: The centers and widths can be selected in a ﬁxed manner and regardless of the training samples – this is what we have assumed until now. A realization of this tion. too.dkriesel.3. determine centers and widths We know that the delta rule is of RBF neurons It is obvious that the approximation accu∆wh. Here. one trains oﬄine in a third learnapproach will not be discussed in ing phase.6. put space as evenly as possible.

7 on the facing page). There are diﬀerent methods to determine which increases the computational eﬀort clusters in an arbitrarily dimensional set exponentially with the dimension – and is of points. 6.6). does not cause any problems here).use clustering methods to determine them. ﬁxed selection Suppose that our training samples are not evenly distributed across the input space. So this method would allow for every training pattern p to be directly in the center of a neuron (ﬁg. Here it is method the widths are ﬁxedly selected. are statistical factors according to which we should distribute the centers and sigmas (ﬁg. and self-organizing maps are apologize this sloppy formulation.to tendimensional problems in RBF networks are already called "high-dimensional" (an MLP. One neural cluster2 It is apparent that a Gaussian bell is mathematically inﬁnitely wide. two-dimensional input space by applying radial and so it can be determined whether there basis functions. "one third"2 (ﬁg. This is This may seem to be very inelegant. However. It then seems obvious to arrange the centers and sigmas of the RBF neurons by means of the pattern distribution. Generally. 6.3. A more trivial alternative would be to set |H | centers on positions randomly selected from the set of patterns. 6.Chapter 6 Radial basis functions dkriesel.5).1. centers can be selected so that the Gaussian bells overlap by approx. 6.com responsible for the fact that six.2 Conditional. but not yet very elegant but a good solution in the ﬁeld of function approximation we when time is an issue. the high input dimen. we can only 0.6: Example for an even coverage of a tical techniques such as a cluster analysis. The closer the bells are set the more precise but the more time-consuming the whole thing becomes. We will be introduced to some of them in excursus A. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . for this cannot avoid even coverage. So the training patterns can be analyzed by statisFigure 6. sion requires a great many RBF neurons. for example. therefore I ask the reader to ing method are the so-called ROLFs (section A. useless if the function to be approximated is precisely represented at some positions If we have reason to believe that the set but at other positions the return value is of training samples is clustered. input dimension very expensive 116 D.8 on the next page).

by applying radial basis functions.6. All these methods have nothing to do with the RBF networks themselves but are only used to generate some previous knowledge.dkriesel. of which we have previous knowledge. Using ROLFs. one can also receive indicators for useful radii of the RBF neurons. the centers of the neurons were randomly distributed throughout the training patterns. which can be seen at the single data point down to the left.com 6. This distribution can certainly lead to slightly unrepresentative results. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 117 .1). Therefore we will not discuss them in this chapter but independently in the indicated chapters. Figure 6.7: Example of an uneven coverage of a two-dimensional input space. Another approach is to use the approved methods: We could slightly move the positions of the centers and observe how our error function Err is changing – a gradient descent. also useful in connection with determining the position of RBF neurons (section 10.3 Training of RBF networks Figure 6.8: Example of an uneven coverage of a two-dimensional input space by applying radial basis functions. as already known from the MLPs. Learning vector quantisation (chapter 9) has also provided good results. The widths were ﬁxedly selected. D.

by means of a clustering method) and then extended or reduced. as we already know. we will signiﬁcantly if the σ are large. we have to exerAnd that is the crucial point: Naturally. Even if mathematics claim P that such methods are promising. only simple mechaerror depends on the values σ . But change a c or a σ .Chapter 6 Radial basis functions dkriesel. this adjustment is made by moving the centers c of the other neurons away from the new neuron and reducing their width σ a bit.neurons are considerably inﬂuenced by the new neuron because of the overlapping of tion. A certain number |H | of neurons as well as their centers ch and widths σh are previously selected (e. Of course. ∂ch 6. if we considerably if the distance between them is short. cise care in doing this: IF the σ are small.4 Growing RBF networks automatically adjust the neuron density In growing RBF networks.com In a similar manner we could look how the In the following text.The extension of the network is simple: We replace this maximum error with a new faces. calculated. This method is 118 D. the already exisiting change the appearance of the error func. dient descent. RBF networks generate very craggy er.is sought. RBF neuron. Then the current output vector y of the network is compared to the teaching input t and the weight vector G is improved by means of training. leads to problems with very craggy error sur. For more information.1 Neurons are added to places with large error values Since the derivation of these terms corre. Subsequently. So it is obvious that we will adjust the already existing RBF neurons when adding the new neuron.the vector of the weights G is analytically tion we do not want to discuss it here. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .g. to the derivation of backpropagation we I refer to [Fri94].the neurons will only inﬂuence each other ror surfaces because.4. Analogous nisms are sketched. a new neuron can be inserted if necessary.After generating this initial conﬁguration sponds to the derivation of backpropaga. replace error with neuron 6. To put it simply. the number |H | of RBF neurons is not constant. the gra. the Gaussian bells. Then all speciﬁc errors Errp concerning the set P of the training samBut experience shows that no convincing ples are calculated and the maximum speresults are obtained by regarding how the ciﬁc error error behaves depending on the centers max(Errp ) and sigmas. derive ∂ Err(σh ch ) ∂σh and ∂ Err(σh ch ) .

tions. for neurons |H |max .3 Less important neurons are dimensional functional spaces since deleted the network could very quickly require huge memory storage and computational eﬀort. with the MLP. it is very useful We will compare multilayer perceptrons to previously deﬁne a maximum number and RBF networks with respect to diﬀerent aspects. one single neuron with a higher them. Such problems do not occur Gaussian bell would be appropriate.5 Comparing RBF networks and multilayer perceptrons D. A neuron is. selecting the unimportant for the network if there is ancenters c for RBF networks is (despite other neuron that has a similar function: the introduced approaches) still a maIt often occurs that two Gaussian bells exjor problem. quainted with and extensivley discussed two network paradigms for similar prob. Thus. Therefore we want to compare these 6. Center selection: However.4. 6. Please use any previous actly overlap and at such a position. a Which leads to the question whether it multilayer perceptron would cause is possible to continue learning when this less problems because its number of limit |H |max is reached. We only have with the input dimension. But to develop automated procedures in Output dimension: The advantage of order to ﬁnd less relevant neurons is highly RBF networks is that the training is problem dependent and we want to leave not much inﬂuenced when the output this to the programmer. for example.4. dimension of the network is high.dkriesel.two paradigms and look at their advantages and disadvantages. to look for the "most unimportant" neuron and delete it.2 Limiting the number of neurons delete unimportant neurons Here it is mandatory to see that the network will not grow ad inﬁnitum. Here. a learning procedure With RBF networks and multilayer persuch as backpropagation thereby will ceptrons we have already become acbe very time-consuming. The answer is: neuons does not grow exponentially this would not stop learning.Extrapolation: Advantage as well as disadvantage of RBF networks is the lack lems. for knowledge you have when applying instance. which can happen very fast.com 6.5 Comparing RBF networks and multilayer perceptrons particularly suited for function approxima. Input dimension: We must be careful with RBF networks in high6. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 119 . For an MLP.

but experience shows that MLPs are suitable for that matter). But one part of the output is heavily affected because a Gaussian bell is directly missing.Chapter 6 Radial basis functions of extrapolation capability: An RBF network returns the result 0 far away from the centers of the RBF layer. On the other hand. Thus. Important! 120 D. It will only worsen a little in total. The weights should be analytically determined by means of the Moore-Penrose pseudo inverse. For this. it is no so important if a weight or a neuron is missing. please indicate the used methods together with their complexity. unlike the MLP it cannot be used for extrapolation (whereby we could never know if the extrapolated values of the MLP are reasonable. On the one hand it does not extrapolate. Lesion tolerance: For the output of an MLP. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . The MLPs seem to have a considerably longer tradition and they are working too good to take the eﬀort to read some pages of this work about RBF networks) :-). dkriesel. I recommend to look for such methods (and their complexity). which could be an advantage. Indicate the running time behavior regarding |P | and |O| as precisely as possible. An |I |-|H |-|O| RBF network with ﬁxed widths and centers of the neurons should approximate a target function u. Note: There are methods for matrix multiplications and matrix inversions that are more eﬃcient than the canonical methods. unlike the MLP the network is capable to use this 0 to tell us "I don’t know". t) of the function u are given. If a weight or a neuron is missing in an RBF network then large parts of the output remain practically uninﬂuenced. |P | training samples of the form (p. In addition to your complexity calculations. Spread: Here the MLP is "advantaged" since RBF networks are used considerably less often – which is not always understood by professionals (at least as far as low-dimensional input spaces are concerned).com Exercises Exercise 13. For better estimations. Let |P | > |H | be true. we can choose between a strong local error for lesion and a weak but global error.

I will brieﬂy introduce two paradigms of recurrent networks and afterwards roughly outline their training. i. Additionally. it is. With a recurrent network an input x that is constant over time may lead to diﬀerent results: On the one hand. Here. Thus. or at least not until a long time later.1 on the following Recurrent networks in themselves have a page) are returned. The aim of this chapter is larly want to refer to the literature cononly to brieﬂy discuss how recurrences can cerning dynamical systems. recurrent networks are networks that are capable of inﬂuencing themselves by means of recurrences.g. the recurrent network will be reduced to an ordinary MLP. it could transform itself into a ﬁxed state and at some time return a ﬁxed output value y . it could never converge.Chapter 7 Recurrent perceptron-like networks Some thoughts about networks with internal states. Generally. such a recurrent network is capable to compute more than the ordinary MLP: If the recurrent weights are set to 0. There are many types of recurrent networks of nearly arbitrary form. we can expect great dynamic that is mathematically dif.If the network does not converge. the network could converge. for example. or attractors (ﬁg. by including the network output in the following computation steps. As a result. On the other hand. so that it can no longer be recognized. possible to check if periodicals work state. state dynamics more capable than MLP Apparently. 121 . 7. e. y constantly changes. the recurrence generates diﬀerent network-internal states so that diﬀerent inputs can produce diﬀerent outputs in the context of the net. That is the reason why I particuextensively.the complete variety of dynamical sysﬁcult to conceive and has to be discussed tems.e. and as a consequence. and nearly all of them are referred to as recurrent neural networks. be structured and how network-internal states can be generated. for the few paradigms introduced here I use the name recurrent multilayer perceptrons.

there are weighted connections between each output neuron and one context neuron. In the originial deﬁnition of a Jordan network the context neurons are also recurrent to themselves via a connecting weight λ. 7. In this chapter the related paradigms of recurrent networks according to Jordan and Elman will be introduced. .1 (Context neuron). The stored values are returned to the actual network by means of complete links between the context neurons and the input layer. a context neuron just memorizes an output until it can be processed in the next time step. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . 7. Therefore. A context neuron k receives the output value of another neuron i at a time t and then reenters it into the network at a time (t + 1). Deﬁnition 7.1: The Roessler attractor 122 D. . But most applications omit this recurrence since the Jordan network is already very dynamic and diﬃcult to analyze. even without these additional recurrences. There is one context neuron per output neuron (ﬁg. .2 (Jordan network). k|K | . Deﬁnition 7. A Jordan network is a multilayer perceptron output neurons are buﬀered Figure 7.1 Jordan networks A Jordan network [Jor86] is a multilayer perceptron with a set K of so-called context neurons k1 . k2 .2 on the next page).com Further discussions could reveal what will happen if the input of recurrent networks is changed.Chapter 7 Recurrent perceptron-like networks (depends on chapter 5) dkriesel. . In principle.

dkriesel.com

7.2 Elman networks

GFED @ABC @ABC GFED i i1 e ii 2 ee iiii }} eee i }} i i ee e } } i } e } } ii i ee e ii i } }} i i e } } eee Ö e2 { ~}} x v } iiiiii ~} B 2 tii @ABC GFED GFED @ABC @ABC GFED h2 e h1 e iii} h3 i i e ee } i i ee iii }}} ee }} i e i ee ei } }} ii i e } } i i ee } }} iiii ee2 } } 2 ~i ~} B t iii @ABC GFED GFED @ABC Ω2 Ω1 @A

GFED @ABC k2 y

GFED @ABC k1 y

BC

Figure 7.2: Illustration of a Jordan network. The network output is buﬀered in the context neurons and with the next time step it is entered into the network together with the new input.

with one context neuron per output neuron. The set of context neurons is called K . The context neurons are completely linked toward the input layer of the network.

during the next time step (i.e. again a complete link on the way back). So the complete information processing part1 of the MLP exists a second time as a "context version" – which once again considerably increases dynamics and state variety.

nearly everything is buﬀered

Compared with Jordan networks the Elman networks often have the advantage to act more purposeful since every layer can The Elman networks (a variation of access its own context. the Jordan networks) [Elm90] have context neurons, too, but one layer of context Deﬁnition 7.3 (Elman network). An Elneurons per information processing neu- man network is an MLP with one conron layer (ﬁg. 7.3 on the following page). text neuron per information processing Thus, the outputs of each hidden neuron neuron. The set of context neurons is or output neuron are led into the associ- called K . This means that there exists one ated context layer (again exactly one con- context layer per information processing

7.2 Elman networks

text neuron per neuron) and from there it is reentered into the complete neuron layer

1 Remember: The input layer does not process information.

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

123

Chapter 7 Recurrent perceptron-like networks (depends on chapter 5)

dkriesel.com

GFED @ABC @ABC GFED i1 d i ii 2 dd iiii ~~~ dddd i ~~ i d i ~ i diii dd ~ ~ d d iid ~~ ~~ ~~ iiiiiii ddd ~~ ddd ~ ~ zw v i ~~ ~ 2 ~ v u i B 2 i t tu i @ABC GFED @ABC GFED @ABC GFED h2 d h h1 d 3 i i dd dd ~ iiii ~~ ~~~ dd iiiiii dd ~ ~ dd dd iiii ~~ ~~ i i d dd ~ ~ i i ~ ii d1 wv ~ ~ 1 i ~ B t u iii GFED @ABC GFED @ABC Ω1 Ω2 S

**ONML HIJK R kh1 S ONML HIJK kΩ
**

1

ONML HIJK kh

2

S

ONML HIJK kh

3

S

ONML HIJK kΩ

2

Figure 7.3: Illustration of an Elman network. The entire information processing part of the network exists, in a way, twice. The output of each neuron (except for the output of the input neurons) is buﬀered and reentered into the associated layer. For the reason of clarity I named the context neurons on the basis of their models in the actual network, but it is not mandatory to do so.

neuron layer with exactly the same num- 7.3 Training recurrent ber of context neurons. Every neuron has networks a weighted connection to exactly one context neuron while the context layer is comIn order to explain the training as comprepletely linked towards its original layer. hensible as possible, we have to agree on some simpliﬁcations that do not aﬀect the learning principle itself. Now it is interesting to take a look at the training of recurrent networks since, for instance, ordinary backpropagation of error cannot work on recurrent networks. Once again, the style of the following part is rather informal, which means that I will not use any formal deﬁnitions. So for the training let us assume that in the beginning the context neurons are initiated with an input, since otherwise they would have an undeﬁned input (this is no simpliﬁcation but reality). Furthermore, we use a Jordan network without a hidden neuron layer for our training attempts so that the output neu-

124

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

dkriesel.com rons can directly provide input. This approach is a strong simpliﬁcation because generally more complicated networks are used. But this does not change the learning principle.

7.3 Training recurrent networks

but forward-oriented network without recurrences. This enables training a recurrent network with any training strategy developed for non-recurrent ones. Here, the input is entered as teaching input into every "copy" of the input neurons. This can be done for a discrete number of time steps. These training paradigms are called 7.3.1 Unfolding in time unfolding in time [MP69]. After the unfolding a training by means of backpropaRemember our actual learning procedure gation of error is possible. for MLPs, the backpropagation of error, which backpropagates the delta values. But obviously, for one weight wi,j sevSo, in case of recurrent networks the eral changing values ∆wi,j are received, delta values would backpropagate cycli- which can be treated diﬀerently: accumucally through the network again and again, lation, averaging etc. A simple accumuwhich makes the training more diﬃcult. lation could possibly result in enormous On the one hand we cannot know which changes per weight if all changes have the of the many generated delta values for a same sign. Hence, also the average is not weight should be selected for training, i.e. to be underestimated. We could also introwhich values are useful. On the other hand duce a discounting factor, which weakens we cannot deﬁnitely know when learning the inﬂuence of ∆wi,j of the past. should be stopped. The advantage of re- Unfolding in time is particularly useful if current networks are great state dynamics we receive the impression that the closer within the network; the disadvantage of past is more important for the network recurrent networks is that these dynamics than the one being further away. The are also granted to the training and there- reason for this is that backpropagation fore make it diﬃcult. has only little inﬂuence in the layers farOne learning approach would be the attempt to unfold the temporal states of the network (ﬁg. 7.4 on the next page): Recursions are deleted by putting a similar network above the context neurons, i.e. the context neurons are, as a manner of speaking, the output neurons of the attached network. More generally spoken, we have to backtrack the recurrences and place "‘earlier"’ instances of neurons in the network – thus creating a larger, ther away from the output (remember: the farther we are from the output layer, the smaller the inﬂuence of backpropagation).

attach the same network to each context layer

Disadvantages: the training of such an unfolded network will take a long time since a large number of layers could possibly be produced. A problem that is no longer negligible is the limited computational accuracy of ordinary computers, which is exhausted very fast because of so many

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

125

Chapter 7 Recurrent perceptron-like networks (depends on chapter 5)

dkriesel.com

GFED @ABC @ABC GFED @ABC GFED @ABC GFED @ABC GFED i1 y i i y 3 2 y i dd i n ky 2 n} ky 1 iii ee n n n n yyy i n n e dd }iiii nn }i yyy eee nnni} } i nn i nn d yyy d i n i d e } nnn n i n n i e } yyydd i n n i n } n e2 i i i } ~n y9 1 n i in w nn wi tn B 9 GFED @ABC GFED @ABC Ω1 Ω2 @A BC

. . .

. . .

. . .

. . .

. . .

()*+c ()*+ /.-, ()*+ /.-, ()*+ /.-, ()*+g /.-, o o/.-, j j g o oj cc jjjj o o gg o o jj ooo cc ooo j g j g o j c o g o ooo c o jjj g j o o j c g o j @ 3 wo j o o 1 j o w @ B ()*+ /.-, ()*+ /.-, ()*+g /.-, ()*+ /.-, ()*+tj j/.-, n j j n gg pp jjjj hh n p h n p h n ggg hh jj ppp nn jj j nn h g n pppp h jjj n j g h n j p j n g3 " # ! h3 j n j j wppp j vn @ GFED tjn B @GFED @ABC GFED @ABC GFED @ABC GFED @ABC @ABC i i i1 y k k y 2 3 y 1 2 i dd ee ii nn nn} iiii yyy ee dd }iii nnnn nnn}i n yyy n n d e } i i nnn d } yyy ee nnnn ii dd n e }}nnnn yyy iiii n n i d } e i i n n i 1 } ~n y9 B 9 2 ii n wi w n tn @ABC GFED @ABC GFED Ω2 Ω1

Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: The recurrent MLP. Bottom: The unfolded network. For reasons of clarity, I only added names to the lowest part of the unfolded network. Dotted arrows leading into the network mark the inputs. Dotted arrows leading out of the network mark the outputs. Each "network copy" represents a time step of the network with the most recent time step being at the bottom.

126

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

especially with recurrent networks. so that this limit is reached). One reason for this is that they are not only unrestricted with respect to recurrences but they also have other advantages when the mutation mechanisms D. Disadvantage: with Elman networks a teaching input for non-output-neurons is not given. with several levels of context neurons this procedure could produce very large networks to be trained. 7.2 Teacher forcing Other procedures are the equivalent teacher forcing and open loop learning. 7. for example.3 Training recurrent networks are chosen suitably: So. So. With ordinary MLPs. They detach the recurrence during the learning process: We simply pretend that the recurrence does not exist and apply the teaching input to the context neurons during the training. however. evolutionary strategies are less popular since they certainly need a lot more time than a directed learning procedure such as backpropagation.4 Training with evolution Due to the already long lasting training time. teaching input applied at context neurons 7.3 Recurrent backpropagation Another popular procedure without limited time horizon is the recurrent backpropagation using methods of diﬀerential calculus to solve the problem [Pin87].3. backpropagation becomes possible. too. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 127 . evolutionary algorithms have proved to be of value. the smaller the inﬂuence of backpropagation. neurons and weights can be adjusted and the network topology can be optimized (of course the result of learning is not necessarily a Jordan or Elman network).dkriesel. 7.com nested computations (the farther we are from the output layer. Furthermore.3.3.

.

8. each particle applies a force to any other particle so that all particles adjust their movements in the energetically most favorable way.8. This natural mechanism is copied to adjust noisy inputs in order to match their real models. sists of a set K of completely linked neuall particles or neurons rotate and thereby rons with binary activation (since we only K 129 . a Hopﬁeld network conthis state is known as activation. Hopﬁeld and his physically motivated networks have contributed a lot to the renaissance of neural networks. encourage each other to continue this rotation. Thus.1 Hopﬁeld networks are inspired by particles in a magnetic ﬁeld The idea for the Hopﬁeld networks originated from the behavior of particles in a magnetic ﬁeld: Every particle "communi.Chapter 8 Hopﬁeld networks In a magnetic ﬁeld. all cates" (by means of magnetic forces) with neurons inﬂuence each every other particle (completely linked) other symmetrically with each particle trying to reach an energetically favorable state (i.2 In a hopﬁeld network. Another supervised learning example of the wide range of neural networks was developed by John Hopfield: the socalled Hopﬁeld networks [Hop82]. As a manner of speaking. we will recognize that the developed Hopﬁeld network shows considerable dynamics. Hopﬁeld had the idea to use the "spin" of the particles to process information: Why not letting the particles search minima on arbitrary functions? Even if we only use two of those spins. our neural network is a cloud of particles Based on the fact that the particles automatically detect the minima of the energy function. a minimum of the energy function ). i.e.e. As for the neurons Brieﬂy speaking. a binary activation.

G ?>=< o GRS ?>=< ↓ ↑ ` jo i d↑ d `` kkk¢ k `` ¢ k k `` kkk ¢ ¢¢ `` ¢ ``kk ¢ ¢ `` ¢ kkkk `` ¢¢ k k `` ¢¢¢ kk ¢ k `0 Ð¢¢ 0 Ð¢ G A ?>=< ukkkk 89:.1 (Hopﬁeld network). Thus.Chapter 8 Hopﬁeld networks 89:. 8. 8. we have binary string y ∈ {−1. ples) on its energy surface. o ↓ ↑ i kTG S ?>=< ¢d y ```kkkkkk¢¢d y ``` ¢ `` kkkk ¢¢ `` ¢ `` ` ¢¢ ¢ kkk k ¢ ¢ k k ` ``` `0 Ð¢¢¢ ¢¢kkkkk Ð¢ A 0 k uk 89:.1: Illustration of an exemplary Hop. The activation function of the neurons is the binary threshold function with outputs ∈ {1. −1}. thing into the |K | neurons. i. i. matrix that has zeros on its diagonal alAdditionally. The arrows ↑ and ↓ mark the {−1. 89:.e. it will stand still. Thus.be understood as a binary string z ∈ ﬁeld network.2 (State of a Hopﬁeld network). binary "spin". It can be proven that a Hopﬁeld network with a symmetric weight changing their state. a set of |K | particles.e. which means that a Hopﬁeld network simply includes a set of neurons. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Thus. the state of the network can Figure 8. The state of the network consists of the activation states of all neurons. ?>=< 89:. A Hopﬁeld network consists of a set K of completely linked neurons without direct recurrences. Due to the completely linked neurons the layers cannot be separated.state string of the network that has found a minimum. use two spins). i. 1}|K | . namely the to think about how we can input some. that is in a state is automatically looking for a minimum.2. at some point the fact that we do not know any input. Then the output is a output or hidden neurons. Then the network is looking for the minimum to be taken (which we have previThe complete link provides a full square ously deﬁned by the input of training sammatrix of weights between the neurons. the state of |K | neurons with two possible states ∈ {−1. we But when do we know that the minimum will soon recognize according to which has been found? This is simple. 89:. 1}|K | that initializes the neurons. Deﬁnition 8. Furthermore.1 Input and output of a Hopﬁeld network are represented by neuron states completely linked set of neurons We have learned that a network. The meaning of the weights will be discussed in the following.com Deﬁnition 8. 1}|K | . ?>=< ↑ ↓ o dkriesel. 1} can be described by a string x ∈ {−1. with the weights being symmetric between the individual neurons and without any neuron being directly connected to itself (ﬁg. ?>=< 89:.e. 1}|K | . input and output = network states always converges 130 D. An input pattern of a Hopﬁeld network is exactly such a state: A binary string x ∈ {−1. too: when rules the neurons are spinning. the complete link leads to ways converges [CG88].1). are the network stops.

the output is the binary string y ∈ {−1. The in each time step.2. rons k occurs according to the scheme from −1 to 1 or vice versa. j ∈K Thus. Colloquially speaking: a neuron k calholds: culates the sum of wj. If the neuron i is in work (time t) results from the state of the state 1 and the neuron j is in state network at the previous time t − 1.k is pushed.j is negative.j is positive.neurons j . Another diﬀerence between Hopﬁeld networks and other already known network topologies is the asynchronous update : A neuron k is randomly chosen every time. Depending on the sign of the sum the neuron takes state 1 or −1.1) xk (t) = fact other neurons and the associated weights. After the convergence of the network. a high positive weight will advise sum is the direction into which the neuron the two neurons that it is energeti. the harder the net. The weights as a whole apparently take Deﬁnition 8. the other neurons Once a network has been trained and initialized with some starting state.com 8. If wi. network. where the function fact weights can be positive.dkriesel. its behavior will be analoguous only that i and j are urged to be diﬀerent.j be(ﬁg. negative.k · xj (t − 1). for a weight wi. cally more favorable to be equal. their direction. We now want to discuss {−1. generally is the binary threshold function Colloquially speaking. the new state of the network will try.Zero weights lead to the two involved neurons not inﬂuencing each other.e.2 Signiﬁcance of weights D. which then recalculates the activation. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 131 . the We have already said that the neurons change of state x of the individual neuk change their states. 1}|K | that initializes the state of the how the neurons follow this way. i. Thus. This −1. tents of the weight matrix and the rules for the state change of the neurons.3 (Input and output of the way from the current state of the neta Hopﬁeld network).2 Structure and functionality Now let us take a closer look at the con. These spins oc cur dependent on the current states of the wj.2 on the next page) with threshold tween two neurons i and j the following 0. which If wi. A neuron i in state −1 would try to urge a neuron j into state 1.8. the weights are capable to control the complete change of the network. 8. it will try to force the indicates how strong and into which directwo neurons to become equal – the tion the neuron k is forced by the other larger they are.3 A neuron changes its state according to the inﬂuence of work state. 1}|K | generated from the new net. The input of a work towards the next minimum of the enHopﬁeld network is binary string x ∈ ergy function. or 0.2.k · xj (t − 1) (8. 8.

we use a set P of training patterns p ∈ {1. one time step indicates Unlike many other network paradigms.j one after another. −1}|K | . The change of state of the neurons occurs asynchronously with the neuron to be updated being randomly chosen and the new state being generated by means of this rule: Figure 8.5 −1 −4 −2 0 x 2 4 8.2) This results in the weight matrix W .work by means of a training pattern and rons and force the entire network towards then process weights wi. the training of a Hopﬁeld network is done by training each training pattern exactly once using the rule described in the following (Single Shot Learning ). so that at an function.k · xj (t − 1) . but we will understand the whole purpose later. The purpose is that the network work is often much easier to implement: shall automatically take the closest minThe neurons are simply processed one af.function.e. Heaviside Function 1 0. For ter the other and their activations are re.4 (Change in the state of a Hopﬁeld network). changed neurons immediately inﬂuences the network. 132 D.imum when the input is presented. i. ColNow that we know how the weights inﬂu. the new activation of the previously energy surface. do not look for the minima of an unknown Regardless of the aforementioned random error function but deﬁne minima on such a selection of the neuron.now this seems unusual.5 f(x) 0 −0. calculated until no more changes occur. we the change of a single neuron.2: Illustration of the binary threshold mentioned energy surface. pi · pj (8. representing the minima of our Thus. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . where pi and pj are the states of the neurons i and j under p ∈ P : wi. Deﬁnition 8.3 The weight matrix is generated directly out of the training patterns The aim is to generate minima on the random neuron calculates new activation As with many other network paradigms. input the network can converge to them.j = p∈P xk (t) = fact j ∈J wj. then there is the question of how to teach the weights to force the network towards a certain minimum.loquially speaking: We initialize the netence the changes in the states of the neu.Chapter 8 Hopﬁeld networks dkriesel.com a minimum. a Hopﬁeld net. Roughly speaking.

Afterwards. already p.3) also works with inputs that are close to a which in turn only applies to orthogo. letters or other characters in the ﬁeld networks).5 (Learning rule for Hop. An autoassociator a exactly shows the aforementioned behavior: Firstly. this |P |MAX ≈ 0. For each of these weights we verify: Are the neurons i. this high value tells the neurons: 8. This was shown by precise (and time-consuming) mathematical anal. stored information will be destroyed. Here. Colloquially speaking. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 133 .139 · |K |. Thus. in the second case we add −1. the network will be able to of the weight matrix W are deﬁned by a correctly recognize deformed or noisy letsingle processing of the learning rule ters with high probability (ﬁg. exactly this known pattern is returned. At an input x the network will converge to the stored pattern that is closest to the input p. above. which we do not want to specify case. The individual elements form of pixels.j = pi · pj .4 Autoassociation and "Often. The primary ﬁelds of application of Hopwhere the diagonal of the matrix is covered ﬁeld networks are pattern recognition with zeros.3 on the following page). such as the zip p∈P network restores damaged inputs D.139 · |K | training samples can be trained and at the same time maintain their function. nal patterns. the autoassociator is. the number of the maxia(p) = p. Finally. no more than |P |MAX ≈ and pattern completion.dkriesel.pattern: a(p + ε) = p. a Unfortunately. (8. j n the same state or do the states vary? In the ﬁrst case we add 1 to the weight. it is energetically favorable to hold traditional application the same state". Secp is limited to ondly. namely in the state now. wi. the values of the weights use. If more patterns are entered. like those mentioned Due to this training we can store a certain ﬁxed number of patterns p in the weight matrix. in a stable state.4 Autoassociation and traditional application 0. when a known pattern p is entered. are called autoassociators. in any yses. If the set of patterns P consists of. for exDeﬁnition 8. Now we know the functionality of Hopﬁeld This we repeat for each training pattern networks but nothing about their practical p ∈ P . wi. and that is the practical use. mum storable and reconstructible patterns with a being the associative mapping. The same applies to negative weights.com 8. Hopﬁeld networks.ample. 8.j are high when i and j corresponded with many training patterns.

but rather tures has 10 × 12 = 120 binary pixels. Another variant is a dynamic energy surface: Here. the appearance of the energy surface depends on the current state and we receive a heteroassociator instead of an autoassociator. Today Hopﬁeld networks are virtually no longer used.Chapter 8 Hopﬁeld networks dkriesel. 8.5 Heteroassociation and analogies to neural data storage So far we have been introduced to Hopﬁeld networks that converge from an arbitrary input into the closest minimum of a static energy surface.com code recognition on letters in the eighties. h 134 D. they have not become established in practice. onto another one. But soon the Hopﬁeld networks were replaced by other systems in most of their ﬁelds of application. h is the heteroasso- a(p + ε) = p ciative mapping. neuron.is no longer true.3: Illustration of the convergence of an exemplary Hopﬁeld network. for example by OCR systems in the ﬁeld of letter recognition. In the Hopﬁeld network each pixel corresponds to one h(p + ε) = q. the lower shows the convergence of a heavily noisy 3 to the corresponding training which means that a pattern is mapped sample. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . The upper illustration shows the training samples. For a heteroassociator Figure 8. Such heteroassociations are achieved by means of an asymmetric weight matrix V . Each of the pic.

The associative rule provokes that the network stabilizes a pattern. For two training samples p being predecessor and q being h(p + ε) = q successor of a heteroassociative transition h(q + ε) = r the weights of the heteroassociative matrix h(r + ε) = s V result from the learning rule .5 Heteroassociation and analogies to neural data storage Heteroassociations connected in series of Deﬁnition 8. Additionally.com 8. .heteroassociative matrix V but also by the already known autoassociative matrix erated from p: W. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 135 . remains D. h(z + ε) = p p. the neuron adaptation rule is changed so that competing terms are generated: One term autoassociating an existing pattern and one term trying to convert the very same pattern into its successor. p. vi. pletely accepted: Before a pattern is entirely completed. . since after having reached the heteroassociations last state z . .p=q with several heteroassociations being introduced into the network by a simple addiwhereby a single pattern is never com. as always.2 Stabilizing the never stop.dkriesel.q ∈P.5.q ∈P. The neuron states are. We generate the matrix V by means of elements v very similar to the autoassociative matrix with p being (per transition) This problem can be avoided by not only the training sample before the transition inﬂuencing the network by means of the and q being the training sample to be gen. .1 Generating the erated but that the next pattern is already heteroassociative matrix beginning before the generation of the previous pattern is ﬁnished. it would proceed to the ﬁrst state p again.j = p i qj (8. → z → p. adapted during operation.j = p i qj . the heteroassociation already tries to generate the successor of this pattern. Additionally. the network would 8.p=q can provoke a fast cycle of states V v q netword is instable while changing states The diagonal of the matrix is again ﬁlled with zeros.tion. whereby the said limitation exists here. We have already mentioned the problem that the patterns are not completely gen8.5. Several transitions can be introduced into the matrix by a simple addition.6 (Learning rule for the hetthe form eroassociative matrix). too.4) p → q → r → s → . vi.

for example. you gener.6 Continuous Hopﬁeld networks So far. ∆t stable change in states Here. But Hopﬁeld also described a version of his networks with continuous activations [Hop84]. which realizes much by means of elling salesman problem [HT85].4 on the next page). twenty steps. but the place at which one memorized it the last time is perfectly known. 8. to recite the alphabet.k xk (t − ∆t) heteroassociation Another example is the phenomenon that one cannot remember a situation. during which the individual states are stable for a short while. If one returns to this place. we only have discussed Hopﬁeld networks with binary activations. 8. the forgotten situation often comes back to mind. But ally will manage this better than (please today there are faster algorithms for hantry it immediately) to answer the follow.com Which letter in the alphabet follows the letter P ? (8. the network is stable for symmetric From a biological point of view the transi. the activation is no longer calculated by the binary threshold function but by the Fermi function with temperature parameters (ﬁg.dling this problem and therefore the Hopﬁeld network is no longer used here. 8. If ∆t is set to. Accordstate chains: When I would ask you. which we want to cover at least brieﬂy: continuous Hopﬁeld networks. the inﬂuence of the matrix V to be delayed. that continuous Hopthe Hopﬁeld modell will achieve an ap. ing question: 136 D. dear ing to some veriﬁcation trials [Zel94] this reader. and only after that it will work against it. then the asymmetric weight matrix will realize any change in the network only twenty steps later so that it initially works with the autoassociative matrix (since it still perceives the predecessor pattern of the current one). xi (t + 1) = j ∈K autoassociation dkriesel.statement can’t be kept up any more. is highly motivated: At least in the beginning of the nineties it was assumed that Hopﬁeld also stated. Here.5) fact wi. the value ∆t causes. The result is a change in state. tion of stable states into other stable states too. goes on to the next pattern.Chapter 8 Hopﬁeld networks there for a while.ﬁeld networks can be applied to ﬁnd acproximation of the state dynamics in the ceptable solutions for the NP-hard travbrain. descriptively speaking. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .3 Biological motivation of heterassociation Here.j xj (t) + k ∈K vi.weight matrices with zeros on the diagonal. since it only refers to a network being ∆t versions behind.5. and so on.

Compute the weights wi. −1.4: The already known Fermi function with diﬀerent temperature parameter variations. 1. 1). −1. Exercises Exercise 14.com 8. −1). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 137 .6 f(x) 0. (1. 1. −1. D. (−1.8 0.2 0 −4 −2 0 x 2 4 Figure 8.4 0. −1.j for a Hopﬁeld network using the training set P ={(−1.dkriesel. −1. −1. 1)}. Indicate the storage requirements for a Hopﬁeld network with |K | = 1000 neurons when the weights wi.6 Continuous Hopﬁeld networks Fermi Function with Temperature Parameter 1 0. 1.j shall be stored as integers. −1. −1. Is it possible to limit the value range of the weights in order to save storage space? Exercise 15. −1.

.

. If this has been managed. 139 . These SOMs are described in the next chapter that already belongs to part III of this text. for instance.Chapter 9 Learning vector quantization Learning Vector Quantization is a learning procedure with the aim to represent the vector training sets divided into predeﬁned classes as well as possible by using a few representative vectors. which contains the natural numbers. the sequence of real numbers R. for example. Slowly. The goal of this chapter is rather to analyze the underlying principle. Discrete means. part II of this text is nearing its end – and therefore I want to write a last chapter for this part that will be a smooth transition into the next one: A chapter about the learning vector quantization (abbreviated LVQ ) [Koh89] described by Teuvo Kohonen. since SOMs learn unsupervised. numbers between 1 and 2.1 About quantization In order to explore the learning vector quantization we should at ﬁrst get a clearer picture of what quantization (which can also be referred to as discretization ) is. is continuous: It does not matter how close two selected numbers are. Everybody knows the sequence of discrete numbers N = {1. 9. 3.}. because the natural numbers do not include. Thus. . which can be characterized as being related to the self organizing feature maps. that this sequence consists of separated elements that are not interconnected. discrete = separated Previously. . I want to announce that there are diﬀerent variations of LVQ. The elements of our example are exactly such numbers. which will be mentioned but not exactly represented. On the other hand. there will always be a number between them. 2. after the exploration of LVQ I want to bid farewell to supervised learning. vectors which were unkown until then could easily be assigned to one of these classes.

to a class. it could be assigned to the natural number 2. the codebook vectors build a voronoi diagram out of the set. Such separation of data into classes is interesting for many problems for which it is useful to explore only some characteristic representatives instead of the possibly huge set of all vectors – be it because it is Deﬁnition 9. Separa.ciently precise.to which class. input space reduced to vector representatives It must be noted that a sequence can be irregularly quantized. all decimal places of the real number 2.which means they may overlap. Furthermore.Chapter 9 Learning vector quantization Quantization means that a continuous space is divided into discrete sections: By deleting. 3). for example. A codebook vector is the representative of exactly those input space vectors lying closest to it. Such a vector is called codebook vector. where the set of these representatives should represent the entire input space as precisely as possible. these numbers will be digitized into the binary system (basis 2). the It is to be emphasized that we have to timeline for a week could be quantized into know in advance how many classes we working days and weekend. each element of the input space should be assigned to a vector as a representative.com into classes that reﬂect the input space as well as possible (ﬁg. quantization. i.e. it is imporzation : In case of digitization we always tant that the classes must not be disjoint. Here it is obvious that any other number having a 2 in front of the comma would also be assigned to the natural number 2. Since closest vector wins 140 D. for example. i. If we enter. Thus. have and which training sample belongs A special case of quantization is digiti. 2 would be some kind of representative for all real numbers within the interval [2. Deﬁnition 9.71828. some numbers into the computer.e. tinuous space into a number system with respect to a certain basis.1 (Quantization).3 Using codebook vectors: the nearest one is the winner The use of a prepared set of codebook vectors is very simple: For an input vector y the class association is easily decided by considering which codebook vector is the closest – so.1 on the facing page). dkriesel. tions.2 (Digitization). 9.2 LVQ divides the input space into separate areas Now it is almost possible to describe by means of its name what LVQ should enable us to do: A set of representatives should be used to divide an input space 9. too: For instance. which divides the input space into the said discrete areas. Regular 9. talk about regular quantization of a con.less time-consuming or because it is suﬃtion of a continuous space into discrete sec. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .

the × mark the codebook vectors. 9. . A codebook vector is clearly assigned to each class. we also have a set of classes C . we can say that the set of classes |C | contains many codebook vectors C1 .4 Adjusting codebook vectors As we have already indicated. DThe lines represent the class limit. C|C | .com 9. c) and D. Roughly speaking.dkriesel. Additionally. we already know that classes are predeﬁned. it is the aim of the This leads to the structure of the training learning procedure that training samples samples: They are of the form (p.1: BExamples for quantization of a two-dimensional input space. each input vector is asso. too. i. . . 9. Thus. too.4 Adjusting codebook vectors Figure 9. we have a teaching input that tells the learning procedure whether the classiﬁcation of the input pattern is right or wrong: In other words.e. Thus. We have (since learning is supervised) a set P of |P | training samples.1 The procedure of learning Learning works according to a simple scheme. C2 . tors to reﬂect the training data as precisely as possible.are used to cause a previously deﬁned numciated to a class. . we have to know in advance the number of classes to be represented or the number of codebook vectors. each codebook vector can clearly be asso.ber of randomly initialized codebook vecciated to a class.4. the LVQ is a supervised learning procedure. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 141 .

Training sample: A training sample p of our training set P is selected and presented. the training sample to a class or a codebook vector. |C |} Ci (t + 1) = Ci (t) + ∆Ci . C|C | and our input p. . their codebook vectors there – and that’s it. . C2 . Distance measurement: We measure the distance ||p − C || between all codebook vectors C1 . . For the class aﬃliation ∆Ci = η (t) · h(p. Assignment is correct: The winner vector is the codebook vector of the class that includes p.e. Ci ) · (p − Ci ) (9. (9. I only want to brieﬂy discuss the steps of the fundamental LVQ learning procedure: Initialization: We place our set of codebook vectors on random positions in the input space. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . With good reason: From here on. Therefore it moves away from p. the one with into diﬀerent nuances. which means that it clearly assigns which we now want to break down.2) holds. But the function h(p." But we will see soon that our learning procedure can do a lot more. 142 D. Ci ) is the core of the rule: It implements a distinction of cases. . we could say about learning: ing rate allowing us to diﬀerentiate "Why a learning procedure? We calculate between large learning steps and ﬁne the average of all class members and place tuning. the function provides positive values and the codebook vector moves towards p.com therefore contain the training input vector Learning process: The learning process takes place according to the rule p and its class aﬃliation c. . . Ci ∈ C be deﬁned (called LVQ1. The last factor (p − Ci ) is obviously the direction toward which the codebook vector is moved. . i. We have already seen that the ﬁrst factor η (t) is a time-dependent learnIntuitively. LVQ2. Assignment is wrong: The winner vector does not represent the class that includes p. LVQ3. .Chapter 9 Learning vector quantization dkriesel. Important! We can see that our deﬁnition of the funcWinner: The closest codebook vector tion h was not precise enough.1) c ∈ {1. In this case. dependent of how exactly h and the learning rate should min ||p − Ci ||. 2. the LVQ is divided wins.

etc).5 Connection to neural networks Until now. Now let us take a look at the unsupervised learning networks ! vectors = neurons? D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 143 . in nature it often occurs that in a group one neuron may ﬁre (a winner neuron. I decided to place this brief chapter about learning vector quantization here so that this approach can be continued in the following chapter about self-organizing maps: We will classify further inputs by means of neurons distributed throughout the input space. here: a codebook vector) and. inhibits all other neurons. in spite of the learning process. They are not all based on the same principle described here. we do not know which input belongs to which class. The codebook vectors can be understood as neurons with a ﬁxed position within the input space. similar to RBF networks. Indicate a quantization which equally distributes all vectors H ∈ H in the ﬁve-dimensional unit cube H into one of 1024 classes. the question was what LVQ has to do with neural networks. 9. in return.com OLVQ. 9. The diﬀerences are. only that this time.5 Connection to neural networks Exercises Exercise 16.dkriesel. and as announced I don’t want to discuss them any further. Additionally. Therefore I don’t give any formal deﬁnition regarding the aforementioned learning rule and LVQ. in the strength of the codebook vector movements. for instance.

.

Part III Unsupervised learning network paradigms 145 .

.

the feedforward networks. without a teacher.1 Structure of a organizing feature maps [Koh82.e. SOMs have – like our brain – the network. Unlike the other network paradigms we have already got to know. Koh98].the task to map a high-dimensional input (N dimensions) onto areas in a lowsupervised. i.10. Teuvo Kohonen developed in the Eighties his self.Chapter 10 Self-organizing feature maps A paradigm of unsupervised learning neural networks. it will be less interesting to know how strong a certain muscle is contracted but which muscle is activated. for SOMs it is unnecessary to ask what the neurons calculate. which are increasBased on this principle and exploring ingly used for calculations. variations and neural gas. Biologically. These are. Function. We only ask which neuron is active at the moment. A paradigm of neural networks where the output is the state of Typically. too. no output. Our brain responds to external input by changes in state. its output. for example. learning procedure. but active neuron 147 . which learns completely un. SOMs are considerably more related to biology than. one question will arise: How does our brain store and recall the impressions it receives every day. this is very motivated: If in biology the neurons are connected to certain muscles. Thus. And while already considering this subject we realize that there is no output in this sense at all. the question of how biological neural networks organize themselves. which maps an input space by its ﬁxed topology and thus independently looks for simililarities. In other words: We are not interested in the exact output of the neuron but in knowing which neuron provides output. so to speak. Let me point out that the brain does not have any training samples and therefore no "desired output". How are data stored in the brain? If you take a look at the concepts of biological neural networks mentioned in the introduction. self-organizing map shortly referred to as self-organizing maps or SOMs.

In this special case they only have the same In a one-dimensional grid. ()*+ /. If an input vector is entered.com /. the SOM simply obtains arbitrary many points of the input space. ()*+ /.-. ()*+ /.-. During the input of the points the SOM will try to cover as good as possible the positions on which the points appear by its neurons.-. ()*+ /.-. Irregular topologies are possible.A self-organizing map is a set K of SOM tion capability they are not employed very neurons. ()*+ dkriesel. Similar space would be some kind of honeycomb to the neurons in an RBF network a SOM shape. could be.Chapter 10 Self-organizing feature maps dimensional grid of cells (G dimensions) to draw a map of the high-dimensional space. these facts seem to be a bit con/. other possible array in two-dimensional Deﬁnition 10.-.-. input ↓ low-dim. ()*+ /. A two-dimensional grid could be a map and then make it clear by means of square array of neurons (ﬁg. ()*+ /. ()*+ /. Every neuron would have exactly Initially. ()*+ /. ()*+ /. like pearls on a string.1). ()*+ /. ()*+ /. the neurons dimension. 10. for instance. ()*+ /.2 (Self-organizing map). but not very often. possible.-.-. neuron k does not occupy a ﬁxed position too. we will brieﬂy and formally retwo neighbors (except for the two end neu. ()*+ /. and it is recommended to brieﬂy reﬂect about them.-.some examples.-.-.-. ()*+ /.-. input space and topology Important! the G-dimensional grid on which the neurons are lying and which indicates the neighborhood relationships between the neurons and therefore Even if N = G is true. exoften. map At ﬁrst. actly that neuron k ∈ K is activated which c K 148 D. To generate this map. k more dimensions and considerably more neighborhood relationships would also be Deﬁnition 10.-. This particularly means. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .-.-.-. Above we can see a onedimensional topology.-.-. An. ()*+ /. so to speak. the two spaces are the network topology.-. not equal and have to be distinguished.gard the functionality of a self-organizing rons). but due to their lack of visualiza. There are two spaces in which SOMs are working: Figure 10. ()*+ /. Topolgies with c (a center ) in the input space. /. below a two-dimensional one. ()*+ /. ()*+ /.1: Example topologies of a selfThe N -dimensional input space and organizing map. ()*+ /.-. ()*+ fusing. that every neuron can be assigned to a certain position in the input space.-. ()*+ /. ()*+ /.-.-.-.1 (SOM neuron). ()*+ /. ()*+ high-dim.

These neighborhood relationships are called topology. k the neuron to be adapted (which will be discussed later) and t the timestep. One neuron becomes active. a point p.correspond to those of functionality. which partially Calculation of the distance between ev. D.e. It is deﬁned by the topology function h(i. The dimension of the input space is referred to as N .3 (Topology).e.3 Training functionality of a complete self-organizing map before training. k. namely Like many other neural networks. Basically. i. such neuron i with the shortest Creating an input pattern: A stimulus. is selected from the 1 We will learn soon what a winner neuron is. calculation of ||p − ck ||. Functionality [Training makes the SOM topology cover the input space] The training of a SOM consists of the following steps: is nearly as straightforward as the funcInput of an arbitrary value p of the input tionality described above.com is closest to the input pattern in the input space. since there are many analogies to the training. N input ↓ winner i k G 10. i. The SOM layer is laterally linked in itself so that a winner neuron can be established and inhibit the other neurons. The neurons are interconnected by neighborhood relationships. I think that this explanation of a SOM is not very descriptive and therefore I tried to provide a clearer description of the network structure. Then the input layer (N neurons) forwards all inputs to the SOM layer. But let us regard the very simple 10. the description of SOMs is more formal: Often an input layer is described that is completely linked towards an SOM layer. Deﬁnition 10. All other neurons remain inactive. t).3 Training calculated distance to the input. In many literature citations. The output we expect due to the input of a SOM shows which neuron becomes active. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 149 . The training of a SOM is highly inﬂuenced by the topology. random neuron centers ck ∈ RN from the input space.2 SOMs always activate the neuron with the least distance to an input pattern ery neuron k and p by means of a Initialization: The network starts with norm. 10. structured into ﬁve steps.This paradigm of activity is also called winner-takes-all scheme.dkriesel. where i is the winner neuron1 ist. it is space RN . The dimension of the topology is referred to as G. Now the question is which neuron is activated by which input – and the answer is given by the network itself during training. the SOM has to be trained before it can be used.

The tion: The topology function must be uniabove-mentioned network topology exmodal. last factor shows that the change in In principle. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . t). which wrongly leads the reader to believe that h is a constant. → winner i. i.1 The topology function deﬁnes. and the parameter i added to the existing centers. for which the distance to in the following.e. which has the smallest ∆ck = η (t) · h(i. SMore precise deﬁnidependent learning rate η (t).4 (SOM learning rule). it must have exactly one maxierts its inﬂuence by means of the funcmum. This maximum must be next to the tion h(i. You can see that from several winner neurons one can be selected at will. deﬁned on the grid Additionally. the function shall take a large position of the neurons k is proporvalue if k is the neighbor of the winner neutional to the distance to the input ron or even the winner neuron itself.1) condition c (t + 1) = c (t) + ∆c (t). as usual. itself certainly is 0.Chapter 10 Self-organizing feature maps dkriesel. This problem can easily be solved by not omitting the multiplication dots ·. and pattern p and.e. to a timesmall values if not. which are deﬁned by the neuron k in the network.com training: input. how a learning neuron inﬂuences its neighbors Adapting the centers: The neuron cen. The parameter k is the index running where the values ∆ck are simply through all neurons. change in position i and neighbors input space RN . only 1 maximum for the winner 150 D.2) k k k ||p − ci || ≤ ||p − ck || ∀k=i .e. k. distance to p. 2 Note: In many sources this rule is written ηh(p − ck ).ner neuron. to reduce the neighborhood in the course of time. network. The is the index of the winner neuron. (10. The winner neuron and its tance ||p − ck || is determined for every neighbor neurons. for example. i. k. which will be discussed winner neuron i. then adapt their centers according to the rule Winner takes all: The winner neuron i is determined.3. 10. which fulﬁlls the (10. A SOM is trained by presenting an input patentered into the network. k. Now this stimulus is Deﬁnition 10. It can be time-dependent (which it often is) – which explains the parameter t. tern and determining the associated winDistance measurement: Then the dis. i.The topology function h is not deﬁned ters are moved within the input space on the input space but on the grid and repaccording to the rule2 resents the neighborhood relationships between the neurons. t) · (p − ck ). the time-dependence enables us. topology function. t) · (p − ck ). the topology of the ∆ck = η (t) · h(i.

for instance. It is unimodal with a maximum close to 0.1 Introduction of common distance is determined (in two-dimensional space distance and topology equivalent to the Pythagoream theorem). ()*+ /.5 (Topology function).-. 10.-. ()*+ /. There are diﬀerent methods to calculate this distance. It can be any unimodal function that reaches its maximum when i = k gilt. ()*+ /. its width can be changed by applying its parameter σ . ()*+ /. σ example. ()*+ /.-. 10.3.-.3 Training /. ()*+ /.-.-. t) describes the neighborhood relationships in the topology.-. the function h needs some kind of distance notion on the grid because from somewhere it has to know how far i and k are apart from each other on the grid. the Euclidean distance (lower part of ﬁg. ()*+ Figure 10. ()*+ /.-. ()*+ 1 G ?>=< 89:. ()*+ 89:.-.-.o i o 89:. ()*+ /.1. ()*+ /.-.23 qqq q q q xq 89:.2: Example distances of a onedimensional SOM topology (above) and a twodimensional SOM topology (below) between two neurons i and k .-. To simplify matters I required a ﬁxed grid edge length of 1 in both A common distance function would be. Time-dependence is optional. ?>=< /()*+ G . ()*+ /.dkriesel.-.-. ()*+ /. Deﬁnition 10.com In order to be able to output large values for the neighbors of i and small values for non-neighbors. ?>=< qV k q y q qqq G/()*+ .-. In the functions upper case we simply count the discrete path length between i and k .-. for cases. the already known Gaussian bell (see ﬁg. On a two-dimensional grid we could apply.-. k /. /.-. ?>=< i o /.3 on page 153).2) or on a one-dimensional grid we could simply use the number of the connections between the neurons i and k (upper part of the same ﬁgure). 10. but often used. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 151 . ()*+ /.-.-.-. which can be used to realize the neighborhood being reduced in the course of time: We simply relate the time-dependence to the σ and the result is D. ()*+ 2. k. ()*+ /. The topology function h(i. ()*+ /. In the lower case the Euclidean 10. ()*+ /. Additionally.

At ﬁrst. i. Other functions that can be used instead of the Gaussian function are. Here. Then our Typical sizes of the target value of a learning rate are two sizes smaller than the initopology function could look like this: tial value. i. a behavior that has already been observed in nature. not the neuron posi. for instance. by means of a time-dependent. referred to as ci and ck .3 on the facing page).3. 10.3) 0. k. In the end of the learning process. the cone function. a decreasing neighborhood size can be realized.com a monotonically decreasing σ (t). the SOMs often work But enough of theory – let us take a look with temporally monotonically decreasing at a SOM in action! learning rates and neighborhood sizes. it could virtually explode. which would be pend on the network topology or the size of the neighborhood.Chapter 10 Self-organizing feature maps dkriesel. It must be noted that must always be true.6 where gi and gk represent the neuron positions on the grid. t) = e − ||gi −ck || 2·σ (t)2 . But this adjustment characteristic is not necessary for the functionality of the map. e.01 < η < 0. forcefully pull the entire map towards a new pattern. This can cause sharply separated map areas – and that is exactly why the Mexican hat function has been suggested by Teuvo Kohonen himself. since otherwise the neurons would constantly miss the current To avoid that the later training phases training sample.e.could be true. (10. The advantage of a decreasing neighborhood size is that in the beginning a moving neuron "pulls along" many neurons in its vicinity. monotonically decreasing σ with the Gaussin bell being used in the topology function. for example. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . the cylinder function or the Mexican hat function (ﬁg. the randomly initialized network can unfold fast and properly in the beginning. let us talk about the learning rate: 10.2 Learning rates and neighborhoods can decrease monotonically over time h·η ≤1 152 D. the Mexican hat function oﬀers a particular biological motivation: Due to its negative digits it rejects some neurons close to the winner neuron.g 2 h(i. it could even be possible that the map would diverge. only a few neurons are inﬂuenced at the same time which stiﬀens the network as a whole but enables a good "ﬁne tuning" of the individual neurons.e. As we have already seen. But this size must also detions in the input space.

5 −3 −2 Mexican Hat Function −1 0 x 1 2 3 Figure 10. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 153 .dkriesel.5 −1 −1.6 0.8 0. cone function. cylinder function and the Mexican hat function suggested by Kohonen as examples for topology functions of a SOM.6 h(r) f(x) 0.5 1 1.3: Gaussian bell.6 f(x) f(x) 0.8 0.8 0.5 1 0.4 0.com 10.4 0..2 0 Cone Function 0 x 2 4 Cylinder Funktion 3.5 −1 −0.5 1 0.4 0.2 0 −2 −1.3 Training Gaussian in 1D 1 1 0.5 0 r 0.5 2 −4 −2 0.5 2 1. D.5 0 −0.2 0 −4 −2 0 x 2 4 3 2.

The arrows mark the movement of the winner neuron and its neighbors towards the training sample p. In the topology. ?>=< 7 ?>=< 89:. Neuron 3 is the winner neuron since it is closest to p. 2 ÕÕ 89:. To illustrate the one-dimensional topology of the network. ?>=< 6 89:.com 89:.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgy space (right) of a self-organizing map. ?>=< 6 89:. 154 D. ?>=< 4 89:. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . ?>=< 3 GG p 89:. ?>=< 7 ?>=< 89:. ?>=< 2 89:. ?>=< 3 89:. ?>=< 5 89:. 4b bb bb bb b 11 89:.Chapter 10 Self-organizing feature maps dkriesel. ?>=< 1 89:. ?>=< 1 89:. The arrows mark the movement of the winner neuron and its neighbors towards the pattern. ?>=< 5 Figure 10. the neurons 2 and 4 are the neighbors of 3. it is plotted into the input space by the dotted line.

The neighborhood function is also kept simple so that we will be able to mentally The learning rate indicates. and process the three factors from the After the adaptation of the neurons 2. and so back: on. who will learn D. N = 2 is true. (10. Another example of how such a oneLearning direction: Remember that the dimensional SOM can develop in a twoneuron centers ck are vectors in the dimensional input space with uniformly input space. our example SOM should is not speciﬁed. i. all in all. k. As already h(i.e. as well as the pattern p. e. we use a two-dimensional closest neighbors (here: 2 and 4) are input space. 3 and 4 the next pattern is applied. i. 10. k. the result is that the winner neuron and its neighbors (here: 2. i.4) mentioned. distributed input patterns in the course of topology speciﬁes. otherw. Let the allowed to learn by returning 0 for grid structure be one-dimensional (G = 1).5. Our topology function h indicates that only the winner neuron and its two In this example. our vector consist of 7 neurons and the learning rate (p − ck ) is multiplied by either 1 or should be η = 0. k = i. Obviously. the factor (p − ck ) indicates the vector of the neuron k to the pattern p. comprehend the network: the strength of learning. Thus.4 Examples for the functionality of SOMs Let us begin with a simple. all other neurons.e.com 10. This is now multiplied by diﬀerent scalars: 10. ∆ck = η (t) · h(i.5. in our example the input pattern is closest to neuron 3. mentally comprehensible example.dkriesel. Although the center of neuron 7 – seen from the input space – is considerably closer to the input pattern p than neuron 2. η = 0. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 155 . A time-dependence Furthermore. This is exactly the mechanism by We remember the learning rule for which a topology can signiﬁcantly cover an SOMs input space without having to be related to it by any sort. this is the winning neuron. t) = 1 1 0 k direct neighbor of i. 3 and 4) approximate the pattern p half the way (in the ﬁgure marked by arrows). as always. I want to remind that the network topology speciﬁes which neuron is allowed to learn and not its position in the input space. neuron 2 is learning and neuron 7 is not.4 on the preceding page) and enter a training sample p. t) · (p − ck ) Now let us take a look at the abovementioned network with random initialization of the centers (ﬁg. 0.4 Examples Thus.

e.and two-dimensional SOMs with diﬀerently shaped input spaces can be seen in ﬁgure 10. during training on circularly arranged input patterns it is nearly impossible with a twodimensional squared topology to avoid the exposed neurons in the center of the circle.com End states of one. "knot" in map 10. There are so called exposed neurons – neurons which are located in an area where no input pattern has ever been occurred.4. plex the topology is (or the more neighbors each neuron has.5 on the facing page. i. These are pulled in every direction Figure 10. the SOM does not unfold correctly. since a three-dimensional or a honeycombed two10. But this does not make the one-dimensional topology an optimal topology since it can only ﬁnd less complex neighborhood relationships than neighborhood size.5 It is possible to adjust the resolution of certain areas in a SOM A remedy for topological defects could We have seen that a SOM is trained by be to increase the initial values for the entering input patterns of the input space 156 D.Chapter 10 Self-organizing feature maps time can be seen in ﬁgure 10.7: A topological defect in a twoduring the training so that they ﬁnally dimensional SOM. respectively. A topological defect can be described at best by means of the word "knotting". As we can see. A one-dimensional topology generally produces less exposed neurons than a two-dimensional one: For instance. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .7) occurs. 10. During the unfolding of a SOM it could happen that a topological defect (ﬁg.6 on page 158. not every input space can be neatly covered by every network topology.1 Topological defects are dimensional topology could also be generfailures in SOM unfolding ated) the more diﬃcult it is for a randomly initialized map to unfold. because the more coma multi-dimensional one. dkriesel. remain in the center.

D. 5000.dkriesel.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0.2.5 Adjustment of resolution and position-dependent learning rate Figure 10. 70000 and 80000 randomly distributed input patterns p ∈ R2 . During the training η decreased from 1.0 to 0. the σ parameter of the Gauss function decreased from 10. 300. 50000. 500.0 to 0. 100.com 10. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 157 .1.

158 D. 200 neurons were used for the one-dimensional topology.com Figure 10. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .000 input patterns for all maps.Chapter 10 Self-organizing feature maps dkriesel. 10 × 10 neurons for the two-dimensionsal topology and 80.6: End states of one-dimensional (left column) and two-dimensional (right column) SOMs on diﬀerent input spaces.

10.6 Application For example. similar.based search also works with many other proved corner coverage. 10. within the topology – and that’s why it is so interesting. Then Kohonen created a SOM with G = 2 and used it to map the high-dimensional "paper space" developed by him. i.e. there are many This example shows that the position c of ﬁelds of application for self-organizing the neurons in the input space is not significant. too. depending on the occurrence of keywords. It could happen that we want a certain subset U of the input space to be mapped more precise than the other ones. input spaces. 10. This type of brain-like contextogy. It Also. the edge of the SOM could be deformed. it is possible to enter any paper corner with the SOMs). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 159 . So one tries once more to break down a high-dimensional space into a low-dimensional space (the topology). It is to be noted that the system itself deﬁnes what is neighbored. looks if some structures have been developed – et voilà: clearly deﬁned areas for the individual phenomenons are formed. again and again so that the SOM will be aligned with these patterns and map them.ing. since they are bored papers in the topology are interestonly pulled into the center by the topol.dkriesel. a higher learning rate is often used will be likely to discover that the neighfor edge and corner neurons.8 on the next page). into the completely trained SOM and look which neuron in the SOM is activated. the diﬀerent phonemes of the ﬁnnish language have successfully been mapped onto a SOM with a two dimensional discrete grid topology and therefore neighborhoods have been found (a SOM does nothing else than ﬁnding neighborhood relationships). Teuvo Kohonen himself made the effort to search many papers mentioning his SOMs in their keywords. In this large input space the individual papers now individual positions. then more neurons will group there while the remaining neurons are sparsely distributed on RN \ U (ﬁg. This problem can easily be solved by means of SOMs: During the training disproportionally many input patterns of the area U are presented to the SOM. If the number of training patterns of U ⊂ RN presented to the SOM exceeds the number of those patterns of the remaining RN \ U . more patterns ↓ higher resolution As you can see in the illustration. SOM ﬁnds similarities D. It is rather interesting to see which maps and their variations. This also results in a signiﬁcantly im.com RN one after another.6 Application of SOMs Regarding the biologically inspired associative data storage. This can be compensated by assigning to the edge of the input space a slightly higher probability of being hit by training patterns (an often applied approach for reaching every Thus.

Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .com Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. 160 D. In this circle the neurons are obviously more crowded and the remaining area is covered less dense but in both cases the neurons are still evenly distributed. On the left side.000 training samples and decreasing η (1 → 0. On the right side. The two SOMS were trained by means of 80. for the central circle in the input space.2) as well as decreasing σ (5 → 0.5).Chapter 10 Self-organizing feature maps dkriesel. this chance is more than ten times larger than for the remaining input space (visible in the larger pattern density in the background). the chance to become a training pattern was equal for each coordinate of the input space.

we can look at which of the previous inputs this neuron was also activated – and will immediately discover a group of very similar inputs.or. Virtually. As a result they are used.6.com neuron is activated when an unknown input pattern is entered.e. many neural network simulators oﬀer an additional so-called SOM layer in connection with the simulation of RBF networks.1 SOMs can be used to determine centers for RBF neurons 10. We have already been introduced to the paradigm of the RBF network in chapter 6.7. ate step: work more exactly. Next. 10. Due to the fact that they are deshould be covered with higher resolution rived from the SOMs the learning steps . in connection with RBF networks. i. roughly speakAs we have already seen. to realize a SOM without a grid structo control which areas of the input space ture. As a further useful feature of the combination of RBF networks with SOMs one can use the topology obtained through the SOM: During the ﬁnal training of a RBF neuron it can be used again.7 Variations of SOMs There are diﬀerent variations of SOMs Therefore. 10. the topology of a SOM often for diﬀerent variations of representation is two-dimensional so that it can be easily tasks: visualized.7 Variations to inﬂuence neighboring RBF neurons in diﬀerent ways.dkriesel. it is possible ing. 10. while the input space can be very high-dimensional. The idea of a neural gas is. The more the inputs within the topology are diverging. which has been developed from the diﬃculty of mapping complex input information that partially only occur in the subspaces of the input space or even change the subspaces (ﬁg. the topology generates a map of the input characteristics – reduced to descriptively few dimensions in relation to the input dimension. are very similar to the SOM learning steps. the less things they have in common. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 161 .9 on the following page). 10. random initialization of ck ∈ Rn selection and presentation of a pattern of the input space p ∈ Rn The neural gas is a variation of the selforganizing maps of Thomas Martinetz [MBS93].1 A neural gas is a SOM without a static topology SOMs arrange themselves exactly towards the positions of the outgoing inputs. for example. on which areas of our function should the but they include an additional intermediRBF network work with more neurons. to select the centers of an RBF network. For this. D.

t).decreasing neighborhood size. dynamic neighborhood The bulk of neurons can become as stiﬀchanging the centers by means of the ened as a SOM by means of a constantly known rule but with the slightly mod. A disadvantage could be that there is no ﬁxed grid forcing the input space to become regularly covered. of the winner neuron i. t). hL (i. and the number of neighbors is almost arbitrary. The distance within the neighborhood is now represented by the distance within the input space.com Figure 10. The function hL (i. the ﬁrst neuron in the list L is the neuron that is closest to the winner neuron. which is slightly modiﬁed compared with the original function h(i. Thus. too. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . now regards the ﬁrst elements of the list as the neighborhood 162 D. neuron distance measurement identiﬁcation of the winner neuron i Intermediate step: generation of a list L of neurons sorted in ascending order by their distance to the winner neuron. k. It does not have a ﬁxed dimension but it can take the iﬁed topology function dimension that is locally needed at the moment. k.Chapter 10 Self-organizing feature maps dkriesel. The direct result is that – similar to the free-ﬂoating molecules in a gas – the neighborhood relationships between the neurons can change anytime. t). and therefore wholes can occur in the cover or neurons can be isolated. which can be very advantageous. k.9: A ﬁgure ﬁlling diﬀerent subspaces of the actual input space of diﬀerent positions therefore can hardly be ﬁlled by a SOM.

In order to present another variant of the The reader certainly wonders what advanSOMs. start). However.2 A Multi-SOM consists of Again.3 A multi-neural gas consists of several separate neural winner neuron.7 (Multi-SOM). 10. It is unnecessary that the SOMs have the same topology or size. it could be eﬀective to store the list individual SOMs exactly reﬂect these clusin an ordered data structure right from the ters.6 (Neural gas). Thus. shortly referred to as M-SOM [GKE01b. the neighborhood of a neural gas must initially refer to all neurons since otherwise some outliers of the random initialization may never reach the remaining group. several SOMs can classify complex ﬁgure Unlike a SOM. A multiDeﬁnition 10. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 163 . GKE01a. I want to formulate an extended tage is there to use a multi-neural gas since several gases D. Generally. the (here.7 Variations problem: What do we do with input patterns from which we know that they are conﬁned in diﬀerent (maybe disjoint) areas? Here. the criterion for this decision is the distance between gases the neurosn and the winner neuron in the input space. namic neighborhood function.ous use of M SOMs. Actually. This learning process is analog to that of the SOMs. the idea is to use not only one SOM but several ones: A multi-selforganizing map.com In spite of all practical hints. an M-SOM is just a combination of M SOMs.7. we also have a set of M neural gases: a multi-neural gas [GS06. even if one of the some computational eﬀort could be necesclusters is not represented in every dimensary for the permanent sorting of the list sion of the input space RN . only the neurons beWith a neural gas it is possible to learn a longing to the winner SOM of each trainkind of complex input such as in ﬁg.7. Analogous to the multi-SOM. To forget this is a popular error during the implementation of a neural gas. it is as always the user’s responsibility not to understand this text as a catalog for easy answers but to explore all advantages and disadvantages himself. But means of two SOMs. Deﬁnition 10. GS06]. SG06]. This construct behaves analogous to neural gas and M-SOM: 10.9 ing step are adapted. only the neurons of the winner gas several separate SOMs are adapted. With every learning cycle it is decided anew which neurons are the neigborhood neurons of the 10. A neural SOM is nothing more than the simultanegas diﬀers from a SOM by a completely dy. 10. it is easy to on the preceding page since we are not represent two disjoint clusters of data by bound to a ﬁxed-dimensional grid.dkriesel.

this is an attempt to work against the isolation of neurons or the generation of larger wholes in the cover.Chapter 10 Self-organizing feature maps an individual neural gas is already capable to divide into clusters and to work on complex input patterns with changing dimensions. The very imprecise formulation of this exwe only use one single neural gas. Thus. we can directly dkriesel.7.8 (Multi-neural gas). for which multineural gases have been used recently.com Deﬁnition 10. two-dimensional rons.because new neurons have to be integrated ready mentioned) the sorting of the in the neighborhood. but in most cases these local sortings are suﬃcient. a few or only one neuron per gas) behaves analogously to the K-means clustering (for more information on clustering procedures see excursus A). Which criteria did you use for "well" cases of multi-neural gases: One extreme and "best"? case is the ordinary neural gas M = 1. . A lot of computational eﬀort is saved tioned but not discussed. list L could use a lot of computational eﬀort while the sorting of several smaller lists L1 . this is correct. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . 1. i. A multi-neural gas is nothing more than the simultaneous use of M neural gases. Interestercise is intentional. 10. Simple neural gases can also ﬁnd and cover clusters. ing enough.4 Growing neural gases can add neurons to themselves A growing neural gas is a variation of the aforementioned neural gas to which more and more neurons are added according to certain rules. This is particularly important for clustering tasks.as "well" as possible. . grid shall cover a two-dimensional surface As a result we will only obtain local in. LM is less Exercises time-consuming – even if these lists in total contain the same number of neuExercise 17. less computational eﬀort Here. for this purpose? Now we can choose between two extreme 2. this subject should only be men2. . Basically. the other extreme case (very large M . With several gases. Which grid structure would suit best tell which neuron belongs to which gas. when large original gases are divided To build a growing SOM is more diﬃcult into several smaller ones since (as al. . but a multi-neural gas has two serious advantages over a simple neural gas.e. 1. but now we cannot recognize which neuron belongs to which cluster. A regular. L2 . 164 D. stead of global sortings.

It is the idea of unsupervised learning. Simple binary patterns are entered into the input layer and transferred to the recognition layer while the recognition layer shall return a 1-out-of-|O| encoding. This was followed by a whole family of ART improvements (which we want to discuss brieﬂy.1 Task and structure of an ART network An ART network comprises exactly two layers: the input layer I and the recognition layer O with the input layer being completely linked towards the recognition layer. whose aim is the (initially binary) pattern recognition. the adaptive resonance theory (abbreviated: ART ) without discussing its theory profoundly. Stephen Grossberg and Gail Carpenter published the ﬁrst version of their ART network [Gro76] in order to alleviate this problem. i. 11.1 on the following page).Chapter 11 Adaptive resonance theory An ART network in its original form shall classify binary input vectors. we want tionally an ART network shall be capable to try to ﬁgure out the basic idea of to ﬁnd new classes. This circumstance is called stability / plasticity dilemma. or more precisely the categorization of patterns into classes. In several sections we have already mentioned that it is diﬃcult to use neural networks for the learning of new information in addition to but without destroying the already existing information. it should follow the winner-takes-all pattern recognition 165 . Simultaneously. too). In 1987. As in the other smaller chapters. But addi- 11. This complete link induces a top-down weight matrix W that contains the weight values of the connections between each neuron in the input layer and each neuron in the recognition layer (ﬁg.e. i.e. to assign them to a 1-out-of-n output. the so far unclassiﬁed patterns shall be recognized and assigned to a new class.

Top: the input layer. bottom: the recognition layer.1 Resonance takes place by activities being tossed and turned But there also exists a bottom-up weight matrix V .1. For instance.Chapter 11 Adaptive resonance theory dkriesel. scheme. In this illustration the lateral inhibition of the recognition layer and the control neurons are omitted. 166 D. Now it is obvious that these activities are bounced forth and back again and again. to realize this 1out-of-|O| encoding the principle of lateral inhibition can be used – or in the implementation the most activated neuron can be searched.1: Simpliﬁed illustration of the ART network structure. put layer causes an activity within the recognition layer while in turn in the recognition layer every activity causes an activity within the input layer. But we do not want to discuss this theory further since here only the basic principle of the ART network should become explicit. the ART network will achieve a stable state after an input. layers activate one another 11.com @ABC GFED @ABC GFED GFED @ABC @ABC GFED S i i i i yy UY i 3 R UY i 4 R k i yy p pp kk Y i 2R y yy o o k gp g y y k p p p y y yy o o i y 1R p p o o x x x k R R R R y y o o k p p o o y y R x x x p p p x x x k Rp k y k y R y o o o o y k p p p x x x RR RR k p p p x x x y y o o k o o y y k R R R p p p x x x y y o o p p p x x x o o y y k R R R RRR k y y k p p p o o x x x o o y y p p p x x x k R R R k y y o o o o y y R R RR R k p p p x x x k p p p x x x y y k o o oy oy y y R p p k p x x x p k pp p pp R RR RR RR y y x x x o o o o y y k p k o o x x R RRp o o y y p x x x k k R RR R k y y o o o o k y y p p pp x x xxRR k p p p x x x R R k y y o o o o y y k p p R R R RR x x x k p p p x x x y y o o k o o y y k R R R p p p k x x x y y o o o o y y p p p x x x k R R RR R k y y o o k o o y y p p p x x x k p p p x x x R R R k y y o o o o y y k R R RR R p p p x x x k p p p x x x y y o o k o o y y R R R k p p p k x x x y y o o R R RR R p p p o o y y x x x k k y y o o k p p p o o y y x x x R R R k p p p x x x k R R RR R y y o o o o y y k p p p x x xoo k p p p x x x k y y R R R o o o y y k R R R RR p p p k x x x y y o p p p x x x o o y y k k R R R k y y o o p p p x x x o o y y k p p pp x x x R R RR R k y y o o o o k y y p p p x x x k R R R p p x x x k y y o o o o y y k R R RR R p p p k x x x p p p y y o o x x x k o o y y k % % % k p p p y y o o x x x o o y y Õ Õ Õx Õ k 5 5 x x k o o 9 k o o y y { { {x x x k A9 5 % k o o w w k u @ABC GFED GFED @ABC @ABC GFED @ABC GFED @ABC GFED @ABC GFED Ω1 Ω2 Ω3 Ω4 Ω5 Ω6 Figure 11. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . a fact that leads us to resonance. For practical reasons an IF query would suit this task best. Every activity within the in- V In addition to the two mentioned layers. in an ART network also exist a few neurons that exercise control functions such as signal enhancement. I have only mentioned it to explain that in spite of the recurrences. which propagates the activities within the recognition layer back into the input layer.

Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 167 . it could happen that the neudivided to top-down and rons are nearly equally activated or that bottom-up learning several neurons are activated. In an ART network there are diﬀerent additional control neurons which answer this question according to diﬀerent mathematical rules and which are responsible for intercepting special cases. enhance input vectors. input is teach. In this case. however.e. on the other hand we train the bottom-up matrix Thus. Then the current pattern is assigned to this output neuron and the weight sets of the new neuron are trained as usual. the system can only moderately distinguish the patterns.e. one of the largest objections to an ART is the fact that an ART network uses a special distinction of 11. Then the weights of the matrix W going towards the output neuron are changed such that the output of the strongest neuron Ω is still enhanced. that has been learning forced into the mechanism of a neural network.3 Adding an output neuron an ART network is Of course.as already mentioned . the advantage of this system is not V (ﬁg.2.1 Pattern input and top-down typical representative of a class looks like learning .2.2 on the next page). for backward weights At the same time. D. winner neuron is ampliﬁed Often.which is a signiﬁcant feature.2 Resonance and bottom-up cases. the network is trained to As already mentioned above. the class aﬃliation of the input vector to the class of the output neuron Ω becomes enhanced. that the The trick of adaptive resonance theory is not only the conﬁguration of the ART network but also the two-piece learning procedure of the theory: On the one hand we train the top-down matrix W .2. inp. only to divide inputs into classes and to ﬁnd new classes. network is indecisive. i.com 11. the mechanisms of the control neurons activate a signal that adds a new output neuron.an activation at the output neurons and the strongest neuron wins. The training of the backward weights of the matrix V is a bit tricky: Only the weights of the respective winner neuron are trained towards the input layer and 11. the ART networks have often been extended. When a pattern is entered into the network it causes . The question is when a new neuron is permitted to become active and when it should learn.dkriesel. similar to an IF query.2 The learning process of 11.3 Extensions 11. i. 11. Thus. it can also tell us after the activation of an output neuron what a 11.3 Extensions our current input pattern is used as teaching input.

3 Erweiterungen tional control neurons and layers. ability of ART-2 by adapting additional % Õ | GFED @ABC Ω1 4% Õ GFED @ABC Ω2 0 1 GFED @ABC @ABC GFED GFED @ABC @ABC GFED i2 R i1 pp i3 ` i i4 i pp y R y RR pp pp RR pp R pp R pp RR pp RR pp R ppR% % Õ 4 Õ | @ABC GFED @ABC GFED Ω2 Ω1 ART-2 [CG87] processes ist eine Erweiterung biological such as the chemical auf kontinuierliche Eingaben bietet 1 . ” Abbildung 11. them "ART-n networks".com GFED @ABC i1 GFED @ABC i2 y GFED @ABC i 3 i y GFED @ABC ` i i4 einer IF-Abfrage. Genetwork and that the numbers mark Mitte: Also die Gewichte Top:winnerneuron.B. Let us asdargestellt. processes within the und synapses zus¨ atzlich (in einer ART-2A genannten Erweiterung) Verbesserungen der LerngeApart from the described ones there exist schwindigkeit.com dkriesel. 0 1 GFED @ABC GFED @ABC @ABC GFED @ABC GFED i1 p i i i 2 3 4 ` i y R pp y R i pp RR pp RR pp R pp pp RRR pp R pp RR pp R % % Õ p4 Õ | GFED @ABC @ABC GFED Ω1 Ω2 0 1 1 Durch die h¨ auﬁgen Erweiterungen der Adaptive Resonance Theory sprechen b¨ ose Zungen bereits von ART-n-Netzen“. indem zus¨ atzliche biologische Vorg¨ ange wie z. die man in den MechaART-2 [CG87] Netzes is extended to continuous nismus eines Neuronalen gepresst hat. inputs and additionally oﬀers (in an ex- tension called ART-2A) enhancements of the learning speed which results in addi11. die chemischen Vorg¨ ange innerhalb der Synapsen adaptiert werden1 .2: Simpliﬁed illustration of the Trainings eines ART-Netzes: piecezweigeteilten training of an ART network: The Die trained ¨ 168 trainierten D. Krieselsind – Ein kleiner Uberblick u ¨ber Neuronale Netze (EPSILON-DE) jeweils Gewichte durchgezogen weights are represented by solid lines. ist Ω2the das outputs. Zus¨ atzlich zu den beschriebenen Erweiterungen existieren noch viele mehr.Chapter 11 resonance theory Kapitel 11Adaptive Adaptive Resonance Theory dkriesel. Oben: Wir wir sehen. 168 D. Nehmen wir an. die Middle: So the weights are trained towards 1 Because of the frequent extensions of the adapGewichte vom Gewinnerneuron zur Eingangstive resonance theory wagging tongues already call the winner neuron and (below) the weights of schicht trainiert. We can see that Ωwerden 2 is the winner neuzum Gewinnerneuron hin trainiert und (unten) ron. ein Muster wurde in sume that a eingegeben pattern has entered into the das Netz und been die Zahlen markieren Ausgaben. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Wie schon eingangs erw¨ ahnt. wurden die the learning ART-3 [CG90] 3 improves ART-Netze vielfach erweitert. ART-3 [CG90] verbessert die Lernf¨ ahigkeit von ART-2. was zus¨ atzliche Kontrollmany other extensions.2: Vereinfachte Darstellung des twoFigure 11. the winner neuron are trained towards the input layer. neurone und Schichten zur Folge hat.

appendices and registers 169 .Part IV Excursi.

.

3. duced. the distances have where these points are situated. sym- metry. 2. A regional and online learnable ﬁeld models from a point cloud. the is referred to as metric if each of the folsquared distance and the Euclidean lowing criteria applies: distance. dist(x1 . dist(x1 .1 (Metric). x2 ) deﬁned for two objects x1 . Based on such metrics we can de- 171 . Here. a comparatively small set of neurons being representative for the point cloud.e. and the distance beWe brieﬂy want to specify what a metric tween to points may only be 0 if the two is. x1 ). many problems can be traced back to problems in cluster analysis. x2 ) + the triangle Since cluster analysis procedures need a notion of distance between two points. inequality holds. x3 ) ≤ dist(x2 . possibly with a lot of points. the trianDeﬁnition A. i. As already mentioned. comparison of their advantages and disadvantages. for example.)". x2 ) = 0 if and only if x1 = x2 .e. which have already been intro1. a Colloquially speaking. points are equal. it is necessary to research procedures that examine whether groups (so-called clusters) exist within point clouds. x2 ) = dist(x2 . Additionally. dist(x1 . A relation gle inequality must apply. the formation of groups within point clouds is explored. x3 ). Introduction of some procedures. Therefore. dist(x1 . i. dist(x1 . Discussion of an adaptive clustering method based on neural networks. to be symmetrical.Appendix A Excursus: Cluster analysis and regional and online learnable ﬁelds In Grimm’s dictionary the extinct German word "Kluster" is described by "was dicht und dick zusammensitzet (a thick and dense group of sth. x2 Metrics are provided by. In static cluster analysis. a metric is a tool metric must be deﬁned on the space for determining distances between points in any space.

Provide data to be examined. the outer ring of the construction in the following illustration will be recog1. The problem is that it is not necessarily known in advance how k can be determined best.the ring with the small inner clusters will be recognized as one cluster. MacQueen [Mac67] is an algorithm that is often used because of its low computation and storage complexity and which is regarded as "inexpensive and good".1 k-means clustering allocates data to a predeﬁned number of clusters k-means clustering according to J. Another problem is that the procedure can become quite instable if the codebook vectors are badly initialized. which is the number of clus.For an illustration see the upper right part Step 2 already shows one of the great questions of the k-means algorithm: The number k of the cluster centers has to be determined in advance. But since this is random.com ﬁne a clustering procedure that uses a metric as distance measure. you will receive quite good results. 2. Now we want to introduce and brieﬂy discuss diﬀerent clustering procedures. 172 D. Assign each data point to the next 5. 7. nized as many single clusters.of ﬁg. Select k random vectors for the clus. Deﬁne k . it is often useful to restart the procedure.Appendix A Excursus: Cluster analysis and regional and online learnable ﬁelds dkriesel.2 k-nearest neighboring looks for the k nearest neighbors of each data point The k-nearest neighboring procedure [CH67] connects each data point to the k 1 The name codebook vector was created because closest neighbors. Set codebook vectors to new cluster A. A. Compute cluster centers for all clus6. Then such a group clear. If you are fully aware of those weaknesses. A. The However. centers. complex structures such as "clusoperation sequence of the k-means clusterters in clusters" cannot be recognized. This has the advantage of not requiring much computational eﬀort. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . which often results in a the often used name cluster vector was too undivision of the groups. Continue with 4 until the assignments are no longer changed. number of cluster must be known previously ter centers (also referred to as code. 3. This cannot be done by the algorithm. book vectors). If k ing algorithm is the following: is high. If k is low. ter centers. 4.1 on page 174. codebook vector1 ters.

the storage and computational eﬀort is obviously very high.4 The silhouette coeﬃcient looks for neighbors within determines how accurate the radius ε for each a given clustering is data point As we can see above. For an illustration see the lower left part For an illustration see the lower right part of ﬁg. But this procedure allows a recognition of rings and therefore of "clusters in clusters". Another Furthermore.1. can possibly be a problem. of ﬁg. vantage is that the procedure adaptively i. This can also happen with k -nearest neighboring. With variable cluster and point distances within clusters this the clusters.3 ε-nearest neighboring A. A. (see the two small clusters in the upper right of the illustration).e. But note that there are some special cases: Two separate clusters can easily be connected due to the unfavorable situation of a single data point. Each procethe neighborhood detection does not use a dure described has very speciﬁc disadvanﬁxed number k of neighbors but a radius ε. tages. A. it is not mandatory that the advantage is that the combination of minimal clusters due to a ﬁxed number of links between the points are symmetric. clustering radii around points A. if k is too high. Another ad. The disadvantage is that a large storage and computational eﬀort is required to ﬁnd the next neighbor (the distances between all data points must be computed and stored). On the other hand. The advantage is that the number of clusters occurs all by itself. which is a disadvantage. swer for clustering problems.fully initialize ε in order to be successful. neighbors is avoided. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 173 . there is no easy anAnother approach of neighboring: here. but it would be more diﬃcult since in this case the number of neighbors per point is limited. A.1. Points are neigbors if they are at most ε apart from each other. which is not An advantage is the symmetric nature of always intentional. In this respect it is useful to have D. smaller than half the smallest distance responds to the distances in and between between two clusters. clustering next points There are some special cases in which the procedure combines data points belonging to diﬀerent clusters.4 The silhouette coeﬃcient which is the reason for the name epsilonnearest neighboring. Clusters consisting of only one single data point are basically conncted to another cluster. the neighborhood relationships.dkriesel. Here. it is necessary to skillwhich is a clear advantage.com builds a cluster.

this will result in cluster combinations shown in the upper right of the illustration. This procedure will cause diﬃculties if ε is selected larger than the minimum distance between two clusters (see upper left of the illustration).Appendix A Excursus: Cluster analysis and regional and online learnable ﬁelds dkriesel. Top right: k -means clustering. the procedure is not capable to recognize "clusters in clusters" (bottom left of the illustration). which will then be combined. We will use this set to explore the diﬀerent clustering methods. Bottom right: ε-nearest neighboring. 174 D. As we can see. Long "lines" of points are a problem. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) .com Figure A. If k is selected too high (higher than the number of points in the smallest cluster). Bottom left: k -nearest neighboring. Using this procedure we chose k = 6.1: Top left: our set of points. too: They would be recognized as many small clusters (if k is suﬃciently large).

p ∈ c. (A.5 Regional and online learnable ﬁelds Apparently. I want to introduce a clustering method based 1 a(p) = dist(p.q=p work [SGE05] which was published in 2005. 1]. In this case. A. 1]. let b(p) be the average disnot be perfect but it eliminates large stantance between our point p and all points dard weaknesses of the known clustering of the next cluster (g represents all clusters methods except for c): b(p) = min g ∈C. i. shortly remax{a(p). which I lowing term provides a value close to 1: want to introduce now. Let c ⊆ P be a cluster within the |P | p∈P point cloud and p be part of this cluster. The set of clusters is called C . This coeﬃcient measures how well the clusters are delimited from each other and indicates if points may be assigned to the wrong clusters. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 175 . q ) (A. A value close to -1 indicates a bad classiﬁcation of p. As above the total quality of the clusSummary: ter division is expressed by the interval p∈c⊆P [−1. as well as a measure to invariable is referred to as a(p) and deﬁned dicate the quality of an existing arrangeas follows: ment of given data into clusters. applies.4) P . are the regional b(p) − a(p) s(p) = (A.e. Like all the other methods this one may Furthermore. the folThe paradigm of neural networks. As diﬀerent clustering strategies with difTo calculate the silhouette coeﬃcient.5 Regional and online learnable ﬁelds are a neural clustering strategy D. q ) |g | q ∈ g (A. we ferent characteristics have been presented initially need the average distance between now (lots of further material is presented point p and all its cluster neighbors.com a criterion to decide how good our cluster division is. the whole term s(p) can only be within the interval [−1.3) and online learnable ﬁelds. A.dkriesel. b(p)} ferred to as ROLFs.1) on an unsupervised learning neural net|c| − 1 q∈c. This in [DHS01]).2) The point p is classiﬁed well if the distance to the center of the own cluster is minimal and the distance to the centers of the other clusters is maximal. This possibility is oﬀered by the silhouette coeﬃcient according to [Kau90]. The silhouette coeﬃcient S (P ) results from the average of all values s(p): clustering quality is measureable Let P be a point cloud and p a point in 1 S (P ) = s(p).g =c 1 dist(p.

neurons are added.2) with the multiplier ρ being globally deﬁned and previously speciﬁed for all neurons. The parameters of the individual neurons will be discussed later.2 (Regional and online learnable ﬁeld).5. rameters of a ROLF neuron k are a center ck and σk are locally deﬁned for each neu.1 ROLF neurons feature a position and a radius in the input space Here. The perceptive surface of a ROLF neuron 176 D.3 (ROLF neuron). For this. A. which deﬁnes the radius of the perceptive surface surrounding the neuron2 . the regional and online learnable ﬁelds are a set K of neurons which try to cover a set of points as well as possible by means of their distribution in the input space. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Its signiﬁcance will be discussed later. Furthermore. moved or changed in their size during training if necessary.1 ROLFs try to cover data with neurons Roughly speaking.1. 2 I write "deﬁnes" and not "is" because the actual radius is speciﬁed by σ · ρ. a position in the input space. A regional and online learnable ﬁeld (abbreviated ROLF or ROLF network) is a set K of neurons that are trained to cover a certain set in the input space as well as possible. ron. the following has to be observed: It is not necessary for the perceptive surface of the different neurons to be of the same size. A. A neuron covers the part of the input space Deﬁnition A.com A.Appendix A Excursus: Cluster analysis and regional and online learnable ﬁelds dkriesel. The radius of the perceptive surface is speciﬁed by r = ρ · σ (ﬁg.2: Structure of a ROLF neuron. a ROLF neuron k ∈ K has two parameters: Similar to the RBF networks. K network covers point cloud Figure A. The pathat is situated within this radius.4 (Perceptive surface). This particularly means that the neurons are capable to cover surfaces of diﬀerent sizes. Deﬁnition A.5. it has a center ck .e.ck and a radius σk . c σ neuron represents surface But it has yet another parameter: The radius σ . i. the reader will wonder what this multiplicator is used for. Intuitively. Deﬁnition A.

5 (Accepting neuron). ρ Deﬁnition A. samples online ησ .dkriesel.6 (Adapting a ROLF neuron).e.6) able. For each training sample p entered into the network two cases can occur: ck (t + 1) = ck (t) + ηc (p − ck (t)) σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t)) Note that here σk is a scalar while ck is a vector in the input space.com A. The radius multiplier ρ > 1 is globally deﬁned and expands the perceptive surface of a neuron k to a multiple of σk .2. Deﬁnition A. one can be chosen randomly.7 (Radius multiplier). There is one accepting neuron k for p adapted according to the following rules: or ck (t + 1) = ck (t) + ηc (p − ck (t)) (A. ηc Like many other paradigms of neural networks our ROLF network learns by receiving many training samples p of a training set P . Additionally.5 Regional and online learnable ﬁelds k consists of all points within the radius is an accepting neuron k . then the closest neuron will be the accepting one. towards the ρ · σ in the input space. For the accepting A.5. 2. A neuron k accepted by a point p is 1. A. This means that due to the aforementioned learning rule σ cannot only decrease but also increase. then there will be exactly one accepting neuron insofar as the closest neuron is the accepting one. distance between p and ck ) and the center ck towards p. neurons to be able not only to shrink Deﬁnition A. The Now we can understand the function of the multiplier ρ: Due to this multiplier the perceptive surface of a neuron includes more than only all points surrounding the neuron in the radius σ . so the neurons can grow D.2 The radius multiplier allows neuron k ck and σk are adapted. The learning is unsupervised. If there are several closest neurons.2. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 177 .5.1 Both positions and radii are adapted throughout learning Adapting existing neurons σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t)) If in the ﬁrst case several neurons are suit(A. there is no accepting neuron at all. If p is located in the perceptive surfaces of several neurons. Then the radius moves towards ||p − ck || (i. let us deﬁne A.2 A ROLF learns unsupervised the two learning rates ησ and ηc for radii by presenting training and centers.5.5) criterion for a ROLF neuron k to be an accepting neuron of a point p is that the point p must be located within the perceptive surface of k . So it is enLet us assume that we entered a training sured that the radius σk cannot only desample p into the network and that there crease but also increase.

Appendix A Excursus: Cluster analysis and regional and online learnable ﬁelds dkriesel. the radius multiplier is set to Mean σ : We select the mean σ of all neurons. Currently. such as 2 or 3. then ck is intialized with p and σk according to ron. Minimum σ : We take a look at the σ of each neuron and select the minimum. the mean-σ variant is the faSo far we only have discussed the case in vorite one although the learning procedure the ROLF training that there is an accept. i. generated Deﬁnition A.also works with the other ones. maximum-σ .5. cover less of the surface. ck = p.com Generally.e. In the minimum-σ variant the neurons tend to ing neuron for the training sample p. new neurons are surface. minimum-σ .ated by entering a training sample p. generated for our training sample. The result is of course that ck and σk have to be The training is complete when after reinitialized. values in the lower one-digit range.2.8 (Generating a ROLF neuThis suggests to discuss the approach for ron). mean-σ ). in the maximumσ variant they tend to cover more of the A. peated randomly permuted pattern presen- initialization of a neurons The initialization of ck can be understood tation no new neuron has been generated intuitively: The center of the new neuron in an epoch and the positions of the neurons barely change. Thus.3 Evaluating a ROLF We generate a new neuron because there is no neuron close to p – for logical reasons.5. is simply set on the training sample. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . The result of the training algorithm is that the training set is gradually covered well we place the neuron exactly on p. A.3 As required. put set). one of the aforementioned strategies (initIn this case a new accepting neuron k is σ . nected when their perceptive surfaces over- cluster = connected neurons 178 D. Then it is very easy to deﬁne the number of clusters: Two neurons are (accordMaximum σ : We take a look at the σ of ing to the deﬁnition of the ROLF) coneach neuron and select the maximum. If a new ROLF neuron k is generthe case that there is no accepting neu. and precisely by the ROLF neurons and But how to set a σ when a new neuron that a high concentration of points on a is generated? For this purpose there exist spot of the input space does not automatidiﬀerent options: cally generate more neurons. a posInit-σ : We always select a predeﬁned sibly very large point cloud is reduced to very few representatives (based on the instatic σ .

e.e. Particularly with clustering methods whose storage eﬀort grows quadratic to |P | the storage eﬀort can be reduced dramatically since generally there are considerably less ROLF neurons than original data points. Of course. i. that storing the neurons rather than storing the input points takes the biggest part of the storage eﬀort of the ROLFs.5 Regional and online learnable ﬁelds A. Since it is unnecessary to store the entire point cloud.5. but the neurons represent the data points quite well.3: The clustering process. some kind of nearest neighboring is executed with the variable perceptive surfaces). as a neural clustering method. A. which is deﬁnitely a great advantage. which is by far the greatest disadvantage of the two neighboring methods. the complete ROLF network can be evaluated by means of other clustering methods.dkriesel. has the capability to learn online. A cluster is a group of connected neurons or a group of points of the input space covered by these neurons (ﬁg. it can (similar to ε nearest neighboring or k nearest neighboring) distinguish clusters from enclosed clusters – but due to the online presentation of the data without a quadratically growing storage eﬀort. our ROLF. the neurons can be searched for clusters. D. bottom: the input space only covered by the neurons (representatives).com lap (i.3). Furthermore. middle: the input space covered by ROLF neurons. This is a great advantage for huge point clouds with a lot of points.4 Comparison with popular clustering methods It is obvious. Top: the input set. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 179 . A. less storage eﬀort! recognize "cluster in clusters" Figure A.

Determine at least four adaptation steps for one single ROLF neuron k if the four patterns stated below are presented one after another in the indicated order. for σ and ρ.1.005 to 0. ing color clusters in RGB images. k -means clustering recognizes clusters enclosed by other A ﬁrst application example could be ﬁndclusters as separate clusters. the ROLF is on a par with the other clustering methods and is particularly very interesting for systems with The ROLF compares favorably with k . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . variations during P = {(0.5 Initializing radii. 0. The previous knowledge about the data set can so to say be included in ρ and the initial value of σ of the ROLF: Fine-grained data clusters should use a small ρ and a small σ initial value.com Additionally.9)}.1. the disadvantages of the ROLF dimensions. they often have to be tested). depend on the cluster and data distribu= (0. As a whole. lar. But compared to wrong initializations – 180 D. means clustering.1. just like for the learning rates ηc and ησ .1).Appendix A Excursus: Cluster analysis and regional and online learnable ﬁelds dkriesel.5.5 and digit range such as 2 or 3 are very popuη = 0. Thus.1). Further applications can be shall not be concealed: It is not always found in the ﬁeld of analysis of attacks on easy to select the appropriate initial value network systems and their classiﬁcation. Furthermore.9.σ ues about 0.low storage capacity or huge data sets.6 Application examples of clusters and. 0. secondly. at least with the mean-σ strategy – they are relatively robust after some training time. But the smaller the ρ the smaller. Another ﬁeld of application directly described in the ROLF publication is the recognition of A. ηc and ησ successfully work with val. Initial values for σ generally = (0. run-time are also imaginable for this type = (0.1) and For ρ the multipliers in the lower singleσk = 1. Let ρ = 3. Let the initial values for the ROLF neuron be ck = (0. 0. learning words transferred into a 720-dimensional rates and multiplier is not feature space. let ηc = 0. tion (i. we can see that trivial ROLFs are relatively robust against higher Certainly.which is also not always the case for the two mentioned methods. there is no easy answer.5. the issue of the size of the individual clusters proportional to their distance from each other is addressed by using variable perceptive surfaces . 0.e.9. Exercises Exercise 18. of network. the chance that the neurons will grow if necessary.1. as well: Firstly. Here again. it is unnecessary to previously know the number A.9). 0.

we will respect I will again try to avoid formal deflook for a neural network that maps the initions. In this If we want to predict a time series. After discussing the diﬀerent paradigms of neural networks it is now useful to take a look at an application of neural networks which is brought up often and (as we will see) is also used for fraud: The application of time series prediction. we will time series of values ∆t 181 . Finally. Share price values also represent a time series. For example.Appendix B Excursus: neural networks used for prediction Discussion of an application of neural networks: a look ahead into the future of time series. daily measured temperature values or other meteorological data of a speciﬁc site could be represented by a time series. B.1 About time series A time series is a series of values discretized in time. Time series can also be values of an actually continuous function read in a certain This chapter should not be a detailed distance of time ∆t (ﬁg. I will say something about the range of software which should predict share prices or other economic characteristics by means of neural networks or other procedures. B. proaches for time series prediction.g. the daily weather forecast. and in many time series the future development of their values is very interesting.page). Often the measurement of time series is timely equidistant. This excursus is structured into the description of time series and estimations about the requirements that are actually needed to predict the values of a time series.e. if we know longer sections of the time series. e.1 on the next description but rather indicate some ap. i. previous series values to future developments of the time series.

we assume systems whose future values can be deduced from their states – the deterministic systems. In this chapter. Do we have any evidence which sug- gests that future values depend in any way on the past values of the time series? Does the past of a time series include information about its future? time series that can be used as training patterns? function: What must a useful ∆t look like? 2. Do we have enough past values of the 3. for instance. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . In case of a prediction of a continuous Figure B. these are not examples for the future to be predicted but it is tried to generalize and to extrapolate the past by means of the said samples. This 182 D.Appendix B Excursus: neural networks used for prediction dkriesel. If the future values of a time series.com have enough training samples. The sampled values are entered into a neural network (in this example an SLP) which shall learn to predict the future values of the time series. Now these questions shall be explored in detail. then a time series prediction based on them will be impossible. this means that the result is a time series.1: A function x that depends on the time is sampled at discrete time steps (time discretized). But before we begin to predict a time series we have to answer some questions about this time series we are dealing with and ensure that it fulﬁlls some requirements. How much information about the future is included in the past values of a time series? This is the most important question to be answered for any time series that should be mapped into the future. do not depend on the past values. Of course. 1.

xt ) = x ˜t+1 . . From this tions. The future The ﬁrst attempt to predict the next fuof a deterministic system would be clearly ture value of a time series out of past valdeﬁned by means of the complete descripues is called one-step-ahead prediction tion of its current state. . input and outputs the prediction for the next state (or state part). xt−1 . e. . . B.1) wide phenomena that control the weather would be interesting as well as small local pheonomena such as the cooling system of which receives exactly n past values in orthe local power plant. Here we use the fact that time series x ˜ D. The aim of the predictor is to realize a But the whole state would include signiﬁ.dkriesel.2) weaknesses by using not only one single state (the last one) for the prediction. .values. . but that approximately fulﬁlls our condiby using several past states. for a The most intuitive and simplest approach weather forecast these fragments are the would be to ﬁnd a linear combination said weather data. (ﬁg. the atmospheric pressure and the cloud density as the meteorological state of the place at a time t. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 183 . (B.g. der to predict the future value.to distinguish them from the actual future sirable for prediction but not always possi.com leads us to the question of what a system state is. B. the temperature.function predict the next value cantly more information.g. Often only fragments of the current states can be acquired.2 on the following page). The problem in the real world is that such Such a predictor system receives the last a state concept includes all things that inn observed state parts of the system as ﬂuence our system by any means. we can partially overcome these (B.2 One-step-ahead prediction B. x ˜i+1 = a0 xi + a1 xi−1 + . The idea of In case of our weather forecast for a spea state space with predictable states is ciﬁc site we could deﬁnitely determine called state space forecasting.2 One-step-ahead prediction A system state completely describes a system for a certain point of time. Predicted values shall be headed by a tilde (e. Here. ble to obtain. x ˜) So we shall note that the system state is de. + aj xi−j However. the worldf (xt−n+1 . we want to derive our ﬁrst prediction system: Such a construction is called digital ﬁlter.

In fact. can set up a series of equations1 : Even if this approach often provides satisxt−1 = a0 xt−2 + . Additional layers with xt−n = a0 xt−n + . But remember that here the number mental setup would comply with ﬁg. .2: Representation of the one-step-ahead prediction. Or another. . activation functions provide a universal But this linear structure corresponds to a non-linear function approximator. a multilayer perceptron will require the prediction becomes easier the more past values considerably less computational eﬀort.The multilayer perceptron and non-linear ing average procedure. This is called mov. .1 n has to remain low since in RBF networks on page 182). since a multilayer perceptron with Thus. . . Such could use m > n equations for n unknowns considerations lead to a non-linear apin such a way that the sum of the mean proach. .3) lems cannot be solved by using a singlelayer perceptron. we have seen that many prob. + aj xt−1−(n−1) 184 D.only linear activation functions can be resible). .com x ˜t+1 u FEC predictor Figure B. usually have a lot of past values so that we means of the delta rule provides results very close to the analytical solution. + aj xt−2−(n−1) fying results. n equations could be found for n unknown coeﬃcients and solve them (if pos. better approach: we duced to a singlelayer perceptron. B. An RBF network could also be means of data from the past (The experi. The predicting element (in this case a neural network) is referred to as predictor. It is tried to calculate the future value from a series of past values. + aj xt−n−(n−1) linear activation function are useless. as well. the training by high input dimensions are very complex to realize. i. . I want to remark that values.can use an n-|H |-1-MLP for n n inputs out tion function which has been trained by of the past. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . (B.Appendix B Excursus: neural networks used for prediction xt−3 xt−2 xt−1 xt dkriesel.used. I would like to ask the reader to read up on the Nyquist-Shannon sampling theorem xt = a0 xt−1 + . we singlelayer perceptron with a linear activa. So if we want to include many past 1 Without going into detail.e. of the time series are available. squared errors of the already known prediction is minimized.

1 Changing temporal prediction is generally imprecise so that parameters errors can be built up.m. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 185 .3.e.4 Additional optimization approaches for prediction B. we can certainly train it to predict the It is also possible to combine diﬀerent ∆t: next but one value.4 on the we use the last values of several periods. a neural network Monday the values of the last few days to look two time steps ahead into the fu. everyone of us has diﬀerence is the training.many people sat in the lecture room on stance. two time steps into the future.com B.The possibility to predict values far away in the future is not only important because ther into the future? we try to look farther ahead into the future. There can also be periodic time series where other approaches are hardly posB. Thus. i. it is not very useful to know how In order to extend the prediction to.2 Direct two-step-ahead which past value is used for prediction.e. to periodically occurring commuter page). every prediction Thursday. We could also include an anthe one-step-ahead prediction. to introduce the parameter ∆t which indicates B. it can be useful to intentionally leave imprecise becomes the result. step-ahead prediction (ﬁg. which is referred to as direct two. prediction Technically speaking. Obviously.3. the value determined by means of a one-step-ahead B. extent input period direct prediction is better D. a recursive two-step-ahead jams. B.dkriesel. prediction. gaps in the future values as well as in the past values of the time series. for extions in a row (ﬁg.4.could be used as data input in addition to ture. This means we di. for in. next page).the values of the previous Mondays. The only nual period in the form of the beginning of the holidays (for sure. can be trained to predict the next value. The same applies. we still use a onestep-ahead prediction only that we extend We have already guessed that there exists the input space or train the system to prea better approach: Just like the system dict values lying farther away. for example. we Monday to predict the number of lecture could perform two one-step-ahead predic.1 Recursive two-step-ahead sible: If a lecture begins at 9 a. B. i.3 on the following ample.participants.4 Additional optimization approaches for prediction predict future values What approaches can we use to to see far. and the more predictions are performed in a row the more Thus.In case of the traﬃc jam prediction for a rectly train. Unfortunately.in this case the values of a weekly and a ahead prediction is technically identical to daily period.3 Two-step-ahead prediction B. the direct two-step.

the ﬁrst one is omitted.com H predictor y xt−3 xt−2 xt−1 xt x ˜t+1 t FEC predictor x ˜t+2 Figure B.Appendix B Excursus: neural networks used for prediction dkriesel. Technically. the second time step is predicted directly. 186 D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . xt−3 xt−2 xt−1 xt x ˜t+1 FEC predictor x ˜t+2 i Figure B. Attempt to predict the second future value out of a past value series by means of a second predictor and the involvement of an already predicted value. it does not diﬀer from a one-step-ahead prediction.4: Representation of the direct two-step-ahead prediction.3: Representation of the two-step-ahead prediction. Here.

dkriesel.com

B.5 Remarks on the prediction of share prices

already spent a lot of time on the highway discrete values – often, for example, in a because he forgot the beginning of the hol- daily rhythm (including the maximum and minimum values per day, if we are lucky) idays). with the daily variations certainly being eliminated. But this makes the whole B.4.2 Heterogeneous prediction thing even more diﬃcult. Another prediction approach would be to predict the future values of a single time series out of several time series, if it is assumed that the additional time series is related to the future of the ﬁrst one (heterogeneous one-step-ahead prediction, ﬁg. B.5 on the following page). If we want to predict two outputs of two related time series, it is certainly possible to perform two parallel one-step-ahead predictions (analytically this is done very often because otherwise the equations would become very confusing); or in case of the neural networks an additional output neuron is attached and the knowledge of both time series is used for both outputs (ﬁg. B.6 on the next page). You’ll ﬁnd more and more general material on time series in [WG94]. There are chartists, i.e. people who look at many diagrams and decide by means of a lot of background knowledge and decade-long experience whether the equities should be bought or not (and often they are very successful). Apart from the share prices it is very interesting to predict the exchange rates of currencies: If we exchange 100 Euros into Dollars, the Dollars into Pounds and the Pounds back into Euros it could be possible that we will ﬁnally receive 110 Euros. But once found out, we would do this more often and thus we would change the exchange rates into a state in which such an increasing circulation would no longer be possible (otherwise we could produce money by generating, so to speak, a ﬁnancial perpetual motion machine. At the stock exchange, successful stock and currency brokers raise or lower their thumbs – and thereby indicate whether in their opinion a share price or an exchange rate will increase or decrease. Mathematically speaking, they indicate the ﬁrst bit (sign) of the ﬁrst derivative of the exchange rate. In that way excellent worldclass brokers obtain success rates of about 70%.

use information outside of time series

B.5 Remarks on the prediction of share prices

Many people observe the changes of a share price in the past and try to conclude the future from those values in order to beneﬁt from this knowledge. Share prices are discontinuous and therefore they are principally diﬃcult functions. Further- In Great Britain, the heterogeneous onemore, the functions can only be used for step-ahead prediction was successfully

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

187

Appendix B Excursus: neural networks used for prediction

dkriesel.com

xt−3

xt−2

xt−1

xt

x ˜t+1

u FHEICQ predictor

yt−3

yt−2

yt−1

yt

Figure B.5: Representation of the heterogeneous one-step-ahead prediction. Prediction of a time series under consideration of a second one.

xt−3

xt−2

xt−1

xt

x ˜t+1

u FHEICQ predictor

yt−3

yt−2

yt−1

yt

y ˜t+1

Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time.

188

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

dkriesel.com

B.5 Remarks on the prediction of share prices Again and again some software appears which uses scientiﬁc key words such as ”neural networks” to purport that it is capable to predict where share prices are going. Do not buy such software! In addition to the aforementioned scientiﬁc exclusions there is one simple reason for this: If these tools work – why should the manufacturer sell them? Normally, useful economic knowledge is kept secret. If we knew a way to deﬁnitely gain wealth by means of shares, we would earn our millions by using this knowledge instead of selling it for 30 euros, wouldn’t we?

used to increase the accuracy of such predictions to 76%: In addition to the time series of the values indicators such as the oil price in Rotterdam or the US national debt were included. This is just an example to show the magnitude of the accuracy of stock-exchange evaluations, since we are still talking only about the ﬁrst bit of the ﬁrst derivation! We still do not know how strong the expected increase or decrease will be and also whether the eﬀort will pay oﬀ: Probably, one wrong prediction could nullify the proﬁt of one hundred correct predictions. How can neural networks be used to predict share prices? Intuitively, we assume that future share prices are a function of the previous share values. But this assumption is wrong: Share prices are no function of their past values, but a function of their assumed future value. We do not buy shares because their values have been increased during the last days, but because we believe that they will futher increase tomorrow. If, as a consequence, many people buy a share, they will boost the price. Therefore their assumption was right – a self-fulﬁlling prophecy has been generated, a phenomenon long known in economics. The same applies the other way around: We sell shares because we believe that tomorrow the prices will decrease. This will beat down the prices the next day and generally even more the day after the next.

share price function of assumed future value!

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

189

Now we want to explore something inbetween: The learning paradigm of reinforcement learning – reinforcement learning according to Sutton and Barto [SB98]. Due to its very rudimentary feedback it is reasonable to separate it from the supervised learning procedures – apart from the fact that there are no training samples at all. i.e. we provide exemplary output values. In some sources it is counted among the supervised learning procedures since a feedback is given. The term reinforcement learning comes from cognitive science and psychology and it describes the learning system of carrot and stick. then nobody could tell us exactly which han- no samples but feedback 191 . but no results for the individual intermediate steps. Reinforcement learning in itself is no neural network but only one of the three learning paradigms already mentioned in chapter 4.e. But there is no learning aid that exactly explains what we have to do: We only receive a total result for a process (Did we win the game of chess or not? And how sure was this victory?). I now want to introduce a more exotic approach of learning – just to leave the usual paths. For example. if we ride our bike with worn tires and at a speed of exactly 21. which occurs everywhere in nature. into which only input values are entered. i. While it is generally known that procedures such as backpropagation cannot work in the human brain itself. reward and punishment.Appendix C Excursus: reinforcement learning What if there were no training samples but it would nevertheless be possible to evaluate how well we have learned to solve a problem? Let us examine a learning paradigm that is situated between supervised and unsupervised learning. learning by means of good or bad experience.1mm. 5 km h through a turn over some sand with a grain size of 0. We know learning procedures in which the network is exactly told what to do. on the average. We also know learning procedures like those of the self-organizing maps. reinforcement learning is usually considered as being biologically more motivated.

The agent inﬂuences the system. be it good or bad. we will get a feel for what works and what does not. as mentioned above. Broadly speaking. The reward is a real or discrete scalar which describes. a feedback or a reward. which in the following is called reward. The agent performs some actions within the environment and in return receives a feedback from the environment.Appendix C Excursus: reinforcement learning dlebar angle we have to adjust or. each of these components sizes and components of the system. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . how strong the great number of muscle parts in our arms or legs have to contract for this. how well we achieve our aim. The agent shall solve some problem. He could. be an autonomous robot that shall avoid obstacles. even worse. discrete forcement learning represents the mutual simple examplary world 192 D.1 on the facing page) is a simple. Thus. The aim is always to make the sum of rewards as high as possible on the long term. However. C. This cycle of action and reward is characteristic for reinforcement learning.1 The gridworld tournament.1. the system provides a reward and then changes. reinforcement gridworld. Now let us exemplary deC. We will be examined more exactly. it is very suitable for representing the approach of reinforcement learning.but on the other hand it is considerably easier to obtain. Depending on whether we reach the end of the curve unharmed or not. We will see that its struclearning often means trial and error – and ture is very simple and easy to ﬁgure out and therefore reinforcement is actually not therefore it is very slow. As a learning example for reinforcement To get straight to the point: Since we learning I would like to use the so-called receive only little feedback.1 System structure ﬁne the individual components of the reinforcement system by means of the gridNow we want to brieﬂy discuss diﬀerent world.com interaction between an agent and an environmental system (ﬁg. The aim of reinforcement learning is to maintain exactly this feeling. Later. dkriesel. Another example for the quasiimpossibility to achieve a sort of cost or utility function is a tennis player who tries to maximize his athletic success on the long term by means of complex movements and ballistic trajectories in the three-dimensional space including the wind direction. we soon have to face the learning experience. rein. for instance. the reward is very simple . C. private factors and many more. necessary.2). If we now have tested different velocities and turning angles often enough and received some rewards. but it does not give any guidance how we can achieve it.Environment: The gridworld (ﬁg. the importance of the C. will deﬁne them more precisely in the following sections.

com world in two dimensions which in the following we want to use as environmental system. in the lower part it is closed. We simply deﬁne that the robot could move one ﬁeld up or down. Agent: As an Agent we use a simple robot being situated in our gridworld.1 System structure × Action space: The actions are still missing.1. our agent can occupy 29 positions in the grid world. gridworld. The symbol × marks the starting position of our agent. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 193 . We now have created a small world that will accompany us through the following learning strategies and illustrate them. The exit is located on the right of the light-colored ﬁeld. The exit is located on the right side of the light-colored ﬁeld. The position of the door cannot change during a cycle but only between the cycles. These positions are regarded as states for the agent. Therefore. c Agent reward / new situation action environment Figure C. Dark-colored cells are obstacles and Task: Our agent’s task is to leave the gridworld. Non-determinism: The two obstacles can be connected by a "door". State space: As we can see. it D. to the right or to the left (as long as there is no obstacle or the edge of our Figure C. When the door is closed (lower part of the illustration). C. the corresponding ﬁeld is inaccessible. Thus.1: A graphical representation of our gridworld).dkriesel.2: The agent performs some actions within the environment and in return receives a reward. therefore inaccessible.2 Agent und environment Our aim is that the agent learns what happens by means of the reward. × C. our gridworld has 5 × 7 ﬁelds with 6 ﬁelds being unaccessible. In the upper part of our ﬁgure the door is open.

it would be possible to plan opment learning the agent can be formally timally and also easy to ﬁnd an optimal In the gridworld: In the gridworld. solution it may sometimes be useful to t+1 receive a smaller award or a punishment Environment: S × A → P (S × rt ) (C.2 (Environment).3 States. The award is. the ﬁnal sum of all rewards – which is also called return. In reinforce. uations to actions (called policy ).Appendix C Excursus: reinforcement learning is trained over. The aim is simply shown to the agent by giving an Deﬁnition C.e. But what does learning mean in this context? dkriesel. vironment represents a stochastic mapping of an action A in the current situaSuch an award must not be mistaken for tion st to a reward rt and a new situation the reward – on the agent’s way to the s .1) to achieve a certain (given) aim. speak. of and by means of a dynamic system. it diﬀerent states: In case of the gridworld.system). we want to discuss more state so that we have to introduce the term precisely which components can be used to situation. if the agent is heading into As already mentioned. For an agent is ist not always possible to After having colloquially named all the ba. in order to reach an aim. only a more or less precise approximation of a state. ing system. the environment. If we knew all states and the transitions between them exactly (thus. the complete Deﬁnition C. So.1 (Agent).com agent acts in environment described as a mapping of the situation space S into the action space A(st ). i. i. The meaning of situations st will be deﬁned later and should only indicate that the action space depends on the current situaThe agent shall learn a mapping of sittion.1. it can be in diﬀerent positions ceives no reward at all or even a negative (here we get a two-dimensional state vecreward (punishment). receives a positive reward. it shall learn what to do in which situation Agent: S → A(st ) (C.realize all information about its current sic components. 194 D. The enaward for the achievement. an agent can be in the right direction towards the target. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . and if not it refor example. situations and actions share price or to a pawn sacriﬁce in a chess game). A situation is a state from the make up our abstract reinforcement learn. which is a discrete gridworld. The environment is the gridworld itself. situations generally do not allow to clearly "predict" successor situations – even with a completely deterministic system this may not be applicable. the agent is a simple robot that should ﬁnd the exit of the gridworld.e. Therefore.2) when in return the longterm result is maximum (similar to the situation when an investor just sits out the downturn of the C. so to tor).agent’s point of view.

an inﬁnite sum in the ﬁrst place rt D. the return is only estimated directly know what is better or worse. only ﬁnitely many time steps will be tion within a deterministic system out of possible.5 (Action). . Possible actions would be to move towards north.1.com C. For ﬁnitely many time steps1 the rewards can simply be added: st S at A(S ) Rt = rt+1 + rt+2 + .3 (State).4) however. the situations equal the states point of view. real or discrete (even sometimes it is theoretically possible to clearly pre.e. called return R. They cause state transitions and where the agent can be situated. which would not be possible. here (if we knew all rewards and therefore the return completely. it would no longer Deﬁnition C. south. S are the agent’s limited. The agent cannot determine dictions impossible. C. Thus. could be possible that depending on the situation another action space A(S ) exIn the gridworld: States are positions ists). Actions at can why it receives the said reward from the be performed by the agent (whereupon it environment. This approxis an interaction between the agent and imation (about which the agent cannot the system including actions at and siteven know how good it is) makes clear preuations st . the reward is always a scalar (in an extreme case even only a binary value) since the aim of reinforcement learning is to get along with little feedback. A reward rt is within the environmental system. As in real life it is our aim to receive an award that is as high as possible. A complex vectorial reward would equal a real teaching input. east or west. Situations st (here at time t) of a situation space by dynamic programming). with a vectorial reward since we x=1 do not have any intuitive order relations in multi-dimensional space. i. . Simtherefore a new situation from the agent’s ply said. vironment the agent is in a state. Deﬁnition C. on the long term. by itself whether the current situation is good or bad: This is exactly the reason Deﬁnition C. to maximize the sum of the expected rewards r. the cost function should be ∞ minimized.6 (Reward). i. for example.3) By the way. we do not Certainly. = rt+x (C.dkriesel.only binary) reward or punishment which dict a successor state to a performed ac1 In practice. even though the formulas are stated with this godlike state knowledge.e. Within its enbe necessary to learn). (C. States contain any information about the agent Deﬁnition C.1 System structure policy (methods are provided.4 (Situation). in the gridworld. a scalar. approximate Now we know that reinforcement learning knowledge about its state.4 Reward and return Situation and action can be vectorial. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 195 .

5) During reinforcement learning the agent learns a policy ∞ = x=1 γ x−1 rt+x (C. This is not only some system components of reinforcement useful if there exists no target but also if learning the actual aim is still to be discussed: the target is very far away: Rt = rt+1 + γ 1 rt+2 + γ 2 rt+3 + . our current situation to a desired state.1.6) Π Π : S → P (A).7 (Return). the smaller Thus. the pawn sacriﬁce in a chess game) plicit target and therefore a ﬁnite sum (e. . our agent can be a robot having the task to drive around again and again and to avoid obstacles). τ so that only τ many following rewards In the gridworld: In the gridworld the polrt+1 . not every problem has an ex(e. . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . but will pay oﬀ later. The policy Π s a mapping of situations to probabilities 196 D. sions. it continuously adjusts a mapping is the inﬂuence it has in the agent’s deci.of the situations to the probabilities P (A).5 The policy diverging sum in case of an inﬁnite series of reward estimations a weakening factor 0 < γ < 1 is used. we divide the timeline into episodes. ods is used to limit the sum. it time is also possible to perform actions that. The return Rt methods together. Usually. = τ x=1 γ x−1 rt+x (C. with which any action A is performed in any situation S . A policy can be deﬁned Another possibility to handle the return as a strategy to select actions that would sum would be a limited time horizon maximize the reward in the long term. rt+τ are regarded: icy is the strategy according to which the Rt = rt+1 + . .g. τ The farther the reward is away.Appendix C Excursus: reinforcement learning dkriesel. .1 Dealing with long periods of tal sum decides what the agent will do. + γ τ −1 rt+τ (C.After having considered and formalized ﬂuence of future rewards. (C.8 (Policy).4. if not both Deﬁnition C. on short notice. In order not to receive a C. . .com Rt γ the environmental system returns to the Thus. Since it is not mandatory that only the next expected reward but the expected toC.8) Deﬁnition C.g. which weakens the in.1. result in a negative reward However.7) agent tries to exit the gridworld. . is the accumulation of all received rewards As in daily living we try to approximate until time t. one of the two methagent as reaction to an action.

we distinguish between two pol. This policy represents such problems we have to ﬁnd an alterna. C. . Here. In particular. which incorpo. Π : S → P (A). (C. so to speak. Thus. in the beginning the agent develops a plan and consecutively executes it to the end without considering the intermediate situations (therefore ai = ai (si ). But. to perform every action out of the action in a manner of speaking.a reactive plan to map current situations icy paradigms: An open loop policy rep. as already illustrated in ﬁg. when an obstacle appears dynamically. such as the way from the given starting position to (in ab. to know the A greedy policy always chooses the way chess game well and truly it would be nec.of the highest reward that can be deteressary to try every possible move. Thus. In the gridworld: A closed-loop policy would be responsive to the current position and choose the direction according to the action.e. an examined.C.the exploitation approach and is very tive to the open-loop policy. i > 0. actions after a0 do not depend on the situations). open-loop policy would provide a precise direction towards the exit.promising when the used system is already rates the current situations into the action known. which mined in advance. resents an open control chain and creates out of an initial situation s0 a sequence of actions a0 . exploration breviations of the directions) EEEEN. with ai = ai (si ). So it can be formalized as responds to the input of the environment. starting situation. for example. plan: In contrast to the exploitation approach it A closed loop policy is a closed loop.2. As in real life.exisiting knowledge is only willfully exquence of actions is generated out of a ploited or new ways are also explored. research or safety? D. for est known reward.1 System structure When selecting the actions to be performed. such an open-loop policy tremes: can be used successfully and lead to useful results. A se.com C.1 Exploitation vs. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 197 .1. a1 .dkriesel. again two basic strategies can be In the gridworld: In the gridworld. i. we want to discuss the two exwell and truly. is Basically. a is the aim of the exploration approach to explore a system as detailed as possible function so that also such paths leading to the target can be found which may be not very Π : si → ai with ai = ai (si ). such a policy is the better choice. If the system is known Initially. the environment inﬂuences our action or the agent space. during reinforcement learnSo an open-loop policy is a sequence of ing often the question arises whether the actions without interim feedback.9) respectively. . . the way of the highwould be very time-consuming.5.to actions to be performed. A closed-loop policy.

Here. C. not matter how unoptimal and long it may be. the reexplore shorter ways every now and then. The exploration apvery successful.Appendix C Excursus: reinforcement learning dkriesel.com promising at ﬁrst glance but are in fact he leaves of such a tree are the end situations of the system. know. and not to try to explore bet. we also ter ways.the end of and not during the game. a static probability distribution is also possible and often applied.2. even at the risk of taking a long time and Now we have to adapt from daily life how being unsuccessful. from each subsit.cases as design examples for the reward: tions can lead us from one situation into diﬀerent subsituations.ﬁnally can say whether we were succesful rately be referred to as a situation graph ). First of all. the restaurant example aption. This ered (often there are several ways to reach method is always advantageous when we a situation – so the tree could more accu. a safe policy would leaves. or not. we get a situation tree where reward : We only receive the reward at links between the nodes must be consid. proach would search the tree as thoroughly Let us assume that we are looking for the as possible and become acquainted with all way to a restaurant. there generally are (again as in daily life) various acIn the gridworld: For ﬁnding the way in tions that can be performed in any situathe gridworld.2 Learning process We now want to indicate some extreme Let us again take a look at daily life. Another approach would be to can create an action tree. In reality. In chess game is referred to as pure delayed a sense.A rewarding similar to the rewarding in a uation into further sub-subsituations. As we have seen above.Analogous to the situation tree. uate the selected situations and to learn which series of actions would lead to the target. Ac. There are diﬀerent strategies to evalplies equally. The exploitation approach would be to always take the way we already unerringly go to the best known leave. often a combination of both methods is applied: In the beginning of the learning process it is researched with a higher probability while at the end more existing knowledge is exploited. wards for the actions are within the nodes. this principle should be explained in the following. but the interim steps do not allow 198 D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . having to take the original way and arrive too late at the restaurant. Here. C. and therefore ﬁnally we learn exactly.1 Rewarding strategies Interesting and very important is the question for what a reward and what kind of reward is awarded since the design of the reward signiﬁcantly controls system behavior.

If we lose. If we win. only a few of them receive a negative reward. Most situations do not receive any reward.2 The state-value function Unlike our agent we have a godlike view of our gridworld so that we can swiftly determine which robot starting position can provide which optimal return. For our gridworld we want to apply the rt = 0 ∀t < τ (C. only returned by the leaves of the situation complex tasks. then rτ = −1. we can see that it would be more practical for the robot to be capable to evaluate the current and future situations. (C. The agent agent will avoid getting too close to such negative situations Thus. we can show that especially small tasks can be solved better by means as well as rτ = 1. which with regard to a policy Π is often called VΠ (s). (C. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 199 .11) pure negative reward strategy: The robot shall ﬁnd the exit as fast as possible. we will understand that this A situation being bad under a policy that behavior optimally fulﬁlls the return of the is searching risks and checking out limits D.10) Pure negative reward : Here.3 on the next page these optimal returns are applied per ﬁeld. rt = −1 ∀t < τ. So let us take a look at another system component of reinforcement learning: the state-value function V (s). This system ﬁnds the most rapid way to reach the target because this way is automatically the most favorable one in respect of the reward. C.12) is unknown and has to be learned. The agent receives punishment for anything it does – even if it does nothing. If standing still is also punished. Because whether a situation is bad often depends on the general behavior Π of the agent. for our gridworld exactly represents such Here.2 Learning process an estimation of our situation.dkriesel. more With this rewarding strategy a reward is diﬀerentiated rewards are useful for large. tree.In the gridworld: The state-value function egy : Harmful situations are avoided. a function per situation (= position) with the diﬀerence being that here the function rt ∈ {0. then Furthermore. As a result it is the most inexpensive method for the agent to reach the target fast. robot but unfortunately was not intended to do so.2.com C. −1}. state evaluation Another strategy is the avoidance strat. Reconsidering this. In ﬁgure C. A robot that is told "have it your own way but if you touch an obstacle you will be punished" will simply stand still. it will drive in small circles. of negative rewards while positive. Warning: Rewarding strategies can have unexpected consequences.

13) -7 -3 -7 have a godlike view of its environment. It does not have a table with optimal returns like the one shown above to orient itself.tion bit by bit on the basis of the returns of cycle turns a corner and the front wheel many trials and approximates the optimal begins to slide out.1 Policy evaluation Policy evaluation is the approach to try a policy a few times.3: Representation of each optimal re.Appendix C Excursus: reinforcement learning -6 -7 -6 -7 -8 -9 -10 -6 -7 -8 -9 -10 -11 -10 -5 -5 -6 -7 -8 -9 -5 -9 -10 -11 -10 -9 -4 -4 -5 -6 -7 -8 -4 -10 -11 -10 -9 -8 -3 -3 -2 -1 -2 -3 -4 -5 -6 -2 -1 -2 -3 -4 -5 -6 dkriesel. if an agent on a bi. the value of the statevalue function corresponds to the return Rt (the expected value) of a situation st . With a risk-aware policy terms closely related to the cycle between the same situations would look much betstate-value function and policy: ter. Π turn per ﬁeld in our gridworld by means of pure negative reward awarding.V ∗ (s). devil policy the agent would not brake in In this context I want to introduce two this situation. For this purpose it returns the expectation of the return under the situation: VΠ (s) = EΠ {Rt |s = st } (C. The state-value function VΠ (s) has the task of determining the value of situations under a policy.com EΠ denotes the set of the expected returns under Π and the current situation st . at the top with an Unfortunaely.2.state-value function V ∗ (if there is one). to provide many rewards that way and to gradually accumulate a state-value function by means of these rewards.e. Figure C. VΠ (s) = EΠ {Rt |s = st } Deﬁnition C. according to the above deﬁnitions. unlike us our robot does not open and at the bottom with a closed door. for instance. to answer the agent’s question of whether a situation s is good or bad or how good or bad it is. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . The aim of reinforcement learning is that the robot generates its state-value funcwould be. And due to its dare. Abstractly speaking. The optimal state-value function is called ∗ (s) VΠ 200 D.2. C.9 (State-value function). thus it would be evaluated higher by a good state-value function VΠ (s) VΠ (s) simply returns the value the current situation s has for the agent under policy Π. i.

But here caution is advised: In this way.2 Learning process The easiest approach to accumulate a state-value function is mere trial and erV∗ Π∗ ror. random pol.2. would produce oscillations for all ﬁelds and such oscillations would inﬂuence their shortest way to the target. D. At ﬁrst.4). If we additionally assume a pure negative reward. let us regard a simple.com C. one ﬁeld). Our door.e.dkriesel. we want to memorize only the better value for one state (i. derived by out any previous knowledge. The changed state-value function provides information about the system with which we again improve our policy.e. V (st )new = V (st )alt + α(Rt − V (st )alt ).2. Thus.to use the learning rule consuming. It is tried to evaluate how good a policy is in individual situations. but this derivation is not discussed in this chapter. to turn it into a new and better one. V i A Π C.2. It can be proved that at some which ideally leads to optimal Π∗ and V ∗ .in which the update of the state-value funcicy by which our robot could slowly fulﬁll tion is obviously inﬂuenced by both the and improve its state-value function with2 The learning rule is. point we will ﬁnd the exit of our gridworld by chance. C.2 Policy improvement Policy improvement means to improve a policy itself.e. among others.4: The cycle of reinforcement learning lated state-value function for its random decisions. With the Monte Carlo method we prefer 2 This cycle sounds simple but is very time. i. Intuitively. it is obvious that we can receive an optimum value of −6 for our starting ﬁeld in the state-value function.3 Monte Carlo method C. means of the Bellman equation. The principle of reinforcement learning is to realize an interaction. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 201 . These two values lift each other. the learning procedure would work only with deterministic systems. i. so that the ﬁnal result is an optimal policy Π∗ and an optimal state-value function V ∗ (ﬁg. which can be open or closed during a cycle. Depending on the random way the random policy takes values other (smaller) than −6 can occur for the starting ﬁeld. we select a randomly behaving policy which does not consider the accumuFigure C. In order to improve the policy we have to aim at the return ﬁnally having a larger value than before. which can mathematically be proved. until we have found a shorter way to the restaurant and have walked it successfully Inspired by random-based games of chance this approach is called Monte Carlo method.

Due to the fact that in the course of time many diﬀerent ways are walked given a random policy. the agent gets some kind of memory. But this method is the only one for which it can be mathematically proved that it works and therefore it is very useful for theoretical considerations. C. the computation of the state value was applied for only one single state (our initial state). dkriesel. C. The result of such a calculation related to our example is illustrated in ﬁg. In this example. It should be obvious that it is possible (and often done) to train the values for the states visited inbetween (in case of the gridworld our ways to the target) at the same time.10 (Monte Carlo learning). walking or riding a bicycle Figure C.g.com α -6 -5 -4 -3 -1 -2 -14 -13 -12 -11 -10 -9 -8 -7 -1 -2 -3 -4 -5 -6 -10 C. 202 D.Appendix C Excursus: reinforcement learning old state value and the received return (α is the learning rate). Deﬁnition C. Bottom: The result of the learning rule for the value of the initial state considering both ways.5: Application of the Monte Carlo learning rule with a learning rate of α = 0.4 Temporal diﬀerence learning Most of the learning is the result of experiences. Thus.6 on the facing page. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . Actions are randomly performed regardless of the state-value function and in the long term an expressive state-value function is accumulated by means of the following learning rule. The Monte Carlo method seems to be suboptimal and usually it is signiﬁcantly slower than the following methods of reinforcement learning.5. An exemplary learning step is shown in ﬁg.5. new ﬁndings always change the situation value just a little bit. Top: two exemplary ways the agent randomly selects are applied (one with an open and one with a closed door). a very expressive statevalue function is obtained. e.2. V (st )new = V (st )alt + α(Rt − V (st )alt ).

2.11 (Temporal diﬀerence learning). the temporal diﬀerence learning (abbreviated: TD learning ).5 in which the returns for intermediate states are also used to accumulate the statevalue function. tal skills like mathematical problem solv. TD learning looks ahead by rewithout getting injured (or not). Just as we learn from experience to re. is inﬂuenced by the received reward rt+1 . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 203 . which is proportional to the learning rate α. the previous return weighted with a factor γ of the following situation V (st+1 ). the following sit-9 uations with st+1 and so on.dkriesel.2 Learning process in ﬁg.14) ple trial and error. Again the current situa-10 tion is identiﬁed with st . Thus.garding the following situation st+1 .Analogous to the state-value function act on diﬀerent situations in diﬀerent ways VΠ (s).7). the -8 -7 learning formula for the state-value funcFigure C. Unlike the Monte Carlo method. learn and improve the policy due to experience change of previous value (ﬁg.6: Extension of the learning example tion VΠ (st ) is -1 -2 -3 -4 -5 -6 V (st )new =V (st ) + α(rt+1 + γV (st+1 ) − V (st )) change of previous value Evaluation We can see that the change in value of the current situation st .5 The action-value function rected manner. this state is impossible.the learning rule is given by ing beneﬁt a lot from experience and sim(C. the agent learns to estimate which situations are worth a lot and -11 which are not).7: We try diﬀerent actions within the environment and as a result we learn and improve the policy. we initialize our V (st )new =V (st ) + α(rt+1 + γV (st+1 ) − V (st )) . Thus. C. If the door is closed. the low value on the door ﬁeld can be seen very well: If this state is possible. the action-value function action evaluation D. Thus. Here. does the same by -10 -9 -8 -3 training VΠ (s) (i. Deﬁnition C.com C. even men. C. In contrast to the Monte Carlo method we want to do this in a more diC.e. policy with arbitrary values – we try. the previous value of the situation V (st ). it must be very positive. Π 3 Q policy improvement Figure C.

But we must not disregard that reinforcement learning is generally quite slow: The system has to ﬁnd out itself what is good. we do not mind to Deﬁnition C. always get into the best known next Like the state-value function. a)new =Q(st . Q(st . otherwise the actions are simply performed again and again). a) as learning fomula for the action-value function. a) −Q(st . Π t means of α). Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . a) Again we break down the change of the current action value (proportional to the learning rate α) under the current situaQΠ (s. moving greedy strategy down is not a good way at all (provided that the change of previous value door is open for all cases). a) is another system component of tion.8). moving up is still a quite fast way. on the other hand.2. a) + α(rt+1 + γ max Q(st+1 . the maximum action over the following actions weighted with γ (Here. a In the gridworld: In the gridworld. value function QΠ (st .6 Q learning +1 This implies QΠ (s. C. the actionsituation.com 0 × -1 C. one remains on the fastest way towards the tara get. The optimal action-value member that this is also weighted by function is called Q∗ ( s . a) Q∗ Π (s. a)) . With TD learning.9. a ) .12 (Action-value function). a) evaluates certain the previous value of the action under actions on the basis of certain situations our situation st known as Q(st . Usually. But the advantage of Q 204 D. It is inﬂuenced by reinforcement learning. tain direction (ﬁg. which evaluates a the received reward rt+1 . QΠ (s. the actions are performed until a target situation (here referred to as sτ ) is achieved (if there exists a target situation. C.8: Exemplary values of an actionvalue function for the position ×.). a) (reunder a policy. the greedy strategy is applied since it can action-value function tells us how good it be assumed that the best known acis to move from a certain ﬁeld into a certion is selected. and – analogously to TD learning – its application is called Q learning : Figure C. the action-value function learns considerably faster than the state-value function. Moving right. As shown in ﬁg.Appendix C Excursus: reinforcement learning dkriesel. certain action a under a certain situation s and the policy Π.

Then the + α(rt+1 + γ max Q(st+1 .played backgammon knows that the situily. a)) . The situation here is the current conﬁguration of the board. and by means of Q learning the result ation space is huge (approx.9: Actions are performed until the desired target situation is achieved. 1020 situais always Q∗ . The result ∗ was that it achieved the highest ranking in and thus ﬁnds Q in any case. a)new =Q(st . system was allowed to practice itself a (initially against a backgammon program.e.ticularly in the late eighties when TD gaming trains the action-value function by mon was introduced). a computer-backgammon league and strikingly disproved the theory that a computer programm is not capable to master a task C.3. As a result. The selected remeans of the learning rule warding strategy was the pure delayed reward. the system receives the reward not before the end of the game and at the Q(st . Q learn. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 205 . Anyone who has ever C. i.3 Example applications better than its programmer.2 The car in the pit Let us take a look at a car parking on a one-dimensional road at the bottom of a deep pit without being able to get over the slope on both sides straight away by means of its engine power in order to leave D. a) − Q(st . C. learning is: Π can be initialized arbitrar.dkriesel. the state-value functions cannot be computed explicitly (parDeﬁnition C.3 Example applications @ a aτ −2 a a GFED @ABC @ABC @ABC G ONML HIJK @ABC sτ −1 l τ −1 G GFED s0 hk 0 G GFED s1 k 1 G GFED sτ ··· k r1 r2 rτ −1 rτ direction of reward direction of actions Figure C.13 (Q learning). actions and situations beginning with 0 (This has simply been adopted as a convention).1 TD gammon TD gammon is a very successful backgammon game based on TD learning invented by Gerald Tesauro.com C.15) same time the reward is the return. then against an entity of itself). tions). a) (C. Attention should be paid to numbering: Rewards are numbered beginning with 1.3.

Only these two More diﬃcult for the system is the folactions can be performed. there are maximum values and minimum values x can adopt. In to both sides. On the top of this car is hangs down. The pole is built in such a the literature this task is called swing up way that it always tips over to one side so an inverted pendulum. This is achieved best by an avoidchoice for awarding the reward so that the ance strategy: As long as the pole is balsystem learns fast how to leave the pit and anced the reward is 0. 206 D.Appendix C Excursus: reinforcement learning the pit. to prevent the pole from tipping Here.3. Furthermore. If the pole tips over. The pole balancer was developed by Barto. Our one-dimensional world is limited. the pole will tip over).e. it never stands still (let us assume that the pole is rounded at the lower end). standing still lowing initial situation: the pole initially is impossible. "everything costs" would be a good over. table since the state space is hard to dis. So the system will slowly build up Interestingly.At this the system mostly is in the cencretize. dkriesel. to keep the pole balanced by tilting it sufThe policy can no longer be stored as a ﬁciently fast and with small movements. Sutton and Anderson.3.3 The pole balancer C. Trivially. the system is soon capable the movement. The aim of our system is to learn to steer the car in such a way that it can balance "full reverse" and "doing nothing". by means of mere forward directed engine power. the executable actions here are the possibilities to drive forwards and backwards. The intuitive solution we think of immediately is to move backwards. the vehicle always has a ﬁxed position x an our one-dimensional world and a velocity of x ˙ . realizes that our problem cannot be solved the reward is -1.3. the pole. The actions of a reinforcement learning system would be "full throttle forward".1 Swinging up an inverted pendulum Let be given a situation including a vehicle that is capable to move either to the right at full throttle or to the left at full throttle (bang bang control).com The angle of the pole relative to the vertical line is referred to as α. As policy a function has to be ter of the space since this is farthest from generated. the walls which it understands as negative (if it touches the wall. to gain momentum at the opposite slope and oscillate in this way several times to dash out of the pit. has to be swung up over the hinged an upright pole that could tip over vehicle and ﬁnally has to be stabilized. i. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . C.

problems of informatics which could be But not every problem is that easily solved solved eﬃciently by means of reinforcelike our gridworld: In our backgammon ex. Here. we have to ﬁnd approximators for these functions. ton and Anderson to control the pole There is often something in between. Exercises Exercise 19. Please give reasons for ample we have approx.ment learning. this does not mean that they have been proposed by Barto.obstacles and at any time knows its posiready been introduced to supervised and tion (x. And which learning approximators for these reinforcement learning components come immediately into our mind? Exactly: neural networks. What could an appropriate statevalue function look like? How would you generate an appropriate reward? C. We have al. Problems Bibliography: [BSA83].4 Reinforcement learning in connection with neural networks Finally. kind of criticism or school mark. unsupervised learning procedures.dkriesel. some balancer. let alone other games. Although we do not always have an om.com C. 1020 situations and your answers. Assume that the robot is capable to avoid The answer is very simple. Describe the function of niscient teacher who makes unsupervised the two components ASE and ACE as learning possible.and action-value functions. the reader would like to ask why a text on "neural networks" includes a chapter about reinforcement learning. Indicate several "classical" forcement learning. y ) and orientation φ. Sutwe do not receive any feedback at all. the situation tree has a large branching factor.4 Reinforcement learning in connection with neural networks ment learning to ﬁnd a strategy in order to exit a maze as fast as possible. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 207 . Thus. the tables used in the gridworld can no longer be realized as state. like this can be solved by means of reinExercise 21.Exercise 20. A robot control system shall be persuaded by means of reinforce- D.

.

IEEE Transactions on Information Theory. R.A. Computer Society Press Technology Series Neural Networks. and Cybernetics. 1987. 1967. Oregon Convention Center. and R.E. Barto. 1972.O. Spektrum. ART 3: Hierarchical search using chemical transmitters in self-organising pattern recognition architectures. Grossberg. Wiley New York. 13(1):21–27. Applied Optics. Cover and P. Portland: World Congress on Neural Networks. IEEE Transactions on Systems. 1988.A. Mathematical Biosciences. Portland. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. 1990. 2(4):303–314. 1993. Grossberg. A simple neural network generating an interactive memory. Cohen and S. D. Akademischer Verlag. Sutton. Hart. ART2: Self-organization of stable category recognition codes for analog input patterns. Stork. Nearest neighbor pattern classiﬁcation. [BSA83] [CG87] [CG88] [CG90] [CH67] [CR00] [Cyb89] [DHS01] 209 . G. 1989. 2000. Pattern classiﬁcation. G. Anderson. R. and C. Campbell and JB Reece. 13(5):834–846. A. Neuron-like adaptive elements that can solve diﬃcult learning control problems. Man. Parodi. Signals. Grossberg.G. 14:197–220. G. and D. and Systems (MCSS). Zunino. G. Hart. A. Cybenko. P. 3(2):129–152. Anderson. Neural Networks. Carpenter and S. Mathematics of Control.Bibliography [And72] [APZ93] James A. Speed improvement of the backpropagation on current-generation workstations. A. Duda. 2001. Carpenter and S. September 1983. N. Biologie. July 11-15. 26:4919– 4930. In WCNN’93. T. pages 70–81. Approximation by superpositions of a sigmoidal function. volume 1. Oregon. 1993. M. Anguita. Lawrence Erlbaum.

Proceedings. 23:121–134. JJ Hopﬁeld. pages 3895–3902. 79:2554–2558. S.com Jeﬀrey L. M. N. Hopﬁeld. Neurons with graded response have collective computational properties like those of two-state neurons. 14(2):179– 211. Adaptive pattern classiﬁcation and universal recoding. I. Self organized partitioning of chaotic attractors for control. Classiﬁcation using multi-soms and multi-neural gas. JJ Hopﬁeld and DW Tank. Finding structure in time. Jordan. Ito. Hebb. 52(3):141–152. F. Neocognitron: A neural network model for a mechanism of visual pattern recognition. New York. 2006. S. Kintzler. Fast learning with incremental RBF networks. Fahlman. In IJCNN. Grossberg. N. Donald O. [Fri94] [GKE01a] [GKE01b] [Gro76] [GS06] [Heb49] [Hop82] [Hop84] [HT85] [Jor86] 210 D. Nils Goerke and Alexandra Scherbart. and R. 81(10):3088–3092. IEEE Transactions on Systems. 1985. Erlbaum. Biological cybernetics. Attractor dynamics and parallelism in a connectionist sequential machine. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . 1982. Neural Processing Letters. USA. and R. Cognitive Science. 1(1):2–5. 1984. 1976. B. Proc. Fukushima. Eckmiller. Proceedings of the National Academy of Sciences. IJCNN’01. 13(5):826–834. Eckmiller. Elman. Neural networks and physical systems with emergent collective computational abilities. Goerke. Miyake. I: Parallel development and coding of neural feature detectors. April 1990. 2001. Lecture notes in computer science. Fritzke. E. pages 851–856. Biological Cybernetics. In Neural Networks. An empirical sudy of learning speed in back-propagation networks. Technical Report CMU-CS-88-162. The Organization of Behavior: A Neuropsychological Theory. volume 3. K. John J. 1986. CMU. 2001. of the National Academy of Science. 2001. S. F. 1994. International Joint Conference on. 1949. Kintzler. pages 531–546. In Proceedings of the Eighth Conference of the Cognitive Science Society. Man. and Cybernetics. Goerke. September/October 1983. and T.Bibliography [Elm90] [Fah88] [FMI83] dkriesel. Self organized classiﬁcation of chaotic domains from a nonlinearattractor. Wiley. 1988. Neural computation of decisions in optimization problems.

C-21:353–359. 4(4):558–569. New York. E. 1986. Optimal brain damage.C. Vol. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. IEEEtC. Kaufman. B. Y. and S. In D. 1. 1972. Solla. Rumelhart. 5(4):115–133. 21(1-3):1–6. Busse. 1967. J. Mass. Touretzky. Denker. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 211 . le Cun. McCulloch and W. and S. N. 2000. L. Appleton & Lange. third edition.M. on Neural Networks. Teuvo Kohonen. [Koh72] [Koh82] [Koh89] [Koh98] [KSJ00] [lCDS90] [Mac67] [MBS93] [MBW+ 10] K. Statistics and Probability.D. N. ’Neural-gas’ network for vector quantization and its application to timeseries prediction. Kohonen. Berlin. J. D. Papert. Advances in Neural Information Processing Systems 2. McClelland and D.S. T.dkriesel. Micheva. Cambridge. S. In Finding Groups in Data: An Introduction to Cluster Analysis. Jessell. 1982. Kandel. 68(4):639–653. Morgan Kaufmann. Weiler. Self-organized formation of topologically correct feature maps. pages 281–296. Berkovich. and Klaus J. 1990. 1990.R. Smith. Teuvo Kohonen. [MP43] [MP69] [MR86] W. Thomas M. 2010. IEEE Trans. O’Rourke. Schulten. Self-Organization and Associative Memory. MacQueen. Some methods for classiﬁcation and analysis of multivariate observations. Correlation matrix memories. Bulletin of Mathematical Biology.J.com [Kau90] Bibliography L. editor. Wiley. Finding groups in data: an introduction to cluster analysis. MIT Press. SpringerVerlag. and T. pages 598–605. J. Cambridge. T. volume 2. 1969. 1993. Martinetz. Biological Cybernetics. 43:59–69. J. Perceptrons. Principles of neural science. Neuron. Kohonen. The self-organizing map. Neurocomputing. Singlesynapse analysis of a diverse synapse population: proteomic imaging methods and markers. 1998. M. Pitts. A logical calculus of the ideas immanent in nervous activity. Schwartz. 1943. Stanislav G. E. 1989.H. Minsky and S. In Proceedings of the Fifth Berkeley Symposium on Mathematics. A. MIT Press.

Williams.. Poggio and F. McCulloch. Rosenblatt. Rosenblatt. Prechelt. McClelland. Spartan. Cambridge Mass. Pitts and W. [PG89] [Pin87] [PM47] [Pre94] [RB93] [RD05] [RHW86a] D. and second order hebbian learning. Nature. How we know universals the perception of auditory and visual forms.. 1947. G. Reinforcement Learning: An Introduction. 1962. Learning representations by back-propagating errors. Learning internal representations by error propagation. G. editors. Rumelhart. Dicke. pages 586–591. 9(3):127–147. 1986. 1989. F. pages II–593–II–600. Rprop . second order direct propagation. Williams. IEEE First International Conference on Neural Networks (ICNN’87). Proben1: A set of neural network benchmark problems and benchmarking rules. Roth and U. CA. Technical Report. [Ros62] [SB98] 212 D. F. and R. MIT Press. Pineda. 9(5):250–257. L. Optimal algorithms for adaptive networks: Second order back propagation. Braun. MIT Press. W. In D. L. Rumelhart. IEEE International Conference on. 1998. J. Psychological Review. Hinton. 2005. Parallel distributed processing: Explorations in the microstructure of cognition. M. 1958. Geoﬀrey E. T. 1993. IEEE.. Girosi. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . In Maureen Caudill and Charles Butler. [RHW86b] David E. 1994. J. The perceptron: a probabilistic model for information storage and organization in the brain. San Diego. Bulletin of Mathematical Biology. [Rie94] [Ros58] M. and R. 59:2229–2232. E. A theory of networks for approximation and learning. IEEE. Sutton and A. Parker. A direct adaptive method for faster backpropagation learning: The rprop algorithm.description and implementation details. June 1987. University of Karlsruhe. Hinton.S. 1987. Volume 1: Foundations. Barto. Cambridge. In Neural Networks. MA. New York. 21:94. and the PDP research group. Evolution of the brain and intelligence. October 1986. Technical report. 1994. Rumelhart. J. editors. volume II. MIT Press. S. G. Physical Review Letters. 323:533–536. Trends in Cognitive Sciences.Bibliography [Par87] dkriesel.com David R. Principles of Neurodynamics. Riedmiller. Generalization of back-propagation to recurrent neural networks. Riedmiller and H. F. R. 1993. 65:386–408.

1988. Goerke. Simulation Neuronaler Netze. P. Nils Goerke. Neural Computing Theory and Practice. P. E.com [SG06] [SGE05] Bibliography A. Adaptive switching circuits. and Petra Perner. pages 343–353. 1960. 1989. Unsupervised system for discovering patterns in time-series. Kybernetik (Biological Cybernetics). [Ste61] [vdM73] [Was89] [Wer74] [Wer88] [WG94] [WH60] [Wid89] [Zel94] D. 1974. Andreas Zell. San Diego. B. editors. 1989.A. ICAPR (2). Kybernetik. Widrow and M. German. Regional and online learnable ﬁelds. D. and Rolf Eckmiller. Single-stage logic. 1961. Werbos. Hoﬀ. AddisonWesley. Theory and Practice. PhD thesis. Van Nostrand Reinhold. Backpropagation: Past and future. Scherbart and N. Wasserman. Springer. Gershenfeld. Self-organizing of orientation sensitive cells in striate cortex. Steinbuch. 1973. Time series prediction. Wasserman. 1994. In Proceedings WESCON. Die lernmatrix. Harvard University.dkriesel. Addison-Wesley. Weigend and N. K. P. pages 96–104. 14:85–100. In Sameer Singh. 1:36–45. Rolf Schatten. 1994. Neural Computing. A. P. Werbos. J. Widner. 2005. Chidanand Apté. In Proceedings ICNN-88. Maneesha Singh. 1960. pages 74–83. volume 3687 of Lecture Notes in Computer Science. C. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. R. 2006. J. von der Malsburg. AIEE Fall General Meeting.S. New York : Van Nostrand Reinhold. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 213 .

.

. . . . . . . . . . . . . . . Biological neuron . . . . . . . . . .4 Robot with 8 sensors and 2 motors . . . . . . . . . . . . . . . . . . . . .2 4. . . . 6 7 8 9 14 15 17 22 27 35 38 40 41 41 42 43 44 45 46 56 60 62 63 65 65 72 74 74 75 Data processing of a neuron . . . . . . . . . . . . . . . Black box with eight inputs and two outputs Learning samples for the example robot . . .3 2. . . . .3 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 5. . . .1 4. . .2 3. . . . Compound eye . . . . . . . . . . .1 1. . . . . . . . . . . The perceptron in three diﬀerent views . . . . . . . . . . . .5 3. . . . . . . . . . . . . . . . . . . . . .3 3. . .5 3. . . . . . . . . . . . . . . . . . . . . . . . . . 215 . . . . . . . . . . . . . . . . Singlelayer perceptron with several output AND and OR singlelayer perceptron . .2 5. . . . . . . . . . . . . . . . . . .7 3. . . . . . . . . . Possible errors during a gradient descent The 2-spiral problem . . . . . . . . .2 2. . . Examples for diﬀerent types of neurons . . . . . . . . . . . Laterally recurrent network . . . . . . . . . Example network with and without bias neuron Training samples and network capacities Learning curve with diﬀerent scalings . . . .4 4. . . . . . . Gradient descent. . . . . . . . . Feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2. . . . . . . . . . . . . . . . . . . . Institutions of the ﬁeld of neural networks . . . . . . . . . . . . . . . . neurons . . . . . . . . . Action potential . Feedforward network with shortcuts . . . . . . . . . .1 3. . .4 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 4. . . . . . . . . . . . . 2D visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indirectly recurrent network . . . . . . . Singlelayer perceptron . . . . . . . . . . . . . . . . . . Central nervous system Brain . .3 5.1 2. . . . . Checkerboard problem . Directly recurrent network . . . .4 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 4.List of Figures 1. . . . . . Completely linked network . . . . . .2 1. . . .6 3. . . . . .1 5. . . . . . . . . . . . . . . . . . Various popular activation functions . . . . . . . . . . . .10 3. . . . . . . . . . . . . . . . . . . . . . . . .8 3. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . Position of an inner neuron for derivation of backpropagation Illustration of the backpropagation derivation . . . . . . . . . . uneven coverage of an input space with radial basis functions Roessler attractor . . . . . . . . Two-dimensional linear separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sketch of a XOR-SLP . . . Accumulating Gaussian bells in one-dimensional space . . . . . . . . . . . . . . . . . .4 8. . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . . . . . . . . . . . . . . . Accumulating Gaussian bells in two-dimensional space . .3 10. . . . . . . . Momentum term .1 10. .2 10. . .3 7. Multilayer perceptrons and output sets . . . . . . . . . . . . . . . . . . . . . . . . . . .14 5. . . . . .11 5. . . . . . . . . . . . . . .15 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 216 D. . . . . 156 Training a SOM with one-dimensional topology . . . Unfolding in time . . . . . . . . . . . . dkriesel. . . . .8 Error surface of a network with 2 connections . .1 10. . . . . . . . . . . . . . .com . . . . . . . . 154 Topological defect of a SOM . . . . . . . . . . . . . . . . . . . . . . .9 5. . . . .13 5. . . .1 6. . . . . . . . . . . . . . . . . . . . . . . . Elman network . . Even coverage of an input space with radial basis functions . . Convergence of a Hopﬁeld network Fermi function . Three-dimensional linear separation . . . . . . . . . 151 SOM topology functions . . . . . . . . . . . . . . .2 7. . . . . . . . . . . .1 7. . . . . . . . . . . . . . . 148 Example distances of SOM topologies . . . . . . . . . . . . . . . . . Distance function in the RBF network . . . 157 SOMs with one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .List of Figures 5.2 6.6 6. . . . . . . . . . Individual Gaussian bells in one. . . . . . . . 153 First example of a SOM . .10 5. . . . Hopﬁeld network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 82 82 83 84 85 87 89 97 102 103 107 108 109 109 110 116 117 117 122 123 124 126 130 132 134 137 RBF network . . . . . . . . . . . . . . . Binary threshold function . . . . . Examples for quantization . .5 6.3 6. . . . . . . . . . . . . . . . . . . . . . . . .6 5. . Uneven coverage of an input space with radial basis functions . . . . . . . . . . . . . 141 Example topologies of a SOM . . . . . . . . . . .12 5. . .3 8. . .5 5. . . . . . . . . . . . . . . Fermi function and hyperbolic tangent . . . . .and two-dimensional topologies and diﬀerent input spaces158 Resolution optimization of a SOM to certain areas . . . . . . . . . . . . .7 6. . . . . . . . . . . . . . . . Random. . . . .6 10.1 8. . . .8 7. . . . . . . . . . Functionality of 8-2-8 encoding .4 9. . . .and two-dimensional space . . The XOR network . . . .2 8.5 10. . . . . . . . . . . . . . . . . .4 6. . .8 5. . . . . . . . . . . . Jordan network . .7 5. . . . . .4 10. . . . . .7 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . 179 B. . . . . .dkriesel. . . . . . . .com List of Figures 10. . . . .4 C. . . . 176 A. . . . . . . . . . . . . .6 C. . . . . . . . . . . . . . . . .3 C. . . . . .2 C. . . . . Heterogeneous one-step-ahead prediction Heterogeneous one-step-ahead prediction Gridworld . . . .6 C. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 217 . . . . . . . . . . . . . . .4 B. 166 11. . . .9 Neural network reading time series . . . . . . . Two-step-ahead prediction . . . . . . . . . . . . . .5 B. . . . . . . two outputs . . . . . . . . . . Reinforcement learning timeline . . . . . . . . . . . . . . The Monte Carlo method . . . . . . . . . . . . . . . . . . . . .3 Clustering by means of a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Learning process of an ART network . . . . . . . . . . . . . . . . .1 Comparing cluster analysis methods . . . . . .8 C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 C. . . . . . . . . . . . . . . . . . . . Extended Monte Carlo method Improving the policy . One-step-ahead prediction . . . . . . . .3 B. . . . . . . . . . . . . . . . . . . . . . Gridworld with optimal returns Reinforcement learning cycle . .9 Shape to be classiﬁed by neural gas . . . . . . . . . . . . . . . . . . . . . . Direct two-step-ahead prediction . with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 ROLF neuron . . 168 A. . . . . . . 182 184 186 186 188 188 193 193 200 201 202 203 203 204 205 D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 C. . . . . . . . 162 11. . 174 A. . . . . . . . . . . . . . . . . .1 B. . Action-value function . . . . . . . . . . . Reinforcement learning . . . .7 C.2 B. . .1 Structure of an ART network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. . . . . . . 138 bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18. . . . . . . . . . . . . . . . 28 approximation. . see adaptive linear neuron adaptive linear neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 of a SOM neuron . . . . . . . . . . . . . . . . . . . 168 ART-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 autoassociator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 B backpropagation . 44 binary threshold function . . . see adaptive linear neuron adaptive linear element . . . . . . . 5 ATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 recurrent . . . . . . . . 195 action-value function . . . . . 37 bipolar cell . . . 36 activation function . . 23 A Action . . . . . . . . . . . . . . . . . . . . . . . . see adaptive resonance theory ART-2 . . . . . . . . . 98 ADALINE . . . . . . . 4 center of a ROLF neuron . . . . . . . . . . 146 219 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110 ART . . . . . . . . 203 activation . . . . . . . . . . . . . . . . . . . . . 20 attractor . . . . 165 agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 algorithm. . . . . . . . 36 selection of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 brainstem . . 14 basis . . . . . . . . . . . 16 C capability to learn . . . 167 ART-2A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 artiﬁcial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 second order . . . . . . . . . . . . . . . . . . . 10 adaptive resonance theory . . . . . . . . . 195 action potential . . . . . . . . . . . . . . . . . . . .50 amacrine cell . . . . . . . . . . . . . . . . . . . 21 action space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 black box . . . . . . . . .Index * 100-step rule . . . . . . 10 associative data storage . . . . . . . . . . . . . 95 backpropagation of error. . . . . . 131 axon . . . . . . . . . . . . . . . . 11. . . . . .

196 epoch . . . . . . . . . . . . .39 Fermi function . . . . . . . . . . . . . . . . . . . . 75 speciﬁc . . . . . . . . . . . . . . . . . . . 21 diencephalon . . . . . 37 ﬂat spot elimination . . . . . .34 context-based search . . . . . . . . . . . . . . . . . . . . . . . . . 18 depolarization . . . . 138. . . . . . . . . . .76. . . . . . . . . . . . 14 cerebrum . . . . . . . . . 53 evolutionary algorithms . . . . . . . . . . . . . . 93. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 cortical ﬁeld . . . . . . . . . 56. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . see quantization distance Euclidean . . . . . . . . . 197 exploration approach . . . . . . . . . . . . . . . 19 cone function . . . . . . . . . 79 dendrite . . 56 total . . . . 137 discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 cerebellum . . . . . . . . . . . . . . . . 119 E early stopping . . . . . . . . . . . . 18 tree . . . . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . . . . . . . . . . . . . . . . . . . . . . .39 compound eye . . . . . . . . . . 171 dynamical system . . . . . . . 171 squared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 episode . . . . . . . . . . . . . . . . . . . . . . 104 distance to the . . . . . . . . 15 cerebral cortex . . . . . . . . . . . . 47 fault tolerance . . . . 172 complete linkage. .com digitization . . . . . . see error vector digital ﬁlter . . . .Index of an RBF neuron . 9 Elman network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 F fastprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 epsilon-nearest neighboring . . . . . . . . . . . . . . . . . 197 exteroceptor . . . . . . . . . . . . . . 107 central nervous system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 error vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 exploitation approach . . . . . . . . . . 171 CNS . . . . . . . . . . . . 95 220 D. . . . . . . . 150 dkriesel. . . . . . . . . . 173 error speciﬁc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . see central nervous system codebook vector . . . . . . . . . . . . . . see interbrain diﬀerence vector . . . 137 cortex . . . see cerebral cortex visual . . . . . . . . . . . . . . . 26 concentration gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 cylinder function . . . . . . . . . . . . . . . . . . . . . . . . . . 171 clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 primary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 electronic brain . . . . . . . . . . . . . . . . 150 connection. . . . . . . . . . . 157 continuous . . . . . 24 D Dartmouth Summer Research Project9 deep networks . . . . . . . . . . . . . . . . 14 change in weight. . . . . . . . . . . . 121 environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 feedforward. 79 delta rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64 cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Delta . . . . . . . . . . . . . . . . 56 error function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

dkriesel.com fudging . . . . . . . see ﬂat spot elimination function approximation . . . . . . . . . . . . . 98 function approximator universal . . . . . . . . . . . . . . . . . . . . . . . 82

Index

I

individual eye . . . . . . . . see ommatidium input dimension . . . . . . . . . . . . . . . . . . . . 48 input patterns . . . . . . . . . . . . . . . . . . . . . . 50 input vector . . . . . . . . . . . . . . . . . . . . . . . . 48 interbrain . . . . . . . . . . . . . . . . . . . . . . . . . . 15 internodes . . . . . . . . . . . . . . . . . . . . . . . . . . 23 interoceptor . . . . . . . . . . . . . . . . . . . . . . . . 24 interpolation precise . . . . . . . . . . . . . . . . . . . . . . . . 110 ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

G

ganglion cell . . . . . . . . . . . . . . . . . . . . . . . . 27 Gauss-Markov model . . . . . . . . . . . . . . 111 Gaussian bell . . . . . . . . . . . . . . . . . . . . . . 149 generalization . . . . . . . . . . . . . . . . . . . . 4, 49 glial cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 gradient descent . . . . . . . . . . . . . . . . . . . . 59 problems . . . . . . . . . . . . . . . . . . . . . . . 60 grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 gridworld. . . . . . . . . . . . . . . . . . . . . . . . . .192

J

Jordan network. . . . . . . . . . . . . . . . . . . .120

H

Heaviside function see binary threshold function Hebbian rule . . . . . . . . . . . . . . . . . . . . . . . 64 generalized form . . . . . . . . . . . . . . . . 65 heteroassociator . . . . . . . . . . . . . . . . . . . 132 Hinton diagram . . . . . . . . . . . . . . . . . . . . 34 history of development. . . . . . . . . . . . . . .8 Hopﬁeld networks . . . . . . . . . . . . . . . . . 127 continuous . . . . . . . . . . . . . . . . . . . . 134 horizontal cell . . . . . . . . . . . . . . . . . . . . . . 28 hyperbolic tangent . . . . . . . . . . . . . . . . . 37 hyperpolarization . . . . . . . . . . . . . . . . . . . 21 hypothalamus . . . . . . . . . . . . . . . . . . . . . . 15

K

k-means clustering . . . . . . . . . . . . . . . . 172 k-nearest neighboring. . . . . . . . . . . . . .172

L

layer hidden . . . . . . . . . . . . . . . . . . . . . . . . . 39 input . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 output . . . . . . . . . . . . . . . . . . . . . . . . . 39 learnability . . . . . . . . . . . . . . . . . . . . . . . . . 97 learning

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

221

Index batch . . . . . . . . . . see learning, oﬄine oﬄine . . . . . . . . . . . . . . . . . . . . . . . . . . 52 online . . . . . . . . . . . . . . . . . . . . . . . . . . 52 reinforcement . . . . . . . . . . . . . . . . . . 51 supervised. . . . . . . . . . . . . . . . . . . . . .51 unsupervised . . . . . . . . . . . . . . . . . . . 50 learning rate . . . . . . . . . . . . . . . . . . . . . . . 89 variable . . . . . . . . . . . . . . . . . . . . . . . . 90 learning strategy . . . . . . . . . . . . . . . . . . . 39 learning vector quantization . . . . . . . 137 lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 linear separability . . . . . . . . . . . . . . . . . . 81 linearer associator . . . . . . . . . . . . . . . . . . 11 locked-in syndrome . . . . . . . . . . . . . . . . . 16 logistic function . . . . see Fermi function temperature parameter . . . . . . . . . 37 LVQ . . see learning vector quantization LVQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 LVQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 LVQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

dkriesel.com S . . . . . . . . . . . . . . see situation space T . . . . . . see temperature parameter ∗ (s) . . . . . see state-value function, VΠ optimal VΠ (s) . . . . . see state-value function W . . . . . . . . . . . . . . see weight matrix ∆wi,j . . . . . . . . see change in weight Π . . . . . . . . . . . . . . . . . . . . . . . see policy Θ . . . . . . . . . . . . . . see threshold value α . . . . . . . . . . . . . . . . . . see momentum β . . . . . . . . . . . . . . . . see weight decay δ . . . . . . . . . . . . . . . . . . . . . . . . see Delta η . . . . . . . . . . . . . . . . . see learning rate η ↑ . . . . . . . . . . . . . . . . . . . . . . see Rprop η ↓ . . . . . . . . . . . . . . . . . . . . . . see Rprop ηmax . . . . . . . . . . . . . . . . . . . . see Rprop ηmin . . . . . . . . . . . . . . . . . . . . see Rprop ηi,j . . . . . . . . . . . . . . . . . . . . . see Rprop ∇ . . . . . . . . . . . . . . see nabla operator ρ . . . . . . . . . . . . . see radius multiplier Err . . . . . . . . . . . . . . . . see error, total Err(W ) . . . . . . . . . see error function Errp . . . . . . . . . . . . . see error, speciﬁc Errp (W ) see error function, speciﬁc ErrWD . . . . . . . . . . . see weight decay at . . . . . . . . . . . . . . . . . . . . . . see action c . . . . . . . . . . . . . . . . . . . . . . . . see center of an RBF neuron, see neuron, self-organizing map, center m . . . . . . . . . . . see output dimension n . . . . . . . . . . . . . see input dimension p . . . . . . . . . . . . . see training pattern rh . . . see center of an RBF neuron, distance to the rt . . . . . . . . . . . . . . . . . . . . . . see reward st . . . . . . . . . . . . . . . . . . . . see situation t . . . . . . . . . . . . . . . see teaching input wi,j . . . . . . . . . . . . . . . . . . . . see weight x . . . . . . . . . . . . . . . . . see input vector y . . . . . . . . . . . . . . . . see output vector

M

M-SOM . see self-organizing map, multi Mark I perceptron . . . . . . . . . . . . . . . . . . 10 Mathematical Symbols (t) . . . . . . . . . . . . . . . see time concept A(S ) . . . . . . . . . . . . . see action space Ep . . . . . . . . . . . . . . . . see error vector G . . . . . . . . . . . . . . . . . . . . see topology N . . see self-organizing map, input dimension P . . . . . . . . . . . . . . . . . see training set Q∗ Π (s, a) . see action-value function, optimal QΠ (s, a) . see action-value function Rt . . . . . . . . . . . . . . . . . . . . . . see return

222

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

dkriesel.com fact . . . . . . . . see activation function fout . . . . . . . . . . . see output function membrane . . . . . . . . . . . . . . . . . . . . . . . . . . 19 -potential . . . . . . . . . . . . . . . . . . . . . . 19 memorized . . . . . . . . . . . . . . . . . . . . . . . . . 54 metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Mexican hat function . . . . . . . . . . . . . . 150 MLP. . . . . . . . see perceptron, multilayer momentum . . . . . . . . . . . . . . . . . . . . . . . . . 94 momentum term. . . . . . . . . . . . . . . . . . . .94 Monte Carlo method . . . . . . . . . . . . . . 201 Moore-Penrose pseudo inverse . . . . . 110 moving average procedure . . . . . . . . . 184 myelin sheath . . . . . . . . . . . . . . . . . . . . . . 23

Index self-organizing map. . . . . . . . . . . .146 tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 winner . . . . . . . . . . . . . . . . . . . . . . . . 148 neuron layers . . . . . . . . . . . . . . . . . see layer neurotransmitters . . . . . . . . . . . . . . . . . . 17 nodes of Ranvier . . . . . . . . . . . . . . . . . . . 23

O

oligodendrocytes . . . . . . . . . . . . . . . . . . . 23 OLVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 on-neuron . . . . . . . . . . . . . see bias neuron one-step-ahead prediction . . . . . . . . . 183 heterogeneous . . . . . . . . . . . . . . . . . 187 open loop learning. . . . . . . . . . . . . . . . .125 optimal brain damage . . . . . . . . . . . . . . 96 order of activation . . . . . . . . . . . . . . . . . . 45 asynchronous ﬁxed order . . . . . . . . . . . . . . . . . . . 47 random order . . . . . . . . . . . . . . . . 46 randomly permuted order . . . . 46 topological order . . . . . . . . . . . . . 47 synchronous . . . . . . . . . . . . . . . . . . . . 46 output dimension . . . . . . . . . . . . . . . . . . . 48 output function. . . . . . . . . . . . . . . . . . . . .38 output vector . . . . . . . . . . . . . . . . . . . . . . . 48

N

nabla operator. . . . . . . . . . . . . . . . . . . . . .59 Neocognitron . . . . . . . . . . . . . . . . . . . . . . . 12 nervous system . . . . . . . . . . . . . . . . . . . . . 13 network input . . . . . . . . . . . . . . . . . . . . . . 35 neural gas . . . . . . . . . . . . . . . . . . . . . . . . . 159 growing . . . . . . . . . . . . . . . . . . . . . . . 162 multi- . . . . . . . . . . . . . . . . . . . . . . . . . 161 neural network . . . . . . . . . . . . . . . . . . . . . 34 recurrent . . . . . . . . . . . . . . . . . . . . . . 119 neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 accepting . . . . . . . . . . . . . . . . . . . . . 177 binary. . . . . . . . . . . . . . . . . . . . . . . . . .71 context. . . . . . . . . . . . . . . . . . . . . . . .120 Fermi . . . . . . . . . . . . . . . . . . . . . . . . . . 71 identity . . . . . . . . . . . . . . . . . . . . . . . . 71 information processing . . . . . . . . . 71 input . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 RBF . . . . . . . . . . . . . . . . . . . . . . . . . . 104 output . . . . . . . . . . . . . . . . . . . . . . 104 ROLF. . . . . . . . . . . . . . . . . . . . . . . . .176

P

parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 5 pattern . . . . . . . . . . . see training pattern pattern recognition . . . . . . . . . . . . 98, 131 perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 71 multilayer . . . . . . . . . . . . . . . . . . . . . . 82 recurrent . . . . . . . . . . . . . . . . . . . . 119

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

223

Index singlelayer . . . . . . . . . . . . . . . . . . . . . . 72 perceptron convergence theorem . . . . 73 perceptron learning algorithm . . . . . . 73 period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 peripheral nervous system . . . . . . . . . . 13 Persons Anderson . . . . . . . . . . . . . . . . . . . . 206 f. Anderson, James A. . . . . . . . . . . . . 11 Anguita . . . . . . . . . . . . . . . . . . . . . . . . 37 Barto . . . . . . . . . . . . . . . . . . . 191, 206 f. Carpenter, Gail . . . . . . . . . . . . 11, 165 Elman . . . . . . . . . . . . . . . . . . . . . . . . 120 Fukushima . . . . . . . . . . . . . . . . . . . . . 12 Girosi . . . . . . . . . . . . . . . . . . . . . . . . . 103 Grossberg, Stephen . . . . . . . . 11, 165 Hebb, Donald O. . . . . . . . . . . . . 9, 64 Hinton . . . . . . . . . . . . . . . . . . . . . . . . . 12 Hoﬀ, Marcian E. . . . . . . . . . . . . . . . 10 Hopﬁeld, John . . . . . . . . . . . 11 f., 127 Ito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Jordan . . . . . . . . . . . . . . . . . . . . . . . . 120 Kohonen, Teuvo . 11, 137, 145, 157 Lashley, Karl . . . . . . . . . . . . . . . . . . . . 9 MacQueen, J. . . . . . . . . . . . . . . . . . 172 Martinetz, Thomas . . . . . . . . . . . . 159 McCulloch, Warren . . . . . . . . . . . . 8 f. Minsky, Marvin . . . . . . . . . . . . . . . . 9 f. Miyake . . . . . . . . . . . . . . . . . . . . . . . . . 12 Nilsson, Nils. . . . . . . . . . . . . . . . . . . .10 Papert, Seymour . . . . . . . . . . . . . . . 10 Parker, David . . . . . . . . . . . . . . . . . . 95 Pitts, Walter . . . . . . . . . . . . . . . . . . . 8 f. Poggio . . . . . . . . . . . . . . . . . . . . . . . . 103 Pythagoras . . . . . . . . . . . . . . . . . . . . . 56 Riedmiller, Martin . . . . . . . . . . . . . 90 Rosenblatt, Frank . . . . . . . . . . 10, 69 Rumelhart . . . . . . . . . . . . . . . . . . . . . 12 Steinbuch, Karl . . . . . . . . . . . . . . . . 10 Sutton . . . . . . . . . . . . . . . . . . 191, 206 f. Tesauro, Gerald . . . . . . . . . . . . . . . 205

dkriesel.com von der Malsburg, Christoph . . . 11 Werbos, Paul . . . . . . . . . . . 11, 84, 96 Widrow, Bernard . . . . . . . . . . . . . . . 10 Wightman, Charles . . . . . . . . . . . . . 10 Williams . . . . . . . . . . . . . . . . . . . . . . . 12 Zuse, Konrad . . . . . . . . . . . . . . . . . . . . 9 pinhole eye . . . . . . . . . . . . . . . . . . . . . . . . . 26 PNS . . . . see peripheral nervous system pole balancer . . . . . . . . . . . . . . . . . . . . . . 206 policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 closed loop . . . . . . . . . . . . . . . . . . . . 197 evaluation . . . . . . . . . . . . . . . . . . . . . 200 greedy . . . . . . . . . . . . . . . . . . . . . . . . 197 improvement . . . . . . . . . . . . . . . . . . 200 open loop . . . . . . . . . . . . . . . . . . . . . 197 pons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 propagation function . . . . . . . . . . . . . . . 35 pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 pupil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Q

Q learning . . . . . . . . . . . . . . . . . . . . . . . . 204 quantization . . . . . . . . . . . . . . . . . . . . . . . 137 quickpropagation . . . . . . . . . . . . . . . . . . . 95

R

RBF network. . . . . . . . . . . . . . . . . . . . . .104 growing . . . . . . . . . . . . . . . . . . . . . . . 115 receptive ﬁeld . . . . . . . . . . . . . . . . . . . . . . 27 receptor cell . . . . . . . . . . . . . . . . . . . . . . . . 24 photo-. . . . . . . . . . . . . . . . . . . . . . . . . .27 primary . . . . . . . . . . . . . . . . . . . . . . . . 24 secondary . . . . . . . . . . . . . . . . . . . . . . 24

224

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

. . . . . . . . . . . . . . . . . . . . 183 state-value function . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 11 self-organizing map . . . . 18 spin . . 23 self-fulﬁlling prophecy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 D. . . . . . . . . . . 97 resilient backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 34 TD gammon . . . . 27. . . . . . . . . . . . . . . 23 regional and online learnable ﬁelds 175 reinforcement learning . 40 indirect . . . . . . . . . 71 return . 53 telencephalon . . . . . . . . 205 TD learning. 17 electrical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 stimulus . . . . . . . . . . . . . . . 127 spinal cord . . . . . 195 avoidance strategy . . . . . . . . . . . . . . . . . . . 119 direct . . . . . . . . . . . . . . . . . . . . . . . . . . 199 pure delayed . . . . . . 20 SOM . . . . . . . . . . . . . . . . . . . . see resilient backpropagation Index situation . . . . . . . . . . . . . . 27 Single Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 silhouette coeﬃcient . . . . . . . . . 90 resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 multi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . see root mean square ROLFs . . . . . . . . . . see self-organizing map soma . . . . . . . . . . . . . . . . . . . . 161 sensory adaptation . . . . . . . . . . . 17 S saltatory conductor . . . . . . . 125 teaching input . . . . . . . . 175 single lense eye . . . . . . . . . . . . . . . 17 synapses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 situation space . . . 21. . . . . . . . . . . . . 40. . . . .206 symmetry breaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176 swing up an inverted pendulum. . . . . . . 41 lateral . . . . . . . . . . . perceptive. . . . 42 refractory period . . 14 stability / plasticity dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . see temporal diﬀerence learning teacher forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 surface. . . . . . . . . . . . . . . . vi sodium-potassium pump . . . . . . . 191 repolarization . . see perceptron. . . . . . . . . . . . . . . 166 retina . . . . . . . . singlelayer Snark . . . . . . . . . . . . . . . . . 147 stimulus-conducting apparatus. . . . . . . . . . . . . . . . . . 9 SNIPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 thalamus . . . . . . . . . 21 representability . . . see cerebrum temporal diﬀerence learning . . . . . . . . .dkriesel. . . . . . . . . . . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 225 . . . . . . .17 synaptic cleft .24 shortcut connections . . . . . 198 SLP . . 165 state . . . . . . . . . .com recurrence . 56 Rprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 state space forecasting . . . 195 situation tree . . . . . . see regional and online learnable ﬁelds root mean square . . . . . . . . . . . 98 synapse chemical . . . . . . . . . . . . . . . . . . . . . . . 195 reward . . . . . . . . . . . . . . . . 198 pure negative . . 198 RMS . . . . . . 23 Schwann cell . . . . . . . . . . . . . . . . . 130 T target . . . . . . . . . . . . . . . . . . . . . . . . . . 25 sensory transduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 self-organizing feature maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .165 weight vector . . . . . Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . see brainstem two-step-ahead prediction . . . . 123 V voronoi diagram . . 34 bottom-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 topology function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 time horizon . . 53 set of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . see delta rule winner-takes-all scheme . . . . . . . . . . . . . 42 U unfolding in time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 dkriesel. . . . . . 50 transfer functionsee activation function truncus cerebri . . . . . . . . 181 time series prediction . . 181 topological defect. . 35 Widrow-Hoﬀ rule . . . . . . . . . . . . . . . 166 top-down. . . . . . . . . . . . . . . . . . . . . . . . 34 226 D. . . . . . . . . . . . .Index threshold potential . . . . . . . . . . . . . . . . . . . . . . . . . . 148 training pattern . . . . . . . 21 threshold value . 53 training set . . . . . 185 direct . . . . . . . . . .154 topology . 196 time series . . . . . . . . . . . . . . . . . . . . . . . . 138 W weight . . . . . . 34 weight matrix . . . 36 time concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .com weighted sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Artificial Neural Networks, its mathematical foundation and applications.

Introduction to Artificial Neural Networks, its mathematical foundation and applications.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd