Professional Documents
Culture Documents
Neural Networks
Neural Networks
Neural Networks
David Kriesel
dkriesel.com
Download location:
http://www.dkriesel.com/en/science/neural_networks
The above abstract has not yet become a stand the definitions without reading the
preface but at least a little preface, ever running text, while the opposite holds for
since the extended text (then 40 pages readers only interested in the subject mat-
long) has turned out to be a download ter; everything is explained in both collo-
hit. quial and formal language. Please let me
know if you find out that I have violated
this principle.
Ambition and intention of this
manuscript
The sections of this text are mostly
independent from each other
The entire text is written and laid out
more effectively and with more illustra-
The document itself is divided into differ-
tions than before. I did all the illustra-
ent parts, which are again divided into
tions myself, most of them directly in
chapters. Although the chapters contain
LATEX by using XYpic. They reflect what
cross-references, they are also individually
I would have liked to see when becoming
accessible to readers with little previous
acquainted with the subject: Text and il-
knowledge. There are larger and smaller
lustrations should be memorable and easy
chapters: While the larger chapters should
to understand to offer as many people as
provide profound insight into a paradigm
possible access to the field of neural net-
of neural networks (e.g. the classic neural
works.
network structure: the perceptron and its
Nevertheless, the mathematically and for- learning procedures), the smaller chapters
mally skilled readers will be able to under- give a short overview – but this is also ex-
v
dkriesel.com
for highlighted text – all indexed words 3. You must maintain the author’s attri-
are highlighted like this. bution of the document at all times.
Mathematical symbols appearing in sev- 4. You may not use the attribution to
eral chapters of this document (e.g. Ω for imply that the author endorses you
an output neuron; I tried to maintain a or your document use.
consistent nomenclature for regularly re- For I’m no lawyer, the above bullet-point
curring elements) are separately indexed summary is just informational: if there is
under "Mathematical Symbols", so they any conflict in interpretation between the
can easily be assigned to the correspond- summary and the actual license, the actual
ing term. license always takes precedence. Note that
this license does not extend to the source
Names of persons written in small caps
files used to produce the document. Those
are indexed in the category "Persons" and
are still mine.
ordered by the last names.
1. You are free to redistribute this docu- Now I would like to express my grati-
tude to all the people who contributed, in
ment (even though it is a much better
whatever manner, to the success of this
idea to just distribute the URL of my
work, since a work like this needs many
homepage, for it always contains the
most recent version of the text).helpers. First of all, I want to thank
the proofreaders of this text, who helped
2. You may not modify, transform, or me and my readers very much. In al-
build upon the document except for phabetical order: Wolfgang Apolinarski,
personal use. Kathrin Gräve, Paul Imhoff, Thomas
2 http://creativecommons.org/licenses/ 3 http://www.dkriesel.com/en/science/
by-nd/3.0/ neural_networks
Kühn, Christoph Kunze, Malte Lohmeyer, of the University of Bonn – they all made
Joachim Nock, Daniel Plohmann, Daniel sure that I always learned (and also had
Rosenthal, Christian Schulz and Tobias to learn) something new about neural net-
Wilken. works and related subjects. Especially Dr.
Goerke has always been willing to respond
Additionally, I want to thank the readers
to any questions I was not able to answer
Dietmar Berger, Igor Buchmüller, Marie
myself during the writing process. Conver-
Christ, Julia Damaschek, Jochen Döll,
sations with Prof. Eckmiller made me step
Maximilian Ernestus, Hardy Falk, Anne
back from the whiteboard to get a better
Feldmeier, Sascha Fink, Andreas Fried-
overall view on what I was doing and what
mann, Jan Gassen, Markus Gerhards, Se-
I should do next.
bastian Hirsch, Andreas Hochrath, Nico
Höft, Thomas Ihme, Boris Jentsch, Tim
Globally, and not only in the context of
Hussein, Thilo Keller, Mario Krenn, Mirko
this work, I want to thank my parents who
Kunze, Maikel Linke, Adam Maciak,
never get tired to buy me specialized and
Benjamin Meier, David Möller, Andreas
therefore expensive books and who have
Müller, Rainer Penninger, Lena Reichel,
always supported me in my studies.
Alexander Schier, Matthias Siegmund,
Mathias Tirtasana, Oliver Tischler, Max-
For many "remarks" and the very special
imilian Voit, Igor Wall, Achim Weber,
and cordial atmosphere ;-) I want to thank
Frank Weinreis, Gideon Maillette de Buij
Andreas Huber and Tobias Treutler. Since
Wenniger, Philipp Woock and many oth-
our first semester it has rarely been boring
ers for their feedback, suggestions and re-
with you!
marks.
Additionally, I’d like to thank Sebastian Now I would like to think back to my
Merzbach, who examined this work in a school days and cordially thank some
very conscientious way finding inconsisten-teachers who (in my opinion) had im-
cies and errors. In particular, he cleared parted some scientific knowledge to me –
lots and lots of language clumsiness from although my class participation had not
the English version. always been wholehearted: Mr. Wilfried
Hartmann, Mr. Hubert Peters and Mr.
Especially, I would like to thank Beate
Frank Nökel.
Kuhl for translating the entire text from
German to English, and for her questions
Furthermore I would like to thank the
which made me think of changing the
whole team at the notary’s office of Dr.
phrasing of some paragraphs.
Kemp and Dr. Kolb in Bonn, where I have
I would particularly like to thank Prof. always felt to be in good hands and who
Rolf Eckmiller and Dr. Nils Goerke as have helped me to keep my printing costs
well as the entire Division of Neuroinfor- low - in particular Christiane Flamme and
matics, Department of Computer Science Dr. Kemp!
David Kriesel
A small preface v
xi
Contents dkriesel.com
A Excursus: Cluster analysis and regional and online learnable fields 171
A.1 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.2 k-nearest neighboring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.3 ε-nearest neighboring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.4 The silhouette coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.5 Regional and online learnable fields . . . . . . . . . . . . . . . . . . . . . 175
A.5.1 Structure of a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.5.2 Training a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.5.3 Evaluating a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.5.4 Comparison with popular clustering methods . . . . . . . . . . . 179
A.5.5 Initializing radii, learning rates and multiplier . . . . . . . . . . . 180
A.5.6 Application examples . . . . . . . . . . . . . . . . . . . . . . . . 180
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Bibliography 209
Index 219
1
Chapter 1
Introduction, motivation and history
How to teach a computer? You can either write a fixed program – or you can
enable the computer to learn on its own. Living beings do not have any
programmer writing a program for developing their skills, which then only has
to be executed. They learn by themselves – without the previous knowledge
from external impressions – and thus can solve problems better than any
computer today. What qualities are needed to achieve such a behavior for
devices like computers? Can such cognition be adapted from biology? History,
development, decline and resurgence of a wide approach to solve problems.
3
Chapter 1 Introduction, motivation and history dkriesel.com
Brain Computer
No. of processing units ≈ 1011 ≈ 109
Type of processing units Neurons Transistors
Type of calculation massively parallel usually serial
Data storage associative address-based
Switching time ≈ 10−3 s ≈ 10−9 s
Possible switching operations ≈ 1013 1s ≈ 1018 1s
Actual switching operations ≈ 1012 1s ≈ 1010 1s
Table 1.1: The (flawed) comparison between brain and computer at a glance. Inspired by: [Zel94]
mum, from which the computer is orders eralize and associate data: After suc-
of magnitude away (Table 1.1). Addition- cessful training a neural network can find
ally, a computer is static - the brain as reasonable solutions for similar problems
a biological neural network can reorganize of the same class that were not explicitly
itself during its "lifespan" and therefore is trained. This in turn results in a high de-
able to learn, to compensate errors and so gree of fault tolerance against noisy in-
forth. put data.
Within this text I want to outline how Fault tolerance is closely related to biolog-
we can use the said characteristics of our ical neural networks, in which this charac-
brain for a computer system. teristic is very distinct: As previously men-
tioned, a human has about 1011 neurons
So the study of artificial neural networks that continuously reorganize themselves
is motivated by their similarity to success- or are reorganized by external influences
fully working biological systems, which - in (about 105 neurons can be destroyed while
comparison to the overall system - consist in a drunken stupor, some types of food
of very simple but numerous nerve cells or environmental influences can also de-
simple
but many that work massively in parallel and (which stroy brain cells). Nevertheless, our cogni-
processing is probably one of the most significant tive abilities are not significantly affected.
aspects) have the capability to learn. Thus, the brain is tolerant against internal
units n. network
fault
There is no need to explicitly program a errors – and also against external errors, tolerant
neural network. For instance, it can learn for we can often read a really "dreadful
from training samples or by means of en- scrawl" although the individual letters are
n. network
capable couragement - with a carrot and a stick, nearly impossible to read.
to learn so to speak (reinforcement learning).
Our modern technology, however, is not
One result from this learning procedure is automatically fault-tolerant. I have never
the capability of neural networks to gen- heard that someone forgot to install the
hard disk controller into a computer and What types of neural networks particu-
therefore the graphics card automatically larly develop what kinds of abilities and
took over its tasks, i.e. removed con- can be used for what problem classes will
ductors and developed communication, so be discussed in the course of this work.
that the system as a whole was affected
by the missing component, but not com- In the introductory chapter I want to
pletely destroyed. clarify the following: "The neural net-
work" does not exist. There are differ-
Important!
A disadvantage of this distributed fault- ent paradigms for neural networks, how
tolerant storage is certainly the fact that they are trained and where they are used.
we cannot realize at first sight what a neu- My goal is to introduce some of these
ral neutwork knows and performs or where paradigms and supplement some remarks
its faults lie. Usually, it is easier to per- for practical application.
form such analyses for conventional algo-
rithms. Most often we can only trans- We have already mentioned that our brain
fer knowledge into our neural network by works massively in parallel, in contrast to
means of a learning procedure, which can the functioning of a computer, i.e. every
cause several errors and is not always easy component is active at any time. If we
to manage. want to state an argument for massive par-
allel processing, then the 100-step rule
Fault tolerance of data, on the other hand, can be cited.
is already more sophisticated in state-of-
the-art technology: Let us compare a
record and a CD. If there is a scratch on a
record, the audio information on this spot
1.1.1 The 100-step rule
will be completely lost (you will hear a
pop) and then the music goes on. On a CD Experiments showed that a human can
the audio data are distributedly stored: A recognize the picture of a familiar object
scratch causes a blurry sound in its vicin- or person in ≈ 0.1 seconds, which cor-
ity, but the data stream remains largely responds to a neuron switching time of
unaffected. The listener won’t notice any- ≈ 10−3 seconds in ≈ 100 discrete time
thing. steps of parallel processing.
parallel
processing
So let us summarize the main characteris- A computer following the von Neumann
tics we try to adapt from biology: architecture, however, can do practically
nothing in 100 time steps of sequential pro-
. Self-organization and learning capa-
cessing, which are 100 assembler steps or
bility,
cycle steps.
. Generalization capability and
Now we want to look at a simple applica-
. Fault tolerance. tion example for a neural network.
f : R8 → B1 ,
f : R8 → R2 ,
Figure 1.3: Initially, we regard the robot control
as a black box whose inner life is unknown. The
black box receives eight real sensor values and which gradually controls the two motors
maps these values to a binary output value. by means of the sensor inputs and thus
cannot only, for example, stop the robot
but also lets it avoid obstacles. Here it
is more difficult to analytically derive the
rules, and de facto a neural network would
neural network as a kind of black box be more appropriate.
(fig. 1.3). This means we do not know its
structure but just regard its behavior in Our goal is not to learn the samples by
practice. heart, but to realize the principle behind
them: Ideally, the robot should apply the
The situations in form of simply mea- neural network in any situation and be
sured sensor values (e.g. placing the robot able to avoid obstacles. In particular, the
in front of an obstacle, see illustration), robot should query the network continu-
which we show to the robot and for which ously and repeatedly while driving in order
we specify whether to drive on or to stop, to continously avoid obstacles. The result
are called training samples. Thus, a train- is a constant cycle: The robot queries the
ing sample consists of an exemplary input network. As a consequence, it will drive
and a corresponding desired output. Now in one direction, which changes the sen-
the question is how to transfer this knowl- sors values. Again the robot queries the
edge, the information, into the neural net- network and changes its position, the sen-
work. sor values are changed once again, and so
on. It is obvious that this system can also
The samples can be taught to a neural be adapted to dynamic, i.e changing, en-
network by using a simple learning pro- vironments (e.g. the moving obstacles in
cedure (a learning procedure is a simple our example).
algorithm or a mathematical formula. If
we have done everything right and chosen 2 There is a robot called Khepera with more or less
similar characteristics. It is round-shaped, approx.
good samples, the neural network will gen- 7 cm in diameter, has two motors with wheels
eralize from these samples and find a uni- and various sensors. For more information I rec-
versal rule when it has to stop. ommend to refer to the internet.
Figure 1.2: The robot is positioned in a landscape that provides sensor values for different situa-
tions. We add the desired output values H and so receive our learning samples. The directions in
which the sensors are oriented are exemplarily applied to two robots.
Figure 1.4: Some institutions of the field of neural networks. From left to right: John von Neu-
mann, Donald O. Hebb, Marvin Minsky, Bernard Widrow, Seymour Papert, Teuvo Kohonen, John
Hopfield, "in the order of appearance" as far as possible.
John Hopfield also invented the time a certain kind of fatigue spread
so-called Hopfield networks [Hop82] in the field of artificial intelligence,
which are inspired by the laws of mag- caused by a series of failures and un-
netism in physics. They were not fulfilled hopes.
widely used in technical applications,
From this time on, the development of
but the field of neural networks slowly
the field of research has almost been
regained importance.
explosive. It can no longer be item-
1983: Fukushima, Miyake and Ito in- ized, but some of its results will be
troduced the neural model of the seen in the following.
Neocognitron which could recognize
handwritten characters [FMI83] and
was an extension of the Cognitron net- Exercises
work already developed in 1975.
Exercise 1. Give one example for each
of the following topics:
1.2.4 Renaissance
. A book on neural networks or neuroin-
Through the influence of John Hopfield, formatics,
who had personally convinced many re- . A collaborative group of a university
searchers of the importance of the field, working with neural networks,
and the wide publication of backpro-
pagation by Rumelhart, Hinton and . A software tool realizing neural net-
Williams, the field of neural networks works ("simulator"),
slowly showed signs of upswing. . A company using neural networks,
and
1985: John Hopfield published an arti-
cle describing a way of finding accept- . A product or service being realized by
able solutions for the Travelling Sales- means of neural networks.
man problem by using Hopfield nets.
Renaissance Exercise 2. Show at least four applica-
1986: The backpropagation of error learn- tions of technical neural networks: two
ing procedure as a generalization of from the field of pattern recognition and
the delta rule was separately devel- two from the field of function approxima-
oped and widely published by the Par- tion.
allel Distributed Processing Group
[RHW86a]: Non-linearly-separable Exercise 3. Briefly characterize the four
problems could be solved by multi- development phases of neural networks
layer perceptrons, and Marvin Min- and give expressive examples for each
sky’s negative evaluations were dis- phase.
proven at a single blow. At the same
13
Chapter 2 Biological neural networks dkriesel.com
is also heavily involved in the human cir- All parts of the nervous system have one
cadian rhythm ("internal clock") and the thing in common: information processing.
sensation of pain. This is accomplished by huge accumula-
tions of billions of very similar cells, whose
structure is very simple but which com-
2.1.5 The brainstem connects the municate continuously. Large groups of
brain with the spinal cord and these cells send coordinated signals and
controls reflexes. thus reach the enormous information pro-
cessing capacity we are familiar with from
our brain. We will now leave the level of
In comparison with the diencephalon the
brain areas and continue with the cellular
brainstem or the (truncus cerebri) re-
level of the body - the level of neurons.
spectively is phylogenetically much older.
Roughly speaking, it is the "extended
spinal cord" and thus the connection be-
tween brain and spinal cord. The brain- 2.2 Neurons are information
stem can also be divided into different ar- processing cells
eas, some of which will be exemplarily in-
troduced in this chapter. The functions
will be discussed from abstract functions Before specifying the functions and pro-
towards more fundamental ones. One im- cesses within a neuron, we will give a
portant component is the pons (=bridge), rough description of neuron functions: A
a kind of transit station for many nerve sig- neuron is nothing more than a switch with
nals from brain to body and vice versa. information input and output. The switch
will be activated if there are enough stim-
If the pons is damaged (e.g. by a cere- uli of other neurons hitting the informa-
bral infarct), then the result could be the tion input. Then, at the information out-
locked-in syndrome – a condition in put, a pulse is sent to, for example, other
which a patient is "walled-in" within his neurons.
own body. He is conscious and aware
with no loss of cognitive function, but can-
not move or communicate by any means.
2.2.1 Components of a neuron
Only his senses of sight, hearing, smell and
taste are generally working perfectly nor-
mal. Locked-in patients may often be able Now we want to take a look at the com-
to communicate with others by blinking or ponents of a neuron (Fig. 2.3 on the fac-
moving their eyes. ing page). In doing so, we will follow the
way the electrical information takes within
Furthermore, the brainstem is responsible the neuron. The dendrites of a neuron
for many fundamental reflexes, such as the receive the information by special connec-
blinking reflex or coughing. tions, the synapses.
Figure 2.3: Illustration of a biological neuron with the components discussed in this text.
2.2.1.1 Synapses weight the individual The chemical synapse is the more dis-
parts of information tinctive variant. Here, the electrical cou-
pling of source and target does not take
place, the coupling is interrupted by the
Incoming signals from other neurons or
synaptic cleft. This cleft electrically sep-
cells are transferred to a neuron by special
arates the presynaptic side from the post-
connections, the synapses. Such connec-
synaptic one. You might think that, never-
tions can usually be found at the dendrites
theless, the information has to flow, so we
of a neuron, sometimes also directly at the
will discuss how this happens: It is not an
soma. We distinguish between electrical
electrical, but a chemical process. On the
and chemical synapses.
presynaptic side of the synaptic cleft the
electrical signal is converted into a chemi-
The electrical synapse is the simpler cal signal, a process induced by chemical
electrical
synapse: variant. An electrical signal received by cues released there (the so-called neuro-
simple the synapse, i.e. coming from the presy- transmitters). These neurotransmitters
naptic side, is directly transferred to the cross the synaptic cleft and transfer the
postsynaptic nucleus of the cell. Thus, information into the nucleus of the cell
there is a direct, strong, unadjustable (this is a very simple explanation, but later
connection between the signal transmitter on we will see how this exactly works),
and the signal receiver, which is, for exam- where it is reconverted into electrical in-
ple, relevant to shortening reactions that formation. The neurotransmitters are de-
must be "hard coded" within a living or- graded very fast, so that it is possible to re-
ganism.
lease very precise information pulses here, many different sources, which are then
too. transferred into the nucleus of the cell.
The amount of branching dendrites is also
In spite of the more complex function- called dendrite tree.
cemical
synapse ing, the chemical synapse has - compared
is more with the electrical synapse - utmost advan-
tages:
complex
but also 2.2.1.3 In the soma the weighted
more
powerful One-way connection: A chemical information is accumulated
synapse is a one-way connection.
Due to the fact that there is no direct After the cell nucleus (soma) has re-
electrical connection between the ceived a plenty of activating (=stimulat-
pre- and postsynaptic area, electrical ing) and inhibiting (=diminishing) signals
pulses in the postsynaptic area by synapses or dendrites, the soma accu-
cannot flash over to the presynaptic mulates these signals. As soon as the ac-
area. cumulated signal exceeds a certain value
(called threshold value), the cell nucleus
Adjustability: There is a large number of
of the neuron activates an electrical pulse
different neurotransmitters that can
which then is transmitted to the neurons
also be released in various quantities
connected to the current one.
in a synaptic cleft. There are neuro-
transmitters that stimulate the post-
synaptic cell nucleus, and others that
slow down such stimulation. Some 2.2.1.4 The axon transfers outgoing
synapses transfer a strongly stimulat- pulses
ing signal, some only weakly stimu-
lating ones. The adjustability varies The pulse is transferred to other neurons
a lot, and one of the central points by means of the axon. The axon is a
in the examination of the learning long, slender extension of the soma. In
ability of the brain is, that here the an extreme case, an axon can stretch up
synapses are variable, too. That is, to one meter (e.g. within the spinal cord).
over time they can form a stronger or The axon is electrically isolated in order
weaker connection. to achieve a better conduction of the elec-
trical signal (we will return to this point
later on) and it leads to dendrites, which
2.2.1.2 Dendrites collect all parts of transfer the information to, for example,
information other neurons. So now we are back at the
beginning of our description of the neuron
Dendrites branch like trees from the cell elements. An axon can, however, transfer
nucleus of the neuron (which is called information to other kinds of cells in order
soma) and receive electrical signals from to control them.
Negative A ions remain, positive K with its environment. But even with these
ions disappear, and so the inside of two ions a standstill with all gradients be-
the cell becomes more negative. The ing balanced out could still be achieved.
result is another gradient. Now the last piece of the puzzle gets into
the game: a "pump" (or rather, the protein
Electrical Gradient: The electrical gradi- ATP) actively transports ions against the
ent acts contrary to the concentration direction they actually want to take!
gradient. The intracellular charge is
now very strong, therefore it attracts Sodium is actively pumped out of the cell,
positive ions: K+ wants to get back although it tries to get into the cell
into the cell. along the concentration gradient and
the electrical gradient.
If these two gradients were now left alone,
they would eventually balance out, reach Potassium, however, diffuses strongly out
a steady state, and a membrane poten- of the cell, but is actively pumped
tial of −85 mV would develop. But we back into it.
want to achieve a resting membrane po-
tential of −70 mV, thus there seem to ex- For this reason the pump is also called
ist some disturbances which prevent this. sodium-potassium pump. The pump
Furthermore, there is another important maintains the concentration gradient for
ion, Na+ (sodium), for which the mem- the sodium as well as for the potassium,
brane is not very permeable but which, so that some sort of steady state equilib-
however, slowly pours through the mem- rium is created and finally the resting po-
brane into the cell. As a result, the sodium tential is −70 mV as observed. All in all
is driven into the cell all the more: On the the membrane potential is maintained by
one hand, there is less sodium within the the fact that the membrane is imperme-
neuron than outside the neuron. On the able to some ions and other ions are ac-
other hand, sodium is positively charged tively pumped against the concentration
but the interior of the cell has negative and electrical gradients. Now that we
charge, which is a second reason for the know that each neuron has a membrane
sodium wanting to get into the cell. potential we want to observe how a neu-
ron receives and transmits signals.
Due to the low diffusion of sodium into the
cell the intracellular sodium concentration
increases. But at the same time the inside 2.2.2.2 The neuron is activated by
of the cell becomes less negative, so that changes in the membrane
+
K pours in more slowly (we can see that potential
this is a complex mechanism where every-
thing is influenced by everything). The Above we have learned that sodium and
sodium shifts the intracellular equilibrium potassium can diffuse through the mem-
from negative to less negative, compared brane - sodium slowly, potassium faster.
They move through channels within the Stimulus up to the threshold: A stimu-
membrane, the sodium and potassium lus opens channels so that sodium
channels. In addition to these per- can pour in. The intracellular charge
manently open channels responsible for becomes more positive. As soon as
diffusion and balanced by the sodium- the membrane potential exceeds the
potassium pump, there also exist channels threshold of −55 mV, the action po-
that are not always open but which only tential is initiated by the opening of
response "if required". Since the opening many sodium channels.
of these channels changes the concentra-
tion of ions within and outside of the mem- Depolarization: Sodium is pouring in. Re-
brane, it also changes the membrane po- member: Sodium wants to pour into
tential. the cell because there is a lower in-
tracellular than extracellular concen-
These controllable channels are opened as tration of sodium. Additionally, the
soon as the accumulated received stimulus cell is dominated by a negative en-
exceeds a certain threshold. For example, vironment which attracts the posi-
stimuli can be received from other neurons tive sodium ions. This massive in-
or have other causes. There exist, for ex- flux of sodium drastically increases
ample, specialized forms of neurons, the the membrane potential - up to ap-
sensory cells, for which a light incidence prox. +30 mV - which is the electrical
could be such a stimulus. If the incom- pulse, i.e., the action potential.
ing amount of light exceeds the threshold,
controllable channels are opened. Repolarization: Now the sodium channels
are closed and the potassium channels
The said threshold (the threshold poten- are opened. The positively charged
tial) lies at about −55 mV. As soon as the ions want to leave the positive inte-
received stimuli reach this value, the neu- rior of the cell. Additionally, the intra-
ron is activated and an electrical signal, cellular concentration is much higher
an action potential, is initiated. Then than the extracellular one, which in-
this signal is transmitted to the cells con- creases the efflux of ions even more.
nected to the observed neuron, i.e. the The interior of the cell is once again
cells "listen" to the neuron. Now we want more negatively charged than the ex-
to take a closer look at the different stages terior.
of the action potential (Fig. 2.4 on the next
page): Hyperpolarization: Sodium as well as
potassium channels are closed again.
Resting state: Only the permanently At first the membrane potential is
open sodium and potassium channels slightly more negative than the rest-
are permeable. The membrane ing potential. This is due to the
potential is at −70 mV and actively fact that the potassium channels close
kept there by the neuron. more slowly. As a result, (positively
charged) potassium effuses because of Now you may assume that these less in-
its lower extracellular concentration. sulated nodes are a disadvantage of the
After a refractory period of 1 − 2 axon - however, they are not. At the
ms the resting state is re-established nodes, mass can be transferred between
so that the neuron can react to newly the intracellular and extracellular area, a
applied stimuli with an action poten- transfer that is impossible at those parts
tial. In simple terms, the refractory of the axon which are situated between
period is a mandatory break a neu- two nodes (internodes) and therefore in-
ron has to take in order to regenerate. sulated by the myelin sheath. This mass
The shorter this break is, the more transfer permits the generation of signals
often a neuron can fire per time. similar to the generation of the action po-
tential within the soma. The action po-
Then the resulting pulse is transmitted by
tential is transferred as follows: It does
the axon.
not continuously travel along the axon but
jumps from node to node. Thus, a series
2.2.2.3 In the axon a pulse is of depolarization travels along the nodes of
conducted in a saltatory way Ranvier. One action potential initiates the
next one, and mostly even several nodes
are active at the same time here. The
We have already learned that the axon
pulse "jumping" from node to node is re-
is used to transmit the action potential
sponsible for the name of this pulse con-
across long distances (remember: You will
ductor: saltatory conductor.
find an illustration of a neuron including
an axon in Fig. 2.3 on page 17). The axon Obviously, the pulse will move faster if its
is a long, slender extension of the soma. jumps are larger. Axons with large in-
In vertebrates it is normally coated by a ternodes (2 mm) achieve a signal disper-
myelin sheath that consists of Schwann sion of approx. 180 meters per second.
cells (in the PNS) or oligodendrocytes However, the internodes cannot grow in-
(in the CNS) 1 , which insulate the axon definitely, since the action potential to be
very well from electrical activity. At a dis- transferred would fade too much until it
tance of 0.1 − 2mm there are gaps between reaches the next node. So the nodes have
these cells, the so-called nodes of Ran- a task, too: to constantly amplify the sig-
vier. The said gaps appear where one in- nal. The cells receiving the action poten-
sulate cell ends and the next one begins. tial are attached to the end of the axon –
It is obvious that at such a node the axon often connected by dendrites and synapses.
is less insulated. As already indicated above, the action po-
1 Schwann cells as well as oligodendrocytes are vari- tentials are not only generated by informa-
eties of the glial cells. There are about 50 times tion received by the dendrites from other
more glial cells than neurons: They surround the neurons.
neurons (glia = glue), insulate them from each
other, provide energy, etc.
changes in the noise level can be ig- 2.3.3.1 Compound eyes and pinhole
nored. eyes only provide high temporal
or spatial resolution
Just to get a feeling for sensory organs
and information processing in the organ-
ism, we will briefly describe "usual" light Let us first take a look at the so-called
sensing organs, i.e. organs often found in compound eye (Fig. 2.5 on the next
nature. For the third light sensing organ page), which is, for example, common in
described below, the single lens eye, we insects and crustaceans. The compound
will discuss the information processing in eye consists of a great number of small, Compound eye:
high temp.,
the eye. individual eyes. If we look at the com- low
sharp is the image in the area of this 2.4 The amount of neurons in
ganglion cell. So the information is living organisms at
already reduced directly in the retina
and the overall image is, for exam- different stages of
ple, blurred in the peripheral field development
of vision. So far, we have learned
about the information processing in An overview of different organisms and
the retina only as a top-down struc- their neural capacity (in large part from
ture. Now we want to take a look at [RD05]):
the
302 neurons are required by the nervous
horizontal and amacrine cells. These system of a nematode worm, which
cells are not connected from the serves as a popular model organism
front backwards but laterally. They in biology. Nematodes live in the soil
allow the light signals to influence and feed on bacteria.
themselves laterally directly during
the information processing in the 104 neurons make an ant (To simplify
retina – a much more powerful matters we neglect the fact that some
method of information processing ant species also can have more or less
than compressing and blurring. efficient nervous systems). Due to the
When the horizontal cells are excited use of different attractants and odors,
by a photoreceptor, they are able to ants are able to engage in complex
excite other nearby photoreceptors social behavior and form huge states
and at the same time inhibit more with millions of individuals. If you re-
distant bipolar cells and receptors. gard such an ant state as an individ-
This ensures the clear perception of ual, it has a cognitive capacity similar
outlines and bright points. Amacrine to a chimpanzee or even a human.
cells can further intensify certain
With 105 neurons the nervous system of
stimuli by distributing information
a fly can be constructed. A fly can
from bipolar cells to several ganglion
evade an object in real-time in three-
cells or by inhibiting ganglions.
dimensional space, it can land upon
These first steps of transmitting visual in- the ceiling upside down, has a consid-
formation to the brain show that informa- erable sensory system because of com-
tion is processed from the first moment the pound eyes, vibrissae, nerves at the
information is received and, on the other end of its legs and much more. Thus,
hand, is processed in parallel within mil- a fly has considerable differential and
lions of information-processing cells. The integral calculus in high dimensions
system’s power and resistance to errors implemented "in hardware". We all
is based upon this massive division of know that a fly is not easy to catch.
work. Of course, the bodily functions are
also controlled by neurons, but these 1.6 · 108 neurons are required by the
should be ignored here. brain of a dog, companion of man for
ages. Now take a look at another pop-
With 0.8 · 106 neurons we have enough
ular companion of man:
cerebral matter to create a honeybee.
Honeybees build colonies and have 3 · 108 neurons can be found in a cat,
amazing capabilities in the field of which is about twice as much as in
aerial reconnaissance and navigation. a dog. We know that cats are very
4 · 106 neurons result in a mouse, and elegant, patient carnivores that can
here the world of vertebrates already show a variety of behaviors. By the
begins. way, an octopus can be positioned
within the same magnitude. Only
1.5 · 107 neurons are sufficient for a rat, very few people know that, for exam-
an animal which is denounced as be- ple, in labyrinth orientation the octo-
ing extremely intelligent and are of- pus is vastly superior to the rat.
ten used to participate in a variety
of intelligence tests representative for For 6 · 109 neurons you already get a
the animal world. Rats have an ex- chimpanzee, one of the animals being
traordinary sense of smell and orien- very similar to the human.
tation, and they also show social be-
havior. The brain of a frog can be 1011 neurons make a human. Usually,
positioned within the same dimension. the human has considerable cognitive
The frog has a complex build with capabilities, is able to speak, to ab-
many functions, it can swim and has stract, to remember and to use tools
evolved complex behavior. A frog as well as the knowledge of other hu-
can continuously target the said fly mans to develop advanced technolo-
by means of his eyes while jumping gies and manifold social structures.
in three-dimensional space and and
catch it with its tongue with consid- With 2 · 10 neurons there are nervous
11
Vectorial input: The input of technical Adjustable weights: The weights weight-
neurons consists of many components, ing the inputs are variable, similar to
Exercises
This chapter contains the formal defini- certain point in time, the notation will be,
tions for most of the neural network com- for example, netj (t − 1) or oi (t).
ponents used later in the text. After this
chapter you will be able to read the indi-
vidual chapters of this work without hav-
ing to know the preceding ones (although From a biological point of view this is, of
this would be useful). course, not very plausible (in the human
brain a neuron does not wait for another
one), but it significantly simplifies the im-
3.1 The concept of time in plementation.
neural networks
33
Chapter 3 Components of artificial neural networks (fundamental) dkriesel.com
tween two neurons i and j is referred to as ber of the matrix indicating where the con-
wi,j 1 . nection begins, and the column number of
the matrix indicating, which neuron is the
Definition 3.2 (Neural network). A
target. Indeed, in this case the numeric
neural network is a sorted triple
0 marks a non-existing connection. This
(N, V, w) with two sets N , V and a func-
matrix representation is also called Hin-
tion w, where N is the set of neurons and
ton diagram 2 .
V a set {(i, j)|i, j ∈ N} whose elements are
called connections between neuron i and The neurons and connections comprise the
neuron j. The function w : V → R defines following components and variables (I’m
n. network
= neurons the weights, where w((i, j)), the weight of following the path of the data within a
+ weighted the connection between neuron i and neu- neuron, which is according to fig. 3.1 on
ron j, is shortened to wi,j . Depending on the facing page in top-down direction):
connection
wi,j I
the point of view it is either undefined or
0 for connections that do not exist in the
network. 3.2.1 Connections carry information
that is processed by neurons
SNIPE: In Snipe, an instance of the class
NeuralNetworkDescriptor is created in
the first place. The descriptor object Data are transferred between neurons via
roughly outlines a class of neural networks, connections with the connecting weight be-
e.g. it defines the number of neuron lay-
ers in a neural network. In a second step,
ing either excitatory or inhibitory. The
the descriptor object is used to instantiate definition of connections has already been
an arbitrary number of NeuralNetwork ob- included in the definition of the neural net-
jects. To get started with Snipe program- work.
ming, the documentations of exactly these
two classes are – in that order – the right
SNIPE: Connection weights
thing to read. The presented layout involv-
can be set using the method
ing descriptor and dependent neural net-
NeuralNetwork.setSynapse.
works is very reasonable from the imple-
mentation point of view, because it is en-
ables to create and maintain general param-
3.2.2 The propagation function
eters of even very large sets of similar (but
not neccessarily equal) networks.
converts vector inputs to
scalar network inputs
So the weights can be implemented in a
square weight matrix W or, optionally,
Looking at a neuron j, we will usually find
in a weight vector W with the row num-
WI a lot of neurons with a connection to j, i.e.
1 Note: In some of the cited literature i and j could which transfer their output to j.
be interchanged in wi,j . Here, a consistent stan-
dard does not exist. But in this text I try to use 2 Note that, here again, in some of the cited liter-
the notation I found more frequently and in the ature axes and rows could be interchanged. The
more significant citations. published literature is not consistent here, as well.
Aktivierungsfunktion
dkriesel.com
(Erzeugt aus Netzeingabe und alter 3.2 Components of neural networks
Aktivierung die neue Aktivierung)
Aktivierung
Ausgabe zu
Ausgabefunktion
anderen neuron j the propagation func-
For aNeuronen
(Erzeugt aus Aktivierung die Ausgabe, tion receives the outputs oi1 , . . . , oin of
other neurons i1 , i2 , . . . , in (which are con-
ist oft Identität)
reactions of the neurons to the input val- 3.2.5 The activation function
How active
ues depend on this activation state. The determines the activation of a
is a activation state indicates the extent of a neuron dependent on network
neuron? neuron’s activation and is often shortly re- input and treshold value
ferred to as activation. Its formal defini-
tion is included in the following definition At a certain time – as we have already
of the activation function. But generally, learned – the activation aj of a neuron j
it can be defined as follows: depends on the previous 3 activation state
of the neuron and the external input.
Definition 3.4 (Activation state / activa-
tion in general). Let j be a neuron. The Definition 3.6 (Activation function and
activation state aj , in short activation, is Activation). Let j be a neuron. The ac-
explicitly assigned to j, indicates the ex- tivation function is defined as
calculates
activation
tent of the neuron’s activity and results
from the activation function. aj (t) = fact (netj (t), aj (t − 1), Θj ). (3.3)
SNIPE: It is possible to get and set activa- It transforms the network input netj ,
tion states of neurons by using the meth- as well as the previous activation state
Jfact
ods getActivation or setActivation in
the class NeuralNetwork. aj (t − 1) into a new activation state aj (t),
with the threshold value Θ playing an im-
portant role, as already mentioned.
3.2.4 Neurons get activated if the
network input exceeds their Unlike the other variables within the neu-
treshold value ral network (particularly unlike the ones
defined so far) the activation function is
Near the threshold value, the activation often defined globally for all neurons or
function of a neuron reacts particularly at least for a set of neurons and only the
sensitive. From the biological point of threshold values are different for each neu-
view the threshold value represents the ron. We should also keep in mind that
threshold at which a neuron starts fir- the threshold values can be changed, for
ing. The threshold value is also mostly example by a learning procedure. So it
highest
included in the definition of the activation can in particular become necessary to re-
late the threshold value to the time and to
point of
sensation function, but generally the definition is the
following: write, for instance Θj as Θj (t) (but for rea-
sons of clarity, I omitted this here). The
Definition 3.5 (Threshold value in gen- activation function is also called transfer
eral). Let j be a neuron. The threshold function.
value Θj is uniquely assigned to j and 3 The previous activation is not always relevant for
ΘI
marks the position of the maximum gradi- the current – we will see examples for both vari-
ent value of the activation function. ants.
1
SNIPE: In Snipe, activation functions are . (3.5)
1+e
−x
generalized to neuron behaviors. Such T
0
culates the values which are transferred to
−0.5 the other neurons connected to j. More
formally:
−1
−4 −2 0 2 4
Definition 3.7 (Output function). Let j
x informs
Fermi Function with Temperature Parameter
be a neuron. The output function other
1
neurons
fout (aj ) = oj (3.6)
0.8
0.4
Hyperbolic Tangent
1
0.8 fout (aj ) = aj , so oj = aj (3.7)
0.6
0.4
Unless explicitly specified differently, we
0.2
will use the identity as output function
tanh(x)
0
−0.2 within this text.
−0.4
−0.6
−0.8
−1 3.2.8 Learning strategies adjust a
network to fit our needs
−4 −2 0 2 4
x
GFED
@ABC @ABC
GFED
i1 UU i i2 Definition 3.10 (Feedforward network
}} AAAUUUUUUUiiiiiii}} AAA with shortcut connections). Similar to the
}} AA iiii UUUU }} AA
} i i
A U
} U A
}} UUUUUUU AAA feedforward network, but the connections
i U
}} iiii AAA
@ABC
GFED GFED
@ABC i GFED
@ABC
~}ti}iiiiii ~}} UUU* may not only be directed towards the next
h1 AUUUU h2 A i h3
U
AA UUUU } AA ii }} layer but also towards any other subse-
ii
AA UUUU }}} AA iiiiii }
AA }U
} UUU U i iiiiAA }}} quent layer.
AA iUiU A
@ABC
GFED GFED
@ABC
} }
}~ it}iiiiii UUUUUUA* }~ }
Ω1 Ω2
89:;
?>=< 89:;
?>=<
v v
1 2
?>=<
89:; ?>=<
89:; ) ?>=<
89:;
u v v v
3 4 5
?>=<
89:; ) ?>=<
89:;
uv v
6 7
GFED
@ABC GFED
@ABC
i1 i2 1 2 3 4 5 6 7
1
2
@ABC
GFED GFED
@ABC * GFED
@ABC
~t ~ 3
h1 h2 h3
4
5
GFED
@ABC + * GFED
@ABC
~
t
~ 6
Ω1 s Ω 2 7
Figure 3.4: A feedforward network with short- Definition 3.11 (Direct recurrence).
cut connections, which are represented by solid Now we expand the feedforward network neurons
lines. On the right side of the feedforward blocks by connecting a neuron j to itself, with the influence
new connections have been added to the Hinton weights of these connections being referred themselves
diagram.
to as wj,j . In other words: the diagonal
of the weight matrix W may be different
from 0.
8 ?>=<
89:; 82 ?>=<
89:;
self, for example, by influencing the neu-
rons of the next layer and the neurons of 1 g 2
X X
this next layer influencing j (fig. 3.6).
?>=<
89:; 8 ?>=<
89:; 82 ?>=<
) 89:;
Definition 3.12 (Indirect recurrence). u
3 g 4 5
Again our network is based on a feedfor- X X
ward network, now with additional connec-
tions between neurons and their preceding ?>=<
89:; ) ?>=<
89:;
u
6 7
layer being allowed. Therefore, below the
diagonal of W is different from 0.
1 2 3 4 5 6 7
1
3.3.2.3 Lateral recurrences connect 2
neurons within one layer 3
4
5
Connections between neurons within one
6
layer are called lateral recurrences
7
(fig. 3.7 on the facing page). Here, each
neuron often inhibits the other neurons of
Figure 3.6: A network similar to a feedforward
the layer and strengthens itself. As a re-
network with indirectly recurrent neurons. The
sult only the strongest neuron becomes ac-
indirect recurrences are represented by solid lines.
tive (winner-takes-all scheme). As we can see, connections to the preceding lay-
ers can exist here, too. The fields that are sym-
Definition 3.13 (Lateral recurrence). A metric to the feedforward blocks in the Hinton
laterally recurrent network permits con- diagram are now occupied.
nections within one layer.
?>=<
89:; + ) ?>=<
89:;
u unequal to 0 everywhere, except along its
6 k 7 diagonal.
1 2 3 4 5 6 7
1
2 3.4 The bias neuron is a
3 technical trick to consider
4 threshold values as
5
6
connection weights
7
By now we know that in many network
Figure 3.7: A network similar to a feedforward paradigms neurons have a threshold value
network with laterally recurrent neurons. The that indicates when a neuron becomes ac-
direct recurrences are represented by solid lines.
tive. Thus, the threshold value is an
Here, recurrences only exist within the layer.
In the Hinton diagram, filled squares are con-
activation function parameter of a neu-
centrated around the diagonal in the height of ron. From the biological point of view
the feedforward blocks, but the diagonal is left this sounds most plausible, but it is com-
uncovered. plicated to access the activation function
at runtime in order to train the threshold
value.
But threshold values Θj1 , . . . , Θjn for neu-
rons j1 , j2 , . . . , jn can also be realized as
connecting weight of a continuously fir-
ing neuron: For this purpose an addi-
tional bias neuron whose output value
@ABC
GFED
?>=<
89:; o / ?>=<
89:; BIAS .
@ 1O >^ Ti >TTTTT jjjjj5 @ 2O >^ >
>> jTjTjTjT >>
>>jj
j TTTT > It is used to represent neuron biases as con-
jj jj > T TTTT >>>
?>=<
89:; / ?>=<
89:; / ?>=<
89:;
j j >>
ju jjj
j TTTT>) nection weights, which enables any weight-
3 >^ Ti jo TTT 4 o
@ >^ > jjj45 @ 5 training algorithm to train the biases at
>> TTTT j j
>> TTTT >> jjj
>> TTTT jjjj>>j>jj the same time.
89:;
?>=< / ?>=<
89:;
>> jjTT >
ju jjjjj TTTTT>)
6 o 7
Then the threshold value of the neurons
j1 , j2 , . . . , jn is set to 0. Now the thresh-
1 2 3 4 5 6 7
old values are implemented as connection
1 weights (fig. 3.9 on page 46) and can di-
2 rectly be trained together with the con-
3 nection weights, which considerably facil-
4 itates the learning process.
5
6 In other words: Instead of including the
7 threshold value in the activation function,
it is now included in the propagation func-
Figure 3.8: A completely linked network with tion. Or even shorter: The threshold value
symmetric connections and without direct recur-
is subtracted from the network input, i.e.
rences. In the Hinton diagram only the diagonal
is left blank.
it is part of the network input. More for-
mally:
bias neuron
replaces
Let j1 , j2 , . . . , jn be neurons with thresh- thresh. value
old values Θj1 , . . . , Θjn . By inserting a with weights
bias neuron whose output value is always
1, generating connections between the said
bias neuron and the neurons j1 , j2 , . . . , jn
and weighting these connections
wBIAS,j1 , . . . , wBIAS,jn with −Θj1 , . . . , −Θjn ,
we can set Θj1 = . . . = Θjn = 0 and
From now on, the bias neuron is omit- 3.6 Take care of the order in
ted for clarity in the following illustrations,
but we know that it exists and that the which neuron activations
threshold values can simply be treated as are calculated
weights because of it.
For a neural network it is very important
SNIPE: In Snipe, a bias neuron was imple-
mented instead of neuron-individual biases.
in which order the individual neurons re-
The neuron index of the bias neuron is 0. ceive and process the input and output the
results. Here, we distinguish two model
classes:
We have already seen that we can either All neurons change their values syn-
write its name or its threshold value into chronously, i.e. they simultaneously cal-
a neuron. Another useful representation, culate network inputs, activation and out-
which we will use several times in the put, and pass them on. Synchronous ac-
following, is to illustrate neurons accord- tivation corresponds closest to its biolog-
ing to their type of data processing. See ical counterpart, but it is – if to be im-
fig. 3.10 for some examples without fur- plemented in hardware – only useful on
ther explanation – the different types of certain parallel computers and especially
neurons are explained as soon as we need not for feedforward networks. This order
them. of activation is the most generic and can
be used with networks of arbitrary topol-
ogy.
GFED
@ABC GFED
@ABC / ?>=<
89:;
Θ1 B BIAS T
AA TTTTT−Θ 1 0
|| BB AA TTTT
|| BB −ΘA2A −Θ3 TTT
| BB
89:;
?>=< 89:;
?>=<
| AA TTTT
@ABC
GFED @ABC
GFED
| BB
|~ | TTT*
Θ2 Θ3 0 0
Figure 3.9: Two equivalent neural networks, one without bias neuron on the left, one with bias
neuron on the right. The neuron threshold values can be found in the neurons, the connecting
weights at the connections. Furthermore, I omitted the weights of the already existing connections
(represented by dotted lines on the right side).
Definition 3.16 (Synchronous activa- of time. For this, there exist different or-
tion). All neurons of a network calculate ders, some of which I want to introduce in
biologically
plausible network inputs at the same time by means the following: easier to
of the propagation function, activation by implement
means of the activation function and out-
put by means of the output function. Af-
ter that the activation cycle is complete. 3.6.2.1 Random order
Here, the neurons do not change their val- Apparently, this order of activation is not
ues simultaneously but at different points always useful.
are true.
As written above, the most interesting the question of how to implement it. In
characteristic of neural networks is their principle, a neural network changes when
capability to familiarize with problems its components are changing, as we have
by means of training and, after sufficient learned above. Theoretically, a neural net-
training, to be able to solve unknown prob- work could learn by
lems of the same class. This approach is re-
1. developing new connections,
ferred to as generalization. Before intro-
ducing specific learning procedures, I want 2. deleting existing connections,
to propose some basic principles about the
learning procedure in this chapter. 3. changing connecting weights,
51
Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel.com
As mentioned above, we assume the Thus, we let our neural network learn by
change in weight to be the most common modifying the connecting weights accord-
procedure. Furthermore, deletion of con- ing to rules that can be formulated as al-
Learning
nections can be realized by additionally gorithms. Therefore a learning procedure by changes
taking care that a connection is no longer is always an algorithm that can easily be in weight
Here I want to refer again to the popu- according to their difference. The objec-
lar example of Kohonen’s self-organising tive is to change the weights to the effect
maps (chapter 10). that the network cannot only associate in-
put and output patterns independently af-
ter the training, but can provide plausible
4.1.2 Reinforcement learning results to unknown, similar input patterns,
methods provide feedback to i.e. it generalises.
the network, whether it
behaves well or bad Definition 4.4 (Supervised learning).
The training set consists of input patterns
In reinforcement learning the network with correct results so that the 1network can
receives a logical or a real value after receive a precise error vector can be re-
completion of a sequence, which defines turned.
network
receives
reward or whether the result is right or wrong. Intu-
itively it is clear that this procedure should This learning procedure is not always bio-
punishment
be more effective than unsupervised learn- logically plausible, but it is extremely ef-
ing since the network receives specific crit- fective and therefore very practicable.
era for problem-solving.
At first we want to look at the the su-
Definition 4.3 (Reinforcement learning).
pervised learning procedures in general,
The training set consists of input patterns,
which - in this text - are corresponding
after completion of a sequence a value is re-
to the following steps:
turned to the network indicating whether
the result was right or wrong and, possibly, Entering the input pattern (activation of
how right or wrong it was. input neurons),
to as teaching input, and that for each neu- as well as simple preprocessing of the
ron individually. Thus, for a neuron j with training data.
the incorrect output oj , tj is the teaching
input, which means it is the correct or de- Definition 4.9 (Error vector). For sev-
sired output for a training pattern p. eral output neurons Ω1 , Ω2 , . . . , Ωn the dif-
JEp
ference between output vector and teach-
Definition 4.7 (Training patterns). A ing input under a training input p
pI
training pattern is an input vector p
with the components p1 , p2 , . . . , pn whose
t1 − y1
desired output is known. By entering the Ep = ..
.
training pattern into the network we re-
tn − yn
ceive an output that can be compared with
the teaching input, which is the desired is referred to as error vector, sometimes
output. The set of training patterns is it is also called difference vector. De-
called P . It contains a finite number of or- pending on whether you are learning of-
dered pairs(p, t) of training patterns with fline or online, the difference vector refers
corresponding desired output. to a specific training pattern, or to the er-
ror of a set of training patterns which is
Training patterns are often simply called normalized in a certain way.
patterns, that is why they are referred
to as p. In the literature as well as in Now I want to briefly summarize the vec-
this text they are called synonymously pat- tors we have yet defined. There is the
terns, training samples etc.
input vector x, which can be entered into
Definition 4.8 (Teaching input). Let j the neural network. Depending on
tI
be an output neuron. The teaching in- the type of network being used the
put tj is the desired and correct value j neural network will output an
desired
output should output after the input of a certain
training pattern. Analogously to the vec- output vector y. Basically, the
tor p the teaching inputs t1 , t2 , . . . , tn of training sample p is nothing more than
the neurons can also be combined into a an input vector. We only use it for
vector t. t always refers to a specific train- training purposes because we know
ing pattern p and is, as already mentioned, the corresponding
contained in the set P of the training pat-
terns. teaching input t which is nothing more
than the desired output vector to the
SNIPE: Classes that are relevant training sample. The
for training data are located in
the package training. The class error vector Ep is the difference between
TrainingSampleLesson allows for storage the teaching input t and the actural
of training patterns and teaching inputs,
output y.
samples with the output 1. This implies This means, that these data are included
an oversized network with too much free in the training, even if they are not used
storage capacity. explicitly for the training. The solution
is a third set of validation data used only
On the other hand a network could have
for validation after a supposably success-
insufficient capacity (fig. 4.1, bottom) –
ful training.
this rough presentation of input data does
not correspond to the good generalization
By training less patterns, we obviously
performance we desire. Thus, we have to
withhold information from the network
find the balance (fig. 4.1, middle).
and risk to worsen the learning perfor-
mance. But this text is not about 100%
4.3.1 It is useful to divide the set of exact reproduction of given samples but
training samples about successful generalization and ap-
proximation of a whole function – for
which it can definitely be useful to train
An often proposed solution for these prob-
less information into the network.
lems is to divide, the training set into
. one training set really used to train ,
. and one verification set to test our 4.3.2 Order of pattern
progress representation
– provided that there are enough train-
ing samples. The usual division relations You can find different strategies to choose
are, for instance, 70% for training data the order of pattern presentation: If pat-
and 30% for verification data (randomly terns are presented in random sequence,
chosen). We can finish the training when there is no guarantee that the patterns
the network provides good results on the are learned equally well (however, this is
training data as well as on the verification the standard method). Always the same
data. sequence of patterns, on the other hand,
provokes that the patterns will be memo-
SNIPE: The method splitLesson within rized when using recurrent networks (later,
the class TrainingSampleLesson allows for
we will learn more about this type of net-
splitting a TrainingSampleLesson with re-
spect to a given ratio. works). A random permutation would
solve both problems, but it is – as already
mentioned – very time-consuming to cal-
But note: If the verification data provide culate such a permutation.
poor results, do not modify the network
structure until these data provide good re- SNIPE: The method shuffleSamples lo-
sults – otherwise you run the risk of tai- cated in the class TrainingSampleLesson
loring the network to the verification data. permutes a lesson.
4.4 Learning curve and error Generally, the root mean square is com-
measurement monly used since it considers extreme out-
liers to a greater extent.
The learning curve indicates the progress Definition 4.12 (Root mean square).
norm of the error, which can be determined in The root mean square of two vectors t and
to various ways. The motivation to create a y is defined as
compare
learning curve is that such a curve can in-
Ω∈O (tΩ − yΩ )2
sP
dicate whether the network is progressing Errp = . (4.3)
or not. For this, the error should be nor- |O|
malized, i.e. represent a distance measure
between the correct and the current out- As for offline learning, the total error in
put of the network. For example, we can the course of one training epoch is inter-
take the same pattern-specific, squared er- esting and useful, too:
ror with a prefactor, which we are also go-
Err = Errp (4.4)
X
ing to use to derive the backpropagation
p∈P
of error (let Ω be output neurons and O
the set of output neurons): Definition 4.13 (Total error). The total
1 X error Err is based on all training samples,
Errp = (tΩ − yΩ )2 (4.1) that means it is generated offline. JErr
2 Ω∈O
Definition 4.10 (Specific error). The Analogously we can generate a total RMS
specific error Errp is based on a single and a total Euclidean distance in the
Errp I
training sample, which means it is gener- course of a whole epoch. Of course, it is
ated online. possible to use other types of error mea-
surement. To get used to further error
Additionally, the root mean square (ab- measurement methods, I suggest to have a
breviated: RMS) and the Euclidean look into the technical report of Prechelt
distance are often used. [Pre94]. In this report, both error mea-
surement methods and sample problems
The Euclidean distance (generalization of are discussed (this is why there will be a
the theorem of Pythagoras) is useful for simmilar suggestion during the discussion
lower dimensions where we can still visual- of exemplary problems).
ize its usefulness.
SNIPE: There are several static meth-
Definition 4.11 (Euclidean distance). ods representing different methods of er-
The Euclidean distance between two vec- ror measurement implemented in the class
tors t and y is defined as ErrorMeasurement.
sX
Errp = (tΩ − yΩ )2 . (4.2) Depending on our method of error mea-
Ω∈O surement our learning curve certainly
changes, too. A perfect learning curve depends on a more objective view on the
looks like a negative exponential func- comparison of several learning curves.
tion, that means it is proportional to e−t
(fig. 4.2 on the following page). Thus, the Confidence in the results, for example, is
representation of the learning curve can be boosted, when the network always reaches
objectivity
illustrated by means of a logarithmic scale nearly the same final error-rate for differ-
(fig. 4.2, second diagram from the bot- ent random initializations – so repeated
tom) – with the said scaling combination initialization and training will provide a
a descending line implies an exponential more objective result.
descent of the error.
On the other hand, it can be possible that
With the network doing a good job, the a curve descending fast in the beginning
problems being not too difficult and the can, after a longer time of learning, be
logarithmic representation of Err you can overtaken by another curve: This can indi-
see - metaphorically speaking - a descend- cate that either the learning rate of the
ing line that often forms "spikes" at the worse curve was too high or the worse
bottom – here, we reach the limit of the curve itself simply got stuck in a local min-
64-bit resolution of our computer and our imum, but was the first to find it.
network has actually learned the optimum
of what it is capable of learning. Remember: Larger error values are worse
than the small ones.
Typical learning curves can show a few flat
areas as well, i.e. they can show some
But, in any case, note: Many people only
steps, which is no sign of a malfunctioning
generate a learning curve in respect of the
learning process. As we can also see in fig.
training data (and then they are surprised
4.2, a well-suited representation can make
that only a few things will work) – but for
any slightly decreasing learning curve look
reasons of objectivity and clarity it should
good – so just be cautious when reading
not be forgotten to plot the verification
the literature.
data on a second learning curve, which
generally provides values that are slightly
worse and with stronger oscillation. But
4.4.1 When do we stop learning? with good generalization the curve can de-
crease, too.
Now, the big question is: When do we
stop learning? Generally, the training is When the network eventually begins to
stopped when the user in front of the learn- memorize the samples, the shape of the
ing computer "thinks" the error was small learning curve can provide an indication:
enough. Indeed, there is no easy answer If the learning curve of the verification
and thus I can once again only give you samples is suddenly and rapidly rising
something to think about, which, however, while the learning curve of the verification
0.00025 0.0002
0.00018
0.0002 0.00016
0.00014
0.00015 0.00012
Fehler
Fehler
0.0001
0.0001 8e−005
6e−005
5e−005 4e−005
2e−005
0 0
0 100 200 300 400 500 600 700 800 900 1000 1 10 100 1000
Epoche Epoche
1 1
1e−005 1e−005
1e−010 1e−010
1e−015 1e−015
Fehler
Fehler
1e−020 1e−020
1e−025 1e−025
1e−030 1e−030
1e−035 1e−035
0 100 200 300 400 500 600 700 800 900 1000 1 10 100 1000
Epoche Epoche
Figure 4.2: All four illustrations show the same (idealized, because very smooth) learning curve.
Note the alternating logarithmic and linear scalings! Also note the small "inaccurate spikes" visible
in the sharp bend of the curve in the first and second diagram from bottom.
data is continuously falling, this could indi- of its norm |g|. Thus, the gradient is a
cate memorizing and a generalization get- generalization of the derivative for multi-
ting poorer and poorer. At this point it dimensional functions. Accordingly, the
could be decided whether the network has negative gradient −g exactly points to-
already learned well enough at the next wards the steepest descent. The gradient
point of the two curves, and maybe the operator ∇ is referred to as nabla op-
final point of learning is to be applied erator, the overall notation of the the
J∇
gradient is
here (this procedure is called early stop- gradient g of the point (x, y) of a two- multi-dim.
ping). dimensional function f being g(x, y) = derivative
∇f (x, y).
Once again I want to remind you that they
are all acting as indicators and not to draw Definition 4.14 (Gradient). Let g be
If-Then conclusions. a gradient. Then g is a vector with n
components that is defined for any point
of a (differential) n-dimensional function
f (x1 , x2 , . . . , xn ). The gradient operator
4.5 Gradient optimization notation is defined as
procedures g(x1 , x2 , . . . , xn ) = ∇f (x1 , x2 , . . . , xn ).
g directs from any point of f towards
In order to establish the mathematical ba-
the steepest ascent from this point, with
sis for some of the following learning pro-
|g| corresponding to the degree of this as-
cedures I want to explain briefly what is
cent.
meant by gradient descent: the backpro-
pagation of error learning procedure, for Gradient descent means to going downhill
example, involves this mathematical basis in small steps from any starting point of
and thus inherits the advantages and dis- our function towards the gradient g (which
advantages of the gradient descent. means, vividly speaking, the direction to
Gradient descent procedures are generally which a ball would roll from the starting
used where we want to maximize or mini- point), with the size of the steps being pro-
mize n-dimensional functions. Due to clar- portional to |g| (the steeper the descent,
ity the illustration (fig. 4.3 on the next the longer the steps). Therefore, we move
page) shows only two dimensions, but prin- slowly on a flat plateau, and on a steep as-
cipally there is no limit to the number of cent we run downhill rapidly. If we came
dimensions. into a valley, we would - depending on the
size of our steps - jump over it or we would
The gradient is a vector g that is de- return into the valley across the opposite
fined for any differentiable point of a func- hillside in order to come closer and closer
tion, that points from this point exactly to the deepest point of the valley by walk-
towards the steepest ascent and indicates ing back and forth, similar to our ball mov-
the gradient in this direction by means ing within a round bowl.
we will see in the following sections) – how- 4.5.1.1 Often, gradient descents
ever, they work still well on many prob- converge against suboptimal
lems, which makes them an optimization minima
paradigm that is frequently used. Anyway,
let us have a look on their potential disad- Every gradient descent procedure can, for
vantages so we can keep them in mind a example, get stuck within a local mini-
bit. mum (part a of fig. 4.4 on the facing page).
Figure 4.4: Possible errors during a gradient descent: a) Detecting bad minima, b) Quasi-standstill
with small gradient, c) Oscillation in canyons, d) Leaving good minima.
A popular example is the one that did Another favourite example for singlelayer
not work in the nineteen-sixties: the XOR perceptrons are the boolean functions
function (B2 → B1 ). We need a hidden AND and OR.
neuron layer, which we have discussed in
detail. Thus, we need at least two neu-
rons in the inner layer. Let the activation 4.6.2 The parity function
function in all layers (except in the input
layer, of course) be the hyperbolic tangent.The parity function maps a set of bits to 1
Trivially, we now expect the outputs 1.0 or 0, depending on whether an even num-
or −1.0, depending on whether the func- ber of input bits is set to 1 or not. Ba-
tion XOR outputs 1 or 0 - and exactly sically, this is the function Bn → B1 . It
here is where the first beginner’s mistake is characterized by easy learnability up to
occurs. approx. n = 3 (shown in table 4.1), but
the learning effort rapidly increases from
For outputs close to 1 or -1, i.e. close to n = 4. The reader may create a score ta-
the limits of the hyperbolic tangent (or ble for the 2-bit parity function. What is
in case of the Fermi function 0 or 1), we conspicuous?
need very large network inputs. The only
chance to reach these network inputs are
large weights, which have to be learned: 4.6.3 The 2-spiral problem
The learning process is largely extended.
Therefore it is wiser to enter the teaching As a training sample for a function let
inputs 0.9 or −0.9 as desired outputs or us take two spirals coiled into each other
to be satisfied when the network outputs (fig. 4.5 on the facing page) with the
those values instead of 1 and −1. function certainly representing a mapping
Figure 4.5: Illustration of the training samples Figure 4.6: Illustration of training samples for
of the 2-spiral problem the checkerboard problem
R2 → B1 . One of the spirals is assigned suitable for this kind of problems than the
to the output value 1, the other spiral to MLP).
0. Here, memorizing does not help. The The 2-spiral problem is very similar to the
network has to understand the mapping it- checkerboard problem, only that, mathe-
self. This example can be solved by means matically speaking, the first problem is us-
of an MLP, too. ing polar coordinates instead of Cartesian
coordinates. I just want to introduce as
an example one last trivial case: the iden-
4.6.4 The checkerboard problem tity.
Most of the learning rules discussed before Exercise 7. Calculate the average value
are a specialization of the mathematically µ and the standard deviation σ for the fol-
more general form [MR86] of the Hebbian lowing data points.
rule.
p1 = (2, 2, 2)
Definition 4.17 (Hebbian rule, more gen- p2 = (3, 3, 3)
eral). The generalized form of the
Hebbian Rule only specifies the propor- p3 = (4, 4, 4)
tionality of the change in weight to the p4 = (6, 0, 0)
product of two undefined functions, but p5 = (0, 6, 0)
with defined input values. p6 = (0, 0, 6)
∆wi,j = η · h(oi , wi,j ) · g(aj , tj ) (4.6)
. g(aj , tj ) and
. h(oi , wi,j )
69
Chapter 5
The perceptron, backpropagation and its
variants
A classic among the neural networks. If we talk about a neural network, then
in the majority of cases we speak about a percepton or a variation of it.
Perceptrons are multilayer networks without recurrence and with fixed input
and output layers. Description of a perceptron, its limits and extensions that
should avoid the limitations. Derivation of learning procedures and discussion
of their problems.
As already mentioned in the history of neu- ble output values (e.g. {0, 1} or {−1, 1}).
ral networks, the perceptron was described Thus, a binary threshold function is used
by Frank Rosenblatt in 1958 [Ros58]. as activation function, depending on the
Initially, Rosenblatt defined the already threshold value Θ of the output neuron.
discussed weighted sum and a non-linear
activation function as components of the In a way, the binary activation function
perceptron. represents an IF query which can also
be negated by means of negative weights.
There is no established definition for a per- The perceptron can thus be used to ac-
ceptron, but most of the time the term complish true logical information process-
is used to describe a feedforward network ing.
with shortcut connections. This network
has a layer of scanner neurons (retina) Whether this method is reasonable is an-
with statically weighted connections to other matter – of course, this is not the
the following layer and is called input easiest way to achieve Boolean logic. I just
layer (fig. 5.1 on the next page); but the want to illustrate that perceptrons can
weights of all other layers are allowed to be be used as simple logical components and
changed. All neurons subordinate to the that, theoretically speaking, any Boolean
retina are pattern detectors. Here we ini- function can be realized by means of per-
tially use a binary perceptron with every ceptrons being connected in series or in-
output neuron having exactly two possi- terconnected in a sophisticated way. But
71
Kapitel 5 Das Perceptron dkriesel.com
GFED
@ABC @ABC
GFED )GFED
@ABC + )GFED
@ABC ,+ )GFED
@ABC
| " { us # {u # | "
u
Osr O @
OOO
OOO @@@ ~~ ooooo
OOO @@ ~~ ooo
OOO @@ ~~~ ooooo
WVUT
PQRS
OOO@ ~ o
' Σ ~ wooo
L|H
i1 PPP GFED
GFED
@ABC @ABC
i2 C @ABC
GFEDi3 @ABC
GFED
i4 @ABC
GFED
i
PPP CC {{ nnnn 5
PPP CC { nn
PPP CC {{ nnn
PPP CC {{ nnn
( ?>=<
89:; vn
PPPC! {} {n{nnnn
Ω
Architecture
Figure 5.1:Abbildung 5.1: of a perceptron
Aufbau with onemit
eines Perceptrons layer ofSchicht
einer variablevariabler
connections in different
Verbindungen views.
in verschiede-
nen Ansichten.
The solid-drawn Die durchgezogene
weight layer Gewichtsschicht
in the two illustrations on thein bottom
den unteren
can beiden Abbildungen ist trainier-
be trained.
Left side: bar.
Example of scanning information in the eye.
Right side,Oben:
upper Ampart:
Beispiel der Informationsabtastung
Drawing of the same example im with
Auge.indicated fixed-weight layer using the
Mitte: Skizze desselben mit eingezeichneter fester Gewichtsschicht unter Verwendung der definier-
defined designs of the functional descriptions for neurons.
ten funktionsbeschreibenden Designs für Neurone.
Right side, lower part: Without indicated fixed-weight layer, with the name of each neuron
Unten: Ohne eingezeichnete feste Gewichtsschicht, mit Benennung der einzelnen Neuronen nach
corresponding to our
unserer convention.
Konvention. The fixed-weight
Wir werden layer will noim
die feste Gewichtschicht longer be taken
weiteren Verlaufinto
der account in the
Arbeit nicht mehr
course of this work.
betrachten.
we will see that this is not possible without Now that we know the components of a
connecting them serially. Before providing perceptron we should be able to define
the definition of the perceptron, I want to it.
define some types of neurons used in this
chapter. Definition 5.3 (Perceptron). The per-
ceptron (fig. 5.1 on the facing page) is1 a
Definition 5.1 (Input neuron). An in- feedforward network containing a retina
put neuron is an identity neuron. It that is used only for data acquisition and
exactly forwards the information received. which has fixed-weighted connections with
Thus, it represents the identity function, the first neuron layer (input layer). The
fixed-weight layer is followed by at least
input neuron
only forwards which should be indicated by the symbol
. Therefore the input neuron is repre- one trainable weight layer. One neuron
sented by the symbol GFED
@ABC
data
layer is completely linked with the follow-
.
ing layer. The first layer of the percep-
tron consists of the input neurons defined
Definition 5.2 (Information process- above.
ing neuron). Information processing
neurons somehow process the input infor-
mation, i.e. do not represent the identity A feedforward network often contains
function. A binary neuron sums up all shortcuts which does not exactly corre-
inputs by using the weighted sum as prop- spond to the original description and there-
agation function, which we want to illus- fore is not included in the definition. We
trate by the sign Σ. Then the activation can see that the retina is not included in
function of the neuron is the binary thresh- the lower part of fig. 5.1. As a matter
old function, which can be illustrated by of fact the first neuron layer is often un-
L|H . This leads us to the complete de- derstood (simplified and sufficient for this
piction of information processing neurons, method) as input layer, because this layer
namely WVUT
PQRS
retina is
Σ only forwards the input values. The retina
L|H . Other neurons that use itself and the static weights behind it are
unconsidered
WVUT
PQRS
Σ WVUT
PQRS
Σ ONML
HIJK
Σ
. 1 It may confuse some readers that I claim that there
Tanh Fermi fact is no definition of a perceptron but then define the
perceptron in the following section. I therefore
suggest keeping my definition in the back of your
These neurons are also referred to as mind and just take it for granted in the course of
Fermi neurons or Tanh neuron. this work.
@ABC
GFED @ABC
GFED @ABC
GFED
and the variation -WithShortcuts in
a NeuralNetworkDescriptor-Instance
apply settings to a descriptor, which
BIAS i1 i2
are appropriate for feedforward networks
wBIAS,Ωwi1 ,Ω wi2 ,Ω
89:;
?>=<
or feedforward networks with shortcuts.
The respective kinds of connections are
allowed, all others are not, and fastprop is Ω
activated.
5.1 The singlelayer Figure 5.2: A singlelayer perceptron with two in-
perceptron provides only put neurons and one output neuron. The net-
work returns the output by means of the ar-
one trainable weight layer row leaving the network. The trainable layer of
weights is situated in the center (labeled). As a
reminder, the bias neuron is again included here.
Here, connections with trainable weights Although the weight wBIAS,Ω is a normal weight
go from the input layer to an output and also treated like this, I have represented it
neuron Ω, which returns the information by a dotted line – which significantly increases
the clarity of larger networks. In future, the bias
1 trainable
layer whether the pattern entered at the input
neurons was recognized or not. Thus, a neuron will no longer be included.
singlelayer perception (abbreviated SLP)
has only one level of trainable weights
(fig. 5.1 on page 72).
i1 @PUPUUU GFED
perceptron having only one layer of vari- @ABC
GFED @ABC @ABC
GFED @ABC
GFED iGFED
@ABC
A singlelayer perceptron (SLP) is a
i2 P i3 i4 i5
@@PPPUPUUUU AAPAPPPP}} AAAnnnn}n} iiiinininin~n~
able weights and one layer of output neu- U P
@@ PPP UUAUA}} PP nn AA}i}ii nnn ~~
PP }UAUU nPnP ii}iA nn
n i
@@ ~
rons Ω. The technical view of an SLP is @@ P}}P}PnPnAnAinAUinUiU iUiPUiP}U}P}PnPnAnAAn ~~~
@ABC
GFED @ABC
GFED @ABC
GFED
n i
P
~}vnit niii P' U
n P
~}nw n UUUP* ( ~~
shown in fig. 5.2. Ω1 Ω2 Ω3
the learning target, to automatically learn of the network is approximately the de-
faster. sired output t, i.e. formally it is true
that
Suppose that we have a singlelayer percep-
tron with randomly set weights which we ∀p : y ≈ t or ∀p : Ep ≈ 0.
want to teach a function by means of train-
ing samples. The set of these training sam-
This means we first have to understand the
ples is called P . It contains, as already de-
total error Err as a function of the weights:
fined, the pairs (p, t) of the training sam-
The total error increases or decreases de-
ples p and the associated teaching input t.
pending on how we change the weights.
I also want to remind you that
Definition 5.5 (Error function). The er-
. x is the input vector and
ror function
. y is the output vector of a neural net-
JErr(W )
work, Err : W → R
∆W = −η∇Err(W ). (5.2)
fore the definition of the error function results from the sum of the specific er-
Err(W ): rors):
∂op,Ω
= −δp,Ω · (5.13) However: From the very beginning the
∂wi,Ω derivation has been intended as an offline
rule by means of the question of how to
The second multiplicative factor of equa- add the errors of all patterns and how to
tion 5.11 and of the following one is the learn them after all patterns have been
derivative of the output specific to the pat- represented. Although this approach is
tern p of the neuron Ω according to the mathematically correct, the implementa-
weight wi,Ω . So how does op,Ω change tion is far more time-consuming and, as
when the weight from i to Ω is changed? we will see later in this chapter, partially
∆wi,Ω = η · oi · δΩ . (5.18)
GFED
@ABC GFED
@ABC
i1 B i2
BB |
B |||
BB
wi1 ,Ω w
||
i2 ,Ω
89:;
?>=<
BB
|~ |
Ω
XOR?
netΩ = oi1 wi1 ,Ω + oi2 wi2 ,Ω ≥ ΘΩ Figure 5.7: Linear separation of n = 2 inputs of
(5.21)
the input neurons i1 and i2 by a 1-dimensional
We assume a positive weight wi2 ,Ω , the in- straight line. A and B show the corners belong-
equality 5.21 is then equivalent to ing to the sets of the XOR function that are to
be separated.
1
o i1 ≥ (ΘΩ − oi2 wi2 ,Ω ) (5.22)
wi1 ,Ω
GFED
@ABC @ABC
GFED
part of fig. 5.10 on the facing page). A
A multilayer perceptron represents an uni-
11 AA }
11 A }}
} versal function approximator, which
11 1AA }1 is proven by the Theorem of Cybenko
111 GFED
@ABC
A
11 A ~}} }
1.5 1 [Cyb89].
11
11 Another trainable weight layer proceeds
11−2
GFED
@ABC
analogously, now with the convex poly-
0.5 gons. Those can be added, subtracted or
somehow processed with other operations
(lower part of fig. 5.10 on the next page).
XOR
Generally, it can be mathematically
proven that even a multilayer perceptron
Figure 5.9: Neural network realizing the XOR with one layer of hidden neurons can ar-
function. Threshold values (as far as they are bitrarily precisely approximate functions
existing) are located within the neurons.
with only finitely many discontinuities as
well as their first derivatives. Unfortu-
nately, this proof is not constructive and
therefore it is left to us to find the correct
5.3 A multilayer perceptron number of neurons and weights.
contains more trainable In the following we want to use a
weight layers widespread abbreviated form for different
multilayer perceptrons: We denote a two-
stage perceptron with 5 neurons in the in-
A perceptron with two or more trainable
put layer, 3 neurons in the hidden layer
weight layers (called multilayer perceptron
and 4 neurons in the output layer as a 5-
or MLP) is more powerful than an SLP. As
3-4-MLP.
we know, a singlelayer perceptron can di-
vide the input space by means of a hyper- Definition 5.7 (Multilayer perceptron).
plane (in a two-dimensional input space Perceptrons with more than one layer of
by means of a straight line). A two- variably weighted connections are referred
stage perceptron (two trainable weight lay- to as multilayer perceptrons (MLP).
more planes
ers, three neuron layers) can classify con- An n-layer or n-stage perceptron has
vex polygons by further processing these thereby exactly n variable weight layers
straight lines, e.g. in the form "recognize and n + 1 neuron layers (the retina is dis-
patterns lying above straight line 1, be- regarded here) with neuron layer 1 being
low straight line 2 and below straight line the input layer.
3". Thus, we – metaphorically speaking
- took an SLP with several output neu- Since three-stage perceptrons can classify
rons and "attached" another SLP (upper sets of any form by combining and sepa- 3-stage
MLP is
sufficient
@ABC
GFED
i1 UU jGFED
@ABC
i2
@@@UUUUUUjUjjjjjj @@@
@@jjjj UUUU @@
jjjjjjj@@@ UUUUUUU @@@
@ABC
GFED @ABC
GFED @ABC
GFED
@ UUUU @
tjjjjjj U*
h1 PP h2 o h3
PPP oo o
PPP ooo
PPP ooo
PPP oo
' ?>=<
89:;
PPP oooo
wo
Ω
GFED
@ABC
i1 @ @ABC
GFED
i2 @
~~ @ @@ ~ ~ @@
~~ @@ ~~ @@
~~ @ ~~ @@
@ABC
GFED GFED
@ABC GFED
@ABC @ABC
GFED ) GFED
@ABC * GFED
@ABC
~ @@ ~~ @@
~~~t
u w ' ~ ~
h1 PP h2 @ h3 h4 h5 n h6
PPP @ ~ nn n
PPP @@ ~ n
PPP @@ ~~ nnn
PPP @@ ~~ nnnnn
' GFED
@ABC -*, GFED
@ABC
PPP@ ~~~nnn~
t nw
h7 @rq h8
@@ ~
@@ ~~~
@@ ~
~~
89:;
?>=<
@@
~ ~
Ω
Figure 5.10: We know that an SLP represents a straight line. With 2 trainable weight layers,
several straight lines can be combined to form convex polygons (above). By using 3 trainable
weight layers several polygons can be formed into arbitrary sets (below).
The derivation of the output according to The same applies for the first factor accord-
the network input (the second factor in ing to the definition of our δ:
equation 5.27) clearly equals the deriva-
∂Err
tion of the activation function according − = δl (5.34)
∂netl
to the network input:
δh
∂Err
− ∂net h
∂oh
∂neth − ∂Err
∂oh
0 (net )
fact ∂Err
− ∂net
P ∂netl
h l l∈L ∂oh
P
∂ wh,l ·oh
δl h∈H
∂oh
wh,l
Figure 5.12: Graphical representation of the equations (by equal signs) and chain rule splittings
(by arrows) in the framework of the backpropagation derivation. The leaves of the tree reflect the
final results from the generalization of δ, which are framed in the derivation.
It is obvious that backpropagation ini- Since we only use it for one-stage percep-
tially processes the last weight layer di- trons, the second part of backpropagation
rectly by means of the teaching input and (light-colored) is omitted without substitu-
then works backwards from layer to layer tion. The result is:
while considering each preceding change in
∆wk,h = ηok δh with
weights. Thus, the teaching input leaves (5.42)
δh = fact
0 (net ) · (t − o )
traces in all weight layers. Here I describe h h h
the first (delta rule) and the second part Furthermore, we only want to use linear
of backpropagation (generalized delta rule activation functions so that fact 0 (light-
on more layers) in one go, which may meet colored) is constant. As is generally
the requirements of the matter but not known, constants can be combined, and
of the research. The first part is obvious, therefore we directly merge the constant
which you will soon see in the framework derivative fact
0 and (being constant for at
of a mathematical gimmick. Decades of least one lerning cycle) the learning rate η
development time and work lie between the (also light-colored) in η. Thus, the result
first and the second, recursive part. Like is:
many groundbreaking inventions, it was
not until its development that it was recog- ∆wk,h = ηok δh = ηok · (th − oh ) (5.43)
nized how plausible this invention was.
This exactly corresponds to the delta rule
definition.
5.4.3 The selection of the learning 5.4.3.1 Variation of the learning rate
rate has heavy influence on over time
the learning process
During training, another stylistic device
In the meantime we have often seen that can be a variable learning rate: In the
the change in weight is, in any case, pro- beginning, a large learning rate leads to
portional to the learning rate η. Thus, the good results, but later it results in inac-
selection of η is crucial for the behaviour curate learning. A smaller learning rate
of backpropagation and for learning proce- is more time-consuming, but the result is
dures in general. more precise. Thus, during the learning
process the learning rate needs to be de-
how fast
will be
learned? Definition 5.9 (Learning rate). Speed creased by one order of magnitude once or
and accuracy of a learning procedure can repeatedly.
always be controlled by and are always pro-
portional to a learning rate which is writ- A common error (which also seems to be a
ten as η. very neat solution at first glance) is to con-
ηI
tinually decrease the learning rate. Here
it quickly happens that the descent of the
If the value of the chosen η is too large, learning rate is larger than the ascent of
the jumps on the error surface are also a hill of the error function we are climb-
too large and, for example, narrow valleys ing. The result is that we simply get stuck
could simply be jumped over. Addition- at this ascent. Solution: Rather reduce
ally, the movements across the error sur- the learning rate gradually as mentioned
face would be very uncontrolled. Thus, a above.
small η is the desired input, which, how-
ever, can cost a huge, often unacceptable
amount of time. Experience shows that
good learning rate values are in the range 5.4.3.2 Different layers – Different
of learning rates
0.01 ≤ η ≤ 0.9.
The farer we move away from the out-
The selection of η significantly depends on put layer during the learning process, the
the problem, the network and the training slower backpropagation is learning. Thus,
data, so that it is barely possible to give it is a good idea to select a larger learning
practical advise. But for instance it is pop- rate for the weight layers close to the in-
ular to start with a relatively large η, e.g. put layer than for the weight layers close
0.9, and to slowly decrease it down to 0.1. to the output layer.
For simpler problems η can often be kept
constant.
hence we additionally reset the weight up- 5.5.3 We are still missing a few
date for the weight wi,j at time step (t) to details to use Rprop in
0, so that it not applied at all (not shown practice
in the following formula).
A few minor issues remain unanswered,
namely
However, if the sign remains the same, one
can perform a (careful!) increase of ηi,j to 1. How large are η ↑ and η ↓ (i.e. how
get past shallow areas of the error function. much are learning rates reinforced or
Here we obtain our new ηi,j (t) by multiply- weakened)?
ing the old ηi,j (t − 1) with a constant η ↑
η↑I 2. How to choose ηi,j (0) (i.e. how are
which is greater than 1.
the weight-specific learning rates ini-
tialized)?4
Definition 5.11 (Adaptation of learning
rates in Rprop). 3. What are the upper and lower bounds
ηmin and ηmax for ηi,j set?
Jηmin
η ηi,j (t − 1), g(t − 1)g(t) > 0 We now answer these questions with a
↑
Jηmax
ηi,j (t) = η ηi,j (t − 1), g(t − 1)g(t) < 0
↓ quick motivation. The initial value for the
η (t − 1) learning rates should be somewhere in the
otherwise.
order of the initialization of the weights.
i,j
(5.45)
ηi,j (0) = 0.1 has proven to be a good
choice. The authors of the Rprop paper
explain in an obvious way that this value
– as long as it is positive and without an ex-
Caution: This also implies that Rprop is
Rprop only orbitantly high absolute value – does not
exclusively designed for offline. If the gra-
learns
need to be dealt with very critically, as
offline dients do not have a certain continuity, the
it will be quickly overridden by the auto-
learning process slows down to the lowest
matic adaptation anyway.
rates (and remains there). When learning
online, one changes – loosely speaking – Equally uncritical is ηmax , for which they
the error function with each new epoch, recommend, without further mathemati-
since it is based on only one training pat- cal justification, a value of 50 which is used
tern. This may be often well applicable throughout most of the literature. One
in backpropagation and it is very often can set this parameter to lower values in
even faster than the offline version, which order to allow only very cautious updates.
is why it is used there frequently. It lacks, Small update steps should be allowed in
however, a clear mathematical motivation, any case, so we set ηmin = 10−6 .
and that is exactly what we need here.
4 Protipp: since the ηi,j can be changed only by
multiplication, 0 would be a rather suboptimal ini-
tialization :-)
Now we have left only the parameters η ↑ SNIPE: In Snipe resilient backpropa-
and η ↓ . Let us start with η ↓ : If this value gation is supported via the method
is used, we have skipped a minimum, from trainResilientBackpropagation of the
which we do not know where exactly it lies class NeuralNetwork. Furthermore, you
can also use an additional improvement
on the skipped track. Analogous to the to resilient propagation, which is, however,
procedure of binary search, where the tar- not dealt with in this work. There are get-
get object is often skipped as well, we as- ters and setters for the different parameters
sume it was in the middle of the skipped of Rprop.
track. So we need to halve the learning
rate, which is why the canonical choice
η ↓ = 0.5 is being selected. If the value
of η ↑ is used, learning rates shall be in-
5.6 Backpropagation has
creased with caution. Here we cannot gen- often been extended and
eralize the principle of binary search and altered besides Rprop
simply use the value 2.0, otherwise the
learning rate update will end up consist-
ing almost exclusively of changes in direc- Backpropagation has often been extended.
tion. Independent of the particular prob- Many of these extensions can simply be im-
lems, a value of η ↑ = 1.2 has proven to plemented as optional features of backpro-
be promising. Slight changes of this value pagation in order to have a larger scope for
have not significantly affected the rate of testing. In the following I want to briefly
convergence. This fact allowed for setting describe some of them.
this value as a constant as well.
∆wi,j (t) = ηoi δj + α · ∆wi,j (t − 1) (5.46) Figure 5.13: We want to execute the gradient
descent like a skier crossing a slope, who would
hardly stop immediately at the edge to the
We accelerate on plateaus (avoiding quasi- plateau.
standstill on plateaus) and slow down on
craggy surfaces (preventing oscillations).
Moreover, the effect of inertia can be var-
ied via the prefactor α, common val-
αI
ues are between 0.6 und 0.9. Addition- function the derivative outside of the close
ally, the momentum enables the positive proximity of Θ is nearly 0. This results
effect that our skier swings back and in the fact that it becomes very difficult
forth several times in a minimum, and fi- to move neurons away from the limits of
nally lands in the minimum. Despite its the activation (flat spots), which could ex- neurons
nice one-dimensional appearance, the oth- tremely extend the learning time. This get stuck
erwise very rare error of leaving good min- problem can be dealt with by modifying
ima unfortunately occurs more frequently the derivative, for example by adding a
because of the momentum term – which constant (e.g. 0.1), which is called flat
means that this is again no optimal solu- spot elimination or – more colloquial –
tion (but we are by now accustomed to fudging.
this condition).
It is an interesting observation, that suc-
cess has also been achieved by using deriva-
5.6.2 Flat spot elimination prevents tives defined as constants [Fah88]. A nice
neurons from getting stuck example making use of this effect is the
fast hyperbolic tangent approximation by
It must be pointed out that with the hy- Anguita et al. introduced in section 3.2.6
perbolic tangent as well as with the Fermi on page 37. In the outer regions of it’s (as
hence losing this neuron and some weights 5.7 Getting started – Initial
and thereby reduce the possibility that the configuration of a
network will memorize. This procedure is
called pruning. multilayer perceptron
Such a method to detect and delete un-
After having discussed the backpropaga-
necessary weights and neurons is referred
tion of error learning procedure and know-
to as optimal brain damage [lCDS90].
ing how to train an existing network, it
I only want to describe it briefly: The
would be useful to consider how to imple-
mean error per output neuron is composed
ment such a network.
of two competing terms. While one term,
as usual, considers the difference between
output and teaching input, the other one 5.7.1 Number of layers: Two or
tries to "press" a weight towards 0. If a
three may often do the job,
weight is strongly needed to minimize the
but more are also used
error, the first term will win. If this is not
the case, the second term will win. Neu-
rons which only have zero weights can be Let us begin with the trivial circumstance
pruned again in the end. that a network should have one layer of in-
put neurons and one layer of output neu-
There are many other variations of back- rons, which results in at least two layers.
prop and whole books only about this
subject, but since my aim is to offer an Additionally, we need – as we have already
overview of neural networks, I just want learned during the examination of linear
to mention the variations above as a moti- separability – at least one hidden layer of
vation to read on. neurons, if our problem is not linearly sep-
arable (which is, as we have seen, very
For some of these extensions it is obvi- likely).
ous that they cannot only be applied to
It is possible, as already mentioned, to
feedforward networks with backpropaga-
mathematically prove that this MLP with
tion learning procedures.
one hidden neuron layer is already capable
We have gotten to know backpropagation of approximating arbitrary functions with
5
and feedforward topology – now we have any accuracy – but it is necessary not
to learn how to build a neural network. It only to discuss the representability of a
is of course impossible to fully communi- problem by means of a perceptron but also
cate this experience in the framework of the learnability. Representability means
this work. To obtain at least some of that a perceptron can, in principle, realize
this knowledge, I now advise you to deal 5 Note: We have not indicated the number of neu-
with some of the exemplary problems from rons in the hidden layer, we only mentioned the
4.6. hypothetical possibility.
a mapping - but learnability means that neurons should be used. Thus, the most
we are also able to teach it. useful approach is to initially train with
only a few neurons and to repeatedly train
In this respect, experience shows that two
new networks with more neurons until the
hidden neuron layers (or three trainable
result significantly improves and, particu-
weight layers) can be very useful to solve
larly, the generalization performance is not
a problem, since many problems can be
affected (bottom-up approach).
represented by a hidden layer but are very
difficult to learn.
One should keep in mind that any ad- 5.7.3 Selecting an activation
ditional layer generates additional sub- function
minima of the error function in which we
can get stuck. All these things consid- Another very important parameter for the
ered, a promising way is to try it with way of information processing of a neural
one hidden layer at first and if that fails, network is the selection of an activa-
retry with two layers. Only if that fails, tion function. The activation function
one should consider more layers. However, for input neurons is fixed to the identity
given the increasing calculation power of function, since they do not process infor-
current computers, deep networks with mation.
a lot of layers are also used with success.
The first question to be asked is whether
we actually want to use the same acti-
5.7.2 The number of neurons has vation function in the hidden layer and
to be tested in the ouput layer – no one prevents us
from choosing different functions. Gener-
The number of neurons (apart from input ally, the activation function is the same for
and output layer, where the number of in- all hidden neurons as well as for the output
put and output neurons is already defined neurons respectively.
by the problem statement) principally cor-
responds to the number of free parameters For tasks of function approximation it
of the problem to be represented. has been found reasonable to use the hy-
perbolic tangent (left part of fig. 5.14 on
Since we have already discussed the net-
page 102) as activation function of the hid-
work capacity with respect to memorizing
den neurons, while a linear activation func-
or a too imprecise problem representation,
tion is used in the output. The latter is
it is clear that our goal is to have as few
absolutely necessary so that we do not gen-
free parameters as possible but as many as
erate a limited output intervall. Contrary
necessary.
to the input layer which uses linear acti-
But we also know that there is no stan- vation functions as well, the output layer
dard solution for the question of how many still processes information, because it has
threshold values. However, linear activa- range of random values could be the in-
tion functions in the output can also cause terval [−0.5; 0.5] not including 0 or values
huge learning steps and jumping over good very close to 0. This random initialization
minima in the error surface. This can be has a nice side effect: Chances are that
avoided by setting the learning rate to very the average of network inputs is close to 0,
small values in the output layer. a value that hits (in most activation func-
tions) the region of the greatest derivative,
An unlimited output interval is not essen- allowing for strong learning impulses right
tial for pattern recognition tasks6 . If from the start of learning.
the hyperbolic tangent is used in any case,
the output interval will be a bit larger. Un- SNIPE: In Snipe, weights are initial-
like with the hyperbolic tangent, with the ized randomly (if a synapse initial-
Fermi function (right part of fig. 5.14 on ization is wanted). The maximum
the following page) it is difficult to learn absolute weight value of a synapse
initialized at random can be set in
something far from the threshold value a NeuralNetworkDescriptor using the
(where its result is close to 0). However, method setSynapseInitialRange.
here a lot of freedom is given for selecting
an activation function. But generally, the
disadvantage of sigmoid functions is the
fact that they hardly learn something for 5.8 The 8-3-8 encoding
values far from thei threshold value, unless
problem and related
the network is modified.
problems
f(x)
0
−0.2 0.4
−0.4
−0.6 0.2
−0.8
−1 0
−4 −2 0 2 4 −4 −2 0 2 4
x x
Figure 5.14: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function
(right). The Fermi function was expanded by a temperature parameter. The original Fermi function
is thereby represented by dark colors, the temperature parameter of the modified Fermi functions
are, ordered ascending by steepness, 12 , 15 , 10
1 1
and 25 .
hidden neurons represents some kind of bi- dimensionality for encoder problems like
nary encoding and that the above map- the above.
ping is possible (assumed training time:
≈ 104 epochs). Thus, our network is a ma- An 8-1-8 network, however, does not work,
chine in which the input is first encoded since the possibility that the output of one
and afterwards decoded again. neuron is compensated by another one is
Analogously, we can train a 1024-10-1024 essential, and if there is only one hidden
encoding problem. But is it possible to neuron, there is certainly no compensatory
improve the efficiency of this procedure? neuron.
Could there be, for example, a 1024-9-
1024- or an 8-2-8-encoding network?
Exercises
Yes, even that is possible, since the net-
work does not depend on binary encodings:
Exercise 8. Fig. 5.4 on page 75 shows
Thus, an 8-2-8 network is sufficient for our
a small network for the boolean functions
problem. But the encoding of the network
AND and OR. Write tables with all computa-
is far more difficult to understand (fig. 5.15
tional parameters of neural networks (e.g.
on the next page) and the training of the
network input, activation etc.). Perform
networks requires a lot more time.
the calculations for the four possible in-
SNIPE: The static method puts of the networks and write down the
getEncoderSampleLesson in the class values of these variables for each input. Do
TrainingSampleLesson allows for creating the same for the XOR network (fig. 5.9 on
simple training sample lessons of arbitrary
page 84).
Exercise 9.
1. List all boolean functions B3 → B1 ,
that are linearly separable and char-
acterize them exactly.
2. List those that are not linearly sepa-
rable and characterize them exactly,
too.
P ={(0, 0, −1);
(2, −1, 1);
(7 + ε, 3 − ε, 1);
(7 − ε, 3 + ε, −1);
(0, −2 − ε, 1);
(0 − ε, −2, −1)}
105
Chapter 6 Radial basis functions dkriesel.com
function which calculates and outputs RBF output neurons. Each layer is com-
3 layers,
the activation of the neuron. pletely linked with the following one, short- feedforward
cuts do not exist (fig. 6.1 on the next page)
Definition 6.1 (RBF input neuron). Def- – it is a feedforward topology. The connec-
input
inition and representation is identical to tions between input layer and RBF layer
is linear the definition 5.1 on page 73 of the input are unweighted, i.e. they only transmit
again neuron. the input. The connections between RBF
Definition 6.2 (Center of an RBF neu- layer and output layer are weighted. The
ron). The center ch of an RBF neuron original definition of an RBF network only
h is the point in the input space where referred to an output neuron, but – in anal-
cI
the RBF neuron is located . In general, ogy to the perceptrons – it is apparent that
the closer the input vector is to the center such a definition can be generalized. A
Position
in the input
space vector of an RBF neuron, the higher is its bias neuron is not used in RBF networks.
activation. The set of input neurons shall be repre-
sented by I, the set of hidden neurons by
Definition 6.3 (RBF neuron). The so- H and the set of output neurons by O. JI, H, O
called RBF neurons h have a propaga-
tion function fprop that determines the dis-
tance between the center ch of a neuron Therefore, the inner neurons are called ra-
and the input vector y. This distance rep- dial basis neurons because from their def-
Important!
resents the network input. Then the net- inition follows directly that all input vec-
work input is sent through a radial basis tors with the same distance from the cen-
function fact which returns the activation ter of a neuron also produce the same out-
or the output of the neuron. RBF neurons put value (fig. 6.2 on page 108).
are represented by the symbol WVUT
PQRS
||c,x||
Gauß
.
GFED
@ABC @ABC
GFED
ERVRVRVVV h i1 , i2 , . . . , i|I|
y EE RRRVVVVV hhhhhlhlhlhlyly EE
y yy EE RhRhRhRhVhVlVlVlVl yy EE
EE
y y h hEEhh lRlRlR VVyyVVV
h EE
yy hh h E l l R R y
yRRR V VV
y| yhhhhhhhh
h llEE" VVVVVV EE"
WVUT
PQRS WVUT
PQRS WVUT
PQRS WVUT
PQRS WVUT
PQRS
l l y
| y R VVV+
||c,x|| sh ||c,x||
vll ||c,x||
R(
||c,x|| ||c,x||
V
QV
Gauß C QQ VVV Gauß C QQ Q h hm Gauß h1 , h2 , . . . , h|H|
V Gauß C mm hh m
CC QQQ VVVV CC QQQ{{ C mm { hhh mm {
Gauß
Figure 6.1: An exemplary RBF network with two input neurons, five hidden neurons and three
output neurons. The connections to the hidden neurons are not weighted, they only transmit the
input. Right of the illustration you can find the names of the neurons, which coincide with the
names of the MLP neurons: Input neurons are called i, hidden neurons are called h and output
neurons are called Ω. The associated sets are referred to as I, H and O.
basically, in relation to the whole input tion (fig. 6.4 on the facing page). Ad-
space, Gaussian bells are added here. ditionally, the network includes the cen-
ters c1 , c2 , . . . , c4 of the four inner neurons
Suppose that we have a second, a third h1 , h2 , . . . , h4 , and therefore it has Gaus-
and a fourth RBF neuron and therefore sian bells which are finally added within
four differently located centers. Each of the output neuron Ω. The network also
these neurons now measures another dis- possesses four values σ1 , σ2 , . . . , σ4 which
tance from the input to its own center influence the width of the Gaussian bells.
and de facto provides different values, even On the contrary, the height of the Gaus-
if the Gaussian bell is the same. Since sian bell is influenced by the subsequent
these values are finally simply accumu- weights, since the individual output val-
lated in the output layer, one can easily ues of the bells are multiplied by those
see that any surface can be shaped by drag- weights.
h(r)
Gaussian in 1D Gaussian in 2D
1 1
0.8
0.8 0.6
0.4
0.6 0.2
h(r)
0
0.4
−2
0.2 2
−1 1
0 0
x
1 −1 y
0 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
r
Figure 6.3: Two individual one- or two-dimensional Gaussian bells. In both cases σ = 0.4 holds
and the centers of the Gaussian bells lie in the coordinate origin. The
p distance r to the center (0, 0)
is simply calculated according to the Pythagorean theorem: r = x2 + y 2 .
1.4
1.2
0.8
0.6
0.4
y
0.2
−0.2
−0.4
−0.6
−2 0 2 4 6 8
x
Figure 6.4: Four different Gaussian bells in one-dimensional space generated by means of RBF
neurons are added by an output neuron of the RBF network. The Gaussian bells have different
heights, widths and positions. Their centers c1 , c2 , . . . , c4 are located at 0, 1, 3, 4, the widths
σ1 , σ2 , . . . , σ4 at 0.4, 1, 0.2, 0.8. You can see a two-dimensional example in fig. 6.5 on the following
page.
h(r) h(r)
Gaussian 1 Gaussian 2
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−2 2 −2
−1 2
1 −1 1
x 0 0 x 0 0
1 −1 y 1 −1 y
−2 −2
h(r) h(r)
Gaussian 3 Gaussian 4
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−2 2 −2 2
−1 1 −1 1
x 0 0 x 0 0
1 −1 y 1 −1 y
−2 −2
WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x||
QQQ Gauß A m
m Gauß
Gauß
QQQ AA }}
Gauß
mmm
QQQ AA } mm
QQQ
QQQ AAA }} mmm
QQQ AA }}} mmmmm
ONML
HIJK
QQ( }~ }mmm
Σ vm
−2
−1.5 2
−1 1.5
−0.5 1
0 0.5
0
x 0.5 −0.5
1 −1 y
1.5 −1.5
2 −2
Figure 6.5: Four different Gaussian bells in two-dimensional space generated p by means of RBF
neurons are added by an output neuron of the RBF network. Once again r = x2 + y 2 applies for
the distance. The heights w, widths σ and centers c = (x, y) are: w1 = 1, σ1 = 0.4, c1 = (0.5, 0.5),
w2 = −1, σ2 = 0.6, c2 = (1.15, −1.15), w3 = 1.5, σ3 = 0.2, c3 = (−0.5, −1), w4 = 0.8, σ4 =
1.4, c4 = (−2, 0).
Since we use a norm to calculate the dis- activation function fact , and hence the ac-
tance between the input vector and the tivation functions should not be referred
center of a neuron h, we have different to as fact simultaneously. One solution
choices: Often the Euclidian norm is cho- would be to number the activation func-
sen to calculate the distance: tions like fact 1 , fact 2 , . . . , fact |H| with H be-
ing the set of hidden neurons. But as a
rh = ||x − ch || (6.1) result the explanation would be very con-
fusing. So I simply use the name fact for
sX
= (xi − ch,i )2 (6.2)
i∈I
all activation functions and regard σ and
c as variables that are defined for individ-
Remember: The input vector was referred ual neurons but no directly included in the
to as x. Here, the index i runs through activation function.
the input neurons and thereby through the
input vector components and the neuron The reader will certainly notice that in the
center components. As we can see, the literature the Gaussian bell is often nor-
Euclidean distance generates the squared malized by a multiplicative factor. We
differences of all vector components, adds can, however, avoid this factor because
them and extracts the root of the sum. we are multiplying anyway with the subse-
In two-dimensional space this corresponds quent weights and consecutive multiplica-
to the Pythagorean theorem. From the tions, first by a normalization factor and
definition of a norm directly follows that then by the connections’ weights, would
the distance can only be positive. Strictly only yield different factors there. We do
speaking, we hence only use the positive not need this factor (especially because for
part of the activation function. By the our purpose the integral of the Gaussian
way, activation functions other than the bell must not always be 1) and therefore
Gaussian bell are possible. Normally, func- simply leave it out.
tions that are monotonically decreasing
over the interval [0; ∞] are chosen.
6.2.2 Some analytical thoughts
Now that we know the distance rh be- prior to the training
rh I
tween the input vector x and the center
ch of the RBF neuron h, this distance has The output y of an RBF output neuron
Ω
to be passed through the activation func- Ω results from combining the functions of
tion. Here we use, as already mentioned, an RBF neuron to
a Gaussian bell:
yΩ = wh,Ω · fact (||x − ch ||) . (6.4)
X
−r 2
h
2σ 2 h∈H
fact (rh ) = e h (6.3)
It is obvious that both the center ch and Suppose that similar to the multilayer per-
the width σh can be seen as part of the ceptron we have a set P , that contains |P |
weight layer – which requires less comput- 6.3.1 It is not always trivial to
ing time. determine centers and widths
of RBF neurons
We know that the delta rule is
It is obvious that the approximation accu-
∆wh,Ω = η · δΩ · oh , (6.17) racy of RBF networks can be increased by
adapting the widths and positions of the
in which we now insert as follows: Gaussian bells in the input space to the
problem that needs to be approximated.
∆wh,Ω = η · (tΩ − yΩ ) · fact (||p − ch ||) There are several methods to deal with the
(6.18) centers c and the widths σ of the Gaussian vary
bells: σ and c
centers can be selected so that the Gaus- A more trivial alternative would be to
sian bells overlap by approx. "one third"2 set |H| centers on positions randomly se-
(fig. 6.6). The closer the bells are set the lected from the set of patterns. So this
more precise but the more time-consuming method would allow for every training pat-
the whole thing becomes. tern p to be directly in the center of a neu-
ron (fig. 6.8 on the next page). This is
This may seem to be very inelegant, but not yet very elegant but a good solution
in the field of function approximation we when time is an issue. Generally, for this
cannot avoid even coverage. Here it is method the widths are fixedly selected.
useless if the function to be approximated
is precisely represented at some positions If we have reason to believe that the set
but at other positions the return value is of training samples is clustered, we can
only 0. However, the high input dimen- use clustering methods to determine them.
sion requires a great many RBF neurons, There are different methods to determine
input
which increases the computational effort clusters in an arbitrarily dimensional set
dimension exponentially with the dimension – and is of points. We will be introduced to some
very expensive
2 It is apparent that a Gaussian bell is mathemati-
of them in excursus A. One neural cluster-
cally infinitely wide, therefore I ask the reader to ing method are the so-called ROLFs (sec-
apologize this sloppy formulation. tion A.5), and self-organizing maps are
In a similar manner we could look how the In the following text, only simple mecha-
error depends on the values σ. Analogous nisms are sketched. For more information,
to the derivation of backpropagation we I refer to [Fri94].
derive
RBF networks generate very craggy er- the neurons will only influence each other
ror surfaces because, if we considerably if the distance between them is short. But
change a c or a σ, we will significantly if the σ are large, the already exisiting
change the appearance of the error func- neurons are considerably influenced by the
tion. new neuron because of the overlapping of
the Gaussian bells.
So it is obvious that we will adjust the al-
6.4 Growing RBF networks ready existing RBF neurons when adding
the new neuron.
automatically adjust the
neuron density To put it simply, this adjustment is made
by moving the centers c of the other neu-
rons away from the new neuron and re-
In growing RBF networks, the number ducing their width σ a bit. Then the
|H| of RBF neurons is not constant. A current output vector y of the network is
certain number |H| of neurons as well as compared to the teaching input t and the
their centers ch and widths σh are previ- weight vector G is improved by means of
ously selected (e.g. by means of a cluster- training. Subsequently, a new neuron can
ing method) and then extended or reduced. be inserted if necessary. This method is
particularly suited for function approxima- two paradigms and look at their advan-
tions. tages and disadvantages.
121
Chapter 7 Recurrent perceptron-like networks (depends on chapter 5) dkriesel.com
GFED
@ABC @ABC
GFED GFED
@ABC GFED
@ABC
i1 AUUUU i
i 2 k2 k1
}} AA UUUUUiiiiiii}} AAA O O
} A i ii U U }
UU}U}U AA
}}
}
i iiAiAAi UUUU A
ii } UUUU AAA
GFED
@ABC @ABC
GFED @ABC
GFED
} i A }
~}x v ti}iiiiii A { ~}} UUU*
h1 AUUUU h2 A iii} h3
AA UUUUU } AAA iii i ii i
AA UUUU }}} }}
AA }}UUUUU iiiiAiAiAi }}}
AA Ui A
GFED
@ABC @ABC
GFED
} iU }
}~ it}iiiiii UUUUUUA* }~ }
@A BC
Ω1 Ω2
Figure 7.2: Illustration of a Jordan network. The network output is buffered in the context neurons
and with the next time step it is entered into the network together with the new input.
with one context neuron per output neu- during the next time step (i.e. again a com-
ron. The set of context neurons is called plete link on the way back). So the com-
K. The context neurons are completely plete information processing part1 of the
linked toward the input layer of the net- MLP exists a second time as a "context
work. version" – which once again considerably
increases dynamics and state variety.
Compared with Jordan networks the El-
7.2 Elman networks man networks often have the advantage to
act more purposeful since every layer can
The Elman networks (a variation of access its own context.
the Jordan networks) [Elm90] have con-
text neurons, too, but one layer of context Definition 7.3 (Elman network). An El-
neurons per information processing neu- man network is an MLP with one con-
ron layer (fig. 7.3 on the following page). text neuron per information processing
Thus, the outputs of each hidden neuron neuron. The set of context neurons is
or output neuron are led into the associ- called K. This means that there exists one
nearly every-
thing is
buffered ated context layer (again exactly one con- context layer per information processing
text neuron per neuron) and from there it 1 Remember: The input layer does not process in-
is reentered into the complete neuron layer formation.
GFED
@ABC @ABC
GFED
i1 @UUUU i 2i
~~ @@ UUUUUiiiiiii~~ @@@
~~ @ @ i i i U U ~
UUU~~U @@
~~ i ii@i@i@ ~ UUUU @@
~ ii ~ UUUU @@@
@ABC
GFED @ABC
GFED @ABC
GFED ONML
HIJK ONML
HIJK ONML
HIJK
~ ~ iiii @@ ~ ~ zw v
~~it tu iii ~~uv UUU*
h1 @UUUU h2 @ ii h 3 4 kh1 5
kh kh
@@ UUUUU iiii ~~ 5
2 3
~ @@
@@ UUUU ~~~ @@ iiiiii ~~
U i
@@
@@ ~~~ UUUUiUiUiiii @@@ ~ ~~
GFED
@ABC GFED
@ABC ONML
HIJK ONML
HIJK
~ it~u iiiii i U UUUU @ ~ ~wv
U*
Ω1 Ω2 5
kΩ kΩ
1
5 2
Figure 7.3: Illustration of an Elman network. The entire information processing part of the network
exists, in a way, twice. The output of each neuron (except for the output of the input neurons)
is buffered and reentered into the associated layer. For the reason of clarity I named the context
neurons on the basis of their models in the actual network, but it is not mandatory to do so.
neuron layer with exactly the same num- 7.3 Training recurrent
ber of context neurons. Every neuron has networks
a weighted connection to exactly one con-
text neuron while the context layer is com-
pletely linked towards its original layer. In order to explain the training as compre-
hensible as possible, we have to agree on
some simplifications that do not affect the
learning principle itself.
Now it is interesting to take a look at the
So for the training let us assume that in
training of recurrent networks since, for in-
the beginning the context neurons are ini-
stance, ordinary backpropagation of error
tiated with an input, since otherwise they
cannot work on recurrent networks. Once
would have an undefined input (this is no
again, the style of the following part is
simplification but reality).
rather informal, which means that I will
not use any formal definitions. Furthermore, we use a Jordan network
without a hidden neuron layer for our
training attempts so that the output neu-
rons can directly provide input. This ap- but forward-oriented network without re-
proach is a strong simplification because currences. This enables training a recur-
generally more complicated networks are rent network with any training strategy
used. But this does not change the learn- developed for non-recurrent ones. Here,
attach
ing principle. the input is entered as teaching input into the same
every "copy" of the input neurons. This network
GFED
@ABC
i1 OUOUUU GFED
@ABC @ABC
GFED @ABC
GFED @ABC
GFED
i2 @PP i3 A n} kO 1 iiininin kO 2
OOOUUUU @@PPP A nnn i n
OOO UUUU@@ PPP AA nn } iii nn
OOO U@U@UU PPP AA nnnniii}i}i}iinnnn
OOO @@ UUUUPPPPnAnAniiii }}nnnn
@ABC
GFED * GFED
@ABC
OO' nw UniUniUniUPiUPiA' }~ nw }nn
@A BC
it
Ω1 Ω2
.. .. .. .. ..
. . . . .
/.-,
()*+RVRVV /.-,
()*+ /.-,
()*+? /.-,
()*+ /.-,
()*+
RRVRVVV CPCPCPPP oo jjjojojojo
RRRVVVV CC PP ??? oo j
RRR VVVCV PPP ?? ooojojjjjjoooo
/.-,
()*+RVRVV /.-,
()*+DQQ /.-,
()*+C n/.-,
()*+ /.-,
()*+
RRR CVCVVVVPPPoo?ojjj ooo
RRRC! ojVojVjVPjVP? oo
( wotj * ( wo
V
RRRVVVV DDQQQ C nn j j jpjpjp
RRR VVVV DD QQQ C n j p
RRR VVDVDV QQQ CCC nnnnjnjjjjjpjppp
RRR DVDVVV QQQ CnCnnjjjj ppp
@ABC
GFED @ABC
GFED @ABC
GFED @ABC
GFED vntjnjj VVQ* (GFED
@ABC
RRR D! VVVVQnVQnjQjjC! wpppp
R(
i1 OUOUUU i2 @PP i n k1 iiin k2
OOOUUUU @@PPP 3 AAA nnn}}}iiiiinininnn
OOO UUUU@@ PPP A n n
OOO U@U@UU PPP AA nnnniii}i}ii nnnn
OOO @@ UUUUPPPPnAnAniiii }}nnnn
@ABC
GFED * GFED
@ABC
OO' nw UniUniUniUPiUPiA' }~ nw }nn
it
Ω1 Ω2
Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: The
recurrent MLP. Bottom: The unfolded network. For reasons of clarity, I only added names to
the lowest part of the unfolded network. Dotted arrows leading into the network mark the inputs.
Dotted arrows leading out of the network mark the outputs. Each "network copy" represents a time
step of the network with the most recent time step being at the bottom.
nested computations (the farther we are are chosen suitably: So, for example, neu-
from the output layer, the smaller the in- rons and weights can be adjusted and
fluence of backpropagation, so that this the network topology can be optimized
limit is reached). Furthermore, with sev- (of course the result of learning is not
eral levels of context neurons this proce- necessarily a Jordan or Elman network).
dure could produce very large networks to With ordinary MLPs, however, evolution-
be trained. ary strategies are less popular since they
certainly need a lot more time than a di-
rected learning procedure such as backpro-
7.3.2 Teacher forcing pagation.
Another supervised learning example of encourage each other to continue this rota-
the wide range of neural networks was tion. As a manner of speaking, our neural
developed by John Hopfield: the so- network is a cloud of particles
called Hopfield networks [Hop82]. Hop-
field and his physically motivated net- Based on the fact that the particles auto-
works have contributed a lot to the renais- matically detect the minima of the energy
sance of neural networks. function, Hopfield had the idea to use the
"spin" of the particles to process informa-
tion: Why not letting the particles search
minima on arbitrary functions? Even if we
8.1 Hopfield networks are
only use two of those spins, i.e. a binary
inspired by particles in a activation, we will recognize that the devel-
magnetic field oped Hopfield network shows considerable
dynamics.
The idea for the Hopfield networks origi-
nated from the behavior of particles in a
magnetic field: Every particle "communi- 8.2 In a hopfield network, all
cates" (by means of magnetic forces) with neurons influence each
every other particle (completely linked)
with each particle trying to reach an ener-
other symmetrically
getically favorable state (i.e. a minimum
of the energy function). As for the neurons Briefly speaking, a Hopfield network con-
this state is known as activation. Thus, sists of a set K of completely linked neu-
all particles or neurons rotate and thereby rons with binary activation (since we only
JK
129
Chapter 8 Hopfield networks dkriesel.com
use two spins), with the weights being We have learned that a network, i.e. a
symmetric between the individual neurons set of |K| particles, that is in a state
completely is automatically looking for a minimum.
and without any neuron being directly con-
linked
set of An input pattern of a Hopfield network
nected to itself (fig. 8.1). Thus, the state
neurons
of |K| neurons with two possible states is exactly such a state: A binary string
∈ {−1, 1} can be described by a string x ∈ {−1, 1}|K| that initializes the neurons.
x ∈ {−1, 1}|K| . Then the network is looking for the min-
imum to be taken (which we have previ-
The complete link provides a full square ously defined by the input of training sam-
matrix of weights between the neurons. ples) on its energy surface.
The meaning of the weights will be dis-
cussed in the following. Furthermore, we But when do we know that the minimum
will soon recognize according to which has been found? This is simple, too: when input and
rules the neurons are spinning, i.e. are the network stops. It can be proven that a output =
Additionally, the complete link leads to ways converges [CG88], i.e. at some point
always
the fact that we do not know any input, it will stand still. Then the output is a converges
output or hidden neurons. Thus, we have binary string y ∈ {−1, 1}|K| , namely the
to think about how we can input some- state string of the network that has found
thing into the |K| neurons. a minimum.
Now let us take a closer look at the con- Zero weights lead to the two involved
tents of the weight matrix and the rules neurons not influencing each other.
for the state change of the neurons.
The weights as a whole apparently take
Definition 8.3 (Input and output of the way from the current state of the net-
a Hopfield network). The input of a work towards the next minimum of the en-
Hopfield network is binary string x ∈ ergy function. We now want to discuss
{−1, 1}|K| that initializes the state of the how the neurons follow this way.
network. After the convergence of the
network, the output is the binary string
y ∈ {−1, 1}|K| generated from the new net- 8.2.3 A neuron changes its state
work state. according to the influence of
the other neurons
8.2.2 Significance of weights Once a network has been trained and
initialized with some starting state, the
We have already said that the neurons change of state x of the individual neu-
k
change their states, i.e. their direction, rons k occurs according to the scheme
from −1 to 1 or vice versa. These spins oc-
cur dependent on the current states of the
other neurons and the associated weights. xk (t) = fact wj,k · xj (t − 1) (8.1)
X
−0.5
generated directly out of
−1 the training patterns
−4 −2 0 2 4
x
wi,j =
(8.2)
X
pi · pj
xk (t) = fact wj,k · xj (t − 1) .
X
p∈P
j∈J
This results in the weight matrix W . Col-
Now that we know how the weights influ- loquially speaking: We initialize the net-
ence the changes in the states of the neu- work by means of a training pattern and
rons and force the entire network towards then process weights wi,j one after another.
For each of these weights we verify: Are 0.139 · |K| training samples can be trained
the neurons i, j n the same state or do the and at the same time maintain their func-
states vary? In the first case we add 1 tion.
to the weight, in the second case we add
−1. Now we know the functionality of Hopfield
This we repeat for each training pattern networks but nothing about their practical
p ∈ P . Finally, the values of the weights use.
wi,j are high when i and j corresponded
with many training patterns. Colloquially
speaking, this high value tells the neurons: 8.4 Autoassociation and
"Often, it is energetically favorable to hold traditional application
the same state". The same applies to neg-
ative weights. Hopfield networks, like those mentioned
Due to this training we can store a certain above, are called autoassociators. An
fixed number of patterns p in the weight autoassociator a exactly shows the afore-
mentioned behavior: Firstly, when a
Ja
matrix. At an input x the network will
converge to the stored pattern that is clos- known pattern p is entered, exactly this
est to the input p. known pattern is returned. Thus,
Heteroassociations connected in series of Definition 8.6 (Learning rule for the het-
the form eroassociative matrix). For two training
samples p being predecessor and q being
h(p + ε) = q
successor of a heteroassociative transition
h(q + ε) = r the weights of the heteroassociative matrix
h(r + ε) = s V result from the learning rule
..
.
vi,j =
X
p i qj ,
h(z + ε) = p p,q∈P,p6=q
there for a while, goes on to the next pat- Which letter in the alphabet follows the
tern, and so on. letter P ?
xi (t + 1) = (8.5)
Another example is the phenomenon that
one cannot remember a situation, but the
0.8
0.6
f(x)
0.4
0.2
0
−4 −2 0 2 4
x
Exercises
Previously, I want to announce that there nected. The elements of our example are
are different variations of LVQ, which will exactly such numbers, because the natural
be mentioned but not exactly represented. numbers do not include, for example, num-
The goal of this chapter is rather to ana- bers between 1 and 2. On the other hand,
lyze the underlying principle. the sequence of real numbers R, for in-
stance, is continuous: It does not matter
how close two selected numbers are, there
will always be a number between them.
139
Chapter 9 Learning vector quantization dkriesel.com
Quantization means that a continuous into classes that reflect the input space
space is divided into discrete sections: By as well as possible (fig. 9.1 on the facing input space
deleting, for example, all decimal places page). Thus, each element of the input reduced to
of the real number 2.71828, it could be space should be assigned to a vector as a vector repre-
sentatives
assigned to the natural number 2. Here representative, i.e. to a class, where the
it is obvious that any other number hav- set of these representatives should repre-
ing a 2 in front of the comma would also sent the entire input space as precisely as
be assigned to the natural number 2, i.e. possible. Such a vector is called codebook
2 would be some kind of representative vector. A codebook vector is the represen-
for all real numbers within the interval tative of exactly those input space vectors
[2; 3). lying closest to it, which divides the input
space into the said discrete areas.
It must be noted that a sequence can be ir-
regularly quantized, too: For instance, the It is to be emphasized that we have to
timeline for a week could be quantized into know in advance how many classes we
working days and weekend. have and which training sample belongs
A special case of quantization is digiti- to which class. Furthermore, it is impor-
zation: In case of digitization we always tant that the classes must not be disjoint,
talk about regular quantization of a con- which means they may overlap.
tinuous space into a number system with
Such separation of data into classes is in-
respect to a certain basis. If we enter, for
teresting for many problems for which it
example, some numbers into the computer,
is useful to explore only some characteris-
these numbers will be digitized into the bi-
tic representatives instead of the possibly
nary system (basis 2).
huge set of all vectors – be it because it is
Definition 9.1 (Quantization). Separa- less time-consuming or because it is suffi-
tion of a continuous space into discrete sec- ciently precise.
tions.
should be used to divide an input space voronoi diagram out of the set. Since
Figure 9.1: BExamples for quantization of a two-dimensional input space. DThe lines represent
the class limit, the × mark the codebook vectors.
each codebook vector can clearly be asso- are used to cause a previously defined num-
ciated to a class, each input vector is asso- ber of randomly initialized codebook vec-
ciated to a class, too. tors to reflect the training data as precisely
as possible.
therefore contain the training input vector Learning process: The learning process
p and its class affiliation c. For the class takes place according to the rule
affiliation
∆Ci = η(t) · h(p, Ci ) · (p − Ci )
c ∈ {1, 2, . . . , |C|} (9.1)
Ci (t + 1) = Ci (t) + ∆Ci , (9.2)
holds, which means that it clearly assigns
the training sample to a class or a code- which we now want to break down.
book vector.
. We have already seen that the first
Intuitively, we could say about learning: factor η(t) is a time-dependent learn-
"Why a learning procedure? We calculate ing rate allowing us to differentiate
the average of all class members and place between large learning steps and fine
their codebook vectors there – and that’s tuning.
it." But we will see soon that our learning
. The last factor (p − Ci ) is obviously
procedure can do a lot more.
the direction toward which the code-
book vector is moved.
I only want to briefly discuss the steps
of the fundamental LVQ learning proce- . But the function h(p, Ci ) is the core of
dure: the rule: It implements a distinction
of cases.
Initialization: We place our set of code-
Assignment is correct: The winner
book vectors on random positions in
vector is the codebook vector of
the input space.
the class that includes p. In this
Important!
case, the function provides posi-
Training sample: A training sample p of
tive values and the codebook vec-
our training set P is selected and pre-
tor moves towards p.
sented.
Assignment is wrong: The winner
Distance measurement: We measure the vector does not represent the
distance ||p − C|| between all code- class that includes p. Therefore
book vectors C1 , C2 , . . . , C|C| and our it moves away from p.
input p.
We can see that our definition of the func-
Winner: The closest codebook vector tion h was not precise enough. With good
wins, i.e. the one with reason: From here on, the LVQ is divided
into different nuances, dependent of how
min ||p − Ci ||. exactly h and the learning rate should
Ci ∈C be defined (called LVQ1, LVQ2, LVQ3,
145
Chapter 10
Self-organizing feature maps
A paradigm of unsupervised learning neural networks, which maps an input
space by its fixed topology and thus independently looks for simililarities.
Function, learning procedure, variations and neural gas.
sense at all, too. Our brain responds to activated. In other words: We are not in-
external input by changes in state. These terested in the exact output of the neuron
are, so to speak, its output. but in knowing which neuron provides out-
put. Thus, SOMs are considerably more
related to biology than, for example, the
feedforward networks, which are increas-
Based on this principle and exploring ingly used for calculations.
the question of how biological neural net-
works organize themselves, Teuvo Ko-
honen developed in the Eighties his self- 10.1 Structure of a
organizing feature maps [Koh82, Koh98],
shortly referred to as self-organizing
self-organizing map
maps or SOMs. A paradigm of neural
networks where the output is the state of Typically, SOMs have – like our brain –
the network, which learns completely un- the task to map a high-dimensional in-
supervised, i.e. without a teacher. put (N dimensions) onto areas in a low-
147
Chapter 10 Self-organizing feature maps dkriesel.com
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
dimensional grid of cells (G dimensions)
to draw a map of the high-dimensional
high-dim.
input space, so to speak. To generate this map,
↓ the SOM simply obtains arbitrary many
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
points of the input space. During the in-
low-dim.
map
put of the points the SOM will try to cover
as good as possible the positions on which
the points appear by its neurons. This par- /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
ticularly means, that every neuron can be
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
assigned to a certain position in the input
space.
is closest to the input pattern in the input calculated distance to the input. All
space. The dimension of the input space other neurons remain inactive.This
is referred to as N . paradigm of activity is also called
NI input
winner-takes-all scheme. The output ↓
Definition 10.3 (Topology). The neu- we expect due to the input of a SOM winner
rons are interconnected by neighborhood shows which neuron becomes active.
relationships. These neighborhood rela-
tionships are called topology. The train- In many literature citations, the descrip-
ing of a SOM is highly influenced by the tion of SOMs is more formal: Often an
topology. It is defined by the topology input layer is described that is completely
function h(i, k, t), where i is the winner linked towards an SOM layer. Then the in-
iI put layer (N neurons) forwards all inputs
neuron1 ist, k the neuron to be adapted
kI (which will be discussed later) and t the to the SOM layer. The SOM layer is later-
timestep. The dimension of the topology ally linked in itself so that a winner neuron
is referred to as G. can be established and inhibit the other
GI neurons. I think that this explanation of
a SOM is not very descriptive and there-
10.2 SOMs always activate fore I tried to provide a clearer description
of the network structure.
the neuron with the
least distance to an Now the question is which neuron is ac-
tivated by which input – and the answer
input pattern is given by the network itself during train-
ing.
Like many other neural networks, the
SOM has to be trained before it can be
used. But let us regard the very simple 10.3 Training
functionality of a complete self-organizing
map before training, since there are many
analogies to the training. Functionality [Training makes the SOM topology cover
consists of the following steps: the input space] The training of a SOM
is nearly as straightforward as the func-
Input of an arbitrary value p of the input tionality described above. Basically, it is
space RN . structured into five steps, which partially
Calculation of the distance between ev- correspond to those of functionality.
ery neuron k and p by means of a Initialization: The network starts with
norm, i.e. calculation of ||p − ck ||. random neuron centers ck ∈ RN from
One neuron becomes active, namely the input space.
such neuron i with the shortest Creating an input pattern: A stimulus,
1 We will learn soon what a winner neuron is. i.e. a point p, is selected from the
training:
input space RN . Now this stimulus is Definition 10.4 (SOM learning rule). A
input, entered into the network. SOM is trained by presenting an input pat-
→ winner i, tern and determining the associated win-
Distance measurement: Then the dis- ner neuron. The winner neuron and its
change in
position
i and tance ||p − ck || is determined for every neighbor neurons, which are defined by the
neighbors
neuron k in the network. topology function, then adapt their cen-
ters according to the rule
Winner takes all: The winner neuron i
is determined, which has the smallest ∆ck = η(t) · h(i, k, t) · (p − ck ),
distance to p, i.e. which fulfills the (10.1)
condition c (t + 1) = c (t) + ∆c (t). (10.2)
k k k
||p − ci || ≤ ||p − ck || ∀ k 6= i
10.3.1 The topology function
. You can see that from several win- defines, how a learning
ner neurons one can be selected at neuron influences its
will. neighbors
Adapting the centers: The neuron cen- The topology function h is not defined
ters are moved within the input space on the input space but on the grid and rep-
according to the rule2 resents the neighborhood relationships be-
tween the neurons, i.e. the topology of the
∆ck = η(t) · h(i, k, t) · (p − ck ), network. It can be time-dependent (which
it often is) – which explains the parameter
defined on
t. The parameter k is the index running the grid
where the values ∆ck are simply through all neurons, and the parameter i
added to the existing centers. The is the index of the winner neuron.
last factor shows that the change in
In principle, the function shall take a large
position of the neurons k is propor-
value if k is the neighbor of the winner neu-
tional to the distance to the input
ron or even the winner neuron itself, and
pattern p and, as usual, to a time-
small values if not. SMore precise defini-
dependent learning rate η(t). The
tion: The topology function must be uni-
above-mentioned network topology ex-
modal, i.e. it must have exactly one maxi-
erts its influence by means of the func-
mum. This maximum must be next to the
tion h(i, k, t), which will be discussed
winner neuron i, for which the distance to
in the following.
itself certainly is 0.
only 1 maximum
2 Note: In many sources this rule is written ηh(p −
Additionally, the time-dependence enables
for the winner
ck ), which wrongly leads the reader to believe that
h is a constant. This problem can easily be solved us, for example, to reduce the neighbor-
by not omitting the multiplication dots ·. hood in the course of time.
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
part of fig. 10.2) or on a one-dimensional
grid we could simply use the number of the
connections between the neurons i and k
/.-,
()*+ /.-,
()*+ /.-,
()*+ 89:;
?>=< /.-,
()*+
(upper part of the same figure).
qqq8 kO
Definition 10.5 (Topology function). qqq
2.23
The topology function h(i, k, t) describes qqq
/.-,
()*+ 89:;
?>=< /()*+o
/ .-,o //()*+
.-, /.-,
()*+
qq
the neighborhood relationships in the qx q
i o
topology. It can be any unimodal func-
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
tion that reaches its maximum when i = k
gilt. Time-dependence is optional, but of-
ten used.
Figure 10.2: Example distances of a one-
dimensional SOM topology (above) and a two-
dimensional SOM topology (below) between two
10.3.1.1 Introduction of common neurons i and k. In the lower case the Euclidean
distance is determined (in two-dimensional space
distance and topology
equivalent to the Pythagoream theorem). In the
functions upper case we simply count the discrete path
length between i and k. To simplify matters I
required a fixed grid edge length of 1 in both
A common distance function would be, for cases.
example, the already known Gaussian
bell (see fig. 10.3 on page 153). It is uni-
modal with a maximum close to 0. Addi-
tionally, its width can be changed by ap-
plying its parameter σ , which can be used
σI
to realize the neighborhood being reduced
in the course of time: We simply relate the
time-dependence to the σ and the result is
a monotonically decreasing σ(t). Then our Typical sizes of the target value of a learn-
topology function could look like this: ing rate are two sizes smaller than the ini-
2
tial value, e.g
||gi −ck ||
−
h(i, k, t) = e 2·σ(t)2
, (10.3)
0.01 < η < 0.6
where gi and gk represent the neuron po-
sitions on the grid, not the neuron posi- could be true. But this size must also de-
tions in the input space, which would be pend on the network topology or the size
referred to as ci and ck . of the neighborhood.
Other functions that can be used in- As we have already seen, a decreasing
stead of the Gaussian function are, for neighborhood size can be realized, for ex-
instance, the cone function, the cylin- ample, by means of a time-dependent,
der function or the Mexican hat func- monotonically decreasing σ with the
tion (fig. 10.3 on the facing page). Here, Gaussin bell being used in the topology
the Mexican hat function offers a particu- function.
lar biological motivation: Due to its neg-
ative digits it rejects some neurons close The advantage of a decreasing neighbor-
to the winner neuron, a behavior that has hood size is that in the beginning a moving
already been observed in nature. This can neuron "pulls along" many neurons in its
cause sharply separated map areas – and vicinity, i.e. the randomly initialized net-
that is exactly why the Mexican hat func- work can unfold fast and properly in the
tion has been suggested by Teuvo Koho- beginning. In the end of the learning pro-
nen himself. But this adjustment charac- cess, only a few neurons are influenced at
teristic is not necessary for the functional- the same time which stiffens the network
ity of the map, it could even be possible as a whole but enables a good "fine tuning"
that the map would diverge, i.e. it could of the individual neurons.
virtually explode.
It must be noted that
10.3.2 Learning rates and
h·η ≤1
neighborhoods can decrease
monotonically over time
must always be true, since otherwise the
neurons would constantly miss the current
To avoid that the later training phases training sample.
forcefully pull the entire map towards
a new pattern, the SOMs often work
But enough of theory – let us take a look
with temporally monotonically decreasing
at a SOM in action!
learning rates and neighborhood sizes. At
first, let us talk about the learning rate:
0.6 0.6
h(r)
f(x)
0.4 0.4
0.2
0.2
0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −2 0 2 4
r x
f(x)
1
0.4 0.5
0
0.2
−0.5
0 −1
−1.5
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
x x
Figure 10.3: Gaussian bell, cone function, cylinder function and the Mexican hat function sug-
gested by Kohonen as examples for topology functions of a SOM..
89:;
?>=<
1
89:;
?>=<
1 89:;
?>=<
2 89:;
?>=<
2
89:;
?>=<
7 89:;
?>=<
3
89:;
?>=<
4
89:;
?>=<
?>=<
89:; 89:;
?>=<
5
4> 6
>>
89:;
?>=<
>>
>>
> 6
89:;
?>=<
3 // p 89:;
?>=<
5 89:;
?>=<
7
Figure 10.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgy
space (right) of a self-organizing map. Neuron 3 is the winner neuron since it is closest to p. In
the topology, the neurons 2 and 4 are the neighbors of 3. The arrows mark the movement of the
winner neuron and its neighbors towards the training sample p.
To illustrate the one-dimensional topology of the network, it is plotted into the input space by the
dotted line. The arrows mark the movement of the winner neuron and its neighbors towards the
pattern.
Now let us take a look at the above- Although the center of neuron 7 – seen
from the input space – is considerably
mentioned network with random initializa-
closer to the input pattern p than neuron
tion of the centers (fig. 10.4 on the preced-
ing page) and enter a training sample p.2, neuron 2 is learning and neuron 7 is
Obviously, in our example the input pat-not. I want to remind that the network
topology specifies which neuron is allowed
tern is closest to neuron 3, i.e. this is the topology
winning neuron. to learn and not its position in the input specifies,
space. This is exactly the mechanism by who will learn
We remember the learning rule for which a topology can significantly cover an
SOMs input space without having to be related
to it by any sort.
∆ck = η(t) · h(i, k, t) · (p − ck )
and process the three factors from the After the adaptation of the neurons 2, 3
back: and 4 the next pattern is applied, and so
on. Another example of how such a one-
Learning direction: Remember that the dimensional SOM can develop in a two-
neuron centers ck are vectors in the dimensional input space with uniformly
input space, as well as the pattern p. distributed input patterns in the course of
A remedy for topological defects could We have seen that a SOM is trained by
be to increase the initial values for the entering input patterns of the input space
Figure 10.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0, 100,
300, 500, 5000, 50000, 70000 and 80000 randomly distributed input patterns p ∈ R2 . During the
training η decreased from 1.0 to 0.1, the σ parameter of the Gauss function decreased from 10.0
to 0.2.
Figure 10.6: End states of one-dimensional (left column) and two-dimensional (right column)
SOMs on different input spaces. 200 neurons were used for the one-dimensional topology, 10 × 10
neurons for the two-dimensionsal topology and 80.000 input patterns for all maps.
RN one after another, again and again so For example, the different phonemes of
that the SOM will be aligned with these the finnish language have successfully been
patterns and map them. It could happen mapped onto a SOM with a two dimen-
that we want a certain subset U of the in- sional discrete grid topology and therefore
put space to be mapped more precise than neighborhoods have been found (a SOM
the other ones. does nothing else than finding neighbor-
hood relationships). So one tries once
This problem can easily be solved by more to break down a high-dimensional
means of SOMs: During the training dis- space into a low-dimensional space (the
proportionally many input patterns of the topology), looks if some structures have
area U are presented to the SOM. If the been developed – et voilà: clearly defined
number of training patterns of U ⊂ RN areas for the individual phenomenons are
presented to the SOM exceeds the number formed.
of those patterns of the remaining RN \ U ,
then more neurons will group there while Teuvo Kohonen himself made the ef-
the remaining neurons are sparsely dis- fort to search many papers mentioning his
tributed on RN \ U (fig. 10.8 on the next SOMs in their keywords. In this large in-
more page). put space the individual papers now indi-
patterns vidual positions, depending on the occur-
↓ As you can see in the illustration, the edge rence of keywords. Then Kohonen created
higher
resolution
of the SOM could be deformed. This can a SOM with G = 2 and used it to map the
be compensated by assigning to the edge high-dimensional "paper space" developed
of the input space a slightly higher proba- by him.
bility of being hit by training patterns (an
often applied approach for reaching every Thus, it is possible to enter any paper
corner with the SOMs). into the completely trained SOM and look
which neuron in the SOM is activated. It
Also, a higher learning rate is often used will be likely to discover that the neigh-
for edge and corner neurons, since they are bored papers in the topology are interest-
only pulled into the center by the topol- ing, too. This type of brain-like context-
ogy. This also results in a significantly im- based search also works with many other
proved corner coverage. input spaces.
SOM finds
similarities
It is to be noted that the system itself
10.6 Application of SOMs defines what is neighbored, i.e. similar,
within the topology – and that’s why it
is so interesting.
Regarding the biologically inspired asso-
ciative data storage, there are many This example shows that the position c of
fields of application for self-organizing the neurons in the input space is not signif-
maps and their variations. icant. It is rather interesting to see which
Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. On the left side,
the chance to become a training pattern was equal for each coordinate of the input space. On the
right side, for the central circle in the input space, this chance is more than ten times larger than
for the remaining input space (visible in the larger pattern density in the background). In this circle
the neurons are obviously more crowded and the remaining area is covered less dense but in both
cases the neurons are still evenly distributed. The two SOMS were trained by means of 80.000
training samples and decreasing η (1 → 0.2) as well as decreasing σ (5 → 0.5).
Figure 10.9: A figure filling different subspaces of the actual input space of different positions
therefore can hardly be filled by a SOM.
In spite of all practical hints, it is as al- problem: What do we do with input pat-
ways the user’s responsibility not to un- terns from which we know that they are
derstand this text as a catalog for easy an- confined in different (maybe disjoint) ar-
swers but to explore all advantages and eas?
several SOMs
disadvantages himself.
Here, the idea is to use not only one
Unlike a SOM, the neighborhood of a neu- SOM but several ones: A multi-self-
ral gas must initially refer to all neurons organizing map, shortly referred to as
since otherwise some outliers of the ran- M-SOM [GKE01b, GKE01a, GS06]. It is
dom initialization may never reach the re- unnecessary that the SOMs have the same
maining group. To forget this is a popular topology or size, an M-SOM is just a com-
error during the implementation of a neu- bination of M SOMs.
ral gas. This learning process is analog to that of
the SOMs. However, only the neurons be-
With a neural gas it is possible to learn a
longing to the winner SOM of each train-
kind of complex input such as in fig. 10.9
can classify ing step are adapted. Thus, it is easy to
on the preceding page since we are not
complex
represent two disjoint clusters of data by
figure bound to a fixed-dimensional grid. But
means of two SOMs, even if one of the
some computational effort could be neces-
clusters is not represented in every dimen-
sary for the permanent sorting of the list
sion of the input space RN . Actually, the
(here, it could be effective to store the list
individual SOMs exactly reflect these clus-
in an ordered data structure right from the
ters.
start).
Definition 10.7 (Multi-SOM). A multi-
Definition 10.6 (Neural gas). A neural SOM is nothing more than the simultane-
gas differs from a SOM by a completely dy- ous use of M SOMs.
namic neighborhood function. With every
learning cycle it is decided anew which neu-
rons are the neigborhood neurons of the 10.7.3 A multi-neural gas consists
winner neuron. Generally, the criterion of several separate neural
for this decision is the distance between gases
the neurosn and the winner neuron in the
input space. Analogous to the multi-SOM, we also have
a set of M neural gases: a multi-neural
gas [GS06, SG06]. This construct be-
several gases
10.7.2 A Multi-SOM consists of haves analogous to neural gas and M-SOM:
several separate SOMs Again, only the neurons of the winner gas
are adapted.
In order to present another variant of the The reader certainly wonders what advan-
SOMs, I want to formulate an extended tage is there to use a multi-neural gas since
when large original gases are divided To build a growing SOM is more difficult
into several smaller ones since (as al- because new neurons have to be integrated
ready mentioned) the sorting of the in the neighborhood.
list L could use a lot of computa-
tional effort while the sorting of sev-
eral smaller lists L1 , L2 , . . . , LM is less Exercises
time-consuming – even if these lists in
total contain the same number of neu-
Exercise 17. A regular, two-dimensional
rons.
grid shall cover a two-dimensional surface
As a result we will only obtain local in- as "well" as possible.
stead of global sortings, but in most cases 1. Which grid structure would suit best
these local sortings are sufficient. for this purpose?
Now we can choose between two extreme 2. Which criteria did you use for "well"
cases of multi-neural gases: One extreme and "best"?
case is the ordinary neural gas M = 1, i.e.
The very imprecise formulation of this ex-
we only use one single neural gas. Interest-
ercise is intentional.
ing enough, the other extreme case (very
large M , a few or only one neuron per gas)
behaves analogously to the K-means clus-
tering (for more information on clustering
procedures see excursus A).
As in the other smaller chapters, we want tionally an ART network shall be capable
to try to figure out the basic idea of to find new classes.
the adaptive resonance theory (abbre-
viated: ART ) without discussing its the-
ory profoundly.
11.1 Task and structure of an
In several sections we have already men- ART network
tioned that it is difficult to use neural
networks for the learning of new informa-
tion in addition to but without destroying An ART network comprises exactly two
the already existing information. This cir- layers: the input layer I and the recog-
cumstance is called stability / plasticity nition layer O with the input layer be-
dilemma. ing completely linked towards the recog-
nition layer. This complete link induces
In 1987, Stephen Grossberg and Gail a top-down weight matrix W that con-
Carpenter published the first version of tains the weight values of the connections
their ART network [Gro76] in order to al- between each neuron in the input layer
leviate this problem. This was followed and each neuron in the recognition layer
by a whole family of ART improvements (fig. 11.1 on the following page).
(which we want to discuss briefly, too).
Simple binary patterns are entered into
It is the idea of unsupervised learning, the input layer and transferred to the pattern
whose aim is the (initially binary) pattern recognition layer while the recognition recognition
recognition, or more precisely the catego- layer shall return a 1-out-of-|O| encoding,
rization of patterns into classes. But addi- i.e. it should follow the winner-takes-all
165
Chapter 11 Adaptive resonance theory dkriesel.com
GFED
@ABC @ABC
GFED @ABC
GFED @ABC
GFED
S 5
E O 1 4Y 4cFOgSFi4OFOFSOSFOSOSOSOSOSSxSxSx; E O 2 4Y 4cFOg F4OFOFOFOOOoOoOoooxoxox7 ; E O 3 4Y 4cFF4FFkFkkokokokokoxkokxox7 ; E O 4 4Y 44
i i i i
444FFFFxOxxOOSxOSSSSS444oFooFFoFxOxxOOxO kkk4k4k4koFkooFFoFxxxx 444
4x44x4xxxFxFFFFOFOOOoOSOoOoSoOoSSooSS4xo44SSx4xxSxFxFkFFFkkOFkkOOkkOoOkOoOokoOooo4xo44x4xxxFxFFFFF 4444
kS
xxxxx44444oooFooFFooFFoFkxOkxxOOxkkxOOkxkO4OkkS44O4kS4okSSooSFSooFSSFooFFSoFSxOxxOOxxOOxO4O44O44 FFFFFF 44444
x
xxxx oo4oo4 kkxkFkxxFkFxkF oOo4oOOo4O xSFxSxFFxSSFSS O4OO4O FFFF 4 4
xxxxxxoooooookookkokk4kk44k4xk4xkxxxkxoooFooFoFoFoFoFo 44O44x4OxOxOxOxOOxOOOFSFFFSFSFSSS4SS4O4S4S4OOSOOOOOOOFFFFFF 44444
xxxxooookkkkk xxxx44oooo FFxFFxFxxFxx4444 OOOOOFOOFFOFFF4S444SSSSSSOSSOOSOOFOOFFOFFF4444
xxxxoookokkkkkk xxxxooo4o44 xxFF 4 FF 4 SS FF 4
@ABC
GFED GFED
@ABC @ABC
GFED @ABC
GFED @ABC
GFED @ABC
GFED
xo{xkow xxokokxokokookkokkk xo{xow xxooxooooo 4 x{xxxx FFF4# OOOOOFOOFF'4# SSSOSSOOSOSOFSOSOFF) '4#
ku
Ω1 Ω2 Ω3 Ω4 Ω5 Ω6
Figure 11.1: Simplified illustration of the ART network structure. Top: the input layer, bottom:
the recognition layer. In this illustration the lateral inhibition of the recognition layer and the control
neurons are omitted.
scheme. For instance, to realize this 1- put layer causes an activity within the
layers
out-of-|O| encoding the principle of lateral recognition layer while in turn in the recog- activate
inhibition can be used – or in the imple- nition layer every activity causes an activ- one
mentation the most activated neuron can ity within the input layer.
another
GFED
@ABC GFED
@ABC @ABC
GFED GFED
@ABC
i1 Fb i2 i
E O 3 < E i4
Y FF O 4Y 4
FF 44
FF 44
FF 4
FF
FF 444
FF 4
FF 44
FF 4
GFED
@ABC @ABC
GFED
F"
|
Ω1 Ω2
169
Appendix A
Excursus: Cluster analysis and regional and
online learnable fields
In Grimm’s dictionary the extinct German word "Kluster" is described by "was
dicht und dick zusammensitzet (a thick and dense group of sth.)". In static
cluster analysis, the formation of groups within point clouds is explored.
Introduction of some procedures, comparison of their advantages and
disadvantages. Discussion of an adaptive clustering method based on neural
networks. A regional and online learnable field models from a point cloud,
possibly with a lot of points, a comparatively small set of neurons being
representative for the point cloud.
171
Appendix A Excursus: Cluster analysis and regional and online learnable fieldsdkriesel.com
fine a clustering procedure that uses a met- 7. Continue with 4 until the assignments
ric as distance measure. are no longer changed.
number of
Now we want to introduce and briefly dis- Step 2 already shows one of the great ques- cluster
must be
cuss different clustering procedures. tions of the k-means algorithm: The num- known
ber k of the cluster centers has to be de- previously
termined in advance. This cannot be done
by the algorithm. The problem is that it
A.1 k-means clustering is not necessarily known in advance how k
allocates data to a can be determined best. Another problem
predefined number of is that the procedure can become quite in-
stable if the codebook vectors are badly
clusters initialized. But since this is random, it
is often useful to restart the procedure.
k-means clustering according to J. This has the advantage of not requiring
MacQueen [Mac67] is an algorithm that much computational effort. If you are fully
is often used because of its low computa- aware of those weaknesses, you will receive
tion and storage complexity and which is quite good results.
regarded as "inexpensive and good". The
However, complex structures such as "clus-
operation sequence of the k-means cluster-
ters in clusters" cannot be recognized. If k
ing algorithm is the following:
is high, the outer ring of the construction
1. Provide data to be examined. in the following illustration will be recog-
nized as many single clusters. If k is low,
2. Define k, which is the number of clus- the ring with the small inner clusters will
ter centers. be recognized as one cluster.
3. Select k random vectors for the clus- For an illustration see the upper right part
ter centers (also referred to as code- of fig. A.1 on page 174.
book vectors).
builds a cluster. The advantage is that which is the reason for the name epsilon-
the number of clusters occurs all by it- nearest neighboring. Points are neig-
self. The disadvantage is that a large stor- bors if they are at most ε apart from each
age and computational effort is required to other. Here, the storage and computa-
find the next neighbor (the distances be- tional effort is obviously very high, which
tween all data points must be computed is a disadvantage.
clustering
and stored). radii around
clustering But note that there are some special cases: points
next
points There are some special cases in which the Two separate clusters can easily be con-
procedure combines data points belonging nected due to the unfavorable situation of
to different clusters, if k is too high. (see a single data point. This can also happen
the two small clusters in the upper right with k-nearest neighboring, but it would
of the illustration). Clusters consisting of be more difficult since in this case the num-
only one single data point are basically ber of neighbors per point is limited.
conncted to another cluster, which is not
An advantage is the symmetric nature of
always intentional.
the neighborhood relationships. Another
Furthermore, it is not mandatory that the advantage is that the combination of min-
links between the points are symmetric. imal clusters due to a fixed number of
neighbors is avoided.
But this procedure allows a recognition of
rings and therefore of "clusters in clusters", On the other hand, it is necessary to skill-
which is a clear advantage. Another ad- fully initialize ε in order to be successful,
vantage is that the procedure adaptively i.e. smaller than half the smallest distance
responds to the distances in and between between two clusters. With variable clus-
the clusters. ter and point distances within clusters this
can possibly be a problem.
For an illustration see the lower left part
of fig. A.1. For an illustration see the lower right part
of fig. A.1.
Figure A.1: Top left: our set of points. We will use this set to explore the different clustering
methods. Top right: k-means clustering. Using this procedure we chose k = 6. As we can
see, the procedure is not capable to recognize "clusters in clusters" (bottom left of the illustration).
Long "lines" of points are a problem, too: They would be recognized as many small clusters (if k
is sufficiently large). Bottom left: k-nearest neighboring. If k is selected too high (higher than
the number of points in the smallest cluster), this will result in cluster combinations shown in the
upper right of the illustration. Bottom right: ε-nearest neighboring. This procedure will cause
difficulties if ε is selected larger than the minimum distance between two clusters (see upper left of
the illustration), which will then be combined.
a criterion to decide how good our clus- Apparently, the whole term s(p) can only
ter division is. This possibility is offered be within the interval [−1; 1]. A value
by the silhouette coefficient according close to -1 indicates a bad classification of
to [Kau90]. This coefficient measures how p.
well the clusters are delimited from each
other and indicates if points may be as- The silhouette coefficient S(P ) results
signed to the wrong clusters. from the average of all values s(p):
clustering
1 X
b(p) = min dist(p, q) (A.2)
g∈C,g6=c |g|
q∈g A.5 Regional and online
The point p is classified well if the distance learnable fields are a
to the center of the own cluster is minimal neural clustering strategy
and the distance to the centers of the other
clusters is maximal. In this case, the fol-
lowing term provides a value close to 1: The paradigm of neural networks, which I
want to introduce now, are the regional
(A.3) and online learnable fields, shortly re-
b(p) − a(p)
s(p) =
max{a(p), b(p)} ferred to as ROLFs.
k consists of all points within the radius is an accepting neuron k. Then the radius
ρ · σ in the input space. moves towards ||p − ck || (i.e. towards the
distance between p and ck ) and the center
ck towards p. Additionally, let us define
A.5.2 A ROLF learns unsupervised the two learning rates ησ and ηc for radii
by presenting training and centers.
Jησ , ηc
samples online
ck (t + 1) = ck (t) + ηc (p − ck (t))
Like many other paradigms of neural net- σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t))
works our ROLF network learns by receiv-
ing many training samples p of a training Note that here σk is a scalar while ck is a
set P . The learning is unsupervised. For vector in the input space.
each training sample p entered into the net-
work two cases can occur: Definition A.6 (Adapting a ROLF neu-
ron). A neuron k accepted by a point p is
1. There is one accepting neuron k for p
adapted according to the following rules:
or
2. there is no accepting neuron at all. ck (t + 1) = ck (t) + ηc (p − ck (t)) (A.5)
σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t))
If in the first case several neurons are suit-
(A.6)
able, then there will be exactly one ac-
cepting neuron insofar as the closest neu-
ron is the accepting one. For the accepting
A.5.2.2 The radius multiplier allows
neuron k ck and σk are adapted.
neurons to be able not only to
Definition A.5 (Accepting neuron). The shrink
criterion for a ROLF neuron k to be an
accepting neuron of a point p is that the Now we can understand the function of the
point p must be located within the percep- multiplier ρ: Due to this multiplier the per-
tive surface of k. If p is located in the per- ceptive surface of a neuron includes more
Jρ
ceptive surfaces of several neurons, then than only all points surrounding the neu-
the closest neuron will be the accepting ron in the radius σ. This means that due
one. If there are several closest neurons, to the aforementioned learning rule σ can-
one can be chosen randomly. not only decrease but also increase.
so the
neurons
Definition A.7 (Radius multiplier). The can grow
A.5.2.1 Both positions and radii are radius multiplier ρ > 1 is globally defined
adapted throughout learning and expands the perceptive surface of a
Adapting neuron k to a multiple of σk . So it is en-
existing Let us assume that we entered a training sured that the radius σk cannot only de-
neurons
sample p into the network and that there crease but also increase.
Generally, the radius multiplier is set to Mean σ: We select the mean σ of all neu-
values in the lower one-digit range, such rons.
as 2 or 3.
Currently, the mean-σ variant is the fa-
So far we only have discussed the case in vorite one although the learning procedure
the ROLF training that there is an accept- also works with the other ones. In the
ing neuron for the training sample p. minimum-σ variant the neurons tend to
cover less of the surface, in the maximum-
σ variant they tend to cover more of the
A.5.2.3 As required, new neurons are surface.
generated
Definition A.8 (Generating a ROLF neu-
This suggests to discuss the approach for ron). If a new ROLF neuron k is gener-
the case that there is no accepting neu- ated by entering a training sample p, then initialization
ron. ck is intialized with p and σk according to of a
one of the aforementioned strategies (init- neurons
In this case a new accepting neuron k is σ, minimum-σ, maximum-σ, mean-σ).
generated for our training sample. The re-
sult is of course that ck and σk have to be The training is complete when after re-
initialized. peated randomly permuted pattern presen-
The initialization of ck can be understood tation no new neuron has been generated
intuitively: The center of the new neuron in an epoch and the positions of the neu-
is simply set on the training sample, i.e. rons barely change.
ck = p.
A.5.3 Evaluating a ROLF
We generate a new neuron because there
is no neuron close to p – for logical reasons, The result of the training algorithm is that
we place the neuron exactly on p. the training set is gradually covered well
and precisely by the ROLF neurons and
But how to set a σ when a new neuron that a high concentration of points on a
is generated? For this purpose there exist spot of the input space does not automati-
different options: cally generate more neurons. Thus, a pos-
Init-σ: We always select a predefined sibly very large point cloud is reduced to
static σ. very few representatives (based on the in-
put set).
Minimum σ: We take a look at the σ of
each neuron and select the minimum. Then it is very easy to define the num- cluster =
ber of clusters: Two neurons are (accord- connected
Maximum σ: We take a look at the σ of ing to the definition of the ROLF) con- neurons
each neuron and select the maximum. nected when their perceptive surfaces over-
Additionally, the issue of the size of the in- at least with the mean-σ strategy – they
dividual clusters proportional to their dis- are relatively robust after some training
tance from each other is addressed by us- time.
ing variable perceptive surfaces - which is
As a whole, the ROLF is on a par with
also not always the case for the two men-
the other clustering methods and is par-
tioned methods.
ticularly very interesting for systems with
The ROLF compares favorably with k- low storage capacity or huge data sets.
means clustering, as well: Firstly, it is un-
necessary to previously know the number
of clusters and, secondly, k-means cluster-
A.5.6 Application examples
ing recognizes clusters enclosed by other
clusters as separate clusters. A first application example could be find-
ing color clusters in RGB images. Another
field of application directly described in
A.5.5 Initializing radii, learning the ROLF publication is the recognition of
rates and multiplier is not words transferred into a 720-dimensional
trivial feature space. Thus, we can see that
ROLFs are relatively robust against higher
Certainly, the disadvantages of the ROLF dimensions. Further applications can be
shall not be concealed: It is not always found in the field of analysis of attacks on
easy to select the appropriate initial value network systems and their classification.
for σ and ρ. The previous knowledge
about the data set can so to say be in-
cluded in ρ and the initial value of σ of the Exercises
ROLF: Fine-grained data clusters should
use a small ρ and a small σ initial value. Exercise 18. Determine at least four
But the smaller the ρ the smaller, the adaptation steps for one single ROLF neu-
chance that the neurons will grow if neces-ron k if the four patterns stated below
sary. Here again, there is no easy answer, are presented one after another in the in-
just like for the learning rates ηc and ησ .
dicated order. Let the initial values for
the ROLF neuron be ck = (0.1, 0.1) and
For ρ the multipliers in the lower single-
σk = 1. Furthermore, let ηc = 0.5 and
digit range such as 2 or 3 are very popu-
η = 0. Let ρ = 3.
lar. ηc and ησ successfully work with val- σ
ues about 0.005 to 0.1, variations during P = {(0.1, 0.1);
run-time are also imaginable for this type
= (0.9, 0.1);
of network. Initial values for σ generally
depend on the cluster and data distribu- = (0.1, 0.9);
tion (i.e. they often have to be tested). = (0.9, 0.9)}.
But compared to wrong initializations –
181
Appendix B Excursus: neural networks used for prediction dkriesel.com
.-+ predictor
Figure B.2: Representation of the one-step-ahead prediction. It is tried to calculate the future
value from a series of past values. The predicting element (in this case a neural network) is referred
to as predictor.
usually have a lot of past values so that we means of the delta rule provides results
can set up a series of equations1 : very close to the analytical solution.
xt = a0 xt−1 + . . . + aj xt−1−(n−1)
Even if this approach often provides satis-
xt−1 = a0 xt−2 + . . . + aj xt−2−(n−1)
fying results, we have seen that many prob-
..
. (B.3) lems cannot be solved by using a single-
layer perceptron. Additional layers with
xt−n = a0 xt−n + . . . + aj xt−n−(n−1)
linear activation function are useless, as
Thus, n equations could be found for n un- well, since a multilayer perceptron with
known coefficients and solve them (if pos- only linear activation functions can be re-
sible). Or another, better approach: we duced to a singlelayer perceptron. Such
could use m > n equations for n unknowns considerations lead to a non-linear ap-
in such a way that the sum of the mean proach.
squared errors of the already known pre-
diction is minimized. This is called mov- The multilayer perceptron and non-linear
ing average procedure. activation functions provide a universal
But this linear structure corresponds to a non-linear function approximator, i.e. we
singlelayer perceptron with a linear activa- can use an n-|H|-1-MLP for n n inputs out
tion function which has been trained by of the past. An RBF network could also be
means of data from the past (The experi- used. But remember that here the number
mental setup would comply with fig. B.1 n has to remain low since in RBF networks
on page 182). In fact, the training by high input dimensions are very complex to
realize. So if we want to include many past
1 Without going into detail, I want to remark that values, a multilayer perceptron will require
the prediction becomes easier the more past values
of the time series are available. I would like to
considerably less computational effort.
ask the reader to read up on the Nyquist-Shannon
sampling theorem
What approaches can we use to to see far- The possibility to predict values far away
ther into the future? in the future is not only important because
we try to look farther ahead into the fu-
ture. There can also be periodic time se-
B.3.1 Recursive two-step-ahead ries where other approaches are hardly pos-
prediction sible: If a lecture begins at 9 a.m. every
predict Thursday, it is not very useful to know how
future
In order to extend the prediction to, for in- many people sat in the lecture room on
stance, two time steps into the future, we Monday to predict the number of lecture
values
could perform two one-step-ahead predic- participants. The same applies, for ex-
tions in a row (fig. B.3 on the following ample, to periodically occurring commuter
page), i.e. a recursive two-step-ahead jams.
prediction. Unfortunately, the value de-
termined by means of a one-step-ahead
B.4.1 Changing temporal
prediction is generally imprecise so that
parameters
errors can be built up, and the more pre-
dictions are performed in a row the more
Thus, it can be useful to intentionally leave
imprecise becomes the result.
gaps in the future values as well as in the
past values of the time series, i.e. to in-
B.3.2 Direct two-step-ahead troduce the parameter ∆t which indicates
prediction which past value is used for prediction.
Technically speaking, we still use a one- extent
step-ahead prediction only that we extend input
We have already guessed that there exists
the input space or train the system to pre- period
a better approach: Just like the system
dict values lying farther away.
can be trained to predict the next value,
we can certainly train it to predict the It is also possible to combine different ∆t:
direct
prediction next but one value. This means we di- In case of the traffic jam prediction for a
is better rectly train, for example, a neural network Monday the values of the last few days
to look two time steps ahead into the fu- could be used as data input in addition to
ture, which is referred to as direct two- the values of the previous Mondays. Thus,
step-ahead prediction (fig. B.4 on the we use the last values of several periods,
next page). Obviously, the direct two-step- in this case the values of a weekly and a
ahead prediction is technically identical to daily period. We could also include an an-
the one-step-ahead prediction. The only nual period in the form of the beginning of
difference is the training. the holidays (for sure, everyone of us has
0 predictor
O
xt−3 xt−2 xt−1 xt x̃t+1 x̃t+2
J
.-+ predictor
Figure B.3: Representation of the two-step-ahead prediction. Attempt to predict the second future
value out of a past value series by means of a second predictor and the involvement of an already
predicted value.
.-+ predictor
Figure B.4: Representation of the direct two-step-ahead prediction. Here, the second time step is
predicted directly, the first one is omitted. Technically, it does not differ from a one-step-ahead
prediction.
already spent a lot of time on the highway discrete values – often, for example, in a
because he forgot the beginning of the hol- daily rhythm (including the maximum and
idays). minimum values per day, if we are lucky)
with the daily variations certainly being
eliminated. But this makes the whole
B.4.2 Heterogeneous prediction thing even more difficult.
Another prediction approach would be to There are chartists, i.e. people who look
predict the future values of a single time at many diagrams and decide by means
series out of several time series, if it is of a lot of background knowledge and
assumed that the additional time series decade-long experience whether the equi-
use
information is related to the future of the first one ties should be bought or not (and often
outside of
(heterogeneous one-step-ahead pre- they are very successful).
time series
diction, fig. B.5 on the following page).
Apart from the share prices it is very in-
If we want to predict two outputs of two teresting to predict the exchange rates of
related time series, it is certainly possible currencies: If we exchange 100 Euros into
to perform two parallel one-step-ahead pre- Dollars, the Dollars into Pounds and the
dictions (analytically this is done very of- Pounds back into Euros it could be pos-
ten because otherwise the equations would sible that we will finally receive 110 Eu-
become very confusing); or in case of ros. But once found out, we would do this
the neural networks an additional output more often and thus we would change the
neuron is attached and the knowledge of exchange rates into a state in which such
both time series is used for both outputs an increasing circulation would no longer
(fig. B.6 on the next page). be possible (otherwise we could produce
money by generating, so to speak, a finan-
You’ll find more and more general material
cial perpetual motion machine.
on time series in [WG94].
At the stock exchange, successful stock
and currency brokers raise or lower their
B.5 Remarks on the thumbs – and thereby indicate whether in
prediction of share prices their opinion a share price or an exchange
rate will increase or decrease. Mathemat-
ically speaking, they indicate the first bit
Many people observe the changes of a (sign) of the first derivative of the ex-
share price in the past and try to con- change rate. In that way excellent world-
clude the future from those values in or- class brokers obtain success rates of about
der to benefit from this knowledge. Share 70%.
prices are discontinuous and therefore they
are principally difficult functions. Further- In Great Britain, the heterogeneous one-
more, the functions can only be used for step-ahead prediction was successfully
.0-1+3 predictor
.0-1+3 predictor
yt−3 yt−2 yt−1 yt ỹt+1
Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time.
used to increase the accuracy of such pre- Again and again some software appears
dictions to 76%: In addition to the time which uses scientific key words such as
series of the values indicators such as the ”neural networks” to purport that it is ca-
oil price in Rotterdam or the US national pable to predict where share prices are go-
debt were included. ing. Do not buy such software! In addi-
tion to the aforementioned scientific exclu-
This is just an example to show the mag- sions there is one simple reason for this:
nitude of the accuracy of stock-exchange If these tools work – why should the man-
evaluations, since we are still talking only ufacturer sell them? Normally, useful eco-
about the first bit of the first derivation! nomic knowledge is kept secret. If we knew
We still do not know how strong the ex- a way to definitely gain wealth by means
pected increase or decrease will be and of shares, we would earn our millions by
also whether the effort will pay off: Prob- using this knowledge instead of selling it
ably, one wrong prediction could nullify for 30 euros, wouldn’t we?
the profit of one hundred correct predic-
tions.
I now want to introduce a more exotic ap- While it is generally known that pro-
proach of learning – just to leave the usual cedures such as backpropagation cannot
paths. We know learning procedures in work in the human brain itself, reinforce-
which the network is exactly told what to ment learning is usually considered as be-
do, i.e. we provide exemplary output val- ing biologically more motivated.
ues. We also know learning procedures
The term reinforcement learning
like those of the self-organizing maps, into
comes from cognitive science and
which only input values are entered.
psychology and it describes the learning
Now we want to explore something in- system of carrot and stick, which occurs
between: The learning paradigm of rein- everywhere in nature, i.e. learning by
forcement learning – reinforcement learn- means of good or bad experience, reward
ing according to Sutton and Barto and punishment. But there is no learning
[SB98]. aid that exactly explains what we have
to do: We only receive a total result
Reinforcement learning in itself is no neu-
for a process (Did we win the game of
ral network but only one of the three learn-
chess or not? And how sure was this
ing paradigms already mentioned in chap-
victory?), but no results for the individual
ter 4. In some sources it is counted among
intermediate steps.
the supervised learning procedures since a
no feedback is given. Due to its very rudimen- For example, if we ride our bike with worn
samples tary feedback it is reasonable to separate tires and at a speed of exactly 21, 5 kmh
but
feedback
it from the supervised learning procedures through a turn over some sand with a
– apart from the fact that there are no grain size of 0.1mm, on the average, then
training samples at all. nobody could tell us exactly which han-
191
Appendix C Excursus: reinforcement learning dkriesel.com
dlebar angle we have to adjust or, even interaction between an agent and an envi-
worse, how strong the great number of ronmental system (fig. C.2).
muscle parts in our arms or legs have to
contract for this. Depending on whether The agent shall solve some problem. He
we reach the end of the curve unharmed or could, for instance, be an autonomous
not, we soon have to face the learning expe- robot that shall avoid obstacles. The
rience, a feedback or a reward, be it good agent performs some actions within the
or bad. Thus, the reward is very simple environment and in return receives a feed-
- but on the other hand it is considerably back from the environment, which in the
easier to obtain. If we now have tested dif- following is called reward. This cycle of ac-
ferent velocities and turning angles often tion and reward is characteristic for rein-
enough and received some rewards, we will forcement learning. The agent influences
get a feel for what works and what does the system, the system provides a reward
not. The aim of reinforcement learning is and then changes.
to maintain exactly this feeling. The reward is a real or discrete scalar
which describes, as mentioned above, how
Another example for the quasi-
well we achieve our aim, but it does not
impossibility to achieve a sort of cost or
give any guidance how we can achieve it.
utility function is a tennis player who
The aim is always to make the sum of
tries to maximize his athletic success
rewards as high as possible on the long
on the long term by means of complex
term.
movements and ballistic trajectories in
the three-dimensional space including the
wind direction, the importance of the C.1.1 The gridworld
tournament, private factors and many
more.
As a learning example for reinforcement
To get straight to the point: Since we learning I would like to use the so-called
receive only little feedback, reinforcement gridworld. We will see that its struc-
learning often means trial and error – and ture is very simple and easy to figure out
therefore it is very slow. and therefore reinforcement is actually not
necessary. However, it is very suitable
simple
for representing the approach of reinforce- examplary
C.1 System structure ment learning. Now let us exemplary de- world
C.1.2 Agent und environment Figure C.2: The agent performs some actions
within the environment and in return receives a
reward.
Our aim is that the agent learns what hap-
pens by means of the reward. Thus, it
In the gridworld: In the gridworld, the Therefore, situations generally do not al-
low to clearly "predict" successor situa-
agent is a simple robot that should find the
exit of the gridworld. The environment tions – even with a completely determin-
istic system this may not be applicable.
is the gridworld itself, which is a discrete
gridworld. If we knew all states and the transitions
between them exactly (thus, the complete
Definition C.1 (Agent). In reinforce- system), it would be possible to plan op-
ment learning the agent can be formally timally and also easy to find an optimal
policy (methods are provided, for example, Definition C.4 (Situation). Situations
by dynamic programming). st (here at time t) of a situation space
S are the agent’s limited, approximate
Jst
Now we know that reinforcement learning JS
knowledge about its state. This approx-
is an interaction between the agent and
imation (about which the agent cannot
the system including actions at and sit-
even know how good it is) makes clear pre-
uations st . The agent cannot determine
dictions impossible.
by itself whether the current situation is
good or bad: This is exactly the reason Definition C.5 (Action). Actions at can
why it receives the said reward from the be performed by the agent (whereupon it Jat
environment. could be possible that depending on the
situation another action space A(S) ex-
In the gridworld: States are positions JA(S)
ists). They cause state transitions and
where the agent can be situated. Sim-
therefore a new situation from the agent’s
ply said, the situations equal the states
point of view.
in the gridworld. Possible actions would
be to move towards north, south, east or
west. C.1.4 Reward and return
Situation and action can be vectorial, the
reward is always a scalar (in an extreme As in real life it is our aim to receive
case even only a binary value) since the an award that is as high as possible, i.e.
aim of reinforcement learning is to get to maximize the sum of the expected re-
along with little feedback. A complex vec- wards r, called return R, on the long
torial reward would equal a real teaching term. For finitely many time steps1 the
input. rewards can simply be added:
the environmental system returns to the Thus, we divide the timeline into
agent as reaction to an action. episodes. Usually, one of the two meth-
ods is used to limit the sum, if not both
Definition C.7 (Return). The return Rt methods together.
is the accumulation of all received rewards
Rt I As in daily living we try to approximate
until time t.
our current situation to a desired state.
Since it is not mandatory that only the
C.1.4.1 Dealing with long periods of next expected reward but the expected to-
time tal sum decides what the agent will do, it
is also possible to perform actions that, on
short notice, result in a negative reward
However, not every problem has an ex-
(e.g. the pawn sacrifice in a chess game)
plicit target and therefore a finite sum (e.g.
but will pay off later.
our agent can be a robot having the task
to drive around again and again and to
avoid obstacles). In order not to receive a
C.1.5 The policy
diverging sum in case of an infinite series
of reward estimations a weakening factor
0 < γ < 1 is used, which weakens the in- After having considered and formalized
γI
fluence of future rewards. This is not only some system components of reinforcement
useful if there exists no target but also if learning the actual aim is still to be dis-
the target is very far away: cussed:
The farther the reward is away, the smaller Thus, it continuously adjusts a mapping
is the influence it has in the agent’s deci- of the situations to the probabilities P (A),
sions. with which any action A is performed in
any situation S. A policy can be defined
Another possibility to handle the return as a strategy to select actions that would
sum would be a limited time horizon maximize the reward in the long term.
τ so that only τ many following rewards
τI
rt+1 , . . . , rt+τ are regarded: In the gridworld: In the gridworld the pol-
icy is the strategy according to which the
Rt = rt+1 + . . . + γ τ −1 rt+τ (C.7) agent tries to exit the gridworld.
τ
= (C.8) Definition C.8 (Policy). The policy Π
X
γ x−1 rt+x
x=1 s a mapping of situations to probabilities
to perform every action out of the action in a manner of speaking. Here, the envi-
space. So it can be formalized as ronment influences our action or the agent
responds to the input of the environment,
Π : S → P (A). (C.9) respectively, as already illustrated in fig.
C.2. A closed-loop policy, so to speak, is
Basically, we distinguish between two pol- a reactive plan to map current situations
icy paradigms: An open loop policy rep- to actions to be performed.
resents an open control chain and creates In the gridworld: A closed-loop policy
out of an initial situation s0 a sequence of would be responsive to the current posi-
actions a0 , a1 , . . . with ai 6= ai (si ); i > 0. tion and choose the direction according to
Thus, in the beginning the agent develops the action. In particular, when an obsta-
a plan and consecutively executes it to the cle appears dynamically, such a policy is
end without considering the intermediate the better choice.
situations (therefore ai 6= ai (si ), actions af-
ter a0 do not depend on the situations). When selecting the actions to be per-
formed, again two basic strategies can be
In the gridworld: In the gridworld, an examined.
open-loop policy would provide a precise
direction towards the exit, such as the way
from the given starting position to (in ab- C.1.5.1 Exploitation vs. exploration
breviations of the directions) EEEEN.
As in real life, during reinforcement learn-
So an open-loop policy is a sequence of ing often the question arises whether the
actions without interim feedback. A se- exisiting knowledge is only willfully ex-
quence of actions is generated out of a ploited or new ways are also explored.
starting situation. If the system is known Initially, we want to discuss the two ex-
well and truly, such an open-loop policy tremes:
can be used successfully and lead to use- research
ful results. But, for example, to know the A greedy policy always chooses the way
or safety?
chess game well and truly it would be nec- of the highest reward that can be deter-
essary to try every possible move, which mined in advance, i.e. the way of the high-
would be very time-consuming. Thus, for est known reward. This policy represents
such problems we have to find an alterna- the exploitation approach and is very
tive to the open-loop policy, which incorpo- promising when the used system is already
rates the current situations into the action known.
plan:
In contrast to the exploitation approach it
A closed loop policy is a closed loop, a is the aim of the exploration approach
function to explore a system as detailed as possible
so that also such paths leading to the tar-
Π : si → ai with ai = ai (si ), get can be found which may be not very
promising at first glance but are in fact he leaves of such a tree are the end situ-
very successful. ations of the system. The exploration ap-
proach would search the tree as thoroughly
Let us assume that we are looking for the as possible and become acquainted with all
way to a restaurant, a safe policy would leaves. The exploitation approach would
be to always take the way we already unerringly go to the best known leave.
know, not matter how unoptimal and long
it may be, and not to try to explore bet- Analogous to the situation tree, we also
ter ways. Another approach would be to can create an action tree. Here, the re-
explore shorter ways every now and then, wards for the actions are within the nodes.
even at the risk of taking a long time and Now we have to adapt from daily life how
being unsuccessful, and therefore finally we learn exactly.
having to take the original way and arrive
too late at the restaurant.
an estimation of our situation. If we win, robot but unfortunately was not intended
then to do so.
This system finds the most rapid way to C.2.2 The state-value function
reach the target because this way is auto-
matically the most favorable one in respect Unlike our agent we have a godlike view state
of the reward. The agent receives punish- of our gridworld so that we can swiftly de- evaluation
ment for anything it does – even if it does termine which robot starting position can
nothing. As a result it is the most inex- provide which optimal return.
pensive method for the agent to reach the In figure C.3 on the next page these opti-
target fast. mal returns are applied per field.
Another strategy is the avoidance strat- In the gridworld: The state-value function
egy: Harmful situations are avoided. for our gridworld exactly represents such
Here, a function per situation (= position) with
the difference being that here the function
rt ∈ {0, −1}, (C.12)
is unknown and has to be learned.
Most situations do not receive any reward, Thus, we can see that it would be more
only a few of them receive a negative re- practical for the robot to be capable to
ward. The agent agent will avoid getting evaluate the current and future situations.
too close to such negative situations So let us take a look at another system
component of reinforcement learning: the
Warning: Rewarding strategies can have
state-value function V (s), which with
unexpected consequences. A robot that is
regard to a policy Π is often called VΠ (s).
told "have it your own way but if you touch
Because whether a situation is bad often
an obstacle you will be punished" will sim-
depends on the general behavior Π of the
ply stand still. If standing still is also pun-
agent.
ished, it will drive in small circles. Recon-
sidering this, we will understand that this A situation being bad under a policy that
behavior optimally fulfills the return of the is searching risks and checking out limits
policy improvement
. the previous return weighted with a
factor γ of the following situation
V (st+1 ),
Figure C.7: We try different actions within the
environment and as a result we learn and improve . the previous value of the situation
the policy. V (st ).
C.2.6 Q learning
0
× +1
This implies QΠ (s, a) as learning fomula
-1 for the action-value function, and – analo-
gously to TD learning – its application is
called Q learning:
direction of actions
GFED
@ABC
s0 hk 0 / GFED
@ABC
s1 k 1 / GFED
@ABC / ONML
HIJK
sτ −1 l τ −1 / GFED
@ABC
a a aτ −2 a (
··· k sτ
r1 r2 rτ −1 rτ
direction of reward
Figure C.9: Actions are performed until the desired target situation is achieved. Attention should
be paid to numbering: Rewards are numbered beginning with 1, actions and situations beginning
with 0 (This has simply been adopted as a convention).
learning is: Π can be initialized arbitrar- played backgammon knows that the situ-
ily, and by means of Q learning the result ation space is huge (approx. 1020 situa-
is always Q∗ . tions). As a result, the state-value func-
tions cannot be computed explicitly (par-
Definition C.13 (Q learning). Q learn- ticularly in the late eighties when TD gam-
ing trains the action-value function by mon was introduced). The selected re-
means of the learning rule warding strategy was the pure delayed re-
ward, i.e. the system receives the reward
not before the end of the game and at the
Q(st , a)new =Q(st , a) (C.15) same time the reward is the return. Then
the system was allowed to practice itself
+ α(rt+1 + γ max Q(st+1 , a) − Q(st , a)).
a
(initially against a backgammon program,
then against an entity of itself). The result
and thus finds Q in any case.
∗ was that it achieved the highest ranking in
a computer-backgammon league and strik-
ingly disproved the theory that a computer
programm is not capable to master a task
C.3 Example applications better than its programmer.
the pit. Trivially, the executable actions The angle of the pole relative to the verti-
here are the possibilities to drive forwards cal line is referred to as α. Furthermore,
and backwards. The intuitive solution we the vehicle always has a fixed position x an
think of immediately is to move backwards, our one-dimensional world and a velocity
to gain momentum at the opposite slope of ẋ. Our one-dimensional world is lim-
and oscillate in this way several times to ited, i.e. there are maximum values and
dash out of the pit. minimum values x can adopt.
The actions of a reinforcement learning
system would be "full throttle forward", The aim of our system is to learn to steer
"full reverse" and "doing nothing". the car in such a way that it can balance
the pole, to prevent the pole from tipping
Here, "everything costs" would be a good over. This is achieved best by an avoid-
choice for awarding the reward so that the ance strategy: As long as the pole is bal-
system learns fast how to leave the pit and anced the reward is 0. If the pole tips over,
realizes that our problem cannot be solved the reward is -1.
by means of mere forward directed engine
power. So the system will slowly build up
the movement. Interestingly, the system is soon capable
to keep the pole balanced by tilting it suf-
The policy can no longer be stored as a ficiently fast and with small movements.
table since the state space is hard to dis- At this the system mostly is in the cen-
cretize. As policy a function has to be ter of the space since this is farthest from
generated. the walls which it understands as negative
(if it touches the wall, the pole will tip
over).
C.3.3 The pole balancer
Exercises
209
Bibliography dkriesel.com
[KSJ00] E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science.
Appleton & Lange, 2000.
[MBW+ 10] K.D. Micheva, B. Busse, N.C. Weiler, N. O’Rourke, and S.J. Smith. Single-
synapse analysis of a diverse synapse population: proteomic imaging meth-
ods and markers. Neuron, 68(4):639–653, 2010.
[MP43] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent
in nervous activity. Bulletin of Mathematical Biology, 5(4):115–133, 1943.
[Par87] David R. Parker. Optimal algorithms for adaptive networks: Second or-
der back propagation, second order direct propagation, and second order
hebbian learning. In Maureen Caudill and Charles Butler, editors, IEEE
First International Conference on Neural Networks (ICNN’87), volume II,
pages II–593–II–600, San Diego, CA, June 1987. IEEE.
[PG89] T. Poggio and F. Girosi. A theory of networks for approximation and
learning. MIT Press, Cambridge Mass., 1989.
[Pin87] F. J. Pineda. Generalization of back-propagation to recurrent neural net-
works. Physical Review Letters, 59:2229–2232, 1987.
[PM47] W. Pitts and W.S. McCulloch. How we know universals the perception of
auditory and visual forms. Bulletin of Mathematical Biology, 9(3):127–147,
1947.
[Pre94] L. Prechelt. Proben1: A set of neural network benchmark problems and
benchmarking rules. Technical Report, 21:94, 1994.
[RB93] M. Riedmiller and H. Braun. A direct adaptive method for faster back-
propagation learning: The rprop algorithm. In Neural Networks, 1993.,
IEEE International Conference on, pages 586–591. IEEE, 1993.
[RD05] G. Roth and U. Dicke. Evolution of the brain and intelligence. Trends in
Cognitive Sciences, 9(5):250–257, 2005.
[RHW86a] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by
back-propagating errors. Nature, 323:533–536, October 1986.
[RHW86b] David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learning
internal representations by error propagation. In D. E. Rumelhart, J. L.
McClelland, and the PDP research group., editors, Parallel distributed pro-
cessing: Explorations in the microstructure of cognition, Volume 1: Foun-
dations. MIT Press, 1986.
[Rie94] M. Riedmiller. Rprop - description and implementation details. Technical
report, University of Karlsruhe, 1994.
[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for information
storage and organization in the brain. Psychological Review, 65:386–408,
1958.
[Ros62] F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.
[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 1998.
215
List of Figures dkriesel.com
ATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
* attractor . . . . . . . . . . . . . . . . . . . . . . . . . . 119
autoassociator . . . . . . . . . . . . . . . . . . . . . 131
100-step rule . . . . . . . . . . . . . . . . . . . . . . . . 5 axon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 23
A B
Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
backpropagation . . . . . . . . . . . . . . . . . . . . 88
action potential . . . . . . . . . . . . . . . . . . . . 21
second order . . . . . . . . . . . . . . . . . . . 95
action space . . . . . . . . . . . . . . . . . . . . . . . 195
backpropagation of error. . . . . . . . . . . .84
action-value function . . . . . . . . . . . . . . 203
recurrent . . . . . . . . . . . . . . . . . . . . . . 125
activation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
activation function . . . . . . . . . . . . . . . . . 36
basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
selection of . . . . . . . . . . . . . . . . . . . . . 98
bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . 44
ADALINE . . see adaptive linear neuron
binary threshold function . . . . . . . . . . 37
adaptive linear element . . . see adaptive
bipolar cell . . . . . . . . . . . . . . . . . . . . . . . . . 27
linear neuron
black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
adaptive linear neuron . . . . . . . . . . . . . . 10
brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
adaptive resonance theory . . . . . 11, 165
brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . 16
agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .50
amacrine cell . . . . . . . . . . . . . . . . . . . . . . . 28
approximation. . . . . . . . . . . . . . . . . . . . .110
ART . . . . see adaptive resonance theory C
ART-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
ART-2A . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 capability to learn . . . . . . . . . . . . . . . . . . . 4
ART-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 center
artificial intelligence . . . . . . . . . . . . . . . . 10 of a ROLF neuron . . . . . . . . . . . . 176
associative data storage . . . . . . . . . . . 157 of a SOM neuron . . . . . . . . . . . . . . 146
219
Index dkriesel.com
H K
k-means clustering . . . . . . . . . . . . . . . . 172
Heaviside function see binary threshold
k-nearest neighboring. . . . . . . . . . . . . .172
function
Hebbian rule . . . . . . . . . . . . . . . . . . . . . . . 64
generalized form . . . . . . . . . . . . . . . . 65
heteroassociator . . . . . . . . . . . . . . . . . . . 132
Hinton diagram . . . . . . . . . . . . . . . . . . . . 34 L
history of development. . . . . . . . . . . . . . .8
Hopfield networks . . . . . . . . . . . . . . . . . 127 layer
continuous . . . . . . . . . . . . . . . . . . . . 134 hidden . . . . . . . . . . . . . . . . . . . . . . . . . 39
horizontal cell . . . . . . . . . . . . . . . . . . . . . . 28 input . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
hyperbolic tangent . . . . . . . . . . . . . . . . . 37 output . . . . . . . . . . . . . . . . . . . . . . . . . 39
hyperpolarization . . . . . . . . . . . . . . . . . . . 21 learnability . . . . . . . . . . . . . . . . . . . . . . . . . 97
hypothalamus . . . . . . . . . . . . . . . . . . . . . . 15 learning
S
saltatory conductor . . . . . . . . . . . . . . . . . 23
Schwann cell . . . . . . . . . . . . . . . . . . . . . . . 23 T
self-fulfilling prophecy . . . . . . . . . . . . . 189
self-organizing feature maps . . . . . . . . 11 target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
self-organizing map . . . . . . . . . . . . . . . . 145 TD gammon . . . . . . . . . . . . . . . . . . . . . . 205
multi- . . . . . . . . . . . . . . . . . . . . . . . . . 161 TD learning. . . .see temporal difference
sensory adaptation . . . . . . . . . . . . . . . . . 25 learning
sensory transduction. . . . . . . . . . . . . . . .24 teacher forcing . . . . . . . . . . . . . . . . . . . . 125
shortcut connections . . . . . . . . . . . . . . . . 39 teaching input . . . . . . . . . . . . . . . . . . . . . . 53
silhouette coefficient . . . . . . . . . . . . . . . 175 telencephalon . . . . . . . . . . . . see cerebrum
single lense eye . . . . . . . . . . . . . . . . . . . . . 27 temporal difference learning . . . . . . . 202
Single Shot Learning . . . . . . . . . . . . . . 130 thalamus . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
U
unfolding in time . . . . . . . . . . . . . . . . . . 123
V
voronoi diagram . . . . . . . . . . . . . . . . . . . 138
W
weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
weight matrix . . . . . . . . . . . . . . . . . . . . . . 34
bottom-up . . . . . . . . . . . . . . . . . . . . 166
top-down. . . . . . . . . . . . . . . . . . . . . .165
weight vector . . . . . . . . . . . . . . . . . . . . . . . 34