You are on page 1of 83

Neural

Network Basic Concepts for Neural Networks


Technology

Contents

Real Neurons

Neural Network Structure

Neural Network Operation

Neural Network Learning

Appendix: Specific Formulae and Algorithms

Note: This document is an excerpt from the NeuralystTM User's Guide, Chapter 3.

Real Neurons

Let's start by taking a look at a biological neuron. Figure 1 shows such a neuron.

Figure 1. A Biological Neuron

A neuron operates by receiving signals from other neurons through connections,


called synapses. The combination of these signals, in excess of a
certain threshold or activation level, will result in the neuron firing, that is sending a
signal on to other neurons connected to it. Some signals act as excitations and
others as inhibitions to a neuron firing. What we call thinking is believed to be the
collective effect of the presence or absence of firings in the pattern of synaptic
connections between neurons.

This sounds very simplistic until we recognize that there are approximately one
hundred billion (100,000,000,000) neurons each connected to as many as one
thousand (1,000) others in the human brain. The massive number of neurons and
the complexity of their interconnections results in a "thinking machine", your brain.

Each neuron has a body, called the soma. The soma is much like the body of any
other cell. It contains the cell nucleus, various bio-chemical factories and other
components that support ongoing activity.

Surrounding the soma are dendrites. The dendrites are receptors for signals
generated by other neurons. These signals may be excitatory or inhibitory. All
signals present at the dendrites of a neuron are combined and the result will
determine whether or not that neuron will fire.

If a neuron fires, an electrical impulse is generated. This impulse starts at the base,
called the hillock, of a long cellular extension, called the axon, and proceeds down
the axon to its ends.

The end of the axon is actually split into multiple ends, called the boutons. The
boutons are connected to the dendrites of other neurons and the resulting
interconnections are the previously discussed synapses. (Actually, the boutons do
not touch the dendrites; there is a small gap between them.) If a neuron has fired,
the electrical impulse that has been generated stimulates the boutons and results in
electrochemical activity which transmits the signal across the synapses to the
receiving dendrites.

At rest, the neuron maintains an electrical potential of about 40-60 millivolts. When
a neuron fires, an electrical impulse is created which is the result of a change in
potential to about 90-100 millivolts. This impulse travels between 0.5 to 100 meters
per second and lasts for about 1 millisecond. Once a neuron fires, it must rest for
several milliseconds before it can fire again. In some circumstances, the repetition
rate may be as fast as 100 times per second, equivalent to 10 milliseconds per
firing.

Compare this to a very fast electronic computer whose signals travel at about
200,000,000 meters per second (speed of light in a wire is 2/3 of that in free air),
whose impulses last for 10 nanoseconds and may repeat such an impulse
immediately in each succeeding 10 nanoseconds continuously. Electronic computers
have at least a 2,000,000 times advantage in signal transmission speed and
1,000,000 times advantage in signal repetition rate.

It is clear that if signal speed or rate were the sole criteria for processing
performance, electronic computers would win hands down. What the human brain
lacks in these, it makes up in numbers of elements and interconnection complexity
between those elements. This difference in structure manifests itself in at least one
important way; the human brain is not as quick as an electronic computer at
arithmetic, but it is many times faster and hugely more capable at recognition of
patterns and perception of relationships.

The human brain differs in another, extremely important, respect beyond speed; it
is capable of "self-programming" or adaptation in response to changing external
stimuli. In other words, it can learn. The brain has developed ways for neurons to
change their response to new stimulus patterns so that similar events may affect
future responses. In particular, the sensitivity to new patterns seems more
extensive in proportion to their importance to survival or if they are reinforced by
repetition.

Neural Network Structure

Neural networks are models of biological neural structures. The starting point for
most neural networks is a model neuron, as in Figure 2. This neuron consists of
multiple inputs and a single output. Each input is modified by a weight, which
multiplies with the input value. The neuron will combine these weighted inputs and,
with reference to a threshold value and activation function, use these to determine
its output. This behavior follows closely our understanding of how real neurons
work.

FIgure 2. A Model Neuron

While there is a fair understanding of how an individual neuron works, there is still a
great deal of research and mostly conjecture regarding the way neurons organize
themselves and the mechanisms used by arrays of neurons to adapt their behavior
to external stimuli. There are a large number of experimental neural network
structures currently in use reflecting this state of continuing research.
In our case, we will only describe the structure, mathematics and behavior of that
structure known as the backpropagation network. This is the most prevalent and
generalized neural network currently in use. If the reader is interested in finding out
more about neural networks or other networks, please refer to the material listed in
the bibliography.

To build a backpropagation network, proceed in the following fashion. First, take a


number of neurons and array them to form a layer. A layer has all its inputs
connected to either a preceding layer or the inputs from the external world, but not
both within the same layer. A layer has all its outputs connected to either a
succeeding layer or the outputs to the external world, but not both within the same
layer.

Next, multiple layers are then arrayed one succeeding the other so that there is an
input layer, multiple intermediate layers and finally an output layer, as in Figure 3.
Intermediate layers, that is those that have no inputs or outputs to the external
world, are called >hidden layers. Backpropagation neural networks are usually fully
connected. This means that each neuron is connected to every output from the
preceding layer or one input from the external world if the neuron is in the first layer
and, correspondingly, each neuron has its output connected to every neuron in the
succeeding layer.

Figure 3. Backpropagation Network

Generally, the input layer is considered a distributor of the signals from the external
world. Hidden layers are considered to be categorizers or feature detectors of such
signals. The output layer is considered a collector of the features detected and
producer of the response. While this view of the neural network may be helpful in
conceptualizing the functions of the layers, you should not take this model too
literally as the functions described may not be so specific or localized.

With this picture of how a neural network is constructed, we can now proceed to
describe the operation of the network in a meaningful fashion.
Neural Network Operation

The output of each neuron is a function of its inputs. In particular, the output of
the jth neuron in any layer is described by two sets of equations:

[Eqn 1]

and

[Eqn 2]

For every neuron, j, in a layer, each of the i inputs, Xi, to that layer is multiplied by a
previously established weight, wij. These are all summed together, resulting in the
internal value of this operation, Uj. This value is then biased by a previously
established threshold value, tj, and sent through an activation function, Fth. This
activation function is usually the sigmoid function, which has an input to output
mapping as shown in Figure 4. The resulting output, Yj, is an input to the next layer
or it is a response of the neural network if it is the last layer. Neuralyst allows other
threshold functions to be used in place of the sigmoid described here.

Figure 4. Sigmoid Function

In essence, Equation 1 implements the combination operation of the neuron and


Equation 2 implements the firing of the neuron.

From these equations, a predetermined set of weights, a predetermined set of


threshold values and a description of the network structure (that is the number of
layers and the number of neurons in each layer), it is possible to compute the
response of the neural network to any set of inputs. And this is just how Neuralyst
goes about producing the response. But how does it learn?

Neural Network Learning


Learning in a neural network is called training. Like training in athletics, training in a
neural network requires a coach, someone that describes to the neural network
what it should have produced as a response. From the difference between the
desired response and the actual response, the error is determined and a portion of
it is propagated backward through the network. At each neuron in the network the
error is used to adjust the weights and threshold values of the neuron, so that the
next time, the error in the network response will be less for the same inputs.

Figure 5. Neuron Weight Adjustment

This corrective procedure is called backpropagation (hence the name of the neural
network) and it is applied continuously and repetitively for each set of inputs and
corresponding set of outputs produced in response to the inputs. This procedure
continues so long as the individual or total errors in the responses exceed a
specified level or until there are no measurable errors. At this point, the neural
network has learned the training material and you can stop the training process and
use the neural network to produce responses to new input data.

[There is some heavier going in the next few paragraphs. Skip ahead if you don't
need to understand all the details of neural network learning.]

Backpropagation starts at the output layer with the following equations:

[Eqn 3]

and

[Eqn 4]

For the ith input of the jth neuron in the output layer, the weight wij is adjusted by
adding to the previous weight value, w'ij, a term determined by the product of
a learning rate, LR, an error term, ej, and the value of the ith input,Xi. The error
term, ej, for the jth neuron is determined by the product of the actual output, Yj, its
complement, 1 - Yj, and the difference between the desired output, dj, and the
actual output.

Once the error terms are computed and weights are adjusted for the output layer,
the values are recorded and the next layer back is adjusted. The same weight
adjustment process, determined by Equation 3, is followed, but the error term is
generated by a slightly modified version of Equation 4. This modification is:

[Eqn 5]

In this version, the difference between the desired output and the actual output is
replaced by the sum of the error terms for each neuron, k, in the layer immediately
succeeding the layer being processed (remember, we are going backwards through
the layers so these terms have already been computed) times the respective pre-
adjustment weights.

The learning rate, LR, applies a greater or lesser portion of the respective
adjustment to the old weight. If the factor is set to a large value, then the neural
network may learn more quickly, but if there is a large variability in the input set
then the network may not learn very well or at all. In real terms, setting the learning
rate to a large value is analogous to giving a child a spanking, but that is
inappropriate and counter-productive to learning if the offense is so simple as
forgetting to tie their shoelaces. Usually, it is better to set the factor to a small
value and edge it upward if the learning rate seems slow.

In many cases, it is useful to use a revised weight adjustment process. This is


described by the equation:

[Eqn 6]

This is similar to Equation 3, with a momentum factor, M, the previous weight, w'ij,
and the next to previous weight, w''ij, included in the last term. This extra term
allows for momentum in weight adjustment. Momentum basically allows a change
to the weights to persist for a number of adjustment cycles. The magnitude of the
persistence is controlled by the momentum factor. If the momentum factor is set to
0, then the equation reduces to that of Equation 3. If the momentum factor is
increased from 0, then increasingly greater persistence of previous adjustments is
allowed in modifying the current adjustment. This can improve the learning rate in
some situations, by helping to smooth out unusual conditions in the training set.

[Okay, that's the end of the equations. You can relax again.]

As you train the network, the total error, that is the sum of the errors over all the
training sets, will become smaller and smaller. Once the network reduces the total
error to the limit set, training may stop. You may then apply the network, using the
weights and thresholds as trained.
It is a good idea to set aside some subset of all the inputs available and reserve
them for testing the trained network. By comparing the output of a trained network
on these test sets to the outputs you know to be correct, you can gain greater
confidence in the validity of the training. If you are satisfied at this point, then the
neural network is ready for running.

Usually, no backpropagation takes place in this running mode as was done in the
training mode. This is because there is often no way to be immediately certain of
the desired response. If there were, there would be no need for the processing
capabilities of the neural network! Instead, as the validity of the neural network
outputs or predictions are verified or contradicted over time, you will either be
satisfied with the existing performance or determine a need for new training. In this
case, the additional input sets collected since the last training session may be used
to extend and improve the training data.

Return to Top | Home Page | Up One Level

Cheshire Engineering Corporation


120 West Olive Avenue
Monrovia, California 91016
+1 626 303 1602 Neuralyst Sales
+1 626 303 1602 Customer Service and Support
+1 626 303 1590 FAX

EMAIL to <Neuralyst@CheshireEng.com>.

Copyright 1995-2003 Cheshire Engineering Corporation. All Rights Reserved


nnbg.htm last revised October 2003 by Ross Berteig

Chapter 10. Neural


Networks
You cant process me with a normal brain. Charlie Sheen
Were at the end of our story. This is the last official chapter of this book
(though I envision additional supplemental material for the website and
perhaps new chapters in the future). We began with inanimate objects
living in a world of forces and gave those objects desires, autonomy, and
the ability to take action according to a system of rules. Next, we allowed
those objects to live in a population and evolve over time. Now we ask:
What is each objects decision-making process? How can it adjust its
choices by learning over time? Can a computational entity process its
environment and generate a decision?

The human brain can be described as a biological neural networkan


interconnected web of neurons transmitting elaborate patterns of
electrical signals. Dendrites receive input signals and, based on those
inputs, fire an output signal via an axon. Or something like that. How the
human brain actually works is an elaborate and complex mystery, one that
we certainly are not going to attempt to tackle in rigorous detail in this
chapter.
Figure 10.1

The good news is that developing engaging animated systems with code
does not require scientific rigor or accuracy, as weve learned throughout
this book. We can simply be inspired by the idea of brain function.

In this chapter, well begin with a conceptual overview of the properties


and features of neural networks and build the simplest possible example
of one (a network that consists of a single neuron). Afterwards, well
examine strategies for creating a Brain object that can be inserted into
our Vehicle class and used to determine steering. Finally, well also look at
techniques for visualizing and animating a network of neurons.

10.1 Artificial Neural Networks:


Introduction and Application
Computer scientists have long been inspired by the human brain. In 1943,
Warren S. McCulloch, a neuroscientist, and Walter Pitts, a logician,
developed the first conceptual model of an artificial neural network. In
their paper, "A logical calculus of the ideas imminent in nervous activity,
they describe the concept of a neuron, a single cell living in a network of
cells that receives inputs, processes those inputs, and generates an output.

Their work, and the work of many scientists and researchers that
followed, was not meant to accurately describe how the biological brain
works. Rather, an artificial neural network (which we will now simply
refer to as a neural network) was designed as a computational model
based on the brain to solve certain kinds of problems.

Its probably pretty obvious to you that there are problems that are
incredibly simple for a computer to solve, but difficult for you. Take the
square root of 964,324, for example. A quick line of code produces the
value 982, a number Processing computed in less than a millisecond.
There are, on the other hand, problems that are incredibly simple for you
or me to solve, but not so easy for a computer. Show any toddler a picture
of a kitten or puppy and theyll be able to tell you very quickly which one
is which. Say hello and shake my hand one morning and you should be
able to pick me out of a crowd of people the next day. But need a machine
to perform one of these tasks? Scientists have already spent entire careers
researching and implementing complex solutions.

The most common application of neural networks in computing today is


to perform one of these easy-for-a-human, difficult-for-a-machine tasks,
often referred to as pattern recognition. Applications range from optical
character recognition (turning printed or handwritten scans into digital
text) to facial recognition. We dont have the time or need to use some of
these more elaborate artificial intelligence algorithms here, but if you are
interested in researching neural networks, Id recommend the
books Artificial Intelligence: A Modern Approach by Stuart J. Russell and
Peter Norvig and AI for Game Developers by David M. Bourg and Glenn
Seemann.

Figure 10.2

A neural network is a connectionist computational system. The


computational systems we write are procedural; a program starts at the
first line of code, executes it, and goes on to the next, following
instructions in a linear fashion. A true neural network does not follow a
linear path. Rather, information is processed collectively, in parallel
throughout a network of nodes (the nodes, in this case, being neurons).
Here we have yet another example of a complex system, much like the
ones we examined in Chapters 6, 7, and 8. The individual elements of the
network, the neurons, are simple. They read an input, process it, and
generate an output. A network of many neurons, however, can exhibit
incredibly rich and intelligent behaviors.

One of the key elements of a neural network is its ability to learn. A neural
network is not just a complex system, but a complex adaptive system,
meaning it can change its internal structure based on the information
flowing through it. Typically, this is achieved through the adjusting
of weights. In the diagram above, each line represents a connection
between two neurons and indicates the pathway for the flow of
information. Each connection has aweight, a number that controls the
signal between the two neurons. If the network generates a good output
(which well define later), there is no need to adjust the weights. However,
if the network generates a poor outputan error, so to speakthen the
system adapts, altering the weights in order to improve subsequent
results.

There are several strategies for learning, and well examine two of them in
this chapter.

Supervised Learning Essentially, a strategy that involves a


teacher that is smarter than the network itself. For example, lets
take the facial recognition example. The teacher shows the network
a bunch of faces, and the teacher already knows the name associated
with each face. The network makes its guesses, then the teacher
provides the network with the answers. The network can then
compare its answers to the known correct ones and make
adjustments according to its errors. Our first neural network in the
next section will follow this model.

Unsupervised Learning Required when there isnt an example


data set with known answers. Imagine searching for a hidden
pattern in a data set. An application of this is clustering, i.e. dividing
a set of elements into groups according to some unknown pattern.
We wont be looking at any examples of unsupervised learning in
this chapter, as this strategy is less relevant for our examples.

Reinforcement Learning A strategy built on observation.


Think of a little mouse running through a maze. If it turns left, it
gets a piece of cheese; if it turns right, it receives a little shock.
(Dont worry, this is just a pretend mouse.) Presumably, the mouse
will learn over time to turn left. Its neural network makes a decision
with an outcome (turn left or right) and observes its environment
(yum or ouch). If the observation is negative, the network can adjust
its weights in order to make a different decision the next time.
Reinforcement learning is common in robotics. At time t, the robot
performs a task and observes the results. Did it crash into a wall or
fall off a table? Or is it unharmed? Well look at reinforcement
learning in the context of our simulated steering vehicles.

This ability of a neural network to learn, to make adjustments to its


structure over time, is what makes it so useful in the field of artificial
intelligence. Here are some standard uses of neural networks in software
today.

Pattern Recognition Weve mentioned this several times


already and its probably the most common application. Examples
are facial recognition, optical character recognition, etc.

Time Series Prediction Neural networks can be used to make


predictions. Will the stock rise or fall tomorrow? Will it rain or be
sunny?

Signal Processing Cochlear implants and hearing aids need to


filter out unnecessary noise and amplify the important sounds.
Neural networks can be trained to process an audio signal and filter
it appropriately.

Control You may have read about recent research advances in


self-driving cars. Neural networks are often used to manage steering
decisions of physical vehicles (or simulated ones).

Soft Sensors A soft sensor refers to the process of analyzing a


collection of many measurements. A thermometer can tell you the
temperature of the air, but what if you also knew the humidity,
barometric pressure, dewpoint, air quality, air density, etc.? Neural
networks can be employed to process the input data from many
individual sensors and evaluate them as a whole.

Anomaly Detection Because neural networks are so good at


recognizing patterns, they can also be trained to generate an output
when something occurs that doesnt fit the pattern. Think of a
neural network monitoring your daily routine over a long period of
time. After learning the patterns of your behavior, it could alert you
when something is amiss.

This is by no means a comprehensive list of applications of neural


networks. But hopefully it gives you an overall sense of the features and
possibilities. The thing is, neural networks are complicated and difficult.
They involve all sorts of fancy mathematics. While this is all fascinating
(and incredibly important to scientific research), a lot of the techniques
are not very practical in the world of building interactive, animated
Processing sketches. Not to mention that in order to cover all this
material, we would need another bookor more likely, a series of books.

So instead, well begin our last hurrah in the nature of code with the
simplest of all neural networks, in an effort to understand how the overall
concepts are applied in code. Then well look at some Processing sketches
that generate visual results inspired by these concepts.
10.2 The Perceptron
Invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical
Laboratory, a perceptron is the simplest neural network possible: a
computational model of a single neuron. A perceptron consists of one or
more inputs, a processor, and a single output.

Figure 10.3: The perceptron

A perceptron follows the feed-forward model, meaning inputs are sent


into the neuron, are processed, and result in an output. In the diagram
above, this means the network (one neuron) reads from left to right:
inputs come in, output goes out.

Lets follow each of these steps in more detail.

Step 1: Receive inputs.

Say we have a perceptron with two inputslets call them x1 and x2.

Input 0: x1 = 12
Input 1: x2 = 4
Step 2: Weight inputs.

Each input that is sent into the neuron must first be weighted, i.e.
multiplied by some value (often a number between -1 and 1). When
creating a perceptron, well typically begin by assigning random weights.
Here, lets give the inputs the following weights:

Weight 0: 0.5
Weight 1: -1

We take each input and multiply it by its weight.

Input 0 * Weight 0 12 * 0.5 = 6

Input 1 * Weight 1 4 * -1 = -4

Step 3: Sum inputs.

The weighted inputs are then summed.

Sum = 6 + -4 = 2

Step 4: Generate output.

The output of a perceptron is generated by passing that sum through an


activation function. In the case of a simple binary output, the activation
function is what tells the perceptron whether to fire or not. You can
envision an LED connected to the output signal: if it fires, the light goes
on; if not, it stays off.

Activation functions can get a little bit hairy. If you start reading one of
those artificial intelligence textbooks looking for more info about
activation functions, you may soon find yourself reaching for a calculus
textbook. However, with our friend the simple perceptron, were going to
do something really easy. Lets make the activation function the sign of
the sum. In other words, if the sum is a positive number, the output is 1; if
it is negative, the output is -1.

Output = sign(sum) sign(2) +1

Lets review and condense these steps so we can implement them with a
code snippet.

The Perceptron Algorithm:

1. For every input, multiply that input by its weight.

2. Sum all of the weighted inputs.

3. Compute the output of the perceptron based on that sum passed


through an activation function (the sign of the sum).

Lets assume we have two arrays of numbers, the inputs and the weights.
For example:

Show Raw

float[] inputs = {12 , 4};


float[] weights = {0.5,-1};
For every input implies a loop that multiplies each input by its
corresponding weight. Since we need the sum, we can add up the results
in that very loop.

Show Raw

Steps 1 and 2: Add up all the weighted inputs.

float sum = 0;
for (int i = 0; i < inputs.length; i++) {
sum += inputs[i]*weights[i];
}
Once we have the sum we can compute the output.

Show Raw
Step 3: Passing the sum through an activation function

float output = activate(sum);

The activation function

int activate(float sum) {


Return a 1 if positive, -1 if negative.

if (sum > 0) return 1;


else return -1;
}

10.3 Simple Pattern Recognition Using


a Perceptron
Now that we understand the computational process of a perceptron, we
can look at an example of one in action. We stated that neural networks
are often used for pattern recognition applications, such as facial
recognition. Even simple perceptrons can demonstrate the basics of
classification, as in the following example.
Figure 10.4

Consider a line in two-dimensional space. Points in that space can be


classified as living on either one side of the line or the other. While this is
a somewhat silly example (since there is clearly no need for a neural
network; we can determine on which side a point lies with some simple
algebra), it shows how a perceptron can be trained to recognize points on
one side versus another.

Lets say a perceptron has 2 inputs (the x- and y-coordinates of a point).


Using a sign activation function, the output will either be -1 or 1i.e., the
input data is classified according to the sign of the output. In the above
diagram, we can see how each point is either below the line (-1) or above
(+1).

The perceptron itself can be diagrammed as follows:


Figure 10.5

We can see how there are two inputs (x and y), a weight for each input
(weight x andweight y), as well as a processing neuron that generates the
output.

There is a pretty significant problem here, however. Lets consider the


point (0,0). What if we send this point into the perceptron as its input: x =
0 and y = 0? What will the sum of its weighted inputs be? No matter what
the weights are, the sum will always be 0! But this cant be rightafter all,
the point (0,0) could certainly be above or below various lines in our two-
dimensional world.

To avoid this dilemma, our perceptron will require a third input, typically
referred to as abias input. A bias input always has the value of 1 and is
also weighted. Here is our perceptron with the addition of the bias:
Figure 10.6

Lets go back to the point (0,0). Here are our inputs:

0 * weight for x = 0
0 * weight for y = 0
1 * weight for bias = weight for bias

The output is the sum of the above three values, 0 plus 0 plus the biass
weight. Therefore, the bias, on its own, answers the question as to where
(0,0) is in relation to the line. If the biass weight is positive, (0,0) is
above the line; negative, it is below. It biases the perceptrons
understanding of the lines position relative to (0,0).
10.4 Coding the Perceptron
Were now ready to assemble the code for a Perceptron class. The only
data the perceptron needs to track are the input weights, and we could use
an array of floats to store these.

Show Raw

class Perceptron {
float[] weights;
The constructor could receive an argument indicating the number of
inputs (in this case three: x, y, and a bias) and size the array accordingly.

Show Raw

Perceptron(int n) {
weights = new float[n];
for (int i = 0; i < weights.length; i++) {
The weights are picked randomly to start.

weights[i] = random(-1,1);
}
}
A perceptron needs to be able to receive inputs and generate an output.
We can package these requirements into a function called feedforward() .
In this example, well have the perceptron receive its inputs as an array
(which should be the same length as the array of weights) and return the
output as an integer.

Show Raw

int feedforward(float[] inputs) {


float sum = 0;
for (int i = 0; i < weights.length; i++) {
sum += inputs[i]*weights[i];
}
Result is the sign of the sum, -1 or +1. Here the perceptron is making a guess. Is it on one side of the line or the
other?
return activate(sum);
}
Presumably, we could now create a Perceptron object and ask it to make a
guess for any given point.

Figure 10.7

Show Raw

Create the Perceptron.

Perceptron p = new Perceptron(3);


The input is 3 values: x,y and bias.
float[] point = {50,-12,1};
The answer!

int result = p.feedforward(point);


Did the perceptron get it right? At this point, the perceptron has no better
than a 50/50 chance of arriving at the right answer. Remember, when we
created it, we gave each weight a random value. A neural network isnt
magic. Its not going to be able to guess anything correctly unless we teach
it how to!

To train a neural network to answer correctly, were going to employ the


method ofsupervised learning that we described in section 10.1 .

With this method, the network is provided with inputs for which there is a
known answer. This way the network can find out if it has made a correct
guess. If its incorrect, the network can learn from its mistake and adjust
its weights. The process is as follows:

1. Provide the perceptron with inputs for which there is a known


answer.

2. Ask the perceptron to guess an answer.

3. Compute the error. (Did it get the answer right or wrong?)

4. Adjust all the weights according to the error.

5. Return to Step 1 and repeat!

Steps 1 through 4 can be packaged into a function. Before we can write the
entire function, however, we need to examine Steps 3 and 4 in more
detail. How do we define the perceptrons error? And how should we
adjust the weights according to this error?

The perceptrons error can be defined as the difference between the


desired answer and its guess.
ERROR = DESIRED OUTPUT - GUESS OUTPUT

The above formula may look familiar to you. In Chapter 6 , we computed a


steering force as the difference between our desired velocity and our
current velocity.

STEERING = DESIRED VELOCITY - CURRENT VELOCITY

This was also an error calculation. The current velocity acts as a guess and
the error (the steering force) tells us how to adjust the velocity in the right
direction. In a moment, well see how adjusting the vehicles velocity to
follow a target is just like adjusting the weights of a neural network to
arrive at the right answer.

In the case of the perceptron, the output has only two possible
values: +1 or -1. This means there are only three possible errors.

If the perceptron guesses the correct answer, then the guess equals the
desired output and the error is 0. If the correct answer is -1 and weve
guessed +1, then the error is -2. If the correct answer is +1 and weve
guessed -1, then the error is +2.

Desired Guess Error

-1 -1 0

-1 +1 -2
Desired Guess Error

+1 -1 +2

+1 +1 0

The error is the determining factor in how the perceptrons weights


should be adjusted. For any given weight, what we are looking to calculate
is the change in weight, often calledweight (or delta weight, delta
being the Greek letter ).

NEW WEIGHT = WEIGHT + WEIGHT

weight is calculated as the error multiplied by the input.

WEIGHT = ERROR * INPUT

Therefore:

NEW WEIGHT = WEIGHT + ERROR * INPUT

To understand why this works, we can again return to steering . A steering


force is essentially an error in velocity. If we apply that force as our
acceleration (velocity), then we adjust our velocity to move in the correct
direction. This is what we want to do with our neural networks weights.
We want to adjust them in the right direction, as defined by the error.

With steering, however, we had an additional variable that controlled the


vehicles ability to steer: the maximum force. With a high maximum force,
the vehicle was able to accelerate and turn very quickly; with a lower
force, the vehicle would take longer to adjust its velocity. The neural
network will employ a similar strategy with a variable called the learning
constant. Well add in the learning constant as follows:

NEW WEIGHT = WEIGHT + ERROR * INPUT * LEARNING CONSTANT

Notice that a high learning constant means the weight will change more
drastically. This may help us arrive at a solution more quickly, but with
such large changes in weight its possible we will overshoot the optimal
weights. With a small learning constant, the weights will be adjusted
slowly, requiring more training time but allowing the network to make
very small adjustments that could improve the networks overall accuracy.

Assuming the addition of a variable c for the learning constant, we can


now write a training function for the perceptron following the above steps.

Show Raw

A new variable is introduced to control the learning rate.

float c = 0.01;

Step 1: Provide the inputs and known answer. These are passed in as arguments to train().

void train(float[] inputs, int desired) {

Step 2: Guess according to those inputs.

int guess = feedforward(inputs);

Step 3: Compute the error (difference between answer and guess).

float error = desired - guess;

Step 4: Adjust all the weights according to the error and learning constant.

for (int i = 0; i < weights.length; i++) {


weights[i] += c * error * inputs[i];
}
}
We can now see the Perceptron class as a whole.

Show Raw

class Perceptron {
The Perceptron stores its weights and learning constants.

float[] weights;
float c = 0.01;

Perceptron(int n) {
weights = new float[n];
Weights start off random.

for (int i = 0; i < weights.length; i++) {


weights[i] = random(-1,1);
}
}

Return an output based on inputs.

int feedforward(float[] inputs) {


float sum = 0;
for (int i = 0; i < weights.length; i++) {
sum += inputs[i]*weights[i];
}
return activate(sum);
}

Output is a +1 or -1.

int activate(float sum) {


if (sum > 0) return 1;
else return -1;
}

Train the network against known data.

void train(float[] inputs, int desired) {


int guess = feedforward(inputs);
float error = desired - guess;
for (int i = 0; i < weights.length; i++) {
weights[i] += c * error * inputs[i];
}
}
}
To train the perceptron, we need a set of inputs with a known answer. We
could package this up in a class like so:

Show Raw

class Trainer {

A "Trainer" object stores the inputs and the correct answer.

float[] inputs;
int answer;

Trainer(float x, float y, int a) {


inputs = new float[3];
inputs[0] = x;
inputs[1] = y;
Note that the Trainer has the bias input built into its array.

inputs[2] = 1;
answer = a;
}
}
Now the question becomes, how do we pick a point and know whether it is
above or below a line? Lets start with the formula for a line, where y is
calculated as a function of x:

y = f(x)

In generic terms, a line can be described as:

y = ax + b

Heres a specific example:


y = 2*x + 1

We can then write a Processing function with this in mind.

Show Raw

A function to calculate y based on x along a line

float f(float x) {
return 2*x+1;
}
So, if we make up a point:

Show Raw

float x = random(width);
float y = random(height);
How do we know if this point is above or below the line? The line
function f(x) gives us the yvalue on the line for that x position. Lets call
that yline .

Show Raw

The y position on the line

float yline = f(x);


If the y value we are examining is above the line, it will be less than yline .
Figure 10.8

Show Raw

if (y < yline) {
The answer is -1 if y is above the line.

answer = -1;
} else {
answer = 1;
}
We can then make a Trainer object with the inputs and the correct answer.

Show Raw

Trainer t = new Trainer(x, y, answer);


Assuming we had a Perceptron object ptron , we could then train it by
sending the inputs along with the known answer.

Show Raw

ptron.train(t.inputs,t.answer);
Now, its important to remember that this is just a demonstration.
Remember ourShakespeare-typing monkeys ? We asked our genetic
algorithm to solve for to be or not to bean answer we already knew.
We did this to make sure our genetic algorithm worked properly. The
same reasoning applies to this example. We dont need a perceptron to tell
us whether a point is above or below a line; we can do that with simple
math. We are using this scenario, one that we can easily solve without a
perceptron, to demonstrate the perceptrons algorithm as well as easily
confirm that it is working properly.

Lets look at how the perceptron works with an array of many training
points.

RESET PAUSE

Example 10.1: The Perceptron

Show Raw

The Perceptron

Perceptron ptron;
2,000 training points

Trainer[] training = new Trainer[2000];


int count = 0;
The formula for a line

float f(float x) {
return 2*x+1;
}

void setup() {
size(640, 360);

ptron = new Perceptron(3);

Make 2,000 training points.

for (int i = 0; i < training.length; i++) {


float x = random(-width/2,width/2);
float y = random(-height/2,height/2);
Is the correct answer 1 or -1?

int answer = 1;
if (y < f(x)) answer = -1;
training[i] = new Trainer(x, y, answer);
}
}

void draw() {
background(255);
translate(width/2,height/2);

ptron.train(training[count].inputs, training[count].answer);
For animation, we are training one point at a time.

count = (count + 1) % training.length;

for (int i = 0; i < count; i++) {


stroke(0);
int guess = ptron.feedforward(training[i].inputs);
Show the classificationno fill for -1, black for +1.

if (guess > 0) noFill();


else fill(0);
ellipse(training[i].inputs[0], training[i].inputs[1], 8, 8);
}
}

Exercise 10.1
Instead of using the supervised learning model above, can you train the
neural network to find the right weights by using a genetic algorithm?

Exercise 10.2
Visualize the perceptron itself. Draw the inputs, the processing node, and
the output.

10.5 A Steering Perceptron


While classifying points according to their position above or below a line
was a useful demonstration of the perceptron in action, it doesnt have
much practical relevance to the other examples throughout this book. In
this section, well take the concepts of a perceptron (array of inputs, single
output), apply it to steering behaviors, and demonstrate reinforcement
learning along the way.

We are now going to take significant creative license with the concept of a
neural network. This will allow us to stick with the basics and avoid some
of the highly complex algorithms associated with more sophisticated
neural networks. Here were not so concerned with following rules
outlined in artificial intelligence textbookswere just hoping to make
something interesting and brain-like.

Remember our good friend the Vehicle class? You know, that one for
making objects with a location, velocity, and acceleration? That could
obey Newtons laws with an applyForce() function and move around the
window according to a variety of steering rules?
What if we added one more variable to our Vehicle class?

Show Raw

class Vehicle {

Giving the vehicle a brain!

Perceptron brain;

PVector location;
PVector velocity;
PVector acceleration;
//etc...
Heres our scenario. Lets say we have a Processing sketch with
an ArrayList of targets and a single vehicle.
Figure 10.9

Lets say that the vehicle seeks all of the targets. According to the
principles of Chapter 6, we would next write a function that calculates a
steering force towards each target, applying each force one at a time to the
objects acceleration. Assuming the targets are
an ArrayList ofPVector objects, it would look something like:

Show Raw

void seek(ArrayList<PVector> targets) {


for (PVector target : targets) {
For every target, apply a steering force towards the target.

PVector force = seek(targets.get(i));


applyForce(force);
}
}
In Chapter 6, we also examined how we could create more dynamic
simulations by weighting each steering force according to some rule. For
example, we could say that the farther you are from a target, the stronger
the force.

Show Raw

void seek(ArrayList<PVector> targets) {


for (PVector target : targets) {
PVector force = seek(targets.get(i));
float d = PVector.dist(target,location);
float weight = map(d,0,width,0,5);
Weighting each steering force individually

force.mult(weight);
applyForce(force);
}
}
But what if instead we could ask our brain (i.e. perceptron) to take in all
the forces as an input, process them according to weights of the
perceptron inputs, and generate an output steering force? What if we
could instead say:

Show Raw

void seek(ArrayList<PVector> targets) {

Make an array of inputs for our brain.

PVector[] forces = new PVector[targets.size()];

for (int i = 0; i < forces.length; i++) {


Fill the array with a steering force for each target.
forces[i] = seek(targets.get(i));
}

Ask our brain for a result and apply that as the force!

PVector output = brain.process(forces);


applyForce(output);
}
In other words, instead of weighting and accumulating the forces inside
our vehicle, we simply pass an array of forces to the vehicles brain
object and allow the brain to weight and sum the forces for us. The output
is then applied as a steering force. This opens up a range of possibilities. A
vehicle could make decisions as to how to steer on its own, learning from
its mistakes and responding to stimuli in its environment. Lets see how
this works.

We can use the line classification perceptron as a model, with one


important differencethe inputs are not single numbers, but vectors!
Lets look at how the feedforward() function works in our vehicles
perceptron, alongside the one from our previous example.

Vehicle PVector inputs Line float inputs

PVector feedforward(PVector[] forces) { int feedforward(float[] inputs) {


// Sum is a PVector. // Sum is a float.
PVector sum = new PVector(); float sum = 0;
for (int i = 0; i < weights.length; i++) { for (int i = 0; i < weights.length; i++)
// Vector addition and multiplication // Scalar addition and multiplication
forces[i].mult(weights[i]); sum += inputs[i]*weights[i];
sum.add(forces[i]);
} }
// No activation function // Activation function
return sum; return activate(sum);
} }

Note how these two functions implement nearly identical algorithms, with
two differences:
1. Summing PVectors. Instead of a series of numbers added
together, each input is a PVector and must be multiplied by the
weight and added to a sum according to the
mathematical PVector functions.

2. No activation function. In this case, were taking the result and


applying it directly as a steering force for the vehicle, so were not
asking for a simple boolean value that classifies it in one of two
categories. Rather, were asking for raw output itself, the resulting
overall force.

Once the resulting steering force has been applied, its time to give
feedback to the brain, i.e.reinforcement learning. Was the decision to
steer in that particular direction a good one or a bad one? Presumably if
some of the targets were predators (resulting in being eaten) and some of
the targets were food (resulting in greater health), the network would
adjust its weights in order to steer away from the predators and towards
the food.

Lets take a simpler example, where the vehicle simply wants to stay close
to the center of the window. Well train the brain as follows:

Show Raw

PVector desired = new PVector(width/2,height/2);


PVector error = PVector.sub(desired, location);
brain.train(forces,error);

Figure 10.10
Here we are passing the brain a copy of all the inputs (which it will need
for error correction) as well as an observation about its environment:
a PVector that points from its current location to where it desires to be.
This PVector essentially serves as the errorthe longer the PVector , the
worse the vehicle is performing; the shorter, the better.

The brain can then apply this error vector (which has two error values,
one for x and one for y) as a means for adjusting the weights, just as we
did in the line classification example.

Training the Vehicle Training the Line Classifier

void train(PVector[] forces, PVector error) { void train(float[] inputs, int desired) {

int guess = feedforward(inputs);


float error = desired - guess;

for (int i = 0; i < weights.length; i++) { for (int i = 0; i < weights.length; i++)
weights[i] += c*error.x*forces[i].x; weights[i] += c * error * inputs[i];
weights[i] += c*error.y*forces[i].y;
} }
} }

Because the vehicle observes its own error, there is no need to calculate
one; we can simply receive the error as an argument. Notice how the
change in weight is processed twice, once for the error along the x-axis
and once for the y-axis.

Show Raw

weights[i] += c*error.x*forces[i].x;
weights[i] += c*error.y*forces[i].y;
We can now look at the Vehicle class and see how the steer function uses a
perceptron to control the overall steering force. The new content from this
chapter is highlighted.

RESET PAUSE
Example 10.2: Perceptron steering

Show Raw

class Vehicle {

The Vehicle now has a brain.

Perceptron brain;

Same old variables for physics

PVector location;
PVector velocity;
PVector acceleration;
float maxforce;
float maxspeed;

The Vehicle creates a perceptron with n inputs and a learning constant.

Vehicle(int n, float x, float y) {


brain = new Perceptron(n,0.001);
acceleration = new PVector(0,0);
velocity = new PVector(0,0);
location = new PVector(x,y);
maxspeed = 4;
maxforce = 0.1;
}

Same old update() function

void update() {
velocity.add(acceleration);
velocity.limit(maxspeed);
location.add(velocity);
acceleration.mult(0);
}

Same old applyForce() function

void applyForce(PVector force) {


acceleration.add(force);
}

void steer(ArrayList<PVector> targets) {


PVector[] forces = new PVector[targets.size()];

for (int i = 0; i < forces.length; i++) {


forces[i] = seek(targets.get(i));
}
All the steering forces are inputs.

PVector result = brain.feedforward(forces);

The result is applied.

applyForce(result);

The brain is trained according to the distance to the center.

PVector desired = new PVector(width/2,height/2);


PVector error = PVector.sub(desired, location);
brain.train(forces,error);

Same old seek() function

PVector seek(PVector target) {


PVector desired = PVector.sub(target,location);
desired.normalize();
desired.mult(maxspeed);
PVector steer = PVector.sub(desired,velocity);
steer.limit(maxforce);
return steer;
}

Exercise 10.3
Visualize the weights of the network. Try mapping each targets
corresponding weight to its brightness.

Exercise 10.4
Try different rules for reinforcement learning. What if some targets are
desirable and some are undesirable?

10.6 Its a Network, Remember?


Yes, a perceptron can have multiple inputs, but it is still a lonely neuron.
The power of neural networks comes in the networking itself. Perceptrons
are, sadly, incredibly limited in their abilities. If you read an AI textbook,
it will say that a perceptron can only solve linearly
separable problems. Whats a linearly separable problem? Lets take a
look at our first example, which determined whether points were on one
side of a line or the other.
Figure 10.11

On the left of Figure 10.11, we have classic linearly separable data. Graph
all of the possibilities; if you can classify the data with a straight line, then
it is linearly separable. On the right, however, is non-linearly separable
data. You cant draw a straight line to separate the black dots from the
gray ones.

One of the simplest examples of a non-linearly separable problem is XOR,


or exclusive or. Were all familiar with AND. For A AND B to be true,
both A and B must be true. With OR, either A or B can be true
for A OR B to evaluate as true. These are both linearly separable problems.
Lets look at the solution space, a truth table.
Figure 10.12

See how you can draw a line to separate the true outputs from the false
ones?

XOR is the equivalent of OR and NOT AND. In other words, A XOR B only
evaluates to true if one of them is true. If both are false or both are true,
then we get false. Take a look at the following truth table.
Figure 10.13

This is not linearly separable. Try to draw a straight line to separate the
true outputs from the false onesyou cant!

So perceptrons cant even solve something as simple as XOR. But what if


we made a network out of two perceptrons? If one perceptron can
solve OR and one perceptron can solve NOT AND, then two perceptrons
combined can solve XOR.
Figure 10.14

The above diagram is known as a multi-layered perceptron, a network of


many neurons. Some are input neurons and receive the inputs, some are
part of whats called a hidden layer (as they are connected to neither the
inputs nor the outputs of the network directly), and then there are the
output neurons, from which we read the results.

Training these networks is much more complicated. With the simple


perceptron, we could easily evaluate how to change the weights according
to the error. But here there are so many different connections, each in a
different layer of the network. How does one know how much each neuron
or connection contributed to the overall error of the network?

The solution to optimizing weights of a multi-layered network is known


asbackpropagation. The output of the network is generated in the
same manner as a perceptron. The inputs multiplied by the weights are
summed and fed forward through the network. The difference here is that
they pass through additional layers of neurons before reaching the output.
Training the network (i.e. adjusting the weights) also involves taking the
error (desired result - guess). The error, however, must be fed backwards
through the network. The final error ultimately adjusts the weights of all
the connections.

Backpropagation is a bit beyond the scope of this book and involves a


fancier activation function (called the sigmoid function) as well as some
basic calculus. If you are interested in how backpropagation works, check
the book website (and GitHub repository) for an example that
solves XOR using a multi-layered feed forward network with
backpropagation.

Instead, here well focus on a code framework for building the visual
architecture of a network. Well make Neuron objects
and Connection objects from which a Network object can be created and
animated to show the feed forward process. This will closely resemble
some of the force-directed graph examples we examined in Chapter 5
(toxiclibs).

10.7 Neural Network Diagrams


Our goal will be to create the following simple network diagram:
Figure 10.15

The primary building block for this diagram is a neuron. For the purpose
of this example, the Neuron class describes an entity with an (x,y) location.

Show Raw

An incredibly simple Neuron class stores and displays the location of a single neuron.

class Neuron {
PVector location;

Neuron(float x, float y) {
location = new PVector(x, y);
}
void display() {
stroke(0);
fill(0);
ellipse(location.x, location.y, 16, 16);
}
}
The Network class can then manage an ArrayList of neurons, as well as
have its own location (so that each neuron is drawn relative to the
networks center). This is particle systems 101. We have a single element
(a neuron) and a network (a system of many neurons).

Show Raw

A Network is a list of neurons.

class Network {
ArrayList<Neuron> neurons;
PVector location;

Network(float x, float y) {
location = new PVector(x,y);
neurons = new ArrayList<Neuron>();
}

We can add an neuron to the network.

void addNeuron(Neuron n) {
neurons.add(n);
}

We can draw the entire network.

void display() {
pushMatrix();
translate(location.x, location.y);
for (Neuron n : neurons) {
n.display();
}
popMatrix();
}
}
Now we can pretty easily make the diagram above.

Show Raw

Network network;

void setup() {
size(640, 360);
Make a Network.

network = new Network(width/2,height/2);

Make the Neurons.

Neuron a = new Neuron(-200,0);


Neuron b = new Neuron(0,100);
Neuron c = new Neuron(0,-100);
Neuron d = new Neuron(200,0);

Add the Neurons to the network.

network.addNeuron(a);
network.addNeuron(b);
network.addNeuron(c);
network.addNeuron(d);
}

void draw() {
background(255);
Show the network.

network.display();
}
The above yields:
Whats missing, of course, is the connection. We can consider
a Connection object to be made up of three elements, two neurons
(from Neuron a to Neuron b) and a weight .

Show Raw

class Connection {
A connection is between two neurons.

Neuron a;
Neuron b;
A connection has a weight.

float weight;

Connection(Neuron from, Neuron to,float w) {


weight = w;
a = from;
b = to;
}

A connection is drawn as a line.

void display() {
stroke(0);
strokeWeight(weight*4);
line(a.location.x, a.location.y, b.location.x, b.location.y);
}
}
Once we have the idea of a Connection object, we can write a function
(lets put it inside the Network class) that connects two neurons together
the goal being that in addition to making the neurons in setup() , we can
also connect them.

Show Raw

void setup() {
size(640, 360);
network = new Network(width/2,height/2);

Neuron a = new Neuron(-200,0);


Neuron b = new Neuron(0,100);
Neuron c = new Neuron(0,-100);
Neuron d = new Neuron(200,0);

Making connections between the neurons

network.connect(a,b);
network.connect(a,c);
network.connect(b,d);
network.connect(c,d);

network.addNeuron(a);
network.addNeuron(b);
network.addNeuron(c);
network.addNeuron(d);
}
The Network class therefore needs a new function called connect() , which
makes a Connection object between the two specified neurons.

Show Raw

void connect(Neuron a, Neuron b) {


Connection has a random weight.

Connection c = new Connection(a, b, random(1));

// But what do we do with the Connection object?


}
Presumably, we might think that the Network should store an ArrayList of
connections, just like it stores an ArrayList of neurons. While useful, in
this case such an ArrayList is not necessary and is missing an important
feature that we need. Ultimately we plan to feed forward" the neurons
through the network, so the Neuron objects themselves must know to
which neurons they are connected in the forward direction. In other
words, each neuron should have its own list of Connection objects.
When a connects to b, we want a to store a reference of that connection so
that it can pass its output to b when the time comes.

Show Raw

void connect(Neuron a, Neuron b) {


Connection c = new Connection(a, b, random(1));
a.addConnection(c);
}
In some cases, we also might want Neuron b to know about this
connection, but in this particular example we are only going to pass
information in one direction.

For this to work, we have to add an ArrayList of connections to


the Neuron class. Then we implement the addConnection() function that
stores the connection in that ArrayList.

Show Raw

class Neuron {
PVector location;

The neuron stores its connections.

ArrayList<Connection> connections;

Neuron(float x, float y) {
location = new PVector(x, y);
connections = new ArrayList<Connection>();
}
Adding a connection to this neuron

void addConnection(Connection c) {
connections.add(c);
}
The neurons display() function can draw the connections as well. And
finally, we have our network diagram.

RESET PAUSE

Example 10.3: Neural network diagram

Show Raw

void display() {
stroke(0);
strokeWeight(1);
fill(0);
ellipse(location.x, location.y, 16, 16);

Drawing all the connections

for (Connection c : connections) {


c.display();
}
}
}

10.8 Animating Feed Forward


An interesting problem to consider is how to visualize the flow of
information as it travels throughout a neural network. Our network is
built on the feed forward model, meaning that an input arrives at the first
neuron (drawn on the lefthand side of the window) and the output of that
neuron flows across the connections to the right until it exits as output
from the network itself.
Our first step is to add a function to the network to receive this input,
which well make a random number between 0 and 1.

Show Raw

void setup() {
All our old network set up code

A new function to send in an input

network.feedforward(random(1));
}
The network, which manages all the neurons, can choose to which
neurons it should apply that input. In this case, well do something simple
and just feed a single input into the first neuron in the ArrayList, which
happens to be the left-most one.

Show Raw

class Network {

A new function to feed an input into the neuron

void feedforward(float input) {


Neuron start = neurons.get(0);
start.feedforward(input);
}
What did we do? Well, we made it necessary to add a function
called feedforward() in the Neuron class that will receive the input and
process it.

Show Raw

class Neuron

void feedforward(float input) {


What do we do with the input?
}
If you recall from working with our perceptron, the standard task that the
processing unit performs is to sum up all of its inputs. So if
our Neuron class adds a variable called sum, it can simply accumulate the
inputs as they are received.

Show Raw

class Neuron

int sum = 0;

void feedforward(float input) {


Accumulate the sums.

sum += input;
}
The neuron can then decide whether it should fire, or pass an output
through any of its connections to the next layer in the network. Here we
can create a really simple activation function: if the sum is greater than 1,
fire!

Show Raw

void feedforward(float input) {


sum += input;
Activate the neuron and fire the outputs?

if (sum > 1) {
fire();
If weve fired off our output, we can reset our sum to 0.

sum = 0;
}
}
Now, what do we do in the fire() function? If you recall, each neuron keeps
track of its connections to other neurons. So all we need to do is loop
through those connections and feedforward() the neurons output. For this
simple example, well just take the neurons sum variable and make it the
output.

Show Raw

void fire() {
for (Connection c : connections) {
The Neuron sends the sum out through all of its connections

c.feedforward(sum);
}
}
Heres where things get a little tricky. After all, our job here is not to
actually make a functioning neural network, but to animate a simulation
of one. If the neural network were just continuing its work, it would
instantly pass those inputs (multiplied by the connections weight) along
to the connected neurons. Wed say something like:

Show Raw

class Connection {

void feedforward(float val) {


b.feedforward(val*weight);
}
But this is not what we want. What we want to do is draw something that
we can see traveling along the connection from Neuron a to Neuron b.

Lets first think about how we might do that. We know the location
of Neuron a; its the PVector a.location . Neuron b is located at b.location .
We need to start something moving from Neuron a by creating
another PVector that will store the path of our traveling data.

Show Raw

PVector sender = a.location.get();


Once we have a copy of that location, we can use any of the motion
algorithms that weve studied throughout this book to move along this
path. Herelets pick something very simple and just interpolate
from a to b.

Show Raw

sender.x = lerp(sender.x, b.location.x, 0.1);


sender.y = lerp(sender.y, b.location.y, 0.1);
Along with the connections line, we can then draw a circle at that
location:

Show Raw

stroke(0);
line(a.location.x, a.location.y, b.location.x, b.location.y);
fill(0);
ellipse(sender.x, sender.y, 8, 8);
This resembles the following:

Figure 10.16

OK, so thats how we might move something along the connection. But
how do we know when to do so? We start this process the moment
the Connection object receives the feedforward signal. We can keep
track of this process by employing a simple boolean to know whether the
connection is sending or not. Before, we had:

Show Raw

void feedforward(float val) {


b.feedforward(val*weight);
}
Now, instead of sending the value on straight away, well trigger an
animation:

Show Raw
class Connection {

boolean sending = false;


PVector sender;
float output;

void feedforward(float val) {


Sending is now true.

sending = true;
Start the animation at the location of Neuron A.

sender = a.location.get();
Store the output for when it is actually time to feed it forward.

output = val*weight;
}
Notice how our Connection class now needs three new variables. We need
a boolean sending that starts as false and that will track whether or not
the connection is actively sending (i.e. animating). We need
a PVector sender for the location where well draw the traveling dot. And
since we arent passing the output along this instant, well need to store it
in a variable that will do the job later.

The feedforward() function is called the moment the connection becomes


active. Once its active, well need to call another function continuously
(each time through draw()), one that will update the location of the
traveling data.

Show Raw

void update() {
if (sending) {
As long as were sending, interpolate our points.

sender.x = lerp(sender.x, b.location.x, 0.1);


sender.y = lerp(sender.y, b.location.y, 0.1);
}
}
Were missing a key element, however. We need to check if the sender has
arrived at location b, and if it has, feed forward that output to the next
neuron.

Show Raw

void update() {
if (sending) {
sender.x = lerp(sender.x, b.location.x, 0.1);
sender.y = lerp(sender.y, b.location.y, 0.1);

How far are we from neuron b?

float d = PVector.dist(sender, b.location);

If were close enough (within one pixel) pass on the output. Turn off sending.

if (d < 1) {
b.feedforward(output);
sending = false;
}
}
}
Lets look at the Connection class all together, as well as our
new draw() function.

RESET PAUSE

Example 10.4: Animating a neural network diagram

Show Raw

void draw() {
background(255);
The Network now has a new update() method that updates all of the Connection objects.

network.update();
network.display();

if (frameCount % 30 == 0) {
We are choosing to send in an input every 30 frames.

network.feedforward(random(1));
}
}

class Connection {
The Connections data

float weight;
Neuron a;
Neuron b;

Variables to track the animation

boolean sending = false;


PVector sender;
float output = 0;

Connection(Neuron from, Neuron to, float w) {


weight = w;
a = from;
b = to;
}

The Connection is active with data traveling from a to b.

void feedforward(float val) {


output = val*weight;
sender = a.location.get();
sending = true;
}

Update the animation if it is sending.

void update() {
if (sending) {
sender.x = lerp(sender.x, b.location.x, 0.1);
sender.y = lerp(sender.y, b.location.y, 0.1);
float d = PVector.dist(sender, b.location);
if (d < 1) {
b.feedforward(output);
sending = false;
}
}
}

Draw the connection as a line and traveling circle.

void display() {
stroke(0);
strokeWeight(1+weight*4);
line(a.location.x, a.location.y, b.location.x, b.location.y);

if (sending) {
fill(0);
strokeWeight(1);
ellipse(sender.x, sender.y, 16, 16);
}
}
}

Exercise 10.5
The network in the above example was manually configured by setting the
location of each neuron and its connections with hard-coded values.
Rewrite this example to generate the networks layout via an algorithm.
Can you make a circular network diagram? A random one? An example of
a multi-layered network is below.

RESET PAUSE

Exercise 10.6
Rewrite the example so that each neuron keeps track of its forward and
backward connections. Can you feed inputs through the network in any
direction?
Exercise 10.7
Instead of lerp() , use moving bodies with steering forces to visualize the
flow of information in the network.

The Ecosystem Project


Step 10 Exercise:

Try incorporating the concept of a brain into your creatures.

Use reinforcement learning in the creatures decision-making


process.

Create a creature that features a visualization of its brain as part of


its design (even if the brain itself is not functional).

Can the ecosystem as a whole emulate the brain? Can elements of


the environment be neurons and the creatures act as inputs and
outputs?

The end
If youre still reading, thank you! Youve reached the end of the book. But
for as much material as this book contains, weve barely scratched the
surface of the world we inhabit and of techniques for simulating it. Its my
intention for this book to live as an ongoing project, and I hope to
continue adding new tutorials and examples to the books website as well
as expand and update the printed material. Your feedback is truly
appreciated, so please get in touch via email at (daniel@shiffman.net) or
by contributing to the GitHub repository , in keeping with the open-source
spirit of the project. Share your work. Keep in touch. Lets be two with
nature.
A Basic Introduction to
Feedforward Backpropagation Neural Networks

David Leverington
Associate Professor of Geosciences

The Feedforward Backpropagation Neural Network Algorithm

Although the long-term goal of the neural-network community remains the design of autonomous
machine intelligence, the main modern application of artificial neural networks is in the field of pattern
recognition (e.g., Joshi et al., 1997). In the sub-field of data classification, neural-network methods have
been found to be useful alternatives to statistical techniques such as those which involve regression
analysis or probability density estimation (e.g., Holmstrm et al., 1997). The potential utility of neural
networks in the classification of multisource satellite-imagery databases has been recognized for well
over a decade, and today neural networks are an established tool in the field of remote sensing.

The most widely applied neural network algorithm in image classification remains the feedforward
backpropagation algorithm. This web page is devoted to explaining the basic nature of this classification
routine.

1 Neural Network Basics

Neural networks are members of a family of computational architectures inspired by biological brains
(e.g., McClelland et al., 1986; Luger and Stubblefield, 1993). Such architectures are commonly called
"connectionist systems", and are composed of interconnected and interacting components called nodes
or neurons (these terms are generally considered synonyms in connectionist terminology, and are used
interchangeably here). Neural networks are characterized by a lack of explicit representation of
knowledge; there are no symbols or values that directly correspond to classes of interest. Rather,
knowledge is implicitly represented in the patterns of interactions between network components (Lugar
and Stubblefield, 1993). A graphical depiction of a typical feedforward neural network is given in Figure
1. The term feedforward indicates that the network has links that extend in only one direction. Except
during training, there are no backward links in a feedforward network; all links proceed from input
nodes toward output nodes.
Figure 1: A typical feedforward neural network.

Individual nodes in a neural network emulate biological neurons by taking input data and performing
simple operations on the data, selectively passing the results on to other neurons (Figure 2). The output
of each node is called its "activation" (the terms "node values" and "activations" are used
interchangeably here). Weight values are associated with each vector and node in the network, and these
values constrain how input data (e.g., satellite image values) are related to output data (e.g., land-cover
classes). Weight values associated with individual nodes are also known as biases. Weight values are
determined by the iterative flow of training data through the network (i.e., weight values are established
during a training phase in which the network learns how to identify particular classes by their typical
input data characteristics). A more formal description of the foundations of multi-layer, feedforward,
backpropagation neural networks is given in Section 5.

Once trained, the neural network can be applied toward the classification of new data. Classifications are
performed by trained networks through 1) the activation of network input nodes by relevant data sources
[these data sources must directly match those used in the training of the network], 2) the forward flow of
this data through the network, and 3) the ultimate activation of the output nodes. The pattern of
activation of the networks output nodes determines the outcome of each pixels classification. Useful
summaries of fundamental neural network principles are given by Rumelhart et al. (1986), McClelland
and Rumelhart (1988), Rich and Knight (1991), Winston (1991), Anzai (1992), Lugar and Stubblefield
(1993), Gallant (1993), and Richards and Jia (2005). Parts of this web page draw on these summaries. A
brief historical account of the development of connectionist theories is given in Gallant (1993).
Figure: 2 Schematic comparison between a biological neuron and an artificial neuron (after Winston,
1991; Rich and Knight, 1991). For the biological neuron, electrical signals from other neurons are
conveyed to the cell body by dendrites; resultant electrical signals are sent along the axon to be
distributed to other neurons. The operation of the artificial neuron is analogous to (though much
simpler than) the operation of the biological neuron: activations from other neurons are summed at the
neuron and passed through an activation function, after which the value is sent to other neurons.

2 McCulloch-Pitts Networks

Neural computing began with the development of the McCulloch-Pitts network in the 1940's
(McCulloch and Pitts, 1943; Luger and Stubblefield, 1993). These simple connectionist networks, shown
in Figure 3, are stand-alone decision machines that take a set of inputs, multiply these inputs by
associated weights, and output a value based on the sum of these products. Input values (also known as
input activations) are thus related to output values (output activations) by simple mathematical
operations involving weights associated with network links. McCulloch-Pitts networks are strictly
binary; they take as input and produce as output only 0's or 1's. These 0's and 1's can be thought of as
excitatory or inhibitory entities, respectively (Luger and Stubblefield, 1993). If the sum of the products
of the inputs and their respective weights is greater than or equal to 0, the output node returns a 1
(otherwise, a 0 is returned). The value of 0 is thus a threshold that must be exceeded or equalled if the
output of the system is to be 1. The above rule, which governs the manner in which an output node maps
input values to output values, is known as an activation function (meaning that this function is used to
determine the activation of the output node). McCulloch-Pitts networks can be constructed to compute
logical functions (for example, in the X AND Y case, no combination of inputs can produce a sum of
products that is greater than or equal to 0, except the combination X=Y=1). McCulloch-Pitts networks
do not learn, and thus the weight values must be determined in advance using other mathematical or
heuristic means. Nevertheless, these networks did much to inspire further research into connectionist
models during the 1950's (Luger and Stubblefield, 1993).

Figure: 3 McCulloch-Pitts networks (after Luger and Stubblefield, 1993).

3 Perceptrons

The development of a connectionist system capable of limited learning occurred in the late 1950's, when
Rosenblatt created a system known as a perceptron (see Rosenblatt, 1962; Luger and Stubblefield,
1993). Again, this system consists of binary activations (inputs and outputs) (see Figure 4). In common
with the McCulloch-Pitts neuron described above, the perceptrons binary output is determined by
summing the products of inputs and their respective weight values. In the perceptron implementation, a
variable threshold value is used (whereas in the McCulloch-Pitts network, this threshold is fixed at 0): if
the linear sum of the input/weight products is greater than a threshold value (theta), the output of the
system is 1 (otherwise, a 0 is returned). The output unit is thus said to be, like the perceptron output unit,
a linear threshold unit. To summarize, the perceptron classifies input values as either 1 or 0, according
to the following rule, referred to as the activation function:

(eqn 1)
PERCEPTRON OUTPUT = 1 if (sum of products of inputs and weights) > theta
(otherwise, PERCEPTRON OUTPUT = 0)

The perceptron is trained (i.e., the weights and threshold values are calculated) based on an iterative
training phase involving training data. Training data are composed of a list of input values and their
associated desired output values. In the training phase, the inputs and related outputs of the training data
are repeatedly submitted to the perceptron. The perceptron calculates an output value for each set of
input values. If the output of a particular training case is labelled 1 when it should be labelled 0, the
threshold value (theta) is increased by 1, and all weight values associated with inputs of 1 are decreased
by 1. The opposite is performed if the output of a training case is labelled 0 when it should be labelled 1.
No changes are made to the threshold value or weights if a particular training case is correctly classified.
This set of training rules is summarized as:

(eqn 2a)
If OUTPUT is correct, then no changes are made to the threshold or weights

(eqn 2b)
If OUTPUT = 1, but should be 0
then {theta = theta + 1}
and {weightx = weightx -1, if inputx = 1}

(eqn 2c)
If OUTPUT = 0, but should be 1
then {theta = theta - 1}
and {weightx=weightx +1, if inputx = 1}

where the subscript x refers to a particular input-node and weight pair. The effect of the above training
rules is to make it less likely that a particular error will be made in subsequent training iterations. For
example, in equation (2b), increasing the threshold value serves to make it less likely that the same sum
of products will exceed the threshold in later training iterations, and thus makes it less likely that an
output value of 1 will be produced when the same inputs are presented. Also, by modifying only those
weights that are associated with input values of 1, only those weights that could have contributed to the
error are changed (weights associated with input values of 0 are not considered to have contributed to
error). Once the network is trained, it can be used to classify new data sets whose input/output
associations are similar to those that characterize the training data set. Thus, through an iterative training
stage in which the weights and threshold gradually migrate to useful values (i.e., values that minimize or
eliminate error), the perceptron can be said to learn how to solve simple problems.

Figure 4: An example of a perceptron. The system consists of binary activations. Weights are identified
by ws, and inputs are identified by is. A variable threshold value (theta) is used at the output node.

4 The Delta Rule

The development of the perceptron was a large step toward the goal of creating useful connectionist
networks capable of learning complex relations between inputs and outputs. In the late 1950's, the
connectionist community understood that what was needed for the further development of connectionist
models was a mathematically-derived (and thus potentially more flexible and powerful) rule for
learning. By the early 1960's, the Delta Rule [also known as the Widrow and Hoff learning rule or the
least mean square (LMS) rule] was invented (Widrow and Hoff, 1960). This rule is similar to the
perceptron learning rule above (McClelland and Rumelhart, 1988), but is also characterized by a
mathematical utility and elegance missing in the perceptron and other early learning rules. The Delta
Rule uses the difference between target activation (i.e., target output values) and obtained activation to
drive learning. For reasons discussed below, the use of a threshold activation function (as used in both
the McCulloch-Pitts network and the perceptron) is dropped; instead, a linear sum of products is used to
calculate the activation of the output neuron (alternative activation functions can also be applied - see
Section 5.2). Thus, the activation function in this case is called a linear activation function, in which the
output nodes activation is simply equal to the sum of the networks respective input/weight products.
The strengths of networks connections (i.e., the values of the weights) are adjusted to reduce the
difference between target and actual output activation (i.e., error). A graphical depiction of a simple two-
layer network capable of employing the Delta Rule is given in Figure 5. Note that such a network is not
limited to having only one output node.

Figure 5: A network capable of implementing the Delta Rule. Non-binary values may be used. Weights
are identified by ws, and inputs are identified by is. A simple linear sum of products (represented by the
symbol at top) is used as the activation function at the output node of the network shown here.

During forward propagation through a network, the output (activation) of a given node is a function of
its inputs. The inputs to a node, which are simply the products of the output of preceding nodes with
their associated weights, are summed and then passed through an activation function before being sent
out from the node. Thus, we have the following:

(Eqn 3a)

and

(Eqn 3b)

where Sj is the sum of all relevant products of weights and outputs from the previous layer i,
wijrepresents the relevant weights connecting layer i with layer j, ai represents the activations of the
nodes in the previous layer i, aj is the activation of the node at hand, and f is the activation function.

Figure 6: Schematic representation of an error function for a network containing only two weights (w1
and w2) (after Lugar and Stubblefield, 1993). Any given combination of weights will be associated with
a particular error measure. The Delta Rule uses gradient descent learning to iteratively change network
weights to minimize error (i.e., to locate the global minimum in the error surface).

For any given set of input data and weights, there will be an associated magnitude of error, which is
measured by an error function (also known as a cost function) (Figure 6) (e.g., Oh, 1997; Yam and
Chow, 1997). The Delta Rule employs the error function for what is known as gradient descent learning,
which involves the modification of weights along the most direct path in weight-space to minimize
error; change applied to a given weight is proportional to the negative of the derivative of the error with
respect to that weight (McClelland and Rumelhart 1988, pp.126-130). The error function is commonly
given as the sum of the squares of the differences between all target and actual node activations for the
output layer. For a particular training pattern (i.e., training case), error is thus given by:

(Eqn 4a)

where Ep is total error over the training pattern, is a value applied to simplify the functions derivative,
n represents all output nodes for a given training pattern, tj sub n represents the target value for node n in
output layer j, and aj sub n represents the actual activation for the same node. This particular error
measure is attractive because its derivative, whose value is needed in the employment of the Delta Rule,
is easily calculated. Error over an entire set of training patterns (i.e., over one iteration, or epoch) is
calculated by summing all Ep:

(Eqn 4b)

where E is total error, and p represents all training patterns. An equivalent term for E in Equation 4b is
sum-of-squares error. A normalized version of Equation 4b is given by the mean squared error (MSE)
equation:

(Eqn 4c)

where P and N are the total number of training patterns and output nodes, respectively. It is the error of
Equations 4b and 4c that gradient descent attempts to minimize (in fact, this is not strictly true if weights
are changed after each input pattern is submitted to the network; see Section 4.1 below; see also
Rumelhart et al., 1986: v1, p.324; Reed and Marks, 1999: pp. 57-62). Error over a given training pattern
is commonly expressed in terms of the total sum of squares (tss) error, which is simply equal to the
sum of all squared errors over all output nodes and all training patterns. The negative of the derivative of
the error function is required in order to perform gradient descent learning. The derivative of Equation
4a (which measures error for a given pattern p), with respect to a particular weight wij sub x, is given by
the chain rule as:

(Eqn 5a)
where aj sub z is the activation of the node in the output layer that corresponds to the weight wij sub x
(note: subscripts refer to particular layers of nodes or weights, and the sub-subscripts simply refer to
individual weights and nodes within these layers). It follows that

(Eqn 5b)

and

(Eqn 5c)

Thus, the derivative of the error over an individual training pattern is given by the product of the
derivatives of Equation 5a:

(Eqn 5d)

Because gradient descent learning requires that any change in a particular weight be proportional to the
negative of the derivative of the error, the change in a given weight must be proportional to the negative
of equation 5d. Replacing the difference between the target and actual activation of the relevant output
node by d, and introducing a learning rate epsilon, Equation 5d can be re-written in the final form of the
delta rule:

(Eqn 5e)

The reasoning behind the use of a linear activation function here instead of a threshold activation
function can now be justified: the threshold activation function that characterizes both the McColloch
and Pitts network and the perceptron is not differentiable at the transition between the activations of 0
and 1 (slope = infinity), and its derivative is 0 over the remainder of the function. As such, the threshold
activation function cannot be used in gradient descent learning. In contrast, a linear activation function
(or any other function that is differentiable) allows the derivative of the error to be calculated.

Equation 5e is the Delta Rule in its simplest form (McClelland and Rumelhart, 1988). From Equation 5e
it can be seen that the change in any particular weight is equal to the products of 1) the learning rate
epsilon, 2) the difference between the target and actual activation of the output node [d], and 3) the
activation of the input node associated with the weight in question. A higher value for e will necessarily
result in a greater magnitude of change. Because each weight update can reduce error only slightly,
many iterations are required in order to satisfactorily minimize error (Reed and Marks, 1999). An actual
example of the iterative change in neural network weight values as a function of an error surface is given
in Figures 7 and 8. Figure 7 is a three-dimensional depiction of the error surface associated with a
particular mathematical problem. Figure 8 shows the two-dimensional version of this error surface,
along with the path that weight values took during training. Note that weight values changed such that
the path defined by weight values followed the local gradient of the error surface.

Figure 7: Three-dimensional depiction of an actual error surface (Leverington, 2001).


Figure 8: Two-dimensional depiction of the error surface given in Figure 7, shown with the training
path iteratively taken by weight values during training (starting at weight values [+6,+6]) (Leverington,
2001). Note that weight values changed such that the path defined by weight values followed the local
gradient of the error surface.

4.1 Batch and On-Line Learning

Weights can be updated in two primary ways: batch training, and on-line (also called sequential or
pattern-based) training. In batch mode, the value of dEp/dwij is calculated after each pattern is submitted
to the network, and the total derivative dE/dwij is calculated at the end of a given iteration by summing
the individual pattern derivatives. Only after this value is calculated are the weights updated. As long as
the learning rate epsilon (e) is small, batch mode approximates gradient descent (Reed and Marks,
1999).

On-line mode (also called pattern-mode learning) involves updating the values of weights after each
training pattern is submitted to the network (note that the term can be misleading: on-line mode does not
involve training during the normal feedforward operation of the network; it involves off-line training just
like batch mode). As noted earlier, on-line learning does not involve true gradient descent, since the sum
of all pattern derivatives over a given iteration is never determined for a particular set of weights;
weights are instead changed slightly after each pattern, causing the pattern derivatives to be evaluated
with respect to slightly different weight values. On-line mode is not a simple approximation of the
gradient descent method, since although single-pattern derivatives as a group sum to the gradient, each
derivative has a random deviation that does not have to be small (Reed and Marks, 1999: p.59).
Although error usually decreases after most weight changes, there may be derivatives that cause the
error to increase as well. Unless learning rates are very small, the weight vector tends to jump about the
E(w) surface, mostly moving downhill, but sometimes jumping uphill; the magnitudes of the jumps are
proportional to the learning rate epsilon (Reed and Marks, 1999).

Cyclic, fixed orders of training patterns are generally avoided in on-line learning, since convergence can
be limited if weights converge to a limit cycle (Reed and Marks 1999: p.61). Also, if large numbers of
patterns are in a training dataset, an ordered presentation of the training cases to the network can cause
weights/error to move very erratically over the error surface (with any given series of an individual
class training patterns potentially causing the network to move in weight-space in a direction that is
very different from the overall desired direction). Thus, training patterns are usually submitted at random
in on-line learning. A comparison between the learning curves produced by networks using non-random
and random submission of training data is given in Figure 9. Note that the network using non-random
submission produced an extremely erratic learning curve, compared to the relatively smooth learning
curve produced by the network using random submission. The on-line mode has an advantage over batch
mode, in that the more erratic path that the weight values travel in is more likely to bounce out of local
minima; pure gradient descent offers no chance to escape a local minimum. Further, the on-line mode is
superior to batch mode if there is a high degree of redundancy in the training data, since, when using a
large training dataset, the network will simply update weights more often in a given iteration, while a
batch-mode network will simply take longer to evaluate a given iteration (Bishop, 1995a, p.264). An
advantage of batch mode is that it can settle on a stable set of weight values, without wandering about
this set.
Figure 9: Learning curves produced by networks using non-random (fixed-order) and random
submission of training data (Leverington, 2001).

4.2 A Simple Delta Rule Example

A simple example of the employment of the Delta Rule, based on a discussion given in McClelland and
Rumelhart (1988), is as follows: imagine that the following inputs and outputs are related as in Table
3.1. Imagine further that is the desire of a worker is to train a network to be able to correctly label each
of the four input cases in this table. This problem will require a network with four input nodes and one
output node. All 4 weights associated with each input node are initially set to 0, and an arbitrary learning
rate (epsilon) of 0.25 is used in this example.

During the training phase, each training case is presented to the network individually, and weights are
modified according to the Delta Rule. For example, when the first training case is presented to the
network, the sum of products equals 0. Because the desired output for this particular training case is 1,
the error equals 1-0 = 1. Using equation (5e), the changes in the four weights are respectively calculated
to be {0.25, -0.25, 0.25, -0.25}. Since the weights are initially set to {0, 0, 0, 0}, they become {0.25,
-0.25, 0.25, -0.25} after this first training case. Presentation of the second set of training inputs causes
the network to calculate a sum of products of 0, again. Thus, the changes in the four weights in this case
are calculated to be {0.25, 0.25, 0.25, 0.25), and, once the changes are added to the previously-
determined weights, the new weight values become {0.5, 0, 0.5, 0}. After presentation of the third and
fourth training cases, the weight values become {0, -0.5, 0, 0.5} and {-0.5, 0, 0.5, 0}, respectively. At the
end of this training iteration, the total sum of squared errors = 12 + 12 + (-2)2 + (-2)2 = 10.

Table 3.1: Sample inputs and outputs.

After this first iteration, it is not clear that the weights are changing in a manner that will reduce network
error. In fact, with the last set of weights given above, the network would only produce a correct output
value for the last training case; the first three would be classified incorrectly. However, with repeated
presentation of the same training data to the network (ie. with multiple iterations of training), it becomes
clear that the networks weights do indeed evolve to reduce classification error: error is eliminated
altogether by the twentieth iteration. The network has learned to classify all training cases correctly, and
is now ready to be used on new data whose relations between inputs and desired outputs generally match
those of the training data.

The Delta Rule will find a set of weights that solves a network learning problem, provided such a set of
weights exists. The required condition for this set of weights existing is that all solutions must be a linear
function of the inputs. As was presented by Minsky and Papert (1969), this condition does not hold for
many simple problems (e.g., the exclusive-OR function, in which an output of 1 must be produced when
either of two inputs are 1, but an output of 0 must be produced with neither or both of the inputs are 1;
see McClelland and Rumelhart 1988, pp. 145-152). Minsky and Papert recognized that a multi-layer
network could convert an unsolvable problem to a solvable problem (note: a multi-layer network
consists of one or more intermediate layers placed between the input and output layers; this accepted
terminology can be somewhat confusing and contradictory, in that the term layer in multi-layer
refers to a row of weights, whereas the term layer in general neural-network usage usually means a
row of nodes; see Section 5.1 and, e.g., Vemuri, 1992, p.42). Minsky and Papert also recognized that the
use of a linear activation function (such as that used in the Delta Rule example above, where network
output is equal to the sum of the input/weight products) would not allow the benefits of having a multi-
layer network to be realized, since a multi-layer network with linear activation functions is functionally
equivalent to a simple input-output network using linear activation functions. That is, linear systems
cannot compute more in multiple layers than they can in a single layer (McClelland and Rumelhart,
1988). Based on the above considerations, the questions at the time became 1) what kind of activation
function should be used in a multi-layer network, and 2) how can intermediate layers in a multi-layer
network be taught? The original application of the Delta Rule involved only an input layer and an
output layer. It was generally believed that no general learning rule for larger, multi-layer networks,
could be formulated. As a result of this view, research on connectionist networks for applications in
artificial intelligence was dramatically reduced in the 1970's (McClelland and Rumelhart, 1988; Joshi et
al., 1997).

5 Multi-Layer Networks and Backpropagation

Eventually, despite the apprehensions of earlier workers, a powerful algorithm for apportioning error
responsibility through a multi-layer network was formulated in the form of the backpropagation
algorithm (Rumelhart et al., 1986). The backpropagation algorithm employs the Delta Rule, calculating
error at output units in a manner analogous to that used in the example of Section 4.2, while error at
neurons in the layer directly preceding the output layer is a function of the errors on all units that use its
output. The effects of error in the output node(s) are propagated backward through the network after
each training case. The essential idea of backpropagation is to combine a non-linear multi-layer
perceptron-like system capable of making decisions with the objective error function of the Delta Rule
(McClelland and Rumelhart, 1988).

5.1 Network Terminology

A multi-layer, feedforward, backpropagation neural network is composed of 1) an input layer of nodes,


2) one or more intermediate (hidden) layers of nodes, and 3) an output layer of nodes (Figure 1). The
output layer can consist of one or more nodes, depending on the problem at hand. In most classification
applications, there will either be a single output node (the value of which will identify a predicted class),
or the same number of nodes in the output layer as there are classes (under this latter scheme, the
predicted class for a given set of input data will correspond to that class associated with the output node
with the highest activation). As noted in Section 4.2, it is important to recognize that the term multi-
layer is often used to refer to multiple layers of weights. This contrasts with the usual meaning of
layer, which refers to a row of nodes (Vemuri, 1992). For clarity, it is often best to describe a
particular network by its number of layers, and the number of nodes in each layer (e.g., a 4-3-5"
network has an input layer with 4 nodes, a hidden layer with 3 nodes, and an output layer with 5 nodes).

5.2 The Sigma Function

The use of a smooth, non-linear activation function is essential for use in a multi-layer network
employing gradient-descent learning. An activation function commonly used in backpropagation
networks is the sigma (or sigmoid) function:

(Eqn 6)

where aj sub m is the activation of a particular receiving node m in layer j, Sj is the sum of the
products of the activations of all relevant emitting nodes (i.e., the nodes in the preceding layer i) by
their respective weights, and wij is the set of all weights between layers i and j that are associated with
vectors that feed into node m of layer j. This function maps all sums into [0,1] (Figure 10) (an alternate
version of the function maps activations into [-1, 1]; e.g., Gallant 1993, pp. 222-223). If the sum of the
products is 0, the sigma function returns 0.5. As the sum gets larger the sigma function returns values
closer to 1, while the function returns values closer to 0 as the sum gets increasingly negative. The
derivative of the sigma function with respect to Sj sub m is conveniently simple, and is given by Gallant
(1993, p.213) as:

(Eqn 7)

The sigma function applies to all nodes in the network, except the input nodes, whose values are
assigned input values. The sigma function superficially compares to the threshold function (which is
used in the perceptron) as shown in Figure 10. Note that the derivative of the sigma function reaches its
maximum at 0.5, and approaches its minimum with values approaching 0 or 1. Thus, the greatest change
in weights will occur with values near 0.5, while the least change will occur with values near 0 or 1.
McClelland and Rumelhart (1988) recognize that it is these features of the equation (i.e., the shape of the
function) that contribute to the stability of learning in the network; weights are changed most for units
whose values are near their midrange, and thus for those units that are not yet committed to being either
on or off.