by Eric Edelstein
MathSoft, Inc.
In this article, we will consider modeling a feed forward network (a special type of weighted
directed graph) after the way a brain operates and begin looking at algorithms that teach the
network how to learn.
Page 2 of 27
One of the things that makes humans efficient is our ability to change. This manifests itself in
many ways. The first is that our brains need not be completely redesigned just to change our
lunch order when we find they're out of octopus sukiyaki. This is nontrivial. Consider the
advantages our flexibility has over, for example, a microchip. The electronic circuits may be
able to perform many different operations, but the number is finite and the abilities don't
change over time. If you have an OR gate, it must be taken apart and rebuilt if you want an
AND gate. If you want it to add numbers, you have to compile quite a few components.
A human, however, can add to his/her stockpile of abilities without adding brain cells. How
does this happen? No one knows exactly, but certain ideas in the theory of learning are
getting clearer, and some can be modeled on a computer. We are now in the age where
computers can be taught to learn new tricks. That is, a program representing a neural net can
be made to learn infinitely many different routines (one at a time). That makes it extremely
flexible, and hence, powerful. Neural nets have been created that mimic and anticipate
human behavior, run machinery in automated factories, read books aloud, make complex
financial decisions, and a host of other impressive tasks.
Page 3 of 27
One of the most common tasks required of a neural net is the recognition of patterns and
reaction to them in some manner. This will be demonstrated in this article. The reason for
this particular emphasis is that once a neural net can find a pattern, it can start predicting.
The art of prediction is an old one. There are many and varied statistical techniques to
approximate predictions. However, neural nets have been shown to be more accurate on
some occasions. Also, unlike a standard statistical program which allows for one set of
analyses, the same neural net can learn to do different analyses on different kinds of data.
It will just need to be retrained. However, the most fundamental difference is one of action.
A neural net will not only predict, but will also act in accordance with this prediction as we
shall see later on.
Now, consider the cellular make up of a brain: neurons. There are millions of neurons
interconnected along axons. The center of a neuron receives stimuli and decides,
somehow, whether or not to send a signal to neighboring neurons. If it decides to send out
a signal, an electrical burst, it does so through the axons. This is how the brain makes its
own predictions and actions. Given this description, the brain can be thought of as a graph
with the main neuron body represented by a node or vertex and the edges representing
axons.
Page 4 of 27
These graphs (also called networks) contain points, called nodes or vertices; the line
segments connecting these nodes are called edges. The endpoints of an edge are
its vertices. An orientation of the edges is a choice of starting and ending vertex for
the edge. Usually, we draw an arrow on an oriented edge pointing from the initial to
the final node. If each edge of the graph has an orientation, the graph is called a
directed graph (or digraph, for short).
A graph with this association and the inherent implications that brings is called a
neural network (or a neural net, for short). We shall restrict ourselves to the study of
neural nets of a certain form: we assume our neural nets are layered and
forwardfeed. These are weighted, directed graphs with nodes that can be broken up
into discrete vertical layers (that is, the nodes lie on vertical slices through the graph).
The orientations given to the edges are the same throughout the graph, either left to
right or right to left. In this article we will use the convention of left to right. Such a
digraph looks like:
Page 5 of 27
With the edge orientation of
left to right, the leftmost layer
is called the input layer, the
rightmost, the output layer,
and all those between, the
hidden layers. Nodes are
often called units, making
the leftmost ones, input
units, the rightmost, output
units, and those in between,
hidden units.
As mentioned earlier, we are concerned not only with the choices of edges and nodes, but
also with the weighting of them. To determine what the weighting should be, let's return to
the brain. If a neuron receives a very small stimulus, it does not fire. Once, however, it does
receive a significant enough stimulus, it fires a complete burst. It follows the allornone
principal. The cutoff value for stimuli is called a threshold. It is the amount of stimulus for a
particular neuron below which no reaction signal will be sent.
Page 6 of 27
In modeling graphs after brains, we associate to each node, v, a threshold value, t
v
, rather
like a transistor has in a logic gate.
The axon connections may be very strong or weak. That is, the signal sent from one neuron
to another via a particular axon may be completely passed on, or it may be impeded. This
can be thought of as the strength of the connection. The degree of connection between
those two neurons will reflect how interdependent they are. This strength between them is
used to define the weights on the edges in the neural net. If the weight is close to zero on an
edge between two nodes, then we can think of these two units as having little effect on each
other. If, on the other hand, the weights are high in absolute value, then the units' effect on
each other is strong. The weight on the edge from vertex v to vertex is denoted w
v.
At this point we've completed the fundamental association between a simplified brain
and a layered feedforward neural net. Let's encapsulate in a table, below:
Page 7 of 27
Page 8 of 27
It remains only to show how signals are passed along. Let's say that we're in the middle of a
neural net at a vertex, v. It would look something like:
Where x1 through x4 represent the strengths of the impulses that have been sent to this
node, v. The effect of x1on v will be determined by the strength of the connection W
1v
. So
by defining our weights correctly the effect of x1on v will be the product W
1v
x1. Taking the
other incoming impulses into consideration, v sees the following impulse:
Page 9 of 27
i 1 4 .. :=
i
xi W
i
( )
The reaction at v to this impulse must be determined. First we must see if the incoming
signal passes the threshold test. To do this, subtract the threshold from the impulse and
determine if the result is positive or negative. Then, a response function of some kind,
called the activation function, will act on the impulse, provided it is above the threshold
level. We perform these two steps as one by assuming some structure on the function.
We will assume that the activation function will treat positive numbers and negative
numbers differently. That is, the function values for a positive input will correspond to the
neuron firing. The function values for negative input values will correspond to nonfiring.
With this we find the response at v to the stimulus is:
f
i
xi W
i
( )
(
(
Page 10 of 27
A typical activation function might be f x ( ) x 0 > ( ) := x 5 4.995 , 5 .. :=
4 2 0 2 4
0.5
0
0.5
1
1.5
f x ( )
x
3 :=
This is an example of an all or none response. Note what it would
look like when applied in a neural net with a threshold value 3:
f x ( ) x
0 > ( ) :=
0.5
0
0.5
1
1.5
f x ( )
x
Page 11 of 27
To get an idea of what's going on geometrically, let's consider
two impulses going to a unit with the same threshold of 3. Let's
say one edge has a weight of a half, and the other a quarter. w1 .5 := w2 .25 :=
f x1 x2 , ( ) x1 w1 x2 w2 +
0 > ( ) :=
i 5 10 .. := j 5 10 .. := M
i 5 + ( ) j 5 + ( ) ,
f i j , ( ) :=
The zaxis describes the node's
reaction output to the two stimuli x1
and x2, plotted in the xy plane.
The neural net to the left of the
node v:
M
Page 12 of 27
All the way, from left side to right side:
Now that we know how a single node
reacts to stimuli, we can determine
the outputs of the output units for a
choice of input units. We consider a
very simple neural net:
There are two input nodes, v1 and v2, three hidden units, v3, v4, and v5, and one output
node v6.
Let's assign some weights to the edges.
w13 w14 w24 w25 w36 w46 w56 ( ) 1 1 1 1 1 2 1 ( ) :=
We must decide upon an activation function. Let's choose: f x ( ) x 0 > ( ) :=
Page 13 of 27
Pick threshold values: Now pick the input values:
y1
y2

\


.
1
1

\


.
:=
6

\





.
0
1.5
0
.5

\





.
:=
For the input layer we assume the thresholds are zero and the activation function is the
identity, so that the signal put into v1 is the same as the signal coming out from v1.
y3 f y1 w13
3
( ) := y4 f y1 w14 y2 w24 +
4
( ) := y5 f y2 w25
5
( ) :=
y6 f y3 w36 y4 w46 + y5 w56 +
6
( ) :=
The output unit for the corresponding input pattern is y6 0 =
Do you recognize this binary function? (Hint: It's one of the standard logical operations.)
Page 14 of 27
Learning in Neural nets
Let's now consider how to change the net. Thinking of the graph as a brain, it seems clear
that as learning goes on, the vertices (neurons) aren't going to go wandering all about.
That is, as we learn, the cellular structure of the brain can't move around very much. It was
found that as we learn, the chemical structure of the brain does change in small local ways.
When we learn to do something, or not to do something else, various connections
between the neurons are either strengthened or weakened. This corresponds to a
change on the edge weights of our network. We start with the simplest type of neural net, a
two layered, feedforward net. We will show how the weight changes take place. Since
there are only two layers, and in every feed forward net there is both an input and output
layer, there can be no hidden units.
We say that a layered
graph is fully connected if
every node in each layer is
connected to every other
node in the next layer to the
right. It generally looks like:
Note that nodes in one
layer aren't connected
to any other nodes in
the same layer. This is
always the case in
layered neural nets.
Page 15 of 27
There is a routine that we can carry out so that the neural net can figure out what the weights
on the edges should be to realize a certain set of fixed reactions. We feed the net specific
inputs with known desired outputs. We compare the network's output with the desired
output and change weights accordingly. This routine is then repeated until all outputs are
correct for all inputs.
Essentially, this can be thought of as a pattern recognition problem. Let's say that we have
a 2 layer neural net with two input nodes, and one output. We might want to teach the net to
produce the result v1 AND v2 for the output v3, using the following logic table:
The net must be trained to recognize the pattern (1,1) as 1 and the other three as 0, in the
same way as you apply a name to a face.
Page 16 of 27
For problems of this type it is often convenient to talk about input and output patterns. We've
already mentioned that the input can be thought of as a pattern. The output can be thought of
one as well. Consider a big neural net with one input node, some hidden units, and 64 output
nodes arranged as an 8 by 8 square. We could train the net that given an input of 0 to send
1's to the outer most units of the square, and 0's to all others. We could in addition, teach it that
given in input of 1, it should produce outputs of 1's to the fourth and fifth columns in the square
of output units, and 0's to the others. It would look like:
The 0's have been left out for clarity. The ellipse in the middle represents the hidden units.
The square represents the output units in an 8 by 8 square. As you can see, the output now
represents a pattern in the visual sense. The output looks like the numeral for the input (well,
sort of). Note that this is completely equivalent to learning the action of a function.
Page 17 of 27
As far as the computer is concerned, the neural net is a function,
f:RR
64
with the following property:
In this way we realize that pattern recognition and learning the action of a fixed function are the
same in principle.
With this in mind, there is a learning algorithm which teaches the two layered feedforward
neural net to recognize patterns. It works as follows:
2Layer FeedForward Learning for binary input
and one binary output and all thresholds equal
Page 18 of 27
Note: By "binary" in this section, we mean the set {1,1} (we use 1 instead of the usual 0).
Assume we start with the edges having random weights assigned to them. Then, given an
input pattern I (some sequence of 1's and 1's), there is an output pattern Z (a number,
though in general, not the correct one) and a corresponding desired output pattern O (also
a number).The weights going out from the v
th
input unit must be changed by adding:
w
v
O I
v
1 O Z = ( ) [ ] = so that w
v
w
v
O I
v
1 O Z = ( ) [ ] + =
c is a small increment. We find the direction for the change from the OI
v
(1(O=Z)) part. The
step size is given by c. Note that if the net's output is the ideal desired output (i.e., it has
learned to identify that pattern or function correctly), then O=Z. In this case Aw=0 for the net,
so no changes will take place. This follows the "if it ain't broke, don't fix it" principle of higher
computer science. Since a function usually consists of correctly identifying several patterns
(one pattern for each point in its domain), we would like to see this net learn several different
patterns concurrently. This is one of the real advantages of the neural net model. It can learn
several different things without changing its basic structure. You can have a neural net learn
the AND function, and then with a change of weights learn the OR function. No new circuitry is
needed. And in this case, the underlying graph is completely identical.
Page 19 of 27
The more patterns we try to make the net learn, the more likely it will incorrectly remember a
previously learned pattern. Luckily, the weights won't have changed much (with c small), so
we keep training and retraining. In certain cases, it has been proven that this method must
converge to successful several pattern recognition in a finite number of steps. This problem
is very much like the tent peg problem. It's easy to nail in one peg, but while nailing the
second peg, you've loosened the first, which then has to get rehammered..
One final improvement before continuing. Since we want to be able to change the thresholds
as the network learns, we treat them as weights for new edges. To do this we add a new
node for each different threshold in the net. When we give the net its input patterns, we make
sure the value of 1 goes to the nodes providing threshold values. The weight on an edge
connecting such a vertex to the next layer will work as a threshold.
Let's try an example: Say we want the computer to come up with a neural net that will produce
an AND function. We start with a net that has two input nodes, one output node, and no hidden
units. This is only a guess. In general it is a difficult problem to know how many units are
needed to solve your problem, and if it's solvable by these methods at all. Let's assume that
all thresholds will be the same through the learning. In this case it is sufficient to add only one
input node (which will always get an input value of 1). The network looks like this:
Page 20 of 27
With a little foresight and a hunch based on our choice of the binary system as {1,1} we
chose the activation function accordingly:
f 0 ( ) 0 =
f x ( )
x
x
x 0 = ( ) + :=
4 2 0 2 4
1
0
1
f x ( )
x
f 5 ( ) 1 =
f 5 ( ) 1 =
Page 21 of 27
k 0 2 .. :=
We start with the weights set randomly. Let's try:
w
0
1 := w
1
0 := w
2
2 := .3 :=
For this network, the output, Z is given by:
Z 0 1 , 2 , ( ) f 0 w
0
1 w
1
+ 2 w
2
+
( )
:=
Page 22 of 27
We begin with the first pattern (1,1,1). This has an ideal output of 1.
I 1 1 1 ( ) ( )
T
:= O 1 :=
The actual output is: Z
1
Z 1 1 , 1 , ( ) := Z
1
1 =
The change of weights: O I
k
1 O Z
1
= ( )
0
0
0

\



.
=
Change the weights: w
k
w
k
O I
k
1 O Z
1
= ( )
+ :=
New weights: w
0
1 = w
1
0 = w
2
2 =
Page 23 of 27
The second pattern (1,1,1). This has an ideal output of 1.
O 1 :=
I 1 1 1 ( ) ( )
T
:=
The actual output is: Z
2
Z 1 1 , 1 , ( ) := Z
2
1 =
The change of weights: O I
k
1 O Z
2
= ( )
0
0
0

\



.
=
Change the weights: w
k
w
k
O I
k
1 O Z
2
= ( )
+ :=
New weights: w
0
1 = w
1
0 = w
2
2 =
Page 24 of 27
The third pattern (1,1,1). This has an ideal output of 1.
I 1 1 1 ( ) ( )
T
:= O 1 :=
The actual output is: Z
3
Z 1 1 , 1 , ( ) := Z
3
1 =
The change of weights: O I
k
1 O Z
3
= ( )
0.3
0.3
0.3

\



.
=
Change the weights: w
k
w
k
O I
k
1 O Z
3
= ( )
+ :=
New weights: w
0
0.7 = w
1
0.3 = w
2
1.7 =
Page 25 of 27
The fourth pattern (1,1,1). This has an ideal output of +1.
I 1 1 1 ( ) ( )
T
:= O 1 :=
The actual output is: Z
4
Z 1 1 , 1 , ( ) := Z
4
1 =
The change of weights:
O I
k
1 O Z
4
= ( )
0
0
0

\



.
=
Change the weights: w
k
w
k
O I
k
1 O Z
4
= ( )
+ :=
New weights: w
0
0.7 = w
1
0.3 = w
2
1.7 =
Page 26 of 27
At this point we've made a pass through each pattern exactly once. We repeat this
procedure several times, until the weights stabilize. To do this, change the initial
assignments of the weights to the edges (where the big red arrow is.) Then page
down to see what the new weights should be.
Eventually, you will see that the matrices of weight changes is zero. At this point the
weights stop changing, and the output will be the correctly predicted and desired output
for each pattern. This should take six complete passes starting with w0=1, w1=0, and
w2=2.
In Future Issues:
BIG, Multilayered neural nets,
Gradient Descent Learning, and
Back Propagation Learning.
Page 27 of 27
References
1. Drew Van Camp, "Neurons for Computers," Scientific American, Sept. 1992, pp.170172.
2. R. C. Lacher, Artificial Neural Networks, An Introduction to the Theory and Practice.
Lecture Notes, Version 1, October 19, 1991.
3. Patrick Shea and Vincent Lin, "Detection of Explosives in Checked Airline Baggage
Using an Artificial Neural System," Science Applications International Corporation, Santa
Clara, CA.