You are on page 1of 4

Early Neural Network

Development History:
The A g e of Camelot
T
he development history of neural net- selected people and their contributions
works can be divided into four seg- R.C. Eberhart and R.W. Dobbins roughly in a chronological order.
ments, or "Ages." We have arbitrarily The Johns Hopkins University
set the beginning of the first Age to the Applied Physics Laboratory The Age of Camelot
time of William James, about a century
ago. We call this the Age of Camelot, and WilliamJames
it ends in 1969, with the publication of the We begin our look at neural network
book by Minsky and Papert on per- modeled after the massively parallel struc- history in the Age of Camelot with per-
ceptrons [l]. This article reviews these ture of the human brain. This tool simu- haps the greatest American psychologist
early days of neural network research. lates a highly interconnected, parallel who ever lived, William James. James
Following the Age of Camelot is the Dark computational structure with many rela- also taught, and thoroughly understood,
Age (or Depression Age) running from tively simple individual processing ele- physiology. It has been almost exactly a
1969 until 1982, when Hopfield's ments, or neurodes. century since James published his "Prin-
landmark paper on neural networks and The selection of individuals discussed in ciples of Psychology," and its condensed
physical systems was published [2]. The this article is somewhat arbitrary. The in- version, "Psychology (Briefer Course)"
third age, the Renaissance, begins with tent is to provide a broad sampling of [7]. James was the first to publish a num-
Hopfield's paper and ends with the publi- people who contributed to current neural ber of facts related to brain structure and
cation of Parallel Distributed Processing, network technology, not an exhaustive function. He first stated, for example,
Volumes 1 and 2, by Rumelhart and Mc- list. Some well known neural networkers some of the basic principles of correla-
Clelland in 1986 [3,4]. The fourth age, are mentioned only briefly, and others are tional learning and associative memory.
named the Age of Neoconnectionism after omitted altogether. We discuss the In stating what he called his Elementary
the review article by Principle, James
Cowan and Sharp on wrote:
neural nets and artifi-
cial intelligence [SI, "Let us then as-
runs from 1987 until sume as the basis of
the present. Much of all our subsequent
the material in this reasoning this law:
article is excerpted When two elemen-
from a book edited tary brain processes
by the authors [6]. have been active
Also presented in the together or in imme-
book is further dis- diate succession,
cussion of the other one of t h e m , on
three Ages of neural reoccurring, tends
network develop- to propagate its ex-
ment. citement into the
T h e history is other."
reviewed here some-
what differently than This is closely re-
in most other articles lated to the concepts
on neural networks, of associative
in that the focus is on memory and cor-
people rather than relational learning.
just on theory or James seemed to
technology. We foretell the notion of
review the contribu- a neuron's activity
tions of a number of being a function of
individuals, and re- the sum of its inputs,
late them to how with past correlation
neural network tools history contributing
are bleing imple- to the weight of
mentled tc)day. A interconnections,
neural netarork tool when he wrote:
is an anal)fsis tool, "The amount of ac-

SEPTEMBER 1990 IEEE ENGINEERING IN MEDICINE AND BIOLOGY I5


tivity at any given point in the brain cortex
is the sum of the tendencies of all other
points to discharge into it, such tendencies
being proportionate to the number of & Gain1
times the excitement of each other point
may have accompanied that of the point in
question; to the intensity of such excite-
ments; and to the absence of any rival
point functionally disconnected with the Input
first point, into which the discharges lines
might be diverted." (+1 o r -
McCulloch and Pitts
More than half a century later, McCulloch
and Pitts [8] published one of the most
famous "neural network" papers, in which
they derived theorems related to models
of neuronal systems based on what was
6
+1
known about biological structures in the
early 1940s. In coming to their conclu-
sions, they stated five physical assump-
tions: The activity of the neuron is an
'all-or-none' process; a certain fixed num- 1. Adaline, an adjustable neuron, consists of a single neurode with an arbitrary num-
ber of synapses must be excited within the ber of input elements, each of which may assume a value of +1or -1;and a bias
period of latent addition in order to excite element that is always at +l.
a neuron at any time, and this number is
independent of previous activity and posi-
tion on the neuron; the only significant being implemented is their all-or-none additive effect of a positive connection
delay within the nervous system is synap- neuron. A binary, on or off, neurode is with the same absolute weight.
tic delay; the activity of any inhibitory used in neural networks such as the Boltz- Referring to the fifth assumption of
synapse absolutely prevents excitation of mann Machine, but it is not generally used McCulloch and Pitts, it is true that the
the neuron at that time; the structure of the in most neural networks today. Much structure of a neural network tool usually
net does not change with time. more common is a neurode whose output does not change with time, with a couple
The period of "latent addition" is the time value can vary continuously over some of caveats. First, it is usual to "train"
during which the neuron is able to detect range, such as from 0 to I , or -1 to 1. neural networks, such as backpropagation
the values present on its inputs, the synap- Another example involves the signal re- and self-organizing networks, prior to
ses. This time was described by Mc- quired to "excite" a neurode. First of all, their use. During the training process, the
Culloch and Pitts as typically less than since the output of a neurode generally structure doesn't usually change, but the
0.25 ms. The "synaptic delay" is the time varies continuously with the input, there interconnecting weights do. In addition, it
delay between sensing inputs and acting is no "threshold" at which an output ap- is not uncommon, once training is com-
on them by transmitting an outgoing pears. Some neural network tools use plete, for neurodes that aren't contributing
pulse, stated by McCulloch and Pitts to be neurodes that activate at some threshold, significantly to be removed. This certainly
on the order of 0.5 ms. The neuron but this is not commonly done. can be considered a change to the structure
described by these five assumptions is For neurodes with either continuous out- of the network.
known as the "McCulloch-Pitts neuron." puts or thresholds, there is no "fixed num- But wait a minute! What are we left with
The theories they developed were impor- ber of connections" (synapses) that must from McCulloch and Pitts' five as-
tant for a number of reasons, including the be excited. The net input to a neurode is sumptions? If truth be told, when referring
fact that any finite logical expression can generally a function of the outputs of the to today's neural network tools, we are in
be realized by networks of their neurons. neurodes connected to it upstream most cases left with perhaps one: the fifth.
They also appear to be the first authors (presynaptically), and the connection Then why do we make so much of their
since William James to describe a mas- strengths to those presynaptic neurodes. 1943 paper? First of all, because they
sively parallel neural model. A third example is that there is generally proved that networks of their neurons
While the paper was very important, it no delay associated with the connection could represent any finite logical expres-
was (and still is) very difficult to read. In (synapse) in a neural network tool. Typi- sion. Second, because of their use of a
particular, the theorem proofs presented cally, the output stales (activation levels) massively parallel architecture. And third,
by McCulloch and Pitts have stopped of the neurodes are updated synchronous- McCulloch and Pitts provided the step-
more than one engineer in their tracks! ly, one slab (or layer) at a time. Some- ping stones for the development of net-
Furthermore, although the paper has times, as in Boltzmann Machines, they are work models and learning paradigms that
proved to be an important milestone, not updated asynchronously, with the update followed.
all of the concepts presented in it are being order determined stochastically. There is Just because neural network tools don't
implemented in today's neural network almost never, however, a delay built into currently always reflect their work doesn't
tools. In this article, comparisons are not a connection from one neurode to another. imply in any way that their work was bad.
made betwem the theories and con- A fourth example is that the activation of Our neural network tools don't always
clusions of McCulloch and Pitts (or a single inhibitory connection usually reflect what we currently understand
anyone else), and current theories of does not disable or deactivate the neuron about biological neural networks, either.
neural biology. The focus here is strictly to which it is connected. Any given in- For instance, it appears that in many cases,
on the implementation (or non-implemen- hibitory connection (a connection with a a neuron acts somewhat like a voltage
tation) of their ideas in neural network negative weight) has the same absolute controlled oscillator (VCO), with the out-
tools. magnitude effect, albeit subtractive, as the put frequency a function of the input level
One example of something not generally (input voltage). The higher the input, the

16 IEEE ENGINEERING IN MEDICINE AND BIOLOGY SEPTEMBER 1 9 9 0


more pulses per second the neuron gener- nblatt [ IO] defined a neural network struc- Rosenblatt also described systems where
ates. Neural network tools usually work ture called the perceptron. The perceptron training, or "forced responses" occurred.
with basically steady state values of the was probably the first honest-to-goodness This paper laid the groundwork for both
neurode from one update to the next. "neural network tool" because it was supervised and unsupervised training
simulated in detail on an IBM 704 com- algorithms as seen today in backpropaga-
Donald Hebb puter at the Cornell Aeronautical tion and Kohonen networks, respectively.
The next personality along our journey Laboratory. The computer-oriented paper The basic structures set forth by
through the Age of Camelot is Donald 0. caught the imagination of engineers and Rosenblatt are therefore alive and well.
Hebb. His 1949 book entitled "The physicists, despite the fact that its mathe-
Organization of Behavior" [9] was the matical proofs, analyses and descriptions Widrow and Hoff
first to define the method of updating contained tortuous twists and turns. If you Our last stop in the Age of Camelot is with
synaptic weights that we now refer to as can wade through the variety of systems Bernard Widrow and Marcian Hoff. In
"Hebbian." He is also among the first to and modes of organization in the paper, 1960, they published a paper entitled
use the term "connectionism." you'll see that the perceptron is capable of "Adaptive Switching Circuits" that, par-
Hebb presented his method as a "neuro- learning to classify certain pattern sets as ticularly from an engineering standpoint,
physiological postulate" in the chapter en- similar or distinct by modifying its con- has become one of the most important
titled "The First Stage of Perception: nections. It can therefore be described as papers on neural network technology 1111.
Growth of the Assembly." It is stated as a "learning machine." The paper is important from several
follows: Rosenblatt used biological vision as his perspectives. We'll briefly mention afew,
network model. Input node groups and go into more detail about a few more.
"When an axon of cell A is near enough consisted of random sets of cells in a Widrow and Hoff are the first engineers
to excite a cell B and repeatedly or persis-' region of the retina, each group being we've talked about in our article. Not only
tently takes part in firing it, some growth connected to a single Association Unit did they design neural network tools that
process or metabolic change takes place (AU) in the next higher layer. AU's were they simulated on computers, they imple-
in one or both cells such that A's efficien- connected bidirectionally to Response mented their designs in hardware. And at
cy as one of the cells firing B, is in- Units (RU's) in the third (highest) layer. least one of the lunchbox-sized machines
creased." The perceptron's objective was to activate they built "way back then" is still in work-
the correct RU for each particular input ing order! Widrow and Hoff introduced a
Hebb made four primary contributions to pattern class. Each RU typically had a device called an "Adaline" (fig. 1).
neural network theory. First, he stated that large number of connections to AU's. Adaline stands for Adaptive Linear. It
in a neural network, information is stored Rosenblatt devised two ways to imple- consists of a single neurode with an ar-
in the weight of the synapses (connec- ment the feedback from RUs to AUs. In bitrary number of input elements that can
tions). Second, he postulated a connection the first, activation of an RU would tend take on values of plus or minus one, and a
weight learning rate that is proportional to to excite the AUs that sent the RU excita- bias element that is always plus one.
the product of the activation values of the tion (positive feedback). In the second, Before being summed by the neurode
neurons. Note that his postulate assumed inhibitory connections existed between summer, each input, including the bias, is
that the activation values are positive. the RU and the complement of the set of modified by a unique weight that Widrow
Since he didn't provide for the weights to AUs that excited it (negative feedback), and Hoff call a "gain". This reflects their
be decreased, they could theoretically go therefore inhibiting activity in AUs which engineering background, since the term
infinitely high. did not transmit to it. Rosenblatt used the gain refers to the amplification factor that
Others since Hebb have labeled learning second option for most of his systems. In an electronic signal undergoes when
that involves neurons with negative addition, for both options, he assumed that processed by an amplifier. The term
activation values as "Hebbian". This is not all RUs were interconnected with in- "gain" may be more descriptive of the
included in Hebb's original formulation, hibitory connections. function performed than the commonly
but is a logical extension of it. Rosenblatt used his perceptron model to used term weight.
Third, he assumed that weights are sym- address two questions. First, in what form At the output of the summer is a quantizer
metric. That is, the weight of a connection is information stored, or remembered? that is at +1 if the summer output, includ-
from neuron A to neuron B is the same as Second, how does stored information ing the bias, is greater than zero. The
that from B to A. While this may or may influence recognition and behavior? His quantizer's output is -1 for summer out-
not be true in biological neural networks, answers were as follows: puts less than or equal to zero.
it is generally applied to implementations The learning algorithm of the Adaline is
in neural network tools for computers. "...the information is contained in connec- particularly ingenious. One of the main
Fourth, he postulated a "cell assembly tions or associationsrather than topograph- problems with perceptrons was the length
theory" which states that as learning oc- ic representations ...since the stored of time it took them to learn to classify
curs, strengths and patterns of synapse informationtakes the form of new connec- patterns correctly. The Widrow-Hoff
connections (weights) change, and assem- tions, or transmissionchannels in the nerv- algorithm yields learning that is faster and
blies of cells are created by these changes. ous system (or the creation of conditions more accurate. The algorithm is a form of
Stated another way, if simultaneous which are functionally equivalent to new supervised learning that adjusts the
activation of a group of weakly connected connections), it follows that the new weights (gains) according to the size of the
cells occurs repeatedly, these cells try to stimuli will make use of these new path- error on the output of the summer.
coalesce into a more strongly connected ways which have been created, automat- Widrow and Hoff showed that the way
assembly. All four of Hebb's contribu- ically activating the appropriate response they adjust the weights will minimize the
tions are generally implemented in without requiring any separate process for sum squared error over all patterns in the
today's neural network tools, at least to their recognition or identification." training set. For that reason, the Widrow-
some degree. We often refer to learning Hoff method is also known as the Least
schemes implemented in some networks The primary perceptron learning mecha- Mean Squares (LMS) algorithm. The
as Hebbian. nism is "self organizing" or "self associa- error is the difference between what the
tive" in that the response that happens to output of the Adaline should be and the
Frank Rosenblatt become dominant is initially random. But output of the summer. The sum-squared
In 1958, a landmark paper by Frank Rose- error is obtained by measuring the error

SEPTEMBER 1990 IEEE ENGINEERING IN MEDICINEAND BIOLOGY 17


for each pattern presented to Adaline, pert systems, and in neural networks. AI- University Applied Science Laboratory.
squaring each value, and then summing all though many areas remained to be ex- He is Vice President of the IEEE Neural
of the squared values. Minimizing the sum plored, and many problems were Networks Council, and Vice Chairman of
squared error involves an error reduction unsolved. the general feeling was that the the Baltimore Chapter of the EMBS.
method called gradient descent, or sky was the I h . Little dTd most folks
steepest descent. Mathematically, it in- know that, for neural networks, the sky
volves the partial derivatives of the error was about to fall. In 1969, Marvin Minsky
with respect to the weights. and Seymour Papert dropped a bombshell
But Widrow and Hoff showed that you on the neural network community in the
don't have to apply the derivatives. They form of the aforementioned book called
are proportional to the error (and its sign), "Perceptrons."The book, which contained
and to the sign of the input. They further an otherwise generally accurate analysis
showed that, for n inputs, reducing the of simple perceptrons, concluded that Science from the
measured error of the summer by l/n for "...our intuitive judgement [is] that the Universitv of South
each input will do a good job of imple- extension [to multilayer perceptrons with Africa. Since 1987 he has beLn a consult-
menting gradient descent. You adjust each hidden layers] is sterile." At the least, this ant at the Johns Hopkins University Ap-
weight until the error is reduced by l/n of statement has proven to be a serious mis- plied Physics Laboratory. Current
the total error you had to start with. For take. Nevertheless, nearly all funding for interests include Neural Networks, CASE
example, if there are 12 input nodes, each neural networks dried up after the book tools and real time programming.
weight is adjusted to remove 1/12 of the was published. It was the beginning of the
total error. This method provides for Dark Age. The authors can be reached at the Johns
weight adjustment (learning) even when Hopkins University, Applied Physics
the output of the classifier is correct. Con- Conclusion Laboratory, Johns Hopkins Rd., Laurel,
sider the case where the output of the Since 1987we have been experiencing the MD 20707.
summer is 0.5, and the classifier output is Age of Neoconnectionism, so named by
1.O. If the correct output is 1 .0, there is still Cowan and Sharp. The field of neural References
an error signal of 0.5 that is used to further networks and the development of neural I . Minsky M, Papert S: "Perceptr-ons." MIT
train the weights. This is a significant network tools for personal computers Pres, Cambridge, MA, 1969.
improvement over the perceptron, which have expanded almost unbelievably in the 2. Hopfield, JJ: Neural networks and physical
systems with emergent collective computational
only adjusts weights when the classifier past several years. It is no longer feasible abilities. Proc. Natl. Acad. Sci., 79:2554-2558
output is incorrect, and is one reason the to assemble "all there is to know" about (1982).
learning is faster and more accurate. the current state of neural networks in one 3. Rumelhart DE, McClelland JL: "Parallel
Widrow and Hoff's paper was prophetic, volume, or one set of volumes, as the PDP Distributed Processing, Explorations in the
too. They suggested several practical im- Research Group attempted to do in 1986- Microstructure of Cognition, Vol I: Foundations."
plementations of their Adaline, stating: "If -1988 [3,4]. The list of applications has MIT Press, Cambridge, Mass., 1986.
a computer were built of adaptive grown from one highlighting biological 4. Rumelhart DE, McClelland JL: "Parallel
neurons, details of structure could be im- and psychological uses to ones as diverse Distributed Processing, Explorations in the
Microstructure of Cognition, Vol. 2: Psychologi-
parted by the designer by training (show- as biomedical waveform classification, cal and Biological Models." MIT Press, Cam-
ing it examples of what he would like to music composition, and prediction of the bridge, Mass., 1986.
do) rather than by direct designing." commodity futures market. And another 5. Cowan JD, Sharp DH:, Neural nets and artifi-
An extension of the Widrow-Hoff learn- shift is occurring that is even more impor- cial intelligence. Duedulus, 117(1):85-121
ing algorithm is used today in tant. That is the shift to personal com- (1988).
backpropagation networks, and their work puters f o r neural network tool 6. Eberhart RC, Dobbins RW, eds: "Neural
in hardware implementation of neural net- implementation. Not that this is the only Network PC Tools: A Practical Guide." Academic
work tools heralded today's cutting edge important trend in neural network re- Press, New York, 1990.
7. James, W: "Psychology (Briefer Course)."
work in VLSI by people including Carver search and development today. Sig- Holt, New York, 1890.
Mead and his colleagues at Cal Tech [ 121. nificant work is also occurring in areas 8. McCulloch WC, Pitts W: A logical calculus
Dr. Widrow is the earliest significant ranging from the prediction of protein of the ideas immanent in nervous activity. Bulletin
contributor to neural network hardware folding using supercomputers to formula- of Mathematical Biophysics, 5: 115-133 (1943).
system development still working in the tion of new network learning algorithms 9. Hebb, DO: "The Organization of Behavior."
area of neural networks. He and his stu- and neurode transfer functions. This ar- John Wiley, New York, 1949.
dents also did the earliest work known to ticle has provided a summary of how it all IO. Rosenblatt, F: The perceptron: a probabilistic
the authors in biomedical applications of started. As we stated at the outset, the model for information storage and organization in
the brain. Psychological Review, 65:386-408
neural network tools. One of his Ph.D. work of only a few of the neural network (1958).
students, Donald F. Specht, used an exten- researchers and developers was described. 1 I.Widrow B, Hoff ME: Adaptive switching
sion of the Adaline, called an Adaptive Many who contributed significantly to the circuits. 1960 IRE Convention Record: Part G
Polynomial Threshold Element, to imple- field were omitted. The intent was to give Computers: Man-Machine Systems, Los Angeles,
ment a vectorcardiographic diagnostic you a flavor of how the basis for current 96-104 (1960).
tool that used the polynomial discriminant neural network tools 12. Mead C : "Analog VLSI and Neural Systems."
method [13,14]. Widrow and his col- evolved. Addison Wesley, Reading, MA, 1989.
leagues later did pioneering work using 13. Specht, DF: Vectorcardiographic diagnosis
the LMS adaptive algorithm for analyzing using the polynomial discriminant method of pat-
Russell C. Eherhurt, tern recognition. IEEE Trans. on Biomedical En-
adult and fetal electrocardiogram signals a Senior Member of gineering, BME14(2):90-95 (1 967).
[W. the IEEE, received 14. Specht, DF: Generation of polynomial dis-
his Ph.D. in Electri- criminant functions for pattern recognition. IEEE
The Fall of Camelot cal Engineering from Trans. on Electronic Computers, EC-l6(3):308-
As the 1960s drew to a close, optimism Kansas State Univer- 3 I9 (1967).
was the order of the day. Many people sitv in 1972. His is 15. Widrow B, Glover JR Jr., McCool JM,
were working in Artificial Intelligence program M~~~~~~ i i the ~ i ~ ~ Kaunitz
~ d J,i Williams
~ ~ CSl et al: Adaptive noise
(AI), both in the area exemplified by ex- cancelling: principles and applications. Proc.
Programs Office of the Johns Hopkins IEEE, 63( 12):1692.1716 (1975).

18 IEEE ENGINEERING I N MEDICINE AND BIOLOGY SEPTEMBER 1990

You might also like