Professional Documents
Culture Documents
VLSI Implementations of
Neural Networks
15.1 Introduction
In the previous chapters of this book we presented a broad exposition of neural networks,
describing a variety of algorithms for implementing supervised and unsupervised learning
paradigms. In the final analysis, however, neural networks can only gain acceptance as
tools for solving engineering problems such as pattern classification, modeling, signal
processing, and control in one of two ways:
w Compared to conventional methods, the use of a neural network makes a significant
difference in the performance of a system for a real-world application, or else it
provides a significant reduction in the cost of implementation without compromising
performance.
w Through the use of a neural network, we are able to solve a difficult problem for
which there is no other solution.
Given that we have a viable solution to an engineering problem based on a neural network
approach, we need to take the next step: build the neural network in hardware, and embed
the piece of hardware in its working environment. It is only when we have a working
model of the system that we can justifiably say we fully understand it. The key question
that arises at this point in the discussion is: What is the most cost-effective medium for
the hardware implementation of a neural network? A fully digital approach that comes
to mind is to use a RZSC processor; RISC is the acronym for reduced instruction set
computer (Cocke and Markstein, 1990). Such a processor is designed to execute a small
number of simple instructions, preferably one instruction for every cycle of the computer
clock. Indeed, because of the very high speed of modern-day RISC processors, their use
for the emulation of neural networks is probably fast enough for some applications.
However, for certain complex applications such as speech recognition and optical character
recognition, a level of performance is required that is not attainable with existing RISC
processors, certainly within the cost limitations of the proposed applications (Ham-
merstrom, 1992). Also, there are many situations such as process control, adaptive beam-
forming, and adaptive noise cancellation where the required speed of learning is much
too fast for standard processors. To meet the computational requirements of the complex
applications and highly demanding situations described here, we may have to resort to
the use of very-large-scale integrated (VLSI) circuits, a rapidly developing technology
that provides an ideal medium for the hardware implementation of neural networks.
In the use of VLSI technology, we have the capability of fabricating integrated circuits
with tens of millions of transistors on a single silicon chip, and it is highly likely that
this number will be increased by two orders of magnitude before reaching the fundamental
593
594 15 / VLSl Implementations of Neural Networks
limits of the technology imposed by the laws of physics (Hoeneisen and Mead, 1972;
Keyes, 1987). We thus find that VLSI technology is well matched to neural networks for
two principal reasons (Boser et al., 1992):
1. The high functional density achievable with VLSI technology permits the implemen-
tation of a large number of identical, concurrently operating neurons on a single chip,
thereby making it possible to exploit the inherent parallelism of neural networks.
2. The regular topology of neural networks and the relatively small number of well-
defined arithmetic operations involved in their learning algorithms greatly simplify
the design and layout of VLSl circuits.
Accordingly, we find that there is a great deal of research effort devoted worldwide to
VLSI implementations of neural networks on many fronts. Today, there are general-
purpose chips available for the construction of multilayer perceptrons, Boltzmann
machines, mean-field-theory machines, and self-organizing neural networks. Moreover,
various special-purpose chips have been developed for specific information-processing
functions.
VLSI technology not only provides the medium for the implementation of complex
information-processing functions that are neurobiologically inspired, but also can be seen
to serve a complementary and inseparable role as a synthetic element to build test beds
for postulates of neural organization (Mead, 1989).The successful use of VLSI technology
to create a bridge between neurobiology and information sciences will have the following
beneficial effects: deeper understanding of information processing, and novel methods for
solving engineering problems that are intractable by traditional computer techniques
(Mead, 1989). The interaction between neurobiology and information sciences via the
silicon medium may also influence the very art of electronics and VLSI technology itself
by having to solve new challenges posed by the interaction.
With all these positive attributes of VLSI technology, it is befitting that we devote this
final chapter of the book to its use as the medium for hardware implementations of neural
networks. The discussion will, however, be at an introductory level.'
' For detailed treatment of analog VLSI systems, with emphasis on neuromorphic networks, see the book
by Mead (1989). For specialized aspects of the subject, see the March 1991, May 1992, and May 1993 Special
Issues of the IEEE Transactions on Neural Networks. The report by Andreou (1992) provides an overview of
analog VLSI systems with emphasis on circuit models of neurons, synapses, and neuromorphic functions.
15.2 / Major Design Considerations 595
the use of this technology. Specifically, we may identify the following items (Ham-
merstrom, 1992).
1. Sum-of-Products Computation. This is a functional requirement common to the
operation of all neurons. It involves multiplying each element of an activation pattern
(data vector) by an appropriate weight, and then summing the weighted inputs, as described
in the standard equation
vj = wjixi (15.1)
i= 1
where wji is the weight of synapse i belonging to neuron j , xiis the input applied to the
ith synapse, p is the number of synapses, and vj is the resulting activation potential of
neuron j .
2. Data Representation. Generally speaking, neural networks have low-precision
requirements, the exact specification of which is algorithdapplication dependent.
3. Output Computation. The most common form of activation function at the output
of a neuron is a smooth nonlinear function such as the sigmoid function described by the
logistic function,
(15.2)
1 - exp(-vj)
p(vj) = tanh(vj) = (15.3)
1 + exp(-vi)
These two forms of the sigmoidal activation function are linearly related to each other;
see Chapter 6. Occasionally, the threshold function
1, Vj> 0
dv,) =
{ 0, v, < 0
(15.4)
is considered to be sufficient.
4. Learning Complexity. Each learning algorithm has computational requirements of
its own. Several popular learning algorithms rely on the use of local computations for
making modifications to the synaptic weights of a neural network; this is a highly desirable
feature from an implementation point of view. Some other algorithms have additional
requirements, such as the back-propagation of error terms through the network, which
imposes an additional burden on the implementation of the neural network, as in the case
of a multilayer perceptron trained with the back-propagation algorithm.
5. Weight Storage. This requirement refers to the need to store the “old” values of
synaptic weights of a neural network. The “new” values of the weights are computed
by using the changes computed by the learning algorithm to update the old values.
6. Communications. Metal is expensive in terms of silicon area, which leads to signifi-
cant inefficiencies if bandwidth utilization of communication (connectivity) links among
neurons is low. Connectivity is perhaps one of the most serious constraints imposed on
the fabrication of a silicon chip, particularly as we scale up analog or digital technology
to very large neural networks. Indeed, significant innovation in communication schemes
is necessary if we are to implement very large neural networks on silicon chips efficiently.
The paper by Bailey and Hammerstrom (1988) discusses the fundamental issues involved
in the connectivity problem with the VLSI implementation of neural networks in mind;
596 15 / VLSl Implementations of Neural Networks
Analog Techniques
In an eloquent and convincing address presented at the First IEEE International Conference
on Neural Networks, Mead (1987a) argued in favor of a synthetic approach combining
silicon VLSI technology and analog circuits for the implementation of neural networks.
Although analog circuits do indeed suffer from lack of precision, this shortcoming is
'
compensated by the efficiency of computations based on the principles of classical circuit
theory and the laws of physics. Analog circuits can do certain computationsthat are difficult
or time-consuming (or both) when implemented in the conventional digital paradigm, and
do them with much less power (Mead, 1989).
Figure 15.1 shows the circuit symbols for the n-channel andp-channel types of MOS
(metal oxide silicon) transistors, which use electrons and holes as their charge carriers,
respectively. The technology so based is thus called complementary MOS, or CMOS. The
function of a MOS transistor may be understood by examining the drain current I d defined
as a function of the gate-source voltage V,, with the drain being maintained at a fixed
voltage (2V, say). Two regions may be identified in such a functional dependence (Andreou,
1992; Mead, 1989):
The abovethreshold region, where the drain current I d is a quadratic function of the
gate-source voltage V,, .
w The subthreshold region, where the transistor is operated at low gate-source voltages
such that the drain current Id is an exponential function of the gate-source voltage
v,.
All things being equal, the exponential nonlinearity (i.e., operation in the subthreshold
region) is preferrable, because it provides more transconductance (i.e., aZd/av,) per unit
current. Moreover, in the subthreshold region, the MOS transistor can provide two useful
functions, depending on the drain-source voltage, as described here:
15.3 / Categories of VLSl Implementations 597
Drain Source
Source Drain
n-channel p-channel
w For small drain-source voltages (approximately, less than a few hundred millivolts),
the device acts essentially as a controlled conductance with perfect symmetry between
the source and the drain; this mode of operation is called the ohmic or linear region.
w For larger values of drain-source voltage, the device is essentially a voltage-controlled
current source (i.e., a sink).
In analog VLSI implementations of neuromorphic networks, whose purpose is to mimic
specific neurobiologicalfunctions, the customary practice is to operate the MOS transistors
in the subthreshold region; neuromorphic networks are discussed in Section 15.4.
Subthresholdoperation of CMOS transistors exhibits the following useful characteristics ~
(15.7)
where kBis Boltzmann’s constant, Tis the absolute temperature measured in kelvins,
and q is the electron charge. Note that the exponential functions in Eqs. (15.5) and
(15.6) are all due to Boltzmann’s law, and the exact difference between exponential
functions inside the square brackets is a result of Ohm’s law. The combination of
these two equations defines the kind of analog computations that can be performed
with subthreshold CMOS technology.
rn Circuit Level. This second level is governed by the conservation of charge and the
conservation of energy, which, respectively, yield the two familiar equations:
czi=o
I
(15.8)
(15.9)
Equation (15.8) is recognized as Kirchoff’s current law, and Eq. (15.9) is Kirchoff’s
voltage law.
rn Architectural Level. At this last level, differential equations from mathematical
physics are used to implement useful functions, depending on the application of
interest.
In the analog approach described by Andreou (1992) and Andreou et al. (1991), a
minimalistic design style is adopted. The approach is motivated by the belief that a single
transistor is a powerful computational element that can provide gain and also some basic
computational functions. The design methodology is based on current-mode subthreshold
CMOS circuits, according to which the signals of interest are represented as currents, and
voltages play merely an incidental role. The current-mode approach offers signal processing
at the highest possible bandwidth, given the available silicon technologies and a fixed
amount of energy resources (Andreou, 1992).
In contrast, in the analog approach described by Mead (1989), a transconductance
ampliJier is taken as the basic building block. This amplifier, shown in its basic form in
Fig. 15.2, is a device whose output current is a function of the difference between two
input voltages, Vl and V,. It is referred to as a transconductance amplifier because it
changes a differential input voltage, Vl - V,, into an output current. This differential
voltage is taken as the primary signal representation. The bottom transistor Qb in Fig.
15.2 operates as a current source, supplying a constant current lb.The current I, is divided
between the two top transistors Ql and Qz in a manner determined by the differential
voltage, V , - V,. Assuming that the drain-source voltages of these two transistors are
large enough for them to be driven into saturation, we find that the application of Eq.
(15.5) to the differential transconductance amplifier of Fig. 15.2 yields (Mead, 1989)
(15.10)
15.3 / Categories of VLSl Implementations 599
voltage
Bias% -
-
-
-
I
(4 (b)
FIGURE 15.3 Models for (a) inhibitory and (b) excitatory synapses.
Digital Techniques
There are two key advantages to the digital approach over the analog approach (Ham-
merstrom, 1992):
Ease of Design and Manufacture. The use of digital VLSI technology offers the
advantage of high precision, ease of weight storage, and cost-performanceadvantage
in “programmability” over analog VLSI technology. Moreover, digital silicon pro-
cessing is more readily available than analog.
Flexibility. The second and most important advantage of the digital approach is that
it is much moreflexible, permitting the use of many more complex algorithms and
expanding the range of possible applications. In some cases, solving complex prob-
lems may require significant flexibility in the neural network architecture to be able
to solve the problem at all. Lack of flexibility is indeed a fundamental limitation of
analog systems; in particular, the level of complexity that the technology can deal
15.3 / Categories of VLSI Implementations 601
with often limits the range and scope of problems that can be solved with analog
technology.
However, a disadvantage of digital VLSI technology is that the digital implementation
of multiplication is both area- and power-hungry. Area requirements may be reduced by
using digital, multiplexed interconnect (Hammerstrom, 1992).
The ultimate choice of digital over analog technology cannot be answered unless we
h o w which particular algorithms are being considered for neural network applications.
If, however, general-purpose use is the aim, then the use of digital VLSI technology has
a distinct advantage over its analog counterpart. We have more to say on this issue in
Section 15.4.
Hybrid Techniques
The use of analog computation is attractive for neural VLSI for reasons of compactness,
potential speed, and absence of quantization effects. The use of digital techniques, on the
other hand, is preferred for long-distance communications, because digital signals are
known to be robust, easily transmitted and regenerated. These considerations encourage
the use of a hybrid approach for the VLSI implementation of neural networks, which
builds on the merits of both analog and digital technologies (Murray et al., 1991). A
signaling technique that lends itself to this hybrid approach is pulse modulation, the theory
and practice of which are well known in the field of communication systems (Haykin,
1983; Black, 1953). In pulse modulation, viewed in the context of neural networks, some
characteristic of a pulse stream used as carrier is varied in accordance with a neural state.
Given that the pulse amplitude, pulse duration, and pulse repetition rate are the parameters
available for variation, we may distinguish three basic pulse modulation techniques as
described here (Murray et al., 1991):
w Pulse-amplitude modulation, in which the amplitude of a pulse is modulated in time,
reflecting the variation in neural state 0 < sj < 1. This technique is not particularly
satisfactory in neural networks, because the information is transmitted as analog
voltage levels, which makes it susceptible to processing variations.
w Pulse-width modulation, in which the width (duration) of a pulse is varied in accor-
dance with the neural state sj. The advantages of a hybrid scheme now become
apparent, as no analog voltage is present in the modulated signal, with information
being coded along the time axis. A pulse-width-modulated signal is therefore robust.
Moreover, demodulation of the signal is readily accomplished via integration. The
use of a constant signaling frequency, however, means that either the leading or
trailing edges of the modulated signals representing neural states will occur simultane-
ously. The existence of this synchronism represents a drawback in massively parallel
neural VLSI networks, since all the neurons (and synapses) tend to draw current on
the supply lines simultaneously, with no averaging effect. It follows, therefore, that
the supply lines must be oversized in order to accommodate the high instantaneous
currents produced by the use of pulse-width modulation.
w Pulse-frequency modulation, in which the instantaneousfrequency of the pulse stream
is varied in accordance with the neural state s,, with the frequency ranging from
some minimum to some maximum value. In this case, both the amplitude and duration
of each pulse are maintained constant. Here also the use of a hybrid scheme is
advantageous for the same reasons mentioned for pulse-width modulation. Since the
signaling frequency is now variable, both the leading and trailing edges of the
modulated signals representing the neural states become skewed. Consequently, the
602 15 / VLSl Implementations of Neural Networks
massive transient demand on supply lines is avoided, and the power requirement is
averaged in time as a result of using pulse-frequency modulation.
From this discussion, it appears that pulse-frequency modulation’ provides a practical
technique for signaling in massively parallel neural VLSI networks. It is also of interest
to note that it has been known for about a century that neurons in the brain signal one
another using pulse-frequency modulation (Hecht-Nielsen, 1990). Thus, recognizing the
benefits of pulse-frequency modulation, and being inspired by neurobiological models,
Churcher et al. (1993) and Murray et al. (1991) describe integrated pulse stream neural
networks, based on pulse-frequency modulation. In particular, the networks use digital
signals to convey information and control analog circuitry, while storing analog informa-
tion along the time axis. Thus the VLSI neural networks described therein are hybrid
devices, moving between the analog and digital domains as appropriate, to optimize the
robustness, compactness, and speed of the associated network chips.
There is another important hybrid technique used in the VLSI implementation of
neural networks, namely, multiplying digital-to-analog converters (MDAC) employed as
multipliers. In this technique, an analog state (i.e., input signal) can be multiplied with
a digital weight as in the Bellcore chip (Alspector et al., 1991b), or a digital state can be
multiplied with an analog weight as in the AT&T ANNA chip (Sackinger et al., 1992);
we have more to say on these hybrid chips in Section 15.4. Thus MDACs permit the
neural network designer to combine the use of analog and digital technologies in an
optimal fashion to solve a particular computation problem.
Another pulse modulation technique, known as pulse duty cycle modulation, may be used as the basis of
VLSI implementation of synaptic weighting and summing (Moon et al., 1992). In this scheme, variations in
the duty cycle of a pulse stream are used to convey information.
15.4 / Neurocomputing Hardware 603
CNAPS
For our first VLSI-based system, we have chosen a general-purpose digital machine
called CNAPS (Connected Network of Adaptive Processors), manufactured by Adaptive
Solutions, Inc., and which is capable of high neural network performance (Hammerstrom,
1992; Hammerstrom et al., 1990).
The CNAPS system is an SIMD (Single Instruction stream, Multiple Data stream)
machine, consisting of an m a y of processor nodes, as illustrated in Fig. 15.4. Each
processor node (PN) is a simple digital signal processorlike computing element. The array
of PNs is laid out in one dimension and operates synchronously (Le., all the PNs execute
the same instruction each clock cycle). The instructions are provided by an external
program sequencer, which has a program memory and instruction fetch and decode
capability. The program sequencer also manages all inpudoutput to and from the PN
array.
Data representation is digital jixed-point. Each PN has a 9-bit by 16-bit multiplier, a
32-bit adder, a logic unit, a 32-word register file, a 12-bit weight address unit, and 4K
bytes of storage for weights and coefficients. The internal buses and registers are 16 bits.
Each PN can compute one multiply accumulate per clock cycle. The use of fixed-point
arithmetic is justified on the grounds of cost; and for practically all current learning
algorithms and neural network applications, the use of arithmetic precision higher than
that described here is considered unnecessary.
CNAPS uses on-chip memories, which makes it possible to perform on-chip learning.
The total synaptic connections per chip are as follows:
2M 1-bit weights
rn 256K 8-bit weights
At 25 MHz, a single CNAPS chip can perform 1.6 billion multiply accumulates per
second. An 8-chip system can perform 12.8 billion multiply accumulates per second.
Thus, in back-propagation learning, the 8-chip system can learn at 2 billion weight updates
per second, assuming that all the PNs are busy. To get an idea of what these numbers
imply, the NETtalk network (developed originally by Sejnowski and Rosenberg, 1987),
which normally takes about 4 hours of training on a SUN SPARC workstation, would fit
onto a single CNAPS chip and require only about 7 seconds to train (Hammerstrom and
Rahfuss, 1992).
program,
sequencer
.
r t
FIGURE 15.4 Single instruction stream, multiple data stream.
604 15 I VLSl Implementations of Neural Networks
for use on a particular class of learning algorithms, it enjoys a wide range of applications,
and in that sense it may be viewed to be of general-purpose use.
From Chapter 8 we recall that both the Boltzmann and mean-field-theory learning
algorithms are as capable as the back-propagation algorithm of learning difficult problems.
In computer simulation, back-propagation learning has the advantage in that it is often
orders of magnitude faster than Boltzmann learning; mean-field-theory learning lies some-
where between the two, though closer to back-propagation learning. However, the local
nature of both Boltzmann learning and mean-field-theory learning makes them easier to
cast into electronics than back-propagation learning. Indeed, by implementing them in
VLSI form, it becomes possible to speed up the learning process in the Boltzmann machine
and mean-field-theory machine by orders of magnitude, which makes them both attractive
for practical applications.
A key issue in the hardware implementation of Boltzmann learning and mean-field-
theory learning is how to account for the effect of temperature T, which plays the role
of a control parameter during the annealing schedule. A practical way in which this effect
may be realized is to add a physical noise term to the activation potential of each neuron
in the network. Specifically, neuron j is designed to perform the activation computation
(see Fig. 15.5)
where v, and sj are the activation potential and output signal of neuron j , respectively,
and nj is an external noise term applied to neuronj. The function q(-)is a monotonic
nonlinear function such as the hyperbolic tangent tanh(.) with a variable gain (midpoint
slope) denoted by g. The details of the noise term n, and the function q(*)depend on
whether Boltzmann learning or mean-field-theory learning is being simulated.
In simulations of the Boltzmann machine, the gain g is made high so as to permit the
function q(.) approach a step function. The noise term n, is chosen from a zero-mean
Gaussian distribution, whose width is proportional to the temperature T. In order to account
for the role of temperature T, the noise n, is thus slowly reduced in accordance with the
prescribed annealing schedule.
In simulations of mean-field-theory learning, on the other hand, the noise term is set
equal to zero. But for this application, the gain g of the function p(*)has a finite value
chosen to be proportional to the reciprocal of temperature T taken from the annealing
schedule. The nonlinearity of the function q ( * )is thus “sharpened” as the annealing
schedule of decreasing temperature proceeds.
Alspector et al. (1991b, 1992a) describe a microchip implementation of the Boltzmann
machine. The chip contains 32 neurons with 992 connections &e., 496 bidirectional
synapses). The chip includes a noise generator that supplies 32 uncorrelated pseudorandom
noise sources simultaneously to all the neurons in the system. The traditional method for
Activation Nonlinearity
potentid n Output
source
FIGURE 15.5 Circuit for simulating the activation of a neuron used in the Boltzmann
machine or mean-field-theory machine.
15.4 / Neurocornputing Hardware 605
generating a pseudorandom bit stream is to use a linear feedback shift register (LFSR)?
However, the use of a separate LFSR for each neuron (in order to obtain uncorrelated noise
from one neuron to another) requires an unacceptable overhead for VLSI implementation.
Alspector et al. (1991a) describe a method of generating multiple, arbitrarily shifted,
pseudorandom bit streams from a single LSFR, with each bit stream being obtained by
tapping the outputs of selected cells (flip-flops) in the LFSR and feeding these tapped
outputs through a set of exclusive-OR gates. This method enables many neurons to share
a single LFSR, resulting in an acceptably small overhead for VLSI implementation.
The individual noise sources (produced in the manner described here) are summed
along with the weighted postsynaptic signals from other neurons at the input to each
neuron. This is done in order to implement the simulated annealing process of the stochastic
Boltzmann machine. The neuron amplifiers implement a nonlinear activation function
with a variable gain so as to cater to the gain-sharpening requirement of the mean-field-
theory learning technique.
Most of the area covered by the “hybrid” microchip is occupied by the array of
synapses. Each synapse digitally stores a weight ranging from - 15 to + 15 as binary
words consisting of 4 bits plus sign. The analog voltage input from the presynaptic neuron
is multiplied by the weight stored in the synapse, producing an output current. Although
the synapses can have their weights set externally, they are designed to be adaptive. In
particular, they store the “instantaneous” correlations produced after annealing, and
therefore adjust the synaptic weight wji in an “on-line” fashion in accordance with the
learning rule
Awji = K 9 sgn[(sjsi)+ - (sjsz)-] (15.16)
where K is a fixed step size. The learning rule of Eq. (15.16) is called Manhattan updating
(Peterson and Hartman, 1989). In the learning rule described in Eq. (8.75), the synaptic
weights are changed according to gradient descent and therefore each gradient component
(weight change) will be of different size. On the other hand, in the Manhattan learning
rule of Eq. (15.16), a step is taken in a slightly different direction along a vector whose
components are all of equal size. In this latter form of learning, everything about the
gradient is thrown away, except for the knowledge as to which quadrant the gradient lies
in, with the result that learning proceeds on a lattice. In the microchip described by
Alspector et al. (1991b), the fixed step size K = 1, and so the synaptic weight wji is
changed by one unit at each iteration of the mean-field-theory learning algorithm.
An on-line procedure is used for weight updates, where only a single correlation is
taken per pattern. Thus there is no basic difference between counting correlations and
counting occurrences as described in Chapter 8. Also, the use of on-line weight updates
avoids the problem of memory storage at synapses.
The chip is designed to be cascaded with similar chips in a board-level system that
can be accessed externally by a computer. The nodes of a particular chip that sum currents
A shift register of length m is a device consisting of m consecutive two-state memory stages (flip-flops)
regulated by a single timing clock. At each clock pulse, the state (represented by binary symbol 1 or 0) of each
memory stage is shifted to the next stage down the line. To prevent the shift register from emptying by the
end of m clock pulses, we use a logical (i.e., Boolean) function of the states of the rn memory stages to compute
a feedback term, and apply it to the first memory stage of the shift register. The most important special form
of this feedback shift register is the linear case in which the feedback function is obtained by using modulo-
2 adders to combine the outputs of the various memory stages. A binary sequence generated by a linear feedback
shift register is called a linear maximal sequence and is always periodic with a period defined by
N=2”-1
where m is the length of the shift register. Linear maximal sequences are also referred to as pseudorandom or
pseudonoise (PN) sequences,The term “random” comes from the fact that these sequences have many of the
physical properties usually associated with a truly random binary sequence (Golomb, 1964).
606 15 / VLSl Implementations of Neural Networks
from synapses for the net activation potential of a neuron are available externally for
connection to other chips and also for external clamping of neurons. Alspector et al.
(1992a) have used this system to perform learning experimentson the parity and replication
(identity)problems, thereby facilitating comparisons with previous simulations (Alspector
et al., 1991b). The parity problem is a generalization of the XOR problem for arbitrary
input size. The goal of the replication problem is for the output to duplicate the bit pattern
found on the input after being encoded by the hidden layer. For real-time operation, it is
reported that the speed for on-chip learning is roughly lo8synaptic connections per second
per chip.
In another study (Alspectoret al., 1992b), a single chip was used to perform experiments
on content-addressable memory using mean-field-theory learning. It is demonstrated that
about 100,000 codewords per second can be stored and retrieved by the chip. Moreover,
close agreement is reported between the experimental results and the computer simulations
performed by Hartman (1991). These results demonstrate that mean-field-theory learning
is able to provide the largest storage per neuron for error-correcting memories reported
in the literature at that time.
ANNA Chip
For the description of a general-purposehybrid chip designed with multilayer perceptrons
in mind, we have chosen a reconfigurable chip called the ANNA (Analog Neural Network
Arithmetic and logic unit) chip, which is a hybrid analog-digital neural network chip
developed by AT&T Bell Labs (Boser et al., 1992; Sackinger et al., 1992). The hybrid
architecture is designed to match the arithmetic precision of the hardware to the computa-
tional requirements of neural networks. In particular, experimental work has shown that
the precision requirements of neurons within a multilayer perceptron vary, in that higher
accuracy is often needed in the output layer, for example, for selective rejection of
ambiguous or other unclassifiable patterns (Boser et al., 1992). A hybrid architecture may
be used to deal with a situation of this kind by implementing the bulk of the neural
computations with low-precision analog devices, but critical connections are implemented
on a digital processor with higher accuracy.
Figure 15.6 shows a simplified architecture of the ANNA chip. The architectural layout
shown in this figure leaves out many design details of the chip, but it is adequate for a
description of how the multilayer perceptron designed to perform pattern classification is
implemented on the chip. The ANNA chip evaluates eight inner products of state vector
x and eight synaptic weight vectors w,in parallel. The state vector is loaded into a barrel
shifter, and the eight weight vectors are selected from a large (4096) on-chip weight
memory by means of a multiplexer; the resulting scalar values of the inner products
w:x, j = l , 2 , . . . ,8 (15.17)
are then passed through a neuron function (sigmoidal nonlinearity) denoted by cp(-),
yielding a corresponding set of scalar neural outputs
zj = (p(wjTx), j = 1,2,. . . , 8 (15.18)
The whole neuron-function evaluation process takes 200 ns, or four clock cycles. The
chip can be reconfigured for synaptic weight and input state vectors of varying dimension,
namely, 64, 128, and 256. These figures also correspond to the number of synapses per
neuron.
The input state vector x is supplied by a shift register that can be shifted by one, two,
three, or four positions in two clock cycles (100 ns). Correspondingly, one, two, three,
weight memory
8 weight vectors w.
I
Neuron-function 8 scalar
unit: + outputs zj
z.I = (P(xTw.)
1
Chip
Barrel shifter
input
FIGURE 15.6 Simplified architecture of the ANNA chip. (From E. Sackinger et al.,
1992a, with permission of IEEE.)
or four new data values are read into the input end of the shift register. Thus, this barrel
shifter serves two useful purposes:
It permits the use of sequential loading.
It is the ideal preprocessor for convolutionalnetworks characterizedby local receptive
fields and weight sharing.
The barrel shifter on the chip has length 64. It is operated in parallel with the neuron-
function unit, such that a new state vector is available as soon as a new calculation cycle
starts.
There are a total of 4096 analog weight values stored on the chip. These values can
be grouped in a flexible way into weight vectors of varying dimension: 64, 128, and 256.
Thus, on the same chip it is possible to have, for example, simultaneously thirty-two
weight vectors of dimension 64, eight weight vectors of dimension 128, and four weight
vectors of dimension 256.
Assuming that all neurons on the chip are configured for the maximum size of 256
synapses, the chip can evaluate a maximum of 10" connections per second (Us) as shown
by the following calculation:
8 neurons X 256 synapses/200 ns = 1O'O C/s = 10 GC/s
In practice, however, the speed of operation of the chip may be lower than this number
for two reasons:
608 15 / VLSl Implementations of Neural Networks
*
Layer
2
Neurons
300
1,200
184
Synapses
1,200
50,000
3,136
1 1 3,136 78,400
20 x 20 (= 400) inputs
0 Neuron Receptive field of neuron
FIGURE 15.7 General structure of the OCR network. (From E. Sackinger et al., 1992a,
with permission of IEEE.)
15.4 / Neurocornputing Hardware 609
Table 15.1 presents a summary of the execution time of the OCR network implemented
using the ANNA chip, compared to a SUN SPARC 1 + workstation. This table shows
that a classification rate of 1000 characters per second can be achieved using a pipelined
system consisting of the ANNA chip and a DSP. This rate corresponds to a speedup
factor of 500 over the SUN implementation.
Faggin (1991) presents a performance assessment of neurocomputation using special-purpose VLSI chips.
The following figures are presented, based on the status of VLSI technology in 1991:
In all vertebrate retinas the transformation from optical to neural image involves three
stages (Sterling, 1990):
w Photo transduction by a layer of receptor neurons.
w Transmission of the resulting signals (produced in response to light) by chemical
synapses to a layer of bipolar cells.
m Transmission of these signals, also by chemical synapses, to output neurons that are
called ganglion cells.
At both synaptic stages (i.e., from receptor to bipolar cells, and from bipolar to ganglion
cells), there are specialized laterally connected neurons, called horizontal cells and ama-
crine cells, respectively. The task of these neurons is to modify the transmission across
the synaptic layers. There are also centrifugal elements, called inter-plexiform cells; their
task is to convey signals from the inner synaptic layer back to the outer one.
Figure 15.8 shows a simplified circuit diagram of the silicon retina built by Mead and
Mahowald (1988), which is modeled on the distal portion of the vertebrate retina. This
diagram emphasizes the lateral spread of the resistive network, corresponding to the
horizontal cell layer of the vertebrate retina. The primary signal pathway proceeds through
the photoreceptor and the circuitry representing the bipolar cell, the latter being shown
in the inset. The image signal is processed in parallel at each node of the network.
The key element in the outer plexiform layer is the triad synapse, which is located at
the base of the photoreceptor. The triad synapse provides the point of contact among the
photoreceptor, the horizontal cells, and the bipolar cells. The computation performed at
the triad synapse proceeds as follows (Mahowald and Mead, 1989):
w The photoreceptor computes the logarithm of the intensity of incident light.
rn The horizontal cells form a resistive network that spatio-temporally averages the
output produced by the photoreceptor.
w The bipolar cell produces an output proportional to the difference between the signals
generated by the photoreceptor and the horizontal cell.
The net result of these computations is that the silicon retina generates, in real time,
outputs that correspond directly to signals observed in the corresponding layers of biologi-
cal retinas. It demonstrates a tolerance for device imperfections that is characteristic of
a collective analog system.
A commercial product resulting from the research done by Mead and co-workers on
the silicon retina is the Synaptics OCR chip, manufactured by Synaptics Corporation for
use in a device that reads the MICR code at the bottom of cheques. The chip is of an
analog design, based on subthreshold CMOS technology, and customized for this specific
application.
Mention should also be made of independent work done by Boahen and Andreou
(1992) on a contrast-sensitive silicon retina, which models all major synaptic interactions
in the outer plexiform of the vertebrate retina, using current-mode subthreshold CMOS
technology. This silicon retina permits resolution to be traded off for enhanced signal-to-
noise ratio, thereby revealing low-contrast stimuli in the presence of large transistor
mismatch. It thus provides the basis of an edge-detection algorithm with a naturally built-
in regularization capability.
The work of Mead and Andreou and their respective fellow researchers on silicon
retinas validates an important principle enunciated by Winograd and Cowan (1963) that
it is indeed possible to design reliable networks using unreliable circuit elements.
15.4 / Neurocomputing Hardware 611
FIGURE 15.8 The silicon retina. Diagram of the resistive network and a single pixel
element, shown in the circular window. The silicon model of the triad synapse consists of
the conductance (G) by which the photoreceptor drives the resistive network, and the
amplifier that takes the difference between the photoreceptor ( P ) output and the voltage
on the resistive network. In addition to a triad synapse, each pixel contains six resistors
and a capacitor C that represents the parasitic capacitance of the resistive network.
These pixels are tiled in a hexagonal array. The resistive network results from a
hexagonal tiling of pixels. (Reprinted from Neural Networks, 1, C.A. Mead and M.
Mahowald, “A silicon model of early visual processing,” pp. 91-97, copyright 1988 with
kind permission from Pergamon Press Ltd., Headington Hill Hall, Oxford OX3 OBW, UK.)
612 15 I VLSl Implementations of Neural Networks
The use of such an approach is described by Burges et al. (1992), where dynamic programming is combined
with a neural network for segmenting and recognizing character strings.
614 15 / VLSI Implementations of Neural Networks
Fuzzy
FIGURE 15.10 Block diagram of controller combining the use of neural networks and
fuzzy logic.
description. To exploit the full information content of such signals at all times, we need
an intelligent signal processor, the design of which addresses the following issues:
Nonlinearity, which makes it possible to extract the higher-order statistics of the
input signals.
Number of degrees of freedom, which means that the system has the right number
of adjustable parameters to cope with the complexity of the underlying physical
process, avoiding the problems that arise due to underlitting or overfitting the input
data.
Adaptivity, which enables the system to respond to nonstationary behavior of the
unknown environment in which it is embedded. Certain applications require that
synaptic weights of the neural network be adjusted continually, while the network
is being used; that is, “training” of the network never stops during the processing
of incoming signals.
Prior information, the exploitation of which specializes (biases) the system design
and thereby enhances its performance.
Information preservation, which requires that no useful information be discarded
before the final decision-making process; such a requirement usually means that soft
decision making is preferrable to hard decision making.
Multisensor fusion, which makes it possible to “fuse” data gathered about an
operational environment by a multitude of sensors, thereby realizing an overall level
of performance that is far beyond the capability of any of the sensors working alone.
Attentional mechanism, whereby, through interaction with a user or in a self-organized
manner, the system is enabled to focus its computing power around a particular
point in an image or a particular location in space for more detailed analysis
The realization of an intelligent signal processor that can provide for these needs would
certainly require the hybridization of neural networks with other appropriate tools such
as time-frequency analysis, chaotic dynamics, and fuzzy logic.
Needless to say, current pattern classification, control, and signa1 processing systems
have a long way to go before they can qualify as intelligent machines.
The bulk of the material presented in this chapter has been devoted to VLSI implementa-
tions of neural networks. As with current applications of neural networks, we will certainly
have to look to VLSI chips/systems, perhaps more sophisticated than those in use today,
Problems 615
to build working models of intelligent machines for pattern classification, control, and
signal processing applications.
PROBLEMS
15.1 Consider Eq. (15.5) describing the behavior of an n-channel MOS transistor. Assum-
ing that the transistor is driven into saturation (i.e., the drain voltage is high enough),
we may simplify this equation as follows:
Using this relation, show that the difference between the two drain currents of the
transconductance amplifier of Fig. 15.2 is related to the differential input voltage
V, - V2 as follows:
where Z, is the constant current supplied by the bottom transistor in Fig. 15.2.
15.2 The MOS transistors shown in Fig. 15.3 model inhibitory and excitatory synapses;
their input-output relations are defined by Eqs. (15.13) and (15.14). Determine the
transconductances realized by these transistors, and thereby confirm their respective
roles.
15.3 The ETANN chip (Holler et al., 1989) and the EPSILON chip (Murray et al., 1991)
use analog and hybrid approaches for the VLSI implementation of neural networks,
respectively. Study the papers cited here, and make up a list comparing their individ-
ual designs and capabilities.
15.4 Moon et al. (1992) describe a pulse modulation technique known as the pulse duty
cycle modulation for the VLSI implementation of a neural network. Referring to
this paper, identify the features that distinguish this pulse modulation technique from
pulse frequency modulation, emphasizing its advantages and disadvantages.
15.5 A systolic array (Kung and Leiserson, 1979) provides an architecture for the imple-
mentation of a parallel processor. A systolic emulation of learning algorithms is
described by Ramacher (1990) and Ramacher et al. (1991). Study this architecture
and discuss its suitability for VLSI implementation.
15.6 The contrast-sensitivesilicon retina described by Boahen and Andreou (1992) appears
to exhibit a regularization capability. In light of the regularization theory presented
in Chapter 7, discuss this effect by referring to the paper cited here.