Digital VLSI Architectures For Neural Networks

Digital VLSI Architectures for Neural Networks*
S. Y. Kung J. N. Hwang
Princeton University University of Southern California

Department of Electrical Eng. Department of Electrical Eng.
Princeton. N J 08540 Los Angeles, CA 90089
Abstract
2. Lager Ifemiron: the network is a spatially iterative network The
This paper proposes a generic iterative model for a wide variety of artifi- heterogeneous neuron layers in between the input and output lay-
cial neural networks (ANNs): single layer feedback networks, multilayer feed- ers are called hidden layers.
forward networks, hierarchical competitive networks, as well as some proba-
bilistic models. A unified formnlation is provided for both the retrieving and 3. Pattern Itemlion: In certain models, each iteration (level) is cor-
learning phases of mcat ANNs. Baed on the formulation, a programmable responding t o one pattern input.
ring systolic array is developed. The architecture maximizes the strength of
VLSI in terms of intensive and pipelined computing and yet circumvents the Nonlinear Activation Wnctions fi The nonlinear activation func-
limitation on communication. It may be adopted . I a brsic structure for an tion fi in Eq. 2 can be a deterministic function, winner-take-all mech-
universal neurocomputer architecture. anism,or a stochastic decision. There are three popular deterministic
nonlinear activation functions for Eq. 2: threshold, squash, and sigmoid
1 A Unified Generic Iterative ANN Model [SI. In some pattern classification applications, winner-take-all (WTA)
nonlinear mechanisms are adopted, which are often implemented by
A basic ANN model consists of a large number of neurons (with acti- lateral inhibitions so that only the neuron receive largest net input is
vation values { a , } ) , linked to each other with synaptic weights {wij).
From algorithmic viewpoint, two separate phases are involved in an
ANN processing: retnevtng phase and learning phase.
activated.
1/11 .,U a ‘ .111
1.1 Generic Formulation in the Retrieving Phase

In the retrieving phase, responding to the stimulus inputs (test p a t
terns), the activation values of all neurons are iteratively updated based
...
on the sysfem dynamics until they reach the L-th iteration and pro-
duce the corresponding output responses. The system dynamics can be
written as:
NI
ui(I+l) = ~w,,(I+l)aj(I)+Bi(l+l) (1)
,=1 troi %(ai aio! %/ai
ai(/ + 1) = fi(u,(I + 1),ai(I)) (2)

(.)
where 1 5 i 5 N I + ~and 0 5 1 5 L - 1. This system dynamics may be

represented by an L-level feed-forward neural network (with NIneural Figure 1: (a) A generic iterative model (L-iterations) for ANNs, where
units at I-th level) shown in Figure l(a). Equation (1) defines the prop- wij(I) and N I may be homogeneous or heterogeneous with respect to
agation rule. Each PU, say i-th neuron at (I + 1)-th level, receives the I. (b) A basic ANN model with two operations: propagation rule, and
weighted inputs from other PUS a t I-th level together with the external nonlinear activation, where iteration index is not considered.
input & ( I+ 1) to yield the net input ui(l + 1). Equation (2) defines
the nonlinear aciiaafion function fi which specifies the new activation
value ai(/+ 1) as a function of the net input value ui(I+ I) and, in some 1.2 Generic Formulation in the Learning Phase
models, the previous activation value ai(/) (see Figure l(b)). In the learning phase, the synaptic weights for all the connections are
Neural nets are influenced by some common factors in the retrieving iteratively updated. The weight updating problem, (sometimes called
and learning phases, which have to be accommodated by the proposed credit assignment problem) for network in Figure l(a) is to find the
generic ANN model. The important factors in the retrieving phase are synaptic weights (and also external inputs) which optimize certain pre-
iteration index 1. and nonlinear activation function f. defined measure function E based on a set of input training patterns.
The learning phase usually involves two steps: In the first step, the input
I t e r a t i o n Index l in t h e Retrieving Phase The iteration index I training patterns are processed by the network based on the retrieving
used in the generic iterative ANN model (see Eqs. 1 and 2) can be used phase equations and generate some actual responses. In the second step,
to represent one of the three possible iterations: time, layer, or paiiem. the weighta are updated according to the responses generated and the
chosen learning rules. Recursive procedures are often adopted. A recur-
1. Time Itemlion: the network is a single layer feedback network sive weight updating formulation (learning rule) for the generic ANN
with each neuron being updated synchronously (in parallel) a t model can be expressed as following:
each level.
Thin research wam supported in p u t by the Nation4 Sdence Foundation

under Grant ECS82-13358, and by the Innovative Saence and Technology Ot +
The new weight value w:Y+”(I) a t (m 1)-th recursion can be deter-
fice of the Strategic Defense Initiative Organisation, administered through the mined by the current weight value wjY)(I),the updating rate parameter
Office of Naval Research under Contract No. N0001485K-0469 and NO0014
% j ( l ) (for simplicity, 9 ) , and most importantly the gradient
85K-0599. =!?E.
445
ISCAS ’89 CH2692-2/89/0000-0445$1.00 0 1989 IEEE
The important factors in the Leaming p h w are recursion index m, 1.3 Neural Network Examples of Generic Iterative
meaaure function E for tbe training criterion,updating formulations 0 , ANN
back-propagation rule of the corrective signala, and homogeneity con-
sideration in certain modela. Table 1 clwifies the neural network examples in terma of the facton,
considered in the retrieving and learning phase.
RecumionIndex m in the Learning Phase The recursion index m
used in the unified recursive weight updating formulation may represent
either a pattern index or a sweep index. 2 A Unified Ring Systolic Design for the
1. Pattern Recursion: the network update the synaptic weights after Generic Iterative ANN Model
the presentation of each training pattern.
An important consideration in the architectural design for ANNs is to
2. Sweep Recursion: then the network update the synaptic weights enaure that the processing in both the retrieving phase and the learn-
only after all the P training patterns are presented. ing phase can share the same storage, processor hardware, and array
configuration. This will significantly speed. up real-time learning by
Meaaure Function E aa Painimg Criterion A criterion function avoiding the difficulty of transfer of synaptic weights between retriev-
E in Eq. 3 has to be selected first, so that weight training may be ing and learning phaeea. It w i l l be shown that operations in both the
formulated an an iterative optimization (maximization or minimization) retrieving and learning phaaa, of the generic iterative ANN modela can
problem. The mursure function E can be a global function of the net- be formulated as conrecutive matriz-vector multiplicaiion, outer-pmduci
'
work, or a local function of I, i.e., E(1). updating, or consecutive vector-matriz multiplication problems. In terma
Of the m a y a t N C t W , all theme formulations lead to a same universd
ExampAea of U p d a t i n g Formulation 0 The updating formulation ring syablic array architectures [B]. In tuma of the functional opera-
function 0 in Eq. 3 can be in additive or multiplicative formulations, t L G , all theme formulatiom calla for a MAC (mxtiply and accumula-
tion) operation. Proper digital arithmetic technique (such a~ Cordic) to
or others. The additive formulations lead to the gradient descent (min- support the nonlinear processing is also required.
imization) or gradient ascent (for maximisation) approach:
2.1 Ring Systolic Design for the Retrieving Phase
Wij(1) -c Wij(1) I) -
8E
(1)
8wij Basic O p e r a t i o m in the Retrieving Phase The system dynamics
where f is determined based on either the maximization or minimiza- in the retrieving phase of the generic iterative ANN model can be for-
tion formulation (e.g., back-propagation learning [9]). On the other mulated as a consecutive matriz-vector multiplication (MVM) problem
hand, if { w i j ( f ) } has to satisfy the following constraints, wij(1) 2 0 and interleaved with the nonlinear activation function (see Eqs. 1 and 2).
Cz, wij(1) = 1, then the constraint optimkation problem leadn to a
multiplicative formulation [SI. u(l+ 1) = W(l+ l)a(l) + e(/ + 1)
wij(1) e wj(l) . I) -BE
(5)
a(1 + 1) = f[u(1+ I), a(/)] (9)
&j(O
where the f[x] operator performa the nonlinear activation function f
Note that only proper choice of the updating step I) can emure that the on each element of the vector x. Without loss of generality and for the
new weights { w i j ( l ) } satisfy the same constraints (e.g., Baum-Welch convenience of homogeneous architectural design, it is assumed that all
reestimation learning [a]). the iterations in the iterative ANN model have uniform size of N neural
units (it is always possible to artificially create a certain number of n-
Back-Propagation of Corrective Signals To alleviate the burden operation neural units and fixed zero weights to balance any inequality
of computing the gradient, back-propagation of corrective signah baaed between two iterations).
on chain rule derivation may be adopted. A speeial feature of consecutive MVM problems is that the output
vector at one iteration will be used as the input vector for the next
iteration. Therefore, the data have to be carefully arranged so that the
wtputa of one iteration can be immediately used for the inputs of the
where the backward propagated corrective signal & ( I ) is defined to be next iteration [SI. This le& to a ring systolic architecture as shown
$&, which can be recursively calculated as shown below: in Figure 2. Note that The functional operation at each PE is a MAC
operation (see Figure 2). The pipelining period of this design is 1, which
impliea 100% utilization efficiency.
Systolic Processing for Consecutive MVM At the (1+ 1)-th iter-

ation of the retrieving phase, each processor element (PE),say the Cth
where L - 1 2 f 2 1, and rji(f + 1 ) denote w. PE, can be treated as a neuron, and the corresponding incoming synap-
tic weights (wi1(1+ l), +
l ) , ...,W ~ N )are stored in the memory
Homogeneity Consideration in Homogeneour I t e r a t i v e Model

A special case of the generic iterative network is when the synaptic
weights { w i j ( l ) } are constant with nspect to 1, i.e., w i j ( l ) = w i j , V1
(e.g., time or pattern iterative networks). A dmple way of computing
the gradient is to consider only the last iteration [9,3],
= &.
However, in order to derive a more robust estimate of the .&ient, some
modela [9,8]adopt a more desirable approach of avemging the gradients
over dl the iterations. More precisely (note that = I),
Figure 2: The ring syrtolic ANN at the (I + 1)-th iteration in the re-
trieving phase.
446
~~
- -
(217) 333-4789 ruriiicr iniurmanun can ~e mraineu nom u r . w . hennern JenKins.

Iteration Nonlinear Refunion Mcrsurc Updating Formulation Back
Neural Networks Index I Activr f, Index m Function E wtJ(l)e= a( ) Prop & ( I ) Comments
Single Layer Non- E= E,: w*J+ 9 [lo18
Adaline Iterative Threshold Pattern (t. - u . p (t, - u8)al(0) No LMS Learning
Memory & E=CN WIJ(L) + r) [91, G ) =
Learrung Module Time Squash Pattern (e. - p.(T)')2 (e. - P.(L))aJ(L) No E, w * J ~ J ( -L 1) .,
Multilayer Adaline/ E= 1"" .=1 W*J(l)f 9 [101, +(E)
+ =
0 ifAE20,
Commttee Machine Layer Threshold Pattern It, - a,(L)l $(&laJ('- l) No "b;' if A E < O
-
Back-Propagation E = C2i WtJ(1) - r)6t(I) [91,
Network Layer Sigmoid Pattern (1, - a,(L))Z ft'(uS(l))aJ(l- l) Yes &(L) = - ( t , - a,(L))
Recurrent €=EN WIJ- 9 E,"=, (91, A w v =
Back-Propagation Time Sigmoid Pattern ( i , - a,(iTi2 f'(*.(1)bAI- 1) Yes Aw,J(l)
Parallel Hopfield E = -12 E" 1
C"J wIJ +9 [3,41, '
wSJaS(L)aJ(L)- ( 2 0 9 L ) - 1) No Adaptive
Assoclative Network Time Threshold Sweep r: e,a.(L) (24"'(L) - 1) No Learrung
= E,[-; E;:;' +
w*J(I) 9 [71, Q j k ( I ) =< (
Self-orgazed Q J $ : ! ~( l ) W I k ( I ) [c:lT1QJ* (I)Wak(r) a y y l - 1) - I i , ( l - 1))
W.V(l))Z C,"l;l
+c5 (ap(l- 1)
Perceptual Network Layer Linear Sweep E,,, W*J(J)l + c6]
w$J(1) No -dk(l - 1)) >
Self-Orgaruzed Non- E= ;EL, a.(l) wlJ + rl a,(l) [5], Shrinlung
Feature Map Iterative WTA Pattern (a>(o) - (O) - WV) No Neighborhood
Adaptive Non- €='E
2 6,4*(1) WIJ + 9 %(I) PI,
Resonance Theory Iterative WTA Pattern (a,@) - w.,)Z ('J(O) - W'J) No Viglance Test
Rnmelhart E(I) = i CG,a * ( / ) wBJ(1)+ v a,(l) [91,
Feature Detector L a y e r WTA Pattern ( a J ( l - 1) - U I , ~ ( I ) ) ~ (aJ(I- 1) - w t J ( l ) ) No Leaky Learning
+9
~
E = X *a. WIJ [gl, P: = Lra

Boltzmann Machne Time Stochastic Sweep Prt(a.) In (P: -Pi) No Pr*(a) a,al
Hidden E = Pr(0IX) 9 E,"=,a*(V
W.> WJ 10,
Markov Model Pattern Stoch-c Sweep =YE, a.(L) - l)
f*(Ol)aJ(l YeS F.-, = 1 W,)
of the i-th PE in a circularly-shift-up order (by i - 1 positions). The

+
operations at the (I 1)-th iteration can be described ad follows:
Given the weight value w,i(l+ 1) (rji(/ 1) correspondingly) of the +
1. Each neuron activation value a,(l), created at i-th PE, is multi- previous recursion, (1) the new weight value can be calculated based
+
plied with wii(l 1). The product is added to the accumulator on the OPU; (2) and the back-propagated corrective signal &(I) of I-th
+
acci, which has initial value set to be equal to O,(/ 1). After the iteration can also be computed based on the consecutive VMM updating.
MAC operation, ai(/) will move counter-clockwise across the ring
array and visit each of the other PES once in N clock units. OPU in the Learning Phase In general, an additive updating for-
mulation of the weights a t (/+ 1)-th iteration can be summarized by the
2. When a , ( / ) arrives at the i-th PE,it is multiplied with wij(l+ l),
following outer-product equation [9];
and the product is added to acci according to Eq. 1.
3. After N clock units, the accumulator acci collects all the neceaeary Wij(l+ 1) Wij(I 1) +si(/ +
1) . h j ( / + 1) (10)+
products and is assigned to be the net input ui(/ + 1). Different choice of g and h function in the OPU will lead t o different
4. One more clock is needed for ai(/) to returns to the i-th PE, learning algorithms. For example,
and the processor is ready for the nonlinear activation operation
h j ( l + 1) Ref.
f(ui(l + +
l ) , a , ( ! ) ) (see Eq. 2) to create ai(/ 1) for €he next
gi(/+ 1)
ai(/ 1)+ (4 [3,71
iteration. ( N clocks are required to perform the winner-takoall
nonlinear mechanism by cycling all the u i ( l + 1) t o determine the
winner.)
ti - a,(/ + 1)
a i ( / + 1)
a.(/)
aj(1) - L i j ( l + 1)
10,91
:5,2,9]
:
&(I + I)f;(ui([ + 1)) aj (4 [91
The above procedure can be executed in a fully pipelined fashion. More-
over, it can be recursively executed (with increased [) until L iterations
are completed.
2.2 Ring Systolic Design for the Learning Phase Wij(l+ l ) r W i j ( I + 1 ) ' 9 i ( l + l ) ' h j ( l + l ) (11)
From Table 1, we note that there a~ two types of operations required

+
where g i ( l + 1) = ai(/ l ) f i ( o ~ + ~ and
) , hi(/ 1) = oj(l). +
for most weight updating methods: (1) the outer-product updating
(OPU),and (2) the consecutive vector-matrix multiplication (VMM). ConsecutiveV M M in the Learning Phase The back-propagation
Both types can be efficiently implemented by the ring systolic ANN rule (see Eq. 7) can be formulated M a consecutive VMM operations
baeed on the same storage, procwaor or hardware, and array configura- +
[SI. To be more specific, rji(l 1) = f j ( u j ( / + l))wji(/ 1) (91,and +
tion in the retrieving phase. rji(l+ 1) = fj(ot+i)wji(l+ 1) [E).
447
Systolic Processing for O P U T h e ring systolic array derived for the neural processing units are MAC or multiplication (e.g., the calcula-
tiom ofgi(l+l), hj(l+l), and rji(l+l)) operatiom. For a time eficient
the retrieving p h s v can be easily adapted to execute in pornllel both
the OPU and consecutive VMM in the learning phase [SI. The OPU dedicated design, a parallel array multiplier (e.g., Baugh-Wooley multi-
adopted the same ring systolic array configuration M consecutive MVM plier [l]) should be favorably considered.
There is concern about using digital hardware for the nonlinear sig-
problems with similar MAC functional operations (see Figure 3). The
moid function which is required in many continuowvalued ANN mod-
operations of the OPU in the ring systolic ANN can be briefly described eh. Simulations have been conducted [e] to replace the sigmoid function
as follows: with a piecewise linear function. The simulations adopted special val-
The value of gi(l+ 1) is computed and stored in the i-th PE. The ues (combination of powersf-two) of the slopes for each intervals of the
value hj(l + 1) produced at the j-th PE will be cyclically piped piecerise linear function so that shift-and-add operations are required
only for the implementation. It is observed that with 8 to 16 linear seg-
(leftward) to all other PES in the ring systolic ANN during the N
menta, the approximation produces very similar results as that in the
clocks.
32-bit floating point implementation of the sigmoid function. This may
When h j ( l + 1) arrives at the i-th PE, it is multiplied with the be exploited to simpliiy the procesring hardware.
+
stored value gi(l 1) to yield Awij(l+ l), which will be added Area EfRcient Design Based on Cordic Processor For an area
to (or multiplied with) the old weight wij(l+ 1) of the previour efficient dedicated digital VLSI implementation of the neural procws-
recursion to yield the updated wij(l+ 1). Note that the old weight ing units, a Cordic procesmr, which wza about 1/3 silicon area of an
data {wij(l+ 1)} an retrieved in a circularly-shift-up sequence, m a y multiplier might be a good alternative [l]. A Cordic scheme is an
juet like what ueed in the retrieving phsv. iterative method based on bit-level shift-and-add operations. It is e s p t
cidly suitable for computing a class of rotations. A 2-dimensionalvector
+
After N clocks, all the N x N new weights at (I 1)-th iteration
v = [z, y] can be rotated by an angle U by using a rotation operator R,,,
+
{wij(l 1)) are generated. The ring syetolic ANN is now ready
v' = &V.
for the weight updating of the next iteration (I).
In a linear mode, the Cordic can be used for MAC (multiply and
accumulation) operations, although somewhat slower than the array
multiplier. Given z,y, and U , we can get the Cordic output y ZQ +
in the linear mode which is needed for the MAC operations. The key
advantage of Cordic is that it may implement the sigmoid activation
function by setting the Cordic in the hyperbolic mode, where the inputs
are set as v = [ l , 11 and a = u/2. For a more detailed design, the
readem refer to [4].
3 Conclusion .
For r e d time proceming performance, the neural network architectures
will require massive parallel processing. Fortunately, today's VLSI and
CAD technologies fadlitate practical and cost-effective implementation
of large male computing networks. A unified mathematical formulations
of the generic iterative ANN model is p r o p d to prepare the ground for
Systolic Processing for Consecutive V M M The coweutive V M M a versatile neural computing architecture. This leads to a programmable
a h lead to the name ring systolic array configuration with similar MAC Universal ring systolic ANN design. Thanks to the versatility of this
functional operations (nee Figure 4). The operations of the c o m u t i v e systolic design, both of the retrieving and learning phases of most neural
VMM in the ring systolic ANN can be briefly described M follows: network models can be efficiently implemented.
References
1. The back-propagated corrective signal 6 j ( l +
1) and the value [l] H.M. Ahmed. Alternative arithmetic unit architectures for VLSI
+
rji(l+ 1) are available in the j-th PE at (I 1)-th iteration. The digital signal processors. In VLSI and Modem Signal Processing,
value r j i ( l + 1) is then multiplied with 6j(l+ 1) at j-th PE. chapter 16, pages 277-306, Prentice Hall, Inc., Englewood Cliffs,
2. The product is added to the newly arrived accumulator acci. (Tbe NJ, 1985.
parameter acci is initiated at the i-th PE with zero initial value [2] S. Gmeaberg. Adaptive pattern classification and universal recod-
and circularly shifted leftward acmea the ring array.) ing: Part 1. parallel development and coding of neural feature d e
tutors. Biological Cqbemeiics, 23:121-134, 1976.
3. After N such accumulation operations, the accumulator acq will [3] J. J. Hopfield. Neural network and physical systems with emergent
return to i-th PE after accumulating all the products &(I) = collective computational abilities. In Pmc. Natl'. Acad. Sci. USA,
Cy==,6j(l+ 1) rji(l+ 1). pages 2554-2558, 1982.
[4] J. N. Hwang. Algorithms/Applicationa/Architectums of Artificial
I I Neural Nets. Dept. of Electrical Engineering, University of South-
ern California, December 1988. Ph.D. Dissertation.
[5] T. Kohonen. Self-organized formation of topologically correct fea-
ture map. Biological Cybernetics, 4359-69, 1982.
[SI S.Y.Kung and J. N. Hwang. A unified systolic architecture for
artificial neural networks. To appear in Joumal of Parallel and
Diaiributed Compuiing, Special Issue on Neural Nctcoorka. 1989.
[7] R. Linsker. Self-organization in a perceptual network. IEEE Com-
W p " w11.11 w I,. I) w I!. *I p i e r Magazine, 21:105-117, March 1988.
t~.av> *n#
[SI L. R. Rabiner and B. E. Juang. An introduction to hidden Markov
Figure 4: The ring systolic architecture for the consecutiveVMM at the models. IEEE ASSP Magazine, 3(1):4-16, January 1986.
+
(I 1)-th iteration in the learning phase.
- ._ [9] D. E. Rumelhart, J. L. McClelland, and the PDP h a t c h Group.
Parallel Diatribuied Processing (PDP): Vol. 1 and 2. MIT Prem,
2.3 Implementation Considerations of Neural Pro-
Cambridge, Mamachuaetts, 1986.
ceming Units [lo] B. Widrow and R. Winter. Neural nets for adaptive filtering and
Time EfsCient Design Based on a Parallel Array Multiplier A. adaptive pattern recognition. IEEE Computer Magazine, 21:25-39,
dwussed in Sections 2.1 and 2.2, most of the computations involved in March 1988.
448
______~-- -~ ~ ~ ~ ~
(217) 333-4789 rurtiicr imvrmarivn can ~e mraineu morn ut-. w . hennern JenKins.

Digital VLSI Architectures For Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Digital VLSI Architectures For Neural Networks

Uploaded by

Copyright:

Available Formats

Digital VLSI Architectures for Neural Networks*

Princeton University University of Southern California

1.1 Generic Formulation in the Retrieving Phase

ai(/ + 1) = fi(u,(I + 1),ai(I)) (2)

where 1 5 i 5 N I + ~and 0 5 1 5 L - 1. This system dynamics may be

Thin research wam supported in p u t by the Nation4 Sdence Foundation

Systolic Processing for Consecutive MVM At the (1+ 1)-th iter-

Homogeneity Consideration in Homogeneour I t e r a t i v e Model

(217) 333-4789 ruriiicr iniurmanun can ~e mraineu nom u r . w . hennern JenKins.

E = X *a. WIJ [gl, P: = Lra

of the i-th PE in a circularly-shift-up order (by i - 1 positions). The

From Table 1, we note that there a~ two types of operations required

You might also like