You are on page 1of 95

The CORDIC Algorithm

:
An Area-EIIicient Technique Ior FPGA-Based ArtiIicial Neural Networks





A Thesis
presented to the Faculty oI
CaliIornia Polytechnic State University
San Luis Obispo






In Partial FulIillment
oI the Requirements Ior the Degree
Master oI Science in Electrical Engineering



by
Tim McLenegan
November 2006
ii
Authorization for Reproduction of Master4s Thesis





# grant permission for the reproduction of this thesis in its entirety or any of its parts4
without further authori6ation from me.





88888888888888888888888888
9ignature :;imothy <c=enegan>

88888888888888888888888888
?ate
iii
!""ro%&'()&*e(


Title: The CORDIC Algorithm: An Area-EIIicient Technique Ior FPGA-Based
ArtiIicial Neural Networks

Author: Timothy McLenegan
Date Submitted: November 28, 2006



Dr. Lynne Slivovsky ¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸
Advisor and Committee Chair Signature


Dr. Albert Liddicoat ¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸
Committee Member Signature


Dr. Jane Zhang ¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸¸
Committee Member Signature
iv
A"stract
An artificial neural network is a computing technique that allows problems to be solved
that would otherwise require an overly complex procedural algorithm. <y designing a
large network of computing nodes based on the artificial neuron model, new solutions
can be developed to problems relating fields such as image and speech recognition.
This thesis presents the ?@ABI? algorithm as a computing technique that is capable of
supporting an artificial neural network in programmable hardware such as DEFAs. The
algorithm is presented in depth, including a derivation and precision analysis. The
designs of a parallel and bitGserial implementation are analyHed with respect to their
ability to support a large neural network. Simulations demonstrating the operation of the
unit follow a breakdown of the subcomponents in each design. It is shown that the small
resource requirements of the ?@ABI? algorithm allow for many instances in an DEFA,
allowing the parallelism inherent to an artificial neural network to be maintained.
v
!able of Contents
!"#$%&'$ ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( i
*&"+,-o/-0o1$,1$# ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( 2
3i#$-o/-*&"+,# (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( 2ii
3i#$-o/-4i56%,#((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((2iii
07&8$,%-9 :1$%o;6'$io1((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( 9
07&8$,%-< !%$i/i'i&+-=,6%&+-=,$wo%?# (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( @
2.1 Background......................................................................................................................................... 4
2.2 ArtiIicial Neuron Model..................................................................................................................... 5
2.2.2 Summation Function .................................................................................................................. 6
2.2.3 TransIer Function ....................................................................................................................... 6
2.2.4 Scaling and Limiting .................................................................................................................. 7
2.2.5 Error Function............................................................................................................................. 7
2.2.6 Output Function.......................................................................................................................... 7
2.2.7 Learning Function ...................................................................................................................... 8
2.3 ArtiIicial Neural Network Structure.................................................................................................. 8
2.3.1 Input Layer.................................................................................................................................. 9
2.3.2 Hidden Layers............................................................................................................................. 9
2.3.3 Output Layer ............................................................................................................................. 10
2.4 Learning Modes ................................................................................................................................ 10
2.4.1 Supervised Learning................................................................................................................. 11
2.4.2 Unsupervised Learning ............................................................................................................ 11
2.4.3 Learning Rates .......................................................................................................................... 12
2.4.4 Common Learning Laws.......................................................................................................... 12
2.5 FPGA Implementations.................................................................................................................... 14
2.6 CORDIC............................................................................................................................................ 16
07&8$,%-A 4BC!# (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( 9D
3.1 Design Decisions .............................................................................................................................. 18
3.2 FPGA Structure ................................................................................................................................ 20
3.3 Xilinx Spartan-3 ............................................................................................................................... 22
07&8$,%-@ *7,-0EFG:0-!+5o%i$7H((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( <I
4.1 Background....................................................................................................................................... 26
4.2 The Algorithm .................................................................................................................................. 27
4.3 Accumulator Registers ..................................................................................................................... 30
4.4 Computation Modes ......................................................................................................................... 31
4.4.1 Rotation Mode .......................................................................................................................... 32
4.4.2 Vectoring Mode........................................................................................................................ 34
4.5 Expanding the Computation Domain.............................................................................................. 36
4.6 Convergence ..................................................................................................................................... 40
"i
$%&%1 (he Con"er/ence Condition %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $3
$%&%4 5roo6 o6 Con"er/ence Criteria%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $1
$%&%8 Con"er/ence in 9otation :ode%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $4
$%; <ccurac> and ?rror%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $8
$%;%1 @uAericaB 9epreDentation%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $8
$%;%4 9oundin/ ?rror %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $$
$%;%8 <n/Be <pproEiAation ?rror%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $$
!"#$%&' ) *"& +#'#,,&, -.$,&.&/%#%01/22222222222222222222222222222222222222222222222222222222222 34
F%1 GeDi/n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $;
F%1%1 (he ControB Hnit%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% $I
F%1%4 (he Jhi6t 9e/iDterD %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% F4
F%1%8 (he KooLup (aMBe %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% F8
F%1%$ GeDi/n JuAAar> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% F$
F%4 JiAuBation 9eDuBtD%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FF
F%4%1 JiAuBation 1N CoAputin/ e Oith CP9GIC RroSth ?rror %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FF
F%4%4 JiAuBation 4N CoAputin/ e CoApenDatin/ 6or CP9GIC RroSth %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% F&
F%4%8 JiAuBation 8N CoAputin/ e
T1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% F;
F%4%$ JiAuBation $N CoAputin/ PutDide the GoAain o6 Con"er/ence%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FU
F%8 JuAAar> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FI
!"#$%&' 4 *"& 5&'0#, -.$,&.&/%#%01/22222222222222222222222222222222222222222222222222222222222222 46
&%1 GeDi/n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &3
&%1%1 ControB Hnit %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &4
&%1%4 (he Jhi6t 9e/iDterD %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &$
&%1%8 (he KooLup (aMBe %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &&
&%1%$ (he JeriaB <dderD %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &;
&%1%F GeDi/n JuAAar> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &U
&%4 JiAuBation 9eDuBtD%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &I
&%4%1 JiAuBation 1N CoAputin/ e Oith CP9GIC RroSth ?rror %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;3
&%4%4 JiAuBation 4N CoAputin/ e CoApenDatin/ 6or CP9GIC RroSth %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;1
&%4%8 JiAuBation 8N CoAputin/ e
T1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;4
&%4%$ JiAuBation $N CoAputin/ PutDide the GoAain o6 Con"er/ence%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;4
&%8 JuAAar> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;8
!"#$%&' 7 !1/8,9:01/ #/; <9%9'& =1'>2222222222222222222222222222222222222222222222222222222222 7)
;%1 CoAparin/ the GeDi/nD %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;&
;%4 Inte/ration Oith <rti6iciaB @euraB @etSorLD%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;I
;%4%1 CP9GIC <rti6iciaB @euron%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ;I
;%4%4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% U4
;%8 Vuture Jtud>%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% U$
?&@&'&/8&:2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 A)
vii
List of Tables
Table 4.1—CORDIC growth factors for each computation domain................................................................ 38
Table 4.2—Functions that can be computed using the CORDIC algorithm. .................................................. 39
Table 4.3—Domains of convergence for the CORDIC algorithm. .................................................................. 41
Table 5.1—Overall device utilization for the parallel design........................................................................... 48
Table 5.2—Unconditional control unit outputs for each state in the parallel design. ..................................... 50
Table 5.3—Device utilization for the parallel control unit. .............................................................................. 52
Table 5.4—Device utilization for the parallel shift register. ............................................................................ 53
Table 5.5—Device utilization for the parallel lookup table.............................................................................. 54
Table 6.1—Overall device utilization for the serial design. ............................................................................. 61
Table 6.2—State outputs for the serial control unit........................................................................................... 64
Table 6.3—Device utilization for the serial control unit. ................................................................................. 64
Table 6.4—Device utilization for a single serial shift register (including 32:1 multiplexer)......................... 66
Table 6.5—Device utilization for the serial lookup table. ................................................................................ 66
Table 6.6—Device utilization for the serial and parallel adders. ..................................................................... 68
Table 7.1—Device resource requirements for the artificial neuron designs using both the parallel and serial
CORDIC units............................................................................................................................................ 81

viii
List of Figures
Figure 2.1—Components that comprise an artificial neuron, the basic building blocks of artificial neural
networks. ...................................................................................................................................................... 5
Figure 2.2—A basic artificial neural network. .................................................................................................... 9
Figure 3.1—Internal structure of a Xilinx Spartan-3 FPGA [8]....................................................................... 20
Figure 3.2—Block diagram of a slice in a Spartan-3. ....................................................................................... 23
Figure 3.3—Block diagram of a Complex Logic Block (CLB) in a Spartan-3. .............................................. 24
Figure 4.1—Step i in the CORDIC algorithm. .................................................................................................. 28
Figure 4.2—The CORDIC Rotation mode. ....................................................................................................... 32
Figure 4.3—CORDIC Vectoring Mode ............................................................................................................. 34
Figure 5.1—Block-level diagram of the complete parallel CORDIC unit. ..................................................... 47
Figure 5.2—State diagram for the parallel CORDIC unit. ............................................................................... 49
Figure 5.3—State transition logic for the parallel CORDIC unit. .................................................................... 50
Figure 5.4—Logic behind the state-independent outputs of the parallel control unit..................................... 51
Figure 5.5—Design of the variable shift register. ............................................................................................. 53
Figure 5.6—Slices used by the various components of the parallel CORDIC unit. ....................................... 54
Figure 5.7—Results of a ModelSim simulation attempting to compute the value of e. As a result of
CORDIC growth, the final computed value is inaccurate....................................................................... 56
Figure 5.8—Results of a ModelSim simulation computing the value of e. By accounting for CORDIC
growth, the final computed value is accurate........................................................................................... 57
Figure 5.9—Results of a ModelSim simulation computing the value of e
–1
. .................................................. 57
Figure 5.10—Results of a ModelSim simulation computing the value of e
5
. Since the initial value 5 is
outside the domain of convergence, the computed value is incorrect. ................................................... 58
Figure 6.1—Block diagram for the serial implementation of the CORDIC algorithm................................... 60
Figure 6.2—State diagram for the serial implementation of the CORDIC algorithm. ................................... 63
Figure 6.3—Design of the serial shift registers. ................................................................................................ 65
Figure 6.4—Design of the serial adder. ............................................................................................................. 67
Figure 6.5—Slices used by the various components of the serial CORDIC unit............................................ 68
Figure 6.6—Results of a ModelSim simulation of the serial CORDIC unit attempting to compute the value
of e. As a result of CORDIC growth, the final computed value is inaccurate. ...................................... 70
Figure 6.7—Results of a ModelSim simulation of the serial CORDIC unit computing the value of e. By
accounting for CORDIC growth, the final computed value is accurate................................................. 71
Figure 6.8—Results of a ModelSim simulation of the serial CORDIC unit computing the value of e
–1
. ..... 72
Figure 6.9—Results of a ModelSim simulation of the serial CORDIC unit computing the value of e
5
. Since
the initial value 5 is outside the domain of convergence, the computed value is incorrect. ................. 73
Figure 7.1—Slices required by the various components of the two CORDIC designs. ................................. 76
Figure 7.2—The effect of word size on the FPGA chip area requirements of the CORDIC unit.................. 78
Figure 7.3—The effect of word size on the execution time of the CORDIC unit........................................... 78
Figure 7.4—Block diagram of a basic artificial neuron utilizing the CORDIC unit to compute the transfer
function e
x
................................................................................................................................................... 80
Figure 7.5—Simulation of an artificial neuron using the 4 inputs (0.3, 0.02, 2.1, 0.65) and the weights (1, 2,
–0.5, 0.25)................................................................................................................................................... 82
1
Chapter 1 Introduction
Traditional design techniques typically require a systems designer to formulate a
mathematical model of the system, and then design an algorithm to operate on the system
inputs in accordance with the model. In general, this approach works well, but there are
some applications where either the mathematical model is difficult to derive, or the
algorithm is too complex to be implemented in a cost-effective manner. Image and
speech processing are two such problems that people intuitively understand but require
complicated mathematical models that in turn require a good deal of computing power to
operate [10].
In these situations, an artificial neural network can be employed. Rather than
deriving an algorithm for analyzing the image or speech pattern, the parameters of the
network are tweaked in a process called training until the desired outputs are obtained
from the network. As will be shown in Chapter 2, networks are massively parallel
networks of small computing nodes called “neurons.” This parallelism makes neural
networks naturally suited to hardware implementations. A software implementation
running on a single microprocessor can only simulate parallelism; only in a hardware
implementation can the real-time benefits of this parallelism be achieved.
Programmable hardware devices such as FPGAs are particularly suited as platforms
for implementation, since the various parameters of the network occur during training.
Programmable hardware devices also afford several other benefits to hardware designers
by eliminating a lot of the difficulties that come with custom chip design. With a platform
for these artificial neural networks chosen, a means of performing the computations in the
interconnected nodes is needed. The number of these nodes can be large; the arithmetic
2
unit must be small enough so that it can be duplicated enough to maintain the parallelism
inherent in the structure.
The CORDIC algorithm will be presented as a solution to that problem. The
algorithm lends itself to a simple hardware implementation consisting only of basic
addition and shifting operations. In addition to its small chip area footprint, the algorithm
can compute a wide variety of functions ensuring that it can be used in a variety of ways
within a system. In particular, the algorithm can compute the exponential function, e
x
by
summing together the values of sinh(x) and cosh(x), both of which are computed in
parallel by the unit. The exponential function has an application as the transfer function
in a specific type of neural network and is chosen as the function of study for this thesis.
Chapter 2 will describe neural networks in detail and describe how they operate. The
structure of the network will be analyzed, as will the design of the neurons that comprise
the network. Chapter 3 will study the design of programmable hardware devices—
FPGAs in particular. The benefits they offer hardware designers will be presented, as will
the basic structures common to most implementations. Chapter 4 introduces the CORDIC
algorithm. A general-purpose definition of the algorithm is presented, and proofs of
convergence and precision are presented. Expansion into multiple computation domains
allows for the computation of sine, cosine, square root, multiplication, hyperbolic sine,
and hyperbolic cosine.
Chapter 5 and Chapter 6 study two implementations of the CORDIC algorithm and
study how the components of the CORDIC processing unit fit into a Xilinx Spartan-3
FPGA. Simulations performed using the ModelSim software are presented. Chapter 7
3
then compares these two designs and shows how they can be used in a neural network to
facilitate their implementation in a programmable device.
"
Chapter ) *rti,icia. /eura. /et1or34
As electronic devices increasingly become more a part of society, the desire for
devices to solve more complex problems is growing. ;ome problems have no easy
algorithmic solution, or the problems themselves are not completely understood. It is in
these cases that it is occasionally easier to develop a learning machine that can be taught
what to do in a situation without needing to understand the underlying processes.
Artificial neural networks are one such technique. They are designed to emulate the
human brainAs ability to learn. The same structure can be used to solve many different
problems, depending on the training methods employed.
)51 7ac38round
The structure of an artificial neural network is similar to that of the human brain. The
human brain is made up of billions of cells called neurons. The computational power of
the brain is a result of the large volume of neurons available, coupled with learning
ability.
The neurons themselves have several different classifications and subcomponents.
The computation process comes as a result of the interconnection of these different types
of neurons. Bowever, unlike the digital realm in which artificial neural networks reside,
the neurons in the brain “form a process which is not binary, not stable, and not
synchronous” [10].
Though artificial neural networks try to mimic the way in which the human brain
operates, the methodology is not designed to replace the brain, especially since so much
"
of it is (ot u(*erstoo*- ./t0er1 /rtifi2i/3 (eur/3 (et4or5s re6rese(t / (e4 te20(i7ue for
e(gi(eers to so39e 6ro:3e;s-
!"! #$%i'i(i)*+,e.$/0+1/2e*+
!
"!"
!
"!2
i
"!"
i
"!2
i
"!$
!
"!$
! f
"
!"#$%
o
"
!
i
"!!!%&'
()ti!+ial /e1)o"

!"#$r& ()*+,-./-0&0ts t3at c-./r"s& a0 art"6"c"a7 0&$r-08 t3& bas"c b$"7:"0# b7-c;s -6 art"6"c"a7
0&$ra7 0&t<-r;s)
<rtifi2i/3 (euro(s i;it/te t0e 9/rious fe/tures of :io3ogi2/3 (euro(s- Figure >-?
s0o4s / :/si2 ;o*e3 of t0e 9/rious 2o;6o(e(ts t0/t ;/5e u6 /( /rtifi2i/3 (euro(- @/20
(euro( 0/s / set of i(6uts1 3/:e3e* i i( t0e figure1 /(* / set of out6uts *e(ote* /s o i( t0e
figure- A0e /;ou(t of t0ese i(6uts /(* out6uts 2/( 9/rB *e6e(*i(g o( t0e stru2ture of t0e
e(tire (et4or5- Cut6uts 2/( :e 2o((e2te* to ;u3ti63e (euro(s-
!"!"3"3 4ei56%i05+7)(%/$8+
D( or*er to ;i;i2 t0e sB(/6ti2 stre(gt0s of :io3ogi2/3 (euro(s1 4eig0ti(g f/2tors
E*e(ote* w i( Figure >-?F /re (ee*e* for t0e i(6uts to t0e /rtifi2i/3 (euro(- @/20 4eig0t
sig(ifies t0e i;6ort/(2e of t0eir res6e2ti9e i(6ut i( t0e 6ro2essi(g fu(2tio( of t0e (euro(-
D(6uts 4it0 3/rger 4eig0ts 4i33 2o(tri:ute ;ore to t0e (eur/3 res6o(se t0/( t0ose 4it0
6
l$%%$r '$ig*t%. -*$ pot$ntial to l$arn i% incorporat$d into t*$ arti4icial n$uron 6and t*u%
t*$ n$t'or789 :; allo'ing t*$ input '$ig*t% to :$ adapti<$ co$44ici$nt%. -*$ adaptation
proc$%% i% p$r4orm$d in r$%pon%$ to training %$t% o4 data9 and d$p$nd% on :ot* t*$
n$t'or7>% %p$ci4ic topolog; a% '$ll a% t*$ l$arning rul$ :$ing appli$d.
!"!"! #$%%&'()* ,$*-'()*
-*$ 4ir%t %t$p in t*$ op$ration o4 an arti4icial n$uron i% t*$ %ummation 4unction. ?%
t*$ nam$ impli$%9 t*i% i% u%uall; a %ummation o4 t*$ '$ig*t$d input% to t*$ n$uron. -*$
%ummation 4unction can :$ mor$ compl$@ t*an a %impl$ %ummation. Ot*$r 4unction% u%$d
ar$ minimum9 ma@imum9 maBorit;9 product9 and ot*$r normaliCing algorit*m%. ?n;
4unction t*at op$rat$% on multipl$ input% to produc$ a %ingl$ output Duali4i$%.
!"!". /0&*1230 ,$*-'()*
?4t$r t*$ input% *a<$ pa%%$d t*roug* t*$ %ummation 4unction9 t*$; ar$ t*$n 4$d
t*roug* a tran%4$r 4unction. -*i% tran%4$r 4unction i% u%uall; a nonlin$ar 4unction. On$ o4
t*$ goal% o4 arti4icial n$ural n$t'or7% i% to :$ a:l$ to pro<id$ nonlin$ar proc$%%ing.
Eo'$<$r9 t*$ a:ilit; o4 a n$ural n$t'or7 to p$r4orm in a nonlin$ar 4a%*ion i% d$p$ndant
upon t*$ tran%4$r 4unction o4 t*$ indi<idual n$uron%. F; c*oo%ing a lin$ar tran%4$r
4unction9 t*$ o<$rall n$t'or7 'ould :$ limit$d to %impl$ lin$ar com:ination% o4 t*$
input%. Gariou% tran%4$r 4unction% ar$ t;picall; u%$d. ? <$r; common tran%4$r 4unction i%
t*$ *;p$r:olic tang$nt. -*$ *;p$r:olic tang$nt i% a continuou% 4unction9 a% ar$ it%
d$ri<ati<$%.
Hn a 4$' ca%$%9 a uni4orm noi%$ g$n$rator i% add$d :$4or$ t*$ tran%4$r 4unction i%
appli$d. -*$ output o4 t*i% g$n$rator i% r$4$rr$d to a% t*$ n$uron>% It$mp$ratur$.J -*$
%urrounding t$mp$ratur$ can ad<$r%$l; a44$ct t*$ *uman :rain9 and :; adding t*i%
7
ca%a&ilit* to artificial neural network34 t5e &e5a6ior of t5e network 7ore clo3el*
e7ulate3 a 5u7an &rain.
!"!"# $%&'()* &), -(.(/()*
97%le7entation of t5i3 %ortion of t5e artificial neuron 7o:el i3 o%tional. ;5e out%ut
of t5e tran3fer function i3 7ani%ulate: in or:er to lie wit5in certain &oun:3. Scaling i3
%erfor7e: fir3t4 followe: &* 3o7e 3ort of t5re35ol: function.
Accor:ing to An:er3on an: ?cNeil ABCD4 t5e3e co7%onent3 are u3uall* u3e: w5en
eE%licitl* 3i7ulating &iological neuron 7o:el3.
!"!"0 12232 45)%/(3)
;5e raw error of t5e network i3 t5e :ifference &etween t5e :e3ire: out%ut an: t5e
actual out%ut. ;5e error function tran3for73 t5i3 raw error to 7atc5 t5e %articular network
arc5itecture in u3e. 9f it i3 :e3ire: &* t5e 3*3te7 arc5itect to con3i:er error in t5e network4
t5en t5i3 co7%onent i3 inclu:e: in t5e neuron 7o:el. 9n t5i3 ca3e4 %ro%agation :irection
of t5i3 error i3 u3uall* &ackwar:3 t5roug5 t5e network. ;5e &ackF%ro%agate: 6alue 3er6e3
a3 t5e in%ut to ot5er neuron3G learning function3.
!"!"6 75/85/ 45)%/(3)
Hut3i:e t5e neuron 7o:el4 &ut relate: to it i3 t5e out%ut function. Nor7all* t5e
out%ut of t5e neuron i3 eIual to t5e out%ut of t5e tran3fer function. J5en i7%le7ente:4
t5e out%ut function allow3 for co7%etition &etween t5e out%ut3 of 6ariou3 neuron3.
Jit5in a 37all Kneig5&or5oo:L of neuron34 a large out%ut &* one neuron will cau3e t5e
out%ut of a :ifferent neuron to :i7ini35. 9n ot5er wor:34 t5e lou:e3t neuron cau3e3 t5e
ot5er neuron3 to &e Iuieter.
"
2.2.# $earning Function
#he learnin+ ,un.tion 1odi,ie3 the input 5ei+ht3 o, the neuron6 7ther na1e3 +i8en
to thi3 ,un.tion are the adaptation ,un.tion9 or learnin+ 1ode6 #here are t5o 1ain type3
o, learnin+ 5hen dealin+ 5ith neuron3 and neural net5or;36 #he ,ir3t type9 3uper8i3ed
learnin+9 i3 a ,or1 o, rein,or.e1ent learnin+ and re<uire3 a tea.her9 u3ually in the ,or1 o,
trainin+ 3et3 or an o=3er8er6 >n3uper8i3ed learnin+ i3 the other type9 and i3 =a3ed upon
internal .riteria =uilt into the net5or;6 #he 1a?ority o, neural net5or;3 utili@e the
3uper8i3ed learnin+ 1ethod9 a3 un3uper8i3ed learnin+ i3 .urrently 1ore in the re3ear.h
real16 #he pro.e33e3 u3ed to train a net5or; are di3.u33ed in A6B6
2.1 Artificial 5eural 5etwork 8tructure
Crti,i.ial neural net5or;3 ,un.tion a3 parallel di3tri=uted .o1putin+ net5or;36 Da.h
node in the net5or; i3 an arti,i.ial neuron6 #he3e neuron3 are .onne.ted to+ether in
8ariou3 ar.hite.ture3 ,or 3pe.i,i. type3 o, pro=le136 Et i3 i1portant to note that the 1o3t
=a3i. ,un.tion o, any arti,i.ial neural net5or; i3 it3 ar.hite.ture6 #he ar.hite.ture9 alon+
5ith the al+orith1 ,or updatin+ the input 5ei+ht3 o, the indi8idual neuron39 deter1ine3
the =eha8ior o, the arti,i.ial neural net5or;6 Feuron3 are typi.ally or+ani@ed into layer3
5ith .onne.tion3 =et5een neuron3 eGi3tin+ a.ro33 layer39 =ut not 5ithin6 Da.h neuron
5ithin ea.h layer i3 o,ten ,ully .onne.ted to all neuron3 in the a33o.iated layer6 #hi3 .an
lead to a 8a3t a1ount o, .onne.tion3 eGi3tin+ 5ithin the net5or;9 e8en 5ith relati8ely
,e5 neuron3 per layer6 Hi+ure A6A 3ho53 a 3i1ple neural net5or; .ontainin+ I layer36 En
thi3 .a3e9 the layer3 are not ,ully .onne.ted6
9
input layer
output
layer
hidden
layer

!igure 2.2*+ ,asi/ arti1i/ia2 neura2 net4or6.
!"#"$ %&'() +,-./
Individual neurons are used Ior each input oI an artiIicial neural network. These
inputs could be collected data, or real world inputs Irom physical sensors. Pre-processing
oI the inputs can be done to speed up the learning process oI the network. II the inputs are
simply raw data, then the network will need to learn to process the data itselI, as well as
analyze it. This would require more time, and possibly an even larger network than with
processed inputs.
!"#"! 0122.& +,-./3
The input layer is typically connected to a hidden layer. Multiple hidden layers may
exist, with the inputs oI each hidden layer`s neurons oIten being Iully connected to the
1#
$u&pu&s $) &*e previ$us/0 /a0er2s neur$ns. 5idden /a0ers 7ere given &*a& na9e due &$ &*e
)a:& &*a& &*e0 d$ n$& see an0 rea/ 7$r/d inpu&s n$r d$ &*e0 give an0 rea/ 7$r/d $u&pu&s.
;*e0 are )ed <0 &*e inpu& /a0er2s $u&pu&s= and )eed &*e $u&pu& /a0er2s inpu&s. ;*e nu9<er
$) neur$ns 7i&*in ea:* $) &*e *idden /a0ers= as 7e// as &*e nu9<er $) *idden /a0ers
&*e9se/ves= de&er9ines &*e :$9p/e>i&0 $) &*e s0s&e9. ?*$$sing &*e rig*& a9$un& )$r ea:*
is a 9a@$r par& $) designing a 7$rAing neura/ ne&7$rA. Bigure 2.2 s*$7s $ne *idden
/a0er= <u& $&*er ne&7$rA ar:*i&e:&ures :an *ave 9u/&ip/e *idden /a0ers.
!"#"# $%&'%& )*+,-
Da:* neur$n 7i&*in &*e $u&pu& /a0er re:eives &*e $u&pu& $) ea:* neur$n 7i&*in &*e
/as& *idden /a0er. ;*e $u&pu& /a0er pr$vides rea/ 7$r/d $u&pu&s. ;*ese $u&pu&s :$u/d g$ &$
an$&*er :$9pu&er pr$:ess= a 9e:*ani:a/ :$n&r$/ s0s&e9= $r 9a0<e saved in&$ a )i/e )$r
ana/0Eing. LiAe &*e $u&pu& )un:&i$n $) an individua/ neur$n= &*e $u&pu& /a0er 9a0
par&i:ipa&e in s$9e s$r& $) :$9pe&i&i$n <e&7een $u&pu&s. ;*is /a&era/ in*i<i&i$n :an <e
seen in Bigure 2.2 as &*e d$&&ed /ines :$nne:&ing &*e $u&pu& neur$ns. Gn addi&i$n= &*e
$u&pu&s 9a0 a/s$ <e )ed <a:A in&$ previ$us neur$ns &$ assis& &*e /earning pr$:ess. Gn
Bigure 2.2= &*e resu/& )r$9 an $u&pu& neur$n is )ed <a:A in&$ a neur$n in &*e *idden /a0er.
!". ),*-/0/1 234,5
H varie&0 $) di))eren& /earning 9$des e>is& )$r de&er9ining *$7 and 7*en &*e inpu&
7eig*&s $) &*e individua/ neur$ns are upda&ed 7i&*in a ne&7$rA. ;*e &0pes $) /earning are
ei&*er supervised $r unsupervised. Hs s&a&ed ear/ier= &*e unsupervised /earning is :urren&/0
&*e 9$s& unAn$7n &0pe $) /earning. ;*e /earning ra&e $) a ne&7$rA dras&i:a//0 e))e:&s &*e
i&s per)$r9an:e.
11
!"#"$ %&'()*+,(- /(0)1+12
Learning in a su,er-ise. mo.e starts wit3 a 4om,arison o5 t3e networ6’s generate.
out,uts an. t3e .esire. out,uts. 9n,ut weig3ts o5 ea43 neuron are a.:uste. to minimi;e
an< .i55eren4es 5oun.. =3is ,ro4ess is re,eate. until t3e networ6 is .eeme. to be a44urate
enoug3. @5ter t3e training ,3aseA t3e neurons’ weig3ts are t<,i4all< 5ro;enA w3i43 allows
t3e networ6 to be use. reliabl<. =o better a.a,t to slig3t -ariationsA t3e learning rate 4an
be lowere.. Bne o5 t3e most im,ortant t3ings to .o w3en training a networ6 is to
4are5ull< 43oose t3e .ata use. 5or training. =<,i4all< .ata is se,arate. into a training set
an. a mu43 smaller test set. =3e training set is use. to train t3e networ6 to ,er5orm a tas6.
=3e test set is use. to -eri5< t3at t3e networ6 is able to generali;e w3at it 3as learne. to
slig3t -ariations. Cit3out t3is se,aration o5 .ata setsA one woul. not be able to 6now i5
t3e networ6 sim,l< memori;e. t3e .ata set or not.
!"#"! 31,&'()*+,(- /(0)1+12
Dnsu,er-ise. learning is ,er5orme. wit3out an< 5orm o5 eEternal rein5or4ement.
=3is learning mo.e re,resents a sort o5 en. goal 5or s<stems .esigners. Dsing t3is
met3o.A t3e s<stem tea43es itsel5 b< itsel5. =3e networ6 4ontains wit3in itsel5 a met3o. o5
.etermining w3en its out,uts are not w3at t3e< s3oul. be. =3is met3o. o5 learning is not
nearl< as well un.erstoo. as t3e su,er-ise. met3o.. 9t reFuires t3at t3e networ6 learn
online. Current wor6 3as been limite. to sel5Horgani;ing ma,sA w3i43 learn to 4lassi5<
in4oming .ata. Iurt3er .e-elo,ments wit3 t3is t<,e o5 learning woul. 3a-e uses in man<
situations w3ere a.a,tation to new in,uts is reFuire. regularl<.
"2
!"#"$ %&'()*)+ -'.&/
The learning rate o/ a networ1 is deter4ined 56 4an6 /actors. 9etwor1 architecture;
si<e and co4=le>it6 =la6 a 5ig role in the s=eed at which the networ1 learns. ?nother
/actor that a//ects the learning rate is the learning rule or rules e4=lo6ed. @low and /ast
learning rates each haAe their =ros and cons. ? lower rate will o5Aiousl6 ta1e longer to
arriAe at a 4ini4u4 error at the out=ut. ? /aster rate will arriAe 4ore Buic1l6; 5ut has a
tendenc6 to oAershoot the 4ini4u4. @o4e learning rules use the 5est o/ 5oth worlds; and
start o// with a high learning rate; and lower it graduall6 until a 4ini4u4 is reached.
!"#"# 01221) %&'()*)+ %'3/
Learning laws goAern how the in=ut weights o/ neurons within the networ1 are
4odi/ied. T6=icall6 the error at the out=ut is =ro=agated 5ac1 through the Aarious la6ers
o/ the networ1; adDusting the weights as it goes. Eow the error is =ro=agated 5ac1 is the
4aDor di//erence. The /ollowing; all /ro4 F"GH; are so4e laws co44onl6 used 56 networ1
architects.
!"#"#"4 5&667/ -89&
Ee55’s rule was the /irst general rule /or u=dating weights. Jt states; KJ/ a neuron
receiAes an in=ut /ro4 another neuron; and i/ 5oth are highl6 actiAe L4athe4aticall6 haAe
the sa4e signM; the weight 5etween the neurons should 5e strengthened.N Ee55 o5serAed
that 5iological neural =athwa6s are strengthened each ti4e the6 are used; and this rule is
designed to si4ulate that e//ect. Oost o/ the other rules 5uild u=on this /unda4ental law.
?s an e>a4=le o/ how this rule wor1s; su==ose a neural networ1 is 5eing trained to
control the acceleration o/ a car. @u==ose /urther that the networ1’s in=uts are the 5ra1e
and gas =edal =ositions 5eing as o=erated 56 a hu4an driAer. The acceleration and
"#
$e&ele(ation o. t/e &a( &an 0e &ompa(e$ 3it/ t/e $esi(e$ o5tp5t o. t/e $(i6e(. 8o3
s5ppose t/e $(i6e( o. t/e &a( 3ante$ to slo3 $o3n9 an$ p5s/e$ t/e 0(a:e. I. t/e o5tp5t o.
t/e ne5(al net3o(: 3as to $e&ele(ate9 t/an an< inp5t 3ei=/ts9 3/i&/ a >$e&ele(ate?
&omman$ 3as passe$ t/(o5=/9 s/o5l$ 0e in&(ease$. @/is 3o5l$ positi6el< (ein.o(&e t/e
a&&epta0le 0e/a6io( o. t/e net3o(:.
!"#"#"! $%& (&)*+ ,-)&
@/e $elta (5le is one o. t/e most &ommon lea(nin= (5les9 an$ is a 6a(iation o. Ae00Bs
C5le. It is also :no3n 0< se6e(al ot/e( names9 in&l5$in= t/e Di$(o3EAo.. Fea(nin= C5le
an$ t/e Feast Gean HI5a(e Fea(nin= C5le. It 3o(:s 0< t(ans.o(min= t/e e((o( at t/e
o5tp5t 0< t/e $e(i6ati6e o. t/e t(ans.e( .5n&tion. @/e (es5lt o. t/is t(ans.o(mation is 5se$
to a$J5st t/e inp5t 3ei=/ts asso&iate$ 3it/ t/e p(e6io5s la<e(s o5tp5ts. @/e t(ans.o(me$
e((o( (es5lt is p(opa=ate$ 0a&: t/(o5=/ all o. t/e la<e(s. Fee$E.o(3a($9 0a&:Ep(opa=ation
net3o(:s 5se t/is met/o$ o. lea(nin=.
!"#"#". $%& /0+12&3* (&45&3* ,-)&
@/e G(a$ient Mes&ent (5le is simila( to t/e Melta C5le in t/at t/e $e(i6ati6e o.
t(ans.e( .5n&tion mo$i.ies t/e o5tp5t e((o(. An a$$itional p(opo(tional &onstant (elate$ to
t/e lea(nin= (ate is a$$e$ to t/e mo$i.<in= .a&to( 0e.o(e t/e 3ei=/ts a(e a$J5ste$. @/is
met/o$ in its 0asi& .o( is :no3n to /a6e a slo3 (ate o. &on6e(=en&e. Osin= a 6a(<in=
lea(nin= (ate as 3as mentione$ 0e.o(e &an miti=ate t/is.
!"#"#"# 67%73&384 9&+0323: 9+;
@/is lea(nin= la3 is 5se$ .o( 5ns5pe(6ise$ net3o(:s. @e56o Po/onen 3as inspi(e$
0< lea(nin= in 0iolo=i&al s<stems9 an$ t/5s &ame 5p 3it/ t/e la3. Dit/ t/is la39 ne5(ons
&ompete .o( t/e oppo(t5nit< to lea(n. @/e ne5(on 3it/ t/e la(=est o5tp5t is t/e 3inne(9
14
and gets to update its weights and possibly some of its neighbors. 8sually the
neighborhoods start large, and shrink as training progresses.
!.# $%&A Implementations
Currently, most neural network research and testing is done using software
simulators. =oftware implementations allow for easy analysis of network behaviors.
Additionally, prediction-style problems in many cases do not require an embedded
hardware application. However, hardware neural networks can offer many advantages.
An optimiCed hardware implementation can yield better performance than a software
configuration running on a standard microprocessor. Additionally, a hardware
implementation can allow for applications that would be difficult to achieve in a software
set-up, such as with remote sensing applications. Designing for an FPGA also offers
many advantages, as discussed in Chapter H.
The ability of FPGAs to be re-programmed offers several specific advantages.
Jarious techniques in the category of “density enhancement” aim to increase the amount
of “effective circuit functionality per unit circuit area” [1N]. One way to accomplish this
is to separate out the various stages of the networkQs learning algorithm onto different
regions of the FPGA. Another way is to use optimiCed constant-coefficient multipliers to
handle the weighting calculations of the neuronQs inputs. These calculations will be fast.
In a training situation, these constant weights can be changed in under 6T !s [1N].
FGPAs also allow for the hardware implementation of artificial neural networks that
can dynamically adUust their topology. This allows for more complex learning algorithms
to adUust the topology, resulting in a more sophisticated final network.
15
Hardware implementations raise new issues and considerations that are not present in
soItware designs. A key consideration involves the numerical representation.
Fundamentally, there are two ways that a number can be represented in any hardware
device: Iixed-point, and Iloating-point. As discussed in 4.7.1, CORDIC uses Iixed-point
notation, which is simply a scaled integer. Floating-point is the most common
representation used in computing hardware due to the wide range oI values that can be
represented. In a hardware setting, especially a setting where the arithmetic units support
a much more complex hardware architecture (e.g. an artiIicial neural network), the chip
area requirements oI a Iloating-point unit are prohibitive. Nichols showed in |15| that it is
impractical to implement a neural network in an FPGA using Iloating-point weights.
Fixed-point units are much more attractive in that they allow Ior more chip area to be
devoted to the actual neural network. There is a large body oI ongoing research devoted
to the use Iixed-point weights in neural networks, which eliminate the need Ior Iloating-
point hardware.
Along with choosing a numerical representation comes the decision oI how much
precision the weights should have. As with any system, greater precision results in
increased computation time, greater chip area requirements and power consumption. For
any problem, in hopes oI eliminating some oI the problems associated with a greater
precision, the so-called 'minimum precision¨ must be determined. For some
applications |16|, it may be as Iew as 16 bits.
Draghici |17| perIormed a more extensive study oI precision, wherein the range and
precision oI the neuron weights were treated as variables. The aim was to derive a general
Iormula that designers could use to Iind the optimal weight Iormat Ior their systems. It
16
was determined the minimum number oI bits to represent one Iixed-point weight is

log $p "1 # $
!
"
#
$
, where the values Ior the weights is in the range

!!! ! " # .
!"# $%&'($)
Any neural network implementation needs an arithmetic unit to perIorm the various
calculations necessary in a neural network. Neural networks are large and rely on the
parallel computing power oI their neurons Ior eIIiciency. The large quantity oI neurons
means that neural networks have signiIicant hardware resource requirements.
Consequently, only very small arithmetic units can be used in a hardware
implementation.
The CORDIC algorithm satisIies that requirement. CORDIC is capable oI computing
many oI Iunctions that are used in neural networks. The Iact that it can switch between
Iunction sets so easily means that it can be used in a network that utilizes diIIerent
transIer Iunctions in diIIerent layers oI the network.
The most common transIer Iunctions used in neural networks are sigmoid Iunctions
such as the hyperbolic tangent. When in the proper computation mode, CORDIC can be
used to compute this Iunction with the help oI a binary divider, using the identity

tanh ! ( ) $
sinh ! ( )
cosh ! ( )
. CORDIC is also capable oI perIorming this division when in linear
vectoring mode, but the division is slow and would double computation time. CORDIC
can also perIorm multiplication when in the linear rotation mode, which can compute the
weights Ior the neuron inputs, though multiplication units optimized Ior multiplication by
a constant are better suited when the weights have already been determined.
1#
This thesis studies the implementation of the exponential function in an FPGA using
C;RDIC. Though not as common as hyperbolic tangent, there are applications where the
exponential function is used as a transfer function. Eu and Fatalama used the
exponential function as a transfer function in their implementation of a neural network
that acted as an associative memory. I1JK Similarly, Malgamuge used the exponential
function in the implementation of a defuzzication approximator. I19K.
1#
!"#$ter 3 FPGA.
Any designer faces a number of different onto which the system can be
implemented. Each platform comes with a set of benefits and tradeoffs. The nature of the
system in addition to its environment need to be taken into account when choosing a
platform.
3.0 1e.234 1ec2.264.
One approach is to reali?e the design as a software program and implement it using a
microprocessor of some kind. There are numerous architectures that can be chosen, each
with its own benefits. It can be easy to find one that is suited for the design. This
approach does not have any special manufacturing costs associated with itB the only cost
comes for the purchase of the processors. Cowever, this techniDue is best suited for
applications that are mostly seDuential in nature. For applications that are parallel in
nature, a custom hardware-based design can offer much in the way of performance
benefits.
Once the designer has elected to use a custom hardware implementation, the type of
implementation must then be chosen. At the lowest level, a fully custom integrated circuit
can be designed at the transistor level. This offers the designer the most amount of
fleGibility, however design costs can be higher due to the compleGity of the
implementation and fabrication of the design. An Application-Hpecific Integrated Circuit
(AHIC) can alternatively be employed to reali?e the design. Lather than designing at the
transistor level, the designer is presented with a library of logic gates with different
performance and power characteristics that can all be used in the design. Hince the design
19
process is simpler, the cost oI implementing the design can be less than a Iully custom
solution.
As an alternative to the static hardware solutions are a number oI programmable
logic devices that can be used. Programmable logic oIIers much oI the beneIits oI using a
microprocessor. Devices come already Iabricated, eliminating the cost and time
associated with Iabricating an ASIC or other custom design. Also, since the devices are
programmable, a new design can be implemented on the device without having to re-
Iabricate new chips. This makes these devices excellent choices Ior design prototyping,
and Ior applications where it is desirable to update to the design aIter it has been
deployed.
There are various architectures Ior programmable devices, that allow Ior designs oI
varying sizes to be implemented. At the smallest level are simple programmable logic
devices (PLDs) that allow the implementation oI two-level logic Iunctions. Complex
PLDs (CPLDs) were developed to allow Ior the implementation oI larger logic Iunctions.
These consist oI a collection oI PLD blocks using a programmable interconnect to allow
deep Iunctions to be implemented.
PLDs are still very limited in the number oI logic gate equivalents that can be made.
Gate arrays oIIer signiIicantly more hardware resources than PLDs, and oIten come with
other hardware blocks, such as RAM or a processor to complement a hardware design.
Gate arrays can be implemented statically with the design Irozen on silicon or in a
programmable package. Field programmable gate arrays (FPGAs) allow Ior
programming aIter the chip has been Iabricated.
"#
3.# FPGA Structure
!"#
!"#
!"# !"#
Block
RAM
)elay-Locked
Loop
)elay-Locked
Loop
)elay-Locked
Loop
)elay-Locked
Loop
234 Logic
234 Pins
Block
RAM
Block
RAM
Block
RAM
M
u
l
t
i
p
l
i
e
r
M
u
l
t
i
p
l
i
e
r
M
u
l
t
i
p
l
i
e
r
M
u
l
t
i
p
l
i
e
r

!i#$r& 3.*+,nt&rnal 1tr$2t$r& 34 a Xilinx Spartan-3 !:;< =>?.
$%e standard -P/A %as t%ree 1ain co1ponents6 con7igura:le logic :locks= I/O
:locks= and a progra11a:le interconnectA Bac% co1ponent plays a role in alloDing 7or
t%e Dide Eariety o7 %ardDare designs t%at can :e i1ple1ented in -P/AsA -igure FAG
s%oDs a :lock-leEel diagra1 o7 t%e Earious co1ponents t%at 1ake up an -P/AA
$%e con7igura:le logic :locks IJLBsM are D%ere t%e actual logic 7unctions speci7ied
:y t%e designer are i1ple1entedA In 1ost -P/A designs= t%e con7igura:le logic :locks
21
are made up oI a number oI lookup tables (LUTs). LUTs can be implemented as a
multiplexer and can implement any n-variable Boolean Iunction. The n inputs to the
Iunction can be used as the control lines on the multiplexer, and the 2
n
outputs Ior the
Iunction are data inputs to the multiplexer. These inputs are generally implemented as
one-bit SRAM cells and are what make the logic blocks programmable. Each LUT
describes the complete truth table Ior the Iunction, rather than the relationship between
the inputs and outputs in normal Boolean circuits.
A conIigurable logic block will contain multiple LUTs with multiplexers selecting
between the outputs oI the inner LUTs. The output oI the Iinal multiplexer becomes the
output oI the logic block. This allows each conIigurable logic block to implement a
Iunction oI more variables than is allowed by the individual LUTs themselves.
FPGAs contain a variety oI interconnections. Usually, there is a Iast interconnect that
connects adjacent CLBs. This allows Iunctions that require multiple CLBs to perIorm
more eIIiciently by placing the subIunctions in adjacent CLBs. For chip-wide routing,
FPGAs include a programmable interconnect that is usually implemented as a switching
matrix using six pass transistors that all possible connections vertically, horizontally, and
diagonally.
The I/O (input/output) blocks (IOBs) provide the interIace between the hardware
design and the outside world. Each IOB contains logic speciIic the input and output oI
signals. Usually, a three-state buIIer is used to conIigure the pin that the block is
connected to as an input pin, and output pin, or a bidirectional pin. Additionally, the
block contains logic that speciIies the current state oI the block (i.e. whether it is an input
or an output pin). The input logic is responsible Ior buIIering the input signal and
22
allo&ing t+e -ignal to be properly capt3red -o it i- pre-entable to t+e re-t o5 t+e FG89:
T+e o3tp3t logic allo&- 5or optional b355ering in a 5lip-5lop= or t+e -ignal can be directly
-ent to t+e pin:
Be5ore any -ignal lea?e- or enter- t+e c+ip= it pa--e- t+ro3g+ an electronic inter5ace
co@ponent t+at deter@ine- t+e electrical c+aracteri-tic- o5 t+e -ignal and control t+e
-a@pling o5 any inp3t ?al3e:
Be-ide- t+e-e t+ree ba-ic co@ponent-= @any ne&er F8G9- pro?ide additional
5acilitie- to @ake t+e de-ign proce-- -i@pler: 9n early addition &a- large array- o5 block
SR9D t+at can be 3-ed 5or data -torage: C+aining toget+er CLB- 5or -i@ple data -torage
can be c3@ber-o@e and area ine55icient= -o 3-ing t+e block SR9D can lea?e @ore c+ip
area 5or t+e +ard&are de-ign: Fa-t carry logic i- o5ten a?ailable &+ic+ allo&- 5a-t adder
de-ign- -3c+ a- carry looka+ead adder- to be i@ple@ented: So@e de-igner- are taking
t+at idea a -tep 5or&ard and i@ple@enting -tatic @3ltiplier- or ot+er 3nit- dedicated to
per5or@ing -i@ple arit+@etic operation-: Ne&er de-ign- can incl3de proce--or- -3c+ a-
t+e 8o&er8C= allo&ing -o5t&are and +ard&are de-ign- to eHi-t on t+e -a@e c+ip:
3.3! #ilin' Spartan-3
T+e XilinH Spartan-3 i- t+e plat5or@ 3-ed 5or t+e de-ign- di-c3--ed in t+e 5ollo&ing
c+apter-: T+e Spartan-3 3-e- SR9D cell- to -tore t+e progra@@ing in5or@ation: Since
t+e SR9D cell- lo-e t+eir -tate &+en t+e po&er -3pply i- t3rned o55= t+ey @3-t be
reprogra@@ed eac+ ti@e t+e c+ip i- po&ered on: T+i- reK3ire- t+e progra@ to be -tored
in a non?olatile 5or@at t+at can be tran-lated onto t+e F8G9:
23
Carry and
Control
Logic
D Q
CLK
CE
S
R
Carry and
Control
Logic
Look-Up
Table
(LUT)
O
I4
I3
I2
I1
D Q
CLK
CE
S
R
YB
Y
C
OUT
G4
G3
G2
G1
Look-Up
Table
(LUT)
O
I4
I3
I2
I1
F4
F3
F2
F1
CLK
CE
BX
C
IN
F
IN
BY
SR
YQ
XB
X
XQ

Figure 3.*—Block diagram of a 5lice in a Spartan-3.
The Spartan-3 CLBs are broken up into two slices. Figure 3.2 shows what a slice
looks like. Each slice contains two two-variable L?Ts and a multiplexer that selects
between the two. The output of each slice can be any function of five variables, or for
some restricted functions, as many as nine variables can be used. An additional
24
$%ltiple+er i- %-ed to -elect bet2een t4e o%tp%t- o5 t4e t2o -lice-6 T4i- enable- eac4 CLB
to operate on any 5%nction o5 -i+ =ariable-> or a re-tricted cla-- o5 5%nction- o5 %p to 1@
=ariable-6 Eac4 -lice al-o contain- carry and control logic t4at 5acilitate t4e
i$ple$entation o5 arit4$etic 5%nction-6 Fig%re 363 -4o2- 4o2 t4e $%ltiple+er- are %-ed
to -elect bet2een t4e t2o inner -lice-6 El-o> t4e CLB contain- t2o -torage ele$ent- t4at
can be con5ig%red eit4er a- latc4e- or 5lip-5lop-6
!"T
!"T
$
$
!"T
!"T
$
$
%!IC(%

!igure (.(*Block diagram o3 a Comple6 7ogic Block 8C7B9 in a ;partan=(.
T4e Spartan-3 contain- a n%$ber o5 block REJ- de-igned to e55iciently -tore
relati=ely large a$o%nt- o5 data6 Eac4 block contain- 1KK o5 data -torage> and anot4er 2K
5or optional parity c4ecking6 Eac4 block %-e- SREJ cell- 5or -torage6 Eac4 block can
25
have a variety oI organizations Ior any combination oI data word size and parity option.
Parity is done at the byte level, so data paths using 4 bytes will have 4 parity bits.
Adjacent to each block RAM is an 18!18 multiplier. The multipliers can be cascaded
to support operands larger than 18 bits wide or to operate on more than 3 operands. When
synthesizing a multiplication operation, the synthesizer can use either an asynchronous
multiplier or a synchronous multiplier that includes a register. Each multiplier outputs a
36-bit wide product.
The Spartan-3 has Iour diIIerent levels oI interconnection. The Iirst type oI
interconnect is the Direct Line that connects adjacent CLBs. This includes connections
that go inside the CLBs to connect the LUTs as well as Iast connections between the
CLBs to the immediate leIt and right. These lines are most Irequently used to a longer
interconnect, which in turn connects to a diIIerent direct line to the destination CLB.
Long lines connect to every sixth CLB and have low capacitance to Iacilitate the transIer
oI high-Irequency signals. Between the Direct Lines and the Long Lines in terms oI
capability are the Hex Lines and Double Lines. Hex Lines connect to every third CLB
and Double Lines connect to alternating CLBs. Both types oI interconnect oIIer a
compromise between connectivity and high-Irequency characteristics.
2#
!h#$ter 4 The !+,-.! Algorithm
The COrdinate Rotation DIgital Computer (CORDIC) algorithm is a technique that
can be used to compute the value of trigonometric functions. CORDIC differs from other
methods of computation because it can be easily implemented using only addition,
subtraction and bit shift operations. It is not as fast as table-based methods, but it can use
significantly less chip area, making it desirable for application where area is more
important than performance.
4.6 7#89gro:;<
The CORDIC algorithm was first proposed by Volder in his paper GThe CORDIC
Trigonometric Computing Technique,H published in 1JKJ L2M. It was originally intended
to be used for real-time airborne computation, but has since found other applications. In
1J71, Walther demonstrated that the algorithmPs domain could also be expanded beyond
computing trigonometric equations to also include hyperbolic and linear (multiply-add-
fused, etc.) functions.
Walther presented a possible hardware implementation using three !-bit registers,
three !-bit adderRsubtractors, three shifters, and a small look-up table. Volder presented a
serial design containing three !-bit"registers, three 1-bit adderRsubtractors, and a number
of shift gates that specified which bits from the registers get fed into the
adderRsubtractors. This thesis will explore both implementations and analySe the
capability of this algorithm to be used in an artificial neural network.
"#
4"# $%&'()*+,-.%/'
The algorithm operate0 in one o2 t3o mode05 Rotation or 7ectoring. The t3o mode0
determine 3hich 0et o2 2unction0 can ;e computed u0ing the algorithm. In Rotation mode=
the !" and #" component0 o2 the 0tarting >ector are input= a0 3ell a0 an angle o2 rotation.
The hard3are then iterati>el? compute0 the !@ and #@component0 o2 the >ector a2ter it ha0
;een rotated ;? the 0peci2ied angle o2 rotation.
In 7ectoring mode= the t3o component0 are input= and the magnitude and angle o2
the original >ector are computed. Thi0 i0 accompli0hed ;? rotating the input >ector until it
i0 aligned 3ith the !"aAi0. B? recording the angle o2 rotation to achie>e thi0 alignment=
3e kno3 the angle o2 the original >ector. Once the algorithm i0 complete= the !@
component o2 the >ector i0 eEual to the magnitude o2 the 0tarting >ector.
The CORDIC algorithm accompli0he0 it0 computation0 ;? u0ing onl? add0=
0u;tract0= and 0hi2t0. Thi0 i0 done ;? iterati>el? rotating the input >ector and 0lo3l?
con>erging on the 2inal rotated >ector. Thi0 i0 done in a 0erie0 o2 3ell@de2ined 0tep0. The
2ir0t rotation 0tep 0ee0 the >ector rotated
!
!
radian05
H4.1K
!
!
!
=±x
0
= #
0
#$% !
0
±
"
&
( )
x
!
=!!
0
= #
0
'(# !
0
±
"
&
( )

Lhere !
0
and y
0
repre0ent the input >ector aligned at the origin 3ith magnitude !
!

and angle !
!
5
H4."K
!
!
i
= #
i
cos!
i
$
i
= #
i
sin!
i

In each 0u;0eEuent 0tep= a ne3 angle o2 rotation !
i
i0 determined 0uch that5
28
(4.3)

!
!
" tan
!1
2
!!
# $
,!where ! % 0
This restriction is crucial in allowing the rotation calculations perIormed in each step
to be accomplished using only an add (or subtract) and a shiIt.
!
!
!
!
"
!+1
"
!
#
!
#
!+1
$
!

!i#$%&'()*+,-&.'i'i/'-0&'123451'67#8%i-09)
Figure 4.1 shows what each step in the CORDIC algorithm looks like. At each step,
a decision is made whether to rotate the vector by !!
!
or !"
i
. The outcome oI both oI
these decisions is shown in the Iigure. The expression Ior the rotated vector in the (i¹1)th
step is:
(4.4)

x
i"!
# !""
!"i
#$% !
i
$"
i
% &
y
i"!
# !""
!"i
%&' !
i
$"
i
% &

When applying the restriction in (4.3), the shiIting and adding become evident:
29
(4.5)

!
i+!
=
!
K
i
!
i
!$
i
"
!i
%
i
( )
%
i+!
=
!
K
i
%
i
+ $
i
"
!i
!
i
( )
#$%&%'K
i
= !+ "
!"i
(!$
i
=±!

!
"
is the magnitude error term, and !
i
corresponds to the rotation direction (¹1
corresponds to a rotation away Irom the !-axis). The !
!!
terms in the top and bottom
equations correspond to a leIt shiIt oI !
"
and !
i
, respectively (when operating in base 2).
This shiIted value is then added (or subtracted) to the current value oI the component.
Volder reIerred this operation "#$%%&'((i*i$+. It is this cross-addition that enables the
algorithm to be used eIIectively in digital hardware.
As illustrated in Figure 4.1, all rotation steps eIIect an increase in the magnitude oI
the input vector by a Iactor oI

!!"
!"!
with each rotation. This error is introduced as a
consequence oI the algorithm`s derivation Irom the Givens transIorm, which rotates a
vector by a speciIied angle:
(4.6)
!
!
! = ! cos!" "sin!
!
" = "cos!+!sin!

These terms can be rearranged using the basic trigonometric identity

tan! =
sin!
cos!
:
(4.7)
!
!
x " !"#! x " y $%&! # $
!
y " !"#! y % x $%&! # $

Using the same restriction in (4.3), we get the same relationships as in (4.5). The
error term is Irom the presence oI the cosine term in (4.7), which is independent oI the
rotation direction, since cosine is symmetric about the rotation direction, since cosine is
30
symmetric about the !-axis. Moreover, this error accumulates with each step, so iI the
number oI iterations is set, the total error in the algorithm is independent oI the input
angle.
II the Iirst rotation is given by
(4.8)

!
!
" !#"
!" ! ( )
"
#
$%& !
#
# #
!
"
!
( )
$
!
" !#"
!" ! ( )
"
#
&'( !
#
# #
!
"
!
( )

Then the second rotation is given by
(4.9)
!
!
!
= "+!
!! " ( )
"+!
!! ! ( )
"
#
$%& ! + #
#
"
#
+ #
"
"
"
( )
$
!
= "+!
!! " ( )
"+!
!! ! ( )
"
#
&'( ! + #
#
"
#
+ #
"
"
"
( )

The n-th rotation can be extended and a general deIinition derived:
(4.10)
!
!
"+!
= !+"
!" ! ( )
!+"
!" " ( )
! !+"
!"" "
#
$
%
&
'
#
#
$%& ! +$
#
"
#
+$
!
"
!
+!+$
"
"
"
( )
%
"+!
= !+"
!" ! ( )
!+"
!" " ( )
! !+"
!"" "
#
$
%
&
'
#
#
&'( ! +$
#
"
#
+$
!
"
!
+!+$
"
"
"
( )

The total increase in magnitude can be speciIied as

!
"
" 1#"
!"#
"
"
. This increase
must be accounted Ior when perIorming calculations using this algorithm.
!"# $%%&'&()*+, ./012*/,2
The CORDIC design calls Ior three accumulator registers: the X register, the Y
register, and the angle accumulator (Z). The X and Y registers hold the present #- and !-
components oI the vector as it is being rotated. The angle accumulator holds the total
rotation amount completed at the current iteration.
31
The angle accumulator stores the arguments to the sine and cosine terms in
Equation (4.10):
!
!
"
"#
!
!
!
##
"
!
"
#!##
"
!
"
. However, since this term is equivalent to
the desired rotation angleconstrained to
!
!
!
= !"n
!$
%
!!
( )
the Z register must always
contain the expression
(4.11)
!
!
!
" "
#
"#$
!%
&
!!
# $
#"%
!
"

The arctangent terms can be stored in a small lookup table. When this is done, only
an addition or a subtraction is required to compute the next value in the Z register, since
! ! "1. This table is very small, requiring only one row Ior each iteration oI the
algorithm. Since the iteration count can be Iixed in hardware, the size oI the table is
constant.
The accumulator registers are also used to determine the direction oI rotation. In
Rotation mode, the sign oI the Z register determines the direction. In Vectoring mode, the
Y register determines the direction. These modes are Iurther elaborated in the next
section.
!"! #$%&'()(*$+ -$./0
The CORDIC algorithm operates in one oI two modes: Rotation and Vectoring. The
mode oI operation determines which set oI Iunctions can be computed, and how the
values in the X, Y, and Z registers change each iteration.
32
!"!"#!$%&'&i%)*+%,-*
!"!#!$%
!
&
'
&
0
!&'
&1
!&
'
&
2
!&'
&3
!&
'
&
4
!
"
! " # $
÷ 1.0000 0.0000 30.0000°
0 1.0000 1.0000 -15.0000°
1 1.5000 0.5000 11.5651°
2 1.3750 0.8750 -2.4712°
3 1.4844 0.7031 4.6538°
4 1.4404 0.7959 1.0775°
5 1.4156 0.8409 -0.7124°
6 1.4287 0.8188 0.1828°
7 1.4223 0.8300 -0.2649°
8 1.4255 0.8244 -0.0411°
9 1.4272 0.8216 0.0709°
10 1.4264 0.8230 0.0149°

!i#$%& ()*+,-& ./012. 03454i36 738&)
In Rotation mode, the input vector is rotated over a speciIied angle. The input vector
is speciIied as the initial value the X and Y registers. The rotation amount is input into the
Z register. In order to rotate the input vector over the input angle, the goal oI each
iteration should be to reduce the value in the angle accumulator to 0. Since the arctangent
values Ior each iteration are Iixed, the only way that the value in the Z register can be
controlled is through the ! values. In Rotation Mode, the decision values are deIined to
be:
(4.12)

!
i
=
+!"#i%#&
i
!'
"!" #i%#&
i
$'
#
$
%
%
&
%
%

With this deIinition oI the decision Iunction, aIter " iterations oI the algorithms, it is
known what the values in each oI the registers will be:
(4.13)

X
n
" "
n
X
0
cos Z
0
( )!Y
0
sin Z
0
( )
"
#
$
%
Y
n
" "
n
#
0
cos Z
0
( )% $
0
sin Z
0
( )
"
#
$
%
Z
n
" 0
"
n
" 1%2
!2n
n
&

33
Figure 4.2 shows what happens to the input vector aIter each iteration in the
algorithm. In this example, the vector is initially aligned with the !-axis. The magnitude
oI the vector increases with each step, as the angle oI the vector converges on 30°. The
values oI the X, Y, and Z registers are also shown.
In the Iollowing sections, descriptions oI how various Iunctions can be computed
using Rotation mode will be discussed.
!"!"#"# $%n'()*+,%n')-n.)/+0-12*-13',%-n)41-n,5+16-3%+n)
The computation oI the sine and cosine Iunctions is intrinsic to the Rotation mode,
and can easily be derived Irom (4.13). All that is necessary is to initialize the Y register
with 0, and the X register with the desired scaling Iactor. II there were no gain in the
magnitude oI the input vector, then the X register could be initialized to 1, and values oI
sine and cosine could be read directly out oI the Y and X registers, respectively.
However, the gain means that some scaling must be perIormed beIore computation. AIter
n iterations oI the algorithm, the contents oI the registers will be:
(4.14)
!
X
n
" "
n
X
0
cos $
0
# $
Y
n
" "
n
Y
0
sin $
0
# $
Z
n
!0

ThereIore, in order to compute the sine or cosine oI an angle !, then the Z register
will be initialized with !, Y with 0, and the X register with !
"
, to account Ior the gain.
Figure 4.2 shows the CORDIC process to compute sine and cosine. The initial vector is
aligned, as required. The X register corresponds to the scaled cosine value, and the Y
register corresponds to the scaled sine value. We can determine the unscaled value by
! ! 34
$i&i$in(!)*+!,in-.!&-./+s!12! !
!"
! !#$%$& .!456!+7-89.+:!-,)+6!11!i)+6-)i5ns:!)*+!-.(56i)*8!
2i+.$s!
! sin 30! " # $
X
10
!
10
$
0.8230
1.6468
$ 0.4998 !
<*+!8+)*5$!)5!=589/)+!sin+!-n$!=5sin+!is!)*+!s-8+!,56!>5.-6?)5?@-6)+si-n!=556$in-)+!
)6-ns,568-)i5n.!Ain=+!)*+!)6-ns,568-)i5n!is!$+,in+$!)5!1+!
B4.1CD!
! ! " !"#!
# ! " #$n!
!
E..!)*-)!is!n+=+ss-62!)5!9+6,568!)*+!)6-ns,568-)i5n!is!)5!5n=+!-(-in!.5-$!!!in)5!!!-n$!
.5-$!X!Gi)*! !"
#
:!-n$!H!Gi)*!0.!
4.4.2!$%&'()*+,-.(/%-
!
i " # $
! 1.0000 1.73'1 0.0000(
0 2.7321 0.7321 45.0000°
1 3.0981 -0.6340 71.5651°
2 3.2566 0.1405 57.5288°
3 3.2741 -0.2665 64.6538°
4 3.2908 -0.0619 61.0775°
5 3.2927 0.0409 59.2876°
6 3.2934 -0.0105 60.1828°
7 3.2935 0.0152 59.7351°
8 3.2935 0.0024 59.9589°
9 3.2935 -0.0041 60.0709°
10 3.')35 +0.000) 60.014)(
i
n
i
t
i
a
l
i =
0
i = 1
i = 2
i = 3
i = 4
"
!
!"#$%&'()*+,-./0,'1&234%"5#'647&'
Jn!K+=)56in(!85$+:!)*+!+L/-)i5ns!/s+$!)5!/9$-)+!)*+!6+(is)+6s!6+8-in!)*+!s-8+:!1/)!
)*+!,/n=)i5n!)*-)!$+)+68in+s!)*+!65)-)i5n!$i6+=)i5n!=*-n(+s.!!Jn!&+=)56in(!85$+:!)*+!
"#
$%&'(i*+, *(i-. *' $%i&/ *+- i/01* 2-3*'( 4i*+ *+- ! $5i., 4+i3+ ,-$/. *+$* *+- 2$%1- i/ *+-
Y (-&i.*-( .+'1%8 3'/2-(&- '/ 9-(': ;+- 8-3i.i'/ f1/3*i'/ i/ ('*$*i/& ,'8- i.=
>?:1AB !
"
!
"!"#i%#&
"
# '
!!"#i%#&
"
" '
#
$
%

Ci&1(- ?:" .+'4. +'4 *+- i/01* 2-3*'( 3'/2-(&-. '/ *+- !D$5i. 4i*+ -$3+ i*-($*i'/ 'f
*+- $%&'(i*+,: E. i/ *+- ('*$*i'/ ,'8-, *+- ,$&/i*18- 'f *+- 2-3*'( i/3(-$.-. 4i*+ -2-(F
i*-($*i'/ 'f *+- $%&'(i*+,: ;+- .i&/ 'f Y i. 1.-8 *' 8-*-(,i/- 4+i3+ 8i(-3*i'/ *' ('*$*-,
4i*+ *+- &'$% 'f G(i/&i/& *+- 2$%1- *' 0: I/ *+i. -5$,0%-, *+- $/&%- 'f *+- i/01* 2-3*'(
4i*+ (-.0-3* *' *+- !D$5i. i. 3',01*-8 $/8 f'1/8 i/ *+- J (-&i.*-(: Ki/3- *+- $/&%- 8'-.
/'* .3$%-, *+- (-.1%* i. *+- 3'((-3* $/&%-: L.i/& *+- /-4 8-3i.i'/ f1/3*i'/, $f*-( n
i*-($*i'/. 'f *+- $%&'(i*+,, *+- (-&i.*-(. 4i%% 3'/*$i/=
>?:1MB

X
!
" "
!
X
0
2
#Y
0
2
Y
!
! 0
Z
!
" #
0
#tan
"1
Y
0
X
0
#
$
%
%
%
%
&
'
(
(
(
(
"
!
" 1#2
"2$
!
)

;+- f'%%'4i/& .-3*i'/. 8i.31.. 4+i3+ f1/3*i'/. 3$/ G- 3',01*-8 i/ 2-3*'(i/& ,'8-:
!"!"#"$ %&'()*+,*(-
E. .--/ i/ -N1$*i'/ >?:1MB, *+- $(3*$/&-/* f1/3*i'/ i. i/*(i/.i3$%%F 3',01*-8 i/ *+- J
(-&i.*-( 4+-/ i/ O-3*'(i/& ,'8-: I/ '(8-( *' 3',01*- *+- $(3*$/&-/* 'f $/ $/&%-

!, *+-/
*+- J (-&i.*-( ,1.* G- i/i*i$%i9-8 *' 0, .' *+$* *+- !
"
*-(, i. -%i,i/$*-8 i/ >?:1MB, $/8 *+-
$/&%-

! ,1.* G- -50(-..-8 $. $ ($*i' 'f *+- *4' 2$%1-. i/ *+- P $/8 Y (-&i.*-(.: I* i.
36
possible to initialize X with 1.0, and Y with

!, as is done in Figure 4.3. The result oI
!
!"#$!% &
" #
is correctly computed to be 60°.
(4.18)

!
"
" !
!
# "#$
!%
#
!
$
!
"
#
$
$
$
$
%
&
'
'
'
'
!
"
" ! # "#$
!%
&
$ %
!
"
" '!&

!"!"#"#! $%&'(r*+,-./'01%*,.1*2,r'%3/,.45(6,r*
7r,.38(r9,'/(.*
As mentioned earlier, the Iinal value in the X register contains the scaled magnitude
oI the input vector. This property, combined with the intrinsic computation oI arctangent
at the same time means that the CORDIC vectoring mode automatically does a Cartesian
to Polar coordinate transIormation, just as the Rotation mode does a Polar to Cartesian
conversion. Recall the equation Ior the Cartesian-Polar transIormation:
(4.19)

r " "
!
##
!
! " "#$
!%
#
"

The value Ior ! is computed in the X register, and
!
! in the Z register.
!":! ;x=,.1/.-*'>%*2(9=0','/(.*?(9,/.*
The CORDIC algorithm as presented thus Iar can only compute Iunctions based on
the sine and cosine Iunctions. This is a consequence oI the circular rotations perIormed in
each step. The algorithm is capable oI perIorming linear and hyperbolic rotations as well,
which expands the set oI Iunctions that the algorithm can compute. To allow Ior these
new domains, a new Iactor is introduced to the set oI CORDIC equations. Its value
37
determines in which coordinate system the algorithm will operate. This Iactor is deIined
as
(4.20)
!
!! "!" #"! " #
An ! value oI 1 corresponds to the hyperbolic domain, 0 to the linear domain, and
¹1 to the circular domain. When this Iactor is applied to Equation (4.5), the Iollowing
general-purpose deIinition oI the CORDIC algorithm is obtained (assuming n iterations):
(4.21)

!
"
" #
"
!
!
"#$ !
"
$
# $
%%
!
$$%& !
"
$
# $
!
"
#
$
%
&
%
"
" #
"
%
!
"#$ !
"
$
# $
'!
!
$$%& !
"
$
# $
!
"
#
$
%
&
&
"
" &
!
%!
"

As in 4.2,
!
! is the elementary rotation perIormed each step, with

!
!
being the sum
oI the rotations perIormed. This rotation is deIined so that the operation perIormed Ior
each step in the algorithm reduces to a shiIt and adds:
(4.22)
!
!
!
=
!"#
!$
%
!!
( )
& "=+$
%
!!
& "= '
!"#(
!$
%
!!
( )
& "=!$
"
#
$
$
$
$
$
%
$
$
$
$
$

With these conditions, the algorithm reduces to:
(4.23)
!
x
i"1
#
1
K
i
x
i
!md
i
2
!i
y
i
$ %
y
i"1
#
1
K
i
y
i
"d
i
2
!i
x
i
$ %
z
i"1
# z
i
!d
i
!
i
where K
i
# 1" m2
!2i
,!d
i
#&1,! m" !1, ),"1 ' (

38
Domain Growth Factor, !
"
Constant After
!
Circular
!"1 # $


!"!! "
"!
!
#
=!#!$$%0%'"(
15 iterations
!
Linear
!" 0 # $


1""! #
"!
!
#
=1$"""""""""


Hyperbolic
!=!1 ( )


1+ !1 ( )" 2
!!
= 0.828159361
!
#

15 iterations
Table 4.1-CORDIC growth factors for each computation domain.
The values oI the CORDIC growth Iactors are shown in Table 4.1. The values are
rounded to nine decimal places. The Constant AIter column shows how many iterations it
takes until the displayed value remains constant. For all but the linear domain, which
always has a Iactor oI 1, the displayed value can be used Ior any implementation with 15
or more iterations.
As beIore, the # and $ components are stored in the X and Y registers. The deIinition
oI Z depends on the computation mode, with hyperbolic tangent being used instead oI the
standard tan Iunction when in the hyperbolic domain, Ior example:
(4.24)
!
z
i"!
#
z
i
!d
i
tan
!!
2
!i
$ %
&! m #!
z
i
!d
i
2
!i
&! m # '
z
i
!d
i
tanh
!!
2
!!
$ %
&! m #!!
"
#
$
$
$
$
$
%
$
$
$
$
$

The Iollowing table summarizes what Iunctions can be computed and in which mode
and domain they can be computed:
39
!"#$%&'&(") +",-
."#'() /"&'&(") 0-1&"2()3
!
!i#c%&a#
!"( # $

!
! " "
n
!
!
"#$ $
!
# $
% " "
n
!
!
$%& $
!
# $
'(&$
!
"
%
!


! = "
n
!
!
"
#$
!
"
% = %
!
##$%
!&
$
!
!
!
"
#
$
$
$
$
%
&
'
'
'
'

!
!"#$%&
!" 0 # $

!
! "!
!
# "
!
#
!

!
! " !
0
#
"
0
#
0


Hyperbolic
!=!1 ( )


X = K
n
X
0
"#$h Z
0
( )
Y = K
n
X
0
$inh Z
0
( )
tanhZ
0
=
Y
X
&
Z
0
= X +Y


! " "
n
!
0
2
!$
0
2
% " %
0
# tanh
!1
$
0
!
0
"
#
$
$
$
$
%
&
'
'
'
'
ln ! $ % " 2%, with! !
0
"!#1,!$
0
"!!1
! "
1
"
n
!, with !
0
"!#
1
4
,!$
0
"!!
1
4

4'56- 789:;%)1&(")< &='& 1') 5- 1"#$%&-, %<()3 &=- !>/.?! '63"2(&=#8
In addition to the addition oI the ! domain selector, a couple other modiIications
must be made to the algorithm when Iunctioning in other modes.
In the circular domain, the algorithm starts with the registers in their initialized states
and the algorithm begins with a sequence oI " values oI the Iorm 0, 1, 2, 3, 4, . and so
on until the desired number oI iterations has been completed. In the Iirst step, no shiIt is
perIormed as a result oI " being equal to 0. In the linear domain, the sequence goes 1, 2, 3,
4, 5, . and so on, with a shiIt happening in the Iirst step. Finally, as explained in 4.6, the
hyperbolic domain`s " sequence is Iurther complicated by the necessity to repeat
iterations. The " sequence in this domain is 1, 2, 3, 4, 4, 5, . with every 3# ¹ 1 iteration
being repeated.
40
!"#! $%&'()*(&+(,
The algorithm is now fully defined, and it will now be demonstrated that the
algorithm converges on the previously noted functions. The specific region of
convergence will also be shown.
!"#"-!./(,$%&'()*(&+(,$%&0i2i%&,
Assume that the CORDIC algorithm is in vectoring mode. Let

!
!
be the angle of the
input vector after the !th iteration of the algorithm. The algorithm tries to reduce the angle
of the input vector. Therefore, after step

i +!, the angle of the vector will change such
that
(4.25)
!
!
!"!
# !
!
!"
!

where
!
!
!
is the rotation performed in the !th iteration of the algorithm. In order for
the algorithm to converge in " iterations, then all subsequent iterations must bring

! to
within
!
!
!!!
of zero. If this is the case, then when the "th iteration completes, the input
angle will be zero. Since the rotations accumulate, the following condition is derived:
(4.26)
!
!
!
! !
"
""!#1
#!1
"
$!
#!1

In order for the algorithm to converge, the condition in (4.26) must hold when the
algorithm first begins:
(4.27)
!
!
!
! "
!
!"!
"!"
"
#"
"!"

To find the domain of initial values for which the algorithm converges, the above
inequality is solved:
41
(4.28)
!
!"# !
$
" # $"
!!%
% "
"
"$$
!!%
"

Since the deIinition oI

!
!
depends only on !, the domain oI convergence can be
computed. The domains oI convergence Ior the three computation domains are shown in
Table 4.3 by using (4.28). By evaluating the limit oI the

! terms as ! approaches
!
!, and
observing that

!
!!!
"" as

i !", the

!
!!1
term can be dropped Irom (4.28).
!"#$%&'&(") +"#'() ,$$-".(#'&/ +"#'() "0 !")1/-2/)3/
!(-3%4'-

!=! # $
!
!
!
" !"#
!$
%
!!

tan
!1
2
!!
( )
!=1
"
#
$1.74329
5()/'-

m= ! # $
!
!
!
" !
!!

!
!!
!="
"
#
="
67$/-8"4(3

!"!! # $
!
!
!
= tanh
!1
2
!!

!
tanh
!1
2
!!
" #
!
"
#1.11817
! $ 1, 2, 3, 4, 4, 5, " % &

9'84/ :;<=+"#'()> "0 3")1/-2/)3/ 0"- &?/ !@A+B! '42"-(&?#;
Note that the condition (4.26) is not met when in the hyperbolic domain iI the
standard non-repeating sequence (e.g. 1, 2, 3,.) is used. In order Ior the algorithm to
converge, certain iterations must be repeated. SpeciIically, it is necessary Ior steps ¦4, 13,
40, . ,

3k +1} to be repeated. This comes as a consequence oI the Iollowing: |1|
(4.29)

!
!
! !
"
""!#1
#!1
"
#
$
%
%
%
&
'
(
(
(
(
!!
3!#1
$!
#!1

!"#"$ %&''()'()*'+,-&.-+/-)*&01-&02)
In order to demonstrate that

!
!
will converge to at most

!
!!!
, it will Iirst be proven
by induction that the Iollowing is true:
(4.30)
!
!
i
""
"!1
# "
#
#$1
"!1
"

42
First, (4.30) is true Ior ! ÷ 0, as a result oI (4.27). To prove that (4.30) is true Ior
! ¹ 1,

!
i
is subtracted Irom (4.30) and (4.26) is applied to the leIt side. This yields:
(4.31)
!
! !
!!!
" !
"
"##"!
!!!
"
#
$
%
%
&
'
(
(
$!!
#
$ "
#
!!
#
$ !
!!!
" !
"
"##"!
!!!
"
#
$
%
%
&
'
(
(

When (4.25) is applied, the Iollowing results:
(4.32)
!
!
!"!
#"
"!!
" "
#
#$!"!
"!!
"

ThereIore, by induction, it is proven that (4.30) is true Ior all ! ! 0. II
!
! " n , then
(4.33)

!
!
""
!!1

which proves that the CORDIC algorithm converges iI the input angle is within the
domain oI convergence deIined in (4.28) is satisIied.
!"#"$ %&'()*+)',) .' /&010.&' 2&3)
The domain oI convergence deIinition and prooI in 4.6.1 and 4.6.2 assumed that
CORDIC is operating in vectoring mode. II " is substituted Ior

! in the equations in
those sections, then the domain oI convergence can be Iound and proven Ior the rotation
mode. In particular, (4.28) shows that " has the same domain oI convergence as
!
! :
(4.34)

ma# !
$
" # $ ma# !
$
" #
4#
!"# $%%&'(%)*(n,*-''.'*
With the algorithm now fully defined, and its convergence proven, the accuracy and
precision of the algorithm will now be discussed. Generally speaking, for each iteration
of the algorithm, an additional bit of accuracy is obtained.
>rror creeps into the results from many sources. ?aturally, since there is not an
infinite number of bits available to any real hardware device, rounding errors are inherent
to any implementation. Additionally, since there are a finite number of iterations to any
real-world implementation of the algorithm, the desired rotation angle can never be fully
realiBed. This results in an angle approximation error.
!"#"/ 0&12'3%(4*526'272n8(83.n*
The CORDIC algorithm operates using bit-shifts, the natural representation for any
number used in the algorithm is going to be a fixed-point representation rather than
floating-point. A fixed point number is a scaled integer, where the binary point is implied
to be ! positions to the left of the JKL. As a result, the integer is 2
!
times larger than the
number it is representing. This is similar to storing 2.N4 as 2N4. This scale, combined
with the number of bits available completely determines the range and precision available
to all numbers in the algorithm. The scale determines how many bits are available to the
left and right of the binary point.
As the binary point moves to the right, greater ranges of numbers are available, but
the scale becomes coarser, with larger gaps between adOacent numbers. Jikewise, as the
binary point moves to the left, the scale is finer, but the range of possible numbers is
smaller.
44
Guard digits can also be employed at both ends oI the binary word to enhance
accuracy. The guard digits are not used in any input values and are reserved Ior bits
appearing in Iinal values. For example, at least 2 bits are necessary in circular mode,
because the !
"
Iactor is grater than 1, which means that the magnitude oI the values will
grow as the algorithm works. Guard digits on the least-signiIicant side are employed to
reduce rounding error.
!"#"$ %&'()*(+,-..&.,
When the number oI bits required to represent a value exceeds the number oI bits
available in the system, the designer has two options: round or truncate. In a truncation,
the excess bits are discarded. In a round, the representation is altered depending on the
excess bits. Rounding works better in the CORDIC algorithm, because the maximum
rounding error is only halI as large as the error resulting Irom truncation.
Rounding errors are introduced in the course oI completing each iteration oI the
algorithm. These errors accumulate, which has the eIIect oI Iurther oIIsetting the Iinal
result. However, Walther pointed out in |1| that the rounding oI the results in each oI N
iterations never results in more than log
2
! bits oI error. In order to negate this eIIect,
the system can use

! "!"#
$
! bits oI storage Ior all intermediate results.
!"#"/ 0(+12,033.&4*567*&(,-..&.,
It has been shown that the algorithm will always converge on an accurate result iI
given inIinite precision. However, Ior some values, it may take an inIinite number oI
iterations to arrive at the accurate result. As a result, the actual rotation oI the input vector
is only an approximation oI the desired angle. This angle approximation error is the
45
desired rotation angle minus the actual rotation angle (which is the sum oI all the
intermediate angles):
(4.35)
!
!! " ! " !
"
"
"
""!
#""
#
, where
!
! is the desired rotation angle
Since this error decreases with the number oI iterations, an obvious way to improve
the accuracy oI the algorithm is to increase the number oI iterations perIormed. However,
there are some limits to how many iterations can be added and still improve accuracy.
First, the rounding error discussed in 4.7.2 increases with the number oI iterations, so this
eIIect must be considered when adding to the iteration count. Further, since the angles are
quantized, the accurate angle may never Iit into the bit width provided. II B bits are used
Ior storage, and there are P bits to the right oI the binary point, then the value oI the least-
signiIicant bit in storage is given by

!
!B
!
P
. This is the smallest number that can be
accurately speciIied in the system. ThereIore the last rotation angle (the smallest) must be
able to Iit in the available space. This yields the Iollowing condition:
(4.36)
!
!
n!1
"2
!"
2
P

II this does not hold, then the angle actually stored will be 0, which will result in no
Iurther rotations oI the input vector.
"#
!"#$%&' ) *"& +#'#,,&, -.$,&.&/%#%01/
$%is (%esis s(*+ies (,- +i..e/en( i123e1en(4(i-ns -. (%e 567D95 43g-/i(%1. <-(%
+esigns 4/e s=n(%esi>e+ .-/ (%e ?i3in@ A24/(4nBC ?5CA15FF *sing (%e ?i3in@ 9AG He/si-n
I.1i 4n+ si1*34(e+ *sing J-+e3Ai1 He/si-n #.F4. $%is K%42(e/ s(*+ies 4 s(/4ig%(.-/,4/+L
24/433e3 +esign K-nsis(ing -. (,- s%i.( /egis(e/sL 4 s(4n+4/+ /egis(e/L .-*/
4++e/Ms*N(/4K(-/sL 4 3--O*2 (4N3eL 4n+ 4 K-n(/-3 *ni(. $%e 4/e4 4n+ (i1ing /eP*i/e1en(s
4/e +isK*sse+L 4n+ si1*34(i-ns -. (%e +esign 4/e 2/esen(e+.
$%e +esign is K-n.ig*/e+ (- Ne in %=2e/N-3iK /-(4(i-n 1-+e in -/+e/ (- K-12*(e (%e
e@2-nen(i43 .*nK(i-nL N*( 433 K-12-nen(s 4/e 2/esen( .-/ 4 K-123e(e3= gene/43B2*/2-se
567D95 *ni(. 6n3= 1in-/ K%4nges (- (%e K-n(/-3 *ni( ,-*3+ Ne neKess4/= (- i123e1en(
(%e -(%e/ 1-+es. $%is +esign 43s- inK3*+es (%e %4/+,4/e neKess4/= (- 2e/.-/1 (%e /e2e4(
i(e/4(i-ns /eP*i/e+ in (%e %=2e/N-3iK +-14in. $%e K-n(/-3 *ni( K4n Ne 1-+i.ie+ (- 433-,
.-/ 1-/e 4 s-2%is(iK4(e+ i(e/4(i-n /e2e(i(i-n sK%e1e.
$%is i123e1en(4(i-n *ses CQ Ni(s -. 2/eKisi-nL ,i(% (%e Nin4/= 2-in( .i@e+ s*K% (%4(
(%e/e is -ne sign Ni( 4n+ (%/ee in(ege/ Ni(s (- (%e 3e.( -. (%e Nin4/= 2-in(L 4n+ QR Ni(s
/e2/esen(ing (%e ./4K(i-n43 24/( -. (%e H43*es. $%e +esign *ses QSs K-123e1en( 4/i(%1e(iKL
giHing (%e /egis(e/s in (%e +esign 4 /4nge -. H43*es ./-1 TR.F (- I.F. $%e 2-si(i-n -. (%e
Nin4/= 2-in( is i123iKi(L 4n+ n-( s2eKi.ie+ 4n=,%e/e in (%e +esign. $%e H43*es in (%e
3--O*2 (4N3e 4/e gene/4(e+ *sing (%e s2eKi.ie+ Nin4/= 2-in( 2-si(i-nL 4n+ in -/+e/ .-/
K-//eK( /es*3(s (- Ne 4K%ieHe+L 433 in2*(s 1*s( Ne sK43e+ 4KK-/+ing3=. $%e -n3= K%4nge
neKess4/= (- 433-, .-/ 4 +i..e/en( Nin4/= 2-in( 2-si(i-n ,-*3+ Ne in (%e H43*es s(-/e+ in
(%e 3--O*2 (4N3e. U- K%4nges (- (%e +esign ,-*3+ Ne /eP*i/e+.
47
!"#! De&'()*
!
"
1
X
Y
!
"
1
X0
Y0
INIT
INIT
#
!
atanh
()T
LD
SHAMT
LD
SHAMT
LD
ADDR
SUBXY
SUBXY
SUBZ
!"#$%"&
'#($
X
Y
Z
START
LD
SUBXY
SUBZ
INIT
SHAMT
DONE
X+Y
$hifted value
un$caled value
input value
NOT./ 3he 4 and 5 regi$ter$ are $pecial regi$ter$ that al$o have
$hifting capa9ilit:. 3he output$ are a$ $ho<n a9ove.
=
X
Y
exp(Z0)
"
1
Z0
INIT
ADDR

Figure 5.1-Block-level diagram of the complete parallel CORDIC unit.
All major components oI the CORDIC processing unit are shown in Figure 5.1. The
X, Y, and Z registers contain the current values oI the X, Y, and Z components oI the
CORDIC algorithm. They each have parallel load capabilities controlled by the LD input
signal. 2:1 multiplexers control the input values to each oI the registers. An INIT output
by the control unit allows initial values to be loaded beIore computation begins. When
INIT is asserted (active high), then the multiplexers pass the initial values into the
register. During the computation phase, the INIT signal is not asserted, allowing the
output oI the adders to be Ied back into the registers.
The control unit uses the sign oI the Z register to generate the subtract signals that
are Ied into the adder/subtractors. With the appropriate minor modiIications to the control
unit, the generation oI the SUBXY and SUBZ signals can be modiIied so that the unit
48
operates in vectoring mode. The design Ior the control unit is discussed Iurther in 5.1.1.
The values oI the X and Y registers are input into a Iinal adder to generate the
exponential Iunction. AIter the algorithm is complete, the result will be
!
"
!
. The unit can
also compute
e
!Z
!
by changing the adder into a subtractor.
Rather than compute the arc-hyperbolic tangent, a small lookup table is used to store
the pre-computed values oI this Iunction. The table has one row per iteration oI the
algorithm. II a general-purpose CORDIC algorithm were to be designed, an additional
lookup table would be necessary to store the necessary arctangent values. For the linear
domain, an additional shiIt register would also be necessary. A multiplexer would then be
used to select the value Irom the appropriate table. The control unit provides the address
used to access the data Irom the table.
U#$% A'()*(+*$ U,)*)-(,)./
S*)1$# 329 13,312 2.5°
S*)1$ 2*)342*.3# 123 26,624 0.5°
LUT# 607 26,624 2.3°
IO9# 132 221 59.7°
9RAM# 1 32 3.1°
<CL># 1 8 12.5°
T(+*$ ?@ABO'$C(** %$')1$ D,)*)-(,)./ E.C ,F$ 3(C(**$* %$#)G/@
The device utilization Ior entire design is summarized in Table 5.1. The design uses
2.4° oI the XC3S1500`s available slices, leaving ample space Ior usage by a neural
network. The Xilinx synthesizer also reported that the design can handle a theoretical
maximum clock Irequency oI 75.0 MHz. Naturally, testing is necessary to determine the
true timing requirements Ior the design. The Iollowing sections will discuss the design
and area requirements IorI the major components oI the unit.
"#
5.1.1 The Control Unit
$he 'ontrol unit i/ re/pon/i1le for regulating the flo5 of data through the CORDIC
unit. $he unit i/ a finite /tate ma'hine ha>ing three /tate/? a/ /ho5n in Figure 5.2. $he
default /tate for the unit i/ the IDCD /tate. In thi/ /tate? the regi/ter/ are not loading? and
the unit i/ /taE/ in thi/ /tate until the S$GR$ /ignal i/ a//erted. Hhen the S$GR$ /ignal
i/ a//erted? the unit mo>e/ into the IRDCOGD /tate.
In the IRDCOGD /tate? the regi/ter load /ignal i/ a//erted? and the IJI$ /ignal i/
al/o a//erted? 'au/ing the initial >alue/ to 1e loaded into the regi/ter/ Kthe/e >alue/ are
repre/ented a/ LM? NM? and OM in Figure 5.PQ. $he unit then ad>an'e/ to the CORIS$D
/tate. $he regi/ter load /ignal/ remain a//erted? 1ut the IJI$ /ignal i/ not a//erted. G/ a
re/ult? the regi/ter/ are loaded 5ith the >alue/ from the 'orre/ponding adder/? rather than
the input >alue/. $he unit maintain/ an internal iteration 'ount? and 5hen the/e >alue/ are
eTual? the 'ontrol unit mo>e/ 1a'U into the IDCD /tate? 5ith the DOJD /ignal a//erted?
/ignifEing that the >alue/ output 1E the regi/ter/ are >alid. $a1le 5.2 /ho5/ the
un'onditional output/ for ea'h /tate. Other /ignal/? /u'h a/ the looUVup ta1le addre//?
SWGR$? and SSX /ignal/ >arE and are 'ontinuou/lE updated 5hile in the CORIS$D
/tate. $he logi' 1ehind the /tate tran/ition/ i/ /ho5n in Figure 5.Y.
!DL$
%R$L'AD C'M%+T$
SHAMT 0 ST'%1AL
START

Figure 5.2-State diagram for the parallel CORDIC unit.
5#
IDLE PRELOAD COMPUTE
INIT # 1 #
DONE 1 # #
LD # 1 1
Table 5.2-Unconditional control unit outputs for each state in the parallel design.
D Q
C
D Q
C
CLK
SHAMT
START
!

Figure 5.3-State transition logic for the parallel CORDIC unit.
The generation of the SHAMT signal is complicated by the need to perform repeat
iterations, and the need to wait one cloc> cycle to read hyperbolic tangent data from the
L@T. Bigure 5.4 shows the logic behind the SHAMT signal, as well as the other signals
that are state independent (S@BXH, S@BI, ADDR, and DONE). The subtract signals
S@BXH and S@BI are determined by the sign of the I register, since the PORDIP unit
is operating in rotation mode. Therefore, the sign bit of the I register is connected
directly to the subtract inputs of the corresponding adders.
The SHAMT register contains the shift amount, which also corresponds to the
current iteration count, which is then connected to the X and H registers. Since data from
the loo>-up table ta>es one cloc> cycle to be retrieved, SHAMT is reTuired to lag one
cycle behind the ADDR signal, which is fed to the loo>up-table. This is shown as a direct
connection between the output of the ADDR register and the input of the SHAMT
51
register. It is the contents oI the ADDR register that determine when a repeat iteration is
necessary.
The current ADDR is compared against the value oI the NXTRPT register. The
NXTRPT register contains the next iteration that must be repeated. It is initialized to 4. In
hyperbolic mode, every !! !" iteration must be repeated (where ! is the previously
repeated iteration). This computation can be perIormed using a single adder. 3! is
computed by inputting NXTRPT into the Iirst input oI the adder and ! ! "#$%&$
(accomplished with a bit shiIt). By asserting the carry-in input, !! NXTRPT!' is
computed. When the current value oI ADDR and NXTRPT match, then NXTRPT is
updated, and ADDR gets loaded with its current value, rather than the output oI the adder
that increments its value.
When SHAMT matches the constant STOPVAL, then the DONE signal is asserted,
and the unit returns to the IDLE state.
!"31% S'(XY
S'(!
ADDR
$%TR'T
1
"
#
+
DD
C
in
#
+
DD
C
in
$$ 1
"
1
&
S)AMT
&
STO',AL
DO12

!"#$%& ()*+,-#". /&0"12 30& 4353&6"12&7&12&13 -$37$34 -8 30& 75%599&9 .-13%-9 $1"3)
52
"#$% &'()*(+*$ ",)*)-(,)./
0*)1$# 39 13,312 0.3°
0*)1$ 2*)342*.3# 29 26,624 0.1°
5"6# 71 26,624 0.3°
789# 113 221 51.1°
9:&;# 0 32 0.0°
<=5># 1 8 12.5°
6(+*$ ?@ABC$')1$ D,)*)-(,)./ E.F ,G$ 3(F(**$* 1./,F.* D/),@
The area requirements Ior this component oI the CORDIC unit are detailed in Table
5.3. The control unit accounts Ior 12° oI the slices used by the entire design.
5.1.2!The Shift -egisters
The X and Y registers require shiIt capability. Since the shiIt amount increases with
each iteration oI the algorithm, a standard single-bit shiIt register cannot be used. A
register capable oI shiIting a variable amount is necessary. Fundamentally, there are two
approaches to designing such a register. The easiest and most area-conscious is to
perIorm a single one-bit shiIt per clock cycle and repeat as necessary until the value has
been shiIted by the desired amount. The single bit shiIt can be hard-wired, minimizing
area requirements. This approach requires the computation to pause until the shiIting is
complete. This is used as part oI the area optimization gained Irom the bit-serial
implementation discussed in Chapter 6.
It is preIerred that the shiIt be perIormed in one clock cycle. This can be
accomplished using multiplexers, and is illustrated Ior a 4-bit register in Figure 5.5. For
this design, which uses 32-bit registers, 32 32 ! # multiplexers are used Ior each shiIting
unit. The registers have two outputs, the unshiIted output direct Irom the register, and the
shiIted output, which is the output Irom the multiplexers. The SHAMT signal controls the
amount by which the output is shiIted. The shiIt register perIorms an arithmetic shiIt
53
ope'a)*o+ )o e+,u'e p'ope' e.e/u)*o+ 0he+ +e2a)*3e 3alue, a'e u,e5. 7o+,e8ue+)l9: )he
;<= >)he ,*2+ b*)@ *, /o++e/)e5 )o )he ,h*A)e5 ou)pu), oA )he Bul)*ple.e', )o allo0 Ao' a
,*2+ e.)e+,*o+ )o o//u' 0he+ ,h*A)*+2.
!"#
$"#
0
1
2
3
%H'FT*3+
0
1
2
3 0
1
2
3
0
1
2
3
%H'FT*2+
%H'FT*1+
%H'FT*0+
D*3+
D*2+
D*1+
D*0+
CL/ %HAMT

Figure 5.5—Design of the variable shift register.
Used Available Utilization
Slices CD 13:31F D.GH
Slice Flip-Flops I5 FI:IFJ D.FH
LUTs 1IF FI:IFJ D.IH
IOBs 1DJ FF1 JG.1H
BRAMs D 3F D.DH
GCLKs 1 K 1F.5H
Table 5.H—Device utilization for the parallel shift register.
The ,h*A) 'e2*,)e', u,e a ,*2+*A*/a+) po')*o+ oA )he 5e3*/e a'ea: 'ela)*3e )o )he 'e,) oA
)he 5e,*2+: a, ,ho0+ *+ Table 5.J. Mo)e )ha) )he )able o+l9 /o3e', a ,*+2le 'e2*,)e': a+5
)he'e a'e )0o ,h*A) 'e2*,)e', *+ )he )o)al 5e,*2+. The N 'e2*,)e' *, a ,)a+5a'5 'e2*,)e'
0*)hou) ,h*A)*+2 Au+/)*o+al*)9.
5.1.$ %&e )**+u- %a/0e
The lookup )able /o+)a*+, )he 3alue, oA )he a'/ h9pe'bol*/ )a+2e+) +e/e,,a'9 Ao' ea/h
*)e'a)*o+ oA )he al2o'*)hB. P 7 p'o2'aB 0'*))e+ Ao' )ha) pu'po,e 2e+e'a)e, )he QRSL Ao'
)he )able. The p'o2'aB u,e, )he 7 Ba)h l*b'a'9 Au+/)*o+, *+ /o+Uu+/)*o+ 0*)h a A*.e5V
54
point conversion function to output a 1HDL file containing a description of the lookup
table in fixed-point notation.
"#e% &'a)lable ",)l)-a,)o/
0l)1e# 0 13,312 0.0%
0l)1e 2l)p42lop# 0 26,624 0.0%
L"T# 0 26,624 0.0%
789# 38 221 17.2%
9:&;# 1 32 3.1%
<=L># 1 8 12.5%
Table 5.5ABe')1e u,)l)-a,)o/ DoE ,he paEallel lookup ,able.
The presence of the lookup table has essentially zero impact on the area requirements
for the unit, since the entire table can be synthesized into a single LRAM, as shown in
Table 5.5.
5.1.$ Design Summar1

2)HuEe 5.IA0l)1e# u#e% bJ ,he 'aE)ou# 1oKpo/e/,# oD ,he paEallel =8:B7= u/),.
The entire design uses very little of the QPSA’s resources, and Qigure 5.6 shows that
much of the QPSA’s slices are used up by the X and Y shift registers. The large
multiplexers required by the design result in massive resource requirements. Chapter 6
55
presents an alternate design that aims to Iurther reduce the area requirements, but at the
cost oI speed.
With only 2.4° oI the Spartan-3`s slices used by this design, there is still plenty oI
room leIt Ior a system such as a neural network that uses the CORDIC unit. II necessary,
the resource requirements could be Iurther reduced by reducing the precision oI the unit
(to 16 bits, Ior example).
At 31 iterations (with 2 repeats), plus a PRELOAD cycle, the algorithm will take 34
clock cycles to complete. At the theoretical maximum clock Irequency oI 75.0 MHz, the
unit will take 0.45 !s to compute the Iinal value.
5.2 $imulation Results
This section presents the results oI simulations perIormed using the design speciIied
in 5.1. The design uses 32 bits to represent each word, with 4 bits Ior the integer part oI
the number, and 28 bits Ior the Iractional part oI the number. Thus the values output by
the X, Y, Z, and the adder that computed the exponential Iunction are eIIectively scaled
up by a Iactor oI 2
28
. The simulated clock runs at 100 MHz.
5.2.1 $imulation 1: Computing e With CORDIC ;rowth
Error
The Iirst simulation perIormed demonstrates the eIIect oI the CORDIC gain K
"
. To
compute the exponential Iunction e
$
, the X register must be initialized with 1, the Y
register with 0, and the Z register with the desired argument to the exponential Iunction.
This simulation attempts to compute the value oI e, so an initial value oI 1 is placed into
the Z register.
5#

!"#$%& ()*+,&-$./- 01 2 304&.5"6 -"6$.2/"07 2//&68/"7# /0 9068$/& /:& ;2.$& 01 e) <- 2 %&-$./ 01
=>,?@= #%0A/:B /:& 1"72. 9068$/&4 ;2.$& "- "7299$%2/&)
$he results of this simulation are shown in 3igure 5.6. $he algorithm iterate7 for a
total of 89 times :in;lu7ing the <R>?OAD state an7 the two repeat iterations reDuire7 in
the hEperFoli; 7omainG. $he final heHa7e;imal Ialues in the J an7 K registers are
L9625NC5
L#
an7 P3Q262AR
L#
respe;tiIelE. Shen ;onIerte7 to 7e;imal an7 s;ale7
a;;or7inglET theE Fe;ome L.266Q
LP
for the J register an7 P.Q688
LP
for the K register. $he
Ialue of the eHponential fun;tion is the sum of these two numFers: 2.25L2
LP
T whi;h
mat;hes the output of the > signal.
$he a;tual Ialue of e to four 7e;imal pla;es is 2.6LN8. $his simulation 7emonstrates
an error in ;omputation FE P.9#6L. $his is a result of the CORDIC magnitu7e error fa;tor
"
i
7is;usse7 in 9.2 an7 illustrate7 in 3igure 9.2. In or7er to pro7u;e a;;urate resultsT the
J register nee7s to Fe preWs;ale7 FE the gain fa;tor ".
5.#.#!Simulation #. Computing e Compensating for
CORDIC :rowth
In or7er to ;ompute an a;;urate Ialue of eT the growth oFserIe7 in 9.2 must Fe
a;;ounte7 for. $o pro7u;e an uns;ale7 final IalueT the initial Ialue must Fe 7iIi7e7 FE
the growth fa;tor. $he formula use7 to ;ompute the amount of growth is shown in :9.28G.
A pre;ompute7 Ialue is shown in $aFle 9.L. 3or this simulation rather than initialiXing J
with LT J will Fe initialiXe7 with the heHa7e;imal Ialue L85L>N62
L#
T whi;h eDuals
L.2P65
LP
. K is again initialiXe7 with P an7 Y with L.
5#

!"#$%& ()*+,&-$./- 01 2 304&.5"6 -"6$.2/"07 8069$/"7# /:& ;2.$& 01 e) <= 2880$7/"7# 10% >?,@A>
#%0B/:C /:& 1"72. 8069$/&4 ;2.$& "- 288$%2/&)
$%& '&(ult( o- t%i( (imulation a'& (%o2n in 3igu'& 5.6. $%& -inal 7omput&9 :alu& o- e
i( output on t%& ; <u(. =t( %&>a9&7imal :alu& 2@#;152B
1C
7on:&'t( to 2.#163
1E
F 2%i7%
mat7%&( t%& a7tual :alu& o- e. G%&n taH&n to 3E 9&7imal pla7&( t%& :alu& output in ; i(
2.#1626166EE211EEEEEEEEEEEEEEEEE
1E
F 2%i7% 9i--&'( -'om t%& '&al :alu& o- e <I onlI
E.EEEEEEE515C2E5E165E633EEE
1E
. $%& onlI 2aI -o' t%i( &''o' to <& '&9u7&9 2oul9 <& to
u(& mo'& p'&7i(ion in t%& '&gi(t&'(. Gi9&ning t%& '&gi(t&'(F o' (%i-ting t%& <ina'I point to
t%& l&-t 7an a7%i&:& t%i(. =n a99ition to allo2ing mo'& p'&7i(ion in t%& '&(ult(F it al(o
allo2( g'&at&' p'&7i(ion -o' t%& :alu&( (to'&9 in t%& a'7 %Ip&'<oli7 tang&nt looHup ta<l&.
!"#"$ %&'()*+&,-.$/.0,'1(+&-2.!
34
.
=t i( al(o impo'tant -o' t%& unit to 7omput& t%& :alu& o- t%& &>pon&ntial -un7tion -o'
n&gati:& a'gum&nt(. Sin7& t%& 9omain o- 7on:&'g&n7& -o' t%& algo'it%m i( (Imm&t'i7
a<out EF %al- o- t%& po((i<l& a'gum&nt( a'& n&gati:&. $%i( (imulation 7omput&( t%& :alu&
o- e
–1
. L i( initialiM&9 2it% 1.2E#5
1E
to &n(u'& an un(7al&9 '&(ultF an9 N i( initialiM&9 2it%
E. $%& O '&gi(t&' 7ontain( t%& a'gum&nt -o' t%& &>pon&ntial -un7tion an9 i( t%u( initialiM&9
to –1
1E
P3EEEEEEE
1C
in 2Q( 7ompl&m&nt notationR.

!"#$%& ()D+,&-$./- 01 2 304&.5"6 -"6$.2/"07 8069$/"7# /:& ;2.$& 01 e
EF
)
5#
$he simulation results presented in Figure 5.6 sho7 that the 89R;<8 unit is a=le to
handle the negative argument. $he final value in the @ register A05@C;5#8
1E
F
corresponds to 0.HEI#I6JH5JC
10
, 7hich is of the same accuracL as the computation for
e
1
.
5.2.$ Simulation $/ Computing Outside the Domain of
Convergence
Ms discussed in J.E, the 89R;<8 algorithm does not converge on an accurate
ans7er for all input values. $a=le J.H sho7s the domains of convergence for all the
computation domains supported =L the 89R;<8 algorithm. For the rotation mode, 7hich
this design uses, the domain of convergence is a limiter on the magnitude of the initial
value of N. <f N 7ere initialiOed 7ith a value 7hose magnitude is greater than 1.11#1I
10
,
then the final computed value 7ould =e incorrect. For this simulation, the P register is
again initialiOed 7ith 1.C0I5
10
, and Q is again initialiOed to 0, =ut N is initialiOed to 5, so
that the unit 7ill attempt to compute e
5
.

!"#$%& ()*+,-&.$/0. 12 3 415&/6"7 ."7$/30"18 917:$0"8# 0;& <3/$& 12 e
(
) 6"89& 0;& "8"0"3/ <3/$& ( ".
1$0."5& 0;& 5173"8 12 918<&%#&89&= 0;& 917:$0&5 <3/$& ". "891%%&90)
$he simulation results sho7n in Figure 5.10 demonstrate that the 89R;<8 unit is
una=le to compute e
5
. $he final value output =L @ is H.056H
10
, 7hich is no7here near the
correct value A1J#.J1HC
10
F. H.056H
10
acts is an asLmptote for the computed values of @.
9nce the initial values of N move outside the domain of convergence, the computed @
value rapidlL approaches H.056H
10
, and remains at that value no matter ho7 large a value
"#
$ is initiali+e- .ith. 1 similar as4mptote e7ists 8or .hen $ is initiali+e- .ith 9al:es that
are too negati9e. In this =ase> the ? 9al:es approa=h @.AB6#
1@
.
!"#! $%&&'()
The parallel -esign is a straight8or.ar- implementation o8 the CGHDIC eJ:ations.
?9en .ith the large register si+es> the -esign has a small 8ootprint insi-e the SpartanLA.
Mith iterations lasting one =lo=N =4=le> the per8orman=e is =onsistent. Oer8orman=e =an Pe
impro9e- P4 :sing 8e.er iterations> .hi=h :s:all4 entails :sing a smaller .or- si+e. The
sim:lations -emonstrate a==eptaPle a==:ra=4> espe=iall4 gi9en the area reJ:irements 8or
the -esign. Qor s4stems that reJ:ire an e9en smaller 8ootprint> Chapter 6 presents a
smaller> slo.er PitLserial implementation that 8:rther aims to re-:=e area reJ:irements P4
:sing 1LPit a--erRs:Ptra=tors an- simple shi8t registers.
60
Chapter 6 *he +erial Implementation
A straightIorward implementation oI the CORDIC algorithm is discussed in Chapter
5. While that design required very little oI the FPGA`s resources, there might be some
applications where Iurther area reduction is necessary. It was observed in 5.1.4 that the
two variable-width shiIt registers accounted Ior halI the resource requirements Ior the
design. This chapter presents an alternate design that uses bit-serial arithmetic along with
one-bit adders to execute the CORDIC algorithm. It will be shown that this design
requires Iar less oI the FPGA`s resources, but will require more time to execute.
6.1 Design
!
!
X
!
BITC&T
'(
)*
+(
)*
SUB'+
#
!
SUB. .(
)*
S/'T
+Sign
'Sign
atanh
()T
BITC&T

!i#$r& 6.1+,lo/0 1i2#r23 4or t6& 7&ri2l i38l&3&nt2tion o4 t6& C;<=>C 2l#orit63.
As shown in Figure 6.1, the overall the design remains relatively unchanged when
compared to the parallel design shown in Figure 5.1. All the changes were made to the
individual components oI the design. The X, Y, and Z registers are now standard one-bit
shiIt registers with parallel load and parallel output capability. This keeps their internal
logic simple and reduces area requirements. The adder/subtractors are now reduced to
operating on single-bit operands, which also reduces their complexity. The adders also
61
contain a Ilip-Ilop that saves the carry output so that it can be applied to the next pair oI
operands. This allows the 32-bit addition to be perIormed one bit at a time.
Each iteration requires the X and Y cross-values to be shiIted by the current iteration
count. The two multiplexers attached to the X and Y registers allow the correct bit to be
selected Ior the corresponding register. Since the results oI the addition or subtraction are
Ied back as the serial inputs to the registers, an additional SEXT control signal and 2:1
multiplexer is needed to choose either the bit selected Irom the register or the
corresponding sign bit. Without this signal, bits computed previously in the iteration
would be Ied into the adders. The sign oI each register is saved prior to the execution oI
each iteration, allowing Ior an arithmetic shiIt oI the values being Ied into the adders.
A multiplexer was also used to allow the correct bit to be Ied Irom the arc hyperbolic
tangent lookup table to the Z-adder. The multiplexer allows the lookup table to be placed
in BRAM and prevents any additional clock cycle delays Irom being introduced into the
design, as would be the case iI the value were loaded into a shiIt register. As beIore, the
control unit provides the address into the lookup table.
This design uses a 32-bit word size, with 4 bits Ior the whole part oI the number, and
28 bits Ior the Iractional part oI the number.
Used Available Utilization
Slices 139 13,312 1.0°
Slice 2lip42lops 143 26,624 0.5°
LUTs 256 26,624 1.0°
IOBs 132 221 59.7°
BRAMs 1 32 3.1°
<CL>s 1 8 12.5°
Table 6.ABOverall device utilization for the serial design.
62
The device utilization Ior the serial design is shown in Table 6.1. This design uses
only 139 oI the XC3S1500`s available slices, as opposed to the 329 required by the
parallel design. This is a reduction oI 57.8°. In addition, the Xilinx synthesizer reported
that a maximum clock Irequency oI 134.8 MHz can be used. The parallel design tops out
at 75.0 MHz. The simplicity oI this design results in less combinational logic delay
which in turn allows Ior Iaster clock speeds.
!"1"1 C%&'(%) +&,'
The parallel design allowed each iteration to be completed in one clock cycle. This
reduced the complexity oI the control unit. The switch to a serial design means that the
number oI clock cycles required to complete each iteration depends on the word size,
since all values are computed one bit at a time. Previously, the subtract control signals
that determine whether each adder is adding or subtracting could be evaluated each clock
cycle. In the serial design, these values need to be determined once at the beginning oI
the iteration and saved until the current value is Iully computed. To accomplish this, a
new state is added to the state machine oI the control unit, as shown in Figure 6.2.
63
IDLE PRELOAD
SETUP COMPUTE
ST#RT
%ST#RT
&'T()T + ,1
#). S/#0T 1 ,1 &'T()T 1 ,1
&'T()T + ,1
#). S/#0T + ,1

!"#$%& ()*+,-.-& /".#%.0 12% -3& 4&%".5 "065&0&7-.-"27 21 -3& 89:;<8 .5#2%"-30)
The IDL* and ./*LO1D states remain unchanged. The IDL* state is active when
the algorithm is not being executed. The ./*LO1D state is used to load the initial values
into the X, Y, and Z registers. The S*TF. state is the new state and is only active the
very first clock cycle of each iteration. In the S*TF. state, the values in the X, Y, and Z
registers are evaluated to determine the subtract and sign signals. These signals are saved
into registers for use during the remaining clock cycles of the current iteration. The
JOM.FT* state is active until all 32 bits have been computed. Once this happens, IDL*
becomes the active state, otherwise if more iterations remain, the machine returns to the
S*TF. state to begin execution of the next iteration.
1 new internal control value is also added, named BITJOT. Whereas SH1MT holds
the current iteration count, BITJOT holds the number of bits that have been processed
within the current iteration. When this counter rolls over to zero, then the active iteration
is completed.
6#
IDLE PRELOAD SETUP COMPUTE
INIT $ % $ $
DONE % $ $ $
PLD $ % $ $
SHIFT $ $ % %
Table 6.2-State outputs for the serial control unit.
To manage the new shift registers, a new SHIFT signal is also added. When asserted,
the SHIFT signal causes the shift registers to load the input bit into the MSB position,
and shift out the LSB. The old LD signal has been changed to cause a parallel load
operation in the shift registers. This allows execution of the algorithm to begin sooner.
Otherwise, without the parallel load, the initial values would need to be shifted into the
register. The values of these signals in each of the four states are shown in Table 6.H.
Used Available Utilization
Slices 3J %3,3%H $.3K
Slice Flip-Flops 33 H6,6H# $.%K
LUTs 6L H6,6H# $.3K
IOBs %H6 HH% 5L.$K
BRAMs $ 3H $.$K
GCLKs % J %H.5K
Table 6.3-Device utilization for the serial control unit.
The device utiliNation for the control unit is shown in Table 6.3. The parallel control
unit uses 3O slices, and this control unit uses 3J slices, making the area reQuirements for
the two units essentially the same. This unit reQuired more slice flip-flops T33U than the
parallel unitVs HO flip-flops.
6.#.$ %he )hift -egisters
The W, X, and Y registers in the serial design are all identical shift registers with
parallel load and parallel output capability. In addition to the clock signal, the registers
have two control inputs, two data inputs, and two data outputs. The data inputs are
65
labeled as S+I- and /. S+I- is the bit to be shifted in to the MSB position of the register
and / is the parallel data input. The t=o control signals are /LD and SAIFT. Chen
SAIFT is assertedD then the contents of the register shift to the right everF clock cFcleD
=ith S+I- becoming the ne= MSB. Chen /+LD is assertedD then the parallel data in / is
loaded into the register at the next rising edge of the clock.
!L#
!
31
&'()
&'*+T
!L#
!
3-
!L#
!
-
!'*+T
31
!'*+T
3-
!'*+T
-
.L/ .L/ .L/

!"#$%& ()*+,&-"#. /0 12& -&%"34 -2"01 %&#"-1&%-)
The basic design is sho=n in Figure 6.3. Chen /LD is not assertedD then the data
inputs for each flip-flop are the data outputs of the adLacent flip-flip to the left. Chen
/LD is assertedD then the bits in / become the inputs for each of the flip-flops.
-ot sho=n in the figure are the clock enable inputs for the flip-flops. After the
algorithm is completeD the values should be held constantN that isD neither shifting nor
loading. The SAIFT and /LD inputs are connected through an OR gate to the clock
enable inputs of each of the flip-flops. As a resultD the register =ill onlF shift if SAIFT is
asserted and =ill onlF do a parallel load if /LD is asserted. In anF caseD the parallel
outputs are available as the /+OUT signalD and serial output is available as the S+OUT
signalD =hich al=aFs corresponds to the LSB of the register.
66
"sed A'aila+le ",ili-a,i./
Sli1es 2$ 1&,&12 (.2*
Sli1e 2li342l.3s &2 26,624 (.1*
5"Ts 4$ 26,624 (.2*
789s 44 221 1$.$*
9:A;s ( &2 (.(*
<=5>s 1 8 12.-*
Ta+le 6.4BCe'i1e D,ili-a,i./ E.r a si/Gle serial shiE, reGis,er Ii/1lDdi/G JKL1 NDl,i3leOerP.
Since this design does not have the large amount of multiplexers required by the
variable shift register described in -.1.2, the resource requirements of this register are
significantly reduced. The device utiliFation for one shift register is shown in Table 6.4.
The table includes the bit-select multiplexer shown in Figure 6.1 that is used to perform
the bit-shift operation. Only 2$ slices are required for this design, compared to $( for the
parallel shift register. The parallel shift registers are each 21(* larger than the serial
registers. Since the parallel shift registers were used twice in the parallel design, this area
reduction is the key to the small area requirements of the entire serial design.
!"#"$!%&' )**+,- %./0'
The lookup table remains essentially unchanged. In the serial design, a multiplexer is
added to choose the individual bits to be fed into the serial adder.
"sed A'aila+le ",ili-a,i./
Sli1es 8 1&,&12 (.1*
Sli1e 2li342l.3s ( 26,624 (.(*
5"Ts 16 26,624 (.1*
789s 12 221 -.4*
9:A;s 1 &2 &.1*
<=5>s 1 8 12.-*
Ta+le 6.QBCe'i1e D,ili-a,i./ E.r ,he serial l..RD3 ,a+le.
The parallel lookup table did not require any slices. The entire design was
implemented in one BROM. Since the serial design requires a &2:1 multiplexer, this
67
implementation oI the lookup table requires 8 slices. This small increase in area is more
than oIIset by the savings in the registers and adders.
!"#"$ %&' S'*i,- .//'*0
!
!
"
!
in
C
!
$%&'Cin
A
B
SUB
S
!
$u&

S)*

S
CLK

!"#$%& ()*+,&-"#. /0 12& -&%"34 355&%)
The adders that compute the X, Y, and Z values in the serial CORDIC unit are
simple one-bit adder/subtractors with additional hardware to support serial arithmetic.
The design oI these adders is shown in Figure 6.4. The additional hardware is necessary
to save the carry value Irom one stage and then to Ieed it to the next. Without this logic,
the Iinal sum would be incorrect since there would be no carry propagation.
The C
out
output oI the adder is connected to the input oI a D Ilip-Ilop. The data
output oI the Ilip-Ilop is in turn Ied into a 2:1 multiplexer. When a subtraction is being
perIormed, it is necessary to have an initial carry oI 1 to allow Ior the 2s complement oI
the B input. The NEW¸Cin signal is generated by the control unit and is asserted only in
the SETUP state and when a subtraction is called Ior. In the COMPUTE state, NEW¸Cin
is not asserted, allowing the carry out oI the previous stage to be passed through to the
carry in oI the current stage.
68
The adder that computes the exponential value is a 32-bit parallel adder that uses the
parallel outputs of the ; and < registers.
"#r%&' (&r&''#' )*&%'&+'# "#r%&' ,-%'%.&-%/n
"'%1#2 4 16 13,312 0.0%
"'%1# 3'%453'/42 1 0 26,624 0.0%
6,72 7 32 26,624 0.0%
89:2 7 97 221 3.2%
:;)<2 0 0 32 0.0%
=>6?2 1 1 8 12.5%
7&+'# @A@BC#*%1# D-%'%.&-%/n E/r -F# 2#r%&' &nG 4&r&''#' &GG#r2A
As the design would imply, the resource requirements for this component are quite
small. Table 6.6’s serial column shows that only 4 slices are required to implement this
design. The parallel column shows the device utilization for the adder that computes the
exponential function.
!"#"$!%&'()* ,-../01

3%HDr# @AIB"'%1#2 D2#G +J -F# *&r%/D2 1/K4/n#n-2 /E -F# 2#r%&' >9;C8> Dn%-A
This design is even more efficient at using the FPOA’s resources, requiring much
less of the chip’s area than the parallel design. As Figure 6.5 shows, the distribution of
69
the chips resources to the components is very balanced, with a near even balance between
the 3 registers and the control unit. Included in the 'Other¨ category are the three serial
adders, the 32-bit adder, and the arc hyperbolic tangent lookup table. With only 1° oI the
Spartan-3`s slices used in this design, even larger applications can be supported. The
serial nature oI the design means that extra precision can be added to the system without
having a large increase in area.
While the area requirements Ior the design are small, the timing requirements are
not. For a system using n iterations (and r repeats) with a word size oI $ bits, then the
total execution time oI the unit will be

! " "# # $"1 clock cycles. Each iteration will
require $ cycles to complete as the values are passed serially through the adders, and n ¹
r total iterations are completed aIter the algorithm Iinishes. The PRELOAD state oI the
control unit adds another clock cycle oI execution time. The added clock cycles are
partially oIIset by an increase in clock Irequency. The linear convergence oI the CORDIC
algorithm means that the number oI iterations should be close to the number oI bits in the
registers. A halving oI the word size (and a corresponding halving oI the iteration count)
will reduce the execution time by a Iactor oI 4.
This design, which perIorms 31 iterations, 2 repeated, with a word size oI 32 bits
will take 1,057 clock cycles to complete. At the theoretical maximum clock rate oI 134.8
MHz, the unit will take 7.84 !s to compute the Iinal value.
!"#! $%&'()*%+, ./0'(*0
This section presents the results oI simulations perIormed using the design speciIied
in 6.1. The design uses 32 bits to represent each word, with 4 bits Ior the integer portion,
and 28 bits Ior the Iractional portion oI the value. The same simulations are perIormed
70
here as are done in 5.2, in order to demonstrate that this unit produces identical results,
but at the cost oI additional clock cycles.
6.#.$!Simulation $/ Computing e 3ith CORDIC 9rowth
<rror
The Iirst simulation perIormed demonstrates the eIIect oI the CORDIC gain !
n
. To
compute the exponential Iunction #
$
, the X register must be initialized with 1, the Y
register with 0, and the Z register with the desired argument to the exponential Iunction.
This simulation attempts to compute the value oI #, so an initial value oI 1 is placed into
the Z register.

!igure 6.6*Results of a 2o3elSim simulation of the serial CORDIC unit attempting to =ompute the
>alue of !. ?s a result of CORDIC growthA the final =ompute3 >alue is ina==urate.
The results oI this simulation are shown in Figure 6.6. The algorithm iterated Ior a
total oI 34 times (including the two repeat iterations required in the hyperbolic domain).
The Iinal hexadecimal values in the X and Y registers are 147258C5
16
and 0F9272AB
16

respectively. When converted to decimal and scaled accordingly, they become 1.2779 Ior
the X register and 0.9733 Ior the Y register. The value oI the exponential Iunction is the
sum oI these two numbers: 2.2512, corresponding to the value oI the E signal.
71
These results correspond precisely to the values produced in 5.2.1. Whereas the
parallel unit completes in under 8 !s, the serial unit takes just over 200 !s to produce the
Iinal value.
6.#.#!Simulation #. Computing e Compensating for
CORDIC :rowth
In order to compute an accurate value oI e, the growth observed in 4.2 must be
accounted Ior. To produce an unscaled Iinal value, the initial value must be divided by
the growth Iactor. The Iormula used to compute the amount oI growth is shown in (4.23).
A precomputed value is shown in Table 4.1. For this simulation, rather than initializing X
with 1, X will be initialized with the hexadecimal value 1351E872
16
, which equals
1.2075
10
. Y is again initialized with 0 and Z with 1
10
.

!"#$%& ()*+,&-$./- 01 2 304&.5"6 -"6$.2/"07 01 /8& -&%"2. 9:,;<9 $7"/ =06>$/"7# /8& ?2.$& 01 e)
@A 2==0$7/"7# 10% 9:,;<9 #%0B/8C /8& 1"72. =06>$/&4 ?2.$& "- 2==$%2/&)
The results oI this simulation are shown in Figure 6.7. The Iinal computed value oI e
is output on the E bus. Its hexadecimal value 2B7E1524
16
converts to 2.7183
10
, which
matches the actual value oI e. When taken to 30 decimal places the value output in E is
2.718281880021100000000000000000, which diIIers Irom the real value oI e by only
0.0000000515620501850833000. These values are the same as the values produced by
the parallel unit. As beIore, the only way to increase the precision is to increase the word
72
size and iteration count, which is easier to achieve in the serial design, but comes at the
cost oI longer execution time.
6.#.$!Simulation $/ Computing e
–1

It is also important Ior the unit to compute the value oI the exponential Iunction Ior
negative arguments. Since the domain oI convergence Ior the algorithm is symmetric
about 0, halI oI the possible arguments are negative. This simulation computes the value
oI e
1
. X is initialized with 1.2075
10
to ensure an unscaled result, and Y is initialized with
0. The Z register contains the argument Ior the exponential Iunction and is thus initialized
to 1 (F0000000
16
in 2`s complement notation).

Figure 6.*—Results of a ModelSim simulation of the serial CORDIC unit computing the value of e
@A
.
The simulation results presented in Figure 6.8 show that the CORDIC unit is able to
handle the negative argument. The Iinal value oI the E signal (05E2D58C
16
) corresponds
to 0.36787943542
10
, which is oI the same accuracy as the computation Ior e
1
.
6.#.5!Simulation 5/ Computing Outside the Domain of
Convergence
As discussed in 4.6, the CORDIC algorithm does not converge on an accurate
answer Ior all input values. Table 4.3 shows the domains oI convergence Ior all the
"3
$%m'u)a)+%, -%ma+,. .u''%r)0- 1y )30 456784 a9:%r+)3m. <%r )30 r%)a)+%, m%-0= >3+$3
)3+. -0.+:, u.0.= )30 -%ma+, %? $%,@0r:0,$0 +. a 9+m+)0r %, )30 ma:,+)u-0 %? )30 +,+)+a9
@a9u0 %? A. 8? A >0r0 +,+)+a9+B0- >+)3 a @a9u0 >3%.0 ma:,+)u-0 +. :r0a)0r )3a, C.CCDC"=
)30, )30 ?+,a9 $%m'u)0- @a9u0 >%u9- 10 +,$%rr0$). <%r )3+. .+mu9a)+%,= )30 E r0:+.)0r +.
a:a+, +,+)+a9+B0- >+)3 C.FG"H
CG
= a,- I +. a:a+, +,+)+a9+B0- )% G= 1u) A +. +,+)+a9+B0- )% H
CG
=
.% )3a) )30 u,+) >+99 a))0m') )% $%m'u)0 !
H
.

Figure (.*+,esults of a 3odel5im simulation of the serial C:,;<C unit computing the value of e
5
.
5ince the initial value 5 is outside the domain of convergence, the computed value is incorrect.
J30 .+mu9a)+%, r0.u9). .3%>, +, <+:ur0 6.L -0m%,.)ra)0 )3a) )30 456784 u,+) +.
u,a190 )% $%m'u)0 !
H
. J30 ?+,a9 @a9u0 %u)'u) 1y M +. 3.GHL3
CG
= >3+$3 +. ,%>30r0 ,0ar )30
$%rr0$) @a9u0 NCOD.OC3FP. 3.GHL3 a$). +. a, a.ym')%)0 ?%r )30 $%m'u)0- @a9u0. %? M. 5,$0
)30 +,+)+a9 @a9u0. %? A m%@0 %u).+-0 )30 -%ma+, %? $%,@0r:0,$0= )30 $%m'u)0- M @a9u0
ra'+-9y a''r%a$30. 3.GHL3= a,- r0ma+,. a) )3a) @a9u0 ,% ma))0r 3%> 9ar:0 a @a9u0 A +.
+,+)+a9+B0- >+)3. Q .+m+9ar a.ym')%)0 0R+.). ?%r >30, A +. +,+)+a9+B0- >+)3 @a9u0. )3a) ar0
)%% ,0:a)+@0. 8, )3+. $a.0= )30 M @a9u0. a''r%a$3 G.3F6L.
!"# $u&&ar)
<%r 9ar:0 a''9+$a)+%,. >30r0 <STQ r0.%ur$0. ar0 .$ar$0= )30 .0r+a9 456784 u,+)
-0.$r+10- +, )3+. $3a')0r %??0r. ma,y a-@a,)a:0. %@0r )30 'ara9909 u,+) -0.$r+10- +,
74
Chapter 5. -y operating only on single bits in a serial fashion, the complexity of the shift
registers is reduced significantly. The size of the adders used in the design is also reduced
since they only need to handle one bit at a time. Precision and accuracy are not sacrificed
to achieve the reduction in resource utilization. However, the added execution time may
prevent this design from being used in more time-sensitive applications. For those, the
parallel design or a faster table-based approach would be better suited.
7#
!"#$%&' ) !*+,-./0*+ #+1 2.%.'& 3*'4
$rti(i)ia+ ne.ra+ net/or1s ha4e the potentia+ to 6e)ome in4a+.a6+e too+s (or s8stems
9esigners +oo1ing to imp+ement arti(i)ia+ inte++igen)e into their /or1. <he a6i+it8 o( these
net/or1s to 6e traine9 (or a spe)i(i) pro6+em an9 to operate a.tonomo.s+8 opens .p a
9oor to ne/ /a8s o( 9ata ana+8sis an9 other app+i)ations /here it is 9i((i).+t to 9esign
straight(or/ar9 a+gorithms to pro9.)e the 9esire9 res.+ts.
=>?$s are +i1e/ise 6e)oming in)reasing+8 .se(.+ (or 9esigners. <he a6i+it8 o( the
)hips to 6e reprogramme9 spee9s protot8ping an9 simp+i(ies the 9esign pro)ess (or +arger
s8stems. <he training pro)e9.re (or arti(i)ia+ ne.ra+ net/or1s re@.ires )hanges to interna+
/eights o( the net/or1. <he reprogramma6i+it8 o( the =>?$ means that these )hanges
)an 6e ma9e easi+8. <he a6i+it8 to p+a)e an arti(i)ia+ ne.ra+ net/or1 in an em6e99e9
en4ironment /ith rea+A/or+9 inp.ts (.rther eBpan9s the n.m6er o( app+i)ations a4ai+a6+e.
>re4io.s+8C it /as 4er8 9i((i).+t to 6.i+9 a ne.ra+ net/or1 in an =>?$ 6e)a.se o( the
9i((i).+t8 o( (itting the arithmeti) har9/are on the same )hip as the net/or1.
De.ra+ net/or1s re@.ire )omp+eB mathemati)a+ s.pport in or9er to (a)i+itate the
)omp.tation o( the s.mmationC /eightingC an9 trans(er (.n)tions. Eost arithmeti)
a+gorithms are optimiFe9 (or spee9 an9 re@.ire +arge amo.nts o( )hip area /hen
imp+emente9. De.ra+ net/or1s are inherent+8 para++e+ )omp.ter s8stems an9 in or9er to
.se a +arge arithmeti) .nitC a++ the )omp.tations /o.+9 ha4e to 6e pipe+ine9 thro.gh one
or se4era+ o( these .nits. <his /o.+9 +essen the para++e+ism o( the net/or1C /hi)h a+so
res.+ts in +onger )omp.tation times.
7#
The '()*+' implementations presented are compact enough that they potentially
could be duplicated in the F=>A, allowing each neuron to have its own '()*+' unit.
This maintains the parallelism of the network. Two alternate designs of '()*+' units
have been presented, a parallel design and a much smaller serial design that takes longer
to execute. Either of these could be used in a neural network at the neuron level.
7.# $omparing the 1esigns
The parallel design fits in 32J of the KpartanL3Ms slices and can run at a theoretical
maximum clock frequency of 7O PQz. The serial design is much smaller, needing only
13J slices, and can also run at almost twice the clock frequencyT 13U.V PQz.

Figure 7.1+,lices required by the various components of the two CORDIC designs.
The parallel design requires a large amount of 32T1 multiplexers to accomplish the
shifting. This is necessary because the amount by which the operands are shifted
77
increases each iteration of the algorith0. When switching to the serial a44roach5 these
0ulti4le7ers are eli0inate8 resulting in significant resource sa9ings. As seen in ;igure
7.15 it is this sa9ings that accounts for 0ost of the efficienc= of the 8esign.
>he effect of wor8 si?e on the area re@uire0ent of the 8esigns can Ae seen in ;igure
7.2. >he 4arallel 8esign alwa=s re@uires 0ore area than the serial 8esign5 an8 the nu0Aer
of slices re@uire8 for the 4arallel 8esign increases 8ra0aticall= with each increase of the
wor8 si?e. >he area re@uire0ent of the serial 8esign increases roughl= linearl=5 while the
4arallel 8esign follows a shar4 e74onential cur9e.
>he affect wor8 si?e has on e7ecution ti0e is shown in ;igure 7.C. >he unit is
assu0e8 to Ae o4erating at the 0a7i0u0 clock fre@uenc=5 as re4orte8 A= the Eilin7
s=nthesi?er. >his 9alue is 8ifferent for Aoth the 4arallel an8 serial 8esigns. >he 0a7i0u0
clock fre@uenc= also 8ecreases as wor8 si?e increases5 which further lengthens e7ecution
ti0e with larger wor8 si?es. >he serial 8esign is se9erel= i04lacte8 A= the increase in
wor8 si?e. F7ecution ti0e increases e74onentiall= with wor8 si?e5 while e7ecution ti0e
for the 4arallel 8esign follows a roughl= linear track.
78

Figure 7.2+The effect of word size on the F78A chip area re<uirements of the CORDIC unit.

Figure 7.C+The effect of word size on the execution time of the CORDIC unit.
As can be seen by comparing the two Iigures, the area and timing requirements Ior
these designs are inversely proportional. This is true Ior most designs. In general, when
design changes are made that reduce the timing requirements Ior the system, these same
changes result in an increased area requirement. The move to bit-serial arithmetic, which
7#
optimized the algorithm for area, requires each iteration of the algorithm to take longer.
Each bit in the operands will use one clock cycle to compute the new value. For these
designs, which use the word sizes of 32 bits and 31 iterations Cwith 2 repeatsD, the 3E
clock cycles that the paralle implementation requires are increased to 1,FG7 cycles. This
is an increase of 3FFFI. This is offset somewhat by the increase in clock frequency, but
the serial implementation still takes significantly longer to complete.
For applications that can tolerate less precision, further savings in both area and
computation time can be achieved. Jeural networks in particular are probabilistic in
nature and may not need 32 bits of precision. Kecreasing the word size to 1L bits will
drastically reduce the area requirement for the parallel design, since its area requirement
decreases exponentially with word size. The area requirement of the serial design
depends only linearly on the word size, so the area savings would not be as significant.
The execution time, however, would decrease exponentially.
Noth implementations also have complex control units that are designed with
flexibility in mind. Ot is possible that the clock frequency for both designs could be
increased if the control unit could be simplified for a specific application.
7.# Integration .ith Artificial 4eural 4etwor7s
7.#.9 :;<=I: Artificial 4euron
This section discusses the design of a basic artificial neuron that uses the PQRKOP
unit to compute its transfer function. Figure 7.E shows the design of this neuron. This
sample design uses E inputs, but the design can be scaled to accommodate any number of
inputs.
"0
WT_1
!"#$
!"#%
!"#&
WT_2
WT_3
WT_4
'
(
(
(
!"#)
'
'
'
!"#$%!
lN_X
lN_Y
lN_Z
START
RESET
CLK
E
DONE
*
+
,
-
.
/
.
0
/
0
1
-
0
"21+
34".
!"##$%&'()*"(+%&'(
%,$(!*-,)*"(+%&'(
&(."%)/-&01&(0

Figure 7.*—Bloc0 diagram of a basic artificial neuron utili9ing the CORDIC unit to compute the
transfer function e
x
.
The 4 inputs are weighted by constants that are stored internally in the neuron. 9
separate multiplier computes the weighted value for each input in parallel. 9s per the
artificial neuron model shown in Figure 2.1, these weighted values are then passed
through a summation function. Three adders are cascaded to add the four values together.
The CORDIC unit then computes the transfer function. The output of the final adder is
used as the argument to the function. In this case, since the CORDIC unit has been
designed to compute e
x
, the sum is connected to the INGH input. INGI is initialized to 0,
and INGK is initialized to 1.2075 to ensure an unscaled result. The NT9RT and RENET
signals may be provided externally or may be hardQwired to high or low. The E output of
the CORDIC unit then becomes the output of the neuron (NS9T) and may be connected
to any number of neurons in the next layer of the network.
The multipliers can be of any design, but since all inputs use a fixed point, the output
must be shifted accordingly. The design used here has 2" bits reserved for the fractional
81
component oI the values, so the Iinal product must be shiIted to the right 28 bits to ensure
that it is in the Iormat expected by the CORDIC unit.
Parallel Serial
Slices 661 (5.0°) 463 (3.5°)
Slice Flip-Flops 123 (0.5°) 141 (0.5°)
LUTs 1,273 (4.8°) 893 (3.4°)
IOBs 163 (73.8°) 163 (73.8°)
BRAMs 1 (3.1°) 1 (3.1°)
18!18 Multipliers 16 (50.0°) 16 (50.0°)
GCLKs 1 (12.5°) 1 (12.5°)
Table 7.1-Device resource requirements for the artificial neuron designs using both the parallel and
serial CORDIC units.
The resource requirements oI this design are shown in Table 7.1. The design was
synthesized using both the parallel and serial CORDIC units. The neuron with the parallel
CORDIC unit requires 661 oI the Spartan`s slices, while the serial design only needs 463.
The added area requirement Ior both designs comes Irom the 3 adders Ior the summation
Iunction and the additional routing logic required to interconnect the multipliers and
adders. The neuron is able to beneIit Irom the Spartan`s built-in multipliers. For larger
networks, Iurther chip area will be required to implement the multiplication Iunction. The
serial design is able to operate at a maximum theoretical clock Irequency oI 140.0 MHz,
and the parallel design can operate at a theoretical maximum Irequency oI 77.9 MHz. The
area/time tradeoII holds true Ior the neuron design just as it did Ior the CORDIC unit
alone. The 32-bit adders present in the serial design are Iast enough to allow the neuron
to still operate at the Iaster clock rate.
It should be noted that the serial neuron still uses parallel multipliers to compute the
weighted inputs, and serial adders to compute the summation Iunction. It should be
possible to use both serial adders and multipliers to gain additional area savings.
82
Depending on the designs used, modiIications to the CORDIC design may also be
required should a designer use this approach.
Results oI a ModelSim simulation oI the design are shown in Figure 7.5. The Iour
imputs used in the simulation are 0.3 (04CCCCCC
16
), 0.02 (0051EB85
16
), 2.1
(21999999
16
), and 0.65 (0A666666
16
). Their respective weights are 1, 2, 0.5, and 0.25.
The signals in1¸w through in4¸w show the results oI multiplying the inputs by their
weighting constants.
The signal sum¸all shows the result oI the summation Iunction: the hex value
F73D70A3
16
or 0.547500. The signal nval is the Iinal output oI the neuron and its hex
value oI 09411A07
16
(0.578394( matches the result oI !
0.5475
.
Since the implementation uses the Spartan`s internal multipliers, the intermediate
results oI computing the weighted values are not available. The only delay in the
simulation is Irom the CORDIC unit`s execution. A design utilizing custom multipliers
would have increased execution time.
!

Figure (.*+Simulation of an artificial neuron using the 4 inputs 9:.;, :.:=, =.>, :.?*@ and the weights
9>, =, C:.*, :.=*@.
7.#.#
For any implementation oI an artiIicial neural network into a programmable device
such as an FPGA, at least two possibilities exist Ior implementing the arithmetic
83
computations: a high-speed lookup table 7either on-chip or off: or either of the CORDIC
implementations discussed here.
Table-based implementations work similar to the logarithm tables found in the back
of textbooks, requiring interpolations between rows. These designs can be very fast and
accurate, but require large amounts of chip area. For large networks, it would require a
much larger 7and more expensive: FPGA to allow the lookup table to coexist with the
network. Additionally, all arithmetic calculations would have to be queued, slowing
down the operation of the entire network. If the table implementation is fast enough and
the network small enough, then performance may be comparable to a CORDIC
implementation.
The CORDIC algorithm is powerful enough and small enough to be able to support a
neural network, ideally with a CORDIC unit in each neuron. This would maintain the
parallelism that is key to the operation of the neural network. The longer computation
time is not a maLor shortcoming, since the power of a neural network lies in its
parallelism. The fact that all neurons can be computing in parallel and all neurons can
finish processing their inputs at the same time allows for a natural progression of data
through the network.
The flexibility of the algorithm with its two computation modes that can each operate
in one of three domains means that the same unit can compute a wide variety of
functions. This gives the network designer the flexibility to vary the transfer functions
used with only minimal changes and no redesigning necessary. It would also be possible
for the CORDIC unit to perform the multiplication of the weighting factors with the
"#
$n&'(s* (+,'-+ f,r ne'r,ns 1$(+ 23r-e 34,'n(s ,f $n&'(s* (+$s 1,'25 c,4e 1$(+ 3 se7ere
&erf,r43nce &en32(89
:+e s8s(e4s 5es$-ner 1,'25 c+,,se (+e &r,&er $4&2e4en(3($,n (+3( ;es( f$(s (+e s$<e
,f (+e ne(1,r= (+3( 1,'25 ;e s'&&,r(e59 :+e f3s( (3;2e>;3se5 3&&r,3c+es 1,'25 ;e
&referre5 ;ec3'se ,f (+e$r +$-+ &erf,r43nce* ;'( ,n28 (+e s4322er ne(1,r=s 1,'25 ;e 3;2e
(, effec($7e28 '($2$<e (+e49 ?,r 4e5$'4>s$<e ne(1,r=s* (+e &3r322e2 @ARCD@ 'n$( $s (+e
;es( c,4&r,4$se ;e(1een s$<e 3n5 s&ee59 En5 f,r 23r-e ne(1,r=s* (+e ser$32
$4&2e4en(3($,n 438 ;e (+e ,n28 c+,$ce9
!"# $%&%'( *&%+,
?'r(+er rese3rc+ c3n ;e 5,ne $n(, (+e ;enef$(s (+3( c3n ;e 5er$7e5 fr,4 's$n- (+e
@ARCD@ 'n$( $n 3 ne'r32 ne(1,r=9 F('58 ,f 3 2$7e* &r3c($c32 ne'r32 ne(1,r= 1$22 +e2& $n
'n5ers(3n5$n- +,1 23r-e 3 ne(1,r= c,'25 ;e s'&&,r(e5 3n5 1+$c+ $4&2e4en(3($,n s ;es(
s'$(e5 f,r ne(1,r=s ,f 73r8$n- s$<e9
F('5en(s 438 32s, 1$s+ (, s('58 138s (+3( (+e @ARCD@ 'n$( c3n ;e f'r(+er
,&($4$<e5 s&ec$f$c3228 f,r 'se $n ne'r32 ne(1,r=s9 :+e c,n(r,2 'n$( $s ,ne c,4&,nen( (+3(
$s 3n e3s8 (3r-e( f,r ,&($4$<3($,n* ;'( f'r(+er 3re3 re5'c($,n 438 ;e &,ss$;2e 1+en (+e
'n$( $s $n(e-r3(e5 $n(, (+e s(r'c('re ,f 3 ne'r,n9 :+$s 32s, $nc2'5es 5e(er4$n$n- (+e $5e32
1,r5 s$<e f,r (+e 3&&2$c3($,n9 :+e 3cc'r3c8 ,f (+e ne(1,r= nee5s (, ;e (3=en $n(, 3cc,'n(
1+en 5e(er4$n$n- (+e &rec$s$,n ,f (+e 'n5er28$n- +3r513re9
:+e @ARCD@ 'n$( &resen(e5 +ere $s +3r5>1$re5 (, (+e +8&er;,2$c r,(3($,n32 4,5e9 E
(r'28 c,4&2e(e -ener32 'n$( (+3( $s c3&3;2e ,f ,&er3($n- $n 322 (+ree 5,43$ns $n ;,(+ 4,5es
s+,'25 ;e &,ss$;2e 1$(+ ,n28 3 2$((2e 4,re +3r513re9 E s('58 ,f (+e res,'rce reG'$re4en(s
,f (+$s -ener32>&'r&,se 'n$( c,'25 32s, ;e 5,ne9
85
!"#"$"%&"'
[1] J. S. *alther, “A unified algorithm for elementary functions,” in Spring (oint
Computer Conference, vol. 38, pp.379–385, 1971.
[2] J.E. Holder, “The COLDIC Trigonometric Computing TechniOue,” 12E Trans.
Electronic Computers, vol. EC-8, no. 3, pp. 330–334, Sept. 1959.
[3] H. Dawid and H. Meyr, “COLDIC Algorithms and Architectures,” in Digital
Signal :rocessing for Multimedia Systems, ed. by K.K. Parhi and T. Nishitani,
Marcel Dekker, 1999, pp. 623–655.
[4] L. Andraka, “A survey of COLDIC algorithms for FPGA-based computers,” in
1nternational Symposium on Field :rogrammable @ate Arrays, 1998.
[5] ^. H. Hu, “The _uanti`ation Effects of the COLDIC Algorithm,” 1EEE Trans.
Signal :rocessing, vol. 40, pp. 834–844, July 1992.
[6] IEEE Std. 754-1985, “Standard for Binary Floating Point Arithmetic,” 1985.
[7] A. A. Liddicoat and L. A. Slivovsky. :rogrammable logic. 3
rd
Edition of EE
Handbook. CLC Press.
[8] cilind. Spartan-3 FPGA family: Complete data sheet. Datasheet DS099, cilind.
[9] M.M. Mano and C.L. Kime, BLS1 :rogrammable Logic Devices, Pearson
Prentice Hall, Upper Saddle Liver, NJ, 3rd edition, 2004. Supplement to Logic
and Computer Design Fundamentals.
[10] D. Anderson and G. McNeil, “Artificial Neural Networks Technology,”
DoD Data g Analysis Center for Software, August 1992.
86
|11| M. Skrbek, 'New Neuochip Architecture,¨ Doctoral Thesis. 115 p.CTU,
Faculty oI Electrical Engineering, Prague, 2000.
|12| J. Zhu and P. Sutton, 'FPGA Implementations oI Neural Networks a
Survey oI a Decade oI Progress,¨ Proceedings of -.t0 International Conference
on Field Programmable Logic and Applications ;FPL <==.>, Lisbon, Sep 2003.
|13| M. Figueiredo and C. Gloster, 'Implementation oI a Probabilistic Neural
Network Ior Multi-spectral Image ClassiIication on an FPGA Based Custom
Computing Machine,¨ Proceedings of ?t0 @raAilian Symposium on Eeural
EetForGs, Dec. 1998.
|14| S.L. Bade and B.L. Hutchings, 'FPGA-Based Stochastic Neural
NetworksImplementation,¨ Proc. of IEEE JorGs0op on FPGAs, 1994.
|15| K. Nichols and M. Moussa, 'Feasibility oI Floating-Point Arithmetic in
FPGA based ArtiIicial Neural Networks,¨ in Proceedings of t0e -?t0
International Conference on Computer Applications in Industry and
Engineering, 2002.
|16| J.L. Holt and T.E. Baker, 'Back propagation simulations using limited
precision calculations,¨ in Proceedings of International Joint Conference on
Eeural EetForGs, 1991, pp 121126 vol. 2.
|17| S. Draghici, 'On the capabilities oI neural networks using limited
precision weights,¨ Eeural EetForGs, 2002, 15: p. 395-414
87
[18] '. )u and S.N. Batalama, “An Efficient Learning Algorithm for
Associative Memories,” IEEE $rans. Neural Networks, vol. 11, no. 5, pp. 1058–
1066, Sept. 2000.
[19] S. Halgamuge, “A Trainable Transparent Universal ApproPimator for
DefuRRification in Mamdani-Type Neuro-FuRRy Controllers,” IEEE $rans. Fuzzy
Systems, vol. 6, no. 2, pp. 304–314, May 1998.