You are on page 1of 39

Faculty of Computing and Information Technology Department of Robotics and Digital Technology Technical Report 94-9

A VHDL Implementation of a CORDIC Arithmetic Processor Chip


Grant Hampson, Student Member, IEEE Andrew Paplinski, Member, IEEE October 10, 1994

Enquiries:Technical Report Coordinator Robotics and Digital Technology Monash University Clayton VIC 3168 Australia
tr.coord@rdt.monash.edu.au +61 3 905 3402

Contents
Abstract and Keywords Preface 1 The CORDIC Algorithm 2 CORDIC Hardware Implementations 3 Improving CORDIC Accuracy
3.1 3.2 3.3 3.4

2.1 CORDIC Processor Architecture : : : : : : : : : : : : : : : : : : : : : : : 10 2.1.1 A Word-Serial CORDIC Architecture : : : : : : : : : : : : : : : : : 10 2.1.2 A Word-Parallel CORDIC Architecture : : : : : : : : : : : : : : : : 11 Estimation of CORDIC Accuracy : : : : The Lower Bound of CORDIC Accuracy Reducing the z update error : : : : : : : Unexpected Truncation Errors : : : : : :

4 5 6 10 14
14 15 16 20

: : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

4 VHDL Implementation

4.1 The Basic CORDIC Unit : : : : : : : : : : 4.2 VHDL Describes Structure and Behaviour 4.2.1 Hierarchical vs Flat Designs : : : : 4.2.2 The Viewlogic Synthesiser : : : : : 4.3 VHDL Design of the CORDIC Unit : : : : 4.3.1 The Rounding Unit : : : : : : : : : 4.4 Combining the CORDIC Units : : : : : : 4.4.1 A Solution : : : : : : : : : : : : : : 4.5 Improvements : : : : : : : : : : : : : : : :

21
21 22 23 25 26 29 30 31 33

Conclusion A CORDIC Functions B Upper Bound of CORDIC Error References

34 35 37 38

List of Tables
1.1 1.2 4.1 A.1 Elementary angles of i : : : : : : Various values of Kn : : : : : : : : Some CORDIC hardware statistics. The six CORDIC modes. : : : : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

8 8 33 36

List of Figures
1.1 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 4.6 Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : : Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : : A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : : Word-Parallel CORDIC architecture with possible data pipelining. : : : : : Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : : Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A plot showing bits of error for a typical test vector rotated through all possible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : : An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : : Simulation results from a CORDIC processor illustrating the e ects of the normalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : : A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : : A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : : A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : : The structure of CORDIC unit showing the various entities. : : : : : : : : The top level schematic of an 4 stage CORDIC processor with Increased Convergence Range and Rounding components. : : : : : : : : : : : : : : : 6 11 12 13 15 15 16 17 17 19 20 21 24 24 25 28 32

Abstract
This report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Computer) algorithm and a possible implementation using the VHDL hardware description language. An analysis of errors associated with a xed point implementation of CORDIC is also discussed and methods for reducing these errors. A normalisation scheme which reduces error and requires no extra hardware is such a method. Various CORDIC structures and possible VHDL implementations are described in detail, including design and language issues. Finally a parallel hardware implementation is described and simulated. CORDIC has many applications, of which, some can be used for array imaging techniques.

Keywords
CORDIC, VHDL

Preface
CORDIC is an acronym for Coordinate Rotations Digital Computer and was derived by Volder 1] in the late 1950's for the purpose of calculating trigonometric functions. Its popularity came about nearly twenty years later when VLSI solutions became a reality. The original algorithm describes the rotation of a 2-D vector which can be applied in applications such as Digital Signal Processing 2] (Fourier Transforms, Digital Filters), Computer Graphics 3] and Robotics 4]. CORDIC processing o ers high computational rates making it attractive to applications such as computer graphics where a combination of scaling and rotations are required in real time. CORDIC is also attractive to Robotics since the fundamental operation is coordinate transformations, however it could be used for more computationally intensive processes such as motion planning and collision detection. Array Imaging typically involves complex signal processing which may require many computationally intensive matrix operations. Increasing the complexity of the imaging model places greater demands on accuracy. Solutions to such complex systems requires better, and hence, more complex algorithms. Most of these algorithms are based on matrix factorization (decomposition) techniques, of which Singular Value Decomposition (SVD) is the most robust method. The SVD factorisation requires a two-sided transformation which involves several trigometric operations and rotations ideally suited to dedicated VLSI hardware (CORDIC processing) for real time calculations. CORDIC has also been applied to phase correction when dynamic range focusing when Digital Baseband Demodulation 5] techniques are employed in Interpolation Beamforming 6] . A complex signal is represented by the in-phase, I, and quadrature, Q, components, and are phase corrected by rotating the complex signal. Haviland and Tuszynski designed and built a CORDIC processor 7] in 1980 which used a iterative process to calculate circular, linear and hyperbolic functions. A more recent implementation (1993) by Duprat and Muller 8] discusses the possibility of using a redundant number system for the representation of a signed digit. This report is broken into four logical sections, namely, CORDIC Theory, Hardware Implementations, Improving CORDIC Accuracy and nally a VHDL Implementation.

Chapter 1 The CORDIC Algorithm


Consider a 2-D vector (x; y) represented by a point v = x + |y in the complex plane. If the vector is rotated by an angle , the new co-ordinate vector is given by: and shown in Figure (1.1). ~ v = v ej (1:1)

y
~ ~ ~ v = x + |y

v = x + |y
x

Figure 1.1: Rotation of a point in 2-D space. The angle can be expanded into a set of elementary angles qi 2 f?1; +1g, and angle expansion error zn , such that = and the sub-rotation angles
i
(

with pseudo-digits (1:2)

n?1 X i=?1

qi

i + zn

take on the following values:

=2 = arctan(2?i ) for i = ?1 ; ; n ? 1 (1:3) for i = 0; 1 Note that i is approximately equal to but less than 2?i and the resulting angular expansion error is therefore jznj < 2?(n?1).
6

Substitution of Equation(1.2) into Equation (1.1) gives: ~ v = v = v and expanding ejqi i ,


n?1 Y |q e i i e | zn i=?1 n?1 Y |q e i i (|qi) i=0

e | zn

(1.4)

ejqi
Finally ~ v=v
n?1 Y i=0

= cos qi i + j sin qi i = cos qi i (1 + j tan qi i) = cos i 1 + j qi 2?i


!

cos

(|q?1) =
n?1 X i=?1

n?1 Y i=0

1 + | qi 2?i 190

e?j zn

(1:5)
max , where

The range of rotation angles which can be represented by Equation (1.2) is


max i

(1:6)

and some values of i are given in Table (1.1). If the expected range of rotation angles is 90 then the initial rotation by 90 , that is, e|q?q 2 = j q?1, does not have to be performed and the initial rotation is by 45 . The second term is a constant scaling factor and for given value of n it can be preevaluated using Equation (1.7), and the rst 15 evaluated in Table (1.2).

Kn =

n?1 Y i=0

cos i =

n?1 Y i=0

1 + 2?2i

1 ?2

n?1 Y i=0

1 1 + 41i

(1:7)

The basic CORDIC algorithm which describes rotation of a unity length vector v = x + |y by an angle can be derived from Equation (1.5) using the initial conditions, where zi is the accumulated angular residue:

And, proceeding with i = ?1; 0;

v?1 = v Kn z?1 = ;n ? 1

qi =

v |qi if i ? vi+1 = vi (1 + |q 2?i ) if i = 0 1 i i zi+1 = zi ? qi i


7

?1 if zi < 0 +1 0 (

(1.8) (1.9) (1.10)

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Angle Angle (degrees) 0) arctan(2 45:0000 ?1 ) arctan(2 26:5651 ?2 ) arctan(2 14:0362 ?3 ) arctan(2 7:1250 arctan(2?4 ) 3:5763 ?5 ) arctan(2 1:7899 ?6 ) arctan(2 0:8952 arctan(2?7 ) 0:4476 ?8 ) arctan(2 0:2238 ?9 ) arctan(2 0:1119 ?10 ) arctan(2 0:0560 arctan(2?11 ) 0:0280 ?12 ) arctan(2 0:0140 ?13 ) arctan(2 0:0070 arctan(2?14 ) 0:0035 ?15 ) arctan(2 0:0017 ?16 ) arctan(2 0:0008

B400 6A43 3825 1C80 0E40 0729 0395 01CA 00E5 0073 0039 001D 000E 0007 0004 0002 0001

16-bit binaries = 110001:0000000000 = 011010:1001000011 = 001110:0000100101 = 000111:0010000000 = 000011:1001000000 = 000001:1100101001 = 000000:1110010101 = 000000:0111001010 = 000000:0011100101 = 000000:0001110011 = 000000:0000111001 = 000000:0000011101 = 000000:0000001110 = 000000:0000000111 = 000000:0000000100 = 000000:0000000010 = 000000:0000000001
i

Table 1.1: Elementary angles of n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Kn 0.70710678118655 0.63245553203368 0.61357199107790 0.60883391251775 0.60764825625617 0.60735177014130 0.60727764409353 0.60725911229889 0.60725447933256 0.60725332108988 0.60725303152913 0.60725295913894 0.60725294104140 0.60725293651701 0.60725293538591 0.60725293510314

Table 1.2: Various values of Kn 8

The nal rotated vector is vn, with angle expansion error zn ~ vn = v = v e| e?|zn n?1 X zn = ? qi i
i=?1

(1.11) (1.12)

One complex operation on vi is equivalent to two operations on real numbers. For i = ?1

x0 + |y0 = |q?1(x?1 + |y?1) Hence =) x0 = ?q?1y?1 y0 = q?1x?1


For i = 0; 1;

(1.13) (1.14)

;n ?1 xi+1 + |yi+1 = (xi + |yi)(1 + |qi 2?i ) Hence =) xi+1 = xi ? qi yi 2?i yi+1 = yi + qi xi 2?i
(1.15) (1.16)

The CORDIC algorithm reduces to an iterative set of operations consisting of a binary shift and an accumulator for each of x; y and z. Refer to Appendix A for a list of transcendental functions.

Chapter 2 CORDIC Hardware Implementations


A Hardware implementation of CORDIC processor is dependent on the number of functions required and the computational speed. If all functions are to be computed, then there will be a necessary overhead for selecting each function. However, a small fast design will result if a small number of functions are required. This chapter presents possible solutions to a mixture of design problems.

2.1 CORDIC Processor Architecture


A CORDIC algorithm can take on two primary architectures, namely, word serial or word parallel. A word-serial processor minimises hardware requirements by utilising a single CORDIC unit repeatedly. However, iterative algorithms which are controlled by a small number of variables can be expanded on a two-dimensional area. ie., instead of executing a certain set of instructions n times using a single element (eg., a CORDIC unit), n times duplicated elementary cells are used in successive steps of an iteration 9]. This attened structure can now perform many operations in parallel and is so called a word-parallel CORDIC processor. A word-parallel architecture has the advantage of being up to n times faster, but due to the expansion requires, at worst, n times more hardware. However, the word-serial architecture requires complex controlling hardware and a variable shifter, decreasing the hardware saving ratio. The CORDIC algorithm has the advantage of not requiring any special hardware other than an accumulator and a variable shifter which are generally available in most microcontrollers. A multi-function word-serial CORDIC processor architecture could be realised using a basic micro structure consisting of a two-port register le, a variable shifter combined with an ALU interconnected by several data paths as shown in Figure (2.1). A generic controller could consist of a microcode instructions for the ALU and register 10

2.1.1 A Word-Serial CORDIC Architecture

n
ROM Kn 's

i
ROM i 's Register File ALU

Result bus: xi+1 , yi+1 , zi+1 CC register

2?i yi or 2?i xi
Variable Shifter

Controlling micro-code

Input data buses: xi, yi , zi

Figure 2.1: Generic Processor Architecture. le, and would execute an iterative algorithm. This structure is simular to that of a microprocessor or DSP and allows many variations of the CORDIC algorithm as the order of operations and the expanded instruction set increases exibility. This type of structure illustrates that it would be possible to implement the CORDIC algorithm on any micro or DSP. Optimising the generic processor-structure for a word-serial CORDIC processor is achieved by reducing the functionality to operations only required by the CORDIC algorithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU now contains three adders and dedicated registers. The microcode controller has been replaced by faster Combination Control Logic dedicated to the CORDIC operation sequence. The word-parallel method expands the problem of a single dimensional algorithm into a two-dimensional problem and results in shorter computational times. Greater speeds of computation can be obtained by pipe-lining between stages so that many partial results can be calculated in parallel. A pipelined-word-parallel architecture is shown in Figure (2.3) where each iteration is represented by a separate CORDIC block and a latch is placed after each iteration, or, several iterations. The following chapters will develop, implement, and simulate such parallel CORDIC structure using the VHDL hardware description language.

2.1.2 A Word-Parallel CORDIC Architecture

11

Load Precision Reset Clock

Initial Inputs

x0

y0

}|

z0

Next State

Combinational Control Logic

Select

xi

yi qiyi2?i

?qixi2?i
q-bit register

qi

zi
Look up Table of i 's

Increment
m-bit register

Zero

counter

Clock

n-bit register n-bit register n-bit register

xi+1

yi+1

zi+1

Finished Flag

Figure 2.2: A Optimised Word-Serial CORDIC Architecture.

12

y0 x0
Cell #0

z0
0 1

y1 x1
Clock

Latch for Pipelining of data

yi qi yi 2?i
Cell #i
P

xi

zi qi = sign zi]

?qi xi 2?i
P

yi+1
Clock

xi+1
Latch for Pipelining of data

zi+1

Cell #n

n?1

yn xn

zn

Figure 2.3: Word-Parallel CORDIC architecture with possible data pipelining.

13

Chapter 3 Improving CORDIC Accuracy


As expected, iterative algorithms calculate results by approximation and the solution will contain errors. CORDIC is not an exception and errors are introduced by a combination of quantisation and approximation errors. The accuracy of a CORDIC processor is dependent on the word length used for the three input variables x; y, and z, as well as the number of iterations or steps performed. The following chapter describes the errors associated with a xed point implementation and a means of reducing these errors.

3.1 Estimation of CORDIC Accuracy


The fundamental operations performed by a CORDIC processor is the shift-and-add process of which xed point arithmetic will introduce errors. For example, consider the binary scaling of the vector vi = (xi; yi) at the ith stage: if i m then vi+1 is updated with the truncated value vi 2?i if i > m then vi+1 = vi ; and the update will be 0 where m is the internal bus width of v and limits the maximumnumber of useful iterations. Peak accuracy could be achieved after m iterations since all accuracy has been exhausted in v. However, truncation errors may exceed the accuracy achieved by more iterations, and it is desirable to nd the optimal number of iterations. The accuracy of the rotation will be determined by how closely the input rotation angle was approximated by the summation of sub-rotation angles i. The error in v after n iterations will be proportional to the error in z. An increase in the z datapath width will increase the accuracy of the z update and hence the v update. The numerical accuracy of the CORDIC algorithm can be calculated by the examination of truncation and approximation errors. Truncation errors are due to the nite word length and approximation errors are due to the nite number of iterations. Walther 10] analyzed the x and y iterations independently of the z iterations and concluded that log n extra bits in the data paths can provide n bits of accuracy. This work was re-calculated by Kota and Cavallaro 11] in a non-independent manner and concluded that log n + 2 extra bits are required to achieve n bits of accuracy after n iterations. 14

This solution represents an upper bound of error in the CORDIC processor. A graph of this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16 bit accuracy, the internal datapaths need to be 13 and 22 bits respectively.
Datapath resolution vs Output Resolution 32
Output resolution is (n) bits with (n) iterations

28 24 20 16 12 8 4 0 0

12 16 20 24 28 Internal Datapath Width (n+log(n)+2)

32

36

40

Figure 3.1: Numerical accuracy of the CORDIC processor.

3.2 The Lower Bound of CORDIC Accuracy


A CORDIC processor can be presented with all possible input combinations to nd the lower bound of error. Simulation results are shown in Figure (3.2) where a 12 bit CORDIC processor with a variable number of stages is presented with all possible rotation angles between ? z?1 and the resulting accuracy in bits is calculated. Kota and Cavallaro's upper bound of error (as de ned by their maximum error equation in Appendix (B)) is also shown in Figure (3.2). The upper bound of error has a well de ned peak of accuracy, however the simulation results indicate that accuracy will improve if more iterations are performed.
Solid: Predicted Accuracy, Dashed: Actual Accuracy 12

10

8
Output Accuracy

0 0

6 8 Number of stages n

10

12

Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal datapath. 15

Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, and the resulting bits of error produced. About 0:3% of results are greater than 2 bits of error which indicates that the error bound of a CORDIC processor is positioned between the upper and lower bounds of error.
Bits error 90 3 120 2 150 1 30 60

180

210

330

240 270

300

Figure 3.3: A plot showing bits of error for a typical test vector rotated through all possible angles. The simulation results indicate that n + log n + 2 is an over estimation of data path width required and a reduction in datapath width is possible if the number of iterations is increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bit datapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. The simulation results were obtained by varying the magnitude of v and in uniform steps. The di erence in resolution obtained is two bits, indicating that the lower bound of error is closer to the error bound of CORDIC.

3.3 Reducing the z update error


In the rotational mode of CORDIC, converges towards zero by adding/subtracting subrotation angles and the nal iterations of the zi update will result in numbers approaching zero. More precisely, the angular error zi is approximately equal to 2?i , thus for a bus width m, only (m ? i) bits are used to represent error. To reduce the zi error a oating point system could be used, but it has complex hardware implementations not suited to word-parallel structures. A simpler method to 16

90 120

1.0 60 0.83 0.66

150

0.50 0.33 0.17

30

180

-150

-30

-120 -90

-60

Figure 3.4: A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results.
90 120 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 60

150

30

180

-150

-30

-120 -90

-60

Figure 3.5: An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. 17

improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisation scheme could be implemented by scaling the existing sequence by 2i , ie.,

zi = 2i zi ^
Therefore, the new sequence becomes

zi+1 = ^ = = =

2i+1 zi+1 2 2i (zi ? qi i) 2 (2i zi ? qi 2i 2(^i ? qi ^i) z

i)

(3.1)

which requires a shift left at each iteration, and requires no extra hardware for a wordparallel structure. A new sequence of sub-rotation angles can be de ned as: ^i = 2i i = 2i tan(2?i) (3:2) where ^i approaches a nite value of 1 for increasing values of i, and will utilise most of the bus width. Since the scaling system results in full use of the databus width, over ow may occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1 can have is when zi approaches zero, giving

max zi+1] 2 max ^i]

(3:3)

To calculate the increase in accuracy is beyond the scope of this report, however, simulation indicates that there is a direct improvement in accuracy. The simulation results indicated that using the traditional scheme the accuracy of the rotation is

accuracy / log(zi datapath width) + log(number of stages) whereas the normalisation scheme has the advantage of

(3:4)

accuracy / log(number of stages) (3:5) since the z datapath is always in a semi-normalised state. Using the traditional scheme, i ! 0, limiting the number of useful stages. However when normalised, there is no limit on the number of stages and a signi cant reduction in hardware is possible by reducing buswidth of z. Figure (3.6) illustrates the error dependencies on the number of stages and bits for the scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) show the angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance of v error on the angular expansion error.

18

No alpha scaling x 10
angle expans. error

Alpha scaling x 10
angle expans. error

-3

-3

6 4 2 0 0 0 10 stages 10 20 20 bits

0 0

0 10 stages 10 20 20 bits

No alpha scaling

Alpha scaling

4
relative v error relative v error

0 0

0 10 stages/bits in v 10 20 20 bits in z

0 0

0 10 stages/bits in v 10 20 20 bits in z

Figure 3.6: Simulation results from a CORDIC processor illustrating the e ects of the normalisation scheme.

19

3.4 Unexpected Truncation Errors


Using xed point arithmetic in a CORDIC processor will introduce an unexpected truncation error. The error occurs when the vector (x; y) has a negative component. Consider the nal iterations where the update of vector v approaches 0 since a larger number of right shifts is performed at each iteration. However this is not the case if x or y is negative. For example, let xi!N equal some number hex X"2D", or positive 45. The right shifted value of xi!N approaches zero. However, the negative of X"2D" in twos-complement form is X"D3" and the right shifted value will produce a number approaching X"FF", or ?1, not the expected zero. This is a signi cant problem in the CORDIC processor, since the addition of extra iterations will only increase the error. A simple method of removing this error would be to round the shifted value, instead of the forced truncation. A simple method for rounding values is to add the bit that was last shifted out to the shifted value. The rounder could be implemented using a half-adder and typically requires three logic gates per bit to implement. Minimal extra hardware is required in the word-serial architecture, however a word-parallel structure requires two half-adders per stage. This will have a direct e ect on the performance of the processor with the additional delay. Figure (3.7) are the simulation results of two CORDIC processors, with and without, rounding units. The test vector was rotated in steps of 5 , through 360 and the rounded results are signi cantly more accurate. The rounding maintains monoticity in the actual angle of rotation as well as uniform magnitude.
90 120 32.95 60 120 32.95 90 60

150

30

150

30

180

180

-150

-30

-150

-30

-120 -90

-60

-120 -90

-60

Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.

20

Chapter 4 VHDL Implementation


Various tools can be used to implement the CORDIC processor, however, a standardised approach to this problem would unify the solution for further development in various applications. A VHDL (VHSIC Hardware Description Language) has been used here to describe the structural and behavioural characteristics of a Word-Parallel CORDIC processor. VHDL has become the standard of hardware description languages and has its own IEEE standard 12].

4.1 The Basic CORDIC Unit


Any CORDIC structure will involve a basic unit containing three adders/subtracters, as shown in Figure (4.1). The binary scaler would be variable in the case of a Word-Serial device, however, much simpler in the Word-Parallel device as a shift translates directly to a misalignment of the data bus.

yi xi
Cell i

zi
i

yi+1 xi+1 zi+1


Figure 4.1: The basic CORDIC unit. This unit and a suitable FSM and registers could form a word-serial structure. A word-parallel implementation can be obtained by linking n CORDIC units. The rest of this chapter deals with development of a Word-Parallel unit and the interconnection of these devices using the VHDL language. It should be a relatively trivial task, but unfortunately there are many bugs in the Viewlogic VHDL Synthesiser, as well as only containing a subset of the full VHDL standard. The main aim of the project was to describe a CORDIC processor using the VHDL language and to allow the application designer to change the size of structure easily. This 21

exibility could include fundamental changes such as variable datapath widths and variable number of stages. Other options such as rounding intermediate nodes and pipelining could also be easily integrated. Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE Standard VHDL, and many constructs are missing from their implementation. However, most of the useful constructs are there, but contain nasty ambiguous messages following to say sorry this only works partially. This made it very di cult to work with.

4.2 VHDL Describes Structure and Behaviour


VHDL has the ability to describe a design in two ways in terms of its component structure, in terms of behavioural functionality of the design and also the possibility of integrating the two streams. A requirement for structural descriptions is that the lowest level description will be a behavioural description to ensure portability between di erent synthesis libraries. An example of a lowest level operator is the logical operator AND (behavioural), and used to describe the ANDing of two operands. This may be synthesised as an AND standard cell from the library. In this way, there is no way of directly accessing a component from a cell library and limiting portability. Consider a slightly more complex design of an n-bit adder/subtracter, which could be described by the following behavioural description:
addsub : PROCESS(a,b,sel) VARIABLE res : VLBIT_VECTOR(n DOWNTO 0); BEGIN res := zero(n DOWNTO 0); IF sel = '1' THEN res := add2c(a,b); ELSE res := sub2c(a,b); END IF; s <= res(n-1 downto 0); END PROCESS; -- discard cout

-- needs to be initialised

The process activates when one of the variables in the sensitivity list changes, and then produces a result in the internal variable res. The signal s is assigned the lower portion of the sum. Now consider a structural description of the same adder/subtracter where several components are used:
c(0) <= sel; -- carry in connect: FOR i IN 0 TO n-1 GENERATE

22

invert: invf101 PORT MAP( b(i), b_bar(i) ); mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) ); addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) ); END GENERATE;

Note that the muxf201 component is used to select between the non-inverted and inverted signals of the b bus. The components are user de ned entities describing the appropriate logic gates. For example a fragment of the faf001 component contains the following lowest level behavioural description:
SUM <= A1 xor B1 xor CIN2; CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2);

It is not immediately obvious which way a designer should describe a particular design, however the next section reveals the results of the synthesiser on which a decision may be based. In general however, the easier it is for a designer to write a design in VHDL, the more optimisation the synthesiser needs to perform. One of very useful features of Viewlogic's VHDL Synthesiser 13] is the ability to either create a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allows the engineer to see lower level interconnections between design units, unlike the at design where no (or little) hierarchy can be seen. This allows easier debugging of designs, however its has the disadvantage of being less e cient than a at design which combines all the design elements together into one circuit, and then performs optimisation. Figure (4.2) illustrates the previous structural design of the Adder/Subtracter where it can be observed that the schematic consists of higher level components than standard library cells. This feature of Viewlogic VHDL enables easy debugging of high level components when compared to a at design. It is relatively simple to navigate between levels in a design. However, most libraries contain standard cells for full adders, muxes, and inverters, but remembering that VHDL doesn't allow direct access to Library cells, these components had to be described by a behavioural description. A mux simply maps to an IF statement, however no behavioural description will map to the full adder cell, and resort to the description stated previously. Compiling the same design using the at (bottom-up) design approach the synthesiser produces the following statistics, if for example, using the X2000 library. The schematic generated by the synthesiser is shown in Figure (4.3).
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:NAND2 15 0.25 X2000:OR2 3 0.25

4.2.1 Hierarchical vs Flat Designs

23

B3

A1

INVF101

A1 B2 SEL3

B2

A1

A1 B2 SEL3

MUXF201 O

INVF101 MUXF201 A0 A1 B1 CIN2 A1 O A1 B2 SEL3 SUM CO S0

B0

FAF001 A1 B1 CIN2 SUM CO S3

INVF101 MUXF201 A3

FAF001 A2 A1 B1 CIN2 A1 O A1 B2 SEL3 SUM CO S2

B1

FAF001

INVF101 SEL MUXF201 A1 FAF001 A1 B1 CIN2 SUM CO S1

Figure 4.2: A Hierarchical Design of the Adder/Subtracter for n = 4.


X2000:XOR2 15 0.25 ---------------------------------------------------------------------------Total Cells : 33 Total Area : 8.25

********************************************* Netlist Statistics ********************************************* Maximum level of gates = 14 Total number of nets = 42

OR2

A1
XOR2

XOR2

A0

NAND2

NAND2

NAND2
XOR2

SEL

NAND2

NAND2 NAND2
XOR2

NAND2

B0

XOR2

NAND2 NAND2 NAND2 NAND2


OR2

A2 NAND2 NAND2 NAND2


OR2

NAND2

S2
XOR2

B1

XOR2 XOR2

B2

XOR2

S1
XOR2

XOR2

S0
XOR2

B3
XOR2

S3
XOR2 XOR2

A3

Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4. Reconsidering the behavioural description of the Adder/Subtracter and synthesizing the design, the following statistics are generated, and the corresponding schematic shown in Figure (4.4).
********************************************* Gate Usage Summary *********************************************

24

Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:AND2 21 0.25 X2000:AND3 1 0.50 X2000:INV 11 0.00 X2000:NAND2 8 0.25 X2000:OR2 17 0.25 X2000:XOR2 3 0.25 ---------------------------------------------------------------------------Total Cells : 61 Total Area : 12.75

********************************************* Netlist Statistics ********************************************* Maximum level of gates = 11 Total number of nets = 70

NAND2

NAND2
OR2

NAND2

INV
OR2

NAND2

AND2 AND2

AND2 INV B2 AND2 A2


XOR2

OR2

OR2

AND2

INV

AND2 INV INV

AND2
OR2

B0

INV
OR2 OR2

S0 AND2
OR2

A0

AND2

AND3

OR2

NAND2 B3
XOR2

S2
OR2

INV
OR2

AND2

A3 INV AND2
OR2

INV
OR2

S3

AND2

S1 B1 AND2 A1
XOR2

NAND2
OR2

AND2
OR2

INV AND2

NAND2

NAND2 SEL
OR2

AND2

AND2 AND2 AND2

INV

AND2

AND2

OR2

Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4. From the statistics of each design, it is important to note that the total area and the maximum level of gates di ers. The structural description produces a small but slow design when compared to the behavioural description which produces a fast but large design. A characteristics of the synthesiser is that a behavioural description maps to a structure by representing each output in terms of its inputs, much like a lookup table, and removes any structure. The synthesizer performs logic level optimisation on a the structural description and thus producing a design with less logic. The Viewlogic Synthesiser has the ability to alter the emphasis on speed or area when optimizing a design. The statistics generated in the previous section were area optimized, 25

4.2.2 The Viewlogic Synthesiser

and neglected the e ect of gate delays. For example, optimizing the behavioural design for speed, the synthesiser generates 14 more gates than before, however there is a signi cant decrease in the maximum level of gates:
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:AND2 10 0.25 X2000:AND3 2 0.50 X2000:AND4 1 0.75 X2000:INV 15 0.00 X2000:NAND2 17 0.25 X2000:NAND3 1 0.50 X2000:NAND4 1 0.75 X2000:NOR3 2 0.50 X2000:NOR4 2 0.75 X2000:OR2 22 0.25 X2000:OR4 1 0.75 X2000:XOR2 1 0.25 ---------------------------------------------------------------------------Total Cells : 75 Total Area : 18.75

********************************************* Netlist Statistics ********************************************* Maximum level of gates = 9 Total number of nets = 84

The synthesiser can optimise small designs, but when the design grows large, the memory and processing power required to optimize such a design is considerable. The design of the CORDIC unit contains three adders/subtracters and takes several minutes to compile and optimize the design. However, integrating this unit into a larger design of several units, the compiler has many problems and will eventually lead to a crash after half an hour of compilation. A solution to get around this optimization problem is to use a hierarchical ow and describe the components using behavioural or structural descriptions. Using this method the compiler knows nothing about large components and cannot perform any global optimization. This is not a fully optimized solution, but it is currently the best solution. However, it is possible to atten the design below the top level making the design slightly more e cient.

4.3 VHDL Design of the CORDIC Unit


The rst stage of the design of a CORDIC processor is to create the CORDIC unit, where two approaches can be taken: a behavioral description or a structural description. Firstly, consider the following behavioural description where the shifted values of (xi; yi) are done external to the CORDIC unit in the top level design. This approach is optimal, since it only requires a misalignment of the data buses in the top level interconnections. However, if contained inside the CORDIC unit, each unit would require a variable shifter and could not be optimized using the current version of Viewlogic VHDL for reasons discussed previously. Another reason why shifting is done external to the CORDIC unit 26

is that the LOOP variable inside the generate statement cannot be passed to any user de ned function, procedure or entity. This is not stated in the manual and took many days to determine the problem. The behavioural description is as follows:
ARCHITECTURE behaviour OF adder IS begin cell_i : process (xi,xs,yi,ys,zi,ai) VARIABLE x_res: VARIABLE y_res: VARIABLE z_res: begin x_res := zero(n downto 0); y_res := zero(n downto 0); z_res := zero(k downto 0); if zi(k-1) = '0' then -- initialise, unless comp complains vlbit_vector(n downto 0); vlbit_vector(n downto 0); vlbit_vector(k downto 0); -- temporary results

-- z_i is positive

x_res := add2c (xi, ys); y_res := sub2c (yi, xs); z_res := sub2c (zi, ai); else -- z_i is negative

x_res := sub2c (xi, ys); y_res := add2c (yi, xs); z_res := add2c (zi, ai); end if; xip1 <= x_res (n-1 downto 0); yip1 <= y_res (n-1 downto 0); zip1 <= z_res (e-1 downto 0); end process; END behavior;

The synthesiser generates the following statistics for a 8 bit version of the code. The maximum level of gates is 20, since each bit requires 2 levels, plus additional gates for the multiplexer and inversion.
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:AND2 159 0.25 X2000:AND3 3 0.50 X2000:INV 69 0.00 X2000:NAND2 76 0.25

27

X2000:OR2 125 0.25 X2000:XOR2 7 0.25 ---------------------------------------------------------------------------Total Cells : 439 Total Area : 93.25

********************************************* Netlist Statistics ********************************************* Maximum level of gates = 20 Total number of nets = 487

For the Structural description of the CORDIC unit is slightly more complex and is best represented pictorially, as shown in Figure (4.5). Each box in the gure represents a di erent VHDL entity (component), and some components are used more than once. The design is very bulky and easier to make mistakes.

zi ai

inv101.vhd

INV

muxf201.vhd addsub e.vhd

2to1 mux

faf001.vhd

Full Adder

zip1

xi ys

inv101.vhd

INV

muxf201.vhd addsub n.vhd

2to1 mux

faf001.vhd

Full Adder

xip1

yi xs

inv101.vhd

INV

muxf201.vhd addsub n.vhd adders.vhd

2to1 mux

faf001.vhd

Full Adder

yip1

Figure 4.5: The structure of CORDIC unit showing the various entities. It achieves the same functionality as the behavioural description but requires a lot more e ort to make sure all the connections are correct. As stated previously, the structural design will minimise area, but will result in a slower design, as re ected by the following synthesiser statistics. 28

********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:INV 3 0.00 X2000:NAND2 139 0.25 X2000:OR2 41 0.25 X2000:XOR2 75 0.25 ---------------------------------------------------------------------------Total Cells : 258 Total Area : 63.75

********************************************* Netlist Statistics ********************************************* Maximum level of gates = 31 Total number of nets = 306

Using the structural design will save about 30% on area but will execute 50% slower. In a FPGA implementation speed might be more desirable than area optimization since the devices operate relatively slower when compared to a custom VLSI device. A 30% increase in the number of gates will be a relatively small concern.

4.3.1 The Rounding Unit

The rounding unit is formed by the interconnection of n half adders, or in behavioural terms, the addition of the bit shifted out during the shifting process. Describing it structurally involves using the inc001 component which contains an AND and a XOR gate to form a half adder. The interconnection of the inc001 components is:
c(0) <= cin; -- first carry connect: for i in 0 to n-1 generate addsub: inc001 port map( a(i), c(i), s(i), c(i+1) ); end generate;

Or, a much simpler behavioural description is created using the unsigned addition routine addum. This avoids the sign extension used in the add2c routine.
rounder : process (a,cin) VARIABLE res: begin res := zero(n downto 0); -- initialise, unless comp complains vlbit_vector(n downto 0); -- temporary results

res := addum(a,cin); -- use addum instead of add2c as it sign -- extends the cin input making it -1 not +1 s <= res (n-1 downto 0); end process;

29

4.4 Combining the CORDIC Units


The process of combining the CORDIC and Rounding units involves writing the top level design in the hierarchical solution. As before with structural descriptions, the generate statement is used and allows iterative or conditional generation of a portion of description. The rst de nition to be made in top level le is the alphai constants, and this version implements the Alpha Normalisation Scheme. Next the x; y; z intermediate signals between CORDIC units are shifted by the appropriate amount. The function shift all is de ned in another le and contains user de ned functions. This operation is required here since execution inside the generate statement will not work since concurrent procedure calls only execute when a variable in the sensitivity list changes state. A change in the shift value is not recognizable inside the generate statement.
-- Scaled a_i * 2^i values are decimal 45 53 56 57 57 57 57 57 ai <= X"39_39_39_39_39_38_35_2D"; sh_x: xis <= shift_all(xi); sh_y: yis <= shift_all(yi); sh_z: zis <= shift_z(zi); -- shift intermediate signals

It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectors containing several smaller vectors. This system had to be used since Viewlogic's VHDL cannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals is done by the following function:
FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0)) RETURN vlbit_vector IS VARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0); BEGIN x_s(1*n-1 x_s(2*n-1 x_s(3*n-1 x_s(4*n-1 x_s(5*n-1 x_s(6*n-1 x_s(7*n-1 x_s(8*n-1 x_s(9*n-1 downto downto downto downto downto downto downto downto downto 0) := shiftr2c(x( 1*n-1 downto 0 1*n) := shiftr2c(x( 2*n-1 downto 2*n) := shiftr2c(x( 3*n-1 downto 3*n) := shiftr2c(x( 4*n-1 downto 4*n) := shiftr2c(x( 5*n-1 downto 5*n) := shiftr2c(x( 6*n-1 downto 6*n) := shiftr2c(x( 7*n-1 downto 7*n) := shiftr2c(x( 8*n-1 downto 8*n) := shiftr2c(x( 9*n-1 downto ),1); 1*n ),2); 2*n ),3); 3*n ),4); 4*n ),5); 5*n ),6); 6*n ),7); 7*n ),8); 8*n ),9); ---------2 stage 3 stage 4 stage 5 stage 6 stage 7 stage 8 stage 9 stage 10 stage

return x_s; END shift_all;

Next comes the connection of the init component which is used to expand the convergence range of the CORDIC processor to ?190 < z < 190 . The input signals are x in, y in, z in are connected to a unit simular to the CORDIC unit, except there is an extra bit appended to the alpha bus to account for the expanded convergence range. 30

initial: init port map(xi <= X"00", xs <= x_in, yi <= X"00", ys <= y_in, zi <= z_in, ai <= B"0_0101_1010", -- add/sub 90 degrees xip1 <= xinit, -- xinit = 0 +- yin yip1 <= yinit, -- yinit = 0 -+ xin zip1 <= zinit );

The following code has been compressed to reduce detail, however it can be seen that there a three separate stages: initial connection, intermediate connections, and nal connection. This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation of components, eg., selection of behavioral or structural components, rounding units, etc.)
connect: for i in 0 to k-1 generate ls_unit: if i=0 generate first_unit: adder port map( ... ); end generate ls_unit; i_unit: if i>0 and i<k-1 generate x_round: round port map ( ... ); y_round: round port map ( ... ); middle_units: adder port map( ... ); end generate ls_unit; ms_unit: if i=k-1 generate x_round_last: round port map ( ... ); y_round_last: round port map ( ... ); last_unit: adder port map( ... ); end generate ms_unit; end generate connect; -- k stages

The contents of ... are simular to the port map of the init component. This represents a solution to the CORDIC problem, and is close to a optimized solution, but due to compiler and language di culties a completely optimized solution is not possible. Under these situations the design has been optimised as far as possible though. There many choices to be made about the design of the CORDIC unit, by deciding on whether the it is going to be area or speed e cient. 31

4.4.1 A Solution

Figure 4.6: The top level schematic of an 4 stage CORDIC processor with Increased Convergence Range and Rounding components.

A7 A6 A5 A4 A3 A2 A1 A0 CIN

S7 S6 S5 S4 S3 S2 S1 S0

X_OUT7

X_OUT6 ROUND A7 A6 A5 A4 A3 A2 A1 A0 CIN S7 S6 S5 S4 S3 S2 S1 S0 XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0 X_OUT5

X_OUT4

X_OUT3

ROUND A7 A6 A5 A4 A3 A2 A1 A0 CIN S7 S6 S5 S4 S3 S2 S1 S0

X_OUT2 A7 A6 A5 A4 A3 A2 A1 A0 CIN XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10 S7 S6 S5 S4 S3 S2 S1 S0 XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0

X_OUT1

X_OUT0

1
Y_OUT7 XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10 Y_OUT6

ROUND

ROUND XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0

X_IN7

X_IN6

X_IN5

A7 A6 A5 A4 A3 A2 A1 A0 CIN XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10

S7 S6 S5 S4 S3 S2 S1 S0

Y_OUT5

Y_OUT4

Y_OUT3

X_IN4

X_IN3 XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0

X_IN2

X_IN1

X_IN0

Y_IN7

Y_IN6

Y_IN5

Y_IN4

Y_IN3

Y_IN2

Y_IN1

Y_IN0

A_IN8

A_IN7

XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI8 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI8 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0

A7 A6 A5 A4 A3 A2 A1 A0 CIN

S7 S6 S5 S4 S3 S2 S1 S0

ROUND

Y_OUT2

Y_OUT1

Y_OUT0

ROUND

A_OUT7

XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10

XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10

A_OUT6

ADDER ADDER A_OUT5

32

A_OUT4

A_OUT3

A_OUT2 ADDER A_OUT1

A_OUT0

ADDER

A_IN6 INIT A_IN5

A_IN4

2
A_IN3 A_IN2

A_IN1

A_IN0

GND

VDD

cordic
WIR:cordic SCH:cordic 24 A B C D E Jul 94 16:30 F SHEET 1 OF 1

The user can exibly change the characteristics of the CORDIC processor by changing the value of a few constants to achieve more or less accuracy as well as hardware con gurations. Some of the CORDIC hardware statistics generated are: Type Internal Bus Width Stages Rounding Number of Gates Behavioural 12 bit 8 no 5841 Behavioural 12 bit 10 no 7139 Behavioural 12 bit 12 no 8437 Behavioural 8 bit 8 no 3753 Behavioural 8 bit 8 yes 5060 Structural 8 bit 8 no 2313 Structural 8 bit 8 yes 2775 Table 4.1: Some CORDIC hardware statistics. (Remember that there is also one additional stage for increasing the convergence range.)

4.5 Improvements
There is one main improvement which could be made to the current design, which is to include pipelining registers between stage(s). If the Xilinx FPGA was used no extra hardware for latches is necessary as each cell contains a latch. Another possibility is to design a Word-Serial Cordic architecture around the already design CORDIC unit. This would only require a FSM driver along with some additional hardware for the variable shifter.

33

Conclusion
The theory behind the CORDIC algorithm has been covered in detail and its possible applications in array imaging discussed. It was shown how Kota predicted the upper bound on CORDIC errors, however simulations reveal CORDIC to be signi cantly more accurate. A normalisation scheme on the z datapath was introduced to maximize bus usage, and hence increase accuracy. This scheme can reduce the bus width required for z, and still achieve greater than or the same accuracy. Also observed was the unexpected truncation errors introduced by the twos-complement binary format, and minimised using a half adder to perform a rounding operation. A method for increasing the convergence range to ?180 < < 180 in one extra iteration was also introduced. Various CORDIC architectures were discussed and a word-parallel architecture described using the VHDL hardware description language. A few design issues and implementation problems were discussed, and concluded that a hierarchical design ow had to followed to avoid compiler memory problems. The solution is not completely optimal, but the resulting design could easily be implemented on a FPGA gate array. The VHDL design of the CORDIC processor could now be easily integrated into any application as alterations in the con guration of the CORDIC processor can easily be achieved using the VHDL language.

34

Appendix A CORDIC Functions


The functional results from a CORDIC processor are derived from the initialisation of the three input variables: x, y, and z and the subsequent mode of operation selected. Equations (1.15,1.16) can be rewritten into a general solution, from which six modes of operation are possible, with the introduction of a mode variable m:

xi+1 = xi + m qiyi 2?i yi+1 = yi ? qixi 2?i zi+1 = zi ? qi i


where m can take on the following values:
8 > < => i :

(A.1) (A.2) (A.3)

atanh (2?i) if m = ?1 2?i if m = 0 (A:4) ?i ) if m = +1 arctan(2 The three modes of m determine the class of function being evaluated: linear (m = 0), circular (m = ?1) or hyperbolic (m = +1). The values of i were previously given for the circular functions mode and simular table could be given for the hyperbolic functions. The six modes of operation exist because there are two sub-classes available depending upon whether the iterations seek to drive the variable y or z towards zero. Table (A.1) summarises all of the functions available with this con guration.

35

Mode

Hyperbolic m = ?1 i = f0; 1; 2;: :: ; N ? 1g (Repeat for i = f4; 13; 40;: : :g)

Linear m=0 i = f0; 1; 2; :: : ;N ? 1g

Circular m = +1 i = f0; 1; 2; :: : ; N ? 1g

qi = z!0 yN

n +1 ?1

if zi < 0 if zi 0

qi =

n +1 ?1

if zi < 0 if zi 0

qi = yN

n +1 ?1

if zi < 0 if zi 0

xN KN (xin cosh(zin ) + yin sinh(zin ) KN (xin sinh(zin ) + yin cosh(zin ))

xN = xin yN = yin + xin zin

xN KN xin cos(zin ) ? yin sin(zin ) KN (xin sin(zin ) ? yin cos(zin ))

jzin j 1:1182
qi = y!0 zN
n +1 ?1 if xi yi 0 if xi yi < 0 p2 2 KN xin ? yin

jzin j 1
qi =
n +1 ?1 if xi yi 0 if xi yi < 0

jzin j 1:7433 (99:9 )


qi = xN zN
n +1 ?1 if yi 0 if yi < 0

xN

xN = xin
y zN = zin + xin in

p 2 KN x2 + yin in

y zin + atanh ( xin ) in

zin + atan2 (yin ; xin )

jatanh(yin =xin)j 1:1182

jyin =xin j 1

jatan2 (yin ; xin )j 1:7433 (99:9 )

Table A.1: The six CORDIC modes.

36

Appendix B Upper Bound of CORDIC Error


Kota and Cavlallaro calculated the numerical accuracy of the CORDIC algorithm by examination of truncation and approximation errors. They concluded that for a CORDIC processor with all data paths being m bits wide, and the number of iterations being n, then the upper bound of error was shown to be:

Eu = 2?n + 3:5 2?m n (B:1) This cannot be solved analytically, but a numerical solution (that is for any given m) can be approximated graphically. The solution approximates to
Internal Bus Width = m n + log2 n + 2 (B:2) Hence to obtain a precision of n bits, a CORDIC processor with (n + log2 n + 2) bits and n iterations would be su cient. This solution represents the upper bound of error.

37

Bibliography
1] J. E. Volder, \The CORDIC trigonometric computing technique," IRE Transactions on Electronic Computing, vol. EC-8, no. 3, pp. 330{334, 1959. 2] Y. H. Hu, \CORDIC-based VLSI architectures for digital signal processing," IEEE Signal Processing Magazine, pp. 16{35, July 1992. 3] F. Koscsis and J. Bohme, \Fast algorithms and parallel structures for form factor evaluation," The Visual Computer, no. 8, pp. 205{216, 1992. 4] M. Kameyama, T. Amada, and T. Higuchi, \Highly parallel collision detection processor for intelligent robots," IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 500{506, 1992. 5] M. O'Donnell et al., \Real-time phases array imaging using digital beam forming and autonomous channel control," Ultrasonics Symposium, pp. 1499{1502, 1990. 6] G. Hampson and A. Paplinski, \Beamforming by interpolation," Tech. Rep. 93-12, Monash University, 1993. 7] G. L. Haviland and A. A. Tuszynski, \A CORDIC arithmetic processor chip," IEEE Transactions on Computers, vol. C-29, no. 2, pp. 68{79, 1980. 8] J. Duprat and J.-M. Muller, \The CORDIC algorithm: New results for fast VLSI implementation," IEEE Transactions on Computers, vol. 42, pp. 168{178, February 1993. 9] A. Paplinski, \Array processor units for evaluating the expotential and logarithmic functions," Tech. Rep. TR-CS-82-07, The Australian National University, 1982. 10] J. S. Walther, \A uni ed algorithm for elementary functions," Proceedings AFIPS Spring Joint Computer Conference, pp. 379{385, 1971. 11] K. Kota and J. R. Cavallaro, \Numerical accuracy and hardware tradeo s for CORDIC arithmetic for special-purpose processors," IEEE Transactions on Computers, vol. 42, pp. 769{779, July 1993. 12] Experts, \IEEE Std 1076-1987, IEEE Standard VHDL Language Reference Manual," IEEE Computer Society, February 1992. 13] ViewLogic, VHDL Reference Manual for Synthesis, Powerview 5.1.3 release ed. 38

You might also like