Professional Documents
Culture Documents
A VHDL Implementation of A CORDIC Arithmetic Processor Chip
A VHDL Implementation of A CORDIC Arithmetic Processor Chip
Enquiries:Technical Report Coordinator Robotics and Digital Technology Monash University Clayton VIC 3168 Australia
tr.coord@rdt.monash.edu.au +61 3 905 3402
Contents
Abstract and Keywords Preface 1 The CORDIC Algorithm 2 CORDIC Hardware Implementations 3 Improving CORDIC Accuracy
3.1 3.2 3.3 3.4
2.1 CORDIC Processor Architecture : : : : : : : : : : : : : : : : : : : : : : : 10 2.1.1 A Word-Serial CORDIC Architecture : : : : : : : : : : : : : : : : : 10 2.1.2 A Word-Parallel CORDIC Architecture : : : : : : : : : : : : : : : : 11 Estimation of CORDIC Accuracy : : : : The Lower Bound of CORDIC Accuracy Reducing the z update error : : : : : : : Unexpected Truncation Errors : : : : : :
4 5 6 10 14
14 15 16 20
: : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
: : : : : : : : : : : : :
4 VHDL Implementation
4.1 The Basic CORDIC Unit : : : : : : : : : : 4.2 VHDL Describes Structure and Behaviour 4.2.1 Hierarchical vs Flat Designs : : : : 4.2.2 The Viewlogic Synthesiser : : : : : 4.3 VHDL Design of the CORDIC Unit : : : : 4.3.1 The Rounding Unit : : : : : : : : : 4.4 Combining the CORDIC Units : : : : : : 4.4.1 A Solution : : : : : : : : : : : : : : 4.5 Improvements : : : : : : : : : : : : : : : :
21
21 22 23 25 26 29 30 31 33
34 35 37 38
List of Tables
1.1 1.2 4.1 A.1 Elementary angles of i : : : : : : Various values of Kn : : : : : : : : Some CORDIC hardware statistics. The six CORDIC modes. : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
8 8 33 36
List of Figures
1.1 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 4.6 Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : : Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : : A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : : Word-Parallel CORDIC architecture with possible data pipelining. : : : : : Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : : Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A plot showing bits of error for a typical test vector rotated through all possible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : : An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : : Simulation results from a CORDIC processor illustrating the e ects of the normalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : : A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : : A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : : A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : : The structure of CORDIC unit showing the various entities. : : : : : : : : The top level schematic of an 4 stage CORDIC processor with Increased Convergence Range and Rounding components. : : : : : : : : : : : : : : : 6 11 12 13 15 15 16 17 17 19 20 21 24 24 25 28 32
Abstract
This report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Computer) algorithm and a possible implementation using the VHDL hardware description language. An analysis of errors associated with a xed point implementation of CORDIC is also discussed and methods for reducing these errors. A normalisation scheme which reduces error and requires no extra hardware is such a method. Various CORDIC structures and possible VHDL implementations are described in detail, including design and language issues. Finally a parallel hardware implementation is described and simulated. CORDIC has many applications, of which, some can be used for array imaging techniques.
Keywords
CORDIC, VHDL
Preface
CORDIC is an acronym for Coordinate Rotations Digital Computer and was derived by Volder 1] in the late 1950's for the purpose of calculating trigonometric functions. Its popularity came about nearly twenty years later when VLSI solutions became a reality. The original algorithm describes the rotation of a 2-D vector which can be applied in applications such as Digital Signal Processing 2] (Fourier Transforms, Digital Filters), Computer Graphics 3] and Robotics 4]. CORDIC processing o ers high computational rates making it attractive to applications such as computer graphics where a combination of scaling and rotations are required in real time. CORDIC is also attractive to Robotics since the fundamental operation is coordinate transformations, however it could be used for more computationally intensive processes such as motion planning and collision detection. Array Imaging typically involves complex signal processing which may require many computationally intensive matrix operations. Increasing the complexity of the imaging model places greater demands on accuracy. Solutions to such complex systems requires better, and hence, more complex algorithms. Most of these algorithms are based on matrix factorization (decomposition) techniques, of which Singular Value Decomposition (SVD) is the most robust method. The SVD factorisation requires a two-sided transformation which involves several trigometric operations and rotations ideally suited to dedicated VLSI hardware (CORDIC processing) for real time calculations. CORDIC has also been applied to phase correction when dynamic range focusing when Digital Baseband Demodulation 5] techniques are employed in Interpolation Beamforming 6] . A complex signal is represented by the in-phase, I, and quadrature, Q, components, and are phase corrected by rotating the complex signal. Haviland and Tuszynski designed and built a CORDIC processor 7] in 1980 which used a iterative process to calculate circular, linear and hyperbolic functions. A more recent implementation (1993) by Duprat and Muller 8] discusses the possibility of using a redundant number system for the representation of a signed digit. This report is broken into four logical sections, namely, CORDIC Theory, Hardware Implementations, Improving CORDIC Accuracy and nally a VHDL Implementation.
y
~ ~ ~ v = x + |y
v = x + |y
x
Figure 1.1: Rotation of a point in 2-D space. The angle can be expanded into a set of elementary angles qi 2 f?1; +1g, and angle expansion error zn , such that = and the sub-rotation angles
i
(
n?1 X i=?1
qi
i + zn
=2 = arctan(2?i ) for i = ?1 ; ; n ? 1 (1:3) for i = 0; 1 Note that i is approximately equal to but less than 2?i and the resulting angular expansion error is therefore jznj < 2?(n?1).
6
e | zn
(1.4)
ejqi
Finally ~ v=v
n?1 Y i=0
cos
(|q?1) =
n?1 X i=?1
n?1 Y i=0
1 + | qi 2?i 190
e?j zn
(1:5)
max , where
(1:6)
and some values of i are given in Table (1.1). If the expected range of rotation angles is 90 then the initial rotation by 90 , that is, e|q?q 2 = j q?1, does not have to be performed and the initial rotation is by 45 . The second term is a constant scaling factor and for given value of n it can be preevaluated using Equation (1.7), and the rst 15 evaluated in Table (1.2).
Kn =
n?1 Y i=0
cos i =
n?1 Y i=0
1 + 2?2i
1 ?2
n?1 Y i=0
1 1 + 41i
(1:7)
The basic CORDIC algorithm which describes rotation of a unity length vector v = x + |y by an angle can be derived from Equation (1.5) using the initial conditions, where zi is the accumulated angular residue:
v?1 = v Kn z?1 = ;n ? 1
qi =
?1 if zi < 0 +1 0 (
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Angle Angle (degrees) 0) arctan(2 45:0000 ?1 ) arctan(2 26:5651 ?2 ) arctan(2 14:0362 ?3 ) arctan(2 7:1250 arctan(2?4 ) 3:5763 ?5 ) arctan(2 1:7899 ?6 ) arctan(2 0:8952 arctan(2?7 ) 0:4476 ?8 ) arctan(2 0:2238 ?9 ) arctan(2 0:1119 ?10 ) arctan(2 0:0560 arctan(2?11 ) 0:0280 ?12 ) arctan(2 0:0140 ?13 ) arctan(2 0:0070 arctan(2?14 ) 0:0035 ?15 ) arctan(2 0:0017 ?16 ) arctan(2 0:0008
B400 6A43 3825 1C80 0E40 0729 0395 01CA 00E5 0073 0039 001D 000E 0007 0004 0002 0001
16-bit binaries = 110001:0000000000 = 011010:1001000011 = 001110:0000100101 = 000111:0010000000 = 000011:1001000000 = 000001:1100101001 = 000000:1110010101 = 000000:0111001010 = 000000:0011100101 = 000000:0001110011 = 000000:0000111001 = 000000:0000011101 = 000000:0000001110 = 000000:0000000111 = 000000:0000000100 = 000000:0000000010 = 000000:0000000001
i
Kn 0.70710678118655 0.63245553203368 0.61357199107790 0.60883391251775 0.60764825625617 0.60735177014130 0.60727764409353 0.60725911229889 0.60725447933256 0.60725332108988 0.60725303152913 0.60725295913894 0.60725294104140 0.60725293651701 0.60725293538591 0.60725293510314
The nal rotated vector is vn, with angle expansion error zn ~ vn = v = v e| e?|zn n?1 X zn = ? qi i
i=?1
(1.11) (1.12)
(1.13) (1.14)
;n ?1 xi+1 + |yi+1 = (xi + |yi)(1 + |qi 2?i ) Hence =) xi+1 = xi ? qi yi 2?i yi+1 = yi + qi xi 2?i
(1.15) (1.16)
The CORDIC algorithm reduces to an iterative set of operations consisting of a binary shift and an accumulator for each of x; y and z. Refer to Appendix A for a list of transcendental functions.
n
ROM Kn 's
i
ROM i 's Register File ALU
2?i yi or 2?i xi
Variable Shifter
Controlling micro-code
Figure 2.1: Generic Processor Architecture. le, and would execute an iterative algorithm. This structure is simular to that of a microprocessor or DSP and allows many variations of the CORDIC algorithm as the order of operations and the expanded instruction set increases exibility. This type of structure illustrates that it would be possible to implement the CORDIC algorithm on any micro or DSP. Optimising the generic processor-structure for a word-serial CORDIC processor is achieved by reducing the functionality to operations only required by the CORDIC algorithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU now contains three adders and dedicated registers. The microcode controller has been replaced by faster Combination Control Logic dedicated to the CORDIC operation sequence. The word-parallel method expands the problem of a single dimensional algorithm into a two-dimensional problem and results in shorter computational times. Greater speeds of computation can be obtained by pipe-lining between stages so that many partial results can be calculated in parallel. A pipelined-word-parallel architecture is shown in Figure (2.3) where each iteration is represented by a separate CORDIC block and a latch is placed after each iteration, or, several iterations. The following chapters will develop, implement, and simulate such parallel CORDIC structure using the VHDL hardware description language.
11
Initial Inputs
x0
y0
}|
z0
Next State
Select
xi
yi qiyi2?i
?qixi2?i
q-bit register
qi
zi
Look up Table of i 's
Increment
m-bit register
Zero
counter
Clock
xi+1
yi+1
zi+1
Finished Flag
12
y0 x0
Cell #0
z0
0 1
y1 x1
Clock
yi qi yi 2?i
Cell #i
P
xi
zi qi = sign zi]
?qi xi 2?i
P
yi+1
Clock
xi+1
Latch for Pipelining of data
zi+1
Cell #n
n?1
yn xn
zn
13
This solution represents an upper bound of error in the CORDIC processor. A graph of this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16 bit accuracy, the internal datapaths need to be 13 and 22 bits respectively.
Datapath resolution vs Output Resolution 32
Output resolution is (n) bits with (n) iterations
28 24 20 16 12 8 4 0 0
32
36
40
10
8
Output Accuracy
0 0
6 8 Number of stages n
10
12
Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internal datapath. 15
Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, and the resulting bits of error produced. About 0:3% of results are greater than 2 bits of error which indicates that the error bound of a CORDIC processor is positioned between the upper and lower bounds of error.
Bits error 90 3 120 2 150 1 30 60
180
210
330
240 270
300
Figure 3.3: A plot showing bits of error for a typical test vector rotated through all possible angles. The simulation results indicate that n + log n + 2 is an over estimation of data path width required and a reduction in datapath width is possible if the number of iterations is increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bit datapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. The simulation results were obtained by varying the magnitude of v and in uniform steps. The di erence in resolution obtained is two bits, indicating that the lower bound of error is closer to the error bound of CORDIC.
90 120
150
30
180
-150
-30
-120 -90
-60
Figure 3.4: A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results.
90 120 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 60
150
30
180
-150
-30
-120 -90
-60
Figure 3.5: An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. 17
improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisation scheme could be implemented by scaling the existing sequence by 2i , ie.,
zi = 2i zi ^
Therefore, the new sequence becomes
zi+1 = ^ = = =
i)
(3.1)
which requires a shift left at each iteration, and requires no extra hardware for a wordparallel structure. A new sequence of sub-rotation angles can be de ned as: ^i = 2i i = 2i tan(2?i) (3:2) where ^i approaches a nite value of 1 for increasing values of i, and will utilise most of the bus width. Since the scaling system results in full use of the databus width, over ow may occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1 can have is when zi approaches zero, giving
(3:3)
To calculate the increase in accuracy is beyond the scope of this report, however, simulation indicates that there is a direct improvement in accuracy. The simulation results indicated that using the traditional scheme the accuracy of the rotation is
accuracy / log(zi datapath width) + log(number of stages) whereas the normalisation scheme has the advantage of
(3:4)
accuracy / log(number of stages) (3:5) since the z datapath is always in a semi-normalised state. Using the traditional scheme, i ! 0, limiting the number of useful stages. However when normalised, there is no limit on the number of stages and a signi cant reduction in hardware is possible by reducing buswidth of z. Figure (3.6) illustrates the error dependencies on the number of stages and bits for the scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) show the angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance of v error on the angular expansion error.
18
No alpha scaling x 10
angle expans. error
Alpha scaling x 10
angle expans. error
-3
-3
6 4 2 0 0 0 10 stages 10 20 20 bits
0 0
0 10 stages 10 20 20 bits
No alpha scaling
Alpha scaling
4
relative v error relative v error
0 0
0 10 stages/bits in v 10 20 20 bits in z
0 0
0 10 stages/bits in v 10 20 20 bits in z
Figure 3.6: Simulation results from a CORDIC processor illustrating the e ects of the normalisation scheme.
19
150
30
150
30
180
180
-150
-30
-150
-30
-120 -90
-60
-120 -90
-60
Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.
20
yi xi
Cell i
zi
i
exibility could include fundamental changes such as variable datapath widths and variable number of stages. Other options such as rounding intermediate nodes and pipelining could also be easily integrated. Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE Standard VHDL, and many constructs are missing from their implementation. However, most of the useful constructs are there, but contain nasty ambiguous messages following to say sorry this only works partially. This made it very di cult to work with.
-- needs to be initialised
The process activates when one of the variables in the sensitivity list changes, and then produces a result in the internal variable res. The signal s is assigned the lower portion of the sum. Now consider a structural description of the same adder/subtracter where several components are used:
c(0) <= sel; -- carry in connect: FOR i IN 0 TO n-1 GENERATE
22
invert: invf101 PORT MAP( b(i), b_bar(i) ); mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) ); addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) ); END GENERATE;
Note that the muxf201 component is used to select between the non-inverted and inverted signals of the b bus. The components are user de ned entities describing the appropriate logic gates. For example a fragment of the faf001 component contains the following lowest level behavioural description:
SUM <= A1 xor B1 xor CIN2; CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2);
It is not immediately obvious which way a designer should describe a particular design, however the next section reveals the results of the synthesiser on which a decision may be based. In general however, the easier it is for a designer to write a design in VHDL, the more optimisation the synthesiser needs to perform. One of very useful features of Viewlogic's VHDL Synthesiser 13] is the ability to either create a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allows the engineer to see lower level interconnections between design units, unlike the at design where no (or little) hierarchy can be seen. This allows easier debugging of designs, however its has the disadvantage of being less e cient than a at design which combines all the design elements together into one circuit, and then performs optimisation. Figure (4.2) illustrates the previous structural design of the Adder/Subtracter where it can be observed that the schematic consists of higher level components than standard library cells. This feature of Viewlogic VHDL enables easy debugging of high level components when compared to a at design. It is relatively simple to navigate between levels in a design. However, most libraries contain standard cells for full adders, muxes, and inverters, but remembering that VHDL doesn't allow direct access to Library cells, these components had to be described by a behavioural description. A mux simply maps to an IF statement, however no behavioural description will map to the full adder cell, and resort to the description stated previously. Compiling the same design using the at (bottom-up) design approach the synthesiser produces the following statistics, if for example, using the X2000 library. The schematic generated by the synthesiser is shown in Figure (4.3).
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:NAND2 15 0.25 X2000:OR2 3 0.25
23
B3
A1
INVF101
A1 B2 SEL3
B2
A1
A1 B2 SEL3
MUXF201 O
B0
INVF101 MUXF201 A3
B1
FAF001
********************************************* Netlist Statistics ********************************************* Maximum level of gates = 14 Total number of nets = 42
OR2
A1
XOR2
XOR2
A0
NAND2
NAND2
NAND2
XOR2
SEL
NAND2
NAND2 NAND2
XOR2
NAND2
B0
XOR2
NAND2
S2
XOR2
B1
XOR2 XOR2
B2
XOR2
S1
XOR2
XOR2
S0
XOR2
B3
XOR2
S3
XOR2 XOR2
A3
Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4. Reconsidering the behavioural description of the Adder/Subtracter and synthesizing the design, the following statistics are generated, and the corresponding schematic shown in Figure (4.4).
********************************************* Gate Usage Summary *********************************************
24
Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:AND2 21 0.25 X2000:AND3 1 0.50 X2000:INV 11 0.00 X2000:NAND2 8 0.25 X2000:OR2 17 0.25 X2000:XOR2 3 0.25 ---------------------------------------------------------------------------Total Cells : 61 Total Area : 12.75
********************************************* Netlist Statistics ********************************************* Maximum level of gates = 11 Total number of nets = 70
NAND2
NAND2
OR2
NAND2
INV
OR2
NAND2
AND2 AND2
OR2
OR2
AND2
INV
AND2
OR2
B0
INV
OR2 OR2
S0 AND2
OR2
A0
AND2
AND3
OR2
NAND2 B3
XOR2
S2
OR2
INV
OR2
AND2
A3 INV AND2
OR2
INV
OR2
S3
AND2
S1 B1 AND2 A1
XOR2
NAND2
OR2
AND2
OR2
INV AND2
NAND2
NAND2 SEL
OR2
AND2
INV
AND2
AND2
OR2
Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4. From the statistics of each design, it is important to note that the total area and the maximum level of gates di ers. The structural description produces a small but slow design when compared to the behavioural description which produces a fast but large design. A characteristics of the synthesiser is that a behavioural description maps to a structure by representing each output in terms of its inputs, much like a lookup table, and removes any structure. The synthesizer performs logic level optimisation on a the structural description and thus producing a design with less logic. The Viewlogic Synthesiser has the ability to alter the emphasis on speed or area when optimizing a design. The statistics generated in the previous section were area optimized, 25
and neglected the e ect of gate delays. For example, optimizing the behavioural design for speed, the synthesiser generates 14 more gates than before, however there is a signi cant decrease in the maximum level of gates:
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:AND2 10 0.25 X2000:AND3 2 0.50 X2000:AND4 1 0.75 X2000:INV 15 0.00 X2000:NAND2 17 0.25 X2000:NAND3 1 0.50 X2000:NAND4 1 0.75 X2000:NOR3 2 0.50 X2000:NOR4 2 0.75 X2000:OR2 22 0.25 X2000:OR4 1 0.75 X2000:XOR2 1 0.25 ---------------------------------------------------------------------------Total Cells : 75 Total Area : 18.75
********************************************* Netlist Statistics ********************************************* Maximum level of gates = 9 Total number of nets = 84
The synthesiser can optimise small designs, but when the design grows large, the memory and processing power required to optimize such a design is considerable. The design of the CORDIC unit contains three adders/subtracters and takes several minutes to compile and optimize the design. However, integrating this unit into a larger design of several units, the compiler has many problems and will eventually lead to a crash after half an hour of compilation. A solution to get around this optimization problem is to use a hierarchical ow and describe the components using behavioural or structural descriptions. Using this method the compiler knows nothing about large components and cannot perform any global optimization. This is not a fully optimized solution, but it is currently the best solution. However, it is possible to atten the design below the top level making the design slightly more e cient.
is that the LOOP variable inside the generate statement cannot be passed to any user de ned function, procedure or entity. This is not stated in the manual and took many days to determine the problem. The behavioural description is as follows:
ARCHITECTURE behaviour OF adder IS begin cell_i : process (xi,xs,yi,ys,zi,ai) VARIABLE x_res: VARIABLE y_res: VARIABLE z_res: begin x_res := zero(n downto 0); y_res := zero(n downto 0); z_res := zero(k downto 0); if zi(k-1) = '0' then -- initialise, unless comp complains vlbit_vector(n downto 0); vlbit_vector(n downto 0); vlbit_vector(k downto 0); -- temporary results
-- z_i is positive
x_res := add2c (xi, ys); y_res := sub2c (yi, xs); z_res := sub2c (zi, ai); else -- z_i is negative
x_res := sub2c (xi, ys); y_res := add2c (yi, xs); z_res := add2c (zi, ai); end if; xip1 <= x_res (n-1 downto 0); yip1 <= y_res (n-1 downto 0); zip1 <= z_res (e-1 downto 0); end process; END behavior;
The synthesiser generates the following statistics for a 8 bit version of the code. The maximum level of gates is 20, since each bit requires 2 levels, plus additional gates for the multiplexer and inversion.
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:AND2 159 0.25 X2000:AND3 3 0.50 X2000:INV 69 0.00 X2000:NAND2 76 0.25
27
X2000:OR2 125 0.25 X2000:XOR2 7 0.25 ---------------------------------------------------------------------------Total Cells : 439 Total Area : 93.25
********************************************* Netlist Statistics ********************************************* Maximum level of gates = 20 Total number of nets = 487
For the Structural description of the CORDIC unit is slightly more complex and is best represented pictorially, as shown in Figure (4.5). Each box in the gure represents a di erent VHDL entity (component), and some components are used more than once. The design is very bulky and easier to make mistakes.
zi ai
inv101.vhd
INV
2to1 mux
faf001.vhd
Full Adder
zip1
xi ys
inv101.vhd
INV
2to1 mux
faf001.vhd
Full Adder
xip1
yi xs
inv101.vhd
INV
2to1 mux
faf001.vhd
Full Adder
yip1
Figure 4.5: The structure of CORDIC unit showing the various entities. It achieves the same functionality as the behavioural description but requires a lot more e ort to make sure all the connections are correct. As stated previously, the structural design will minimise area, but will result in a slower design, as re ected by the following synthesiser statistics. 28
********************************************* Gate Usage Summary ********************************************* Cell Count Area/Cell Cell Count Area/Cell ---------------------------------------------------------------------------X2000:INV 3 0.00 X2000:NAND2 139 0.25 X2000:OR2 41 0.25 X2000:XOR2 75 0.25 ---------------------------------------------------------------------------Total Cells : 258 Total Area : 63.75
********************************************* Netlist Statistics ********************************************* Maximum level of gates = 31 Total number of nets = 306
Using the structural design will save about 30% on area but will execute 50% slower. In a FPGA implementation speed might be more desirable than area optimization since the devices operate relatively slower when compared to a custom VLSI device. A 30% increase in the number of gates will be a relatively small concern.
The rounding unit is formed by the interconnection of n half adders, or in behavioural terms, the addition of the bit shifted out during the shifting process. Describing it structurally involves using the inc001 component which contains an AND and a XOR gate to form a half adder. The interconnection of the inc001 components is:
c(0) <= cin; -- first carry connect: for i in 0 to n-1 generate addsub: inc001 port map( a(i), c(i), s(i), c(i+1) ); end generate;
Or, a much simpler behavioural description is created using the unsigned addition routine addum. This avoids the sign extension used in the add2c routine.
rounder : process (a,cin) VARIABLE res: begin res := zero(n downto 0); -- initialise, unless comp complains vlbit_vector(n downto 0); -- temporary results
res := addum(a,cin); -- use addum instead of add2c as it sign -- extends the cin input making it -1 not +1 s <= res (n-1 downto 0); end process;
29
It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectors containing several smaller vectors. This system had to be used since Viewlogic's VHDL cannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals is done by the following function:
FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0)) RETURN vlbit_vector IS VARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0); BEGIN x_s(1*n-1 x_s(2*n-1 x_s(3*n-1 x_s(4*n-1 x_s(5*n-1 x_s(6*n-1 x_s(7*n-1 x_s(8*n-1 x_s(9*n-1 downto downto downto downto downto downto downto downto downto 0) := shiftr2c(x( 1*n-1 downto 0 1*n) := shiftr2c(x( 2*n-1 downto 2*n) := shiftr2c(x( 3*n-1 downto 3*n) := shiftr2c(x( 4*n-1 downto 4*n) := shiftr2c(x( 5*n-1 downto 5*n) := shiftr2c(x( 6*n-1 downto 6*n) := shiftr2c(x( 7*n-1 downto 7*n) := shiftr2c(x( 8*n-1 downto 8*n) := shiftr2c(x( 9*n-1 downto ),1); 1*n ),2); 2*n ),3); 3*n ),4); 4*n ),5); 5*n ),6); 6*n ),7); 7*n ),8); 8*n ),9); ---------2 stage 3 stage 4 stage 5 stage 6 stage 7 stage 8 stage 9 stage 10 stage
Next comes the connection of the init component which is used to expand the convergence range of the CORDIC processor to ?190 < z < 190 . The input signals are x in, y in, z in are connected to a unit simular to the CORDIC unit, except there is an extra bit appended to the alpha bus to account for the expanded convergence range. 30
initial: init port map(xi <= X"00", xs <= x_in, yi <= X"00", ys <= y_in, zi <= z_in, ai <= B"0_0101_1010", -- add/sub 90 degrees xip1 <= xinit, -- xinit = 0 +- yin yip1 <= yinit, -- yinit = 0 -+ xin zip1 <= zinit );
The following code has been compressed to reduce detail, however it can be seen that there a three separate stages: initial connection, intermediate connections, and nal connection. This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation of components, eg., selection of behavioral or structural components, rounding units, etc.)
connect: for i in 0 to k-1 generate ls_unit: if i=0 generate first_unit: adder port map( ... ); end generate ls_unit; i_unit: if i>0 and i<k-1 generate x_round: round port map ( ... ); y_round: round port map ( ... ); middle_units: adder port map( ... ); end generate ls_unit; ms_unit: if i=k-1 generate x_round_last: round port map ( ... ); y_round_last: round port map ( ... ); last_unit: adder port map( ... ); end generate ms_unit; end generate connect; -- k stages
The contents of ... are simular to the port map of the init component. This represents a solution to the CORDIC problem, and is close to a optimized solution, but due to compiler and language di culties a completely optimized solution is not possible. Under these situations the design has been optimised as far as possible though. There many choices to be made about the design of the CORDIC unit, by deciding on whether the it is going to be area or speed e cient. 31
4.4.1 A Solution
Figure 4.6: The top level schematic of an 4 stage CORDIC processor with Increased Convergence Range and Rounding components.
A7 A6 A5 A4 A3 A2 A1 A0 CIN
S7 S6 S5 S4 S3 S2 S1 S0
X_OUT7
X_OUT6 ROUND A7 A6 A5 A4 A3 A2 A1 A0 CIN S7 S6 S5 S4 S3 S2 S1 S0 XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0 X_OUT5
X_OUT4
X_OUT3
ROUND A7 A6 A5 A4 A3 A2 A1 A0 CIN S7 S6 S5 S4 S3 S2 S1 S0
X_OUT2 A7 A6 A5 A4 A3 A2 A1 A0 CIN XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10 S7 S6 S5 S4 S3 S2 S1 S0 XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0
X_OUT1
X_OUT0
1
Y_OUT7 XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10 Y_OUT6
ROUND
ROUND XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0
X_IN7
X_IN6
X_IN5
A7 A6 A5 A4 A3 A2 A1 A0 CIN XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10
S7 S6 S5 S4 S3 S2 S1 S0
Y_OUT5
Y_OUT4
Y_OUT3
X_IN4
X_IN3 XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0
X_IN2
X_IN1
X_IN0
Y_IN7
Y_IN6
Y_IN5
Y_IN4
Y_IN3
Y_IN2
Y_IN1
Y_IN0
A_IN8
A_IN7
XI7 XI6 XI5 XI4 XI3 XI2 XI1 XI0 XS7 XS6 XS5 XS4 XS3 XS2 XS1 XS0 YI7 YI6 YI5 YI4 YI3 YI2 YI1 YI0 YS7 YS6 YS5 YS4 YS3 YS2 YS1 YS0 AI8 AI7 AI6 AI5 AI4 AI3 AI2 AI1 AI0 TI8 TI7 TI6 TI5 TI4 TI3 TI2 TI1 TI0
A7 A6 A5 A4 A3 A2 A1 A0 CIN
S7 S6 S5 S4 S3 S2 S1 S0
ROUND
Y_OUT2
Y_OUT1
Y_OUT0
ROUND
A_OUT7
XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10
XIP17 XIP16 XIP15 XIP14 XIP13 XIP12 XIP11 XIP10 YIP17 YIP16 YIP15 YIP14 YIP13 YIP12 YIP11 YIP10 AIP17 AIP16 AIP15 AIP14 AIP13 AIP12 AIP11 AIP10
A_OUT6
32
A_OUT4
A_OUT3
A_OUT0
ADDER
A_IN4
2
A_IN3 A_IN2
A_IN1
A_IN0
GND
VDD
cordic
WIR:cordic SCH:cordic 24 A B C D E Jul 94 16:30 F SHEET 1 OF 1
The user can exibly change the characteristics of the CORDIC processor by changing the value of a few constants to achieve more or less accuracy as well as hardware con gurations. Some of the CORDIC hardware statistics generated are: Type Internal Bus Width Stages Rounding Number of Gates Behavioural 12 bit 8 no 5841 Behavioural 12 bit 10 no 7139 Behavioural 12 bit 12 no 8437 Behavioural 8 bit 8 no 3753 Behavioural 8 bit 8 yes 5060 Structural 8 bit 8 no 2313 Structural 8 bit 8 yes 2775 Table 4.1: Some CORDIC hardware statistics. (Remember that there is also one additional stage for increasing the convergence range.)
4.5 Improvements
There is one main improvement which could be made to the current design, which is to include pipelining registers between stage(s). If the Xilinx FPGA was used no extra hardware for latches is necessary as each cell contains a latch. Another possibility is to design a Word-Serial Cordic architecture around the already design CORDIC unit. This would only require a FSM driver along with some additional hardware for the variable shifter.
33
Conclusion
The theory behind the CORDIC algorithm has been covered in detail and its possible applications in array imaging discussed. It was shown how Kota predicted the upper bound on CORDIC errors, however simulations reveal CORDIC to be signi cantly more accurate. A normalisation scheme on the z datapath was introduced to maximize bus usage, and hence increase accuracy. This scheme can reduce the bus width required for z, and still achieve greater than or the same accuracy. Also observed was the unexpected truncation errors introduced by the twos-complement binary format, and minimised using a half adder to perform a rounding operation. A method for increasing the convergence range to ?180 < < 180 in one extra iteration was also introduced. Various CORDIC architectures were discussed and a word-parallel architecture described using the VHDL hardware description language. A few design issues and implementation problems were discussed, and concluded that a hierarchical design ow had to followed to avoid compiler memory problems. The solution is not completely optimal, but the resulting design could easily be implemented on a FPGA gate array. The VHDL design of the CORDIC processor could now be easily integrated into any application as alterations in the con guration of the CORDIC processor can easily be achieved using the VHDL language.
34
atanh (2?i) if m = ?1 2?i if m = 0 (A:4) ?i ) if m = +1 arctan(2 The three modes of m determine the class of function being evaluated: linear (m = 0), circular (m = ?1) or hyperbolic (m = +1). The values of i were previously given for the circular functions mode and simular table could be given for the hyperbolic functions. The six modes of operation exist because there are two sub-classes available depending upon whether the iterations seek to drive the variable y or z towards zero. Table (A.1) summarises all of the functions available with this con guration.
35
Mode
Circular m = +1 i = f0; 1; 2; :: : ; N ? 1g
qi = z!0 yN
n +1 ?1
if zi < 0 if zi 0
qi =
n +1 ?1
if zi < 0 if zi 0
qi = yN
n +1 ?1
if zi < 0 if zi 0
jzin j 1:1182
qi = y!0 zN
n +1 ?1 if xi yi 0 if xi yi < 0 p2 2 KN xin ? yin
jzin j 1
qi =
n +1 ?1 if xi yi 0 if xi yi < 0
xN
xN = xin
y zN = zin + xin in
p 2 KN x2 + yin in
jyin =xin j 1
36
Eu = 2?n + 3:5 2?m n (B:1) This cannot be solved analytically, but a numerical solution (that is for any given m) can be approximated graphically. The solution approximates to
Internal Bus Width = m n + log2 n + 2 (B:2) Hence to obtain a precision of n bits, a CORDIC processor with (n + log2 n + 2) bits and n iterations would be su cient. This solution represents the upper bound of error.
37
Bibliography
1] J. E. Volder, \The CORDIC trigonometric computing technique," IRE Transactions on Electronic Computing, vol. EC-8, no. 3, pp. 330{334, 1959. 2] Y. H. Hu, \CORDIC-based VLSI architectures for digital signal processing," IEEE Signal Processing Magazine, pp. 16{35, July 1992. 3] F. Koscsis and J. Bohme, \Fast algorithms and parallel structures for form factor evaluation," The Visual Computer, no. 8, pp. 205{216, 1992. 4] M. Kameyama, T. Amada, and T. Higuchi, \Highly parallel collision detection processor for intelligent robots," IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 500{506, 1992. 5] M. O'Donnell et al., \Real-time phases array imaging using digital beam forming and autonomous channel control," Ultrasonics Symposium, pp. 1499{1502, 1990. 6] G. Hampson and A. Paplinski, \Beamforming by interpolation," Tech. Rep. 93-12, Monash University, 1993. 7] G. L. Haviland and A. A. Tuszynski, \A CORDIC arithmetic processor chip," IEEE Transactions on Computers, vol. C-29, no. 2, pp. 68{79, 1980. 8] J. Duprat and J.-M. Muller, \The CORDIC algorithm: New results for fast VLSI implementation," IEEE Transactions on Computers, vol. 42, pp. 168{178, February 1993. 9] A. Paplinski, \Array processor units for evaluating the expotential and logarithmic functions," Tech. Rep. TR-CS-82-07, The Australian National University, 1982. 10] J. S. Walther, \A uni ed algorithm for elementary functions," Proceedings AFIPS Spring Joint Computer Conference, pp. 379{385, 1971. 11] K. Kota and J. R. Cavallaro, \Numerical accuracy and hardware tradeo s for CORDIC arithmetic for special-purpose processors," IEEE Transactions on Computers, vol. 42, pp. 769{779, July 1993. 12] Experts, \IEEE Std 1076-1987, IEEE Standard VHDL Language Reference Manual," IEEE Computer Society, February 1992. 13] ViewLogic, VHDL Reference Manual for Synthesis, Powerview 5.1.3 release ed. 38