A Thesis Report

On
FPGA IMPLEMENTATION OF DFT USING
CORDIC ALGORITHM

Submitted in the partial fulfillment of requirement for the award of the
Degree of
MASTER OF TECHNOLOGY
IN
VLSI DESIGN AND CAD

Submitted by
Vikas Kumar
60661004

Under the guidance of
Dr. Kulbir Singh
Assistant Professor

Electronics and Communication Engineering Department
Thapar University
Patiala-147004 (INDIA)
June, 2008




id3874406 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com


Abstract

CORDIC is an acronym for COrdinate Rotation Digital Computer. It is a class of
shift adds algorithms for rotating vectors in a plane, which is usually used for the
calculation of trigonometric functions, multiplication, division and conversion
between binary and mixed radix number systems of DSP applications, such as Fourier
Transform. The Jack E. Volder's CORDIC algorithm is derived from the general
equations for vector rotation. The CORDIC algorithm has become a widely used
approach to elementary function evaluation when the silicon area is a primary
constraint. The implementation of CORDIC algorithm requires less complex
hardware than the conventional method.
In digital communication, the straightforward evaluation of the cited functions is
important, numerous matrix based adaptive signal processing algorithms require the
solution of systems of linear equations, the computation of eigenvalues, eigenvectors
or singular values. All these tasks can be efficiently implemented using processing
elements performing vector rotations. The (CORDIC) offers the opportunity to
calculate all the desired functions in a rather simple and elegant way. Due to the
simplicity of the involved operations the CORDIC algorithm is very well suited for
VLSI implementation. The rotated vector is also scaled making a scale factor
correction necessary. VHDL coding and simulation of selected CORDIC algorithm
for sine and cosine, the comparison of resultant implementations and the specifics of
the FPGA implementation has been discussed.
In this thesis, the CORDIC algorithm has been implemented in XILINX Spartan
3E FPGA kit using VHDL and is found to be accurate. It also contains the
implementation of Discrete Fourier Transform using radix-2 decimation-in-time
algorithm on the same FPGA kit. Due to the high speed, low cost and greater
flexibility offered by FPGAs over DSP processors the FPGA based computing is
becoming the heart of all digital signal processing systems of modern era. Moreover
the generation of test bench by Xilinx ISE 9.2i verifies the results.





LIST OF ACRONYMS
ASICs Application-Specific Integrated Circuits
CLBs Configurable Logic Blocks
CORDIC Cordic Rotation Digital Computer
DFT Digital Fourier Transform
DHT Digital Hartley Transform
DSP Digital Signal Processing
EVD Enhanced Versatile Disc
FFT Fast Fourier Transform
FPGA Field Programmable Gate Array
LUT Look Up Table
RAM Random Access Memory
ROM Read Only Memory
RTL Register Ttransfer Level
SRAM Static RAM
SVD Singular Value Deposition
ULP Unit in the Last Place
VHSIC Very High Speed Integrated Circuit
VHDL VHSIC Hardware Description Language
VLSI Very Large Scale Integration



















LIST OF FIGURES
Figure No. Title Page No.
2.1 Rotation of a vector V by the angle  6
2.2 Vector V with magnitude r and phase  7
2.3 A balance having  at one side and small weights at the
other side.
10
2.4 Inclined balance due to the difference in weight of two
sides
10
2.5
First three of 10 iteration leading from (
0 0
, x y ) to (
10
, 0 x )
rotating by +30
0

Rotation mode
18
3.1 Hardware elements needed for the CORDIC method 20
3.2 Circular, Linear and Hyperbolic CORDIC 25
3.3 Rotation of a vector V by the angle  26
3.4 Iterative vector rotation, initialized with V0 27
3.5 Iterative CORDIC 29
3.6 Unrolled CORDIC 31
3.7 Bit-serial CORDIC 33
3.8 A CORDIC-based Oscillator for sine generation 36
4.1 Basic butterfly computation in the decimation-in-time 41
4.2 Eight point decimation-in-time FFT algorithm. 41
4.3 FFT write read method 42
4.4 Fast Fourier Transform 42
5.1 FPGA Architecture 43
5.2 FPGA Configurable Logic Block 44
5.3 FPGA Configurable I/O Block 45
5.4 FPGA Programmable Interconnect 46
5.5 Design Flow of FPGA 49
5.6 Top-Down Design 53
5.7 Asynchronous: Race Condition 55
5.8 Synchronous: No Race Condition 55
5.9 Metastability - The Problem 57
6.1 Sine-Cosine value generated for input angle z0 59
LIST OF FIGURES
Figure No. Title Page No.
6.2 Sine-Cosine value generated for input angle z0 (integer
value
60
6.3 Real input/output waveforms of DFT using FFT algorithm 61
6.4 Top level RTL schematic for Sine Cosine 63
6.5 RTL schematic for Sine-Cosine 64
6.6 Top label RTL schematic of DFT 69
6.7 RTL schematic of DFT 69



































LIST OF TABLES
Table No. Title Page No.
2.1 For 8-bit Cordic hardware 9
2.2
Phase, magnitude, and CORDIC Gain for different values
of K
13
2.3
Approximate value of the function
-
arctan (2 )
i
i
 = , in
degree, for 0 9 i s s .
17
2.4 Choosing the signs of the rotation angles to force z to zero 18
3.1 Performance and CLB usage in an XC4010E 30
3.2
Performance and CLB usage for the bit-parallel and bit-
serial iterative designs.
36
6.1 Sine-Cosine value for input angle z0 60
6.2 Sine Cosine value for input angle z0 61
6.3 Real input/output values of DFT using FFT algorithm 62
6.4 Power summary 65
6.5 (a) Design summary of Sine-Cosine 66
(b) Design summary of Sine-Cosine 66
6.6 Advanced HDL Synthesis Report 67
6.7 Timing summary 68
6.8 Thermal summary 68
6.9 (a) Design summary for DFT 70
(b) Design summary for DFT 70
6.10 Advanced HDL Synthesis Report for DFT 71














CONTENTS

Certificate i
Acknowledgement ii
Abstract iii
Acronyms iv
List of Figures v
List of Tables vii
Chapter 1 INTRODUCTION 1 – 4
1.1 Preamble 1
1.2 Historical perspective 2
1.3 Thesis objective 4
1.4 Organization of thesis 4
1.5 Methodology 5
Chapter 2 CORDIC ALGORITHM 6 – 20
2.1 Basic equation of Cordic algorithm 6
2.2 Complex number representation of Cordic algorithm 11
2.2.1 Calculation of magnitude of complex number 14
2.2.2 Calculation of phase of complex number 15
2.2.3 Calculation of sine and cosine of an angle 15
2.3 Basic Cordic iterations 16
Chapter 3 COMPUTATION OF SINE COSINE 21 – 36
3.1 Cordic Hardware 22
3.2 Generalized Cordic 23
3.3 The CORDIC-Algorithm for Computing a Sine and Cosine 25
3.4 Implementation of various CORDIC Architectures 28
3.4.1 A Bit-Parallel Iterative CORDIC 29
3.4.2 A Bit-Parallel Unrolled CORDIC 30
3.4.3 A Bit-Serial Iterative CORDIC 32
3.5 Comparison of the Various CORDIC Architectures 34
3.6 Numerical Comparison 34
3.7 Other Considerations 35
3.8 Hardware Implementation 36
Chapter 4 CORDIC FOR DXT CALCULATION 37 – 42
4.1 Calculation of DXT using Cordic 38
4.2 FFT method for DFT calculation 41
Chapter 5 FPGA DESIGN FLOW 43 – 58
5.1 Field Programmable Gate Array (FPGA) 43
5.2 FPGA Architectures 43
5.2.1 Configurable Logic Blocks 44
5.2.2 Configurable Input - Output Blocks 44
5.2.3 Programmable Interconnect 45
5.2.4 Clock Circuitry 46
5.2.5 Small vs. Large Granularity 46
5.2.6 SRAM vs. Anti-fuse Programming 47
5.3 Design Flow 48
5.3.1 Writing a Specification 48
5.3.2 Choosing a Technology 50
5.3.3 Choosing a Design Entry Method 50
5.3.4 Choosing a Synthesis tool 50
5.3.5 Designing the chip 51
5.3.6 Simulating - design review 51
5.3.7 Synthesis 51
5.3.8 Place and Route 52
5.3.9 Resimulating - final review 52
5.3.10 Testing 52
5.4 Design Issues 53
5.4.1 Top-Down Design 53
5.4.2 Keep the Architecture in Mind 54
5.4.3 Synchronous Design 54
5.4.4 Race conditions 55
5.4.5 Metastability 57
5.4.6 Timing Simulation 58
Chapter 6 RESULTS AND DISCUSSIONS 59-72
6.1 ModelSim Simulation Results 59
6.1.1 For sine-cosine binary input and binary output 59
6.1.2 For sine-cosine real input and real output 61
6.1.3 For DFT using FFT Algorithm 60
6.2 XILINX simulation results 63
6.3 Discussions 72
Chapter 7 CONCLUSION 73
REFRENCES 74









































CHAPTER 1
INTRODUCTION

The CORDIC algorithm was first introduced by Jack E. Volder [1] in the year
1959 for the computation of Trigonometric functions, Multiplication, Division, Data
type conversion, Square Root and Logarithms. It is a highly efficient, low-complexity,
and robust technique to compute the elementary functions. The basic Algorithm
structure is described in [2]. Other information about CORDIC Algorithm and
different Issues are in [3]. The CORDIC algorithm has found its way in various
applications such as pocket calculators, numerical co-processors, to high performance
radar signal processing, supersonic bomber aircraft with a digital counterpart.
Bekooij, Huisken, and Nowak research tells about the application of CORDIC in the
computation of the (Fast Fourier Transform) FFT, and at the effects on the numerical
accuracy.

1.1 Preamble

CORDIC stands for COordinate Rotation DIgital Computer. It calculates the
value of trigonometric functions like sine, cosine, magnitude and phase (arctangent)
to any desired precision. It can also calculate hyperbolic functions (such as sinh, cosh,
and tanh). The CORDIC algorithm does not use Calculus based methods such as
polynomial or rational function approximation. It is used as approximation function
values on all popular graphic calculators, including HP-48G as the hardware
restriction of calculators require that the elementary functions should be computed
using only additions, subtractions, digit shifts, comparisons and stored constants.
Today cordic algorithm is used in Neural Network VLSI design [4], high performance
vector rotation DSP applications [5], advanced circuit design, optimized low power
design. CORDIC algorithm revolves around the idea of "rotating" the phase of a
complex number, by multiplying it by a succession of constant values. However, the
"multiplies" can all be powers of 2, so in binary arithmetic they can be done using just
shifts and adds; no actual "multiplier" is needed thus it simpler and do not require
complex hardware structure as in the case of multiplier. Earlier methods used are
Table look up method, Polynomial approximation method etc. for evaluation of
trigonometric functions. It is hardware efficient algorithm. No multiplier requirement
as in the case of microcontroller. The drawback in CORDIC is that after completion
of each iteration, there is a gain which is added to the magnitude of resulting vector
which can be removed by multiplying the resulting magnitude with the inverse of the
gain. There are two ways in CORDIC algorithm for calculation of trigonometric and
other related functions they are rotation mode and vectoring mode. Both methods
initialize the angle accumulator with the desired angle value. The rotation mode,
determines the right sequence as the angle accumulator approaches zero while the
vectoring mode minimizes the y component of the input vector. CORDIC is generally
faster than other approaches when a hardware multiplier is unavailable (e.g. in a
microcontroller), or when the number of gates required to implement is to be
minimized (e.g. in an FPGA). On the other hand, when a hardware multiplier is
available (e.g. in a DSP microprocessor), table-lookup methods and power series are
generally faster than CORDIC. Various CORDIC architectures like bit parallel
iterative CORDIC, a bit parallel unrolled CORDIC, a bit-serial iterative CORDIC and
the comparison of various CORDIC architecture has been discussed. It can be seen
that CORDIC is a feasible way to approximate cosine and sine. CORDIC is useful in
designing computing devices. As it was originally designed for hardware applications,
there are features that make CORDIC an excellent choice for small computing
devices. Since it is an iterative method it has the advantage over the other methods of
being able to get better accuracy by doing more iteration, where as the Taylor
approximation and the Polynomial interpolation methods need to be rederived to get
better results. These properties, in addition to getting a very accurate approximation
is perhaps the reason why CORDIC is used in many scientific calculators today. Due
to the simplicity of the involved operations the CORDIC algorithm is very well suited
for VLSI implementation. However, the CORDIC iteration is not a perfect rotation
which would involve multiplications with sine and cosine. The rotated vector is also
scaled making a scale factor correction necessary.

1.2 Historical perspective

CORDIC algorithm has found its way in many applications. The CORDIC was
introduced in 1956 by Jack Volder as a highly efficient, low-complexity, and robust
technique to compute the elementary functions. It is initially intended for navigation
technology, the CORDIC algorithm has found its way in a wide range of applications,
ranging from pocket calculators, numerical co-processors, to high performance radar
signal processing. After invention CORDIC worked as the replacement for the analog
navigation computers aboard the B-58 supersonic bomber aircraft with a digital
counterpart. The CORDIC airborne navigational computer built for this purpose,
outperformed conventional contemporary computers by a factor of 7, mainly due to
the revolutionary development of the CORDIC algorithm. Further Steve Walther [6]
continues work on CORDIC, with the application of the CORDIC algorithm in the
Hewlett-Packard calculators, such as the HP-9100 and the famous HP-35 in year
1972, the HP-41C in year1980. He told how the unified CORDIC algorithm i.e.
combining rotations in the circular, hyperbolic, and linear coordinate systems and
how it was applied in the HP-2116 floating-point numerical co-processor. Today’s
fast rotation techniques are closely related to CORDIC, to perform orthonormal
rotation at a very low cost. Although fast rotations exist for certain angles only, they
are sufficiently versatile, and have already been widely applied in signal processing.
Hekstra found a large range of known, and previously unknown, fast rotation
methods. An overall evaluation of the methods exposes the trade-offs that exist
between the angle of rotation, the accuracy in scaling and the cost of rotation. Van der
Kolk, Deprettere, and Lee [7] formalized the problem of (approximate) vectoring for
fast rotations in year 2000. They treated the fast and efficient selection of the
appropriate fast rotation, and showed the advantage to be gained when applied to
Enhanced Versatile Disc (EVD). The selection technique works equally well for
redundant arithmetic and floating-point computations. Antelo, Lang, and Bruguera[8]
considers going to a higher radix than the radix-2 for the classical algorithm, so that
less iterations are required. The choice of a higher radix implies that the scaling factor
is no longer constant. The authors propose an on-line calculation of the logarithm of
the scale factor and subsequent compensation. Hsiao, Lau, and Delosme [9]
considered multi-dimensional variants of CORDIC, such as the 4-D (dimention)
householder CORDIC transform, and their application to singular value deposition
(SVD). Rather than building a multi-dimensional transform out of a sequence of 2-D
(dimention) CORDIC operations, they proposed to work with multi-dimensional
micro-rotations, immediately at the iteration level. Their method is evaluated and
benchmarked against solutions by others. Kwak, Choi, and Swartzlander [10] aimed
to overcome the critical path in the iteration through sign prediction and addition.
They proposed to overlap the sign prediction with the addition, by computing the
results for both outcomes of the sign, and to select the proper one at the very end of
the iteration. Novel in their approach is to combine the adder logic for the
computation of both results.

1.3 Thesis objective

Based on the above discussion the thesis has following objectives:
 To study and implement CORDIC algorithm using VHDL programming
code.
 To implement DXT using CORDIC algorithm in VHDL code.
 To implement CORDIC algorithm on XILINX SPARTAN 3E kit.

1.4 Organization of thesis

Chapter 2 discusses basics of CORDIC algorithm, how it came into picture, its
basic equations, different mode of operation i.e. rotation mode and vectoring mode,
gain factor, complex form of representation, CORDIC iteration and how it works.
Chapter 3 discusses about the calculation of sine-cosine using CORDIC algorithm,
different architectures to perform CORDIC iteration and their block diagram, their
comparison on the basis of their complexity, speed, area required to implement in
chip designing, number of iteration required etc. Chapter 4 discusses use of CORDIC
algorithm for calculating DFT, DHT, calculation of DFT using FFT algorithm, basic
equations used. Chapter 5 tells about the design flow of XILINX FPGA. This chapter
includes FPGA architecture, its logic blocks and different families of FPGA, their
specification, technology used, placement and routing, testing and design issues.
Chapter 6 contains the results of simulation using ModelSim and XILINX. The thesis
concludes in chapter 7 which also discusses future scope of work.

1.5 Methodology

In this thesis, VHDL programming has been used to implement CORDIC
algorithm (to calculate Sine and Cosine value for a given angle) and DFT (Digital
Fourier Transform) and DHT (Discrete Fourier Transform). Further XILINX
SPARTAN 3E kit is used for FPGA implementation of the generated VHDL code.
Programming tools used for the implementations are :
 Operating system WINDOWS XP
 ModelSim SE PLUS 5.5c
 XILINX 9.2i
 FPGA kit SPARTAN 3E


























CHAPTER 2
CORDIC ALGORITHM

In 1959 Jack E. Volder [1] described the Coordinate Rotation Digital Computer
or CORDIC for the calculation of trigonometric functions, multiplication, division
and conversion between binary and mixed radix number systems. The CORDIC-
algorithm provides an iterative method of performing vector rotations by arbitrary
angles using only shift and add. In this chapter, it is described that how CORDIC
algorithm works and how it can be understood more clearly.

2.1 Basic equation of CORDIC algorithm
Volder's algorithm is derived from the general equations for a vector rotation. If
a vector V with coordinates ( , ) x y is rotated through an angle  then a new vector
'
V can be obtained with coordinates
' '
( , ) x y where
'
x and
'
y can be obtained using x,
y and  by the following method.
X = r cos , Y = r sin (2.1)

'
'
'
.cos( ) .sin( )
.cos( ) .sin( )
x x y
V
y x
y
 
 
| | ÷ | |
= =
|
|
|
+
\ .
\ .
(2.2)

Figure 2.1: Rotation of a vector V by the angle 

Let’s find how equation 2.1 and 2.2 came into picture. As shown in the figure
2.1, a vector V ( , ) x y can be resolved in two parts along the x - axis and y -axis
as cos r  and sin r  respectively. Figure 2.2 illustrates the rotation of a vector
x
V
y
| |
=
|
\ .
by the angle .


Figure 2.2: Vector V with magnitude r and phase 

i.e.
cos
sin
x r
y r


= ¹
`
=
)
. (2.3)
Similarly from figure 2.1 it can be seen that vector V and
'
V can be resolved in
two parts. Let V has its magnitude and phase as r and respectively and
'
V has its
magnitude and phase as r and
'
 where
'
V came into picture after anticlockwise
rotation of vector V by an angle . From figure 2.1 it can be observed

'
   ÷ = (2.4)
i. e.
'
   = + (2.5)

'
OX =
'
x =
'
cos r 
= cos( ) r   +
= (cos .cos sin .sin ) r     ÷
=( .cos ) cos ( .sin ) sin r r     ÷ (2.6)
Using figure 2.2 and equation 2.3
'
OX can be represented as

'
OX =
'
x = cos sin x y   ÷ (2.7)
Similarly,
'
OY

'
OY =
'
y = cos sin y x   + (2.8)
Similarly, value for the vector
'
V in the clockwise direction rotating the vector
V by the angle  and the equations obtain in this case be

'
cos sin x x y   = + (2.9)

'
sin cos y x y   = ÷ (2.10)
The equations (2.7), (2.8), (2.9), (2.10) can be represented in the matrix form as

'
'
cos sin
sin cos
x x
y
y
 
 
| |
| || |
=
|
| |
|
±
\ .\ .
\ .

(2.11)
The individual equations for x

and
'
y can be rewritten as [11]:

'
x = .cos( ) .sin( ) x y    (2.12)

'
y = .cos( ) .sin( ) y x   ± (2.13)
Volder observed that by factoring out a cos from both sides, resulting equation
be in terms of the tangent of the angle , the angle of which we want to find the sin
andcos . Next if it is assumed that the angle è is being an aggregate of small angles,
and composite angles is chosen such that their tangents are all inverse powers of two,
then this equation can be rewritten as an iterative formula.

'
cos ( tan ) x x y   =  (2.14)

'
cos ( tan ) y y x   = ± (2.15)
'
z z  = ± , here  is the angle of rotation ( ± sign is showing the direction of rotation)
and z is the argument. For the ease of calculation here only rotation in anticlockwise
direction is observed first. Rearranging the equation (2.7) and (2.8).

'
cos( )[ . tan( )] x x y   = ÷ (2.16)

'
cos( )[ . tan( )] y y x   = + (2.17)
The multiplication by the tangent term can be avoided if the rotation angles and
therefore tan( )  are restricted so that tan( ) 2
i

÷
= .In digital hardware this denotes a
simple shift operation. Furthermore, if those rotations are performed iteratively and in
both directions every value of tan( )  is representable. With arctan(2 )
i

÷
= the cosine
term could also be simplified and since cos( ) cos( )   = ÷ it is a constant for a fixed
number of iterations. This iterative rotation can now be expressed as:

i+1
x [ . .2 ]
i
i i i i
k x y d
÷
= ÷ (2.18)

i+1
[ . .2 ]
i
i i i i
y k y x d
÷
= + (2.19)
where, i denotes the number of rotation required to reach the required angle of the
required vector, cos(arctan(2 ))
i
i
K
÷
= and 1
i
d = ± . The product of the
i
K 's represents
the so-called K factor [6]:

1
0
n
i
i
k k
÷
=
=
I
(2.20)
where
1 n
i
i o
k
÷
=
I
=
0 1 2 3 4 1
cos cos cos cos cos .................cos
n
     
÷
( is the angle of
rotation here for n times rotation). The above rotations requirement and adding and
subtracting of the different  can be understood by the following example of a
balance.

Table 2.1: For 8-bit CORDIC hardware.
i
2 tan
i i
i
d 
÷ ÷
= = arctan(2 )
i
i

÷
=
i
 in radian
0 1
o
45
0.7854
1 0.5
o
26.565
0.4636
2 0.25
o
14.036
0.2450
3 0.125
o
7.125
0.1244
4 0.0625
o
3.576
0.0624
5 0.03125
o
1.7876
0.0312
6 0.015625
o
0.8938
0.0156
7 0.0078125
o
0.4469
0.0078

i
k is the gain and it’s value changes as the number of iteration increases. For 8-bit
hardware CORDIC approximation method the value of
i
k as

7
0 1 2 3 4 5 6 7
0
cos cos .cos .cos .cos .cos .cos .cos .cos
i i
i
k         
=
= =
I

=
o o o
cos 45 .cos 26.565 .............................cos 0.4469
= 0.6073 (2.21)
From the above table it can be seen that precision up to
o
0.4469 is possible for 8-bit
CORDIC hardware. These
i
 are stored in the ROM of the hard ware of the CORDIC
hardware as the look up table. Now by taking an example of balance it can be
understood that how the CORDIC algorithm works.

Figure 2.3: A balance having  at one side and small weights (angle) at the other
side.

In the above figure, first of all, keep the input angle  on the left pan of balance
and if the balance rotates around the anticlockwise direction then add the highest
value in the table at the other side.

Figure 2.4: Inclined balance due to the difference in weight of two sides.

Then, if balance shows a left inclination as in figure 2.4 (a) then other weights
are required to add in the right pan or in the term of angle if  is greater than total
i

then add other weights to reach as near to the  as possible but in other hand if the
balance shows a right inclination as in figure 2.4 (b) then a weight required to be
removed from the right pan or in the term of angle if  is less than total
i
 then we
subtract other weights this process is repeated to reach as near to the  as possible.
Matrix representation of the CORDIC algorithm for 8-bit hardware:
1
1
cos sin
sin cos
i i i i
i i i i
x x
y y
 
 
+
+
| | | | | |
=
| | |
±
\ . \ . \ .

(2.22)
1 0 0 1 1
0 0 1 1 1
cos sin cos sin
.........
sin cos sin cos
i
i
x
y
   
   
+
+
| | | || |
=
| | |
± ±
\ . \ . \ .
 

7 7
7 7
cos sin
........
sin cos
x
y
 
 
| || |
| |
±
\ . \ .

(2.23)

1 0 1
0 1 7
0 1 1
1 tan 1 tan
cos cos .........cos
tan 1 tan 1
i
i
x
y
 
  
 
+
+
| | | || |
=
| | |
± ±
\ . \ . \ .
 


7
7
1 tan
................
tan 1
x
y


| || |
| |
±
\ . \ .



(2.24)
Thus, Scale Factor =
0 1 7
cos cos ............cos   
= 0.6073
=
1
1.6466
(2.25)
It can be seen from equation (2.22) that cosine and sine of an angle  can be
represented in the matrix form as

0 7 1
0 7 1
1 tan 1 tan 1 tan cos 0.6073
.......
tan 1 tan 1 tan 1 sin 0
   
   
| | | | | | | | | |
=
| | | | |
± ± ±
\ . \ . \ . \ . \ .
  

(2.26)
2.2 Complex number representation of CORDIC algorithm

Let a given complex number B having its real and imaginary part as,
b
I and
b
Q
respectively.
Given complex number,
b b
B I jQ = + (2.27)
Let rotated complex number is,
' '
'
b b
B I jQ = + (2.28)
It results due to multiplication of a rotation value r where r is
r r
R I jQ = + when a
pair of complex numbers is multiplied, their phases (angles) adds and their
magnitudes multiply. Similarly, when one complex number is multiplied by the
conjugate of the other, the phase of the conjugated is subtracted (though the
magnitudes still multiply).
Therefore to add R’s phase to B,
'
. B B R = i. e.
'
. .
b r b r
b
I I I Q Q = ÷ and
'
. .
b r b r
b
Q Q I I Q = + (2.29)
To subtract R’s phase from B,
'
. B B R = where R is the conjugate complex of R
'
. .
b r b r
b
I I I Q Q = + and
'
. .
b r b r
b
Q Q I I Q = ÷ (2.30)
Thus to rotate by +90 degrees, multiply by R = 0 + j1. Similarly, to rotate by -90
degrees, multiply by R = 0 - j1. Going through the above process, the net effect is: To
add 90 degrees, multiply by 0 1 R j = + : results in
'
b
b
I Q = ÷ and
'
b
b
Q I = .To
subtract 90 degrees, multiply by 0 - 1 R j = : results in
'
b
b
I Q = and
'
b
b
Q I = ÷ .To
rotate by phases of less than 90 degrees, given complex number is multiplied by
numbers of the form " 1 /- R jK = + " where K will be decreasing powers of two,
starting with 2^0 = 1.0. Therefore, K = 1.0, 0.5, 0.25, etc. (here the symbol " i " is used
to designate the power of two itself: 0, -1, -2, etc.). Since the phase of a complex
number " I jQ + " is arctan( / ) Q I , the phase of "1 jK + " is arctan( ). K Likewise,
the phase of "1 - jK " is arctan(- ) - arctan( ) K K = .To add phases " 1 R jK = + " is
used; to subtract phases " 1 - R jK = " is used. Since the real part of this,
r
I , is equal
to 1, we can simplify our table of equations to add and subtract phases for the special
case of CORDIC multiplications to:
To add a phase, multiply by 1 R jK = + :

'
. (2 ).
i
b b b b
b
I I K Q I Q
÷
= ÷ = ÷ (2.31)

'
. (2 ).
i
b b b b
b
Q Q K I Q I
÷
= + = + (2.32)
To subtract a phase, multiply by 1 - R jK = :

'
. (2 ).
i
b b b b
b
I I K Q I Q
÷
= + = + (2.33)

'
. (2 ).
i
b b b b
b
Q Q K I Q I
÷
= ÷ = ÷ (2.34)
Let's look at the phases and magnitudes of each of these multiplier values to get
more of a feel for it. The table below lists values of L, starting with 0, and shows the
corresponding values of K, phase, magnitude, and CORDIC Gain (in table 2.2). Since
CORDIC uses powers of 2 for the K values, multiplication and division can be done
only by shifting and adding binary numbers. That's why the CORDIC algorithm
doesn't need any multiplies. Also it can be seen that starting with a phase of 45
degrees, the phase of each successive R multiplier is a little over half of the phase of
the previous R. That's the key to understanding CORDIC: here what is done that a
"binary search" on phase by adding or subtracting successively smaller phases to
reach some "target" phase.




Table 2.2: Phase, magnitude, and CORDIC Gain for different values of K

i
2
i
K
÷
=

1 R jK = +
ZR (in degree)
=tan
-1
(K)
. Mag R CORDIC gain
0 1.0 1 + j1.0 45.00000 1.41421356 1.414213562
1 0.5 1 + j0.5 26.56505 1.11803399 1.581138830
2 0.25 1 + j0.25 14.03624 1.03077641 1.629800601
3 0.125 1 + j0.125 7.12502 1.00778222 1.642484066
4 0.0625 1 + j0.0625 3.57633 1.00195122 1.645688916
5 0.03125 1 + j0.031250 1.78991 1.00048816 1.646492279
6 0.015625 1 + j0.015625 0.89517 1.00012206 1.646693254
7 0.007813 1 + j0.007813 0.44761 1.00003052 1.646743507
..
.

... ... ... ... ...

The sum of the phases in the table up to L = 3 exceeds 92 degrees, so a complex
number can be rotated by +/- 90 degrees by doing four or more " 1 /- R jK = + "
rotations. Putting that together with the ability to rotate +/-90 degrees using
" 0 /- 1 R j = + ", and it can be rotated a full +/-180 degrees. Each rotation has a
magnitude greater than 1.0. That isn't desirable, but it's the price which is to pay for
using rotations of the form "1 + jK ". The "CORDIC Gain" column in the table is
simply a "cumulative magnitude" calculated by multiplying the current magnitude by
the previous magnitude. Noticing that it converges to about 1.647; however, the actual
CORDIC Gain depends on how many iterations have been done. (It doesn't depend on
whether the phases are being added or subtracted, because the magnitudes multiply
either way.)

2.2.1 Calculation of magnitude of complex number

The magnitude of a complex number
b b
B I jQ = + can be calculated if it is
rotated to have a phase of zero; then its new
b
Q value would be zero, so the
magnitude would be given entirely by the new
b
I value.
 It can be determined whether or not the complex number " B " has a positive phase
just by looking at the sign of the "
b
Q " value: positive
b
Q means positive phase.
As the very first step, if the phase is positive, rotate it by -90 degrees; if it's
negative, rotate it by +90 degrees. To rotate by +90 degrees, just negate
b
Q , then
swap
b
I and
b
Q ; to rotate by -90 degrees, just negate
b
I , then swap. The phase of
B is now less than +/- 90 degrees, so the "1 /- jK + " rotations to follow can
rotate it to zero.
 Next, do a series of iterations with successively smaller values of K, starting with
K=1 (45 degrees). For each iteration, simply look at the sign of
b
Q to decide
whether to add or subtract phase; if
b
Q is negative, add a phase (multiplying by
"1 jK + "); if
b
Q is positive, subtract a phase (multiplying by "1 - jK "). The
accuracy of the result converges with each iteration, as the more iteration is done,
the more accurate its results.
Now, having rotated the complex number to have a phase of zero, it end up with
" B 0
b
I j = + ". The magnitude of this complex value is just "
b
I " (since "
b
Q " is
zero.) However, in the rotation process, B has been multiplied by a CORDIC Gain
(cumulative magnitude) of about 1.647. Therefore, to get the true value of magnitude
it must be multiplied by the reciprocal of 1.647, which is 0.607. (The exact CORDIC
Gain is a function of the how much iteration is done.) Unfortunately, gain-adjustment
multiplication can't be done using a simple shift/add; however, in many applications
this factor can be compensated in some other part of the system. Or, when "relative
magnitude" is all that counts (e.g. AM demodulation), it can simply be neglected.

2.2.2 Calculation of phase of complex number

For calculation of phase, complex number is rotated to have zero phase, as done
previously to calculate the magnitude.
 For each phase-addition/subtraction step, accumulating the actual number of
degrees (or radians) for which it is rotated. The "actual" will come from a table of
" arctan( ) K " values like the "Phase of R" column in the table 2.2. The phase of the
complex input value is the negative of the accumulated rotation required to bring
it to a phase of zero.

2.2.3 Calculation of sine and cosine of an angle
 Starting with a unity-magnitude value of
b b
B I jQ = + . The exact value
depends on the given phase. For angles greater than +90, B should be started with
0 1 B j = + (that is, +90 degrees); for angles less than -90, B should be started
with 0 - 1 B j = (that is, -90 degrees); for other angles, B should be started with
1 0 B j = + (that is, zero degrees). Initializing an "accumulated rotation"
variable to +90, -90, or 0 accordingly. (Of course, it should be done in terms of
radians).
 Doing a series of iterations. If the desired phase minus the accumulated rotation is
less than zero then add the next angle in the table; otherwise, subtract the next
angle. Doing this using each value in the table.
 The "cosine" output is in "
b
I "; the "sine" output is in "
b
Q ".






2.3 Basic CORDIC iterations

To simplify each rotation, picking
i
 (angle of rotation in ith iteration) such that
i
 =
i
d . 2
i ÷
.
i
d is such that it has value +1 or -1 depending upon the rotation i. e.
i
d e{+1,-1} .Then
1
2
i
i i i i
x x d y
÷
+
= ÷ (2.35)
1
2
i
i i i i
y y d x
÷
+
= + (2.36)
1
1
tan 2
i
i i i
z z d
÷ ÷
+
= ÷ (2.37)
The computation of
1 i
x
+
or
1 i
y
+
requires an i-bit right shift and an add/subtract. If
the function
1
tan 2
i ÷ ÷
is pre computed and stored in table (Table 3.1) for different
values of i, a single add/subtract suffices to compute
1 i
z
+
. Each CORDIC iteration thus
involves two shifts, a table lookup and three additions.
If the rotation is done by the same set of angles (with + or- signs), then the expansion
factor K, is a constant, and can be pre computed. For example to rotate by 30 degrees,
the following sequence of angles be followed that add up to ~30 degree.

30.0 ~ 45.0 - 26.6 + 14.0 - 7.1 + 3.6 + 1.8 -0.9 + 0.4 - 0.2 + 0.1
=30.1
In effect, what actually happens in CORDIC is that z is initialized to 30 degree
and then, in each step, the sign of the next rotation angle is selected to try to change
the sign of z; that is,
i
d =sign (
i
z ) is chosen, where the sign function is defined to be -
1 or 1 depending on whether the argument is negative or nonnegative. This is
reminiscent of no restoring division. Table3.2 shows the process of selecting the signs
of the rotation angles for a desired rotation of +30 degree. Figure 3.1 depicts the first
few steps in the process of forcing z to zero.





Table 2.3: Approximate value of the function
-
arctan (2 )
i
i
 = , in degree,
for 0 9 i s s .
i
i

0 45
1 26.6
2 14
3 7.1
4 3.6
5 1.8
6 0.9
7 0.4
8 0.2
9 0.1

In CORDIC terminology the preceding selection rule for
i
d , which makes z converge
to zero, is known as rotation mode. Rewriting the CORDIC iteration, where
1
tan 2
i
i

÷ ÷
= :
1
2
i
i i i i
x x d y
÷
+
= ÷ (2.38)
1
2
i
i i i i
y y d x
÷
+
= + (2.39)
1
i
i i i
z z d 
+
= ÷ (2.40)
After m iteration in rotation mode, when z (m) is sufficiently close to zero. we have
i
z  =
¯
, and the CORDIC equations become:
( cos sin )
m
x k x z y z = ÷ (2.40)
( cos sin )
m
y k y z x z = + (2.41)
0
m
z = (2.42)
Rule: choose { 1,1}
i
d e ÷ such that 0 z ÷
The constant K in the preceding equation is k = 1.646760258121…… Thus, to
compute cos z and sin z, one can start with x = 1/K = 0.607252935….. and y = 0.
then, as
m
z tends to 0 with CORDIC iterations in rotation mode,
m
x and
m
y converge
to cos z and sin z, respectively. Once sin z and cos z are known, tan z can be through
necessary division.

Table 2.4: Choosing the signs of the rotation angles to force z to zero

i
i i
z  ÷
1 i
z
+

0 + 30.0 – 45.0
-15
1
- 15.0 + 26.6 11.6
2
+ 11.6 - 14.0 -2.4
3
- 2.4 + 7.1 4.7
4
+ 4.7 – 3.6 1.1
5
+ 1.1 – 1.8 -0.7
6
- 0.7 + 0.9 0.2
7
+ 0.2 – 0.4 -0.2
8
- 0.2 + 0.2 0
9
+ 0.0 – 0.1 -0.1




Figure 2.5: First three of 10 iteration leading from (
0 0
, x y ) to (
10
, 0 x ) in rotating by
+30
0
,
Rotation mode.
For k bits of precision in the resulting trigonometric functions, k CORDIC
iterations are needed. The reason is that for large i it can be approximated that
tan 2 2
i i i ÷ ÷ ÷
~ . Hence, for i > k, the change in the z will be less than ulp (Unit in the
Last Place).
In the rotation mode, convergence of z to zero is possible because each angle in
table 3.1 is more than half the previous angle or, equivalently, each angle is less than
the sum of the entire angle following it. The domain of convergence is -99.7 < z
<99.7, where99.7 is the sum of all the angles in table 3.1. Fortunately, this range
includes angle from -90 to +90, or [
2 2
to
 
÷ + ] in radians. For outside the preceding
range, trigonometric identities can be converted to the problem, to one that is within
the domain of convergence:
cos( 2 ) cos z j z  ± = (2.43)
sin( 2 ) sin z j z  ± = (2.44)
cos( ) cos z z  ÷ = ÷ (2.45)
sin( ) sin z z  ÷ = ÷ (2.46)
Noting that these transformations become particularly convenient if angles are
represented and manipulated in multiples of  radians, so that z = 0.2 really means z
= 0.2 radian or converted to numbers within the domain quite easily.
In the second way of utilizing CORDIC iterations, known as “vectoring mode,” y is
made nearer to zero by choosing
i
d ( )
i i
sign x y = ÷ . After m iterations in vectoring
mode tan( )
i
y
x
 = ÷
¯
.
This means that:
[ cos( ) sin( )]
m i i
x k x y   = ÷
¯ ¯
(2.47)
2 1/ 2
( tan( )) /[1 tan ( )]
m i i
x k x y   = ÷ +
¯ ¯
(2.48)
2 2 2
( / ) /(1 / )
m
x k x y x y x = + + (2.49)
2 2 1/ 2
( )
m
x k x y = + (2.50)



The CORDIC equations thus become

2 2 1/ 2
( )
m
x k x y = + (2.51)
0
m
y = (2.52)
1
tan ( / )
m
z z y x
÷
= + (2.53)
Rule: Choose
i
d { 1,1} e ÷ such that y÷0.
one can compute
1
tan y
÷
in vectoring mode by starting with x= 1 and z = 0. This
computation always converges. However, one can take advantage of the identity
1 1
tan tan
2
y y

÷ ÷
| |
= ÷ ÷
|
\ .
(2.54)
to limit the range of fixed point numbers that is encountered. CORDIC method also
allows the computation of the other inverse trigonometric functions.









CHAPTER 3
COMPUTATION OF SINE COSINE

Elementary functions, especially trigonometric functions, play important roles in
various digital systems, such as graphic systems, automatic control systems, and so
on. The CORDIC (Coordinate Rotation DIgital Computer) [11], [12] is known as an
efficient method for the computation of these elementary functions. Recent advances
in VLSI technologies make it attractive to develop special purpose hardware such as
elementary function generators. Several function generators based on the CORDIC
have been developed [13]. The CORDIC can also be applied to matrix
triangularization, singular value decomposition, and so on [14], [6]. In this chapter,
different hardware are dealt for sine and cosine computation using CORDIC. In sine
and cosine computation by the CORDIC, iterative rotations of a point around the
origin on the X-Y plane are considered. In each rotation, the coordinates of the rotated
point and the remaining angle to be rotated are calculated. The calculations in each
iteration step are performed by shift, addition and subtraction, and recall of a prepared
constant. Since the rotation is not a pure rotation but a rotation-extension, the number
of rotations for each angle should be a constant independent of the operand so that the
scale factor becomes a constant. When implementing a sine and cosine calculator in
digital hardware, the expense of the multiplication needed for many algebraically
methods, should be kept in mind. Alternative techniques are based on polynomial
approximation, table-lookup [15] etc. as well as shift and add algorithms [15]. Among
the various properties that are desirable, we can cite speed, accuracy or the reasonable
amount of resource [15]. The architecture of FPGAs specifies suitable techniques or
might even change desirable properties. Because the number of sequential cells and
amount of storage area, needed for table-lookup algorithms, are limited but
combinational logic in terms of LUT (Look Up Table) in the FPGA's (Field
Programmable Gate Array) CLBs (Configurable Logic Blocks) is sufficiently
available, shift and add algorithms fit perfectly into an FPGA.





3.1 CORDIC Hardware
A straight forward hardware implementation for CORDIC arithmetic is shown
below in figure 3.2. It requires three registers for x, y and z, a look up table to store
the values of
1
tan 2
i
i

÷ ÷
= , and two shifter to supply the terms 2
i
x
÷
and 2
i
y
÷
to the
adder/subs tractor units. The
i
d factor (-1 and 1) is accommodated by selecting the
(shift) operand or its complement.
Of, course a single adder and one shifter can be shared by three computations if a
reduction in speed by a factor of 3 is acceptable. In the extreme, CORDIC iterations
can be implemented in firmware (micro program) or even software using the ALU
and general purpose registers of a standard microprocessor. In this case, the look up
table supplying the term
i
 can be stored in the control ROM or in main memory.

Figure. 3.1: Hardware elements needed for the CORDIC method

where high speed is not required and minimizing the hard ware cost is important (as
in calculator), the adder in fig 3.1 can be bit serial. Then with k bit operands, O(
2
k )
clock cycle would be required to complete the k CORDIC iterations. This is
acceptable for hand handled calculators, since even a delay of tens of thousands of
clock cycles constitutes a small fraction of a second and thus is hardly noticeable to a
Y
Z
X
¯

¯

shift
¯

shift
Look up
Table
human user. Intermediate between the fully parallel and fully bit-serial realizations are
a wide array of digit serial (for example decimal or radix-16) implementation that
provide trade off speed versus cost.

3.2 Generalized CORDIC

The basic CORDIC method can be generalized to provide the more powerful tool
for function evaluation. Generalized CORDIC is defined as follows:

1
1
1
2
2
i
i i i i
i
i i i i
i
i i i
x x d y
y y d x
z z d


÷
+
÷
+
+
¹ = ÷
¦
= +
`
¦
= ÷
)
(3.1)
[Generalized CORDIC iteration]
Noting that the only difference with basic CORDIC is the introduction of the
parameter  in the equation for x and redefinition of
i
 . The parameter  can
assume one of the three values:
 = 1 Circular rotations (Basic CORDIC)
i
 =
1
tan 2
i ÷ ÷

 = 0 linear rotation
i
 = 2
i ÷

 = -1 Hyperbolic rotation
i
 =
1
tanh 2
i ÷ ÷

Figure3.2 illustrates the three type of rotation in generalized CORDIC. For the
circular case with  = 1, we introduce rotations that led to expansion of vector length
by a factor
1/ 2
(1 tan )
i
 + = / cos
i
i  in each step and by K =
1.646760258121…overall, where the vector length is the familiar
2 2
i
r x y = + . With
reference to figure3.2, the rotation angle AOB can be defined in terms of the area of
sector AOB as follows:

2
2( )
( )
areaAOB
angleAOB
OU
=
The following equations, repeated here for ready comparison, characterize the results
of circular CORDIC rotations:

( cos sin )
( cos sin )
0
m
m
m
x k x z y z
y k y z x z
z
= ÷ ¹
¦
= +
`
¦
=
)
(3.2)
(Circular rotation mode, Rule: choose { 1,1}
i
d e ÷ such that 0 z ÷ )

2 2 1/ 2
1
( )
0
tan ( / )
m
m
m
x k x y
y
z z y x
÷
¹ = +
¦
=
`
¦
= +
)
(3.3)
(Circular vectoring mode, Rule: Choose
i
d { 1,1} e ÷ such that y÷0)
In linear rotations corresponding to  = 0, the end point of the vector is kept on
the line
0
x x = and the vector length is defined by
i i
r x = . Hence, the length of the
vector is always its true length OV and the scaling factor is 1. The following
equations characterized the results of linear CORDIC rotations:


0
m
m
m
x x
y y xz
z
= ¹
¦
= +
`
¦
=
)
(3.4)
(Linear rotation mode, Rule: Choose
i
d { 1,1} e ÷ such that z ÷0)
0
/
m
m
m
x x
y
z z y x
= ¹
¦
=
`
¦
= +
)
(3.5)

(Linear vectoring mode, Rule: Choose
i
d { 1,1} e ÷ such that y÷0)
hence, linear CORDIC rotations can be used to perform multiplication (rotation mode,
y = 0), multiply-add (rotation mode), division (vectoring mode, z = 0), or divide-add
(vectoring mode).
In hyperbolic rotations corresponding to  = -1, the rotation angle EOF can be
defined in terms of the area of the hyperbolic sector EOF as follows:

2
2( )
( )
areaEOB
angleEOF
OW
=
The vector length is defined as
2 2
i
r x y = + , with the length expansion due to
rotation being
1/ 2
(1 tanh )
i
 + = / cosh
i
i  . Because cosh
i
 > 1, the vector length
actually shrinks, leading to an overall shrinkage factor K’ = 0.8281593609602………
after all the iterations. The following equations characterize the results of hyperbolic
CORDIC rotations:

'
( cosh sinh )
( cosh sinh )
0
m
m
m
x k x z y z
y k y z x z
z
¹ = +
¦
= +
`
¦
=
)
(3.6)
(Hyperbolic rotation mode, Rule: choose { 1,1}
i
d e ÷ such that 0 z ÷ )

2 2 1/ 2
1
( )
0
tanh ( / )
m
m
m
x k x y
y
z z y x
÷
¹ = ÷
¦
=
`
¦
= +
)
(3.7)
(Hyperbolic vectoring mode, Rule: Choose
i
d { 1,1} e ÷ such that y÷0.
hence, hyperbolic CORDIC rotations can be used to compute the hyperbolic sine and
cosine functions (rotation mode, x = 1/k’, y = 0) or the
1
tanh
÷
function (vectoring
mode, x = 1, z = 0). Other functions can be computed indirectly [16].













Figure. 3.2: Circular, Linear and Hyperbolic CORDIC [16]

3.3 The CORDIC-algorithm for Computing a Sine and Cosine

Jack E. Volder [1] described the Coordinate Rotation Digital Computer or
CORDIC for the calculation of trigonometric functions, multiplication, division and
conversion between binary and mixed radix number systems. The CORDIC-algorithm
provides an iterative method of performing vector rotations by arbitrary angles using
only shifts and adds. Volder's algorithm is derived from the general equations for
vector rotation. If a vector v with components ( , ) x y is to be rotated through an angle
 a new vector
'
v with components
' '
( , ) x y is formed by (as in equation 2.1 and 2.2):
X = r cos ,Y = r sin (3.8)
'
'
'
.cos( ) .sin( )
.cos( ) .sin( )
x x y
V
y x
y
 
 
| | ÷ | |
= =
|
|
|
+
\ .
\ .
(3.9)
Figure 4.2 illustrates the rotation of a vector
x
V
y
| |
=
|
\ .
by the angle .

Figure 3.3: Rotation of a vector V by the angle 
The individual equations for x

and
'
y can be rewritten as [17]:
'
x = .cos( ) .sin( ) x y   ÷ (3.10)
'
y = .cos( ) .sin( ) y x   + (3.11)
and rearranged so that:
'
cos( )[ . tan( )] x x y   = ÷ (3.12)
'
cos( )[ . tan( )] y y x   = + (3.13)
The multiplication by the tangent term can be avoided if the rotation angles and
therefore tan( )  are restricted so that tan( ) 2
i

÷
= .In digital hardware this denotes a
simple shift operation. Furthermore, if those rotations are performed iteratively and in
both directions every value of tan( )  is representable. With arctan(2 )
i

÷
= the cosine
term could also be simplified and since cos( ) cos( )   = ÷ it is a constant for a fixed
number of iterations. This iterative rotation can now be expressed as:
1
[ . .2 ]
i
i i i i i
x k x y d
÷
+
= ÷ (3.14)
1
[ . .2 ]
i
i i i i i
y k y x d
÷
+
= + (3.15)
where cos(arctan(2 ))
i
i
K
÷
= and 1
i
d = ± . The product of the
i
K 's represents the so-
called K factor [6]:
1
0
n
i
i
k k
÷
=
=
I
(3.16)
This k factor can be calculated in advance and applied elsewhere in the system. A
good way to implement the k factor is to initialize the iterative rotation with a vector
of length k which compensates the gain inherent in the CORDIC algorithm. The
resulting vector
'
v is the unit vector as shown in Figure 4.3.

Figure 3.4: Iterative vector rotation, initialized with V0

Equations 4.7 and 4.8 can now be simplified to the basic CORDIC-equations:

1
[ . .2 ]
i
i i i i
x x y d
÷
+
= ÷ (3.17)

1
[ . .2 ]
i
i i i i
y y x d
÷
+
= + (3.18)

The direction of each rotation is defined by
i
d and the sequence of all
i
d 's
determines the final vector. This yields to a third equation which acts like an angle
accumulator and keeps track of the angle already rotated. Each vector v can be
described by both the vector length and angle or by its coordinates x and y .
Following this incident, the CORDIC algorithm knows two ways of determining the
direction of rotation: the rotation mode and the vectoring mode. Both methods
initialize the angle accumulator with the desired angle
0
z . The rotation mode,
determines the right sequence as the angle accumulator approaches 0 while the
vectoring mode minimizes the y component of the input vector.
The angle accumulator is defined by:
1
. (2 )
i
i i i
z z d arctan
÷
+
= ÷ (3.19)
where the sum of an infinite number of iterative rotation angles equals the input
angle [14]:
0
. (2 )
i
i
i
d arctan 
·
÷
=
=
¯
(3.20)
Those values (2 )
i
arctan
÷
can be stored in a small lookup table or hardwired
depending on the way of implementation. Since the decision is which direction to
rotate instead of whether to rotate or not,
i
d is sensitive to the sign of
i
z Therefore
i
d
can be described as:
1; 0
1; 0
i
i
i
if z
d
if z
÷ < ¦
=
´
+ >
¹
(3.21)
With equation 4.14 the CORDIC algorithm in rotation mode is described
completely. Note, that the CORDIC method as described performs rotations only
within
2

÷ and
2

. This limitation comes from the use of 2
0
for the tangent in the first
iteration. However, since a sine wave is symmetric from quadrant to quadrant, every
sine value from 0 to 2 can be represented by reflecting and/or inverting the first
quadrant appropriately.

3.4 Implementation of various CORDIC Architectures

As intended by Volder, the CORDIC algorithm only performs shift and add
operations and is therefore easy to implement and resource-friendly. However, when
implementing the CORDIC algorithm one can choose between various design
methodologies and must balance circuit complexity with respect to performance. The
most obvious methods of implementing a CORDIC, bit-serial, bit-parallel, unrolled
and iterative, are described and compared in the following sections.
3.4.1 A Bit-Parallel Iterative CORDIC

The CORDIC structure as described in equations 4.10, 4.11, 4.12 and 4.14 is
represented by the schematics in Figure 4.4 when directly translated into hardware.
Each branch consists of an adder-sub tractor combination, a shift unit and a register
for buffering the output. At the beginning of a calculation initial values are fed into
the register by the multiplexer where the MSB of the stored value in the z-branch
determines the operation mode for the adder-sub tractor. Signals in the x and y branch
pass the shift units and are then added to or subtracted from the unshifted signal in the
opposite path.

Figure 3.5: Iterative CORDIC
The z branch arithmetically combines the registers values with the values taken
from a lookup table (LUT) whose address is changed accordingly to the number of
iteration. For n iterations the output is mapped back to the registers before initial
values are fed in again and the final sine value can be accessed at the output. A simple
finite-state machine is needed to control the multiplexers, the shift distance and the
addressing of the constant values. When implemented in an FPGA the initial values
for the vector coordinates as well as the constant values in the LUT can be hardwired
in a word wide manner. The adder and the sub tractor component are carried out
separately and a multiplexer controlled by the sign of the angle accumulator
distinguishes between addition and subtraction by routing the signals as required. The


>> n >> n
Const(n)
Register
Register
Register
± ± ±
Sign(n-1)
x0 y0 z0
x(n)
y(n)
z(n)
¯ ¯ ¯
Multiplexer
Multiplexer Multiplexer
shift operations as implemented change the shift distance with the number of
iterations but those require a high fan in and reduce the maximum speed for the
application [18]. In addition the output rate is also limited by the fact that operations
are performed iteratively and therefore the maximum output rate equals 1/n times
the clock rate.

3.4.2 A Bit-Parallel Unrolled CORDIC

Instead of buffering the output of one iteration and using the same resources
again, one could simply cascade the iterative CORDIC, which means rebuilding the
basic CORDIC structure for each iteration. Consequently, the output of one stage is
the input of the next one, as shown in Figure 4.5, and in the face of separate stages
two simplifications become possible. First, the shift operations for each step can be
performed by wiring the connections between stages appropriately. Second, there is
no need for changing constant values and those can therefore be hardwired as well.
The purely unrolled design only consists of combinatorial components and computes
one sine value per clock cycle. Input values find their path through the architecture on
their own and do not need to be controlled.

Obviously the resources in an FPGA are not very suitable for this kind of
architecture. As we talk about a bit-parallel unrolled design with 16 bit word length,
each stage contains 48 inputs and outputs plus a great number of cross-connections
between single stages. Those cross-connections from the x-path through the shift
components to the y-path and vice versa make the design difficult to route in an
FPGA and cause additional delay times. From table 4.1 it can be seen how
performance and resource usage change with the number of iterations if implemented
in an XILINX FPGA. Naturally, the area and therefore the maximum path delay
increase as stages are added to the design where the path delay is an equivalent to the
speed which the application could run at.

Table 3.1: Performance and CLB usage in an XC4010E [19]

No. of Iterations 8 9 10 11 12 13
Complexity [CLB] 184 208 232 256 280 304
Max path delay[ns] 163.75 177.17 206.9 225.72 263.86 256.87

As described earlier, the area in FPGAs can be measured in CLBs, each of which
consist of two lookup tables as well as storage cells with additional control
components [20], [21]. For the purely combinatorial design the CLB's function
generators perform the add and shift operations and no storage cells are used. This
means registers could be inserted easily without significantly increasing the area.

Figure 3.6: Unrolled CORDIC

However,inserting registers between stages would also reduce the maximum path
delays and correspondingly a higher maximum speed can be achieved. It can be seen,
that the number of CLBs stays almost the same while the maximum frequency
increases as registers are inserted. The reason for that is the decreasing amount of
combinatorial logic between sequential cells. Obviously, the gain of speed when
inserting registers exceeds the cost of area and makes therefore the fully pipelined
CORDIC a suitable solution for generating a sine wave in FPGAs. Especially if a
sufficient number of CLBs is at one's disposal, as is the case in high density devices
like XILINX's Virtex or ALTERA's FLEX families, this type of architecture becomes
more and more attractive.

3.4.3 A Bit-Serial Iterative CORDIC

Problems which involve repeated evaluation of a fixed set of nonlinear, algebraic
equations appear frequently in scientific and engineering applications. Examples of
such problems can be found in the robotics, engineering graphics, and signal
processing areas. Evaluating complicated equation sets can be very time consuming in
software, even when co-processors are used, especially when these equations contain
a large number of nonlinear and transcendental functions as well as many
multiplication and division operations. Both, the unrolled and the iterative bit-parallel
designs, show disadvantages in terms of complexity and path delays going along with
the large number of cross connections between single stages. To reduce this
complexity we can change the design into a completely bit-serial iterative
architecture. Bit-serial means only one bit is processed at a time and hence the cross
connections become one bit-wide data paths. Clearly, the throughput becomes a
function of

clock rate
number of iterations word width ×


In spite of this the output rate can be almost as high as achieved with the unrolled
design. The reason is the structural simplicity of a bit-serial design and the
correspondingly high clock rate achievable. Figure 4.6 shows the basic architecture of
the bit serial CORDIC processor as implemented in a XILINX Spartan.

In this architecture the bit-serial adder-subtractor component is implemented as a
full adder where the subtraction is performed by adding the 2's complement of the
actual subtrahend [22]. The subtraction is again indicated by the sign bit of the angle
accumulator as described in section 4.2.1. A single bit of state is stored at the adder to
realize the carry chain [23] which at the same time requires the LSB to be fed in first.
The shift-by- i operation can be realized by reading the bit 1 i ÷ from it's right end in
the serial shift registers. A multiplexer can be used to change position according to the
current iteration. The initial values
0
x ,
0
y and
0
z are fed into the array at the left end of
the serial-in serial-out register and as the data enters the adder component the
multiplexer at the input switch and map back the results of the bit-serial adder into the
registers. The constant LUT for this design is implemented as a multiplexer with
hardwired choices. Finally, when all iterations are passed the input multiplexers
switch again and initial values enter the bit-serial CORDIC processor as the computed
sine values exit.
The design as implemented runs at a much higher speed than the bit-parallel
architectures described earlier and fits easily in a XILINX SPARTAN device. The
reason is the high ratio of sequential components to combinatorial components. The
performance is constrained by the use of multiplexers for the shift operation and even
more for the constant LUT. The latter could be replaced by a RAM or serial ROM
where values are read by simply incrementing the memory's address. This would
clearly accelerate the performance.



Figure 3.7: Bit-serial CORDIC
3.5 Comparison of the Various CORDIC Architectures

In the previous sections, we described various methods of implementing the
CORDIC algorithm using an FPGA. The resulting structures show differences in the
way of using resources available in the target FPGA device. Table 4.2 illustrates how
the architectures for the iterative bit-serial and iterative bit-parallel designs for 16 bit
resolution vary in terms of speed and area. The bit-serial design stands out due to it's
low area usage and high achievable speed. Whereas the latency and hence the
maximum throughput rate is much lower compared to the bit-parallel designs. The
bit-parallel unrolled and fully pipelined design uses the resources extensively but
shows the best latency per sample and maximum throughput rate. The prototyping
environment limited the implementation of the unrolled design to 13 iterations. The
iterative bit-parallel design provides a balance between unrolled and bit-serial design
and shows an optimum usage of the resources.

3.6 Numerical Comparison

By running a test program on our Taylor, Polynomial Interpolation, and CORDIC
approximations for cosine one can obtained the output attached. As one can see all
three give fairly reasonable approximations to cosine. We can see from the absolute
error that our Taylor approximation does just what is expected. As the values of x get
further away from our centers 0, è /6, è /3, and è /2, the error increases. The error then
decreases as our angle again nears the next center. The polynomial interpolation turns
out to be the worst approximation. By looking at the graph again, it appears that the
approximation should be best when x is near 0, è /4, and è /2, looking at the absolute
errors it appear that this is the case. However, the values computed in the test case
show that at most angles the polynomial doesn’t accurately correspond with the cos
function. The best values, those near our chosen points, are still off by at the most
1/50. The best approximation looking at the absolute error is definitely CORDIC. In
fact it turns out to be exact on nearly every angle (at least in terms of MATLAB’s cos
( ) function). Clearly by numerical standards, the CORDIC method is the winner.
However, note that the Taylor approximation did very well and with more centers
would do even better. As for the polynomial interpolation, it does not seem to fit the
sinusoid very well and this will apparently give a poor approximation.
3.7 Other Considerations

By the numerical comparison there is an obvious loser – polynomial interpolation,
however there may be certain conditions that require different properties other than
just accuracy. For instance, polynomial interpolation, while crude, is very fast to
calculate and in terms of complexity it is the simplest of all three.
The Taylor approximation while it is slower to calculate than the polynomial
approximation is a function and can be calculated quickly as well. However, the
complexity of the method is much greater than the simple quadratic and for
reasonable accuracy needs multiple expansions. Also, for the x values that fall in the
middle of the centers accuracy is still an issue, albeit a small one.
Finally, the CORDIC method, which by far finds the most accurate
approximations, has the most direct solution to the problem of evaluating
trigonometric functions. By rotating a unit vector in a coordinate system, it is
essentially finding (with precision limitations) the actual values of sin and cosine.
However, this method is not a function that can be easily evaluated, but rather an
iterative formula. This means that how fast it is depends on how fast it converges to
the actual answer. While the convergence isn’t great it is fairly fast, giving an error
bound of 1/2
n
≥ |cos è - x
n+1
|, where x
n+1
is the current step in the CORDIC
algorithm. This means that the algorithm gets at about twice as close to the real
solution every iteration. The complexity is great in that CORDIC has to be done n
times to get a solution for an n-bit computer. However, this is combated by the fact
that in n iterations both the cosine and sine are found. Something that the other
methods can’t do. Thus CORDIC is the best way of fast calculation by using
subtraction and addition only.

In actual fact it would be more accurate to look at the resources available in the
specific target devices rather than the specific needs in order to determine what
architecture to use. The bit-serial structure is definitely the best choice for relatively
small devices, but for FPGAs where sufficient CLBs are available one might choose
the bit-parallel and fully pipelined architecture [24] since latency is minimal and no
control unit is needed.

Table 3.2: Performance and CLB usage for the bit-parallel and bit-serial iterative
designs. [19]


CLB LUT FF Speed Latency Max. Throughput

[1] [1] [1] [MHz] [  s] [Mio. Samples /sec]
bit-serial 111 153 108 48 5.33 0.1875
bit-parallel 138 252 52 36 0.44 2.25

3.8 Hardware Implementation

As demonstrated the amplitude control can be carried out within the CORDIC
structure. Instead of hard-wiring the initial values as proposed in section 4.2.2, the
values are now fed into the CORDIC structure through a separate input. Figure 4.9
illustrates the resulting structure of the complete oscillator

Figure 3.8: A CORDIC-based Oscillator for sine generation

The oscillator has been implemented and tested in a XILINX XC4010E. The
architecture of this device provides specific resources in terms of CLBs (configurable
logic blocks), LUTs, storage cells and maximum speed.




CHAPTER 4
CORDIC FOR DFT CALCULATION


In chapter 3 it has discussed that how sine and cosine can be calculated using
CORDIC algorithm and now using this algorithm how Digital Fourier Transform
(DFT) can be calculated has been discussed in this chapter. A Fourier transform is a
special case of a wavelet transform with basis vectors defined by trigonometric
functions sine and cosine. It is concerned with the representation of the signals by a
sequence of numbers or symbols and the processing of these signals. Digital signal
processing and analog signal processing are subfields of signal processing. DSP
(Digital Signal Processing) includes subfields like: audio and speech signal
processing, sonar and radar signal processing, sensor array processing, spectral
estimation, statistical signal processing, digital image processing, signal processing
for communications, biomedical signal processing, seismic data processing, etc. Since
the goal of DSP is usually to measure or filter continuous real-world analog signals,
the first step is usually to convert the signal from an analog to a digital form, by using
an analog to digital converter. Often, the required output signal is another analog
output signal, which requires a digital to analog converter. DSP algorithms have long
been run on standard computers, on specialized processors called digital signal
processors (DSPs), or on purpose-built hardware such as application-specific
integrated circuit (ASICs). Today there are additional technologies used for digital
signal processing including more powerful general purpose microprocessors, field-
programmable gate arrays (FPGAs), digital signal controllers (mostly for industrial
apparatus such as motor control), and stream processors, among others. There are
several ways to calculate the Discrete Fourier Transform (DFT), such as solving
simultaneous linear equations or the correlation method. The Fast Fourier Transform
(FFT) (Figure 4.1 Eight point decimation-in-time FFT algorithm) is another method
for calculating the DFT. While it produces the same result as the other approaches, it
is incredibly more efficient, often reducing the computation time by hundreds. J.W.
Cooley and J.W. Tukey are the founder credit for bringing the FFT (also known as
divide and conquer algorithm). A Fast Fourier transform (FFT) is an efficient
algorithm to compute the Discrete Fourier transform (DFT) and it’s inverse. FFTs are
of great importance to a wide variety of applications, from digital signal processing
and solving partial differential equations to algorithms for quick multiplication of
large integers. This article describes the algorithms, of which there are many; see
discrete Fourier transform for properties and applications of the transform. The DFT
is defined by the formula
2
1
0
( ) ( ) 0,1, 2........, 1.
j kn
N
N
n
X k x n e for k N

÷
÷
=
= = ÷
¯

Evaluating these sums directly would take (N
2
) arithmetical operations. An FFT
is an algorithm to compute the same result in only (N log N) operations. In general,
such algorithms depend upon the factorization of N, but (contrary to popular
misconception) there are FFTs with (N log N) complexity for all N, even for prime N.
Many FFT algorithms only depend on the fact that
2 j
N
e

÷
is an Nth primitive root of
unity, and thus can be applied to analogous transforms over any finite field, such as
number-theoretic transforms
Decimation is the process of breaking down something into it's constituent parts.
Decimation in time involves breaking down a signal in the time domain into smaller
signals, each of which is easier to handle.

4.1 Calculation of DFT using CORDIC

If the input (time domain) signal, of N points, is x(n) then the frequency response
X(k) can be calculated by using the DFT.

1
0
( ) ( ) 0,1, 2........, 1.
N
kn
N
n
X k x n W for k N
÷
=
= = ÷
¯

(4.1)
where
2 / kn j kn N
N
W e
 ÷
=
For a real sample sequence f (n), where n is {0, 1, …….. , (N-1)} DFT and the DHT,
can be defined as
DFT:

1
0
( ) ( )[cos(2 / ) .sin(2 / ) ]
N
n
F k f N N kn j N kn  
÷
=
= ÷
¯

( ) ( )
x y
F k F k = + (4.2)


DHT:

1
0
( ) ( )[cos(2 / ) sin(2 / ) ]
N
n
H k f N N kn N kn  
÷
=
= +
¯
(4.3)

As it is evident from the expressions, above transforms involve trigonometric
operations on the input sample sequences. These transforms can be expressed in terms
of plane rotations. In other words, all the input samples are given a vector rotation by
the defined angle in each of the transforms. The CORDIC unit can iteratively rotate
an input vector | |
X Y
A A A = by a target angle  through small steps of elementary
angles
i
 (so that
i
i

¯
) to generate an output vector | |
X Y
B B B = . The operation
can be represented mathematically as:
| | | |
cos sin
sin cos
X Y X Y
B B A A
 
 
÷
=


(4.4)
The rotation by a certain angle can be achieved by the summation of some elementary
small rotations given by:
15
0
i
i
 
=
=
¯
for a 16-bit machine. Now the rotation by
elementary angles can be expressed in terms of sine of the angle as:
sin 2
i
i
 
÷
= = (4.5)
where i is a positive integer. Since the elementary rotational angles
i
 have been
assumed to be sufficiently small, the higher-order terms in the expansion of the sine
and the cosine can be neglected. This assumption imposes some restriction on the
allowable values of i.
Using sine cosine value generated by CORDIC algorithm DFT can be calculated
by arranging them in matrix. It can be done by arranging sine cosine value in two
different matrix one as real part and another as imaginary part then multiplying by
input sampled data ( ) f n which results in ( ) F k . Result can be stored in matrix as real
and imaginary part. In the case of DHT computation, since there is no imaginary part
the two matrixes can be shown in a single matrix by adding the two resulting matrix
of sine and matrix of cosine in the case of 8 8 × matrix the two results in the case of
DFT and DHT can be arranged as below. In the case of DFT
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
0 1 2 3 4 5 6 7
8 8 8 8 8 8 8 8
0 2 4 6 8 10 12 14
8 8 8 8 8 8 8 8
0 3 6 9 12 15 18 21
8 8 8 8 8 8 8 8
0 4 8 12 16 20 24 28
8 8 8 8 8 8 8 8
0 5 10 15 20 25 30 35
8 8 8 8 8 8 8 8
0 6 12 18 24 30 36 42
8 8 8 8 8 8 8 8
8
( )
R
W W W W W W W W
W W W W W W W W
W W W W W W W W
W W W W W W W W
F k
W W W W W W W W
W W W W W W W W
W W W W W W W W
W
=
0 7 14 21 28 35 42 49
8 8 8 8 8 8 8
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
R
f
f
f
f
f
f
f
f W W W W W W W













(4.6)
suffix R shows real part, in the above 8 8 × matrix only cosine value is stored at
corresponding positions.
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
0 1 2 3 4 5 6 7
8 8 8 8 8 8 8 8
0 2 4 6 8 10 12 14
8 8 8 8 8 8 8 8
0 3 6 9 12 15 18 21
8 8 8 8 8 8 8 8
0 4 8 12 16 20 24 28
8 8 8 8 8 8 8 8
0 5 10 15 20 25 30 35
8 8 8 8 8 8 8 8
0 6 12 18 24 30 36 42
8 8 8 8 8 8 8 8
8
( )
I
W W W W W W W W
W W W W W W W W
W W W W W W W W
W W W W W W W W
F k
W W W W W W W W
W W W W W W W W
W W W W W W W W
W
=
0 7 14 21 28 35 42 49
8 8 8 8 8 8 8
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
I
f
f
f
f
f
f
f
f W W W W W W W













(4.7)
suffix I shows real part, in the above 8 8 × matrix only sine value is stored at
corresponding positions.
Final value is given as: ( ) ( ) . ( )
Z I
F k F k j F k = + . The two result i.e. real and
imaginary part can be stored in RAM location for further use. But in the case of DHT
since there is no imaginary part the two values generated in the matrices be added
together and used further for different applications. The resulting value of ( ) H k by
using equation (4.3) .
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
0 1 2 3 4 5 6 7
8 8 8 8 8 8 8 8
0 2 4 6 8 10 12 14
8 8 8 8 8 8 8 8
0 3 6 9 12 15 18 21
8 8 8 8 8 8 8 8
0 4 8 12 16 20 24 28
8 8 8 8 8 8 8 8
0 5 10 15 20 25 30 35
8 8 8 8 8 8 8 8
0 6 12 18 24 30 36 42
8 8 8 8 8 8 8 8
0
8
( )
W W W W W W W W
W W W W W W W W
W W W W W W W W
W W W W W W W W
H k
W W W W W W W W
W W W W W W W W
W W W W W W W W
W
=
7 14 21 28 35 42 49
8 8 8 8 8 8 8
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
f
f
f
f
f
f
f
f W W W W W W W













(4.8)
In the right side of equation (4.9) each term of 8 8 × matrix can be seen as
cos(2 / ) sin(2 / ) kn N kn N   + and ( ) H k is given as: ( ) ( ) ( )
Z I
H k F k F k = + .
4.2 FFT method for DFT calculation

Fast Fourier transform (FFT) is an efficient algorithm for computing the discrete
Fourier transform. The discovery of the FFT algorithm paved the way for widespread
use of digital methods of spectrum estimation which influenced the research in almost
every field of engineering and science. DFT is a tool to estimate the samples of the
CFT at uniformly spaced frequencies. DFT of an input sampled signal as discussed
earlier in this chapter. It requires less multiplication than a simple approach of
calculating DFT [25]. The basic computation using FFT is called butterfly
computation which s shown in figure 4.2





Figure 4.1: Basic butterfly computation in the decimation-in-time.

By using the above butterfly computation technique an8 8 × DFT can be calculated as
shown figure 4.2 .

x(0) X(0)

x(4)
0
8
W -1 X(1)

x(2)
0
8
W -1 X(2)

x(6) -1
2
8
W -1 X(3)

x(1)
0
8
W -1 X(4)

x(5)
0
8
W -1
1
8
W -1 X(5)


x(3)
0
8
W -1
2
8
W -1 X(6)

x(7)
0
8
W -1
2
8
W -1
3
8
W -1 X(7)
Figure 4.2: Eight point decimation-in-time FFT algorithm.
a
b
r
N
A a W b = +
r
N
A a W b = +
-1

4-POINT

4-POINT

8-POINT

2-POINT

2-POINT

2-POINT

2-POINT

The steps involved in FFT method can be understood by the following figure 4.3

Figure 4.3: FFT divide and conquer method

We take this N point DFT, & break it down into two N/2 point DFTs by splitting the
input signal into odd & even numbered samples to get:

/ 2 1 / 2 1
2 (2 1)
0 0
1 1
( ) (2 ) (2 1)
N N
mk m k
N N
m m
X k x n W x n W
N N
÷ ÷
+
= =
= + +
¯ ¯
(4.9)
i.e. ( ) X k even number samples odd number samples = +

/ 2 1 / 2 1
2 2
0 0
1 1
( ) (2 )( ) (2 1)( )
N N
mk k rk
N N N
m m
X k x m W W x m W
N N
÷ ÷
= =
= + +
¯ ¯
(4.10)











Figure 4.4: Fast Fourier Transform

Block diagram for FFT is in figure 4.4 which tells about the different stages to
calculate DFT of a sampled signal.
Multiplication by
twiddle matrix
Write row wise
Take
DFT column
wise
DFT
row wise
Read
Column
wise
CHAPTER 5
FPGA DESIGN FLOW

5.1 Field Programmable Gate Array (FPGA)

Field Programmable Gate Arrays are called this because rather than having a
structure similar to a PAL or other programmable device, they are structured very
much like a gate array ASIC. This makes FPGAs very nice for use in prototyping
ASICs, or in places where and ASIC will eventually be used. For example, an FPGA
maybe used in a design that need to get to market quickly regardless of cost. Later an
ASIC can be used in place of the FPGA when the production volume increases, in
order to reduce cost.

5.2 FPGA Architectures



Figure5.1: FPGA Architecture
Each FPGA vendor has its own FPGA architecture, but in general terms they are
all a variation of that shown in Figure 5.1. The architecture consists of configurable
logic blocks, configurable I/O blocks, and programmable interconnect. Also, there
will be clock circuitry for driving the clock signals to each logic block, and additional
logic resources such as ALUs, memory, and decoders may be available. The two basic
types of programmable elements for an FPGA are Static RAM and anti-fuses.

5.2.1 Configurable Logic Blocks

Configurable Logic Blocks contain the logic for the FPGA. In large grain
architecture, these CLBs will contain enough logic to create a small state machine. In
fine grain architecture, more like a true gate array ASIC, the CLB will contain only
very basic logic[26]. The diagram in Figure 5.2 would be considered a large grain
block. It contains RAM for creating arbitrary combinatorial logic functions. It also
contains flip-flops for clocked storage elements, and multiplexers in order to route the
logic within the block and to and from external resources. The multiplexers also allow
polarity selection and reset and clear input selection.


Figure 5.2: FPGA Configurable Logic Block

5.2.2 Configurable I/O Blocks

A Configurable I/O Block, shown in Figure 5.3, is used to bring signals onto the
chip and send them back off again. It consists of an input buffer and an output buffer
with three state and open collector output controls. Typically there are pull up
resistors on the outputs and sometimes pull down resistors. The polarity of the output
can usually be programmed for active high or active low output and often the slew
rate of the output can be programmed for fast or slow rise and fall times. In addition,
there is often a flip-flop on outputs so that clocked signals can be output directly to
the pins without encountering significant delay. It is done for inputs so that there is
not much delay on a signal before reaching a flip-flop which would increase the
device hold time requirement.


Figure 5.3: FPGA Configurable I/O Block


5.2.3 Programmable Interconnect

The interconnect of an FPGA is very different than that of a CPLD, but is rather
similar to that of a gate array ASIC. In Figure 5.4, a hierarchy of interconnect
resources can be seen. There are long lines which can be used to connect critical
CLBs that are physically far from each other on the chip without inducing much
delay. They can also be used as buses within the chip. There are also short lines which
are used to connect individual CLBs which are located physically close to each other.
There is often one or several switch matrices, like that in a CPLD, to connect these
long and short lines together in specific ways. Programmable switches inside the chip
allow the connection of CLBs to interconnect lines and interconnect lines to each
other and to the switch matrix. Three-state buffers are used to connect many CLBs to
a long line, creating a bus. Special long lines, called global clock lines, are specially
designed for low impedance and thus fast propagation times. These are connected to
the clock buffers and to each clocked element in each CLB. This is how the clocks are
distributed throughout the FPGA.


Figure 5.4: FPGA Programmable Interconnect

5.2.4 Clock Circuitry

Special I/O blocks with special high drive clock buffers, known as clock drivers,
are distributed around the chip. These buffers are connected to clock input pads and
drive the clock signals onto the global clock lines described above. These clock lines
are designed for low skew times and fast propagation times. As we will discuss later,
synchronous design is a must with FPGAs, since absolute skew and delay cannot be
guaranteed. Only when using clock signals from clock buffers can the relative delays
and skew times are guaranteed.

5.2.5 Small vs. Large Granularity

Small grain FPGAs resemble ASIC gate arrays in that the CLBs contain only
small, very basic elements such as NAND gates, NOR gates, etc. The philosophies
that small elements can be connected to make larger functions without wasting too
much logic. In a large grain FPGA, where the CLB can contain two or more flip-
flops, a design which does not need many flip-flops will leave many of them unused.
Unfortunately, small grain architectures require much more routing resources, which
take up space and insert a large amount of delay which can more than compensate for
the better utilization.

Small Granularity Large Granularity
Better utilization Fewer levels of logic
Direct conversion to ASIC Less interconnect delay

A comparison of advantages of each type of architecture is shown in Table. The
choice of which architecture to use is dependent on your specific application.

5.2.6 SRAM vs. Anti-fuse Programming

There are two competing methods of programming FPGAs. The first, SRAM
programming, involves small Static RAM bits for each programming element.
Writing the bit with a zero turns off a switch, while writing with a one turns on a
switch. The other method involves anti-fuses which consist of microscopic structures
which, unlike a regular fuse, normally make no connection. A certain amount of
current during programming of the device causes the two sides of the anti-fuse to
connect. The advantages of SRAM based FPGAs is that they use a standard
fabrication process that chip fabrication plants are familiar with and are always
optimizing for better performance. Since the SRAMs are reprogrammable, the FPGAs
can be reprogrammed any number of times, even while they are in the system, just
like writing to a normal SRAM. The disadvantages are that they are volatile, which
means a power glitch could potentially change it. Also, SRAM based devices have
large routing delays. The advantages of Anti-fuse based FPGAs are that they are non-
volatile and the delays due to routing are very small, so they tend to be faster. The
disadvantages are that they require a complex fabrication process, they require an
external programmer to program them, and once they are programmed, they cannot be
changed.

FPGA Families

Examples of SRAM based FPGA families include the following:
■ Altera FLEX family
■ Atmel AT6000 and AT40K families
■ Lucent Technologies ORCA family
■ Xilinx XC4000 and Virtex families

Examples of Anti-fuse based FPGA families include the following:

■ Actel SX and MX families
■ Quick logic pASIC family

5.3 The Design Flow

This section examines the design flow for any device, whether it is an ASIC, an
FPGA, or a CPLD. This is the entire process for designing a device that guarantees
that you will not overlook any steps and that you will have the best chance of getting
backs a working prototype that functions correctly in your system. The design flow
consists of the steps in

5.3.1 Writing a Specification


The importance of a specification cannot be overstated. This is an absolute must,
especially as a guide for choosing the right technology and for making your needs
known to the vendor. As specification allows each engineer to understand the entire
design and his or her piece of it. It allows the engineer to design the correct interface
to the rest of the pieces of the chip. It also saves time and misunderstanding. There is
no excuse for not having a specification.
A specification should include the following information:
■ An external block diagram showing how the chip fits into the system.
■ An internal block diagram showing each major functional section.
■ A description of the I/O pins including
■ Output drive capability
■ Input threshold level
■ Timing estimates including

Figure 5.5: Design Flow of FPGA



Specification Review
Write a Specification
Design
Chip Product
System Integration on Test
Chip Test
Simulate
Design Review
Synthesize
Place and Route
Resimulate
Final Review
■ Setup and hold times for input pins
■ Propagation times for output pins
■ Clock cycle time
■ Estimated gate count
■ Package type
■ Target power consumption
■ Target price
■ Test procedures

It is also very important to understand that this is a living document. Many
sections will have best guesses in them, but these will change as the chip is being
designed.

5.3.2 Choosing a Technology

Once a specification has been written, it can be used to find the best vendor with a
technology and price structure that best meets your requirements.

5.3.3 Choosing a Design Entry Method

You must decide at this point which design entry method you prefer. For smaller
chips, schematic entry is often the method of choice, especially if the design engineer
is already familiar with the tools. For larger designs, however, a hardware description
language (HDL) such as Verilog or VHDL is used because of its portability,
flexibility, and readability. When using a high level language, synthesis software will
be required to “synthesize” the design. This means that the software creates low level
gates from the high level description.

5.3.4 Choosing a Synthesis Tool

You must decide at this point which synthesis software you will be using if you
plan to design the FPGA with an HDL. This is important since each synthesis tool has
recommended or mandatory methods of designing hardware so that it can correctly
perform synthesis. It will be necessary to know these methods up front so that
sections of the chip will not need to be redesigned later on. At the end of this phase it
is very important to have a design review. All appropriate personnel should review the
decisions to be certain that the specification is correct, and that the correct technology
and design entry method have been chosen.

5.3.5 Designing the chip

It is very important to follow good design practices. This means taking into
account the following design issues that we discuss in detail later in this chapter.
■ Top-down design
■ Use logic that fits well with the architecture of the device you have chosen
■ Macros
■ Synchronous design
■ Protect against metastability
■ Avoid floating nodes
■ Avoid bus contention

5.3.6 Simulating - design review

Simulation is an ongoing process while the design is being done. Small sections of
the design should be simulated separately before hooking them up to larger sections.
There will be many iterations of design and simulation in order to get the correct
functionality. Once design and simulation are finished, another design review must
take place so that the design can be checked. It is important to get others to look over
the simulations and make sure that nothing was missed and that no improper
assumption was made. This is one of the most important reviews because it is only
with correct and complete simulation that you will know that your chip will work
correctly in your system.

5.3.7 Synthesis

If the design was entered using an HDL, the next step is to synthesize the chip.
This involves using synthesis software to optimally translate your register transfer
level (RTL) design into a gate level design that can be mapped to logic blocks in the
FPGA. This may involve specifying switches and optimization criteria in the HDL
code, or playing with parameters of the synthesis software in order to insure good
timing and utilization.

5.3.8 Place and Route

The next step is to lay out the chip, resulting in a real physical design for a real
chip. This involves using the vendor’s software tools to optimize the programming of
the chip to implement the design. Then the design is programmed into the chip.

5.3.9 Resimulating - final review

After layout, the chip must be resimulated with the new timing numbers produced
by the actual layout. If everything has gone well up to this point, the new simulation
results will agree with the predicted results. Otherwise, there are three possible paths
to go in the design flow. If the problems encountered here are significant, sections of
the FPGA may need to be redesigned. If there are simply some marginal timing paths
or the design is slightly larger than the FPGA, it may be necessary to perform another
synthesis with better constraints or simply another place and route with better
constraints. At this point, a final review is necessary to confirm that nothing has been
overlooked.

5.3.10 Testing

For a programmable device, you simply program the device and immediately have
your prototypes. You then have the responsibility to place these prototypes in your
system and determine that the entire system actually works correctly. If you have
followed the procedure up to this point, chances are very good that your system will
perform correctly with only minor problems. These problems can often be worked
around by modifying the system or changing the system software. These problems
need to be tested and documented so that they can be fixed on the next revision of the
chip. System integration and system testing is necessary at this point to insure that all
parts of the system work correctly together. When the chips are put into production, it
is necessary to have some sort of burn-in test of your system that continually tests
your system over some long amount of time. If a chip has been designed correctly, it
will only fail because of electrical or mechanical problems that will usually show up
with this kind of stress testing.

5.4 Design Issues

In the next sections of this chapter, we will discuss those areas that are unique to
FPGA design or that are particularly critical to these devices.

5.4.1 Top-Down Design

Top-down design is the design method whereby high level functions are defined
first, and the lower level implementation details are filled in later. A schematic can be
viewed as a hierarchical tree as shown in Figure 14. The top-level block represents the
entire chip. Each lower level block represents major functions of the chip.
Intermediate level blocks may contain smaller functionality blocks combined with
gate-level logic. The bottom level contains only gates and macro functions which are
vendor-supplied high level functions. Fortunately, schematic capture software and
hardware description languages used for chip design easily allows use of the top-
down design methodology.


Figure 5.6: Top-Down Design


Top-down design is the preferred methodology for chip design for several reasons.
First, chips often incorporate a large number of gates and a very high level of
functionality. This methodology simplifies the design task and allows more than one
engineer, when necessary, to design the chip. Second, it allows flexibility in the
design. Sections can be removed and replaced with a higher-performance or
optimized designs without affecting other sections of the chip.
Also important is the fact that simulation is much simplified using this design
methodology. Simulation is an extremely important consideration in chip design since
a chip cannot be blue-wired after production. For this reason, simulation must be done
extensively before the chip is sent for fabrication. A top-down design approach allows
each module to be simulated independently from the rest of the design. This is
important for complex designs where an entire design can take weeks to simulate and
days to debug. Simulation is discussed in more detail later in this chapter.

5.4.2 Keep the Architecture in Mind

Look at the particular architecture to determine which logic devices fit best into it.
The vendor may be able to offer advice about this. Many synthesis packages can
target their results to a specific FPGA or CPLD family from a specific vendor, taking
advantage of the architecture to provide you with faster, more optimal designs.

5.4.3 Synchronous Design

One of the most important concepts in chip design, and one of the hardest to
enforce on novice chip designers, is that of synchronous design. Once an chip
designer uncovers a problem due to asynchronous design and attempts to fix it, he or
she usually becomes an evangelical convert to synchronous design. This is because
asynchronous design problems are due to marginal timing problems that may appear
intermittently, or may appear only when the vendor changes its semiconductor
process. Asynchronous designs that work for years in one process may suddenly fail
when the chip is manufactured using a newer process. Synchronous design simply
means that all data is passed through combinatorial logic and flip-flops that are
synchronized to a single clock. Delay is always controlled by flip-flops, not
combinatorial logic. No signal that is generated by combinatorial logic can be fed
back to the same group of combinatorial logic without first going through a
synchronizing flip-flop. Clocks cannot be gated - in other words, clocks must go
directly to the clock inputs of the flip-flops without going through any combinatorial
logic. The following sections cover common asynchronous design problems and how
to fix them using synchronous logic.

5.4.4 Race conditions

Figure 5.7 shows an asynchronous race condition where a clock signal is used to
reset a flip-flop. When SIG2 is low, the flip-flop is reset to a low state. On the rising
edge of SIG2, the designer wants the output to change to the high state of SIG1.
Unfortunately, since we don’t know the exact internal timing of the flip-flop or the
routing delay of the signal to the clock versus the reset input, we cannot know which
signal will arrive first - the clock or the reset. This is a race condition. If the clock
rising edge appears first, the output will remain low. If the reset signal appears first,
the output will go high. A slight change in temperature, voltage, or process may cause
a chip that works correctly to suddenly work incorrectly. A more reliable synchronous
solution is shown in Figure 16. Here a faster clock is used, and the flip-flop is reset on
the rising edge of the clock. This circuit performs the same function, but as long as
SIG1 and SIG2 are produced synchronously - they change only after the rising edge
of CLK - there is no race condition.

Figure 5.7: Asynchronous: Race Condition



Figure 5.8: Synchronous: No Race Condition

5.4.5 Metastability

One of the great buzzwords, and often misunderstood concepts, of synchronous
design is metastability. Metastability refers to a condition which arises when an
asynchronous signal is clocked into a synchronous flip-flop. While chip designers
would prefer a completely synchronous world, the unfortunate fact is that signals
coming into a chip will depend on a user pushing a button or an interrupt from a
processor, or will be generated by a clock which is different from the one used by the
chip. In these cases, the asynchronous signal must be synchronized to the chip clock
so that it can be used by the internal circuitry. The designer must be careful how to do
this in order to avoid metastability problems as shown in Figure 5.9. If the
ASYNC_IN signal goes high around the same time as the clock, we have an
unavoidable race condition. The output of the flip-flop can actually go to an undefined
voltage level that is somewhere between a logic 0 and logic 1. This is because an
internal transistor did not have enough time to fully charge to the correct level. This
meta level may remain until the transistor voltage leaks off or “decays”, or until the
next clock cycle. During the clock cycle, the gates that are connected to the output of
the flip-flop may interpret this level differently. In the figure, the upper gate sees the
level as logic 1 whereas the lower gate sees it as logic 0. In normal operation, OUT1
and OUT2 should always be the same value. In this case, they are not and this could
send the logic into an unexpected state from which it may never return. This
metastability can permanently lock up the chip.



Figure 5.9: Metastability - The Problem

The “solution” to this metastability problem by placing a synchronizer flip-flop in
front of the logic, the synchronized input will be sampled by only one device, the
second flip-flop, and be interpreted only as logic 0 or 1. The upper and lower gates
will both sample the same logic level, and the metastability problem is avoided. Or is
it? The word solution is in quotation marks for a very good reason. There is a very
small but non-zero probability that the output of the synchronizer flip-flop will not
decay to a valid logic level within one clock period. In this case, the next flip-flop will
sample an indeterminate value, and there is again a possibility that the output of that
flip-flop will be indeterminate. At higher frequencies, this possibility is greater.
Unfortunately, there is no certain solution to this problem. Some vendors provide
special synchronizer flip-flops whose output transistors decay very quickly. Also,
inserting more synchronizer flip-flops reduces the probability of metastability but it
will never reduce it to zero. The correct action involves discussing metastability
problems with the vendor, and including enough synchronizing flip-flops to reduce
the probability so that it is unlikely to occur within the lifetime of the product.

5.4.6 Timing Simulation

This method of timing analysis is growing less and less popular. It involves
including timing information in a functional simulation so that the real behavior of the
chip is simulated. The advantage of this kind of simulation is that timing and
functional problems can be examined and corrected. Also, asynchronous designs must
use this type of analysis because static timing analysis only works for synchronous
designs. This is another reason for designing synchronous chips only. As chips
become larger, though, this type of compute intensive simulation takes longer and
longer to run. Also, simulations can miss particular transitions that result in worst case
results. This means that certain long delay paths never get evaluated and a chip with
timing problems can pass timing simulation. If you do need to perform timing
simulation, it is important to do both worst case simulation and best case simulation.
The term “best case” can be misleading. It refers to a chip that, due to voltage,
temperature, and process variations, is operating faster than the typical chip.
However, hold time problems become apparent only during the best case conditions.







.





CHAPTER 6
RESULTS AND DISCUSSIONS

6.1 ModelSim Simulation Results
Figure 6.1 and table 6.1 consists the ModelSim simulation result for binary input
angle z0 and binary outputs xn1(sin(z0)), yn(cos(z0)) in the form of waveform and
their corresponding magnitude respectively. Figure 6.2 and table 6.2 consists the
ModelSim simulation result for real input angle z0 and real outputs xn1(sin(z0)),
yn(cos(z0)) in the form of waveform and their corresponding magnitude respectively.

6.1.1 For binary input and binary output




Figure 6.1: Sine-Cosine value generated for input angle z0 (binary value)






Table 6.1: Sine-Cosine value for input angle z0

Reset 1 0
Clk_enable 1 1
Input(z0)

00000000000001111000000000

00000000000001111000000000
Sine value

00000000000000000000000000

00000000000000000001111110
Cosine
value
00000000000000000000000000 00000000000000000011011110



6.1.2 For sine-cosine real input and real output




Figure 6.2: Sine-Cosine value generated for input angle z0 (integer value)




Table 6.2: Sine Cosine value for input angle z0

S. No. Reset Clk_enable
Input
angle(z0)
Sine value
Cosine
value
1

1 × × 0 0
2

0 1 30 0.49 0.867
3

0 1 45 0.711 0.699
4

0 1 60 0.867 0.49


6.1.3 For DFT using FFT Algorithm



Figure 6.3: Real input/output waveforms of DFT using FFT algorithm








Table 6.3: Real input/output values of DFT using FFT algorithm


S.N. Reset Clk_enable Input Output real
Output
imaginary
1. 0 × × 0
0

2. 1 1 1 36

0
3. 1 1 2 -4.04
9.64

4. 1 1 3 -4
4.0

5. 1 1 4 -4.1
1.61

6. 1 1 5 -4
0

7. 1 1 6 -3.95
-1.64

8. 1 1 7 -4
-4.01

9. 1 1 8 -3.98
-9.61























6.2 XILINX simulation results

Block diagram generated by XILINX 9.2i for sine-cosine using CORDIC is
shown in figure 6.4. Here inputs are z0 (input angle), clk (clock), clk_enable (clock
enable), reset and outputs are xn1 (magnitude of cosine of input angle), yn (magnitude
of sine of input angle), ce_out and dvld are chip enable signal for the next stage
blocks. Figure 6.5 shows the RTL schematic of sine-cosine generator and its internal
block diagram.








Figure 6.4: Top level RTL schematic for Sine Cosine




















Figure 6.5: RTL schematic for Sine-Cosine









Table 6.4 and table 6.5 (a),(b) shows the power summary and design summary
respectively produced by XILINX tool. Table 6.6, table 6.7 and table 6.8 have
synthesis report generated by XILINX showing number of multiplexers, number of
adder, number of flip-flops used, timing and thermal summary of the chip generated
respectively.









Table 6.4: Power summary

I(mA) P(mW)
Total estimated power consumption 81
Vccint 1.20V 26 31
Vccaux 2.50V 18 45
Vcco25 2.50V 2 5
Clocks 0 0
Inputs 0 0
Logic 0 0
Vcco25 0 0
Signals 0 0
Quiescent Vccint 1.20V 26 31
Quiescent Vccaux 2.50V 18 45
Quiescent Vcco25 2.50V 2 5



















Table 6.5: (a) Design summary of Sine-Cosine






Table 6.5: (b) Design summary of Sine-Cosine













Table 6.6: Advanced HDL Synthesis Report for sine cosine



Macro Statistics:
32x12-bit multiplier : 24
32x13-bit multiplier : 15
32x14-bit multiplier : 8
32x15-bit multiplier : 6
Total Multipliers : 73

2-bit adder : 1
3-bit adder : 1
32-bit adder : 72
4-bit adder : 1
5-bit adder : 1
6-bit adder : 1
7-bit adder : 1
8-bit adder : 1
Total Adders/Subtractors : 79

Flip-Flops : 616
Total Registers(Flip-Flops) : 616


























Table 6.7: Timing summary


Minimum period : 93.191ns (Maximum
Frequency: 10.731MHz)
Minimum input arrival time before clock : 5.176ns
Maximum output required time after clock : 4.283ns
Maximum combinational path delay : No path found






Table 6.8: Thermal summary

Estimated junction temperature: 27
0
C
Ambient temp: 25
0
C
Case temp: 26
0
C
Theta J-A range: 26 – 26
0
C/W









Figure 6.6: Top level RTL schematic of DFT


Figure 6.7: RTL schematic of DFT



Table 6.9: (a) Design summary for DFT



Table 6.9: (b) Design summary for DFT




Table 6.10: Advanced HDL Synthesis Report for DFT


Macro Statistics
10x24-bit multiplier : 16
24x10-bit multiplier : 16
24x24-bit multiplier : 32
32x32-bit multiplier : 32
# Multipliers : 96

10-bit adder : 32
2-bit adder : 64
24-bit adder : 80
24-bit subtractor : 64
3-bit adder : 64
32-bit adder : 46
32-bit subtractor : 28
4-bit adder : 64
5-bit adder : 64
6-bit adder : 64
7-bit adder : 64
8-bit adder : 80
9-bit adder : 64
# Adders/Subtractors : 778

Flip-Flops : 2576
# Registers : 2576

24-bit comparator greatequal : 32
24-bit comparator lessequal : 48
8-bit comparator greatequal : 16
# Comparators : 96

32-bit 4-to-1 multiplexer : 32
# Multiplexers : 32










6.3 Discussions

Multipliers used for implementation of CORDIC algorithm for sine cosine
generation are 73 in numbers. Number of adders/subtractors and registers are 79 and
616 respectively. In case of DFT implementation number of multiplier,
adders/subtractors, registers, comparators, multiplexers are 96, 778, 2576, 96, 32
respectively.
Minimum period for sine cosine generation is 93.191 ns (maximum frequency
10.73 MHz) power consumed by the sine cosine generator and DFT generator is 81
mW each with the junction temperature of 27
0
C. Total number of 4 input LUTs (Look
Up Tables) used 708 and 20,547 for sine cosine generator and DFT calculator
respectively. Total number of gates used 7,800 for sine cosine generator and 242,654
gates for 8 1 × DFT generator.


















CHAPTER 7
CONCLUSION

The CORDIC algorithm is a powerful and widely used tool for digital signal
processing applications and can be implemented using PDPs (Programmable Digital
Processors). But a large amount of data processing is required because of complex
computations. This affects the cost, speed and flexibility of the DSP systems. So, the
implementation of DFT using CORDIC algorithm on FPGA is the need of the day as
the FPGAs can give enhanced speed at low cost with a lot of flexibility. This is due to
the fact that the hardware implementation of a lot of multipliers can be done on FPGA
which are limited in case of PDPs.

In this thesis the sine cosine CORDIC based generator is simulated using
ModelSim which is then used for simulation of Discrete Fourier Transform. Then the
implementation of sine cosine CORDIC based generators is done on XILINX Spartan
3E FPGA which is further used to implement eight point Discrete Fourier Transform
using radix-2 decimation-in-time algorithm on FPGA. The results are verified by test
bench generated by the FPGA. This thesis shows that CORDIC is available for use in
FPGA based computing machines, which are the likely basis for the next generation
DSP systems. It can be concluded that the designed RTL model for sine cosine and
DFT function is accurate and can work for real time applications.

Future Scope of work

The future scope should include the following

 Implementation of dif algorithm
 DFT computation and simulation for more number of points
 Implementation and simulation for DHT, DCT and DST calculations
.


REFERENCES

[1] Volder J. E., “The CORDIC trigonometric computing technique”, IRE Trans.
Electronic Computing, Volume EC-8, pp 330 - 334, 1959.
[2] Lindlbauer N., www.cnmat.berkeley.edu/~norbert/CORDIC/node3.htmlhtt.
[3] Avion J.C., http://www.argo.es/~jcea/artic/CORDIC.htm
[4] Qian M., “Application of CORDIC Algorithm to Neural Networks VLSI
Design”, IMACS Multiconference on “Computational Engineering in Systems
Applications (CESA)”, Beijing, China, October 4-6, 2006.
[5] Lin C. H. and Wu A. Y., “Mixed-Scaling-Rotation CORDIC (MSR-CORDIC)
Algorithm and Architecture for High-PerformanceVector Rotational DSP
Applications”, Volume 52, pp 2385- 2398, November 2005
[6] Walther J.S.A, “Unified algorithm for elementary functions”, Spring Joint
Computer Conference, pp 379 - 385, Alantic city, 1971.
[7] Kolk K. J. V., Deprettere E.F.A. and Lee J. A., “ A Floating Point Vectoring
Algorithm Based on Fast Rotations”, Journal of VLSI Signal Processing,
Volume25, pp 125–139, Kluwer Academic Publishers, Netherlands, 2000.
[8] Antelo E., LangT. and Bruguera J. D., “Very-High Radix CORDIC Rotation
Based on Selection by Rounding”, Journal of VLSI Signal Processing, Vol.25,
141–153, Kluwer Academic Publishers, Netherlands, 2000.
[9] Delosme M. J.,Lau C. Y. and Hsiao S. F., “Redundant Constant-Factor
Implementation of Multi-Dimensional CORDIC and Its Application to
Complex SVD”, Journal of VLSI Signal Processing, Volume 25, pp 155–166,
Kluwer Academic Publishers, Netherlands, 2000.
[10] Choi J. H., Kwak J. H. and Swartzlander, Journal of VLSI Signal Processing,
Volume 25, Kluwer Academic Publishers, Netherlands, 2000.
[11] Roads C., “The Computer Music Tutorial”, MIT Press, Cambridge, 1995.
[12] Rhea T., “The Evolution of electronic musical instruments”, PhD thesis,
Nashville: George Peabody College for Teachers, 1972.
[13] Goodwin M., “Frequency-domain analysis-synthesis of musical sounds”,
Master's thesis, CNMAT and Department of Electrical Engineering and
Computer Science, UCB, 1994.
[14] Muller J. M., “Elementary Functions - Algorithms and Implementation”,
Birkhauser Boston, New York, 1997.
[15] H. M. Ahmed, J. M. Delosme, and M. Morf, “Highly concurrent com-
Comput. Mag., Volume 15, no. 1, pp. 65-82, Jan. 1982.
[16] Parhami B., “Computer Arithmetic – Algorithms and hardware designs,”
Oxford University Press, New York, 2000.
[17] Considine V., “CORDIC trigonometric function generator for DSP”, IEEE-
89th, International Conference on Acoustics, Speech and Signal Processing,
pp 2381 - 2384, Glasgow, Scotland, 1989.
[18] Andraka R.A., “Survey of CORDIC algorithms for FPGA based computers”,
Proceedings of the 1998 ACM/SIGDA sixth international symposium on
FPGAs, pp 191-200, Monterey, California, Feb.22-24, 1998.
[19] http://www.dspguru.com/info/faqs/CORDIC.htm
[20] www.xilinx.com/partinfo/#4000.pdf
[21] Troya A., Maharatna K., Krstic M., Crass E., Kraemer R.,“OFDM
synchronizer Implementation for an 1EEE 802.1 la Compliant Modem", Proc.
IASTED International Conference on Wireless and Optical Communications,
Banff, Canada,July 2002.
[22] Andraka R., “Building a high performance bit serial processor in an FPGA”,
On-Chip System Design Conference, North Kingstown, 1996.
[23] http://comparch.doc.ic.ac.uk/publications/files/osk00jvlsisp.ps.
[24] Krsti M., Troyu A., Muharutnu K. and Grass E., “Optimized low-power
synchronizer design for the IEEE 802.11a standard”, Frankfurt (Oder),
Germany, 2003.
[25] Proakis J. G.,Manolakis D. G., “Digital signal processing principles,
algorithms and applications”, Prentice Hall, Delhi, 2008.
[26] www.51protel. com/tech /Introduction.pdf






Abstract
CORDIC is an acronym for COrdinate Rotation Digital Computer. It is a class of shift adds algorithms for rotating vectors in a plane, which is usually used for the calculation of trigonometric functions, multiplication, division and conversion between binary and mixed radix number systems of DSP applications, such as Fourier Transform. The Jack E. Volder's CORDIC algorithm is derived from the general equations for vector rotation. The CORDIC algorithm has become a widely used approach to elementary function evaluation when the silicon area is a primary constraint. The implementation of CORDIC algorithm requires less complex hardware than the conventional method. In digital communication, the straightforward evaluation of the cited functions is important, numerous matrix based adaptive signal processing algorithms require the solution of systems of linear equations, the computation of eigenvalues, eigenvectors or singular values. All these tasks can be efficiently implemented using processing elements performing vector rotations. The (CORDIC) offers the opportunity to calculate all the desired functions in a rather simple and elegant way. Due to the simplicity of the involved operations the CORDIC algorithm is very well suited for VLSI implementation. The rotated vector is also scaled making a scale factor correction necessary. VHDL coding and simulation of selected CORDIC algorithm for sine and cosine, the comparison of resultant implementations and the specifics of the FPGA implementation has been discussed. In this thesis, the CORDIC algorithm has been implemented in XILINX Spartan 3E FPGA kit using VHDL and is found to be accurate. It also contains the

implementation of Discrete Fourier Transform using radix-2 decimation-in-time algorithm on the same FPGA kit. Due to the high speed, low cost and greater flexibility offered by FPGAs over DSP processors the FPGA based computing is becoming the heart of all digital signal processing systems of modern era. Moreover the generation of test bench by Xilinx ISE 9.2i verifies the results.

LIST OF ACRONYMS ASICs CLBs CORDIC DFT DHT DSP EVD FFT FPGA LUT RAM ROM RTL SRAM SVD ULP VHSIC VHDL VLSI Application-Specific Integrated Circuits Configurable Logic Blocks Cordic Rotation Digital Computer Digital Fourier Transform Digital Hartley Transform Digital Signal Processing Enhanced Versatile Disc Fast Fourier Transform Field Programmable Gate Array Look Up Table Random Access Memory Read Only Memory Register Ttransfer Level Static RAM Singular Value Deposition Unit in the Last Place Very High Speed Integrated Circuit VHSIC Hardware Description Language Very Large Scale Integration .

9 6.3 4.3 3.7 3.LIST OF FIGURES Figure No.3 5.1 5.5 5. 2.6 5. 0 ) rotating by +300 Rotation mode 3.4 Inclined balance due to the difference in weight of two sides 2.2 5.The Problem Sine-Cosine value generated for input angle z0 20 25 26 27 29 31 33 36 41 41 42 42 43 44 45 46 49 53 55 55 57 59 18 10 Page No. y0 ) to ( x10 . 2.6 3.2 3.8 5.2 2.4 3. Linear and Hyperbolic CORDIC Rotation of a vector V by the angle  Iterative vector rotation.8 4.1 2.3 Title Rotation of a vector V by the angle  Vector V with magnitude r and phase  A balance having  at one side and small weights at the other side. initialized with V0 Iterative CORDIC Unrolled CORDIC Bit-serial CORDIC A CORDIC-based Oscillator for sine generation Basic butterfly computation in the decimation-in-time Eight point decimation-in-time FFT algorithm.1 4.4 5.7 5. 6 7 10 .2 4.1 Hardware elements needed for the CORDIC method Circular.5 3.4 5.1 3.5 First three of 10 iteration leading from ( x0 . FFT write read method Fast Fourier Transform FPGA Architecture FPGA Configurable Logic Block FPGA Configurable I/O Block FPGA Programmable Interconnect Design Flow of FPGA Top-Down Design Asynchronous: Race Condition Synchronous: No Race Condition Metastability .

7 Real input/output waveforms of DFT using FFT algorithm Top level RTL schematic for Sine Cosine RTL schematic for Sine-Cosine Top label RTL schematic of DFT RTL schematic of DFT 61 63 64 69 69 Page No. 6.LIST OF FIGURES Figure No.5 6.6 6.2 Title Sine-Cosine value generated for input angle z0 (integer value 6.4 6.3 6. 60 .

2 Title For 8-bit Cordic hardware Phase. 2.7 6. 9 13 2.4 3.2 6.8 6.LIST OF TABLES Table No.1 6.10 Advanced HDL Synthesis Report for DFT 18 30 36 60 61 62 65 66 66 67 68 68 70 70 71 Page No.9 Advanced HDL Synthesis Report Timing summary Thermal summary (a) Design summary for DFT (b) Design summary for DFT 6.3 6.6 6. magnitude. 2.4 6.5 Choosing the signs of the rotation angles to force z to zero Performance and CLB usage in an XC4010E Performance and CLB usage for the bit-parallel and bitserial iterative designs.3 17 . and CORDIC Gain for different values of K Approximate value of the function  i  arctan (2-i ) . for 0  i  9 . in degree.1 3. Sine-Cosine value for input angle z0 Sine Cosine value for input angle z0 Real input/output values of DFT using FFT algorithm Power summary (a) Design summary of Sine-Cosine (b) Design summary of Sine-Cosine 6.2 6.1 2.

2 Generalized Cordic 3.6 Numerical Comparison 3.CONTENTS Certificate Acknowledgement Abstract Acronyms List of Figures List of Tables i ii iii iv v vii 1–4 1 2 4 4 5 6 – 20 6 11 14 15 15 16 21 – 36 22 23 25 28 29 30 32 34 34 35 36 Chapter 1 INTRODUCTION 1.3 A Bit-Serial Iterative CORDIC 3.2.4.2 Complex number representation of Cordic algorithm 2.3 Calculation of sine and cosine of an angle 2.3 Thesis objective 1.4 Implementation of various CORDIC Architectures 3.2.3 Basic Cordic iterations Chapter 3 COMPUTATION OF SINE COSINE 3.2 Calculation of phase of complex number 2.1 Cordic Hardware 3.4.4 Organization of thesis 1.5 Comparison of the Various CORDIC Architectures 3.2 Historical perspective 1.1 Basic equation of Cordic algorithm 2.3 The CORDIC-Algorithm for Computing a Sine and Cosine 3.1 A Bit-Parallel Iterative CORDIC 3.7 Other Considerations 3.1 Preamble 1.2 A Bit-Parallel Unrolled CORDIC 3.4.1 Calculation of magnitude of complex number 2.5 Methodology Chapter 2 CORDIC ALGORITHM 2.8 Hardware Implementation .2.

2 Configurable Input .3.design review 5.1 Configurable Logic Blocks 5.2. Large Granularity 5.1 ModelSim Simulation Results 6.6 Simulating .1.10 Testing 5.3 Design Flow 5.4 Design Issues 5.7 Synthesis 5.6 Timing Simulation Chapter 6 RESULTS AND DISCUSSIONS 6. Anti-fuse Programming 5.4.5 Small vs.2.2 Keep the Architecture in Mind 5.1 Writing a Specification 5.2.3.3.2.2 Choosing a Technology 5.1 Top-Down Design 5.3.4 Choosing a Synthesis tool 5.4.9 Resimulating .4.6 SRAM vs.2.3.3 Programmable Interconnect 5.3.3.3.3.2.4.Chapter 4 CORDIC FOR DXT CALCULATION 4.5 Metastability 5.1 For sine-cosine binary input and binary output .final review 5.3 Choosing a Design Entry Method 5.2 FPGA Architectures 5.4 Race conditions 5.8 Place and Route 5.3 Synchronous Design 5.2 FFT method for DFT calculation 37 – 42 38 41 43 – 58 43 43 44 44 45 46 46 47 48 48 50 50 50 51 51 51 52 52 52 53 53 54 54 55 57 58 59-72 59 59 Chapter 5 FPGA DESIGN FLOW 5.1 Calculation of DXT using Cordic 4.5 Designing the chip 5.4.1 Field Programmable Gate Array (FPGA) 5.3.Output Blocks 5.4 Clock Circuitry 5.4.

2 XILINX simulation results 61 60 63 6.2 For sine-cosine real input and real output 6.1.1.3 For DFT using FFT Algorithm 6.3 Discussions 72 73 74 Chapter 7 CONCLUSION REFRENCES .6.

It is hardware efficient algorithm.1 Preamble CORDIC stands for COordinate Rotation DIgital Computer. Volder [1] in the year 1959 for the computation of Trigonometric functions. and robust technique to compute the elementary functions. Data type conversion. for evaluation of trigonometric functions. high performance vector rotation DSP applications [5]. cosh. Bekooij. Multiplication. and Nowak research tells about the application of CORDIC in the computation of the (Fast Fourier Transform) FFT. The basic Algorithm structure is described in [2]. Division. subtractions. and at the effects on the numerical accuracy. Earlier methods used are Table look up method. No multiplier requirement . optimized low power design. numerical co-processors. low-complexity. advanced circuit design. It calculates the value of trigonometric functions like sine. including HP-48G as the hardware restriction of calculators require that the elementary functions should be computed using only additions. CORDIC algorithm revolves around the idea of "rotating" the phase of a complex number. However. so in binary arithmetic they can be done using just shifts and adds. to high performance radar signal processing. the "multiplies" can all be powers of 2. magnitude and phase (arctangent) to any desired precision. no actual "multiplier" is needed thus it simpler and do not require complex hardware structure as in the case of multiplier. Polynomial approximation method etc. It is used as approximation function values on all popular graphic calculators. and tanh). cosine. Today cordic algorithm is used in Neural Network VLSI design [4]. The CORDIC algorithm does not use Calculus based methods such as polynomial or rational function approximation. 1.CHAPTER 1 INTRODUCTION The CORDIC algorithm was first introduced by Jack E. It is a highly efficient. It can also calculate hyperbolic functions (such as sinh. supersonic bomber aircraft with a digital counterpart. by multiplying it by a succession of constant values. Square Root and Logarithms. comparisons and stored constants. Other information about CORDIC Algorithm and different Issues are in [3]. The CORDIC algorithm has found its way in various applications such as pocket calculators. Huisken. digit shifts.

There are two ways in CORDIC algorithm for calculation of trigonometric and other related functions they are rotation mode and vectoring mode. It is initially intended for navigation technology. Due to the simplicity of the involved operations the CORDIC algorithm is very well suited for VLSI implementation.g. determines the right sequence as the angle accumulator approaches zero while the vectoring mode minimizes the y component of the input vector.as in the case of microcontroller. CORDIC is useful in designing computing devices. numerical co-processors. the CORDIC iteration is not a perfect rotation which would involve multiplications with sine and cosine. Various CORDIC architectures like bit parallel iterative CORDIC. the CORDIC algorithm has found its way in a wide range of applications.g. there is a gain which is added to the magnitude of resulting vector which can be removed by multiplying the resulting magnitude with the inverse of the gain. These properties. when a hardware multiplier is available (e. The rotation mode. The CORDIC was introduced in 1956 by Jack Volder as a highly efficient. in a microcontroller). The drawback in CORDIC is that after completion of each iteration. in a DSP microprocessor). table-lookup methods and power series are generally faster than CORDIC.g. in an FPGA). On the other hand.2 Historical perspective CORDIC algorithm has found its way in many applications. or when the number of gates required to implement is to be minimized (e. It can be seen that CORDIC is a feasible way to approximate cosine and sine. and robust technique to compute the elementary functions. 1. there are features that make CORDIC an excellent choice for small computing devices. Both methods initialize the angle accumulator with the desired angle value. ranging from pocket calculators. The rotated vector is also scaled making a scale factor correction necessary. Since it is an iterative method it has the advantage over the other methods of being able to get better accuracy by doing more iteration. a bit parallel unrolled CORDIC. low-complexity. As it was originally designed for hardware applications. in addition to getting a very accurate approximation is perhaps the reason why CORDIC is used in many scientific calculators today. a bit-serial iterative CORDIC and the comparison of various CORDIC architecture has been discussed. where as the Taylor approximation and the Polynomial interpolation methods need to be rederived to get better results. However. CORDIC is generally faster than other approaches when a hardware multiplier is unavailable (e. to high performance radar .

Deprettere. Choi. The CORDIC airborne navigational computer built for this purpose. The authors propose an on-line calculation of the logarithm of the scale factor and subsequent compensation. An overall evaluation of the methods exposes the trade-offs that exist between the angle of rotation. Lang. The choice of a higher radix implies that the scaling factor is no longer constant. and have already been widely applied in signal processing. Lau. and Swartzlander [10] aimed to overcome the critical path in the iteration through sign prediction and addition. such as the HP-9100 and the famous HP-35 in year 1972. Hsiao.e. Kwak. Van der Kolk. hyperbolic. by computing the results for both outcomes of the sign. with the application of the CORDIC algorithm in the Hewlett-Packard calculators. and previously unknown. and their application to singular value deposition (SVD). Their method is evaluated and benchmarked against solutions by others. and to select the proper one at the very end of . fast rotation methods. to perform orthonormal rotation at a very low cost. such as the 4-D (dimention) householder CORDIC transform. and Lee [7] formalized the problem of (approximate) vectoring for fast rotations in year 2000. Although fast rotations exist for certain angles only. and Delosme [9] considered multi-dimensional variants of CORDIC. and Bruguera[8] considers going to a higher radix than the radix-2 for the classical algorithm. Further Steve Walther [6] continues work on CORDIC. they are sufficiently versatile. Hekstra found a large range of known. the HP-41C in year1980. The selection technique works equally well for redundant arithmetic and floating-point computations. immediately at the iteration level. He told how the unified CORDIC algorithm i. combining rotations in the circular. the accuracy in scaling and the cost of rotation. Rather than building a multi-dimensional transform out of a sequence of 2-D (dimention) CORDIC operations. After invention CORDIC worked as the replacement for the analog navigation computers aboard the B-58 supersonic bomber aircraft with a digital counterpart. they proposed to work with multi-dimensional micro-rotations. Today’s fast rotation techniques are closely related to CORDIC. They treated the fast and efficient selection of the appropriate fast rotation. and showed the advantage to be gained when applied to Enhanced Versatile Disc (EVD). and linear coordinate systems and how it was applied in the HP-2116 floating-point numerical co-processor. mainly due to the revolutionary development of the CORDIC algorithm. so that less iterations are required.signal processing. outperformed conventional contemporary computers by a factor of 7. They proposed to overlap the sign prediction with the addition. Antelo.

DHT. To implement CORDIC algorithm on XILINX SPARTAN 3E kit. their specification. basic equations used. testing and design issues. This chapter includes FPGA architecture. its logic blocks and different families of FPGA. Chapter 3 discusses about the calculation of sine-cosine using CORDIC algorithm.e. different architectures to perform CORDIC iteration and their block diagram. area required to implement in chip designing. its basic equations. Chapter 5 tells about the design flow of XILINX FPGA. gain factor. Chapter 4 discusses use of CORDIC algorithm for calculating DFT. The thesis concludes in chapter 7 which also discusses future scope of work. technology used.3 Thesis objective Based on the above discussion the thesis has following objectives:    To study and implement CORDIC algorithm using VHDL programming code. speed. their comparison on the basis of their complexity. Further XILINX . 1.5 Methodology In this thesis. calculation of DFT using FFT algorithm. different mode of operation i. placement and routing. Chapter 6 contains the results of simulation using ModelSim and XILINX.the iteration. VHDL programming has been used to implement CORDIC algorithm (to calculate Sine and Cosine value for a given angle) and DFT (Digital Fourier Transform) and DHT (Discrete Fourier Transform). Novel in their approach is to combine the adder logic for the computation of both results. complex form of representation. how it came into picture.4 Organization of thesis Chapter 2 discusses basics of CORDIC algorithm. number of iteration required etc. rotation mode and vectoring mode. To implement DXT using CORDIC algorithm in VHDL code. CORDIC iteration and how it works. 1. 1.

2i FPGA kit SPARTAN 3E . Programming tools used for the implementations are :     Operating system WINDOWS XP ModelSim SE PLUS 5.5c XILINX 9.SPARTAN 3E kit is used for FPGA implementation of the generated VHDL code.

sin( )  V  '  y     y. y and  by the following method. 2.sin( )  ' (2.cos( )  y. it is described that how CORDIC algorithm works and how it can be understood more clearly. Volder [1] described the Coordinate Rotation Digital Computer or CORDIC for the calculation of trigonometric functions.1: Rotation of a vector V by the angle  .1) (2.CHAPTER 2 CORDIC ALGORITHM In 1959 Jack E. y ' ) where x ' and y ' can be obtained using x. The CORDICalgorithm provides an iterative method of performing vector rotations by arbitrary angles using only shift and add. In this chapter. If a vector V with coordinates ( x. y ) is rotated through an angle  then a new vector V ' can be obtained with coordinates ( x ' .cos( )  x. Y = r sin   x '   x.1 Basic equation of CORDIC algorithm Volder's algorithm is derived from the general equations for a vector rotation. X = r cos  . division and conversion between binary and mixed radix number systems. multiplication.2) Figure 2.

sin  ) sin  Using figure 2.1 it can be seen that vector V and V ' can be resolved in two parts.3) Similarly from figure 2.2 illustrates the rotation of a vector x V    by the angle  .5)  '    OX ' = x ' = r cos  ' = r cos(   ) = r (cos  .2: Vector V with magnitude r and phase  i. As shown in the figure 2.cos  ) cos   ( r. y ) can be resolved in two parts along the x .sin  ) = ( r. e.e.7) Similarly.4) (2.1 it can be observed  '    i. Let V has its magnitude and phase as r and  respectively and V ' has its magnitude and phase as r and  ' where V ' came into picture after anticlockwise rotation of vector V by an angle  .2 came into picture. Figure 2. (2.3 OX ' can be represented as OX ' = x ' = x cos   y sin  (2.Let’s find how equation 2.axis and y -axis as r cos  and r sin  respectively. x  r cos   .1.2 and equation 2.8) .cos   sin  .6) (2. OY ' OY ' = y ' = y cos   x sin  (2.1 and 2. a vector V ( x. y  r sin   (2.  y Figure 2. From figure 2.

x '  cos( )[ x  y.2 i ] yi+1  ki [ yi  xi . resulting equation be in terms of the tangent of the angle  .7). (2.14) (2.8).di .13) y ' = y. (2. value for the vector V ' in the clockwise direction rotating the vector V by the angle  and the equations obtain in this case be x' y'  x cos   y sin   x sin   y cos  (2.17) The multiplication by the tangent term can be avoided if the rotation angles and therefore tan( ) are restricted so that tan( )  2  i . Furthermore. For the ease of calculation here only rotation in anticlockwise direction is observed first.cos( )  x. Rearranging the equation (2.19) .sin( ) Volder observed that by factoring out a cos  from both sides.12) (2.cos( )  y. then this equation can be rewritten as an iterative formula.10) can be represented in the matrix form as  x '   x   cos   '     y     y    sin   sin    cos   (2.Similarly.11) The individual equations for x’ and y ' can be rewritten as [11]: x ' = x.9) (2. Next if it is assumed that the angle è is being an aggregate of small angles. This iterative rotation can now be expressed as: x i+1  ki [ xi  yi . if those rotations are performed iteratively and in both directions every value of tan( ) is representable. With   arctan(2 i ) the cosine term could also be simplified and since cos( )  cos(  ) it is a constant for a fixed number of iterations.18) (2. (2. and composite angles is chosen such that their tangents are all inverse powers of two.15) z '  z   .sin( ) (2. tan( )] y '  cos( )[ y  x.16) (2.d i .9).7) and (2.8). the angle of which we want to find the sin and cos . x '  cos  ( x  y tan  ) y '  cos  ( y  x tan  ) (2.tan( )] (2.10) The equations (2.2 i ] (2.In digital hardware this denotes a simple shift operation. here  is the angle of rotation (  sign is showing the direction of rotation) and z is the argument.

565o....015625 0.0312 0..576o 1. Ki  cos(arctan(2 i )) and di  1 ..where. i 0 1 2 3 4 5 6 7 d  i  2 i  tan i 1 0..125 0.0078 ki is the gain and it’s value changes as the number of iteration increases..cos 5 .0156 0..25 0...20) where k io i = cos 0 cos 1 cos 2 cos 3 cos 4 ..5 0..7876o 0..0624 0...cos 3 .. For 8-bit hardware CORDIC approximation method the value of ki as 7 ki   cos i  cos 0 ....cos 4 .1: For 8-bit CORDIC hardware...0625 0..565o 14.cos n 1 (  is the angle of rotation here for n times rotation).4469o = 0.cos 1 . Table 2...4469o is possible for 8-bit CORDIC hardware.....cos 26.6073 (2.cos 6 .. These i are stored in the ROM of the hard ware of the CORDIC ...4469o i in radian 0..036o 7.cos 0. i denotes the number of rotation required to reach the required angle of the required vector.03125 0.7854 0. The product of the K i 's represents the so-called K factor [6]: n 1 k   ki i 0 n 1 (2. The above rotations requirement and adding and subtracting of the different  can be understood by the following example of a balance..21) From the above table it can be seen that precision up to 0...2450 0..8938o 0..1244 0....4636 0.125o 3.cos 7 i0 = cos 45o..cos 2 ...0078125 i  arctan(2 i ) 45o 26..

Matrix representation of the CORDIC algorithm for 8-bit hardware: . keep the input angle  on the left pan of balance and if the balance rotates around the anticlockwise direction then add the highest value in the table at the other side.4: Inclined balance due to the difference in weight of two sides. first of all. Then. Figure 2.4 (a) then other weights are required to add in the right pan or in the term of angle if  is greater than total i then add other weights to reach as near to the  as possible but in other hand if the balance shows a right inclination as in figure 2. Figure 2. if balance shows a left inclination as in figure 2.3: A balance having  at one side and small weights (angle) at the other side.hardware as the look up table.4 (b) then a weight required to be removed from the right pan or in the term of angle if  is less than total i then we subtract other weights this process is repeated to reach as near to the  as possible. In the above figure. Now by taking an example of balance it can be understood that how the CORDIC algorithm works.

.. I b and Qb respectively. cos 1  (2. B  I b  jQb Let rotated complex number is....25) It can be seen from equation (2. B ' (2.24) Thus. their phases (angles) adds and their magnitudes multiply...27) (2.cos 7    tan 0  yi 1   1 .2 Complex number representation of CORDIC algorithm Let a given complex number B having its real and imaginary part as..    tan 7  tan 0   1  1    tan 1  tan 7  x   1  y    tan 1   1  (2....      sin 7 cos 7   y  (2....... Given complex number..26) 2...6073    1  0   (2..cos 7 = 0..  1    tan 7  tan 7   0.........6073 = 1 1.... Scale Factor = cos 0 cos 1 ....23)  xi 1   1    cos 0 cos 1 .....28)  I b'  jQb' It results due to multiplication of a rotation value r where r is R  I r  jQr when a pair of complex numbers is multiplied.... when one complex number is multiplied by the .. Similarly.......22) that cosine and sine of an angle  can be represented in the matrix form as  cos    1    sin     tan 0  tan 0   1  1    tan 1  tan 1   1  ... xi 1   cos i    yi 1    sin i  sin i   xi    cos i   yi   sin 1   .6466 (2..22)  xi 1   cos 0  sin 0  cos 1     yi 1    sin 0 cos 0    sin 1  cos 7  sin 7  x  ...

0.I r  Qb .jK " is arctan(. Since CORDIC uses powers of 2 for the K values. etc.K )  .Qb Qb'  Qb  K .I r  Qb .j1 : results in I b'  Qb and Qb'   I b . That's why the CORDIC algorithm . (2.32) Qb'  Qb  K .I r  I b . Since the real part of this. given complex number is multiplied by numbers of the form " R  1  /.jK : I b'  I b  K .25. I r .I b  Qb  (2 i ). phase.5.0. and CORDIC Gain (in table 2. Since the phase of a complex number " I  jQ " is arctan(Q / I ) .R i. starting with 0.I b To subtract a phase. multiply by R  0 . multiply by R = 0 . multiply by R  1 . Going through the above process. the phase of the conjugated is subtracted (though the magnitudes still multiply). the phase of " 1 . K = 1.conjugate of the other. magnitude.Qr (2. Similarly.Qr To subtract R’s phase from B.jK " where K will be decreasing powers of two. and shows the corresponding values of K. we can simplify our table of equations to add and subtract phases for the special case of CORDIC multiplications to: To add a phase.30) Thus to rotate by +90 degrees.arctan( K ) .I b  Qb  (2 i ). -1.29) I b'  I b . B '  B. -2.I b (2.Qb  I b  (2 i ). multiplication and division can be done only by shifting and adding binary numbers. starting with 2^0 = 1.34) Let's look at the phases and magnitudes of each of these multiplier values to get more of a feel for it. 0. is equal to 1.j1.). multiply by R  0  j1 : results in I b'  Qb and Qb'  I b . etc.R where R is the conjugate complex of R I b'  I b .jK " is used.2). to rotate by -90 degrees. The table below lists values of L. the phase of " 1  jK " is arctan( K ).33) (2. B '  B. multiply by R = 0 + j1.To add phases " R  1  jK " is used.Qr and Qb'  Qb .Qb  I b  (2 i ).To rotate by phases of less than 90 degrees.To subtract 90 degrees.Qb (2. Therefore to add R’s phase to B. multiply by R  1  jK : I b'  I b  K . Likewise. the net effect is: To add 90 degrees.31) (2. to subtract phases " R  1 .I r  I b . 0. e. (here the symbol " i " is used to designate the power of two itself: 0.Qr and Qb'  Qb . Therefore.

00000 1.031250 1.015625 1 + j0. That's the key to understanding CORDIC: here what is done that a "binary search" on phase by adding or subtracting successively smaller phases to reach some "target" phase.00195122 1.78991 1.0625 1 + j0.5 26.11803399 1. the phase of each successive R multiplier is a little over half of the phase of the previous R.0625 3.581138830 2 0.doesn't need any multiplies.2: Phase.0 1 + j1.25 1 + j0.41421356 1.646693254 . Table 2.12502 1.646492279 6 0.25 14.642484066 4 0.03624 1.56505 1.00048816 1.57633 1. magnitude.645688916 5 0.00012206 1. and CORDIC Gain for different values of K i K  2 i  R (in degree) R  1  jK Mag .629800601 3 0.0 45.03125 1 + j0.03077641 1.5 1 + j0.125 7.R =tan-1 (K) CORDIC gain 0 1.00778222 1.89517 1.015625 0.125 1 + j0. Also it can be seen that starting with a phase of 45 degrees.414213562 1 0.

7

0.007813

1 + j0.007813

0.44761

1.00003052

1.646743507

.. .

...

...

...

...

...

The sum of the phases in the table up to L = 3 exceeds 92 degrees, so a complex number can be rotated by +/- 90 degrees by doing four or more " R  1  /- jK " rotations. Putting that together with the ability to rotate +/-90 degrees using " R  0  /- j1 ", and it can be rotated a full +/-180 degrees. Each rotation has a magnitude greater than 1.0. That isn't desirable, but it's the price which is to pay for using rotations of the form " 1 + jK ". The "CORDIC Gain" column in the table is simply a "cumulative magnitude" calculated by multiplying the current magnitude by the previous magnitude. Noticing that it converges to about 1.647; however, the actual CORDIC Gain depends on how many iterations have been done. (It doesn't depend on whether the phases are being added or subtracted, because the magnitudes multiply either way.)

2.2.1 Calculation of magnitude of complex number The magnitude of a complex number B  I b  jQb can be calculated if it is rotated to have a phase of zero; then its new Qb value would be zero, so the magnitude would be given entirely by the new I b value.

It can be determined whether or not the complex number " B " has a positive phase just by looking at the sign of the " Qb " value: positive Qb means positive phase. As the very first step, if the phase is positive, rotate it by -90 degrees; if it's negative, rotate it by +90 degrees. To rotate by +90 degrees, just negate Qb , then swap I b and Qb ; to rotate by -90 degrees, just negate I b , then swap. The phase of

B is now less than +/- 90 degrees, so the " 1  /- jK " rotations to follow can
rotate it to zero.

Next, do a series of iterations with successively smaller values of K, starting with K=1 (45 degrees). For each iteration, simply look at the sign of Qb to decide

whether to add or subtract phase; if Qb is negative, add a phase (multiplying by " 1  jK "); if Qb is positive, subtract a phase (multiplying by " 1 - jK "). The accuracy of the result converges with each iteration, as the more iteration is done, the more accurate its results. Now, having rotated the complex number to have a phase of zero, it end up with " B  Ib  j 0 ". The magnitude of this complex value is just " I b " (since " Qb " is zero.) However, in the rotation process, B has been multiplied by a CORDIC Gain (cumulative magnitude) of about 1.647. Therefore, to get the true value of magnitude it must be multiplied by the reciprocal of 1.647, which is 0.607. (The exact CORDIC Gain is a function of the how much iteration is done.) Unfortunately, gain-adjustment multiplication can't be done using a simple shift/add; however, in many applications this factor can be compensated in some other part of the system. Or, when "relative magnitude" is all that counts (e.g. AM demodulation), it can simply be neglected.

2.2.2 Calculation of phase of complex number For calculation of phase, complex number is rotated to have zero phase, as done previously to calculate the magnitude.

For each phase-addition/subtraction step, accumulating the actual number of degrees (or radians) for which it is rotated. The "actual" will come from a table of " arctan( K ) " values like the "Phase of R" column in the table 2.2. The phase of the complex input value is the negative of the accumulated rotation required to bring it to a phase of zero.

2.2.3 Calculation of sine and cosine of an angle

Starting with a unity-magnitude value of B  I b  jQb . The exact value depends on the given phase. For angles greater than +90, B should be started with
B  0  j1 (that is, +90 degrees); for angles less than -90, B should be started

with B  0 - j1 (that is, -90 degrees); for other angles, B should be started with
B  1  j 0 (that is, zero degrees). Initializing an "accumulated rotation"

variable to +90, -90, or 0 accordingly. (Of course, it should be done in terms of radians).

Doing a series of iterations. If the desired phase minus the accumulated rotation is less than zero then add the next angle in the table; otherwise, subtract the next angle. Doing this using each value in the table.

The "cosine" output is in " I b "; the "sine" output is in " Qb ".

2.3 Basic CORDIC iterations
To simplify each rotation, picking  i (angle of rotation in ith iteration) such that

 i = d i . 2 i . d i is such that it has value +1 or -1 depending upon the rotation i. e.
d i  {+1,-1} .Then
xi 1  xi  d i yi 2  i yi 1  yi  di xi 2 i zi 1  zi  d i tan 1 2 i

(2.35) (2.36) (2.37)

The computation of xi 1 or yi 1 requires an i-bit right shift and an add/subtract. If the function tan 1 2 i is pre computed and stored in table (Table 3.1) for different values of i, a single add/subtract suffices to compute zi 1 . Each CORDIC iteration thus involves two shifts, a table lookup and three additions. If the rotation is done by the same set of angles (with + or- signs), then the expansion factor K, is a constant, and can be pre computed. For example to rotate by 30 degrees, the following sequence of angles be followed that add up to  30 degree. 30.0  45.0 - 26.6 + 14.0 - 7.1 + 3.6 + 1.8 -0.9 + 0.4 - 0.2 + 0.1 =30.1 In effect, what actually happens in CORDIC is that z is initialized to 30 degree and then, in each step, the sign of the next rotation angle is selected to try to change

the sign of z; that is, d i =sign ( zi ) is chosen, where the sign function is defined to be 1 or 1 depending on whether the argument is negative or nonnegative. This is reminiscent of no restoring division. Table3.2 shows the process of selecting the signs of the rotation angles for a desired rotation of +30 degree. Figure 3.1 depicts the first few steps in the process of forcing z to zero.

Table 2.3: Approximate value of the function  i  arctan (2-i ) , in degree, for 0  i  9 .
i
0 1 2 3 4 5 6 7 8 9

i
45 26.6 14 7.1 3.6 1.8 0.9 0.4 0.2 0.1

In CORDIC terminology the preceding selection rule for d i , which makes z converge to zero, is known as rotation mode. Rewriting the CORDIC iteration, where

 i  tan 1 2 i :
xi 1  xi  d i yi 2  i

(2.38)

646760258121…… Thus.6 + 1.0 – 0.42)  The constant K in the preceding equation is k = 1.41) (2.2 + 0.6 -2.39) (2. tan z can be through necessary division.0 + 26. respectively. and y = 0.4: Choosing the signs of the rotation angles to force z to zero i 0 1 2 3 4 5 6 7 8 9 zi   i + 30. then.7 + 0. xm and ym converge to cos z and sin z.0 .14. as zm tends to 0 with CORDIC iterations in rotation mode.0 – 45.6 . we have i  z .1 – 1.0. to compute cos z and sin z. Table 2. one can start with x = 1/K = 0.1} such that z  0 (2.7 0..40) (2.9 + 0.2 -0.0.4 .15.2.4 4.4 + 7.yi 1  yi  di xi 2 i zi 1  zi  d i i (2.2 – 0. when z (m) is sufficiently close to zero.1 -0.1 zi 1 -15 11.2 + 0.8 .2 0 -0.607252935….6 + 11.40) After m iteration in rotation mode.1 + 4.7 – 3. Once sin z and cos z are known. and the CORDIC equations become: xm  k ( x cos z  y sin z ) ym  k ( y cos z  x sin z ) zm  0 Rule: choose d i  {1.0 .1 .7 1.

.1 is more than half the previous angle or.1. where99. to one that is within the domain of convergence: cos( z  2 j )  cos z sin( z  2 j )  sin z cos( z   )   cos z sin( z   )   sin z (2. equivalently. this range includes angle from -90 to +90.7.45) (2. In the rotation mode. for i > k.7 is the sum of all the angles in table 3.2 really means z = 0. so that z = 0. y0 ) to ( x10 . For outside the preceding 2 2 range. each angle is less than the sum of the entire angle following it. Fortunately. convergence of z to zero is possible because each angle in table 3. the change in the z will be less than ulp (Unit in the Last Place). The domain of convergence is -99.43) (2.44) (2. For k bits of precision in the resulting trigonometric functions. The reason is that for large i it can be approximated that tan  i 2 i  2 i . k CORDIC iterations are needed. trigonometric identities can be converted to the problem. 0 ) in rotating by +300 .5: First three of 10 iteration leading from ( x0 . Rotation mode. Hence.7 < z <99.2  radian or converted to numbers within the domain quite easily.Figure 2.46) Noting that these transformations become particularly convenient if angles are represented and manipulated in multiples of  radians. or [    to  ] in radians.

50) xm  k ( x  y 2 / x) /(1  y 2 / x 2 ) xm  k ( x 2  y 2 )1/ 2 The CORDIC equations thus become xm  k ( x 2  y 2 )1/ 2 (2.51) (2. However.1} such that y  0. CORDIC method also allows the computation of the other inverse trigonometric functions.47) (2. This computation always converges.In the second way of utilizing CORDIC iterations. After m iterations in vectoring mode tan(  i )   This means that: xm  k[ x cos(  i )  y sin(  i )] xm  k ( x  y tan(  i )) /[1  tan 2 (  i )]1/ 2 y .49) (2. .54) to limit the range of fixed point numbers that is encountered.” y is made nearer to zero by choosing d i   sign( xi yi ) .53) ym  0 zm  z  tan 1 ( y / x ) Rule: Choose d i  {1.52) (2. x (2. known as “vectoring mode. one can take advantage of the identity   tan 1 y     tan 1 y  2  (2.48) (2. one can compute tan 1 y in vectoring mode by starting with x= 1 and z = 0.

as well as shift and add algorithms [15]. iterative rotations of a point around the origin on the X-Y plane are considered. Because the number of sequential cells and amount of storage area. [6]. accuracy or the reasonable amount of resource [15]. such as graphic systems. and recall of a prepared constant. . we can cite speed. Since the rotation is not a pure rotation but a rotation-extension. In each rotation. Among the various properties that are desirable. The architecture of FPGAs specifies suitable techniques or might even change desirable properties. the number of rotations for each angle should be a constant independent of the operand so that the scale factor becomes a constant. the expense of the multiplication needed for many algebraically methods. In this chapter. automatic control systems. and so on. play important roles in various digital systems. In sine and cosine computation by the CORDIC. [12] is known as an efficient method for the computation of these elementary functions. Recent advances in VLSI technologies make it attractive to develop special purpose hardware such as elementary function generators. the coordinates of the rotated point and the remaining angle to be rotated are calculated. are limited but combinational logic in terms of LUT (Look Up Table) in the FPGA's (Field Programmable Gate Array) CLBs (Configurable Logic Blocks) is sufficiently available. and so on [14]. singular value decomposition. addition and subtraction. shift and add algorithms fit perfectly into an FPGA. Alternative techniques are based on polynomial approximation. The CORDIC can also be applied to matrix triangularization. When implementing a sine and cosine calculator in digital hardware. The calculations in each iteration step are performed by shift. should be kept in mind. The CORDIC (Coordinate Rotation DIgital Computer) [11]. especially trigonometric functions. table-lookup [15] etc.CHAPTER 3 COMPUTATION OF SINE COSINE Elementary functions. Several function generators based on the CORDIC have been developed [13]. different hardware are dealt for sine and cosine computation using CORDIC. needed for table-lookup algorithms.

1 can be bit serial. O( k 2 ) clock cycle would be required to complete the k CORDIC iterations. X shift  Y shift  Z  Look up Table Figure. This is acceptable for hand handled calculators. Then with k bit operands. The d i factor (-1 and 1) is accommodated by selecting the (shift) operand or its complement. y and z. In the extreme. It requires three registers for x. Of. 3. the look up table supplying the term  i can be stored in the control ROM or in main memory. a look up table to store the values of  i  tan 1 2 i .1 CORDIC Hardware A straight forward hardware implementation for CORDIC arithmetic is shown below in figure 3. the adder in fig 3. CORDIC iterations can be implemented in firmware (micro program) or even software using the ALU and general purpose registers of a standard microprocessor. course a single adder and one shifter can be shared by three computations if a reduction in speed by a factor of 3 is acceptable. since even a delay of tens of thousands of clock cycles constitutes a small fraction of a second and thus is hardly noticeable to a .3. and two shifter to supply the terms 2 i x and 2 i y to the adder/subs tractor units. In this case.2.1: Hardware elements needed for the CORDIC method where high speed is not required and minimizing the hard ware cost is important (as in calculator).

characterize the results of circular CORDIC rotations: xm  k ( x cos z  y sin z )   ym  k ( y cos z  x sin z )   zm  0  (3. where the vector length is the familiar ri  x 2  y 2 .2 Generalized CORDIC The basic CORDIC method can be generalized to provide the more powerful tool for function evaluation. With reference to figure3.1} such that z  0 ) . The parameter  can assume one of the three values: (3. we introduce rotations that led to expansion of vector length by a factor (1  tan  i )1/ 2 = i / cos  i in each step and by K = 1. 3. Intermediate between the fully parallel and fully bit-serial realizations are a wide array of digit serial (for example decimal or radix-16) implementation that provide trade off speed versus cost.2.2 illustrates the three type of rotation in generalized CORDIC. repeated here for ready comparison.1) =1 Circular rotations (Basic CORDIC)  i = tan 1 2 i  i = 2i  i = tanh 1 2 i  = 0 linear rotation  = -1 Hyperbolic rotation Figure3. the rotation angle AOB can be defined in terms of the area of sector AOB as follows: angleAOB  2(areaAOB ) (OU )2 The following equations. Generalized CORDIC is defined as follows: xi 1  xi   d i yi 2 i   yi 1  yi  di xi 2 i   zi 1  zi  d i i  [Generalized CORDIC iteration] Noting that the only difference with basic CORDIC is the introduction of the parameter  in the equation for x and redefinition of  i . For the circular case with  = 1.human user.2) (Circular rotation mode.646760258121…overall. Rule: choose d i  {1.

The following equations characterized the results of linear CORDIC rotations: xm  x   ym  y  xz   zm  0  (Linear rotation mode.xm  k ( x 2  y 2 )1/ 2   ym  0   1 zm  z  tan ( y / x)  (Circular vectoring mode. or divide-add (vectoring mode). In hyperbolic rotations corresponding to  = -1. The following equations characterize the results of hyperbolic CORDIC rotations: . multiply-add (rotation mode). Rule: Choose d i  {1. Hence. leading to an overall shrinkage factor K’ = 0. the length of the vector is always its true length OV and the scaling factor is 1. Rule: Choose d i  {1.5) (3. z = 0).4) (Linear vectoring mode. with the length expansion due to rotation being (1  tanh  i )1/ 2 = i / cosh  i . the rotation angle EOF can be defined in terms of the area of the hyperbolic sector EOF as follows: angleEOF  2(areaEOB) (OW ) 2 The vector length is defined as ri  x 2  y 2 .8281593609602……… after all the iterations. y = 0).1} such that y  0) hence. Because cosh  i > 1. linear CORDIC rotations can be used to perform multiplication (rotation mode.1} such that z  0) xm  x   ym  0  zm  z  y / x   (3. the end point of the vector is kept on the line x  x0 and the vector length is defined by ri  xi . the vector length actually shrinks.3) In linear rotations corresponding to  = 0. Rule: Choose d i  {1.1} such that y  0) (3. division (vectoring mode.

x = 1/k’.6) xm  k ( x 2  y 2 )1/ 2   ym  0   1 zm  z  tanh ( y / x)  (Hyperbolic vectoring mode.7) hence. y = 0) or the tanh 1 function (vectoring mode. 3.1} such that y  0. Rule: Choose d i  {1.2: Circular. Other functions can be computed indirectly [16]. Figure. multiplication.xm  k ' ( x cosh z  y sinh z )   ym  k ( y cosh z  x sinh z )   zm  0  (Hyperbolic rotation mode. Volder's algorithm is derived from the general equations for . hyperbolic CORDIC rotations can be used to compute the hyperbolic sine and cosine functions (rotation mode. z = 0). Volder [1] described the Coordinate Rotation Digital Computer or CORDIC for the calculation of trigonometric functions. The CORDIC-algorithm provides an iterative method of performing vector rotations by arbitrary angles using only shifts and adds.3 The CORDIC-algorithm for Computing a Sine and Cosine Jack E. Rule: choose d i  {1.1} such that z  0 ) (3. Linear and Hyperbolic CORDIC [16] 3. division and conversion between binary and mixed radix number systems. (3. x = 1.

cos( )  y.9) x Figure 4. if those rotations are performed iteratively and in both directions every value of tan( ) is representable.2): X = r cos  . y ) is to be rotated through an angle  a new vector v ' with components ( x ' . This iterative rotation can now be expressed as: . y ' ) is formed by (as in equation 2.vector rotation.cos( )  x.cos( )  y.sin( ) (3.sin( )  (3.13) The multiplication by the tangent term can be avoided if the rotation angles and therefore tan( ) are restricted so that tan( )  2  i .tan( )] (3.10) (3.Y = r sin   x '   x.3: Rotation of a vector V by the angle  The individual equations for x’ and y ' can be rewritten as [17]: x ' = x.In digital hardware this denotes a simple shift operation. With   arctan(2 i ) the cosine term could also be simplified and since cos( )  cos(  ) it is a constant for a fixed number of iterations.sin( ) and rearranged so that: x '  cos( )[ x  y.11) y ' = y.cos( )  x.2 illustrates the rotation of a vector V    by the angle  . tan( )] y '  cos( )[ y  x. If a vector v with components ( x.1 and 2.12) (3.sin( )  V'  '  y     y.  y Figure 3. Furthermore.8) (3.

d i .d i . This yields to a third equation which acts like an angle accumulator and keeps track of the angle already rotated.xi 1  ki [ xi  yi .2 i ] (3. The product of the K i 's represents the socalled K factor [6]: n 1 k   ki i 0 (3. Each vector v can be .15) where Ki  cos(arctan(2 i )) and di  1 . A good way to implement the k factor is to initialize the iterative rotation with a vector of length k which compensates the gain inherent in the CORDIC algorithm.di .18) yi 1  [ yi  xi .17) (3.8 can now be simplified to the basic CORDIC-equations: xi 1  [ xi  yi . initialized with V0 Equations 4.14) (3.4: Iterative vector rotation.2 ] i The direction of each rotation is defined by d i and the sequence of all d i 's determines the final vector. Figure 3.7 and 4.16) This k factor can be calculated in advance and applied elsewhere in the system.3.2 i ] (3.2 i ] yi 1  ki [ yi  xi . The resulting vector v ' is the unit vector as shown in Figure 4.di .

every sine value from 0 to 2 can be represented by reflecting and/or inverting the first quadrant appropriately. However. d i is sensitive to the sign of zi Therefore d i can be described as: 1.21) With equation 4. since a sine wave is symmetric from quadrant to quadrant. 3.14 the CORDIC algorithm in rotation mode is described completely. Since the decision is which direction to rotate instead of whether to rotate or not.arctan(2  i ) (3.20) Those values arctan(2 i ) can be stored in a small lookup table or hardwired depending on the way of implementation.described by both the vector length and angle or by its coordinates x and y .4 Implementation of various CORDIC Architectures As intended by Volder. The rotation mode.arctan(2  i )  i 0 (3. determines the right sequence as the angle accumulator approaches 0 while the vectoring mode minimizes the y component of the input vector. that the CORDIC method as described performs rotations only within    and . The angle accumulator is defined by: zi 1  zi  d i . the CORDIC algorithm knows two ways of determining the direction of rotation: the rotation mode and the vectoring mode. Note. Following this incident. Both methods initialize the angle accumulator with the desired angle z0 . The .19) where the sum of an infinite number of iterative rotation angles equals the input angle  [14]:    di . the CORDIC algorithm only performs shift and add operations and is therefore easy to implement and resource-friendly. However. This limitation comes from the use of 20 for the tangent in the first 2 2 iteration. when implementing the CORDIC algorithm one can choose between various design methodologies and must balance circuit complexity with respect to performance. if di   1. if zi  0 zi  0 (3.

11. are described and compared in the following sections. For n iterations the output is mapped back to the registers before initial values are fed in again and the final sine value can be accessed at the output. At the beginning of a calculation initial values are fed into the register by the multiplexer where the MSB of the stored value in the z-branch determines the operation mode for the adder-sub tractor. A simple finite-state machine is needed to control the multiplexers. The . unrolled and iterative. 4. bit-parallel. the shift distance and the addressing of the constant values. 3.14 is represented by the schematics in Figure 4.4 when directly translated into hardware. Signals in the x and y branch pass the shift units and are then added to or subtracted from the unshifted signal in the opposite path.4.1 A Bit-Parallel Iterative CORDIC The CORDIC structure as described in equations 4.10. 4. The adder and the sub tractor component are carried out separately and a multiplexer controlled by the sign of the angle accumulator distinguishes between addition and subtraction by routing the signals as required. bit-serial.5: Iterative CORDIC The z branch arithmetically combines the registers values with the values taken from a lookup table (LUT) whose address is changed accordingly to the number of iteration. x0 y0 z0   Register Register Sign(n-1) Register  >> n >> n Const(n) Multiplexer Multiplexer Multiplexer    x(n) y(n) z(n) Figure 3. a shift unit and a register for buffering the output.12 and 4. Each branch consists of an adder-sub tractor combination.most obvious methods of implementing a CORDIC. When implemented in an FPGA the initial values for the vector coordinates as well as the constant values in the LUT can be hardwired in a word wide manner.

72 12 280 263. As we talk about a bit-parallel unrolled design with 16 bit word length. Table 3. the area and therefore the maximum path delay increase as stages are added to the design where the path delay is an equivalent to the speed which the application could run at.5. one could simply cascade the iterative CORDIC. and in the face of separate stages two simplifications become possible. Input values find their path through the architecture on their own and do not need to be controlled. First. Naturally.shift operations as implemented change the shift distance with the number of iterations but those require a high fan in and reduce the maximum speed for the application [18].86 13 304 256.75 9 208 177.17 10 232 206. Obviously the resources in an FPGA are not very suitable for this kind of architecture. From table 4. there is no need for changing constant values and those can therefore be hardwired as well. which means rebuilding the basic CORDIC structure for each iteration. the shift operations for each step can be performed by wiring the connections between stages appropriately.87 . each stage contains 48 inputs and outputs plus a great number of cross-connections between single stages. the output of one stage is the input of the next one. as shown in Figure 4. of Iterations Complexity [CLB] Max path delay[ns] 8 184 163. times 3. In addition the output rate is also limited by the fact that operations are performed iteratively and therefore the maximum output rate equals 1/n the clock rate.4. Those cross-connections from the x-path through the shift components to the y-path and vice versa make the design difficult to route in an FPGA and cause additional delay times.1: Performance and CLB usage in an XC4010E [19] No.1 it can be seen how performance and resource usage change with the number of iterations if implemented in an XILINX FPGA. Consequently. The purely unrolled design only consists of combinatorial components and computes one sine value per clock cycle. Second.2 A Bit-Parallel Unrolled CORDIC Instead of buffering the output of one iteration and using the same resources again.9 11 256 225.

the area in FPGAs can be measured in CLBs. Figure 3.As described earlier. It can be seen. each of which consist of two lookup tables as well as storage cells with additional control components [20].inserting registers between stages would also reduce the maximum path delays and correspondingly a higher maximum speed can be achieved. that the number of CLBs stays almost the same while the maximum frequency increases as registers are inserted. This means registers could be inserted easily without significantly increasing the area. [21]. Obviously. The reason for that is the decreasing amount of combinatorial logic between sequential cells.6: Unrolled CORDIC However. For the purely combinatorial design the CLB's function generators perform the add and shift operations and no storage cells are used. the gain of speed when .

The subtraction is again indicated by the sign bit of the angle accumulator as described in section 4. even when co-processors are used. Bit-serial means only one bit is processed at a time and hence the cross connections become one bit-wide data paths. The shift-by. Especially if a sufficient number of CLBs is at one's disposal. this type of architecture becomes more and more attractive. Clearly. engineering graphics.2. Both. 3. The reason is the structural simplicity of a bit-serial design and the correspondingly high clock rate achievable. the throughput becomes a function of number of clock rate iterations  word width In spite of this the output rate can be almost as high as achieved with the unrolled design.4.6 shows the basic architecture of the bit serial CORDIC processor as implemented in a XILINX Spartan.inserting registers exceeds the cost of area and makes therefore the fully pipelined CORDIC a suitable solution for generating a sine wave in FPGAs. In this architecture the bit-serial adder-subtractor component is implemented as a full adder where the subtraction is performed by adding the 2's complement of the actual subtrahend [22]. Evaluating complicated equation sets can be very time consuming in software. especially when these equations contain a large number of nonlinear and transcendental functions as well as many multiplication and division operations. and signal processing areas.3 A Bit-Serial Iterative CORDIC Problems which involve repeated evaluation of a fixed set of nonlinear. show disadvantages in terms of complexity and path delays going along with the large number of cross connections between single stages. Figure 4. A single bit of state is stored at the adder to realize the carry chain [23] which at the same time requires the LSB to be fed in first. To reduce this complexity we can change the design into a completely bit-serial iterative architecture. Examples of such problems can be found in the robotics. algebraic equations appear frequently in scientific and engineering applications. the unrolled and the iterative bit-parallel designs. as is the case in high density devices like XILINX's Virtex or ALTERA's FLEX families.i operation can be realized by reading the bit i  1 from it's right end in .1.

when all iterations are passed the input multiplexers switch again and initial values enter the bit-serial CORDIC processor as the computed sine values exit. The design as implemented runs at a much higher speed than the bit-parallel architectures described earlier and fits easily in a XILINX SPARTAN device.7: Bit-serial CORDIC . A multiplexer can be used to change position according to the current iteration. The latter could be replaced by a RAM or serial ROM where values are read by simply incrementing the memory's address. Finally. The constant LUT for this design is implemented as a multiplexer with hardwired choices. This would clearly accelerate the performance.the serial shift registers. y0 and z0 are fed into the array at the left end of the serial-in serial-out register and as the data enters the adder component the multiplexer at the input switch and map back the results of the bit-serial adder into the registers. The performance is constrained by the use of multiplexers for the shift operation and even more for the constant LUT. The reason is the high ratio of sequential components to combinatorial components. Figure 3. The initial values x0 .

and è /2. the CORDIC method is the winner. The bit-parallel unrolled and fully pipelined design uses the resources extensively but shows the best latency per sample and maximum throughput rate. are still off by at the most 1/50. looking at the absolute errors it appear that this is the case. and è /2. As the values of x get further away from our centers 0. However.6 Numerical Comparison By running a test program on our Taylor. We can see from the absolute error that our Taylor approximation does just what is expected. The best approximation looking at the absolute error is definitely CORDIC. Clearly by numerical standards. As one can see all three give fairly reasonable approximations to cosine. 3. Table 4. .2 illustrates how the architectures for the iterative bit-serial and iterative bit-parallel designs for 16 bit resolution vary in terms of speed and area. note that the Taylor approximation did very well and with more centers would do even better. the values computed in the test case show that at most angles the polynomial doesn’t accurately correspond with the cos function. it does not seem to fit the sinusoid very well and this will apparently give a poor approximation. The prototyping environment limited the implementation of the unrolled design to 13 iterations.3. The iterative bit-parallel design provides a balance between unrolled and bit-serial design and shows an optimum usage of the resources. è /6. In fact it turns out to be exact on nearly every angle (at least in terms of MATLAB’s cos ( ) function). The error then decreases as our angle again nears the next center. è /3. As for the polynomial interpolation. è /4. it appears that the approximation should be best when x is near 0. The best values. The polynomial interpolation turns out to be the worst approximation. However. and CORDIC approximations for cosine one can obtained the output attached. those near our chosen points. the error increases.5 Comparison of the Various CORDIC Architectures In the previous sections. By looking at the graph again. The resulting structures show differences in the way of using resources available in the target FPGA device. we described various methods of implementing the CORDIC algorithm using an FPGA. The bit-serial design stands out due to it's low area usage and high achievable speed. Polynomial Interpolation. Whereas the latency and hence the maximum throughput rate is much lower compared to the bit-parallel designs.

but for FPGAs where sufficient CLBs are available one might choose the bit-parallel and fully pipelined architecture [24] since latency is minimal and no control unit is needed. This means that how fast it is depends on how fast it converges to the actual answer. this is combated by the fact that in n iterations both the cosine and sine are found. but rather an iterative formula. giving an error bound of 1/2n ≥ |cos è . while crude. the CORDIC method.xn+1|. . For instance. it is essentially finding (with precision limitations) the actual values of sin and cosine.7 Other Considerations By the numerical comparison there is an obvious loser – polynomial interpolation. albeit a small one. Something that the other methods can’t do. this method is not a function that can be easily evaluated. Finally. the complexity of the method is much greater than the simple quadratic and for reasonable accuracy needs multiple expansions. for the x values that fall in the middle of the centers accuracy is still an issue. Also. By rotating a unit vector in a coordinate system. In actual fact it would be more accurate to look at the resources available in the specific target devices rather than the specific needs in order to determine what architecture to use. While the convergence isn’t great it is fairly fast. has the most direct solution to the problem of evaluating trigonometric functions. Thus CORDIC is the best way of fast calculation by using subtraction and addition only.3. which by far finds the most accurate approximations. The Taylor approximation while it is slower to calculate than the polynomial approximation is a function and can be calculated quickly as well. however there may be certain conditions that require different properties other than just accuracy. is very fast to calculate and in terms of complexity it is the simplest of all three. polynomial interpolation. However. However. The bit-serial structure is definitely the best choice for relatively small devices. The complexity is great in that CORDIC has to be done n times to get a solution for an n-bit computer. However. where xn+1 is the current step in the CORDIC algorithm. This means that the algorithm gets at about twice as close to the real solution every iteration.

2.8: A CORDIC-based Oscillator for sine generation The oscillator has been implemented and tested in a XILINX XC4010E. Samples /sec] 0. Throughput [Mio. LUTs. the values are now fed into the CORDIC structure through a separate input. [19] CLB [1] bit-serial bit-parallel 111 138 LUT [1] 153 252 FF [1] 108 52 Speed [MHz] 48 36 Latency [  s] 5.Table 3.9 illustrates the resulting structure of the complete oscillator Figure 3.44 Max.2. The architecture of this device provides specific resources in terms of CLBs (configurable logic blocks). storage cells and maximum speed.8 Hardware Implementation As demonstrated the amplitude control can be carried out within the CORDIC structure. .33 0.2: Performance and CLB usage for the bit-parallel and bit-serial iterative designs.25 3.1875 2. Figure 4. Instead of hard-wiring the initial values as proposed in section 4.

digital signal controllers (mostly for industrial apparatus such as motor control). etc. or on purpose-built hardware such as application-specific integrated circuit (ASICs).W. The Fast Fourier Transform (FFT) (Figure 4. Often. on specialized processors called digital signal processors (DSPs). DSP (Digital Signal Processing) includes subfields like: audio and speech signal processing. DSP algorithms have long been run on standard computers. signal processing for communications.CHAPTER 4 CORDIC FOR DFT CALCULATION In chapter 3 it has discussed that how sine and cosine can be calculated using CORDIC algorithm and now using this algorithm how Digital Fourier Transform (DFT) can be calculated has been discussed in this chapter. such as solving simultaneous linear equations or the correlation method. and stream processors. J. Tukey are the founder credit for bringing the FFT (also known as divide and conquer algorithm). the required output signal is another analog output signal.1 Eight point decimation-in-time FFT algorithm) is another method for calculating the DFT. by using an analog to digital converter. fieldprogrammable gate arrays (FPGAs). often reducing the computation time by hundreds. biomedical signal processing. from digital signal processing . Digital signal processing and analog signal processing are subfields of signal processing.W. spectral estimation. Today there are additional technologies used for digital signal processing including more powerful general purpose microprocessors. which requires a digital to analog converter. the first step is usually to convert the signal from an analog to a digital form. statistical signal processing. While it produces the same result as the other approaches. among others. digital image processing. FFTs are of great importance to a wide variety of applications. it is incredibly more efficient. Since the goal of DSP is usually to measure or filter continuous real-world analog signals. It is concerned with the representation of the signals by a sequence of numbers or symbols and the processing of these signals. A Fourier transform is a special case of a wavelet transform with basis vectors defined by trigonometric functions sine and cosine. A Fast Fourier transform (FFT) is an efficient algorithm to compute the Discrete Fourier transform (DFT) and it’s inverse. There are several ways to calculate the Discrete Fourier Transform (DFT). Cooley and J. seismic data processing. sensor array processing. sonar and radar signal processing.

. 2. Many FFT algorithms only depend on the fact that e  j 2 N is an Nth primitive root of unity. N  1. This article describes the algorithms. Decimation in time involves breaking down a signal in the time domain into smaller signals... even for prime N.1) kn where WN  e  j 2 kn / N For a real sample sequence f (n).sin(2 / N ) kn] n 0  Fx ( k )  Fy ( k ) (4. can be defined as DFT: N 1 F (k )   f ( N )[cos(2 / N ) kn  j. . 1. is x(n) then the frequency response X(k) can be calculated by using the DFT..1...1. Evaluating these sums directly would take (N 2) arithmetical operations.1 Calculation of DFT using CORDIC If the input (time domain) signal. An FFT is an algorithm to compute the same result in only (N log N) operations. such algorithms depend upon the factorization of N.. such as number-theoretic transforms Decimation is the process of breaking down something into it's constituent parts.. but (contrary to popular misconception) there are FFTs with (N log N) complexity for all N. of which there are many. ……. and thus can be applied to analogous transforms over any finite field. see discrete Fourier transform for properties and applications of the transform.. (N-1)} DFT and the DHT. N  1..2) . 2.and solving partial differential equations to algorithms for quick multiplication of large integers.... 4. where n is {0. X ( k )   x(n)WNkn for n0 N 1 k  0.. (4.. The DFT is defined by the formula N 1 X ( k )   x ( n )e n0  j 2 kn N for k  0. In general. each of which is easier to handle... of N points.

3) As it is evident from the expressions. the higher-order terms in the expansion of the sine and the cosine can be neglected. Now the rotation by i 0 elementary angles can be expressed in terms of sine of the angle as: sin   i  2 i (4. In other words.DHT: N 1 H ( k )   f ( N )[cos(2 / N ) kn  sin(2 / N ) kn] n0 (4. The operation can be represented mathematically as: BY    AX  cos  AY    sin  15  sin   cos    (4. Using sine cosine value generated by CORDIC algorithm DFT can be calculated by arranging them in matrix. It can be done by arranging sine cosine value in two different matrix one as real part and another as imaginary part then multiplying by input sampled data f (n) which results in F (k ) .5) where i is a positive integer. These transforms can be expressed in terms of plane rotations. This assumption imposes some restriction on the allowable values of i. Since the elementary rotational angles  i have been assumed to be sufficiently small.4) The rotation by a certain angle can be achieved by the summation of some elementary small rotations given by:     i for a 16-bit machine. The CORDIC unit can iteratively rotate an input vector A   AX angles  i (so that AY  by a target angle  through small steps of elementary ) to generate an output vector B   BX   BX i i BY  . In the case of DFT . Result can be stored in matrix as real and imaginary part. above transforms involve trigonometric operations on the input sample sequences. since there is no imaginary part the two matrixes can be shown in a single matrix by adding the two resulting matrix of sine and matrix of cosine in the case of 8  8 matrix the two results in the case of DFT and DHT can be arranged as below. all the input samples are given a vector rotation by the defined angle in each of the transforms. In the case of DHT computation.

9) each term of 8  8 matrix can be seen as cos(2 kn / N )  sin(2 kn / N ) and H ( k ) is given as: H ( k )  FZ ( k )  FI ( k ) . The resulting value of H ( k ) by using equation (4. The two result i. But in the case of DHT since there is no imaginary part the two values generated in the matrices be added together and used further for different applications.W80  0 W8 W80  0 W FR (k )   80 W  80 W8 W 0  8 W80  W80 W81 W82 W80 W82 W84 W80 W83 W86 W89 W812 W815 W818 W821 W80 W84 W88 W812 W816 W820 W824 W828 W80 W85 W810 W815 W820 W825 W830 W835 W80 W86 W812 W818 W824 W830 W836 W842 W83 W86 W84 W88 W85 W810 W86 W812 W87 W814 W80   f (0)     W87   f (1)  W814   f (2)     W821   f (3)  W828   f (4)     W835   f (5)  W842   f (6)     W849  R  f (7)     (4. W80  0 W8 W80  0 W H (k )   80 W  80 W8 W 0  8 W80  W80 W80 W81 W82 W82 W84 W83 W86 W84 W88 W85 W810 W86 W812 W87 W814 W80 W83 W86 W89 W812 W815 W818 W821 W80 W84 W88 W812 W816 W820 W824 W828 W80 W85 W810 W815 W820 W825 W830 W835 W80 W86 W812 W818 W824 W830 W836 W842 W80   W87  W814   W821  W828   W835  W842   W849    f (0)   f (1)     f (2)     f (3)   f (4)     f (5)   f (6)     f (7)    (4. . Final value is given as: F (k )  FZ (k )  j. in the above 8  8 matrix only sine value is stored at corresponding positions. in the above 8  8 matrix only cosine value is stored at corresponding positions.FI (k ) .6) suffix R shows real part.e.8) In the right side of equation (4. W80  0 W8 W80  0 W FI (k )   80 W  80 W8 W 0  8 W80  W80 W81 W82 W83 W84 W80 W82 W84 W86 W88 W80 W83 W86 W89 W80 W84 W88 W812 W80 W85 W810 W815 W80 W86 W812 W818 W80   W87  W814   W821  W828   W835  W842   W849  I   f (0)   f (1)     f (2)     f (3)   f (4)     f (5)   f (6)     f (7)    W812 W816 W820 W824 (4.3) . real and imaginary part can be stored in RAM location for further use.7) W85 W810 W815 W820 W825 W830 W86 W812 W818 W824 W830 W836 W87 W814 W821 W828 W835 W842 suffix I shows real part.

. The basic computation using FFT is called butterfly computation which s shown in figure 4.2 FFT method for DFT calculation Fast Fourier transform (FFT) is an efficient algorithm for computing the discrete Fourier transform.2 . It requires less multiplication than a simple approach of calculating DFT [25].2: Eight point decimation-in-time FFT algorithm. DFT of an input sampled signal as discussed earlier in this chapter.2 a b -1 A  a  WNr b A  a  WNr b Figure 4. By using the above butterfly computation technique an 8  8 DFT can be calculated as shown figure 4. x(0) x(4) W80 -1 X(0) X(1) x(2) x(6) -1 W80 W82 -1 -1 W80 X(2) X(3) x(1) -1 X(4) x(5) W80 x(3) -1 W80 W81 -1 -1 X(5) X(6) -1 W82 x(7) W80 -1 W82 -1 W83 -1 X(7) Figure 4.1: Basic butterfly computation in the decimation-in-time. DFT is a tool to estimate the samples of the CFT at uniformly spaced frequencies. The discovery of the FFT algorithm paved the way for widespread use of digital methods of spectrum estimation which influenced the research in almost every field of engineering and science.4.

& break it down into two N/2 point DFTs by splitting the input signal into odd & even numbered samples to get: X (k )  1 N N / 2 1 m0 2 x(2n)WN mk   1 N N / 21 m0 (2 x(2n  1)WN m1) k (4.e. . X ( k )  even number X (k )  1 N N / 2 1 m0 samples  odd number x(2m)(WN2 ) mk   1 k N / 21 WN  x(2m  1)(WN2 ) rk N m 0 (4.3 Write row wise Take DFT column wise DFT row wise Multiplication by twiddle matrix Read Column wise Figure 4.The steps involved in FFT method can be understood by the following figure 4.9) samples  i.4 which tells about the different stages to calculate DFT of a sampled signal.3: FFT divide and conquer method We take this N point DFT.10) 2-POINT 4-POINT 2-POINT 8-POINT 2-POINT 4-POINT 2-POINT Figure 4.4: Fast Fourier Transform Block diagram for FFT is in figure 4.

or in places where and ASIC will eventually be used. The architecture consists of configurable .CHAPTER 5 FPGA DESIGN FLOW 5. they are structured very much like a gate array ASIC.1: FPGA Architecture Each FPGA vendor has its own FPGA architecture.2 FPGA Architectures Figure5.1 Field Programmable Gate Array (FPGA) Field Programmable Gate Arrays are called this because rather than having a structure similar to a PAL or other programmable device. an FPGA maybe used in a design that need to get to market quickly regardless of cost. Later an ASIC can be used in place of the FPGA when the production volume increases.1. 5. in order to reduce cost. For example. but in general terms they are all a variation of that shown in Figure 5. This makes FPGAs very nice for use in prototyping ASICs.

these CLBs will contain enough logic to create a small state machine. the CLB will contain only very basic logic[26]. It also contains flip-flops for clocked storage elements. memory. In large grain architecture. configurable I/O blocks. The two basic types of programmable elements for an FPGA are Static RAM and anti-fuses. It contains RAM for creating arbitrary combinatorial logic functions.2 Configurable I/O Blocks A Configurable I/O Block. there will be clock circuitry for driving the clock signals to each logic block. and programmable interconnect.1 Configurable Logic Blocks Configurable Logic Blocks contain the logic for the FPGA.2. 5.3. It consists of an input buffer and an output buffer with three state and open collector output controls. and multiplexers in order to route the logic within the block and to and from external resources. more like a true gate array ASIC. shown in Figure 5. Typically there are pull up resistors on the outputs and sometimes pull down resistors. Also. In fine grain architecture. The polarity of the output .2: FPGA Configurable Logic Block 5.logic blocks. The diagram in Figure 5. and additional logic resources such as ALUs. and decoders may be available. is used to bring signals onto the chip and send them back off again. Figure 5. The multiplexers also allow polarity selection and reset and clear input selection.2 would be considered a large grain block.2.

3: FPGA Configurable I/O Block 5.3 Programmable Interconnect The interconnect of an FPGA is very different than that of a CPLD. Three-state buffers are used to connect many CLBs to a long line. Programmable switches inside the chip allow the connection of CLBs to interconnect lines and interconnect lines to each other and to the switch matrix. Figure 5. to connect these long and short lines together in specific ways. They can also be used as buses within the chip. There are also short lines which are used to connect individual CLBs which are located physically close to each other. In Figure 5. there is often a flip-flop on outputs so that clocked signals can be output directly to the pins without encountering significant delay.2.4. like that in a CPLD. are specially designed for low impedance and thus fast propagation times. Special long lines. It is done for inputs so that there is not much delay on a signal before reaching a flip-flop which would increase the device hold time requirement. a hierarchy of interconnect resources can be seen. called global clock lines.can usually be programmed for active high or active low output and often the slew rate of the output can be programmed for fast or slow rise and fall times. There is often one or several switch matrices. creating a bus. These are connected to . In addition. There are long lines which can be used to connect critical CLBs that are physically far from each other on the chip without inducing much delay. but is rather similar to that of a gate array ASIC.

In a large grain FPGA. This is how the clocks are distributed throughout the FPGA. Large Granularity Small grain FPGAs resemble ASIC gate arrays in that the CLBs contain only small. NOR gates. are distributed around the chip. known as clock drivers. Only when using clock signals from clock buffers can the relative delays and skew times are guaranteed. since absolute skew and delay cannot be guaranteed. small grain architectures require much more routing resources. These buffers are connected to clock input pads and drive the clock signals onto the global clock lines described above. As we will discuss later. where the CLB can contain two or more flipflops. a design which does not need many flip-flops will leave many of them unused. Figure 5.2. which .4 Clock Circuitry Special I/O blocks with special high drive clock buffers. synchronous design is a must with FPGAs. very basic elements such as NAND gates. Unfortunately. The philosophies that small elements can be connected to make larger functions without wasting too much logic.4: FPGA Programmable Interconnect 5. 5. etc.5 Small vs.the clock buffers and to each clocked element in each CLB.2. These clock lines are designed for low skew times and fast propagation times.

which means a power glitch could potentially change it. The advantages of SRAM based FPGAs is that they use a standard fabrication process that chip fabrication plants are familiar with and are always optimizing for better performance. Writing the bit with a zero turns off a switch. while writing with a one turns on a switch. The disadvantages are that they are volatile.take up space and insert a large amount of delay which can more than compensate for the better utilization.6 SRAM vs. The disadvantages are that they require a complex fabrication process. normally make no connection. the FPGAs can be reprogrammed any number of times. SRAM based devices have large routing delays. just like writing to a normal SRAM. so they tend to be faster. they require an external programmer to program them. Anti-fuse Programming There are two competing methods of programming FPGAs. even while they are in the system. 5. The first. The other method involves anti-fuses which consist of microscopic structures which. SRAM programming. A certain amount of current during programming of the device causes the two sides of the anti-fuse to connect. Small Granularity Better utilization Direct conversion to ASIC Large Granularity Fewer levels of logic Less interconnect delay A comparison of advantages of each type of architecture is shown in Table. The advantages of Anti-fuse based FPGAs are that they are nonvolatile and the delays due to routing are very small. The choice of which architecture to use is dependent on your specific application. involves small Static RAM bits for each programming element. FPGA Families Examples of SRAM based FPGA families include the following: ■ Altera FLEX family ■ Atmel AT6000 and AT40K families . Also.2. Since the SRAMs are reprogrammable. unlike a regular fuse. they cannot be changed. and once they are programmed.

■ An internal block diagram showing each major functional section. This is an absolute must. This is the entire process for designing a device that guarantees that you will not overlook any steps and that you will have the best chance of getting backs a working prototype that functions correctly in your system. whether it is an ASIC. especially as a guide for choosing the right technology and for making your needs known to the vendor. It allows the engineer to design the correct interface to the rest of the pieces of the chip. or a CPLD. As specification allows each engineer to understand the entire design and his or her piece of it. There is no excuse for not having a specification. A specification should include the following information: ■ An external block diagram showing how the chip fits into the system. The design flow consists of the steps in 5.■ Lucent Technologies ORCA family ■ Xilinx XC4000 and Virtex families Examples of Anti-fuse based FPGA families include the following: ■ Actel SX and MX families ■ Quick logic pASIC family 5.1 Writing a Specification The importance of a specification cannot be overstated.3. ■ A description of the I/O pins including ■ Output drive capability ■ Input threshold level ■ Timing estimates including .3 The Design Flow This section examines the design flow for any device. an FPGA. It also saves time and misunderstanding.

Write a Specification Specification Review Design Simulate Design Review Synthesize Place and Route Resimulate Final Review Chip Test System Integration on Test Chip Product Figure 5.5: Design Flow of FPGA .

For smaller chips. especially if the design engineer is already familiar with the tools. synthesis software will be required to “synthesize” the design. For larger designs. This is important since each synthesis tool has recommended or mandatory methods of designing hardware so that it can correctly perform synthesis. 5. flexibility. but these will change as the chip is being designed. a hardware description language (HDL) such as Verilog or VHDL is used because of its portability. however.4 Choosing a Synthesis Tool You must decide at this point which synthesis software you will be using if you plan to design the FPGA with an HDL.3. 5. it can be used to find the best vendor with a technology and price structure that best meets your requirements.3.3 Choosing a Design Entry Method You must decide at this point which design entry method you prefer.2 Choosing a Technology Once a specification has been written.■ Setup and hold times for input pins ■ Propagation times for output pins ■ Clock cycle time ■ Estimated gate count ■ Package type ■ Target power consumption ■ Target price ■ Test procedures It is also very important to understand that this is a living document.3. 5. Many sections will have best guesses in them. schematic entry is often the method of choice. At the end of this phase it . and readability. When using a high level language. It will be necessary to know these methods up front so that sections of the chip will not need to be redesigned later on. This means that the software creates low level gates from the high level description.

It is important to get others to look over the simulations and make sure that nothing was missed and that no improper assumption was made.design review Simulation is an ongoing process while the design is being done. Once design and simulation are finished. Small sections of the design should be simulated separately before hooking them up to larger sections. All appropriate personnel should review the decisions to be certain that the specification is correct. and that the correct technology and design entry method have been chosen.is very important to have a design review. ■ Top-down design ■ Use logic that fits well with the architecture of the device you have chosen ■ Macros ■ Synchronous design ■ Protect against metastability ■ Avoid floating nodes ■ Avoid bus contention 5. 5. This may involve specifying switches and optimization criteria in the HDL . This is one of the most important reviews because it is only with correct and complete simulation that you will know that your chip will work correctly in your system.5 Designing the chip It is very important to follow good design practices. another design review must take place so that the design can be checked.3. This involves using synthesis software to optimally translate your register transfer level (RTL) design into a gate level design that can be mapped to logic blocks in the FPGA. 5. This means taking into account the following design issues that we discuss in detail later in this chapter.3. There will be many iterations of design and simulation in order to get the correct functionality.3. the next step is to synthesize the chip.6 Simulating .7 Synthesis If the design was entered using an HDL.

System integration and system testing is necessary at this point to insure that all parts of the system work correctly together. 5. These problems need to be tested and documented so that they can be fixed on the next revision of the chip. the chip must be resimulated with the new timing numbers produced by the actual layout. or playing with parameters of the synthesis software in order to insure good timing and utilization. you simply program the device and immediately have your prototypes. it is necessary to have some sort of burn-in test of your system that continually tests your system over some long amount of time.8 Place and Route The next step is to lay out the chip.3. These problems can often be worked around by modifying the system or changing the system software.10 Testing For a programmable device.3. If the problems encountered here are significant. chances are very good that your system will perform correctly with only minor problems. Then the design is programmed into the chip. If there are simply some marginal timing paths or the design is slightly larger than the FPGA. Otherwise. sections of the FPGA may need to be redesigned. You then have the responsibility to place these prototypes in your system and determine that the entire system actually works correctly. 5.3. 5.final review After layout. it . there are three possible paths to go in the design flow. it may be necessary to perform another synthesis with better constraints or simply another place and route with better constraints. resulting in a real physical design for a real chip. If you have followed the procedure up to this point.code. a final review is necessary to confirm that nothing has been overlooked. If everything has gone well up to this point. At this point. the new simulation results will agree with the predicted results. If a chip has been designed correctly. When the chips are put into production. This involves using the vendor’s software tools to optimize the programming of the chip to implement the design.9 Resimulating .

The top-level block represents the entire chip. chips often incorporate a large number of gates and a very high level of . Intermediate level blocks may contain smaller functionality blocks combined with gate-level logic. 5.will only fail because of electrical or mechanical problems that will usually show up with this kind of stress testing. we will discuss those areas that are unique to FPGA design or that are particularly critical to these devices. Fortunately. First. Figure 5. A schematic can be viewed as a hierarchical tree as shown in Figure 14.4.1 Top-Down Design Top-down design is the design method whereby high level functions are defined first. The bottom level contains only gates and macro functions which are vendor-supplied high level functions. and the lower level implementation details are filled in later.6: Top-Down Design Top-down design is the preferred methodology for chip design for several reasons. schematic capture software and hardware description languages used for chip design easily allows use of the topdown design methodology. 5. Each lower level block represents major functions of the chip.4 Design Issues In the next sections of this chapter.

functionality. more optimal designs. No signal that is generated by combinatorial logic can be fed back to the same group of combinatorial logic without first going through a synchronizing flip-flop. he or she usually becomes an evangelical convert to synchronous design. when necessary. clocks must go . Many synthesis packages can target their results to a specific FPGA or CPLD family from a specific vendor. This is important for complex designs where an entire design can take weeks to simulate and days to debug. Asynchronous designs that work for years in one process may suddenly fail when the chip is manufactured using a newer process. 5.4. A top-down design approach allows each module to be simulated independently from the rest of the design. to design the chip.in other words.3 Synchronous Design One of the most important concepts in chip design. Once an chip designer uncovers a problem due to asynchronous design and attempts to fix it. Sections can be removed and replaced with a higher-performance or optimized designs without affecting other sections of the chip. Clocks cannot be gated . Second. This is because asynchronous design problems are due to marginal timing problems that may appear intermittently.4. Delay is always controlled by flip-flops. it allows flexibility in the design. not combinatorial logic. This methodology simplifies the design task and allows more than one engineer. and one of the hardest to enforce on novice chip designers. For this reason. Also important is the fact that simulation is much simplified using this design methodology. or may appear only when the vendor changes its semiconductor process. Simulation is discussed in more detail later in this chapter. taking advantage of the architecture to provide you with faster. 5. is that of synchronous design. simulation must be done extensively before the chip is sent for fabrication. Synchronous design simply means that all data is passed through combinatorial logic and flip-flops that are synchronized to a single clock. The vendor may be able to offer advice about this.2 Keep the Architecture in Mind Look at the particular architecture to determine which logic devices fit best into it. Simulation is an extremely important consideration in chip design since a chip cannot be blue-wired after production.

4 Race conditions Figure 5. since we don’t know the exact internal timing of the flip-flop or the routing delay of the signal to the clock versus the reset input. voltage.the clock or the reset. the output will go high. Figure 5. the designer wants the output to change to the high state of SIG1.7 shows an asynchronous race condition where a clock signal is used to reset a flip-flop. we cannot know which signal will arrive first . On the rising edge of SIG2.there is no race condition. If the clock rising edge appears first. the output will remain low. the flip-flop is reset to a low state. 5. and the flip-flop is reset on the rising edge of the clock. A more reliable synchronous solution is shown in Figure 16. Unfortunately.4. or process may cause a chip that works correctly to suddenly work incorrectly. This circuit performs the same function. If the reset signal appears first. A slight change in temperature.directly to the clock inputs of the flip-flops without going through any combinatorial logic. The following sections cover common asynchronous design problems and how to fix them using synchronous logic.they change only after the rising edge of CLK . Here a faster clock is used. When SIG2 is low. This is a race condition.7: Asynchronous: Race Condition . but as long as SIG1 and SIG2 are produced synchronously .

Metastability refers to a condition which arises when an asynchronous signal is clocked into a synchronous flip-flop.9. If the ASYNC_IN signal goes high around the same time as the clock. the gates that are connected to the output of the flip-flop may interpret this level differently. In normal operation. OUT1 . In the figure.4. While chip designers would prefer a completely synchronous world.8: Synchronous: No Race Condition 5. This meta level may remain until the transistor voltage leaks off or “decays”. and often misunderstood concepts. In these cases. the upper gate sees the level as logic 1 whereas the lower gate sees it as logic 0. we have an unavoidable race condition. The designer must be careful how to do this in order to avoid metastability problems as shown in Figure 5. or until the next clock cycle. or will be generated by a clock which is different from the one used by the chip. the asynchronous signal must be synchronized to the chip clock so that it can be used by the internal circuitry.Figure 5. the unfortunate fact is that signals coming into a chip will depend on a user pushing a button or an interrupt from a processor. The output of the flip-flop can actually go to an undefined voltage level that is somewhere between a logic 0 and logic 1.5 Metastability One of the great buzzwords. During the clock cycle. This is because an internal transistor did not have enough time to fully charge to the correct level. of synchronous design is metastability.

the second flip-flop. there is no certain solution to this problem. The upper and lower gates will both sample the same logic level.9: Metastability . At higher frequencies. In this case. In this case. Also. the next flip-flop will sample an indeterminate value. Figure 5.and OUT2 should always be the same value. and be interpreted only as logic 0 or 1. There is a very small but non-zero probability that the output of the synchronizer flip-flop will not decay to a valid logic level within one clock period. . the synchronized input will be sampled by only one device. This metastability can permanently lock up the chip. this possibility is greater. and there is again a possibility that the output of that flip-flop will be indeterminate. they are not and this could send the logic into an unexpected state from which it may never return.The Problem The “solution” to this metastability problem by placing a synchronizer flip-flop in front of the logic. and the metastability problem is avoided. Or is it? The word solution is in quotation marks for a very good reason. Unfortunately. Some vendors provide special synchronizer flip-flops whose output transistors decay very quickly.

temperature. this type of compute intensive simulation takes longer and longer to run. . Also. asynchronous designs must use this type of analysis because static timing analysis only works for synchronous designs.6 Timing Simulation This method of timing analysis is growing less and less popular.4. is operating faster than the typical chip. . 5. The correct action involves discussing metastability problems with the vendor. though. and including enough synchronizing flip-flops to reduce the probability so that it is unlikely to occur within the lifetime of the product. It involves including timing information in a functional simulation so that the real behavior of the chip is simulated. due to voltage. However. The term “best case” can be misleading. If you do need to perform timing simulation. it is important to do both worst case simulation and best case simulation. Also. This means that certain long delay paths never get evaluated and a chip with timing problems can pass timing simulation.inserting more synchronizer flip-flops reduces the probability of metastability but it will never reduce it to zero. This is another reason for designing synchronous chips only. simulations can miss particular transitions that result in worst case results. hold time problems become apparent only during the best case conditions. As chips become larger. and process variations. It refers to a chip that. The advantage of this kind of simulation is that timing and functional problems can be examined and corrected.

1.1 consists the ModelSim simulation result for binary input angle z0 and binary outputs xn1(sin(z0)).1: Sine-Cosine value generated for input angle z0 (binary value) .2 consists the ModelSim simulation result for real input angle z0 and real outputs xn1(sin(z0)). Figure 6.1 ModelSim Simulation Results Figure 6. 6. yn(cos(z0)) in the form of waveform and their corresponding magnitude respectively.CHAPTER 6 RESULTS AND DISCUSSIONS 6.1 For binary input and binary output Figure 6.1 and table 6. yn(cos(z0)) in the form of waveform and their corresponding magnitude respectively.2 and table 6.

2 For sine-cosine real input and real output Figure 6.2: Sine-Cosine value generated for input angle z0 (integer value) .1.Table 6.1: Sine-Cosine value for input angle z0 Reset 1 0 Clk_enable 1 1 Input(z0) 00000000000001111000000000 00000000000001111000000000 Sine value Cosine value 00000000000000000000000000 00000000000000000001111110 00000000000000000000000000 00000000000000000011011110 6.

Table 6.49 0 0 0 1 1 1 30 45 60 0.3: Real input/output waveforms of DFT using FFT algorithm .699 0.2: Sine Cosine value for input angle z0 S.867 6. No.49 0.1. 1 2 3 4 Reset 1   Clk_enable Input angle(z0) Sine value 0 Cosine value 0 0.867 0.711 0.3 For DFT using FFT Algorithm Figure 6.

64 4. 9.1 -4 -3.61 .Table 6.64 -4.04 -4 -4. 6. 2. 3. 7. 5. 4. 1. 8.61 0 -1.0 1.98 0 9. Reset 0 Clk_enable Input Output real 0 Output imaginary 0   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 36 -4.N.95 -4 -3.3: Real input/output values of DFT using FFT algorithm S.01 -9.

4. reset and outputs are xn1 (magnitude of cosine of input angle). Figure 6. clk_enable (clock enable). yn (magnitude of sine of input angle).4: Top level RTL schematic for Sine Cosine .2 XILINX simulation results Block diagram generated by XILINX 9. Figure 6. ce_out and dvld are chip enable signal for the next stage blocks. Here inputs are z0 (input angle).2i for sine-cosine using CORDIC is shown in figure 6. clk (clock).5 shows the RTL schematic of sine-cosine generator and its internal block diagram.6.

5: RTL schematic for Sine-Cosine .Figure 6.

8 have synthesis report generated by XILINX showing number of multiplexers.7 and table 6.50V Clocks Inputs Logic Vcco25 Signals Quiescent Vccint 1.(b) shows the power summary and design summary respectively produced by XILINX tool.50V Vcco25 2.50V Quiescent Vcco25 2. timing and thermal summary of the chip generated respectively. Table 6. Table 6. number of adder.6.20V Vccaux 2.20V Quiescent Vccaux 2.4: Power summary I(mA) Total estimated power consumption Vccint 1.4 and table 6.Table 6. table 6.50V 26 18 2 0 0 0 0 0 26 18 2 P(mW) 81 31 45 5 0 0 0 0 0 31 45 5 .5 (a). number of flip-flops used.

5: (a) Design summary of Sine-Cosine Table 6.5: (b) Design summary of Sine-Cosine .Table 6.

Table 6.6: Advanced HDL Synthesis Report for sine cosine Macro Statistics: 32x12-bit multiplier 32x13-bit multiplier 32x14-bit multiplier 32x15-bit multiplier Total Multipliers 2-bit adder 3-bit adder 32-bit adder 4-bit adder 5-bit adder 6-bit adder 7-bit adder 8-bit adder Total Adders/Subtractors Flip-Flops Total Registers(Flip-Flops) : 24 : 15 :8 :6 : 73 :1 :1 : 72 :1 :1 :1 :1 :1 : 79 : 616 : 616 .

191ns (Maximum Frequency: 10.7: Timing summary Minimum period Minimum input arrival time before clock Maximum output required time after clock Maximum combinational path delay : 93.731MHz) : 5.8: Thermal summary Estimated junction temperature: Ambient temp: Case temp: Theta J-A range: 270C 250C 260C 26 – 260C/W .283ns : No path found Table 6.Table 6.176ns : 4.

7: RTL schematic of DFT .Figure 6.6: Top level RTL schematic of DFT Figure 6.

9: (a) Design summary for DFT Table 6.9: (b) Design summary for DFT .Table 6.

Table 6.10: Advanced HDL Synthesis Report for DFT Macro Statistics 10x24-bit multiplier 24x10-bit multiplier 24x24-bit multiplier 32x32-bit multiplier # Multipliers 10-bit adder 2-bit adder 24-bit adder 24-bit subtractor 3-bit adder 32-bit adder 32-bit subtractor 4-bit adder 5-bit adder 6-bit adder 7-bit adder 8-bit adder 9-bit adder # Adders/Subtractors Flip-Flops # Registers 24-bit comparator greatequal 24-bit comparator lessequal 8-bit comparator greatequal # Comparators 32-bit 4-to-1 multiplexer # Multiplexers : 16 : 16 : 32 : 32 : 96 : 32 : 64 : 80 : 64 : 64 : 46 : 28 : 64 : 64 : 64 : 64 : 80 : 64 : 778 : 2576 : 2576 : 32 : 48 : 16 : 96 : 32 : 32 .

654 gates for 8  1 DFT generator. . comparators. 2576. 96. Number of adders/subtractors and registers are 79 and 616 respectively. 778.191 ns (maximum frequency 10. registers.800 for sine cosine generator and 242. Minimum period for sine cosine generation is 93.3 Discussions Multipliers used for implementation of CORDIC algorithm for sine cosine generation are 73 in numbers. 32 respectively. In case of DFT implementation number of multiplier. multiplexers are 96.73 MHz) power consumed by the sine cosine generator and DFT generator is 81 mW each with the junction temperature of 270C. Total number of gates used 7.6.547 for sine cosine generator and DFT calculator respectively. adders/subtractors. Total number of 4 input LUTs (Look Up Tables) used 708 and 20.

Implementation of dif algorithm DFT computation and simulation for more number of points Implementation and simulation for DHT. This thesis shows that CORDIC is available for use in FPGA based computing machines. This affects the cost. the implementation of DFT using CORDIC algorithm on FPGA is the need of the day as the FPGAs can give enhanced speed at low cost with a lot of flexibility. But a large amount of data processing is required because of complex computations. Future Scope of work The future scope should include the following    . So. The results are verified by test bench generated by the FPGA.CHAPTER 7 CONCLUSION The CORDIC algorithm is a powerful and widely used tool for digital signal processing applications and can be implemented using PDPs (Programmable Digital Processors). speed and flexibility of the DSP systems. This is due to the fact that the hardware implementation of a lot of multipliers can be done on FPGA which are limited in case of PDPs. It can be concluded that the designed RTL model for sine cosine and DFT function is accurate and can work for real time applications. Then the implementation of sine cosine CORDIC based generators is done on XILINX Spartan 3E FPGA which is further used to implement eight point Discrete Fourier Transform using radix-2 decimation-in-time algorithm on FPGA. In this thesis the sine cosine CORDIC based generator is simulated using ModelSim which is then used for simulation of Discrete Fourier Transform. which are the likely basis for the next generation DSP systems. DCT and DST calculations .

and Bruguera J. Kluwer Academic Publishers. [2] [3] [4] Lindlbauer N. “Very-High Radix CORDIC Rotation Based on Selection by Rounding”. [11] [12] Roads C. J. 141–153.334.. PhD thesis. Electronic Computing. 2000.edu/~norbert/CORDIC/node3. 1959. Kluwer Academic Publishers. Kwak J.es/~jcea/artic/CORDIC. pp 379 . 2000. Netherlands. F.. “Redundant Constant-Factor Implementation of Multi-Dimensional CORDIC and Its Application to Complex SVD”. Kluwer Academic Publishers. 2006.REFERENCES [1] Volder J. November 2005 [6] Walther J. H. http://www. “ A Floating Point Vectoring Algorithm Based on Fast Rotations”. “Mixed-Scaling-Rotation CORDIC (MSR-CORDIC) Algorithm and Architecture for High-PerformanceVector Rotational DSP Applications”. E.. “The Computer Music Tutorial”. [13] Goodwin M. . Journal of VLSI Signal Processing. pp 125–139. “Frequency-domain analysis-synthesis of musical sounds”. Volume 25. pp 2385. Netherlands. pp 155–166.. Netherlands. “The Evolution of electronic musical instruments”. LangT..cnmat. CNMAT and Department of Electrical Engineering and Computer Science. 1972.berkeley.. pp 330 . H. [9] Delosme M.Lau C. V. Deprettere E.F. UCB.. 2000. 1994. Netherlands. Cambridge.25. IMACS Multiconference on “Computational Engineering in Systems Applications (CESA)”. Nashville: George Peabody College for Teachers. [5] Lin C. “Application of CORDIC Algorithm to Neural Networks VLSI Design”. www. Kluwer Academic Publishers.. H.S.. Alantic city..htmlhtt. Journal of VLSI Signal Processing. and Swartzlander. Rhea T.. [8] Antelo E..2398. Y. October 4-6. [10] Choi J.C.htm Qian M. 2000. IRE Trans. Spring Joint Computer Conference. and Lee J. Volume 25. Vol. MIT Press. Beijing. Y.385. 1995. and Hsiao S. “Unified algorithm for elementary functions”. Journal of VLSI Signal Processing. Avion J. Volume25. J. and Wu A. D. Volume EC-8. 1971. Master's thesis. Journal of VLSI Signal Processing. [7] Kolk K.A. Volume 52.A... A. “The CORDIC trigonometric computing technique”.. China.argo.

ic. pp 191-200.“OFDM synchronizer Implementation for an 1EEE 802..doc.1 la Compliant Modem".Algorithms and Implementation”. Ahmed. Morf. [22] Andraka R. com/tech /Introduction. [19] [20] [21] http://www. “CORDIC trigonometric function generator for DSP”. California. Birkhauser Boston.uk/publications/files/osk00jvlsisp.. pp.. “Building a high performance bit serial processor in an FPGA”. Feb. Proc.ps.htm www.. Frankfurt (Oder). Speech and Signal Processing.dspguru. 2008. Canada. www.2384. Considine V. M. 2000. H. 1997. Volume 15. 65-82.22-24. Banff. IEEE89th. North Kingstown.pdf Troya A... and Grass E. Jan. Krstic M. G.com/partinfo/#4000. [23] [24] [25] Proakis J.. Maharatna K. Mag.. Proceedings of the 1998 ACM/SIGDA sixth international symposium on FPGAs.. Kraemer R.ac. G.[14] Muller J. Crass E. M. Prentice Hall. International Conference on Acoustics. Monterey. Germany.Manolakis D. Muharutnu K. 2003. no. 1998.. On-Chip System Design Conference.. http://comparch.A. algorithms and applications”. and M. Delosme. “Elementary Functions . New York. “Optimized low-power synchronizer design for the IEEE 802. 1989.11a standard”.51protel. “Computer Arithmetic – Algorithms and hardware designs.. IASTED International Conference on Wireless and Optical Communications. Troyu A. Scotland.. Krsti M. “Survey of CORDIC algorithms for FPGA based computers”. 1.July 2002.com/info/faqs/CORDIC.xilinx. 1982. Glasgow. [15] [16] [17] [18] Andraka R. Delhi.. Parhami B.” Oxford University Press. “Digital signal processing principles... 1996. “Highly concurrent comComput.pdf [26] . pp 2381 . New York. M. J.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.