You are on page 1of 11

1

FPGA Implementation of Sine and Cosine Generators Using the


CORDIC Algorithm
Tanya Vladimirova and Hans Tiggeler
Surrey Space Centre
University of Surrey, Guildford, Surrey, GU2 5XH
Tel: +44(0) 1483 879278
Fax: +44(0) 1483 876021
Abstract: This paper is concerned with FPGA implementation of CORDIC schemes
for fast and silicon area efficient computation of the sine and cosine functions. The
results of theoretical investigation into redundant CORDIC are presented. Summary
of CORDIC synthesis results based on Actel and XILINX FPGAs is given. Finally
applications of CORDIC sine and cosine generators in small satellites are discussed.
Keywords: CORDIC, sine, cosine, FPGA, synthesis, redundant signed-digit system.
1. Introduction
The name CORDIC stands for Coordinate Rotation Digital Computer. Volder
[Vold59] developed the underlying method of computing the rotation of a vector in a
Cartesian coordinate system and evaluating the length and angle of a vector. The
CORDIC method was later expanded for multiplication, division, logarithm,
exponential and hyperbolic functions. The various function computations were
summarised into a unified technique in [Walt71].
The resulting vector
n
z of the rotation of a vector [ ]
T
y x
0 0
, by an angle in
Cartesian coordinates can be computed by the following matrix operation [Pirs98]:
1
]
1

1
]
1

1
]
1

0
0
cos sin
sin cos
y
x
y
x
n
n


(1)
Using the identity:
2
tan 1 1 cos + and factoring out cos equation (1) can be
modified as follows:
1
]
1

1
]
1

1
]
1

0
0
2
1 tan
tan 1
tan 1
1
y
x
y
x
n
n

(2)
In the CORDIC method, the rotation by an angle is implemented as an iterative
process, consisting of micro-rotations during which the initial vector is rotated by pre-
determined step angles
i
. Any angle can be represented to a certain accuracy by a
set of n step angles
i
. Specifying a direction of rotation or sign
i
, the sum of the
step angles
i
approximates a given angle as follows:

1
0
n
i
i i
, { } 1 , 1
i
(3)
2
The sign of the difference between the angle and the partial sum of step angles

1
0
i
j
j j
controls the sign
i
of the step angles
i
. An auxiliary variable
i
z is
introduced that contains the accumulated partial sum of step angles and is used to
determine the sign of the next micro-rotation. To simplify the computation of the
matrix product given by (2), the step angles
i
are chosen such that
i
tan represents
a series of powers of 2:
1 , ... , 2 , 1 , 0 , 2 tan

n i
i
i
(4)
The CORDIC method can be employed in two different modes, known as the
rotation mode and the vectoring mode. In the rotation mode, the co-ordinate
components of a vector and an angle of rotation are given and the co-ordinate
components of the original vector, after rotation through a given angle, are computed.
In the vectoring mode, the co-ordinate components of a vector are given and the
magnitude and angular argument of the original vector are computed [Vold59].
The rotation mode of the CORDIC algorithm has three inputs that are initialised to the
co-ordinate components of the vector
0 0
, y x and the angle of rotation
0
z and is
described by the following iteration equations:
i
i i i i
y x x

+
2
1

i
i i i i
x y y

+
+ 2
1
(5)
i
i i i
z z

+
2 arctan
1

where

'

+
<

0 1
0 1
i
i
i
z if
z if
and 1 ,..., 2 , 1 , 0 n i (6)
The outputs of the rotation mode
n
x ,
n
y and
n
z are given by the following
expressions,
n
x and
n
y being the co-ordinates of the rotated (by the angle ) vector:
) sin cos (
0 0 0 0
z y z x K x
n n

) sin cos (
0 0 0 0
z x z y K y
n n
+
0
n
z
where

+
1
0
2
2 1
n
i
i
n
K (7)
A CORDIC micro-rotation is not a pure rotation but a rotation-extension. The
constant
n
K , given by (7), is referred to as a scale factor, and represents the increase
in magnitude of the vector during the rotation process. When the number of
iterations/micro-rotations is fixed the scale factor is a constant approaching the value
of 1.647 as the number of iterations goes to infinity.
The elementary functions sine and cosine can be computed using the rotation mode of
the CORDIC algorithm if the initial vector is of unit length and is aligned with the
abscissa. The computation of sin and cos is based on equations (5) and (6) with
input values 0 , 1
0 0
y x and
0
z . The outputs after n iterations are as follows:
3
cos ) sin cos (
0 0 n n n
K y x K x (8)
sin ) sin cos (
0 0 n n n
K x y K y +
0
n
z
An additional operation of division is required to obtain the values of sin and cos
from (8) as a result of the increase in magnitude of the vector by the factor
n
K during
rotation. However, since the scale factor is a constant for a given number of iterations
n , the operation of division can be eliminated by setting the magnitude of the initial
vector to the reciprocal value of the scale factor, i.e.
n
K x 1
0
.
In this paper we consider computation of sine and cosine of an angle (rad), where
is an n-bit binary fraction and satisfies 2 0 . We compute sin and cos
down to the n -th binary position.
In section 2 different approaches to CORDIC implementation are summarised.
Section 3 is dedicated to fast CORDIC methods. Section 4 discusses a redundant
adder for fast CORDIC implementation and its realisation in XILINX XC4000.
Section 5 presents CORDIC synthesis results targeting Actel and XILINX FPGAs and
using different synthesis tools. Section 6 discusses two satellite applications of
CORDIC sine and cosine generators. Finally section 7 contains concluding remarks.
2. Approaches to CORDIC Hardware Implementation
The CORDIC algorithm can be implemented in hardware using three approaches: a
sequential approach - the structure is unfolded in time, a parallel approach - the
structure is unfolded in space or a combination of the two. These three approaches
and the resulting structures are also referred to in the literature as iterative, cascaded
and cascaded fusion, respectively. A sequential CORDIC design performs one
iteration per clock cycle and consists of three n-bit adders/subtractors, two sign
extending shifters, a look-up table (LUT) for the step angle constants and a finite state
machine. A parallel CORDIC design is similar to an array multiplier structure
consisting of rows of adders/subtractors, with hardwired shifts and constants. Parallel
CORDIC can be implemented in the form of purely combinational arrays or can be
pipelined depending on the size of the design and the requested data rate. A combined
CORDIC design is based on a sequential structure where the logic for several
successive iterations is cascaded and is executed within one clock cycle [Wang95].
The number of fused successive iteration stages determines the order of a combined
CORDIC design. Figure 1 summarises the structures used in hardware
implementation of the CORDIC algorithm.
Since algebraic addition is the main operation in the CORDIC algorithm, the
efficiency of the hardware implementation of the algorithm depends significantly on
the type of adder used. Adders based on the conventional two-digit binary system
have time delay dependent on the bit length n and in the best case of fast hierarchical
adder structures the time delay for execution of one iteration is of logarithmic order
) (log
2
n O [Pirs98]. The time delay of the operation of addition can be made
independent on the bit length by using redundant adders that accept operands
4
represented in redundant signed-digit (RSD) binary system. Numbers in RSD system
are represented using a three-digit set { } 1 , 1 , 0 and may have several RDS
representations, hence the name redundant.
Figure 1. CORDIC hardware implementations
Bit-serial and binary adders have been used in sequential CORDIC implementations
[Andr98], all types of adders have been tried in cascaded CORDIC designs bit-serial
adders, carry-save adders, binary adders, redundant adders, combinations of both
binary and redundant adders [Andr98, Timm92]. Obviously, a combination of
sequential approach and bit-serial adders will result in the slowest design with
minimal area, parallel approach and redundant adders in the fastest design with
maximal area. A trade-off between area and speed would determine the right
implementation approach for a given application.
3. Redundant CORDIC Schemes
The introduction of the RSD system into the internal computation of the CORDIC
method is considered to be one of the most effective ways to accelerate the algorithm
[Erce87, Taka91, Timm92, Bake76]. Cascaded designs of redundant CORDIC
schemes have outperformed array implementations of CORDIC based on carry-save
adders according to a comparative study of these methods in [Timm92]. However, the
straightforward application of the RSD representation to the CORDIC algorithm gives
rise to problems that compromise the efficiency of the algorithm, as follows:
Converters from 2s complement representation to RSD and vice versa are
required. The conversion from 2s complement to RSD is straightforward,
however the conversion from RSD to 2s complement requires an extra addition
operation over n-bit.
5
The value of the direction operator
i
is selected from the digit set { } 1 , 0 , 1 since
it depends on the sign of
i
z that is represented as a redundant. The sign evaluation
of a redundant number requires detection of the sign of the most significant non-
zero binary digit and in the worst case needs inspection of all digits which is a
very slow procedure.
In redundant CORDIC no rotation-extension takes place for some step angles
since zero is a valid choice for the direction operator
i
. This makes the scale
factor
n
K operand dependent and not a constant value any more. Two approaches
have been proposed to eliminate the varied scale factor effect: the scale factor is
calculated during computation and the function values are corrected with it at the
end of the rotation process [Erce87] or the scale factor is compensated during the
iteration process via introduction of special iterations [Taka91, Timm92]. An
alternative approach to evaluation of rotation operators
i
is to predict their
values by decomposing the angle of rotation in advance [Bake76].
A comparison of the latency of conventional CORDIC and different modifications of
redundant CORDIC has been carried out with all designs being of array type
[Marx99]. The latency of the designs expressed as a function of the bit-length is given
in Table 1 [Marx99], where - delay of a full adder; ) (log
2
n - the upper bound of a
n-bit non-redundant fast addition; - delay of a redundant adder, independent of the
bit-length; m - an arbitrary integer in the correcting method [Taka91] where a
correction iteration is performed every m-th step. The termination algorithm
originally proposed by [Chen72] allows quitting the iteration process as early as
possible, modified Booth encoding can be used for the same purpose [Timm92].
Table 1. Latency expressions of CORDIC implementations
Name Latency expression as a function of the bit length n
Non-redundant
method
n n
2
log
Double rotation
method [Taka91]
n n n
2
log 2 + +
Correcting method
[Taka91]
( )
]
( )( ) ( )
]
( ) ) ( log 1 2 1
2
+ + + + + + n m n m n n
Prediction method
[Timm92]
]
n n n n
2 2 3
log log 1 log + +
Prediction with
termination method
[Timm92]
]
) 2 log( log log 1 2 ) 1 ( ( log 2 ) 1 (
2 2 3
n n n n n + + + + +
Figure 2 [Marx99] shows graphically the latency of the CORDIC implementations
using estimated delays for XC4000XL and a ratio 2 r . It suggests that a
prediction technique combined with a termination method [Timm92] might lead to a
fastest FPGA implementation.
6
Figure 2. Estimated latency of CORDIC implementations in XC4000XL
4. Redundant Adder Implementation
In RSD representation, a number Y can be viewed as the difference between two
positive binary numbers
*
Y and
* *
Y as follows:
i
i
n
i
i
n
i
i
i
y y y Y 2 ) ( 2
* *
0
*
0



with { } 0 , 1 ,
* * *

i i
y y (9)
The conventional one-bit full adder assumes positive weights to all of its three binary
inputs and two binary outputs. Such adders can be generalised to four types of adder
cells by imposing positive and negative weights to the binary input/output terminals
[Hwan79]. The addition of two redundant signed-digit numbers Y and Z can be
performed by cascading two levels of generalised full adders of types 1 and 2 as
shown in Figure 3. The main drawback of this computation scheme with two numbers
in redundant form is the amount of hardware, which is twice that in the carry-save
case [Vand90].
Figure 3. Redundant signed digit adder [Vand90]
7
The ripple-carry adder and the redundant sign-digit adder have been implemented in
XILINX 4010XL and compared in terms of speed and area [Marx99]. The ripple-
carry adder uses the XILINX dedicated carry logic and takes 0.5 configurable logic
blocks (CLBs) per bit. The smallest redundant adder that has been achieved in
XILINX 4010XL requires two CLBs per bit. Figure 4 [Marx99] illustrates the
mapping for the minimal area redundant adder, where S1n_generator comprises the
logic that generates the
*
) 1 ( 1 + i
S output and S2a_generator comprises the logic that
Figure 4. A minimum-area mapping of a redundant adder onto XC4010XL
generates the
* *
2i
S output. The latency results are shown in Figure 5 [Marx99], where
the ripple-carry adder is referred to as RCA and the redundant adder is referred to as
ISDA. As can be seen from Figure 5, the delay of the ripple-carry adder is nearly
equivalent to the delay of the redundant adder for bit-lengths below 16 bits, however,
for bit-length above 32 bits the redundant adder gives significant gain in performance.
Figure 5. Latency comparison between a ripple-carry adder and a redundant adder in
XILINX 4010XL.
8
5. Experimental Results
We have implemented iterative and cascaded sine and cosine CORDIC-based
generators in Actel and XILINX FPGAs using fast binary adders. The number of the
iterations in all designs was equal to the bit-length. The bit-lengths used were 12, 14,
16, 24 and 32 bit for the iterative designs and 12, 14 and 16 bit for the cascaded
designs. All of the cascaded designs were non-pipelined. Redundant CORDIC designs
have not been attempted in view of the findings about fourfold area increase and no
significant performance gain for bit-lengths below 32 bits in section 4 above.
Synthesis results in terms of module count and speed are summarised in Table 3 and 4
where results for both area and delay optimised designs are presented. Four different
synthesis tools have been used Actmap 3.5.04, Synplify 5.1.4, Spectrum 5.69 and
XILINX Foundation Series Express 1.5i. The speed estimates in the two rightmost
columns of the tables are based on back-annotated delays and indicate the value of the
maximal data rate achieved and the maximal clock frequency.
The experimental results show that module count and operating speed depend
significantly on the used synthesis tool. The Actel-based designs are faster than the
XILINX-based ones, however the Actel FPGAs are not dense enough to
accommodate cascaded designs with bit-lengths higher than 16 bits.
A 32-bit 1.9 Msps iterative sine/cosine generator can be implemented in a small
FPGA (Actel SX16-3). The most area-consuming component of the iterative designs
is the sign extending Barrel shifter, shifting over programmable shift-width, further
optimisation should focus on more area-economical Barrel shifter design.
A 16-bit cascaded design is not possible to be fitted in a XC4010XL device, this is not
surprising, the parallel implementation approach is a trade-off of area for speed where
the area increase is of quadratic order with respect to the bit-length ) (
2
n O . A 12-bit
non-pipelined cascaded CORDIC runs at 22.3 Msps (Actel SX16-3) - this
performance is comparable with the performance of a 12-bit look-up table according
to our LUT synthesis results presented in Table 5.
Table 3. Summary of CORDIC synthesis results based on ACTEL FPGAs.
D De es si ig gn ns s
A A5 54 4S SX X1 16 6- -3 3
L Le en ng gt th h A Ac ct tm ma ap p
1 1
3 3. .5 5. .0 04 4
S Sy yn np pl li if fy y
1 1
5 5. .1 1. .4 4
S Sp pe ec ct tr ru um m
1 1
5 5. .6 69 9
S Sp pe ee ed d
2 2
D Da at ta a
r ra at te e
F Fr re eq qu ue en nc cy y
b bi it ts s A Ar re ea a/ /D De el la ay y
4 4
A Ar re ea a/ /D De el la ay y
4 4
A Ar re ea a/ /D De el la ay y
4 4
n ns s M Ms sp ps s M MH Hz z
Iterative 12 420/574 307/334 347/424 169.5 5.9 71.4
5
Iterative 14 538/784 399/414 428/536 192.3 5.2 72.5
5
Iterative 16 674/958 424/462 501/633 232.5 4.3 68.5
5
Iterative 24 1170/---- 694/727 995/1248 357.2 2.8 66.6
6
Iterative 32 1963/---- 887/1000 1419/1710 526.3 1.9 62.5
6
Cascaded 12 ----/---- 862/888 1326/1378 44.8 22.3
6
Cascaded 14 ----/---- 1970/---- 2164/2164 192.3 5.2
3
Cascaded 16 ----/---- 2853/---- 2941/3718 222.2 4.5
3
9
Table 4. Summary of CORDIC synthesis results based on XILINX FPGAs
D De es si ig gn n L Le en ng gt th h F Fo ou un nd da at ti io on n
7 7
E Ex xp pr re es ss s 1 1. .5 5i i
S Sp pe ee ed d
2 2
D Da at ta a r ra at te e F Fr re eq qu ue en nc cy y
b bi it ts s A Ar re ea a/ /D De el la ay y
T Ta ar rg ge et t
D De ev vi ic ce e
n ns s M Ms sp ps s M MH Hz z
Iterative 12 106/139 XC4010XL-09 370.3 2.7 32.1
Iterative 14 133/145 XC4010XL-09 526.3 1.9 27.5
Iterative 16 162/178 XC4010XL-09 588.2 1.7 27.2
Iterative 24 317/376 XC4062XL-09 1643.8 0.6 14.6
Iterative 32 506/626 XC4062XL-09 2480.6 0.4 12.9
Cascaded 12 210/210 XC4010XL-09 187.6 5.3
Cascaded 14 288/288 XC4010XL-09 192.9 5.2
Cascaded 16 378/378 XC4062XL-09 330.0 3.1
Table 5. LUT synthesis results
D De es si ig gn n
A A5 54 4S SX X1 16 6- -3 3
L Le en ng gt th h A Ac ct tm ma ap p
1 1
3 3. .5 5. .0 04 4
S Sy yn np pl li if fy y
1 1
5 5. .1 1. .4 4
S Sp pe ee ed d
2 2
F Fr re eq qu ue en nc cy y
b bi it ts s A Ar re ea a/ /D De el la ay y
4 4
A Ar re ea a/ /D De el la ay y
4 4
n ns s M MH Hz z
LUT 12 513/859 384/453 43.66 22.1
LUT 16 1899/----
8
946/----
8
84.03 11.9
Note 1: All synthesis tools operated in a "push-button" fashion with maximum
optimisation enabled were available.
Note 2: Speed estimate based on Vital simulation using typical operating conditions.
Note 3: Estimate frequency given by Synplify
Note 4: All module count given by Place and Route software.
Note 5: Actel Netlist Selected
Note 6: Synplify Netlist Selected
Note 7: Foundation Express build 3.1.140
Note 8: ---- Synthesis results not available
6. Application
Two applications of CORDIC sine/cosine generators in satellite data processing
systems have been investigated attitude determination and direct digital synthesis.
The Earths Magnetic Field is a very computationally intensive procedure in satellite
attitude determination and is usually implemented in software. A hardware structure
based on CORDIC modules has been proposed [Vlac99] for the calculation of the
Legendre polynomials - the first step of the international geomagnetic reference field
(IGRF) model [Wert85]. It consists of four blocks comprising CORDIC modules for
sine/cosine as well as other functions and a control block. The delay of the hardware
structure was estimated based on a 32-bit iterative CORDIC module implemented in
XC4085XL. It was compared with the delay of a C-program running on a Pentium
333 MHz computer for five different values of the constants m and l . The
10
improvement in speed was 44% for 10 l m , 37% for 15 l m , 32% for
20 l m , 28% for 25 l m and 23% for 36 l m [Vlac99].
Direct digital synthesis (DDS) generates a new frequency based upon an original
reference frequency. Virtually all DDS architectures include a lookup table that
performs a sine computation function for generating sinusoidal output signals. For
comparison purposes we have designed and synthesised a LUT that is an improved
version of the modified Sutherland architecture [Vank96] (Table 5). It can be seen
that a 12-bit cascaded non-pipelined CORDIC (Table 3) achieves the same data rate
of 22 Msps as the 12-bit improved LUT design. However, in addition to that the
CORDIC design provides both functions sine and cosine at the same time and also its
speed can be accelerated further if pipelining is introduced to reach a data rate of
about 50 Msps.
7. Conclusions
This paper presents theoretical and practical aspects of implementing sine/cosine
CORDIC-based generators in FPGAs. The main results can be summarised as
follows:
A trade-off speed/area will determine the right structural approach to CORDIC
FPGA implementation for an application.
A 32-bit 1.9 Msps iterative CORDIC can be implemented in a small FPGA (Actel
SX16-3).
A 12-bit non-pipelined cascaded CORDIC runs at 22.3 Msps (Actel SX16-3) that
is comparable to a LUT.
Module count and operating speed depend significantly on the used synthesis tool.
Current rad-tolerant FPGAs are not dense enough for the cascaded and redundant
approaches.
Simulation has shown that the redundant adder can improve the efficiency of
CORDIC FPGA implementations for bit-lengths higher than 32-bit.
8. References
[Andr98] R.Andraka. A Survey of CORDIC Algorithms for FPGA Based Computers
Proc. Of the 1998 CM/SIGDA Sixth International Symposium on FPGAs, February
1998, Monterey, CA, pp.191-200.
[Bake76] P.W.Baker. Suggestion for a Binary Cosine Generator, IEEE Transactions
on Computers, February, 1975, pp. 1134-1136.
[Chen72] T.C.Chen. Automatic Computation of Exponentials, Logarithms, Ratios and
Square Roots, IBM J. Res.Development, July, 19972, pp.380-388.
[Erce87] M.D.Ercegovac, T.Lang. Fast Cosine/Sine Implementation Using CORDIC
Iterations, IEEE Trans. On Comput., vol.40, n 9, 1987, pp. 222-226
11
[Marx99] M.Marx. FPGA Implementation of sin(x) and cos(x) Generators Using the
CORDIC Algorithm, Final Year Project Report, School of Electronic Engineering,
University of Surrey, Guidford, UK, 1999.
[Pirs98] P.Pirsch. Architectures for Digital Signal Processing, John Wiley & Sons,
1998.
[Taka91] N.Takagi. Redundant CORDIC Methods with a Constant Scale Factor for
Sine and Cosine Computation, IEEE Trans. On Comput., vol. 40, n 9, 1991, pp. 989-
994.
[Timm92] D.Timmerman, H.Hahn, B.J.Hosticka. Low Latency Time CORDIC
Algorithms, IEEE Transactions on Comput., vol.41, n 8, 1992, pp.1010-1014.
[Timm91] D.Timmerman, H.Hahn, B.J.Hosticka, B.Rix. A New Addition Scheme and
Fast Scaling Factor Compensation Methods for CORDIC algorithms, Integration
the VLSI Journal, vol. 11, n 1, 1991, pp. 85-100.
[Vand90] A.Vandemeulebroecke, E.Vanzieledhem, et al. A New Carry-Free Division
Algorithm and its Application to a Single Chip 1024-b RSA Processor, IEEE Journal
of Solid-State Circuits, vol.25, n 3, 1990, pp.748-755.
[Vank96] J.Vankka. Methods of Mapping from Phase to Sine Amplitude in Direct
Digital Synthesis, Proc of the 1996 IEEE International Frequency Control
Symposium, 1996, pp. 942 950.
[Vlac99] A.Vlachos. Design and Implementation of CORDIC Modules for ADCS,
MSc Project Report, School of Electronic Engineering, University of Surrey,
Guidford, UK, 1999.
[Vold59] J.Volder. The CORDIC Computing Technique, IRE Trans. Comput., Sept.
1959, pp.330-334.
[Walt71] J.S. Walther. A Unified Algorithm for Elementary Functions, Proc. AFIPS
Spring Joint Computer Conference, pp.379-385, 1971.
[Wang96] S.Wang, V.Piuri. A Unified View of CORDIC Processor Design, in
Application Specific Processors, Ed. By Earl E. Swatzlander, Jr., Kluwer Academic
Press, 1996, pp.121-160.
[Wert85] J. Wertz. Spacecraft Attitude Determination and Control, D.Ridel
Publishing Company, London, 1985.

You might also like