12 views

Uploaded by Lucas Weaver

IEEE paper on CORDIC

- A Unified Algorithm for Elementary Functions_by_Walther
- b 0529042212
- Introduction
- S99cha
- 2012_0060. Two-level - Tabulation Method
- MOSInv_Statchar_part2
- Digital IC Design Static Inverter Characteristics2
- ModelSim Trial(Struc)
- c Bjt Biasing and Stabilization
- DWT paper
- Time Response Analysis Part I
- Newton Gauss Method
- Survey of Cordic Algorithm_byAndraka.pdf
- Survey of Cordic Algorithm_byAndraka.pdf
- Dual Mode Logic Design
- RanSaC
- OPAMP
- Low Power ROM
- Chapter3 Ex
- Nonlinear Regression

You are on page 1of 4

Rakesh Gangarajaiah, Student Member, IEEE, and Erik Hertz,

Department of Electrical and Information Technology, Lund University, Lund, Sweden,

Centre for Research on Embedded Systems, Halmstad University, Halmstad, Sweden,

Peter.Nilsson @eit.lth.se, Yuhang.Sun.494 @student.lu.se,

Rakesh.Gangarajaiah @eit.lth.se, and Erik.Hertz @hh.se

AbstractThis paper shows a novel methodology to

improve unrolled CORDIC architectures. The methodology is based on removing adder stages starting from

the first stage. As an example, a 19-stage CORDIC is used

but the methodology is applicable on CORDICs with an

arbitrary number of stages. The CORDIC is implemented,

simulated, and synthesized into hardware. In the paper,

the performance is shown to be increased by 23% and that

the dynamic power can be reduced by 27%.

I.

INTRODUCTION

demanding task to do in hardware. Look-up tables are

useful if the precision is low but the size of the table

grows exponentially with increasing accuracy, which

make them unfeasible for high-precision hardware.

Taylor expansion is another methodology for unary

functions [1]. A novel methodology is to use parabolic

functions to compute unary functions [2].

The focus of this paper is on the CORDIC

algorithm [3]. Traditionally, iterative CORDICs [4]

have been used due to their small area, which give low

silicon cost. In todays downscaled technologies e.g.

below 40nm, the static power consumption tends to

override the dynamic power consumption, in low-speed

architectures. Iterative small-area CORDICs is

therefore still important.

For high-speed designs, other parameters such as

dynamic power consumption and performance are

important. A relaxed focus on silicon area will thus

make it feasible to use fast unrolled CORDICs. This

paper shows a novel methodology to reduce the power

consumption in unrolled CORDIC architectures.

However, the methodology is, to a large extent,

applicable to iterative CORDICs as well.

II.

starting point at 1/R will thus outcome in a result that is

as close to the unit circle as possible when the last stage

is computed. R is a fixed coefficient for the chosen

number of CORDIC stages. It can thus be hardware

wired into the implementation. When the last rotation is

done, the coordinates give the approximate cosine and

sine values for the input angle, i.e. (x4, y4) in the figure.

rotations to find an approximation of a non-linear

function. The main advantage with the algorithm is that

it is multiplier less. It uses additions and subtractions

only. The CORDIC is built up with a number of stages.

By increasing the number of stages the accuracy is

improved. Fig. 1 shows an example with four vector

rotations, corresponding to a four-stage CORDIC. The

(blue) dashed vector is the vector with an angle of

which, in this case, the approximate cosine and sine

values are searched for.

There are many ways to choose where to start the

rotations. In this paper, the starting point is chosen to be

at the coordinates (x = 1/R, y = 0), where R is the last

vector length if the starting vector has the length R = 1

978-1-4673-6576-5/15/$31.00 2015 IEEE

positive and detected to be too large. The second does a

negative rotation, the third a positive and finally the

fourth will be negative again. The angles of the

rotations, ci in Fig. 1, are fixed coefficients, which are

done in the order c1 = 45, c2 = 27, c3 = 14, c4 = 7

degrees etc. The reason is that arctan(1) = 45,

arctan(1/2) = 27, arctan(1/4) = 14, and arctan(1/8) = 7.

That is, they correspond to binary shifts.

A 19 stage CORDIC is chosen to demonstrate the

methodology. However, the methodology is valid for

any size of the CORDIC. Fig. 2a shows the architecture

of the first five stages of the 19-stage CORDIC. In the

four adders at the top, the remaining angle is computed

for each stage. The input signal is the angle for the

cosine and sine value that is searched for. At the top,

the fixed coefficient angle values ci for each rotation are

delivered. These are added or subtracted from in each

stage. That is, the remaining angle v0 = c0, v1 =

c0 c1, etc.

The coefficients are, in this architecture, chosen to

be 23 bits long, which after synthesis becomes much

shorter due to many zeros. They are fixed coefficients

that can be hard wired or stored in a ROM. In the

middle and the lower adder rows, the approximation of

the x5 cos() and y5 sin() is computed, which is

provided to the right.

c0

ADD

SUB

c1

v1

d1

ADD

SUB

c2

v2

d2

ADD

SUB

v3

d3

c3

ADD

SUB

v4

d4

y0

y1

ADD

SUB

1

2

x0

(a)

ADD

SUB

y2

-/+

ADD

SUB

1

2

x1

+/ADD

SUB

c0

ADD

SUB

y1

1

4

1

4

x2

v1

d1

(b)

1

2

x1

ADD

SUB

c0

ADD

SUB

ADD

SUB

(d)

y3,11

y3,01

y3,10

y3,00

x3,11

x3,01

x3,10

x3,00

(e)

c0+c1

c0c1

y3,11

y3,01

y3,10

y3,00

x3,11

x3,01

x3,10

x3,00

ADD

SUB

c1

v1

d1

ADD

SUB

1

4

x2

c0

ADD

SUB

1

0

v1

d1

+/ADD

SUB

c1

ADD

SUB

x3,1

x3,0

d1catch

v2

d2

ADD

SUB

y3

d1

1

0

ADD

SUB

y3,1

y3,0

1

8

x3

+/ADD

SUB

c2

v2

d2

ADD

SUB

y3

x4

v3

d3

1

8

x3

+/ADD

SUB

c2

v2

d2

ADD

SUB

1

8

x3

c2

v2

d2

ADD

SUB

y3

1

16

x4

v3

d3

1

8

x3

-/+

x5

v4

d4

y5

c3

ADD

SUB

-/+

1

16

1

16

x5

v4

d4

y5

x4

v3

d3

+/ADD

SUB

c3

ADD

SUB

y4

-/+

x5

v4

d4

y5

ADD

SUB

1

16

1

16

x4

v3

d3

+/ADD

SUB

c3

ADD

SUB

y4

-/+

x5

v4

d4

y5

ADD

SUB

1

16

1

16

x4

+/ADD

SUB

(1)

by a factor corresponding to 2i where i {1, 2, 3 !} . That

is, division by 1, 2, 4, 8 , which are the right shifts,

discussed in section II that are done by shifting the

busses between the stages. There is thus no extra cost in

hardware for the divisions.

+/ADD

SUB

ADD

SUB

-/+

ADD

SUB

ADD

SUB

y4

ADD

SUB

+/-

c3

1

16

-/+

ADD

SUB

ADD

SUB

yi 1

2i 1

x

yi = yi 1 B ii1

2 1

xi = xi 1

ADD

SUB

ADD

SUB

+/-

y5

+/-

y4

-/+

1

8

1

0

1

0

1

16

ADD

SUB

1

8

x3,1

x3,0

1

16

-/+

1

8

-/+

ADD

SUB

ADD

SUB

y3

1

0

1

0

ADD

SUB

1

8

1

0

1

0

+/-

c2

1

0

1

0

1

0

x3

-/+

y3,1

y3,0

1

0

1

0

1

8

ADD

SUB

1

0

x2,1

x2

+/-

1

4

x2,1

1

4

1

0

y2,0

1

8

-/+

1

4

y4

-/+

ADD

SUB

ADD

SUB

y2

y2,1

(c)

ADD

SUB

y2

-/+

+/-

+/-

c1

ADD

SUB

1

2

y3

-/+

ADD

SUB

shown in (1).

The vector rotations, in each stage of the top row of

adders, depend on the sign bits di, which is the MSB bit

of the remaining angle vi. That is, each sign bit

determine if it should be an addition or a subtraction.

x5

coordinates is delivered to the left. In each stage, new

intermediate vectors are determined, in order to

converge towards the vector that approximates the

vector that approximates the coordinates of the angle .

In each vector rotation stage there is a crosswise

architecture in Fig. 2b can be used. Two 19-bit adders

at the left can be removed, since the x0 and y0 values are

fixed. The methodology is valid for any chosen input

vector. Here in this example, the coordinates x0 = 1/R

and y0 = 0 is chosen. That means that the output from

the first stage is y1 = x0 and x1 = x0 in the middle

respectively the lower adder row, as shown in (1).

To eliminate the second adder stage, the

architecture in Fig. 2c can be used. The removal of the

two 19-bit adders require two 19-bit MUXes, where the

input signals is pre-calculated as fixed coefficient

values as shown in (1). The gain will then be lower

power consumption and a shorter critical path since

there is no ripple in the MUXes.

In Fig. 2d, another adder stage is removed, the third

stage in the middle and lower adder row. Now, six

MUXes are needed to replace the adders and all of the

input signals are fixed as before, see (1). Note that the

two last indices of the input coefficients stand for the

sign-bit values d1 and d2.

The adder stage reduction from Fig. 2d can be

expanded to a reduction in the fourth stage, fifth stage,

and so on. The removal of the fourth stage will require

8+4+2 = 14 MUXes with a wordlength of 19 bits and

the elimination of five adder stages will require 30

MUXes. However, now the power consumption and

silicon area begin to increase but the delay, the critical

path, will be shorter. This paper stops with three

eliminated adder stages in the middle and lower adder

rows, but there is potential for more reductions.

Until now, the upper adder row has been left intact.

However, it is possible to eliminate adder stages in the

upper row, as well. In (2), the sign-bit d1 can be

detected by the logic function d1-catch, which also is used

in the architecture in Fig. 2e. The sign-bit detection can

for instance be realized with two inverters, three 2-input

NANDs and one 3-input NAND. One adder stage is

removed by that operation. An extra MUX is also

needed to select between coefficient angle values c0+c1

positive or negative for the c1 rotation.

d1catch = bit19 + bit18 bit17 + bit18 bit16 bit15 bit14

(2)

with MUXes.

0.14

detection of the function d2-catch, d3-catch, and so on.

However, the design will be more complex but still

there will be a gain in shorter delay. Here, the

investigation stopped with the first sign-bit detection.

0.12

V. RESULTS

0.04

used for the simulations. The designs are implemented

in VHDL and STMicroelectronics has provided the cell

library. Design Compiler and PrimeTime are the used

tools. For the simulations, a supply voltage at 1.2V at a

25 degree temperature is chosen.

TABLE I. THE SILICON AREA WITH MAXIMUM SPEED AND

MINIMUM AREA CONSTRAINT SET

Architecture

Fig. 2a

Fig. 2b

Fig. 2c

Fig. 2d

Fig. 2e

Maximum speed

constraint (mm2)

0.1010 (100%)

0.1005 (99%)

0.1098 (109%)

0.1195 (118%)

0.1352 (134%)

Minimum area

constraint (mm2)

0.0361 (100%)

0.0360 (100%)

0.0340 (94%)

0.0334 (93%)

0.0332 (89%)

cases, when the tools are set with a maximum speed

constraint and when they are set with a minimum area

constraint.

Maximum speed is the frequency when there is a

zero slack in the critical path, e.g. at 76.7MHz for the

original design and 99.6MHz for the final. To find the

minimum area, it is appropriate to find a suitable

frequency where the area is not decreasing any more.

Table II shows the area for the architectures,

synthesized for different frequencies.

TABLE II. THE SYNTHESIZED AREA FOR THE

ARCHITECTURES AT DIFFERENT OPERATING FREQUENCIES

Architecture

Fig. 2a

Fig. 2b

Fig. 2c

Fig. 2d

Fig. 2e

1MHz

(mm2)

0.0355

0.0354

0.0337

0.0331

0.0331

10MHz

(mm2)

0.0355

0.0354

0.0337

0.0334

0.0334

20MHz

(mm2)

0.0358

0.0357

0.0340

0.0334

0.0334

30MHz

(mm2)

0.0361

0.0360

0.0340

0.0334

0.0332

50MHz 76.7MHz

(mm2) (mm2)

0.0401 0.1010

0.0405 0.1005

0.0397 0.1033

0.0394 0.1026

0.0405 0.1019

the frequency. The figure shows the area vs. frequency

with the solid (red) line for the original architecture and

the area for the final architecture with the dashed (blue)

line. It can be seen that below 30MHz, there is almost

no change in the area. The minimum area constraint can

thus be set to the area at 30MHz.

As shown in Table II, when the minimum area

constraint is set to 30MHz, an (361-332)/361 = 8%

reduction can be seen, when the original architecture in

Fig. 2a is compared to the final architecture, in Fig. 2e.

Area at 99.6MHz

0.10

Area at 76.7MHz

0.08

0.06

0.02

0

10

20

30

40

50

60

70

80

90 100

f (MHz)

changing architectures, is almost zero at each specific

frequency, as illustrated in Fig. 3. Above 30MHz, the

synthesizer starts to replace adder cells with random

logic. The number of available adder cells in the cell

library is however limited; especially there are no adder

cells with high driving capability. Library cells like

inverters, NAND, NOR, AOI, OAI, etc. that have large

driving strengths are used instead of the adder cells. A

change from single adder cells to synthesized adder

functions, generated by a number of logic cells, cost

area. If the constraint is set to maximum speed, a

dramatic change can be noted. The area increases with

34% between the original and the final architecture as

shown in Table I.

TABLE III. THE CRITICAL PATH DELAY WITH

MAXIMUM SPEED CONSTRAINT SET

Architecture Max. speed constraint (ns) Frequency (MHz)

Fig. 2a

13.03 (100%)

76.75

Fig. 2b

13.03 (100%)

76.75

Fig. 2c

12.03 (92%)

83.13

Fig. 2d

11.03 (85%)

90.66

Fig. 2e

10.04 (77%)

99.60

is shown in Table III, when the maximum speed

constraint is set. For the maximum speed constraint, the

delay is reduced by up to 23%. The better performance

is thus paid by the larger area, as previously shown in

Table I.

TABLE IV. THE POWER CONSUMPTION AT MAXIMUM SPEED

Architecture

Fig. 2a

Fig. 2b

Fig. 2c

Fig. 2d

Fig. 2e

Frequency

(MHz)

76.7

76.7

83.1

90.7

99.6

Dynamic Power

(mW)

2.852 (100%)

2.802 (98%)

2.687 (94%)

2.319 (81%)

2.044 (72%)

Static Power

(nW)

291.8 (100%)

296.2 (102%)

319.2 (109%)

308.8 (106%)

316.1 (108%)

constraint is set, is shown in Table IV. In Table IV it

can be seen that the dynamic power is lowered by 28%

increase in static power is expected since the area is

increased.

However, it is not so interesting to compare the

power consumption when the frequencies are different,

since the power changes linearly with the frequency. In

Table V, the power consumption at 10MHz is shown.

The table shows that the dynamic power is decreased by

27% and the static power is 6% lower. The decrease in

static power is expected since the area is decreased, due

to the replacement of adder cells with MUXes.

TABLE V. THE POWER CONSUMPTION AT 10 MHZ

Architecture

Fig. 2a

Fig. 2b

Fig. 2c

Fig. 2d

Fig. 2e

Frequency

(MHz)

10

10

10

10

10

Dynamic Power

(mW)

0.2399 (100%)

0.2263 (94%)

0.2124 (89%)

0.1864 (77%)

0.1748 (73%)

Static Power

(nW)

131.5 (100%)

131.5 (100%)

126.2 (96%)

123.5 (94%)

123.5 (94%)

frequency f is illustrated for the final architecture, i.e.

the architecture displayed in Fig. 2e.

close to the limit, the switching power is not linearly

increasing with respect to the frequency, i.e. it is

increasing more than expected. The same phenomena

can be seen for the static power, which is not constant

any more.

When the three power components are added, the

total power, the upper solid (red) line in Fig. 4, will be

the result. Two asymptotic regions turn up, where the

static power dominates to the left and the dynamic to

the right. It can be seen that the switching and internal

power is increasing linearly towards the frequency and

that the static power is unaffected regarding the

frequency. To the left in the diagram, the dynamic

power can be ignored compared to the static power and

to the right the static power can be neglected when

relating to the dynamic. Somewhere around 10 kHz, the

dynamic power becomes larger than the static power.

VI. ACKNOWLEDGEMENT

We would like to thank STMicroelectronics, SSF,

and Vinnova for their support.

VII. CONCLUSIONS

A methodology to improve unrolled CORDIC

architectures is presented. The methodology is based on

removing stages to the cost of a number of MUXes. It is

shown that the performance can be increased by 23%

and that the dynamic power consumption can be

reduced by 27%, by interchanging adders by MUXes.

REFERENCES

[1] Peter Nilsson, Ateeq Ur Rahman Shaik, Rakesh

Gangarajaiah, and Erik Hertz, Hardware

Implementation of the Exponential Function Using

Taylor Series, in the Proceedings of the 32nd

NORCHIP Conference, Tampere, Finland, October

27-28, 2014.

Fig. 4. The power consumption in the final architecture.

and internal power. Citing the tool vendor, The

switching power is determined by the capacitive load

and the frequency of the logic transitions on a cell

output. [5]. Furthermore, The internal power is

caused by the charging of internal loads as well as by

the short-circuit current between the N and P transistors

of a gate when both are on [5]. The sum of them is

thus needed to determine the total dynamic power

consumption. It can be noted that the switching power

is higher than the internal power but they are, however,

basically at the same size. The switching power is

shown by the long-dashed (blue) line and the internal

power by the short-dashed (red) line in Fig. 4. The

static power is shown by the lower solid (blue) line and

the total power consumption is shown by the upper

solid (red) line.

Methodology Implemented on the Sine Function,

in the Proceedings of the 2009 IEEE International

Symposium on Circuits and Systems (ISCAS09),

Taipei, Taiwan, May 24-27, 2009.

[3] Jack E. Volder, The CORDIC Trigonometric

Computing Technique, in the IRE Transactions on

Electronic Computers, vol. EC-8, no. 3, 1959, pp.

330334.

[4] P. K. Meher, et al., 50 years of CORDIC:

algorithms, architectures, and applications, in the

IEEE Transaction on Circuits Systems I (TCAS-I),

vol. 56, pp. 1893-1907, Sep. 2009.

[5] Gordon Yip, Expanding the Synopsys PrimeTime

Solution with Power Analysis, Synopsys, Inc.,

http://www.synopsys.com/, June 2006.

- A Unified Algorithm for Elementary Functions_by_WaltherUploaded byLucas Weaver
- b 0529042212Uploaded bynguynphuthanh
- IntroductionUploaded bychangela_ankur
- S99chaUploaded byLucas Weaver
- 2012_0060. Two-level - Tabulation MethodUploaded byLucas Weaver
- MOSInv_Statchar_part2Uploaded byLucas Weaver
- Digital IC Design Static Inverter Characteristics2Uploaded byLucas Weaver
- ModelSim Trial(Struc)Uploaded byLucas Weaver
- c Bjt Biasing and StabilizationUploaded bykarthikarajaratna
- DWT paperUploaded byLucas Weaver
- Time Response Analysis Part IUploaded byLucas Weaver
- Newton Gauss MethodUploaded byLucas Weaver
- Survey of Cordic Algorithm_byAndraka.pdfUploaded byLucas Weaver
- Survey of Cordic Algorithm_byAndraka.pdfUploaded byLucas Weaver
- Dual Mode Logic DesignUploaded byLucas Weaver
- RanSaCUploaded byLucas Weaver
- OPAMPUploaded byLucas Weaver
- Low Power ROMUploaded byLucas Weaver
- Chapter3 ExUploaded bypranavr
- Nonlinear RegressionUploaded byAbhishek Kumar
- Dynamic and Static PowerUploaded byLucas Weaver

- B2000 Infant Incubator SpecificationUploaded byVer Bautista
- Australasian Fire Authorities Council (AFAC), 'Position on Smoke Alarms in Residential Accomodation' - June, 2006Uploaded byThe World Fire Safety Foundation
- Nucbu BiosUploaded byRemus Chirosca
- Toyora Corolla Wiring Diagram 1998Uploaded byChien Luu Van
- Laboratory 4Uploaded bydemetrickn
- GATE Study Materials for ECE StreamUploaded byTapasRout
- ELSSG01Uploaded byAriel Anibal Aparicio
- OPNET Wireless TutorialUploaded byshadowstyler
- Voltage SwellUploaded byrajfab
- MicroNet MN 50 Controller Installation Instructions F-26617--07.10Uploaded bySergio Hitcar
- RPI Vcgencmd UsoUploaded bylamond4
- ABB rastavljacke prugeUploaded byzg828
- A Comprehensive Survey on WiMAX and LTEUploaded byPerumalraja Rengaraju
- Accel Catalog 2011 74488gUploaded byCarlos Gudiño
- 31h2fdc280Uploaded byRonak Vithlani
- 150KV Grounding BrosurUploaded byparu0509
- 400cb Mac ValvesUploaded byMarco Antonio Ibañez
- nnsd698sUploaded byAnonymous 2GVabYh40W
- T1S6 - BACnet Integration & PeripheralsUploaded bygcartucho
- c detailsUploaded bygiiriish
- Sonnax_TS_Report_V6N2.pdfUploaded bybatman2054
- 58236615-BrakesUploaded bydwybagus20
- Abbreviations IEC61850Uploaded bysavijola
- Aceleoromometru330450-40Uploaded byAlexandraAndreea
- Lecture 2 - MOS Devices SPICE FabricationUploaded byPrayat Anil Hegde
- Ceramah 2Q (2)Uploaded byMariana Ahmad
- An Energy Efficient Secured Method for Medical Wireless Sensor NetworksUploaded byIJRASETPublications
- Elf Circuit DesignUploaded byKatja Goite
- Voyager_X3_ Installation and User Guide v0.5Uploaded bycomnavmarine
- Masterdrives Motion ControlUploaded byRaj Chavan