You are on page 1of 4

Reduced Memory Architecture for CORDIC-based FFT

Xin Xiao, Erdal Oruklu and Jafar Saniie


Department of Electrical and Computer Engineering Illinois Institute of Technology Chicago, USA, 60616
Abstract In this paper, a new pipelined, reduced memory CORDIC-based architecture is presented for any radix size FFT. A multi-bank memory structure and the corresponding addressing scheme are used to realize the parallel and in-place data accesses. The proposed memory-reduced CORDIC algorithm eliminates the need for storing twiddle factors and angles, resulting in significant area savings with no negative impact on performance. As a case study, the radix-2 and radix-4 FFT algorithms have been implemented on FPGA hardware. The synthesis results match the theoretical analysis and it can be observed that more than 20% reduction can be achieved in total memory logic.

I.

INTRODUCTION

multiplier is usually the speed bottleneck in the pipeline of the FFT processor. The Coordinate Rotation Digital Computer (CORDIC) [4] algorithm is an alternative method to realize the butterfly operation without using any dedicated multiplier hardware. CORDIC algorithm is very versatile and hardware efficient since it requires only add and shift operations, making it very suitable for the butterfly operations in FFT [5]. Instead of storing actual twiddle factors in a ROM, the CORDIC-based FFT processor needs to store only the twiddle factor angles in a ROM for the butterfly operation. Additionally, the CORDIC-based butterfly can be twice faster than traditional multiplier-based butterflies in VLSI implementations. In this study, we propose a modified CORDIC algorithm for FFT processors which eliminates the need for storing the twiddle factor angles. The algorithm generates the twiddle factor angles successively by an accumulator. With this approach, full memory requirements of an FFT processor can be reduced by more than 20%. Memory reduction improves with the increased radix size. Since the critical path is not modified with the CORDIC angle calculation, system throughput does not change. In Section II, CORDIC algorithm fundamentals and the design of CORDIC-based FFT processor are described. Then, the proposed memory efficient FFT algorithm and it's hardware architecture are presented in Section III. Hardware synthesis results for both radix-2 and radix-4 FFTs are discussed in Section IV. II.
N 1 n =0

Fast Fourier transform (FFT) is among the most widely used operations in digital signal processing. Often, a high performance FFT processor is the key component and determines most of the design metrics in many applications such as Orthogonal Frequency-Division Multiplexing (OFDM) [1], Synthetic Aperture Radar (SAR) [2] and software defined radio [3]. For embedded systems, in particular portable devices; efficient hardware realization of FFT with small area, low-power dissipation and real-time computation is a significant challenge. A typical FFT processor is composed of butterfly calculation units, memory banks and control logic (address generator for data and twiddle factor accesses). In most cases, an FFT processor uses only one butterfly unit to realize all calculations iteratively, and the in-place memory access strategy is required for the least amount of memory. With inplace strategy, the outputs of a butterfly operation are stored back to the same memory location of the inputs, saving the memory usage by one half. However, correct memory addressing scheme is required to avoid the data conflict. This study implements an efficient addressing scheme to realize the parallel, pipelined and in-place memory accessing. It produces an output at every clock cycle; furthermore the memory banks and the butterfly unit are utilized with 100% efficiency within the pipeline. In FFT processors, butterfly operation is the most computationally demanding stage. Traditionally, a butterfly unit is composed of complex adders and multipliers, and the

FFT AND CORDIC ALGORITHM


j 2 nk N

The N-point discrete Fourier transform is defined by


nk X (k ) = x(n)WN

k = 0,1,..., N 1,

nk WN = e

(1)

Fig. 1 shows the signal flow graph of 16-point decimation-in-frequency (DIF) radix-2 FFT. An FFT processing is composed of many butterfly operations as shown in Fig. 2: x m +1 ( p ) = x m ( p ) + x m (q) (2)
r x m +1 ( q ) = [ x m ( p ) x m ( q )]W N

(3)

978-1-4244-5309-2/10/$26.00 2010 IEEE

2690

If each rotate angle

xi +1

is equal to arctan 2 i , then: = cos ( xi y i 2 i )


i

(6) (7)

y i +1 = cos ( y i + xi 2 )
i

Since = arctan 2 , cos can be simplified to a constant with fixed number of iterations:

xi +1 = K i ( xi y i d i 2 i ) y i +1 = K i ( y i + xi d i 2 )
i i

(8) (9)

where K i = cos(arctan( 2 )) and d i = 1 . Product of


Fig. 1. Signal flow graph of a 16-point radix-2 FFT
xm ( p)
x m+1 ( p )

Ki's can be represented by the K factor which can be applied as a single constant multiplication either at the beginning or end of the iterations. Then, (8) and (9) can be simplified to:

xi +1 = ( xi y i d i 2 i ) y i +1 = ( y i + xi d i 2 )
xm (q )
W
r N

(10) (11)

xm+1 ( q)

Fig. 2. Butterfly unit at stage

Equations (2), (3) describe the radix-2 butterfly operation at stage m as shown in Fig.1. Each butterfly operation needs four data accesses (two read and two write); however, hardware realization of four port memory units is difficult and costly. To overcome this challenge, multi-bank memory units can be used to realize the parallel and "in-place" data accesses. Two two-port memory banks can provide four data access in each clock cycle, but in this case, a special data addressing scheme is required to prevent the data conflict. In [6], a new address scheme has been proposed to realize this function and it can be easily extended to any radix FFT. In this study, this special addressing scheme will be adopted for CORDIC based FFT implementation. CORDIC algorithm was proposed by J.E. Volder [4]. It is an iterative algorithm to calculate the rotation of a vector by using only additions and shifts. Fig. 3 shows an example for rotation of a vector Vi. Equations (4) to (7) illustrate the steps for calculating the rotation.
y

The direction of each rotation is defined by di and the sequence of all di 's determines the final vector. di is given as: 1 if z i < 0 (12) di = + 1 if z i 0 where zi is called angle accumulator and given by (13) All operations described through (10)-(13) can be realized by only additions and shifts; therefore, CORDIC algorithm does not require dedicated multipliers. As shown in (1), the key operation of the FFT processing is x ( n ) W N , ( W N = e
nk

z i +1 = ( z i d i arctan 2 i )

nk

2 nk N

is the so-called twiddle

factor). This is equivalent to "Rotate x (n) by angle 2 nk " operation which can be realized easily by the N CORDIC algorithm. Without normal complex multiplication, CORDIC-based butterfly can be very fast. Normally, a traditional FFT processor needs to store the twiddle factors in memory. Instead, a CORDIC-based FFT processor needs to store the twiddle factor angles in memory. Garrido [7] designed an angle generator to provide twiddle factor angles without the angle storage memory. However, it generates the angles by multipliers which can be slow, hardware inefficient and may lead to precision loss. This paper presents a new CORDIC FFT design using a single accumulator which can generate all the necessary angles at real-time and does not have any precision loss. By avoiding the twiddle factor angle ROM, the FFT processor can save more than 20% of total memory utilization. III. PROPOSED CORDIC-BASED FFT In the past, several multi-bank addressing schemes have been proposed to realize parallel and pipelined FFT processing [8][9], but those methods are not suitable for the proposed memory reduced CORDIC-based FFT. In these

Vi +1 ( xi +1 , yi +1 )

Vi ( xi , y i )

Fig. 3. Rotate vector Vi ( xi , yi ) to Vi +1 ( xi +1 , y i +1 )

x i +1 = r cos ( + ) = r( cos cos sin sin ) = x i cos y i sin yi +1 = r sin( + ) = r (sin cos + cos sin ) = yi cos + xi sin

(4)

(5)

2691

TABLE I. ADDRESS GENERATION TABLE OF MAS[8] DESIGN FOR 16-POINT RADIX-2 FFT
Butterfly Counter B(b2b1b0) 000 001 010 011 100 101 110 111 Stage 0 RAM Twiddle address factor b0b2b1 angle 000 0 4 100
8

Stage 1 RAM Twiddle address factor b1b0b2 angle 000 0 4 010


8

Stage 2 RAM Twiddle address factor b2b1b0 angle 000 0 4 001 8 010 0
4 8

Stage 3 RAM Twiddle address factor b0b2b1 angle 000 0 100 001 101 010 0 0 0 0 0 0 0

001 101 010 110 011 111

8
8

100 110 001 011 101 111

0
4 2 8 8

011 100 101 110 111

2
6

8
8

0
4 8

110 011

3 7

8 8

2 6

8 8

0
4 8

111

TABLE II. ADDRESS GENERATION TABLE OF THE PROPOSED DESIGN FOR 16-POINT RADIX-2 FFT
Butterfly Counter B(b2b1b0) 000 001 010 011 100 101 110 111 Stage 0 RAM Twiddle address factor b2b1b0 angle 000 0 001 010 011 100 101 110 111 Stage 1 RAM Twiddle address factor b0b2b1 angle 000 0 100 001 101 010 110 011 111 0 Stage 2 RAM Twiddle address factor b1b0b2 angle 000 0 010 0 0 0
4 4 8 8

Stage 3 RAM Twiddle address factor b2b1b0 angle 000 0 001 010 011 100 101 110 111 0 0 0 0 0 0 0

8
8

2 2
4
4

8 8
8
8

100 110 001 011 101 111

3
4
5

8
8
8

6
7

8
8

6
6

8
8

FFTs. For radix-2, the structure can be simplified by using just 4 registers, but for radix-r FFT, 2 r 2 registers are needed. Table II shows the address generation table of the proposed design for 16-point radix-2 FFT. It can be seen that twiddle factor angles are sequentially increasing in the proposed design, and every angle is a multiple of the basic angle 2 , which is 8 for 16-point FFT. For different FFT
N
Fig. 4. Proposed design for radix-2 CORDIC FFT processor

schemes, the twiddle factor angles are not in regular increasing order (see Table I) and this results in a more complex design for angle generators. Using a new addressing scheme as shown in Table II, the twiddle factor angles follow a regular, increasing order, which can be generated by a simple accumulator. Fig. 4 shows the basic structure of proposed design for radix-2 FFT processing. In proposed design, registers are used before and after the butterfly unit to buffer the intermediate data temporarily in order to group two sequential butterfly operations together. Therefore, the conflict-free in-place data accessing can be realized. This register-buffer design can be easily extended to any radix

stages, the angles increase always one step per clock cycle. Hence, an angle generator circuit composed of an accumulator, a register and a latch can realize this function, as shown in Fig. 5. Control signal for the latch enables or disables the accumulator output based on the current FFT butterfly stage and RAM address bits b2b1b0 (see Table II).

Fig. 5. Angle generator for the proposed design

2692

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

Fig. 6. Proposed design for radix-r CORDIC-based FFT TABLE III. FPGA IMPLEMENTATION RESULTS FOR RADIX-2 AND RADIX-4 FFT Proposed CORDIC FFT Design Total logic elements Total memory bits 1024-point FFT Total logic elements Total memory bits 4096-point FFT
n

Conventional CORDIC FFT Design Radix-2 1,386 10,720 1,718 41,440 1,757 164,320 Radix-4 5,763 11,800 5,797 45,592 5,863 180,760

256-point FFT

Radix-2 1,427 (19-bit accumulator) 8,672 1,773 (21-bit accumulator) 33,248 1,809 (23-bit accumulator) 131,552

Radix-4 5,892 (20-bit accumulator) 8,728 5,991 (22-bit accumulator) 33,304 5,993 (24-bit accumulator) 131,608

Total logic elements Total memory bits

For radix-2, N = 2 -point, m-bit FFT, by using the angle generator, 5mN bits memory can be reduced to 4mN ,
2 2

which corresponds to 20% reduction. For higher radix FFT, the reduction is even more significant. For radix-r FFT, the saving is (r 1)mN bits out of (3r 1)mN , which converges to
r
r

accordance with the theoretical analysis. In summary, this study proposed a new memory reduced CORDIC-based FFT design and the corresponding addressing scheme. With this new approach, the FFT processing can be realized by less memory logic without any speed and precision loss. REFERENCES
[1] C. Wey, S. Lin, and W. Tang, Efficient memory-based FFT processors for OFDM applications, IEEE International Conference on ElectroInformation Technology, pp.345 - 350, May 2007. [2] L. Fanucci, M. Forliti, and F. Gronchi, Single-chip mixed-radix FFT processor for real-time on-board SAR processing, 6th IEEE International Conference on Electronics, Circuits and Systems, ICECS '99, vol. 2, pp. 1135-1138, September 1999. [3] S. Mittal, M. Khan, and M.B. Srinivas, On the suitability of Bruuns FFT algorithm for software defined radio, 2007 IEEE Sarnoff Symposium, pp. 1-5, April 2007. [4] J. Volder, The CORDIC trigonometric computing technique, IEEE Transactions on Electronic Computers, vol. EC-8, no. 8, pp. 330-334, September 1959. [5] Jayshankar, Efficient computation of the DFT of a 2N-point real sequence using FFT with CORDIC based butterflies, IEEE Region 10 Conference, TENCON 2008, pp. 1-5, November 2008. [6] X. Xiao, E. Oruklu, and J. Saniie, Fast memory addressing scheme for radix-4 FFT implementation, IEEE International Conference on Electro/Information Technology, EIT 2009, pp. 437-440, June 2009. [7] M. Garrido, and J. Grajal, Efficient memory-less CORDIC for FFT Computation, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 2, no. 2, pp. 113-116, April 2007. [8] Y. Ma, An effective memory addressing scheme for FFT processors, IEEE Transactions on Signal Processing, vol. 47, no. 3, pp. 907-911, March 1999. [9] X. Xiao, E. Oruklu, and J. Saniie, An efficient FFT engine with reduced addressing logic, IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 55, no. 11, pp.1149-1153, November 2008.

33.3% reduction. Fig. 6 shows the basic structure of proposed design for radix-r FFT. Due to finite precision, as the accumulator operates, the precision loss will be accumulate as well. In order to address this issue, more bits (wider wordlength) can be used for the fundamental angle 2/N and the accumulator. For radix-2, N = 2 n point, 16-bit FFT (each data is 32-bit complex number), if the basic angle and accumulator is greater than (16+n-5) bits, no precision loss will be observed compared to a normal angle-stored CORDIC FFT processor. For example, for 1024-point FFT, the basic angle and accumulator only have to be extended from 16 bits to 21 bits. IV.
RESULTS AND CONCLUSION

The proposed designs for both radix-2 and radix-4 FFT algorithms have been realized by Verilog-HDL and implemented on an FPGA chip (STRATIX-III EP3SE50C2). Synthesis results shown in Table III confirm that the proposed design can reduce memory usage for FFT processors without a major increase in the number of logic elements used. Furthermore, the maximum clock frequency achieved was 247MHz in all implementations, indicating no delay penalty has occurred. The implementation results are

2693

You might also like