You are on page 1of 5

2009 International Conference on Electrical Engineering and Informatics 5-7 August 2009, Selangor, Malaysia

64-point Fast Efficient FFT Architecture Using Radix-23 Single Path Delay Feedback
Trio Adiono, Muh Syafiq Irsyadi, Yan Syafri Hidayat, Ade Irawan
Electrical Engineering and Informatics School, Bandung Institute of Technology Jl. Ganesha 10, Bandung 40132, Indonesia
tadiono@paume.itb.ac.id syafiq@students.ee.itb.ac.id yayan_sh@yahoo.com ade_gawa@yahoo.com Abstract Here we present a new design of a 64-point Fast Fourier Transform circuit. The design is derived from Radix-23 algorithm and implemented using Single Path Delay Feedback architecture. This approach ensures high memory and multiplier utilizations. The 64-Point FFT is realized by decomposing into two-dimensional structure of 8-point FFTs. Each of this FFT is re-decomposed into 4-point and 2-point FFTs. This decomposition reduces the number of non-trivial twiddle factor into just one. Thus we only need one complex multiplier for the design. The complex multiplier is realized using modified Booth (radix4) encoding algorithm to achieve faster computational speed. The validity and efficiency of the proposed circuit has been thoroughly verified by functional simulation, timing simulation, and FPGA implementation. The proposed design has been successfully synthesized using Synopsys with TSMC 0.18 technology. The core area is 0.47 mm2. The power consumption is 29.7 mW. The time delay is 6 ns. The circuit computes one serial-to-serial data in 116 clock cycles. Thus our design has 3 advantages: small area, low power consumption, and fast computation. Keywords FFT, R23SDF, radix-23.

Break this DFT algorithm into three dimensional index map.

N N n1 + n 2 + n3 4 8 k = k1 + 4k2 + 8k3 n=
N 1 8 N 0 k1 3, 0 k2 1, 0 k3 1 8 0 n1 3, 0 n2 1, 0 n3

Substitute the new index into the DFT algorithm.

X (k ) =

3 N N N nk 4 1 1 x n1 + n2 + n3 WN 8 n3 = 0 n2 = 0 n1 = 0 4
N 1 8

WN 8 =

( N n2 + n3 ) k1 1

WN 8

( N n2 + n3 )(4 k2 +8 k3 )

I. INTRODUCTION FFT have been used in innumerable signal processing applications and are often an important building block in such systems. Many of these applications require real time operation in order to be useful. While Digital Signal Processors (DSP) are available that can perform an FFT fast enough to keep up with many real-time applications, some systems require additional computation or have speed requirements that exceed the capabilities of a DSP alone. It is in these situations that dedicated logic for computing an FFT proved to be useful. Pipeline FFT processor is a specific class of processors for DFT computation utilizing fast algorithms. It is characterized with real-time, non-stopping processing as the data sequence passing the processor. II. THEORY This algorithm is based on fact that radix-8 FFT can be decomposed into radix-4 and radix-2 FFT in order to reduce computation complexity. Recall the DFT algorithm,

n3 = 0 n2 = 0

BF 4 n 8
( N n2 + n3 )( k1 + 4 k2 + 8 k3 )

N 1 8

+ n3 , k1

WN 8

Decompose the twiddle factor.

WN 8

( N n2 + n3 )( k1 + 4 k2 +8 k3 )

= W8n2 ( k1 + 4 k2 )WNn3 ( k1 + 4 k2 )WNn3k3


8

Substitute the decomposed twiddle factor.

X (k ) =

n3 = 0 n2 = 0

BF 4 n 8
N 1 8

N 1 8

+ n3 , k1
8

W8n2 ( k1 + 4 k2 )WNn3 ( k1 + 4 k2 )WNn3k3 =


7

n3 = 0

H ( n , k , k ) W
3 1 2
sm 8

N 8

n3 k3

WNn 3( k1 + 4 k2 )

Now lets recall 64-point radix-8 FFT algorithm [2].

sl 64

x ( l + 8m ) W
m=0

X ( k ) = x ( n )W
n =0

N 1

n3 = 0

H ( n , k , k ) W
3 1 2

N 8

n 3( k1 + 4 k2 ) N

nk N

0k < N

We can clearly see that,

978-1-4244-4913-2/09/$25.00 2009 IEEE EE-27 654

sl 64

x ( l + 8m ) W
m=0

sm 8

n3 = 0

H ( n , k , k ) W
3 1 2

N 8

n 3( k1 + 4 k2 ) N

Where N=64 and,

H ( n3 , k1 , k2 ) = N BF 4 ( n3 , k1 ) + BF 4 n3 + , k1 W8( k1 + 4 k2 ) 8

From the last equation we have shown that the first stage of 64-point radix-8 FFT can be decomposed into radix-4 and radix-2 FFTs. The second stage of radix-8 FFT can be decomposed into radix-4 and radix-2 FFT using the same method. The real advantage of this method is that W8sm and W8lt is trivial twiddle factor. Its actually addition / subtraction operation followed by multiplication with (1/ 2 ) that can be realized using only a hardwired shift-and-add operation [2]. The only non-trivial twiddle factor is W64sl. Detailed derivation of radix-8 and radix-23 FFT algorithm can be found on [2] and [3]. III. DESIGN ARCHITECTURE The block diagram of the 64-point FFT processor derived from section 2 is depicted in figure 1. It consists of four stages of butterfly feedback structure and one reorder stage. The architecture itself is based on Single Path Delay Feedback architecture. The reason is the delay-feedback approach are always more efficient than corresponding delay-commutator approach in term of memory utilization since the stored butterfly output can be directly used by the multiplier [2]. The unusual mixed radix structure consists of radix-4 butterfly, followed by radix2 butterfly, followed by radix-4, and radix-2 butterfly is intended to retain the radix-8 FFT advantage. That is there is only one non-trivial twiddle factor needed and yet this new approach has simpler butterfly structure higher utilization of butterfly compared to radix-8. Controller : In this design we dont implement a master controller. Each butterfly has its own controller that independent from each other. This approach leads to modular and general structure of butterfly. Each controller is activated by the head signal from previous stage. The controller it self is actually a (log2 N)-bit binary counter. In each butterfly, the counter is divided into four or two group cycle on radix-4 and radix-2 butterfly respectively. Each group of counting is

called phase (ph). These phases control the memory modules, butterfly operation and twiddle multiplication. Another control signal called stage (st) is needed by twiddle stage to choose the multiplicative operation. Radix-4 butterfly (stage 1 and stage 3): Stage 1 and stage 3 are radix-4 butterfly modules. There are four phases that control the butterfly operation. In the first three phases the data input is directly inserted into shift register, while the previous data is taken to the output. The butterfly computation only happens on the last phase. Radix-2 butterfly (stage 2 and stage 4): Stage 2 and stage 3 are radix-2 butterfly modules. Same as stage 1 and stage 3, the only difference between stage 2 and 4 is in the shift register length. There are two phases that control the butterfly operation. Trivial twiddle factor: In this design, there is four cases of trivial twiddle factor, each cases belongs to each phases. From the algorithm in section 2, we can conclude that only the second half of the data in each phase that needs to be multiplied with trivial twiddle factor. The first half will be remain constant. Thats why we need another control signal that change every eight clock cycle to tell the twiddle factor mechanism whether its needs to be multiplied or not.
TABLE 1 TRIVIAL TWIDDLE FACTOR CONSTANT

Phase 00 01 10 11

Twiddle constant multiplier 1 (1- j)/ -j -(1+j)/

2 2

As we can see on the table 1 that on phase 0 and phase 2, the multiplication is merely no change at all or just swapping and inverting the real and imaginary part. On phase 1 and phase 3 it involves an addition/subtraction and multiplication with 1/ 2 constant. From [2] we get that the constant to be multiplied is called priori. This constant can be decomposed as a summation / subtraction based on power of 2. This in essence results in a shift-and-add architecture. Constant 1/

2 can be decomposed in terms of power of 2 into (2-1 +

655

Figure 1 Proposed R23SDF pipeline FFT architecture

2-3 + 2-4 + 2-8). With this representation, the multiplication of input data with this constant turns into addition of right shifted values of input data. Non-trivial twiddle factor: This operation uses ROMs to save the twiddle factors and one complex multiplier to do the operation. The ROMs is very simple. We implement two array of constant to save the twiddle factor constant. The real and imaginary parts of the twiddle factor are saved in the first and second array respectively. We implement a custom built multiplier based on radix-4 recoding technique (modified booth recoding technique). This approach is proven to be the most efficient multiplier in terms of AT (area time delay) compared to Synopsys standard multiplier (using * operator) and the standard multiplier plus shuffle network version (intended to reduce the twiddle factor constant). The complete comparisons are presented in table 2.
TABLE 2 MULTIPLIER COMPARISONS

one position to ensure that the last triplet contains 3 bits. In every step we will get a signed digit that will multiply the multiplicand to generate a partial product. The recoding table is presented in table 3.
TABLE 3 RADIX-4 RECODING

xi+2 xi+1 xi 000 001 010 011 100 101 110 111

Partial products 0Y +1 Y +1 Y +2 Y -2 Y -1 Y -1 Y 0Y

Multiplier design Standard (*) Standard + shuffle network Radix-4 recoding

Area (m2)

Time delay (ns) 6.77 8.46 3.98

AT

In the straightforward implementation, complex multiplication needs four real multiplier and two adders. So, we need four booth recoders if we want to implement the multiplication using radix-4 recoding. But, if we examine closely the multiplication formula,

( a + jb )( c + jd ) = ( ac bd ) + j ( bc + ad )

59918.4 59901.8 87577.4

405647.8 506769.3 348558.2

From table 2 it can be clearly seen that radix-4 recoding is the best choice in terms of speed and AT. The other advantage of using custom multiplier is that the synthesized circuit will be independent to synthesis tools Radix-4 recoding multiplier itself is a recoding process intended to reduce the partial product. This can be achieved by the application of the multiplier recoding, changing from a 2scomplement format to a signed-digit representation from the set {0, 1, 2} [5]. The radix-4 recoding starts by appending a zero to the right of x0 (multiplier LSB). Triplets are taken beginning at position x 1 and continuing to the MSB with one bit overlapping between adjacent triplets. If the number of bits in X (excluding x 1) is odd, the sign (MSB) is extended

and if we always keep one pair (a and b for example) as the multiplier and the other pair (c and d) as the multiplicand then we only need two radix-4 recoders instead of four [7]. The circuit block diagram is presented in figure below. There are four inputs. Input a and b are recoded to choose the appropriate partial product. Once the radix-4 recoded partial products have been generated, they need to be shifted and added. To produce the real part then the sum of the first partial product is subtracted by the sum of the second partial product. The imaginary part is an addition of the other two partial products. Micro architecture for radix 4 recoding is presented in figure 2.

656

Reorder: The reorder stage is an integral part of the design to realize data ordered serial-to-serial data input-output. We implement the reorder stage using only shift registers and multiplexers. The shift registers is used to save the data temporally before taken out as the output. We need 98 blocks of shift registers for the design. As the selector, we implement 64to1 mapping using multiplexers. IV. VERIFICATION AND IMPLEMENTATION Verification process includes functional simulation, waveform simulation, and signal tap in FPGA. Functional simulation was done to know if HDL design was match with model. After the functional simulation is complete, the architecture was synthesized for TSMC 0.18 library using Synopsys. The synthesis result is presented in table 4. The FPGA implementation is used to know whether the designed circuit is function correctly in the real world or not. We use Altera Cyclone II EP2C35F672C6 board for this design
Figure 2 Architecture for a complex multiplier circuit with twiddle factor ROM

Figure 3 FFT ouput from FPGA captured using Signal Tap II TABLE 4 PERFORMANCE COMPARISON OF THE PROPOSED FFT CIRCUIT WITH THE REFERENCE DESIGN AND WITH AVAILABLE CHIPSETS

FFT Circuit Proposed (radix23SDF) Koushik[2] (radix-8) T. Chen L..Zhu[2] T. Chen Sunanda[2] McCanny D. Trainor[2]

Word length 16 16 16 16 24

Technology 0.18 0.25 2 0.75 0.35

Cycle required 116 23 (64) 208 222 130

Area mm2 0.47 6.8 282 156 Core Norm. 17780.6 -

Time delay ns 6.03 norm. 24.26 -

Power (mW) 29.7 41 1300

implementation. We upload the test vector and the expected result in ROM, and compare the result. The output signal is captured using Signal Tap II function on Altera Quartus software. On the figure 3, we use 15 cycle complex sinusoid as our test vector. The test vector signal continuously inputted into the designed circuit. We use 50 MHz internal clock to produce the clock signal. To capture the signals we implement a push button as our trigger. The push button itself only serves as a trigger and doesnt have any connections to the design. The

head signal is automatically generated at the beginning of the first data using a counter.
TABLE 5 AREA AND TIME DELAY SYNTHESIS RESULT

Area Design R23SDF Time Delay Design R23SDF

m2 473163.78125

normalized 17780.62478

ns 6.03

normalized 24.25862069

657

V. CONCLUSIONS 64 point FFT architecture for high speed WLAN systems based on OFDM transmission has been presented. This architecture is based on a decomposition of the 64 point FFT into four stages of 4-point and 2-point FFTs. The algorithm offers simple FFT computations so that the resulting algorithm to architecture mapping is well suited for hardware implementation. The design exhibits numerous attractive features from a VLSI point of view, which include regularity, modularity, and high throughput. The validity and efficiency of the proposed circuit has been thoroughly verified by functional simulation, timing simulation, and FPGA implementation. The proposed design has been successfully synthesized using Synopsys with TSMC 0.18 technology library. The core area is 0.47 mm2. The power consumption is 29.7 mW. The time delay is 6 ns. The circuit computes one serial-to-serial data in 116 clock cycles. Thus our design has 3 advantages: small area, low power consumption, and fast computation in terms of speed and clock latency. Those advantages prove that this design is well suited for high performance WLAN system. REFERENCES
[1] [2] Shousheng He, Mats Torkelson. A New Approach to Pipeline FFT Processor. Department of Applied Electronics, Lund University. Koushik Maharatna, Eckhard Grass, Ulrich Jagdhold. A 64-Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM. IEEE Journal of Solid State Circuit, Vol. 39, No. 3, March 2004. . Modified radix- 23 FFT. Graduate Institute of Electronics Engineering, NTU. Wada Tomohisa. 64 Point Fast Fourier Transform Circuit (Version 1.0). Available: http://bw-www.ie.uryukyu.ac. jp/~wada/ design07/spec_e.html J.A Hidalgo. A Radix-8 Multiplier Unit Design For Specific Purpose. Dept. de Electronics, E.T.S.I Industriales. Joel J. Fster, Karl S. Gugel. Pipelined 64-Point Fast Fourier Transform For Programmable Logic Devices. Dept. of Electrical and Computer Engineering, University of Florida. Geoff Knagge. ASIC Design for Signal Processing. Available: http://www.geoff knagge.com/. Lo'ai A. Tawalbeh, Alexandre F. Tenca and C . K. Ko. A Radix-4 Design of a Scalable Modular Multiplier With Recoding Techniques. School of Electrical Engineering & Computer Science Oregon State University.

[3] [4] [5] [6] [7] [8]

658