2011 2012 Second International International Conference Conference on Intelligent on Intelligent Systems System Design Design andand Engineering

Engineering Application Application

A Novel Design of 1024-point Pipelined FFT Processor Based on Cordic Algorithm

Shi Jiangyi, Tian Yinghui, Wang Mingxing ,Yang Zhe
Dept Microelectronics Xidian University, Xian, Shannxi, 710071, China School of Technical Physics Xidian University, Xian, Shaanxi, 710071 China nmgtyh@163.com

Abstract—A novel ASIC design of a 1024-point pipelined FFT Processor is introduced in this paper. This is a new FFT architecture based on the radix-2 algorithm and cordic algorithm. The solution adopted in the design is used to achieve a high-frequency FFT processor of which the frequency could reach as high as 330MHz. It also shows an advantage in saving up to fifty percent hardware resources over traditional FFT processor. Keywords-FFT; pepeline; cordic


X (k) +W X (k) X (k+N/2) = X (k) -W X (k)


2 nk

(2) (3)




Where k refers to 0, 1, 2…, N/2-1. N-point DFT of x(n) could be ascertained from both equation (2) and (3) with the values of X1(k) and X2 (k) could be identified by the way mentioned above.


Figure 1. The basic element of butterfly algorithm

DFT plays an important role in digital signal processing, it is a basic operational unit in signal transformation from time domain to frequency domain. FFT, a fast algorithm, is widely used in wireless communication, voice and image processing, spectrum analysis and so on. With the number of point increasing, the need of hardware resources multiples while the computational speed slows down. The common solution used to enhance the performance of FFT processor is to increase the number of pipeline-stage, but this will lead to the increase in hardware resource and power, and the speed will be limited as well. Therefore, we need to reduce the scale of hardware resources on the basis of ensuring high-performance. Cordic algorithm and a new structure are adopted in the design in this paper to reduce the hardware resource and achieve a high frequency and better performance. An N-point DFT is defined as: X (k) = Where

Equation (2) and (3), also known as butterfly algorithm, are shown in figure-1.Figure-2 illustrates an 8-point FFT algorithm which is composed of basic butterfly algorithm.

Figure 2. 8-point FFT algorithm

¦ x ( n )W

N −1

nk N

k=0, 1, 2..., N-1 (1)




nk N

refers to exp (-j 2π nk ) and x (n) is the N

input signal. FFT, a fast algorithm shown in equation (1), could be divided into DIT and DIF in terms of decimation, and could also be separated into radix-2, radix-4 etc. with respect to radix. X (n) could be divided into odd part and even part using radix-2 DIT in equation (1). N/2-point DFT is computed in both part to identify the value of X1(k) and.

The 1024-point FFT pipelined processor presented in this paper is based on radix-2 algorithm. The system consists of a serial-to-parallel module, ten FFT computational modules, a parallel-to-serial module and a sequence reversion module. Figure-3 shows the architecture of the FFT processor.

X (k)

Based on the periodicity and symmetry of W nk , the

following equations are obtained:
978-0-7695-4608-7/12 $26.00 © 2012 978-0-7695-4608-7/11 2011 IEEE DOI 10.1109/ISdea.2012.503 10.1109/ISdea.2011.126 80

Figure 3. New architecture of 1024-point pipelined FFT

When the number of data coming from the adder equals to 128. They work as a whole. Meanwhile. So this structure could save a lot of resource compared to other designs. the depth of the memory will decrease by fifty percent accordingly. data in SRAM2 and SRAM3 go into the butterfly module synchronously. ensuring that the number of input equals to that of the output.A. ensuring the two states above work alternately. simultaneously. This way of design could spare the use of cordic multiplier and rom used to store the butterfly factor. the data stored in SRAM1 beforehand will go into butterfly module as well. there is only one butterfly factor. SRAM2 and SRAM3 (SRAM1 and SRAM3 are both single-port rams while SRAM2 is dual-port ram used for synchronous read and write) are all 128. In stage three. while the result of the subtracter goes into SRAM2 and SRAM3. The architecture of Serial-to-parallel and stage one module B. In serial-to-parallel module. completing the butterfly computation. the data from serial-to-parallel to stage one module (just one adder) are the data that could be access in. Parallel-to-serial module This module is similar to the serial-to-parallel module except that it is a 2-data-input and 1-data-output module. Serial-to-parallel and stage one module In this paper. one adder could achieve its function. thus the two result are going direct to the adder to accomplish the computation of state one. the serial to parallel module and state one module are connected together. The principle of stage two is the same as stage one. ensuring the continuous output of data. When the butterfly computation is running. the data from subtracter go into SRAM3. leading to the save of a lot of resources with only four registers are needed to implement its function. Because of there is only one butterfly factor in stage one. The result of the adder goes into SRAM1 Figure 5. The architecture of Serial-to-parallel and stage one module Figure 6. The structure as shown in figure-4 comprises 4 SRAMs. the result of the adder and the result of the subtracter. FFT compute module The first and the second stage are simplified from the third level. In stage one. Compute processing 81 . Figure-6 and figure-7 illustrates the computing processing and the state diagram respectively. until the butterfly computation of the adder part is over. C. and data from the adder go into SRAM1 consecutively. A control module is incorporated to control the processing above. the frequency of clock controlling read is half of that of write. The output of the previous stage is made of two parts. Controlled by control module. Figure-5 shows the structure of stage three of which the depth of the memory is 128. When the stage increases by 1. thereforeˈwe can use an adder to accomplish the butterfly computation. the depth of SRAM1. Figure 4. Four registers combined comprise the ping-pong operation. Every two registers work together. the result of the adder will go into the butterfly module directly. using ping-pong operation to make sure the continuous output of data.

s2 and s3 stand for the state while the number 1. The cordic is implemented by iteration which consists of shift and addition operations. E. The system has been verified at RTL level in Modelsim6. According to the theoretical calculations. ready=1 4 address=127.3. add_rom=3 Figure 8. errors are inevitable. a cordic module is used to complete complex multiplication. thus. Figure-10 shows the real part and the imaginary part of the output. but the cordic adjust module may not be adopted frequently. After the simulation by Matlab. 18 times iteration is adopted to obtain a higher precision. The accuracy of the cordic is related to the numbers of iteration and the width of the bits. cordic module. instead. it will save a lot of resource. In this paper.However. State diagram D.5%. Due to the fact that a complex multiplication module uses either a general multiplier or a cordic multiplier. To accomplish the sequence reversion. s0. Figure-12 shows the relative error of our proposed FFT 82 . 1 or -1. the value is a constant generated by multiplying 0. as is shown in figure-9.  Figure 9. It is quite clear that both graphs have the same trend from figure11. PERFORMANCE ANALYSIS Where δ refers to 0. Five protection-bits used in the design in this paper could enhance the precision greatly. From the comparison by the matlab. TABLE1 Descriptions of The Shift Conditions Number Descriptiion 1 ready=1 2 address=127. cordic-last module has been introduced in many papers. In the cordic module of this proposed design. Cordic module The multiplier which is widely used in complex multiplication module costs four multipliers and two adders in each butterfly computation . indicating that the result is right. Inversion sequence module In this paper. Figure-11 shows the output from Matlab. After each operation. The structure of the whole module is shown in figure-8. the modulus will increase.60725 to the value output from the cordic-last module.5 SE simulation environment and has passed the verification of DC verification platform. we have to reverse the address bus. The cordic-pre module. ready=1 5 address==127. The axis-x stands for the sequence of the output while the axis-y stands for its value. a 1024-point FFT processor is designed. the accuracy could be improved greatly. s1. State diagram In figure-7. TABLE2 Description of The Shift Conditions x = x -δ x 2 y = y -δ y 2 i +1 i i i i +1 i i i −i (4) (5) −i 1024-point FFT pipelined processor is the core element on radar signal processing chip. ready=0. From the DC report we know that the frequency could reach as high as 330MHz and the area is less than traditional designs. so a 10-bit-width address bus is needed. Using this module.δ 0 0 δ 1 1 δ 0 2 δ δ 1 3 δ δ 0 4 δ 0 5 δ 6 δ 0 7 -1 11 -1 12 -1 δ 8 δ 9 δ 10 δ 13 δ 14 δ 15 0 -1 1 0 1 0 Figure 7. The input data start from 1 to 1024. it only costs one cordic module in each butterfly computation.2.4 and 5 stand for the state shift conditions which are described in detail in Table1. ready=1 3 address==127. we know that the precision is 0. The iterative formula (4) and (5) are used to achieve this goal. the δ is chosen from Table2. Inversion sequence III.

Tukey J. The CORDIC trigonometric computing technique[J]. axis-x stands for the sequence of the output and axis-y stands for the relative error. ACKNOWLEDGMENT Support for this work was provided by Natural Science Basic Research Plan in Shaanxi Province of China (2010JM8015). H. the ROM is saved while the frequency and the precision are enhanced greatly. By using cordic algorithm. 50 (3):229 ϔ 235 Figure 12. By using the new structure. the SRAM is saved while the function of pipeline is secured. From figure-12. The comparison between FFT processor and Matlab  Figure 13.Piuri V. 207228(2000) [3] Volder J E.  Figure 14. The output of FFT processor IV. Hybrid CORDIC algorithms [J]. Villagarcia. 2002. Jagdhold U. 39 (3):484~493 [10] Sansaloni T. 1992. IEEE Journal of Solid-State Circuit. A 64-Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM. Vol. Tomes V. we can see that the maximum value is less than 2. FFT Spectrum Analyzer Project for Teaching Digital Signal Processing With FPGA Devices.8(3):330-334 [4] Wang S. IEEE Transactions on Education. The comparison between FFT processor and Matlab 83 . "Four Easy Ways to a Faster FFT”. 1997. O. [2] DAISUKE TAKAHASHI. The output of Matlab [1] REFERENCES LENORE R. [9] Maharatna K. Vol. Giacomantone. it is quite clear that this proposed design is pipelined.5%. 46 (11):1202-1207 [5] Baas B M.processor while figure-13 shows the relative error of other FFT processors that use general multiplier.3 and 5 Parallel1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers”. IRE Transactions on Electronic Computers. Bria. 1992. From the figure. The modelsim simulation diagram is shown in figure-14. IEEE Trans. [7] Cooley J W IEEE Signal Processing Magazine.1959.  Figure 11.Journal of Mathematical Modelling and Algorithmsl:193-21 4. 9(1) [8] J. a 1024-point pipelined processor. Wartzlander E. IEEE Journal of Solid-State Circuits 1999. A Low-Power."HighPerformance adix-2. For both figures. 2007. CONCLUSION This paper shows the flow of our design. Grass E. Proc IEEE ASSP. Pascual A P. et al. we can find that the maximum value is less than 0. However. A fast CORDIC CoProcessor Architecture for Digital Signal Processing Applications [J]. VI Congreso Argentino de Ciencias de la Computacion (VI CACIC). October 2000. 34 (3 ): 380-387 [6] Cooley J W. on Computers. IEEE Signal Processing Society.YASUMASA KANADA.6%. 2004. 15. High-performance 1024-Point FFTProcessor[J]. So the precision is improved a lot in our proposed design. MULLIN and SHARON G. SMALL. Journal of Supercomputing. The modelsim simulation diagram  Figure 10. from figure13.