You are on page 1of 4

PIPELINED 64-POINT FAST FOURIER TRANSFORM FOR PROGRAMMABLE LOGIC DEVICES

Joel J. Fster and Karl S. Gugel Dept. of Electrical and Computer Engineering, University of Florida, Gainesville, FL
ABSTRACT This paper describes the design and implementation of a fully pipelined 64-point Fast Fourier Transform (FFT) in programmable logic. The FFT takes 20-bit fixed point complex numbers as input and after a known pipeline latency produces 20-bit complex values representing the FFT of the input. It is designed to allow continuous input of samples and is therefore suitable for use in real-time systems. The modular design allows it to be used together with other 64-point FFTs to create larger sizes, much as this design is built using smaller 8-point FFTs. Such a design has many applications in high-speed real-time systems such as wireless networking, spectral analysis, recognition systems, and more. the inputs be scaled relative to the largest value that will occur in the input data. Additional logic or processing may be required to achieve this condition. The mapping between input numbers and their corresponding hexadecimal values is as follows: 1 1.907x10-6 0 -1.907x10-6 -1 0x7FFFF 0x00001 0x00000 0xFFFFF 0x80000

This mapping is more practically represented by

1. INTRODUCTION FFTs have use in innumerable signal processing applications and are often an important building block in such systems. Many of these applications require realtime operation in order to be useful. While Digital Signal Processors (DSPs) are available that can perform an FFT fast enough to keep up with many real-time applications, some systems require additional computation or have speed requirements that exceed the capabilities of a DSP alone. It is in these situations that dedicated logic for computing an FFT can be useful. Described in this paper is the interface, design, implementation, and testing of a 64-point FFT implementation that takes advantage of pipelining, memory bank switching, and smaller FFTs to create a design capable of continuous real-time operation at high speeds. 2. INTERFACING The FFT is designed to take complex values at the input, where the real and imaginary components each have 20 bits of precision. A twos-complement fixed-point format is used, with all numbers scaled to between -1.0 and 1.0. Getting the most accuracy out of the design requires that

floor (219 s ) 1> s 0 n= 19 floor (2 (2 + s )) 0 s > 1

(2.1)

where s is the -1 to 1 scaled input number and n is the decimal value of the binary number to be fed into the input of the FFT. Output values follow the same format as described above. However, down-scaling is done as a byproduct of some of the internal stages, as described later in this paper. As a result, the output must be multiplied by 256 to correct for this. The mapping for the output of the FFT to their equivalent values is given by

n 0 n 0 x7 FFFF 219 s= (2.2) n 2 0 x80000 n 0 xFFFFF 219


where n is the hexadecimal equivalent of the binary output of the FFT and s is the corresponding fractional decimal value.

20-bit x 2 Complex Input

Bank Switched Memory A

8-point FFT Unit A

Bank Switched Memory B

To Twiddle Factor Multiplier

From Bank Switched Memory B

Twiddle Factor Multiplier

8-point FFT Unit B

Bank Switched Memory C

20-bit x 2 Complex Output

Controller/Memory Address Generator

Figure 1: Pipeline structure of FFT Implementation


t (1) = x (1) + x (5) t (2) = x (2) + x (6) t (3) = x (3) + x (7) t (4) = x (4) + x (8) t (5) = x (1) x (5) t (6) = x (2) x (6) t (7) = x (3) x (7) t (8) = x (4) x (8) q (1) = t (1) + t (3) q (2) = t (2) + t (4) q (3) = t (1) t (3) q (4) = t (2) t (4) q (5) = t (5) q (6) = t (6) + t (8) q (7) = t (7) q (8) = t (6) = t (8) s (1) = q (1) + q (2) s (2) = q (1) q (2) s (3) = q (3) jq (4) s (4) = q (3) + jq (4) s (5) = q (5) j (1/ 2 ) q (6) s (6) = q (5) + j (1/ 2 ) q (6) s (7) = (1/ 2 ) q (8) jq (7) s (8) = (1/ 2 ) q (8) + jq (7) y (1) = s (1) y (2) = s (5) + s (7) y (3) = s (3) y (4) = s (5) s (7) y (5) = s (2) y (6) = s (6) s (8) y (7) = s (4) y (8) = s (6) + s (8)

Both the input and output busses to the FFT are synchronized to the rising edge of the clock, so that an input value is captured and an output value is available on the rising edge. The maximum clock rate for the design is determined by the speed and design of the programmable logic device (PLD) that will be used. The simulated timing information for this design on an Altera Apex FPGA device is described in later in this paper. 3. DESIGN The core FFT algorithm chosen for this design is the Winograd 8-point FFT. This algorithm significantly reduces the number of multiplications needed versus other algorithms at the expense of an increase in the number of additions and memory needed [1-3]. For PLDs, multiplication is more expensive to implement than addition in terms of computation time and number of gates, and therefore the Winograd algorithm was chosen. The equations that describe how it is computed are shown in Figure 2. The pipeline layout of the 64-point FFT is shown in Figure 1. There are 6 stages, including two 8-point Winograd FFTs, one twiddle factor multiplier, and three bank switched memories. The 8-point FFT blocks have clocked shift registers at the input and output, but the FFT itself is computed with purely combinatorial logic. The only multiplications needed within this stage are a few multiplies by 1/ 2 , which are also built using straight combinatorial logic units. These units perform the multiplication by using shift-add techniques. The

Figure 2: 8-point Winograd FFT

Input: x(0..63) x0 8-pt FFT 8-pt FFT

Output: X(0..63) s(0..7)

x1

s(8..15)

x7

8-pt FFT

s(56..63)

s(0..63)

Twiddle factor multiplication X0

t(0..63)

t0

8-pt FFT 8-pt FFT

t1

X1

t7

8-pt FFT

X7

{x, t, X}k = {x, t, X}(n) where (n mod 8) = k Figure 3: Data flow for 64-point FFT

avoided by instead multiplexing the use of two units at the expense of increased latency. The three bank switched memory blocks are used to realize the multiplexing as well as to facilitate the sample reordering that is done at three different times in the data flow. Specifically, each memory block consists of two banks, each of which can store 64 20-bit x 2 complex numbers. While one bank is being written with the data from the previous stage the other bank can be read from separately. When the bank being written to is filled with 64 new samples, the pipeline stages following are timed to be finished reading the 64 samples from the other bank. The banks are then switched, allowing the new data to be read out and the old bank to be loaded with more samples. In this way, continuous operation is possible. Data reordering is accomplished by controlling the memory access pattern when reading the data out of the memory banks. The controller unit for the pipeline generates the memory read addresses for each block, creating the modulo-8 reordering system. The controller is also responsible for timing the start sequences between each pipeline stage, and generating the proper indices for the ROM in the twiddle factor unit. The controller is implemented simply as a 128-state state machine, using a counter as an address generator for a ROM that stores the values for all the control signals and addresses at each state. The twiddle factor multiplier is simply a ROM coupled with a complex multiplier. The ROM stores the 64 pre-computed twiddle factors. The complex multiplication is accomplished by breaking the operation down into three multiplies and five additions, as shown in (3.3). The total latency for this stage is 7 cycles.

actual multiplication value used is 0.7071. It should be noted that the internal precision of these multiplication units and the rest of the 8-point FFT blocks is 24-bit, and that the output is scaled down by a divisor of 16 as a result of how the algorithm is implemented. The 8-point FFT units have shift registers on their inputs and outputs, each one with eight positions for a 20bit by 2 complex number. Each time the input register is loaded with a new group of eight values, it is copied to a latch from where the actual FFT is computed. In this way, the shift register can continue to load itself with new values while the FFT is running. The output register operates in a similar fashion, copying the output of the FFT from a parallel output latch and shifting the values out one at a time. Combined with reordering and a twiddle factor multiplication stage, the two 8-point FFT units are used to produce the 64-point FFT. The data flow diagram for the algorithm is shown in Figure 3. It should be noted that the implementation of sixteen 8-point FFT units was

( xr + jxi )(tr + jti ) xr tr ti xi + jti xr + jtr xi

(3.1) (3.2)

tr ( xr xi ) + xi (tr ti ) + j ( xr (tr ti ) tr ( xr xi ))
(3.3) 4. IMPLEMENTATION AND TESTING VHDL was chosen as the hardware description language with which to build the FFT. The choice was made based mostly on the ready availability of tools to compile and simulate VHDL designs. The Quartus II design system from Altera was used to compile and simulate the system. The target device used for performing the timing simulation was the Apex EP20K600E [4]. This FPGA contains 24320 logic elements (LEs), 7326 of which are used by the FFT. A timing analysis of the worst case propagation time shows that the maximum speed for the FFT design in this FPGA

FFT Output Sample # 1 2 3 4 5 6 7 8

Software FFT Real Imag 0x07D62 0xF7222 0x02E15 0x062ED 0x03302 0xFF3D3 0xFE819 0x00A2E 0x07D62 0xF6115 0xF7099 0xFDB5E 0x023AA 0x01C12 0xFED87 0xFD8DD

Hardware FFT Real Imag 0x07D61 0xF7221 0x02E14 0x062EC 0x03301 0xFF3D2 0xFE818 0x00A2E 0x07D61 0xF6114 0xF7098 0xFDB5E 0x023A9 0x01C12 0xFED86 0xFD8DC

1/ 2 or simply the fact that the software implementation has much higher internal precision, particularly in the multiplication units.

5. CONCLUSIONS This paper presented an architecture for a pipelined 64point FFT for implementation in a PLD. It is suitable for relatively high-speed applications where the typical DSP is not sufficiently fast to process the data, and particularly for real-time designs. Further work might include a simple extension of the radix-8 algorithm to the next step, a 512-point design, further investigation of the off-by-one errors, or perhaps further optimizations of the FFT design.

Table 1: First 8 points of simulation results

is 33.54 MHz. This means that the design can perform a 64-point FFT in 1.908 s. In contrast, a 64-point FFT on common DSP chip, the TI TMS320C3X at 75 MHz, takes 19.75 s [5]. Thus the dedicated hardware FFT is faster by more than a factor of ten. It should be noted however, that the C3X family of DSPs are 32-bit floating-point, which would mean greater dynamic range than the fixedpoint implementation. However, the quoted speed of the C3X does not include the time required to convert the samples to the TMS320-specific floating point format, which may be a concern in actual system implementations and would slow the algorithm down further. While this design is capable of processing up to 33.54 Msamples/s in at least one type of FPGA, the cost of the pipelined architecture is in the latency of each stage. Each bank switched memory stage adds 66 cycles of latency, due to the number of cycles it takes to load 64 samples in before a bank switch. The 8-point Winograd FFT stages each add 16 cycles. These cycles are the time that it takes to shift in 8 samples as well as another 8 cycles to allow the FFT to complete. It should be noted that the FFT itself, without the attached shift registers, is not pipelined. It is given the maximum amount of time possible, 8 cycles, to compute its result. After 8 cycles, the input shift register would begin to lose data, making this the upper limit. This wait state allows the rest of the system to be pipelined while keeping the 8-point FFT combinatorial. The twiddle factor multiplier stage contributes 7 cycles of latency. This latency arises due to the delays associated with the twiddle factor ROM, as well as the pipelined multipliers used to form the complex multiplication unit. This makes the total latency for the entire 64-point FFT is 237 cycles. To test the FFT, it was given a set of artificially generated complex input values, and the outputs were compared with a software FFT implementation. The first eight points of the results are shown in Table 1. It is noted that the output values of the hardware implementation sometimes differ from the software version by one. This is likely due to the rounding caused by the shift-add implementation of the multiplication by

ACKNOWLEDGEMENT
This work was supported in part by the University Scholars Program at the University of Florida. 7. REFERENCES [1] Oppenheim, A.V., Schafer, R.W., Discrete-time Signal Processing, 2nd ed., Prentice Hall, New Jersey, 1999. [2] Press, W.H., Flannery, B.P., Teukolsky S.A., and Vetterling W.T., Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, January 1993. [3] Smith, Steven, The Scientist and Engineers Guide to Digital Signal Processsing, California Technical Publishing, 1997. [4] APEX 20K Programmable Logic Device Family, Product Data Sheet, ver. 4.0, Altera Corporation, August 2001. [5] TMS320C3x General-Purpose Applications User Guide, Texas Instruments, January 1998.

You might also like