You are on page 1of 8

1

A 0.18m VLSI Technology Based


64 Points Fast Fourier Transform Kernel

Duo Ding
Joonsoo Lee
Yousof Mortazavi
Report for final project of VLSI-I Spring 2007

Abstract-In this report, we present a thorough VLSI
implementation of a 64-point FFT/IFFT IP core with signed
fixed-point 16-bit word length accuracy, primarily for IEEE
802.11a wireless Local Area Network applications. Such a kernel
could also be integrated into a vast range of modern Imaging
Radar Systems and Real-time Signal Processing Systems. On
algorithm level, our 64-point FFT is accomplished by
decomposing itself into a 2-D structure of 8-point FFTs.
Compared with traditional radix-2 64-point FFT, such a
mechanism greatly reduces the work load of the complex
multiplier unit and results in much better system performance
with respect to processing speed and power consumptions, etc.
Complex multiplication operations are realized by shifters and
adders with double precision and no RAM cell is required for
coefficient storage. Our proposed FFT kernel is based on 0.18 m
CMOS technology, simulated in Synopsys VCS environment and
is compiled and synthesized in design_vision environment.
Simulated core area of the chip is 2.0 mm
2
. Dynamic power
consumption is 15mW at 68 MHz operating frequency and 1.8V
of power supply voltage. To sum up, our design greatly
outperforms the original target specifications and our FFT
kernels overall performance is satisfactory.

I. INTRODUCTION
In most of todays wireless communication standards,
Orthogonal Frequency Division Multiplexing (OFDM) is used
in order to cope with the multipath fading wireless channel.
OFDM is based on the Fast Fourier Transform (FFT), which is
computationally intensive especially with large number of
inputs. On algorithm level, the complexity of FFT is
represented as O(N log N). As a result, baseband processors
are required to equip with a dedicated FFT processing unit that
is both fast and low power consuming. Power is of primary
importance due to mobility requirement in wireless receivers
and many more handset real-time signal processing devices
and imaging devices.
In this work, we have chosen a particularly low-power FFT
unit from the literature and implemented it in RTL. The FFT
unit is that of [1] which only requires 23 clock cycles to
compute, and occupies only 6.8 mm
2
core area. Compared to
other hardware FFT implementations, the work of [1] offers
the most attractive specifications for wireless communication
applications and many other applications in signal processing
as well.
This paper is organized as follows. Design decisions are
discussed in section 2, and results are presented in section 3.
Finally the paper is summarized and concluded in section 5.
II. DESIGN DECISIONS
A. Specifications


Figure 1. Interface Diagram

The figure above illustrates a high level overview of the
FFT Kernel interface, all the detailed descriptions of input/
output ports are given below in Table 1:

signal direction descriptions
CLK input System Clock
RESET input
The FFT kernel resets itself
when RESET goes low
X [31:0] input
Serial input signal at rising edge
of CLK, each 32 bits long; it
takes 57 clock cycles to start
core FFT computation
MODE input
Mode controls the FFT/ Inverse
FFT functionality:
Mode = 0 output FFT
Mode = 1 output IFFT
Y[31:0] output
Serial output signal at rising
edge of CLK, each 32 bits long;
every 64 sets of outputs
represent a period of FFT/ IFFT
calculation
O_STB output
A data ready signal, when
O_STB =1 it means the output is
ready and valid data sets;
Table 1. Signal Specifications of FFT Core

Since VLSI based FFT IP Cores is an existing technology,
we have a clear set of target specifications, listed as follows is
a 0.25 m technology implemented 64-point FFT core
published in 2004 [1]:
2
Target Item Reference value
Core Area 6.8 mm
2

Clock Frequency 20MHz
Dynamic Power 41 mW
Data Representation Signed fixed-point
No. Flip-Flops 7134
Core computational clock
cycles
23
Points of FFT 64
Word Length 16
Table 2. Initial Target Specifications
B. Implementation

(I) A Break-Down of 64 Point FFT Algorithm Background

The Discrete Fourier Transform ( ) W k of a complex time
series ( ) w n where { } , 0,1, 2... 1 n k N e can be defined as
the following:
( ) ( )
0
N
nk
N
n
W k w n W
=
=

(1)
While the Inverse DFT takes the form of the following:
( ) ( )
0
1
N
nk
N
k
w n W k W
N

=
=

(2)
Where
2 / j N
N
W e
t
= , suppose that N M T = , k s Tt = + ,
n l Mm = + , where { } 0,1, 2...7 l e ; { } , 0,1... 1 m t T e
Substituting these into equation (1) we could get the following:
( ) ( )
1 1
0 0
M T
lt sl sm
M MT T
l m
W s Tt W W w l Mm W

= =
(
+ = +
(


(3)
As can be observed in equation (3), an M T dimensional
DFT could be breakdown to a pair of 2-D DFTs with M and
T points respectively. Our proposal for the 64-point FFT
kernel comes from a direct application of such an algorithm:
( ) ( )
7 7
8 64 8
0 0
8 8
lt sl sm
l m
W s t W W w l m W
= =
(
+ = +
(


(4)

(II) Designing Blueprint An Architectural View

The block diagram of our 64-point FFT/ IFFT core is
illustrated in the following figure 2; here we divide the core
into four sub modules: Input/Output Unit, two 8-points FFT
Units, Complex Multiplier Unit and Internal Register Bank
Unit. Such a structure has its own unique advantages when
compared with other competing proposals, as will be further
discussed in the following section.

Figure 2. Block Diagram of the proposed 64 FFT

(III) Pipelining vs. Parallel Working

Notice that in such architecture, we allocate pipelining and
parallel working units in an evenly distributed manner, rather
than just sharing one physical functional unit and leaving
everything else to pipelining register bank.
Actually, there is another competing proposal for FFT
implementation in which theres only one butterfly unit
integrated, and a super register bank takes care of the
pipelining work load in a very delicate manner. Yet after
discussion, we decided that such a proposal would very likely
be a bad idea, since (1) from thermal analysis point of view: it
might work pretty well for 16-points FFT Unit, yet for 64-
points FFT with such a mechanism, huge percentage of work
load will fall on the pipelining unit alone and makes it very
hot meanwhile the rest of the core is quite cool, we think this
is one of the circuit design pitfalls that we should try to avoid.
(2) It may not scale well: the work load and complexity of the
pipelining unit accumulate dramatically when we later
integrate implemented IP cores to form more complicated
cores.
Due to such concerns and time pressure, we finally chose
the current proposal for our 64-points FFT kernel
implementation.

(IV) Design Environment and Technology Library

For Logic Design and Synthesis, we use design_vision
installed on Sun stations of ECE LRC. For simulation, we use
VCS (Verilog Compiler Simulator) Tool suite from Synopsys;
for verifications and testing, we use VCS, VirSim and
MATLAB environments. Since we employed singed fixed
point representation mechanism, most of the data format
conversions in the Verilog test bench interface with Fixed
Point Toolbox of Matlab with a version higher than 7.0.
The technology library linked for the compiling comes from
Lab3 of VLSI-I, which is HT018.db

(V) Modular Design

Module 1: Input/Output Unit Design

The following Table 3 lists in detail the basic input/output
port descriptions of the implemented I/P module.
3
Table 3. Signal description of I/P module

Basically, I/P unit performs a serial to parallel conversion to
the input data and interfaces with the first 8-points FFT
module, meanwhile it receives control from the
Control_Counter, it also contains embedded buffers for
temporary data storage, since some of the parallel
multiplication needs more than one clock cycle to complete.
This will be further elaborated in Multiplier Unit.


Figure 3. Overall structure of Input Module

The above block diagram in figure 3 illustrates the basic
working principles of the input unit, where we can see
necessary combinational logics (C.L block), swapping block
(SWAP) and internal counters. The swapping unit offers a
data path for the IFFT functionality and it is controlled by
mode, which is an input listed in Table 3.
Combinational Logic block in Figure 3 is controlled by a 5
bit counter, such a counter properly paces the I/P unit, O/P
unit and Multiplier Unit. It is also one of the outputs of the
FFT processor, which offers the user a good strobe port for
better understanding of the internal working processes of the
64-points FFT kernel.
For a better and more thorough inspection into the Parallel
Conversion block, we could take a look at Figure 4 as follows,

Figure 4. Detail of Parallel Conversion Block
There are 3257 register array in the Parallel Conversion
block. Once every 8 target points are ready for parallel output,
these data will then be fed into the first 8-points FFT block on
the rising edge of the system clock, as shown in figure 4.

Module 2: 8-points FFT Units (1
st
and 2
nd
FFT units)

signal direction descriptions
data_in input
255 bits input
data
data_out output
255 bits output
data
Table 4. Port descriptions of the 8-point FFT units
Table 4 summarizes the input/output port characterizations
of the 8-point FFT unit. This is a pure combinational logic
unit.

Figure 5. Flow Chart of Decimation-In-Time FFT (N=8)
Signal Descriptions I/O
CLK clock input
RST Reset signal input
Data_start Enabling signal input
Data_in 32-bit input data input
mode FFT/IFFT control input
Control
counter
Interface port with
control counter
input
Start_count Enabling signal output
Data_out
255-bit output data in
parallel
output
4

Figure 6. Basic building block of butterfly structure
As a purely combinational logic unit, the 8-point FFT takes
in parallel data of inputs, and assigns corresponding FFT
results to the output wires, as shown in figure 5 and figure 6.
For such a design mechanism, each output is computed in a
parallel manner and no flops are employed. A total number of
12 butterfly structured subunits and 5 complex multiplier
subunits are placed. This will add extra area and power
consumptions to the butterfly units, yet both register bank and
multiplier unit benefit greatly from such a trade-off and
satisfactory overall performance of the kernel is guaranteed.
Within the 8-point FFT module, there are 5 complex
multiplications; here in our design, techniques are employed
to make sure that least possible number of actual complex
multiplications are carried out. For the 8-points FFT unit
specifically, theres only one complex multiplication used,
while all other multiplications are achieved by proper
swapping and assigning.


Figure 7. Double precision multiplication mechanism

Since signed fixed point representation mechanism is
employed throughout the design, complex multiplications are
calculated based on two categories: for positive numbers,
shifting and addition are carried out and 0s will be shifted
into the word; for negative numbers, shifting and addition are
carried out and 1s will be shifted into the word. As is shown
in figure7, notice that a double precision calculation is carried
out, which means we double the input length before the
shifting and addition processes, then truncated the result back
into 16 bit word length. By doing so, we actually found out
that accuracy is apparently enhanced in simulation results.






Module 3: Multiplier Unit


Figure 8. Block Diagram of the Multiplier Unit

For the 2-D break-down of the 64-points FFT algorithm,
complex coefficients have to be multiplied with the output of
the first 8-points FFT unit before feeding data into the second
FFT unit. Here 8 complex numbers have to be dealt with.
Techniques are employed so that minimal numbers of
operations are carried out. Similar to the mechanism in 8-point
FFT unit, double precision method is used and also reusable
results are recycled and swapped. For a modular design
perspective, the eight constants are kept in sub-modules
respectively. All operations are monitored by controlling
signals. The following Table 5 gives a input/output port
summary of the multiplier unit:

signal direction descriptions
COUNT input
controlling 5bit
counter
Input_data input 255 bits data
Output_data output 255 bits data
Table 5. Port description of the Multiplier Unit

Module 4: Internal Register Bank

The internal register bank (CB) is integrated into the system
for temporary storage of the 64 complex data coming from the
multiplier unit. CB has 8 wired 255-bits inputs in parallel and
8 wired 255-bits outputs in parallel, which are directly fed into
the second 8-points FFT unit. At every clock cycle, the
appropriate data at the output of the CB gets aligned with the
target input of the second 8-point FFT unit. Since the second
8-point FFT unit is pure combinational, 255 bits of input data
will be processed before the next cycle arrives. Therefore the
downward shifting in CB can be carried out each cycle
without being interrupted.
Essentially, the CB unit is the same to the input unit except
that there are no swapping blocks and buffering registers.
The following Table 6 summarizes the input/output ports of
the CB unit:
5
signal direction descriptions
CLK input System clock
RST input Reset signal
Data_in input 255 bits input
COUNT input Control counter
Data_out output 255 bits output
Table 6. Port summary of CB unit

Module 5: Output Unit

signal direction descriptions
CLK input system clock
RST input reset signal
Data_in input 255 bits input
mode input FFT/IFFT control
Data_out output 32 bits output
O_STB output data ready signal
Table 7. Port summary of O/P unit

Shown in Table 7 above is the basic input/output port
description of the O/P Unit.
Similar to I/P unit, the O/P unit converts the parallel signals
back to serial signals and interfaces with the user and/or LCD
display. There are no buffering registers in this module and
swapping function is selected by input port mode to send
out FFT/IFFT throughput.
C. Optimization
As discussed in previous chapters, several optimization
techniques are employed during the process of module
designing. Actually, many many versions of structures,
modules and codes were modified and tested before we
finalized the design. Here we list the most important two
techniques among the many:

(I) Computing Accuracy

Previously, our multiplication functionalities are carried out
with 16 bits of accuracy, which is the same word length of
actual data passing through the FFT kernel. Yet simulations
show unsatisfactory errors of the core when compared with
expected outcomes from MATLAB7.0 simulator.
For such a problem, we doubled the bit length of each word
after it enters a complex multiplication block, and then
truncated the 32 bits of word back to 16 bits before output port.
With such mechanism, enhanced accuracy turns out to be
quite satisfactory. Further demonstration will be given in
Testing and Verification chapter that follows.

(II) Intermediate Results Recycling

Although a break down of 64-points of FFT greatly reduces
the computational complexity, there is still quite some amount
of complex operations going on. To further reduce the
calculation complexity for the 8-point FFT Units and
particularly the Multiplier Unit, necessary reuse of
intermediate results turns out to be a good idea for further
power reduction and speed accelerations. That explains
exactly why we have much less complex multiplications in
our design than it actually takes to build a 64-points FFT
kernel.
Further results and figures will be provided in the following
section of Testing and Verifications.
D. Testing and Verifications
Our testing bench of the 64-points FFT kernel interfaces
with both Synopsys and Matlab environments through file
operations. Our testing target is to show the calculation error,
therefore the actual spectrum of the testing cases are given
later in Section III. Our testing cases consist of six well known
time series signals in digital signal processing area, plus one
supper test case which involves 1000 randomized input data
sets. Using such a methodology, we covered more than 64000
input data testing and square error analysis, meanwhile
offering quite some direct applications of our implemented 64-
points FFT kernel in the field of static spectrum analysis.

Test case 1:

Input time series is a rectangle shaped pulse, detailed testing
case in 16-bit fixed point representation is listed in Appendix 2.
We analyzed and plotted the square error of the physical
unit output with respect to the standard output of our Matlab
simulator, as follows,
0 10 20 30 40 50 60
0
2
4
6
8
x 10
-9
Testbench1 (MSE=-2.3283e-010+1.397e-009i)
points in frequency
e
r
r
o
r

v
a
l
u
e

Figure 9. MSE of the 64-FFT kernel for test case 1

As illustrated in Figure 9, max Mean Square Error is
suppressed below 810
-9
and 82% of the physical output is
100% accurate.

Test case 2:

Input time series is a cosine shaped wave, detailed testing case
in 16-bit fixed point representation is listed in Appendix 2.
We analyzed and plotted the mean square error of the
physical unit output with respect to the standard output of our
Matlab simulator, as follows,
6
0 10 20 30 40 50 60
-1
-0.5
0
0.5
1
x 10
-5
Testbench2 (MSE=-7.5437e-008-3.574e-008i)
points in frequency
e
r
r
o
r

v
a
l
u
e

Figure 10. MSE of the 64-FFT kernel for test case 2
As illustrated in Figure 10, max Mean Square Error is
suppressed below 710
-6
.

Test case 3:

Input time series is a cos sin j + shaped wave, detailed
testing case in 16-bit fixed point representation is listed in
Appendix 2.We analyzed and plotted the mean square error of
the physical unit output with respect to the standard output of
our Matlab simulator, as follows,
0 10 20 30 40 50 60
-1
-0.5
0
0.5
1
x 10
-4
Testbench3 (MSE=1.2444e-006-5.8627e-007i)
points in frequency
e
r
r
o
r

v
a
l
u
e

Figure 11. MSE of the 64-FFT kernel for test case 3
Test case 4:

Input time series is a real constant value function, detailed
testing case in 16-bit fixed point representation is listed in
Appendix 2. The mean square error of the physical unit output
with respect to the standard output of our Matlab simulator, as
follows, accuracy is 100%.
0 10 20 30 40 50 60
-1
-0.5
0
0.5
1
Testbench4 (MSE=0)
points in frequency
e
r
r
o
r

v
a
l
u
e

Figure 12. MSE of the 64-FFT kernel for test case 4
Test case 5:

Input time series is a truncated pulse series, detailed testing
case in 16-bit fixed point representation is listed in Appendix 2.
We analyzed and plotted the mean square error of the physical
unit output with respect to the standard output of our Matlab
simulator, as follows,
0 10 20 30 40 50 60
-2
-1
0
1
2
3
4
5
x 10
-7
Testbench5 (MSE=-4.773e-009+6.6357e-009i)
points in frequency
e
r
r
o
r

v
a
l
u
e

Figure 13. MSE of the 64-FFT kernel for test case 5

Test case 6:

Input time series is a triangle shaped time series, detailed
testing case in 16-bit fixed point representation is listed in
Appendix 2. We analyzed and plotted the mean square error of
the physical unit output with respect to the standard output of
our Matlab simulator, as follows,
0 10 20 30 40 50 60
-4
-3
-2
-1
0
1
x 10
-7
Testbench6 (MSE=-3.0268e-009-5.1223e-009i)
points in frequency
e
r
r
o
r

v
a
l
u
e

Figure 14. MSE of the 64-FFT kernel for test case 6
As illustrated in figures above, the precision of the proposed
64-points FFT kernel is satisfactory with respect to the six
testing cases in signal processing application.

Massive input data test case

Corner cases are not enough to cover the range of a thorough
testing, so we construct 1000 data sets of inputs and
concatenate them into one massive-input-data-file then feed it
into the testing bench, where 1000 cycles of 64-points FFT
will be carried out continuously within the implemented FFT
kernel. The following figure shows the absolute error value of
the 64000 input cases, as demonstrated in the following
7
Figure 15, the calculation precision is quite satisfactory for
fixed point representation mechanism:

0 1 2 3 4 5 6
x 10
4
0
1
2
3
4
5
x 10
-3
number of points
a
b
s
o
lu
t
e

e
r
r
o
r

v
a
lu
e
Error between MATLAB & Module (random signal)

Figure 15. Absolute Error of the 64-FFT kernel for supper test case

III. RESULTS
A. Functionality
For all the test cases we used in Testing and Verification
sector, here we give the spectrum plots, which are the actual
outputs of our implemented 64-points FFT kernel. Verilog
HDL codes for the FFT kernel is listed in Appendix 1.

Test case 1:

Input time series is a rectangle shaped pulse, therefore the
Amplitude Spectrum is a Sinc function shaped spectrum, as
shown in Figure 16 below,

0 10 20 30 40 50 60
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
points in frequency
a
m
p
l
i
t
u
d
e

s
p
e
c
t
r
u
m
Testbench1 (MSE=-2.3283e-010+1.397e-009i)


actual
expected

Figure 16. Amplitude spectrum of test case 1 by our FFT core

Test case 2:

Input time series is a Sinusoidal shaped waveform; therefore
the Amplitude Spectrum is a double-pulse function shaped
spectrum, as shown in Figure 17

0 10 20 30 40 50 60
0
0.5
1
1.5
points in frequency
a
m
p
l
i
t
u
d
e

s
p
e
c
t
r
u
m
Testbench2 (MSE=-7.5437e-008-3.574e-008i)


actual
expected

Figure 17. Amplitude spectrum of test case 2 by our FFT core

Test case 3:

Input time series is a complex exponential shaped waveform,
therefore the Amplitude Spectrum is a single pulse shaped
spectrum, as shown in Figure 18 below,

0 10 20 30 40 50 60
0
0.5
1
1.5
2
points in frequency
a
m
p
l
i
t
u
d
e

s
p
e
c
t
r
u
m
Testbench3 (MSE=1.2444e-006-5.8627e-007i)


actual
expected

Figure 18. Amplitude spectrum of test case 3 by our FFT core

Test case 4:

Input time series is a constant function, therefore the
Amplitude Spectrum is a pulse function shaped spectrum with
the pulse located at the zero frequency, as shown in Figure 19
below,
0 10 20 30 40 50 60
0
0.5
1
1.5
points in frequency
a
m
p
l
i
t
u
d
e

s
p
e
c
t
r
u
m
Testbench4 (MSE=0)


actual
expected

Figure 19. Amplitude spectrum of test case 4 by our FFT core

8
Test case 5:

Input time series is a truncated pulse series, therefore the
Amplitude Spectrum is an amplitude modulated series of
pulses in frequency domain, as shown in Figure 20 below.
Such a signal is often used for Pulse-Doppler Radar Imaging
systems.

0 10 20 30 40 50 60
0
0.5
1
1.5
2
points in frequency
a
m
p
l
i
t
u
d
e

s
p
e
c
t
r
u
m
Testbench5 (MSE=-4.773e-009+6.6357e-009i)


actual
expected

Figure 20. Amplitude spectrum of test case 5 by our FFT core

Test case 6:

Input time series is a triangular shaped waveform, therefore
the Amplitude Spectrum is a product of two Sinc functions , as
shown in Figure 21 below,

0 10 20 30 40 50 60
0
0.5
1
1.5
2
points in frequency
a
m
p
l
i
t
u
d
e

s
p
e
c
t
r
u
m
Testbench6 (MSE=-3.0268e-009-5.1223e-009i)


actual
expected

Figure 21. Amplitude spectrum of test case 6 by our FFT core

From the figures above, we could observe that the static
spectrum is accurate to the best of our knowledge.

B. Timing / Area Synthesis
For an initial clock period of 20 ns, the synthesized area of
our 64-points FFT kernel is 2.0 mm
2
, timing is 14.8 ns, and
dynamic power consumption reaches 15 mW. A total of three
times of optimizations are tried out and the following Table 8
lists the optimized results. For detailed reports, please refer to
Appendix 3. More optimizations could be carried out if time
allows.

target item reference our design
core area 6.8mm
2
(.25m) 2.0mm
2
(.18m)
clock freq 20MHz 68MHz
No. registers 7134 5713
power 41mW 15mW
Table 8. A comparison between our design and reference design

C. APR and Physical Layout
The chip is placed and routed in Cadence/2006 Encounter
environment, connectivity is verified and theres no violation
or warning detected. A snapshot of the physical layout of the
64-points FFT kernel is displayed as follows in Figure 22:

Figure 22. APR physical layout of the 64-point FFT core
IV. SUMMARY AND CONCLUSIONS
In this paper, we have described the design and
implementation of a serial 64 point FFT suitable for wireless
and modern signal/image processing applications. We
described the modular design in register-transfer level (RTL),
and synthesized and optimized our modules using Design
Vision. We verified our design at various stages, namely at the
RTL level and post-synthesis. We used golden model test
benches where MATLAB was used to generate valid
input/output vectors. Then the input vectors were applied to
the FFT and the output was compared. We demonstrated that
our processor passed the functionality tests with more than
64,000 data points.
Our FFT chip operates well beyond the target frequency of
20 MHz and occupies only 2.0 mm
2
in a 0.18m process.
Once the serial data is in the FFT unit, only 23 clock cycles
are required to produce the output. Therefore, the FFT can be
computed in less than a microsecond.
Overall, the current work allowed us to go through the
design cycles and learn to make important design decisions for
achieving the goal of small area, and minimal delay. It was
also a great exercise on teamwork and collaboration with
colleagues. As a result of this project, we feel were able to
apply many of the concepts learned in the course, hence, the
objectives of the course were fully fulfilled.
REFERENCES
[1] K. Maharatna, E. Grass, and U. Jagdhold A 64-Point Fourier Transform
Chip for High-Speed Wireless LAN Application Using OFDM, IEEE
JSSC. Vol. 39, No. 3, March 2004

You might also like