Linköping Studies in Science and Technology

Thesis No. 1030
STUDIES ON IMPLEMENTATION OF
LOW POWER FFT PROCESSORS
Weidong Li
LiU-Tek-Lic-2003:29
Department of Electrical Engineering
Linköpings universitet, SE-581 83 Linköping, Sweden
Linköping, June 2003
Studies on Implementation of
Low Power FFT Processors
Copyright © 2003 Weidong Li
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping
Sweden
ISBN 91-7373-692-9 ISSN 0280-7971
To memory of my father.
Abstract
In the last decade, the interest for high speed wireless and on cable
communication has increased. Orthogonal Frequency Division
Multiplexing (OFDM) is a strong candidates and has been suggested
or standardized in those communication systems. One key
component in OFDM-based systems is FFT processor, which
performs the efficient modulation/demodulation.
There are many FFT architectures. Among them, the pipeline archi-
tectures are suitable for the real-time communication systems. This
thesis presents the implementation of pipeline FFT processors that
has low power consumptions.
We select the meet-in-the-middle design methodology for the imple-
mentation of FFT processors. A resource analysis for the pipeline
architectures is presented. This resource analysis determines the
number of memories, butterflies, and complex multipliers to meet
the specification.
We present a wordlengths optimization method for the pipeline
architectures. We showthat the high radix butterfly can be efficiently
implemented with carry-save technique, which reduce the hardware
complexity and the delay. We present also an efficient implemen-
tation of complex multiplier using distributed arithmetic (DA). The
implementation of low voltage memories is also discussed.
Finally, we present a 16-point butterfly using constant multipliers
that reduces the total number of complex multiplications. The FFT
processor using the 16-point butterflies is a competitive candidate
for low power applications.
Acknowledgement
I would like to thank my supervisor Professor Lars Wanhammar for
his support and guidance of this research. Also, I would like to thank
the whole group, Electronics Systems at Linköping University, for
their help in the discussions in research as well as other matters.
Lastly, I would like to express my gratitude to Oscar Gustafsson,
Henrik Ohlsson, and Per Löwenberg for the proofreading.
Finally, and most importantly, I would like to thank my family,
relatives, and friends, especially A Phung, for their boundless
support and encouragement.
This work was financially supported by the Swedish Strategic Fund
(SSF) under INTELECT program.
i
Table of Contents
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. DFT and FFT ..................................................................... 2
1.2. OFDM Basics .................................................................... 3
1.3. Power Consumption .......................................................... 6
1.4. Thesis Outline ................................................................... 7
1.5. Contributions ..................................................................... 8
2. FFT ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Cooley-Tukey FFT Algorithms ......................................... 9
2.1.1. Eight-Point DFT ....................................................... 10
2.1.2. Basic Formula ........................................................... 12
2.1.3. Generalized Formula ................................................ 13
2.2. Sande-Tukey FFT Algorithms ........................................ 18
2.3. Prime Factor FFT Algorithms ......................................... 20
2.4. Other FFT Algorithms ..................................................... 23
2.4.1. Split-Radix FFT Algorithm ...................................... 23
2.4.2. Winograd Fourier Transform Algorithm .................. 26
2.5. Performance Comparison ................................................ 26
2.5.1. Multiplication Complexity ....................................... 27
2.5.2. Addition Complexity ................................................ 29
2.6. Other Issues ..................................................................... 30
2.6.1. Scaling and Rounding Issue ..................................... 31
2.6.2. IDFT Implementation ............................................... 35
2.7. Summary ......................................................................... 36
ii
3. LOW POWER TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . 37
3.1. Power Dissipation Sources .............................................. 37
3.1.1. Short-Circuit Power .................................................. 37
3.1.2. Leakage Power ......................................................... 38
3.1.3. Switching Power ....................................................... 39
3.2. Low Power Techniques ................................................... 40
3.2.1. System Level ............................................................ 40
3.2.2. Algorithm Level ....................................................... 42
3.2.3. Architecture Level .................................................... 44
3.2.4. Logic Level ............................................................... 47
3.2.5. Circuit Level ............................................................. 50
3.3. Low Power Guidelines .................................................... 51
3.4. Summary ......................................................................... 52
4. FFT ARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1. General-Purpose Programmable DSP Processors ........... 53
4.2. Programmable FFT Specific Processors ......................... 54
4.3. Algorithm-Specific Processors ........................................ 56
4.3.1. Radix-2 Multipath Delay Commutator ..................... 57
4.3.2. Radix-2 Single-Path Delay Feedback ....................... 58
4.3.3. Radix-4 Multipath Delay Commutator ..................... 59
4.3.4. Radix-4 Single-Path Delay Commutator .................. 60
4.3.5. Radix-4 Single-Path Delay Feedback ....................... 61
4.3.6. Radix-2
2
Single-Path Delay Commutator ................ 62
4.4. Summary ......................................................................... 63
5. IMPLEMENTATION OF FFT PROCESSORS . . . . . . . . 65
5.1. Design Method ................................................................ 65
5.2. High-level Modeling of an FFT Processor ...................... 67
5.2.1. Resource Analysis .................................................... 67
iii
5.2.2. Validation of the High-Level Model ........................ 70
5.2.3. Wordlength Optimization ......................................... 71
5.3. Subsystems ...................................................................... 72
5.3.1. Memory .................................................................... 73
5.3.2. Butterfly .................................................................... 79
5.3.3. Complex Multiplier .................................................. 83
5.4. Final FFT Processor Design ............................................ 93
5.5. Summary ......................................................................... 96
6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7. REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
iv
1
1
INTRODUCTION
The Fast Fourier Transform(FFT) is one of the most used algorithms
in digital signal processing. The FFT, which facilitates the efficient
transformation between the time domain and the frequency domain
for a sampled signal, is used in many applications, e.g., radar,
communication, sonar, speech signal processing.
In the last decade, the interest for high speed wireless and on
cable communication has increased. Orthogonal Frequency Division
Multiplexing (OFDM) technique, which is a special Multicarrier
Modulation (MCM) method, has been demonstrated to be an
efficient and reliable approach for high-speed data transmission. The
immunity to multipath fading channel and the capability for parallel
signal processing make it a promising candidate for the next gener-
ation wide-band communication systems. The modulation and
demodulation of OFDM based communication systems can be
efficiently implemented with an FFT, which has made the FFT
valuable for those communication systems. The OFDM based
communication systems have high performance requirement in both
throughput and power consumption. This performance requirement
necessitates an application-specific integrated circuit (ASIC)
solution for FFT implementation. This thesis addresses the problem
of designing efficient application-specific FFT processors for
OFDM based wide-band communication systems.
In this chapter, we give a short review on DFT and FFT. Then an
introduction to OFDM and power consumption are presented.
Finally, the outline of the thesis is described.
2 Chapter 1
1.1. DFT and FFT
The Discrete Fourier transform (DFT) for an N-point data sequence
, is defined as
(1.1)
for , where is the primitive N
th
root of unity. The number N is also called transform length. The
index respective is referred to as time-domain and frequency-
domain index, respectively.
The inverse DFT (IDFT) for data sequence
( ) is
(1.2)
for .
Direct computation of an N-point DFT according to Eq. (1.1)
requires complex additions and complex
multiplications. The complexity for computing an N-point DFT is
therefore . With the contribution from Cooley and Tukey
[13], the complexity for computation of an N-point DFT can be
reduced to . The Cooley and Tukey’s approach and
later developed algorithms, which reduces of complexity for DFT
computation, are called fast Fourier transform (FFT) algorithms.
Among the FFT algorithms, two algorithms are especially
noteworthy. One algorithm is the split-radix algorithm, which treats
the even part and odd part with different radix, was published in
1984. Another algorithm is Winograd Fourier Transform Algorithm
(WFTA), which requires the least known number of multiplications
among practical algorithms for moderate lengths DFTs and was
published in 1976.
Many implementation approaches for the FFT have been
proposed since the discovery of FFT algorithms. Due to the high
computation workload and intensive memory access, the implemen-
tation of FFT algorithms is still a challenging task.
x k ( ) { } k 0 1 … N 1 – , , , =
X n ( ) x k ( )W
N
nk
k 0 =
N 1 –

=
n 0 1 … N 1 – , , , = W
N
e
2π N ⁄ –
=
k n
X n ( ) { }
n 0 1 … N 1 – , , , =
x k ( )
1
N
---- X n ( )W
N
n – k
n 0 =
N 1 –

=
k 0 1 … N 1 – , , , =
N N 1 – ( ) N N 1 – ( )
O N
2
( )
O N N ( ) log ( )
INTRODUCTION 3
1.2. OFDM Basics
OFDM is a special MCM technique. The idea for MCM is to divide
transmission bandwidth into many narrow subchannels (subcar-
riers), which transmit data in parallel [5].
The principle for MCM is shown in Fig. 1.1. The high rate data
stream at bits/s is grouped into blocks with M bits per block
at a rate of . Ablock is called a symbol. Asymbol allocates m
k
-
bits of M bits for modulation of a carrier k at and totally M bits
for modulation of N carriers. This results in N subchannels, which
send symbols at a rate of .
In the conventional MCM, the N subchannels are non-
overlapping. Each subchannel has its own modulator and demodu-
lator. This leads to inefficient usage of spectrum and excessive
hardware requirement.
The OFDM technique can overcome those drawbacks. With
OFDM, the spectrum can be used more efficient since overlapping
of subchannels is allowed. The overlapping does not cause inter-
ference of subchannels due to the orthogonal modulation.
M f
sym
f
sym
f
c k ,
f
sym
S
e
r
i
a
l

t
o

p
a
r
a
l
l
e
l
M bits (a symbol)
modulator n-1
modulator n-2
modulator 0
demodulator n-1
demodulator n-2
demodulator 0
Channel noise
f
c,n-1
f
c,n-2
f
c,0
f
c,n-1
f
c,n-2
f
c,0
P
a
r
a
l
l
e
l

t
o

s
e
r
i
a
l
m
n-1
bits
Input
Mf
sym
b/s
f
sym
symbol/s
Output
x(t)
Figure 1.1. A multicarrier modulation system.
4 Chapter 1
The orthogonality can be explained in frequency domain. The
symbol rate is f
sym
, e.g., each symbol is sent during a symbol time T
(which is equal to 1/f
sym
). The frequency spacing between adjacent
subchannels is set to be 1/T Hz, the carrier signals can be expressed
as following:
(1.3)
(1.4)
where f
0
is the system base frequency and is the signal for carrier
k at frequency f
k
. If the frequency of subcarrier k and the base
function are chosen according to Eq. (1.3) and Eq. (1.4), its spectrum
is a sinc function with zero points at (l is integer) except
or f
k
. It means that there is no interference to other
subchannels with the selected functions.
This orthogonality can also be found in the time domain. For two
carrier signals, g
k
, and g
l
, the integral over a symbol time is
(1.5)
which shows that two carriers are orthogonal.
OFDM overcomes the inefficient implementation of the
modulator and demodulator for conventional MCM. From Fig. 1.1,
the sending signal x(t) is the summation of symbol transmission in
all subchannels, e.g.,
f
k
f
0
k
T
--- + = 0 k N 1 – < ≤
g
k
t ( )
e
j2πf
k
t
0 t T < ≤
0 otherwise
¹
'
¹
=
g
k
f
0
l T ⁄ +
l k =
g
k
t ( )g
l

t ( ) t d
0
T

T k l =
0 otherwise
¹
'
¹
=
Figure 1.2. Spectrum overlapping of subcarriers for OFDM.
f
1/T
INTRODUCTION 5
where S
k
is the modulated signal of m
k
bit, which should be trans-
mitted by subchannel k. This is an N-point Inverse Discrete Fourier
Transform (IDFT) and baseband modulation (with ). The
IDFT can be computed efficiently by Inverse Fast Fourier Transform
(IFFT) algorithm. Hence the OFDM modulator can be implemented
with one IFFT processor and baseband modulator for N subcarriers
instead of N modulators for conventional MCM. In similar way, the
OFDM demodulator can be implemented more efficient than that of
conventional MCM. The simplified OFDMsystembased on the FFT
is shown in Fig. 1.3.
In reality, the interference between subchannels exists due to the
non-ideal channel characteristics and frequency offset in trans-
mitters and receivers. This interference effects the performance of
the OFDM system. The frequency offset can, in most case, be
compensated.
The other issues, for instance, intersymbol interference, can be
reduced by techniques like cyclic prefix.
x t ( ) S
k
g
k
t ( )
k 0 =
N 1 –

e
j2πf
0
t
S
k
e
j2πkt
T
--------------
k 0 =
N 1 –

= =
e
j2π f
0
t
e
2jpf
0
t
Channel noise
Input
D
/
A
IFFT
Output
e
2jpf
0
t
A
/
D
FFT
Figure 1.3. OFDM system based on FFT.
6 Chapter 1
1.3. Power Consumption
The famous Moore’s Lawpredicts the exponential increase in circuit
integration and clock frequency during the last three decades. Table
1 shows the expectation for the near future from Semiconductor
Industry Association.
The power consumption decreases as the feature size and the
power supply voltage are reduced. However, the power consumption
increases or retains almost the same as the advance of technology
according to the table above. This is due to the potential workload
increase.
During the last decade, the power consumption has grown from
a secondary constraint to one of the main constraints in the design of
integrated circuit. In portable applications, low power consumption
has long been the main constraint. Several other factors, for
instances, more functionality, higher workload, and longer operation
time, contribute to make the power consumption and energy
efficiency even more critical. In the high performance applications,
where the power consumption traditionally was a secondary
Year
Feature size
2003
107 nm
2004
90 nm
2005
80 nm
2010
45 nm
ASIC usable Mega
transistors/cm
2
(auto layout)
142 178 225 714
ASIC maximum functions per
chip (Mega transistors/chip)
810 1020 1286 4081
Package cost (cents/pin)-
maximum/minimum
1.24/
0.70
1.17/
0.66
1.11/
0.61
0.98/
0.49
On-chip, local clock (MHz) 3088 3990 5173 11511
Supply V
dd
(V)
(high performance)
1.0 1.0 0.9 0.6
Power consumption for High
performance with heatsink (W)
150 160 170 218
Power consumption for
Battery(W)-(Hand-held)
2.8 3.2 3.2 3.0
Table 1.1. Technology Roadmap from the International Technology
for Semiconductors (ITRS).
INTRODUCTION 7
constraint, the low power techniques gain more ground due to the
steady increasing cost for cooling and packaging. Besides those
factors, the increasing power consumption has resulted in higher on-
chip temperature, which in turn reduces the reliability. The delivery
of power supply to the chip has also raised many problems like
power rails design, noise immunity, IR-drop etc. Therefore the low
power techniques are important for the current and future integrated
circuits.
1.4. Thesis Outline
In this thesis we summarize some implementation aspects of a low
power FFT processors for an OFDM communication system. The
system specification for the FFT processor has been defined as
• Transform length is 1024
• Transform time is less than 40 ms (continuously)
• Continuous I/O
• 25.6 Msamples/sec. throughput
• Complex 24 bits I/O data
• Low power
In chapter 2, we introduce several FFT algorithms, which is the
starting point for the implementation. The basic idea of FFT
algorithms, e.g., divide and conquer, is demonstrated through a few
examples. Several FFT algorithms and their performance are given
also.
An overview for low power techniques is given in chapter 3.
Different techniques are introduced at different abstraction level.
The main focus of the for low power techniques is reduction of
dynamic power consumption. A general guideline is found in the
end of this chapter.
The choice of FFT architectures is important for the implemen-
tation. A few architectures, including the pipeline architectures, are
introduced in chapter 4. The pipeline architectures are discussed in
more detail since they are the dedicated architectures for our target
application.
8 Chapter 1
In chapter 5, more detailed implementation steps for FFT
processors are provided. Both design method and the design for FFT
processors are discussed in this chapter.
The conclusions for the FFT processor implementation are given
in chapter 6.
1.5. Contributions
The main contributions of this thesis are:
• A method for minimizing the wordlengths in the pipelined
FFT architectures, as outlined in Section 5.2.3.
• An approach to construct efficient high-radix butterflies,
presented in Section 5.3.2.2.
• A complex multiplier using distributed arithmetic and the
overturned-stairs tree, given in Section 5.3.3.
• A 16-point butterfly with constant multipliers. This reduces
of the total number of complex multiplications and is
described in Section 5.4
• Various generators for different components, for instance, the
ripple-carry adder, Brent-Kung adder, complex multiplier,
etc. This is found in Chapter 5.
9
2
FFT ALGORITHMS
In FFT processor design, the mathematical properties of FFT must
be exploited for an efficient implementation since the selection of
FFT algorithm has large impact on the implementation in term of
speed, hardware complexity, power consumption etc. This chapter
focuses on the review of FFT algorithms.
2.1. Cooley-Tukey FFT Algorithms
The technique for efficient computation of DFTs is based on divide
and conquer approach. This technique works by recursively
breaking down a problem into two, or more, sub-problems of the
same (or related) type. The sub-problems are then independently
solved and their solutions are combined to give a solution to the
original problem. This technique can be applied to DFT computation
by dividing the data sequence into smaller data sequences until the
DFTs for small data sequences can be computed efficiently.
Although the technique was described in 1805 [26], it was not
applied to DFT computation until 1965 [13]. Cooley and Tukey
demonstrated the simplicity and efficiency of the divide and conquer
approach for DFT computation and made the FFT algorithms widely
accepted. We give a simple example for the divide and conquer
approach. Then a basic and a generalized FFT formulation are given.
10 Chapter 2
2.1.1. Eight-Point DFT
In this section, we illustrate the idea of the divide and conquer
approach and show why dividing is also conquering for DFT
computation.
Let us consider an 8-point DFT, e.g., and data sequence
, , the DFT of is given by
(2.1)
for .
One way to break down a long data sequence into shorter ones is
to group the data sequence according to their indices. Let
and ( ) be two sequences, the grouping of
to and can be done intuitively through
separating members by odd and even index.
(2.2)
(2.3)
for .
The DFT for can be rewritten
(2.4)
N 8 =
x k ( ) { } k 0 1 … 7 , , , = x k ( ) { }
X n ( ) x k ( )W
8
nk
k 0 =
7

=
n 0 1 … 7 , , , =
x
o
l ( ) { }
x
e
l ( ) { } l 0 1 2 3 , , , =
x k ( ) { } x
o
l ( ) { } x
e
l ( ) { }
x
o
l ( ) x 2l 1 + ( ) =
x
e
l ( ) x 2l ( ) =
l 0 1 2 3 , , , =
x k ( ) { }
X n ( ) x
o
l ( )W
8
n 2l 1 + ( )
l 0 =
3

x
e
l ( )W
8
n 2l ( )
l 0 =
3

+
x
o
l ( )W
8
n
W
8
n 2l ( )
l 0 =
3

x
e
l ( )W
8
n 2l ( )
l 0 =
3

+
W
8
n
x
o
l ( )W
4
nl
l 0 =
3

x
e
l ( )W
4
nl
l 0 =
3

+
W
8
n
X
o
n ( ) X
e
n ( ) +
=
=
=
=
FFTALGORITHMS 11
where , and
are 4-point DFTs of and , respectively.
Eq. (2.4) shows that the computation of an 8-point DFT can be
decomposed into two 4-point DFTs and summations. The direct
computation of an 8-point DFT requires complex
additions and complex multiplications. The
computation of two 4-point DFTs requires
complex additions and complex multiplica-
tions. With additional complex multiplications for
and 7 complex additions, it requires totally 30 complex multiplica-
tions and 32 complex additions for the 8-point DFT computation
according to Eq. (2.4). It requires only two 4-point DFTs for the 8-
point DFT due to the fact that and
for . Furthermore, the number of
complex multiplications for can be reduced to 3 from 7
since for . The total number of complex
additions and complex multiplications is 32 and 27, respectively.
This can be shown in Fig. 2.1.
The above 8-point DFT example shows that the decomposition
of a long data sequence into smaller data sequences reduces the
computation complexity.
W
8
n 2l ( )
e
2π –
8
---------
¸ ,
¸ _
n 2l ( )
e
2π –
4
---------
¸ ,
¸ _
nl
W
4
nl
= = = X
o
n ( )
X
e
n ( ) x
o
l ( ) { } x
e
l ( ) { }
8 8 1 – ( ) 56 =
8 8 1 – ( ) 56 =
2 4 4 1 – ( ) ⋅ ⋅ 24 =
2 4 4 1 – ( ) ⋅ ⋅ 24 =
8 1 – W
8
n
X
o
n ( )
X
o
n ( ) X
o
n 4 – ( ) =
X
e
n ( ) X
e
n 4 – ( ) = n 4 ≥
W
8
n
X
o
n ( )
W
8
n
W
8
n 4 –
– = n 4 ≥
W
1
W
2
4
-
p
o
i
n
t
D
F
T
4
-
p
o
i
n
t
D
F
T
x(0)
x(2)
x(4)
x(6)
x(1)
x(3)
x(5)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
3
W
n
multiplication with W
n
8
W
0
Figure 2.1. An 8-point DFT computation with two 4-point DFTs.
12 Chapter 2
2.1.2. Basic Formula
The 8-point DFT example illustrates the principle of the Cooley-
Tukey FFT algorithm. We introduce a more mathematical
formulation for the FFT algorithm.
Let N be a composite number, i.e., N = r
1
× r
0
, the index k can be
expressed by a two-tuple (k
1
, k
0
) as
( ) (2.5)
In the similar way, the index n can be described by (n
1
, n
0
) as
( , ) (2.6)
The term can be factorized as
(2.7)
where .
With Eq. (2.7), Eq. (1.1) can be rewritten
(2.8)
Eq. (2.8) indicates that the DFT computation can be performed
in three steps:
1 Compute r
0
different r
1
-point DFTs (inner parenthesis).
2 Multiply results with .
3 Compute r
1
different r
0
-point DFTs (utter parenthesis).
k r
0
k
1
k
0
+ = 0 k
0
r
0
0 k
1
r
1
< ≤ , < ≤
n r
1
n
1
n
0
+ = 0 n
1
r
0
< ≤ 0 n
0
r
1
< ≤
W
N
nk
W
N
nk
W
N
r
1
n
1
n
0
+ ( ) r
0
k
1
k
0
+ ( )
W
N
r
1
r
0
n
1
k
1
W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
=
=
=
W
N
r
1
r
0
n
1
k
1
W
N
Nn
1
k
1
e
2πNn
1
k
1
– N ⁄
e
2πn
1
k
1

1 = = = =
X n
1
n
0
, ( ) x k
1
k
0
, ( )W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
k
0
0 =
r
0
1 –

k
1
0 =
r
1
1 –

=
x k
1
k
0
, ( )W
r
1
n
0
k
1
k
1
0 =
r
1
1 –

¸ ,

¸ _
W
N
n
0
k
0
¸ ,

¸ _
W
r
0
n
1
k
0
k
0
0 =
r
0
1 –

=
W
N
n
0
k
0
FFTALGORITHMS 13
The r
0
r
1
-point DFTs require or
complex multiplications and additions. The second step requires N
complex multiplications. The final step requires
complex multiplications and additions. Therefore the total number
of complex multiplications using Eq. (2.8) is and
the number of complex additions is . This is a
reduction from O(N
2
) to O(N(r
1
+r
0
)). Therefore the decomposition
of DFT reduces the computation complexity for DFT.
The number r
0
and r
1
are called radix. If r
0
and r
1
are equal to r,
the number system is called radix-r system. Otherwise, it is called
mixed-radix system. The multiplications with are called
twiddle factor multiplications.
Example 2.1. For , we apply the basic formula by decom-
posing with r
0
= 2 and r
1
= 4. This results in the
given 8-point DFT example in the above section, which is shown in
Fig. 2.1. It is a mixed-radix FFT algorithm.
A closer study for the given 8-point DFT example, it is easy to
show that it does not need to store the input data in the memory after
the computation of the two 4-point DFTs. It can reduce the total
memory size and is important for memory constrained system. An
algorithm with this property is called an in-place algorithm.
2.1.3. Generalized Formula
If r
0
or/and r
1
are not prime, further improvement to reduce the
computation complexity can be achieved by applying the divide and
conquer approach recursively to r
1
-point or/and r
0
-point DFTs [7].
Let , the index k and n can be
written as
(2.9)
(2.10)
where for .
r
1
r
0
r
1
1 – ( ) N r
1
1 – ( )
N r
0
1 – ( )
N r
0
r
1
1 – + ( )
N r
0
r
1
2 – + ( )
W
N
n
0
k
0
N 8 =
N 8 4 2 × = =
N r
p 1 –
r
p 2 –
… × r
0
× × =
k r
0
r
1
…r
p 2 –
k
p 1 –
… r
0
k
1
k
0
+ + + =
n r
p 1 –
r
p 2 –
…r
1
n
p 1 –
… r
p 1 –
n
1
n
0
+ + + =
k
i
n
p i – 1 –
0 r
i
1 – , [ ] ∈ , i 0 1 … p 1 – , , , =
14 Chapter 2
The factorization of can be expressed as
(2.11)
where .
Eq. (2.1) can then be written
(2.12)
Note that the inner product can be recognized as an r
p-1
-point
DFT for n
0
. Define
(2.13)
With Eq. (2.13), index k
p-1
is “replaced” by n
0
. Equation (2.12)
can now be rewritten as
(2.14)
W
N
nk
W
N
nk
W
N
n r
0
r
1
…r
p 2 –
k
p 1 –
… r
0
k
1
k
0
+ + + ( )
W
N
r
0
r
1
…r
p 2 –
nk
p 1 –
… r
0
nk
1
nk
0
+ + +
W
N
r
0
r
1
…r
p 2 –
nk
p 1 –
…W
N
r
0
nk
1
W
N
nk
0
W
r
p 1 –
nk
p 1 –
…W
N r
0

nk
1
W
N
nk
0
=
=
=
=
W
N
r
0
r
1
…r
i
nk
i 1 +
W
N r
0
r
1
…r
i
( ) ⁄
nk
i 1 +
=
X n
p 1 –
n
p 2 –
… n
0
, , , ( )
… x k
p 1 –
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
n
0
k
p 1 –
k
p 1 –
0 =
r
p 1 –
1 –

¸ ,

¸ _
k
1
0 =
r
1
1 –

k
0
0 =
r
0
1 –

=
W
r
p 1 –
r
p 2 –
nk
p 2 –
…W
N r
0

nk
1
W
N
nk
0

x
1
n
0
k
p 2 –
… k
0
, , , ( )
x k
p 1 –
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
n
0
k
p 1 –
k
p 1 –
0 =
r
p 1 –
1 –

=
X n
p 1 –
n
p 2 –
… n
0
, , , ( )
… x k
p 1 –
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
n
0
k
p 1 –
k
p 1 –
0 =
r
p 1 –
1 –

¸ ,

¸ _
k
1
0 =
r
1
1 –

k
0
0 =
r
0
1 –

=
W
r
p 1 –
r
p 2 –
nk
p 2 –
…W
N r
0

nk
1
W
N
nk
0

FFTALGORITHMS 15
The term can be factorized as
The inner sum of k
p-2
in Eq. (2.14) can then be written as
(2.15)
which can be done through multiplications and r
p-2
-point DFTs
(2.16)
(2.17)
Eq. (2.14) can be rewritten as
(2.18)
This process from Eq. (2.14) to Eq. (2.17) can be repeated
times until index k
0
is replaced by n
p-1
.
W
N r
0
r
1
…r
i 1 –
( ) ⁄
nk
i
W
N r
0
r
1
…r
i 1 –
( ) ⁄
nk
i
W
N r
0
r
1
…r
i 1 +
( ) ⁄
r
p 1 –
r
p 2 –
…r
1
n
p 1 –
… r
p 1 –
n
1
n
0
+ + + ( )k
i
W
r
p 1 –
r
p 2 –
…r
i
r
p 1 –
r
p 2 –
…r
i 1 +
n
p i – 1 –
… r
p 1 –
n
1
n
0
+ + + ( )k
i
W
r
p 1 –
r
p 2 –
…r
i
r
p 1 –
…r
i 2 +
n
p i – 2 –
… n
0
+ + ( )k
i
W
r
i
n
p i – 1 –
k
i
=
=
=
x
1
n
0
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
nk
p 2 –
k
p 2 –
0 =
r
p 2 –
1 –

x
1
n
0
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
n
0
k
p 2 –
[ ]W
r
p 2 –
n
1
k
p 2 –
k
p 2 –
0 =
r
p 2 –
1 –

=
x
1
' n
0
k
p 2 –
… k
0
, , , ( ) x
1
n
0
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
n
0
k
p 2 –
=
x
2
n
0
n
1
k
p 3 –
… k
0
, , , , ( )
x
1
' n
0
k
p 2 –
… k
0
, , , ( )W
r
p 2 –
n
1
k
p 2 –
k
p 2 –
0 =
r
p 2 –
1 –

=
X n
p 1 –
n
p 2 –
… n
0
, , , ( )
… x
2
n
0
n
1
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
r
p 2 –
nk
p 3 –
k
p 3 –
0 =
r
p 3 –
1 –

¸ ,

¸ _
k
1
0 =
r
1
1 –

k
0
0 =
r
0
1 –

=
W
r
p 1 –
r
p 2 –
r
p 3 –
r
p 3 –
nk
p 4 –
…W
N r
0

nk
1
W
N
nk
0

p 2 –
16 Chapter 2
(2.19)
Eq. (2.14) can then be expressed as
(2.20)
Eq. (2.20) reorders the output data to natural order. This process
is called unscrambling. The unscrambling process requires a special
addressing mode that converts address (n
0
,...,n
p-1
) to
. In case for radix-2 number system, the n
i
represents a bit. The addressing for unscrambling is to make a
reverse of the address bits and hence is called bit-reverse addressing.
In case of radix-r (r > 2) number system, it is called digit-reverse
addressing.
Example 2.2. 8-point DFT. Let , the factorization
of can be expressed as
(2.21)
By using the generalized formula, the computation of an 8-point
DFT can be computed with the following sequential equations [7]
(2.22)
(2.23)
(2.24)
(2.25)
x
p 1 –
n
0
n
1
… n
p 1 –
, , , ( )
x
p 2 –
' n
0
k
p 2 –
… k
0
, , , ( )W
r
0
n
p 1 –
k
0
k
0
0 =
r
0
1 –

=
X n
p 1 –
n
p 2 –
… n
0
, , , ( ) x
p 1 –
n
0
n
1
… n
p 1 –
, , , ( ) =
n
p 1 –
n
p 2 –
… n
0
, , , ( )
N 2 2 2 × × =
W
N
nk
W
N
nk
W
2
n
0
k
2
( ) W
4
n
0
k
1
W
2
n
1
k
1
( ) W
8
2n
1
n
0
+ ( )k
0
W
2
n
2
k
0
( ) =
x
1
n
0
k
1
k
0
, , ( ) x k
2
k
1
k
0
, , ( )W
2
n
0
k
2
k
2
0 =
1

=
x
1
' n
0
k
1
k
0
, , ( ) x
1
n
0
k
1
k
0
, , ( )W
4
n
0
k
1
=
x
2
n
0
n
1
k
0
, , ( ) x
1
' n
0
k
1
k
0
, , ( )W
2
n
1
k
1
k
1
0 =
1

=
x
2
' n
0
n
1
k
0
, , ( ) x
2
n
0
n
1
k
0
, , ( )W
8
2n
1
n
0
+ ( )k
0
=
FFTALGORITHMS 17
(2.26)
(2.27)
where Eq. (2.22) corresponds to the term in Eq. (2.21), Eq.
(2.23) corresponds to the term, and so on.
The result is shown in Fig. 2.2.
x
2
n
0
n
1
n
2
, , ( ) x
2
' n
0
n
1
k
0
, , ( )W
2
n
2
k
0
k
0
0 =
1

=
X n
2
n ,
1
n
0
, ( ) x
2
n
0
n
1
n
2
, , ( ) =
W
2
n
0
k
2
W
4
n
0
k
1
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
n
multiplication with W
n
8
W
2
W
3
W
1
W
2
W
2
W
0
W
0
W
0
Figure 2.2. 8-point DFT with Cooley-Tukey algorithm.
18 Chapter 2
The recursive usage of divide and conquer approach for an 8-
point DFT is shown in Fig. 2.3. As illustrated in the figure, the inputs
are divided into smaller and smaller groups. This class of algorithms
is called decimation-in-time (DIT) algorithm.
2.2. Sande-Tukey FFT Algorithms
Another class of algorithms is called decimation-in-frequency (DIF)
algorithm, which divides the outputs into smaller and smaller DFTs.
This kind of algorithm is also called the Sande-Tukey FFT
algorithm.
The computation of DFT with DIF algorithm is similar to
computation with DIT algorithm. For the sake of simplicity, we do
not derive the DIF algorithm but illustrate the algorithm with an
example.
Example 2.3. 8-point DFT. The factorization of can be
expressed as
(2.28)
C
o
m
b
i
n
e

t
w
o
4
-
p
o
i
n
t

D
F
T
2-point
DFT
2-point
DFT
2-point
DFT
2-point
DFT
C
o
m
b
i
n
e
t
w
o

2
-
p
o
i
n
t
D
F
T
C
o
m
b
i
n
e
t
w
o

2
-
p
o
i
n
t
D
F
T
8
-
p
o
i
n
t
D
F
T
4
-
p
o
i
n
t
D
F
T
4
-
p
o
i
n
t
D
F
T
C
o
m
b
i
n
e

t
w
o
4
-
p
o
i
n
t

D
F
T
Figure 2.3. The divide and conquer approach for DFT.
W
N
nk
W
N
nk
W
2
k
2
n
0
W
8
2k
1
k
0
+ ( )n
0
( ) W
2
k
1
n
1
W
4
k
0
n
1
( ) W
2
k
0
n
2
( ) =
FFTALGORITHMS 19
The sequential equations can be constructed in similar way as
those in Eq. (2.22) through Eq. (2.27). The result is shown in Fig.
2.4.
The computation of DFT with DIF algorithms can be expressed
with sequential equations, which are similar to that of DIT
algorithms. By using the same notation for index n and k as in Eq.
(2.9) and Eq. (2.10), the computation of N-point DFT with DIF
algorithm is
(2.29)
where and
. The unscrambling process is done by
. (2.30)
Comparing Fig. 2.4 with Fig. 2.2, we can find that the signal-
flow graph (SFG) for DFT computation with DIF algorithm is
transposition of that with DIT algorithm. Hence, many properties for
DIT and DIF algorithms are the same. For instance, the computation
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
W
n
multiplication with W
n
8
W
2
W
2
W
0
W
0
W
0
W
1
W
2
W
3
Figure 2.4. 8-point DFT with DIF algorithm.
x
i
n
0
… n
i
k
p i – 1 –
… k
0
, , , , , ( )
x
i 1 –
n
0
… n
i 1 –
k
p i –
… k
0
, , , , , ( )W
r
p i –
n
i 1 –
k
p i –
k
p i –
0 =
r
p i –
1 –

=
W
N r
p 1 –
…r
p i –
( ) ⁄
n
i
r
p i – 2 –
…r
0
k
p i – 1 –
… r
0
k
1
k
0
+ + + ( )

x
0
k
p 1 –
k
p 2 –
… k
0
, , , ( ) x k
p 1 –
k
p 2 –
… k
0
, , , ( ) =
i 1 2 … p , , , =
X n
p 1 –
… n
0
, , ( ) x
p
n
0
… n
p 1 –
, , ( ) =
20 Chapter 2
workload for the DIT and DIF algorithms are the same. The
unscrambling process are required for both DIF and DIT algorithms.
However, there are clear differences between DIF and DIT
algorithms, e.g., the position of twiddle factor multiplications. The
DIF algorithms have the twiddle factor multiplications after the
DFTs and the DIT algorithms have the twiddle factor multiplications
before the DFTs.
2.3. Prime Factor FFT Algorithms
In Cooley-Turkey or Sande-Turkey algorithms, the twiddle factor
multiplications are required for DFT computation. If the
decomposition of N is relative prime, there exists another type of
FFT algorithms, i.e., prime factor FFT algorithm, which reduce the
twiddle factor multiplications.
In Cooley-Turkey or Sande-Turkey algorithms, index n or k is
expressed with Eq. (2.9) and Eq. (2.10). This representation of index
number is called index mapping. If and are relatively prime,
e.g., greatest common divider gcd( , ) = 1, it exists another index
mapping, so-called Good’s mapping [19]. An index n can be
expressed as
(2.31)
where , , is the
multiplication inverse of modulo , e.g., , and
is the multiplication inverse of modulo . This mapping is a
variant of Chinese Remainder Theorem.
r
1
r
0
r
1
r
0
n r
0
n
1
r
0
1 –
mod r
1
¸ ,
¸ _
r
1
n
0
r
1
1 –
mod r
0
¸ ,
¸ _
+
¸ ,
¸ _
mod N
=
N r
1
r
0
× = 0 n
1
≤ r
1
< 0 n
0
≤ r
0
< , r
0
1 –
r
1
r
0
r
0
1 –
mod r
1
1 =
r
1
1 –
r
0
FFTALGORITHMS 21
Example 2.4. Construct index mapping for 15-point DFT inputs
according to Good’s mapping.
We have with and . is 2 since
and , the index can be
computed according to
(2.32)
The mapping can be illustrated with an index matrix
The mapping for the outputs is simple. It can be constructed by
( ).
Example 2.5. Construct index mapping for 15-point DFT outputs.
We have with and . The index
mapping for the outputs can be constructed by
for and . The
result can be shown Fig. 2.6.
N 3 5 × = r
1
3 = r
0
5 = r
1
1 –
r
1
r
1
1 –
mod r
0
3 2
mod 5
⋅ 1 = = r
0
1 –
2 =
k 5 2k
1
mod 3
¸ ,
¸ _
3 k
0
mod 5
¸ ,
¸ _
+
¸ ,
¸ _
mod 15
=
0 6 12 3 9
10 1 7 13 4
5 11 2 8 14
Figure 2.5. Good’s mapping for 15-point DFT inputs.
k
1
k
0
n r
0
n
1
r
1
n
0
+ ( )
mod N
= 0 n
1
≤ r
1
< 0 n
0
≤ r
0
< ,
N 3 5 × = r
1
3 = r
0
5 =
n 5n
1
3n
0
+ ( )
mod 15
= 0 n
1
≤ 3 < 0 n
0
≤ 5 <
0 3 6 9 12
5 8 11 14 2
10 13 1 4 7
Figure 2.6. Index mapping for 15-point DFT outputs.
n
1
n
0
22 Chapter 2
The computation with prime factor FFT algorithms is similar to
the computation with Cooley-Turkey algorithm. It can be divided
into two steps:
1 Compute r
0
different r
1
-point DFTs. It performs column-wise
DFTs for the input index matrix.
2 Compute r
1
different r
0
-point DFTs. It performs row-wise DFTs
for the output index matrix.
Example 2.6. 15-point DFT with prime factor mapping FFT
algorithm.
The input and output index matrices can be constructed as shown
in Fig. 2.5 and Fig. 2.6. Following the computation steps above, the
computation of 15-point DFT can be performed by and five 3-point
DFTs followed three 5-point DFTs.
The 15-point DFT with prime factor mapping FFT algorithm is
shown in Fig. 2.7.
X(0)
X(3)
X(6)
X(9)
X(12)
X(5)
X(8)
X(11)
X(14)
X(2)
X(10)
X(13)
X(1)
X(4)
X(7)
x(0)
x(10)
x(5)
x(6)
x(1)
x(11)
x(12)
x(7)
x(2)
x(3)
x(13)
x(8)
x(9)
x(4)
x(14)
5
-
p
o
i
n
t

D
F
T
n
0
=
0
5
-
p
o
i
n
t

D
F
T
n
0
=
1
5
-
p
o
i
n
t

D
F
T
n
0
=
2
3
-
p
o
i
n
t
D
F
T
k
0
=
4
3
-
p
o
i
n
t
D
F
T
k
0
=
3
3
-
p
o
i
n
t
D
F
T
k
0
=
2
3
-
p
o
i
n
t
D
F
T
k
0
=
1
3
-
p
o
i
n
t
D
F
T
k
0
=
0
Figure 2.7. 15-point FFT with prime factor mapping.
FFTALGORITHMS 23
The prime factor mapping based FFT algorithm above is also an
in-place algorithm.
Swapping of input and output index matrix gives another FFT
algorithm which does not need twiddle factor multiplication outside
the butterflies either.
Although the prime factor FFT algorithms are similar to the
Cooley-Tukey or Sande-Tukey FFT algorithms, the prime factor
FFT algorithms are derived from convolution based DFT
computations [19] [49] [31]. This leads later to Winograd Fourier
Transform Algorithm (WFTA) [61].
2.4. Other FFT Algorithms
In this section, we discuss two other FFT algorithms. One is the
split-radix FFT algorithm (SRFFT) and the other one is Winograd
Fourier Transform algorithm (WFTA).
2.4.1. Split-Radix FFT Algorithm
Split-radix FFT algorithms (SRFFT) were proposed nearly
simultaneously by several authors in 1984 [17] [18]. The algorithms
belong to the FFT algorithms with twiddle factor. As a matter of fact,
split-radix FFT algorithms are based on the observation of Cooley-
Turkey and Sande-Turkey FFT algorithms. It is observed that
different decomposition can be used for different parts of an
algorithm. This gives possibility to select the most suitable
algorithms for different parts in order to reduce the computational
complexity.
24 Chapter 2
For instance, the signal-flow graph (SFG) for a 16-point radix-2
DIF FFT algorithm is shown in Fig. 2.8.
The SRFFT algorithms exploit this idea by using both a radix-2
and a radix-4 decomposition in the same FFT algorithm. It is
obviously that all twiddle factors are equal to 1 for even indexed
outputs with radix-2 FFT computation, i.e., the twiddle factor
multiplication is not required. But in the radix-4 FFT computation,
there is not such general rule (see Fig. 2.9). For the odd indexed
outputs, a radix-4 decomposition the computational efficiency is
increased because the four-point DFT has the largest multiplication-
free butterfly. This is because the radix-4 FFT is more efficient than
the radix-2 FFT from the multiplication complexity point of view.
Consequently, the DFT computation uses different radix FFT
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
W
2*1
16
W
3*1
16
W
0*1
16
W
1*1
16
W
6*1
16
W
7*1
16
W
4*1
16
W
5*1
16
W
6*0
16
W
7*0
16
W
4*0
16
W
5*0
16
W
2*0
16
W
3*0
16
W
0*0
16
W
1*0
16
W
2*0
8
W
3*0
8
W
0*0
8
W
1*0
8
W
2*1
8
W
3*1
8
W
0*1
8
W
1*1
8
W
2*0
8
W
3*0
8
W
0*0
8
W
1*0
8
W
2*1
8
W
3*1
8
W
0*1
8
W
1*1
8
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
Figure 2.8. Signal-flow graph for a 16-point DIF FFT algorithm.
FFTALGORITHMS 25
algorithms for odd and even indexed outputs. This reduces the
number of complex multiplications and additions/subtractions. A
16-point SRFFT is shown in Fig. 2.10.
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(4)
X(8)
X(12)
X(1)
X(5)
X(9)
X(13)
X(2)
X(6)
X(10)
X(14)
X(3)
X(7)
X(11)
X(15)
W
2*0
16
W
3*0
16
W
0*0
16
W
1*0
16
W
2*1
16
W
3*1
16
W
0*1
16
W
1*1
16
W
2*2
16
W
3*2
16
W
0*2
16
W
1*2
16
W
2*3
16
W
3*3
16
W
0*3
16
W
1*3
16
Radix-4 Butterfly
Figure 2.9. Radix-4 DIF algorithm for 16-point DFT.
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(5)
X(9)
X(13)
X(3)
X(7)
X(11)
X(15)
-j
-j
W
2*2
16
W
3*2
16
W
0*2
16
W
1*2
16
W
2*3
16
W
3*3
16
W
0*3
16
W
1*3
16
W
0*3
8
W
1*3
8
W
0*2
8
W
1*2
8
R
a
d
i
x
-
2
R
a
d
i
x
-
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
R
a
d
i
x

2
/
4
-j
-j
-j
-j
-j
-j
-j
Figure 2.10. SFG for 16-point DFT with SRFFT algorithm.
26 Chapter 2
Although the SRFFT algorithms are derived from the
observation of radix-2 and radix-4 FFT algorithm, it cannot be
derived by index mapping. This could be the reason that the
algorithms are discovered so late [18]. The SRFFT can also be
generalized to lengths N = p
k
, where p is a prime number [18].
2.4.2. Winograd Fourier Transform Algorithm
The Winograd Fourier transform algorithm (WFTA) [61] uses the
cyclic convolution method to compute the DFT. This is based on
Rader’s idea [49] for prime number DFT computation.
The computation of an N-point DFT (N is product of two co-
prime numbers r
1
and r
0
) with WFTA can be divided into five steps:
two pre- and post-addition steps and a multiplication step at the
middle. The number of arithmetic operations depends on N. The
number of multiplications is .
The aim of Winograd’s algorithm is to minimize the number of
multiplications. WFTA succeeds in minimizing the number of
multiplications to the smallest number known. However, the
minimization of multiplications results in complicated computation
ordering and large increase of other arithmetic operations, e.g.,
additions. Furthermore, the irregularity of WFTA makes it
impractical for most real applications.
2.5. Performance Comparison
For the algorithm implementation, the computation load is of great
concern. Usually the number of additions and multiplications are
two important measurements for the computation workload. We
compare the discussed algorithms from the addition and
multiplication complexity point of view.
O N ( )
r
1
-point
input
adds
r
0
-point
input
adds
N-point
multiplications
r
0
-point
output
adds
r
1
-point
output
adds
r
0
sets r
1
sets r
1
sets r
0
sets
Figure 2.11. General structure of WFTA.
FFTALGORITHMS 27
Since the restriction of transform length for prime factor based
algorithms or WFTA, the comparison is not strictly on the same
transform length but rather that of a nearby transform length.
2.5.1. Multiplication Complexity
Since multiplication has large impact on the speed and power
consumption, the multiplication complexity is important for the
selection of FFT algorithms.
In many DFT computations, both complex multiplications and
real multiplications are required. For the purpose of comparison, the
counting is based on the number of real multiplications. A complex
multiplication can be realized directly with 4 real multiplications
and 2 real additions, which is shown in Fig. 2.12 (a). With a simple
transformation, the number of real multiplications can be reduced to
3, but the number of real additions increases to 3 as shown in Fig.
2.12 (b). We consider a complex multiplication as 3 real
multiplications and 3 real additions in the following analysis.
For DFT with transform length of , the number of
complex multiplications can be estimated as half of the number of
total butterfly operations, e.g., . This number is over-
estimated, for example, a complex multiplication with twiddle factor
does not require any multiplications when k is a multiple of
N/4. Furthermore, it requires only 2 real multiplications and 2
additions when k is an odd multiple of N/8. Taking these
simplifications into account, the number of real multiplication for a
X
R
Z
I
C
R
C
I
C
I
X
I
X
R
X
I
X
R
C
R
X
R
X
I
C
I
+C
R
C
I
-C
R
C
R
X
I
Z
R
Z
R
Z
I
(a) (b)
Figure 2.12. Realization of a complex multiplication.
(a) Direct realization (b) Transformed realization
N 2
n
=
N N
2
log ( ) 2 ⁄
W
N
k
28 Chapter 2
DFT with radix-2 algorithm and transform length of is
[25]. The radix-4 algorithm for a
DFT with transform length of requires
real multiplications
[18]. For the split-radix FFT algorithm, the number of real
multiplications is for a DFT with
[18].
If the transform length is a product of two or more co-prime
numbers, there is no simple analytic expression for the number of
real multiplications. However, there are lower bounds that can be
attained by algorithms for those transform lengths. These lower
bounds can be computed [18].
As mentioned previously, WFTA has been proven that it has the
lowest number of multiplications for those transformlengths that are
less than 16. It requires the lowest number of multiplications of the
existing algorithms.
From the multiplication complexity point of view, the most
attractive algorithm is WFTA, following by the prime factor
algorithm, the split-radix algorithm, and the fixed-radix algorithm.
The number of real multiplication for various FFT algorithms on
complex data is shown in the following table [18].
N Radix-2 Radix-4 SRFFT PFA WFTA
16 24 20 20
60 200 136
64 264 208 196
240 1100 632
256 1800 1392 1284
504 2524 1572
512 4360 3076
1008 5804 3548
1024 10248 7856 7172
Table 2.1. Multiplication complexity for various FFT algorithms.
N 2
n
=
M 3N 2 N
2
log ( ) ⁄ 5N – 8 + =
N 4
n
=
M 9N 8 N
2
log ( ) ⁄ 43N 12 ⁄ – 16 3 ⁄ + =
M N N
2
log 3N – 4 + =
N 2
n
=
FFTALGORITHMS 29
2.5.2. Addition Complexity
In a radix-2 or radix-4 FFT algorithm, the addition and subtraction
operations are used for realizing the butterfly operations and the
complex multiplications. Since subtraction has the same complexity
as addition, we consider a subtraction equivalent to an addition.
The additions for the butterfly operations is the larger part of the
addition complexity. For each radix-2 butterfly operation (a 2-point
DFT), the number of real additions is four since each complex
addition/subtraction requires two additions. For a transform length
of , a DFT requires N/2 radix-2 DFTs for each stage. So the
total number of real additions is , or with
radix-2 FFT algorithms. For a transform length of , a DFT,
requires N/4 radix-4 DFTs for each stage. Each radix-4 DFT requires
8 complex additions/subtractions, i.e., 16 real additions. The total
number of real additions is , or . Both
radix-2 and radix-4 FFT algorithms require the same number of
additions for a DFT with a transform length of powers of 4.
The number of additions required for the complex
multiplications is less than the number of butterfly operations.
Nevertheless, it cannot be ignored. As described previously, a
complex multiplication requires generally 3 additions. The exact
number [25] is for a DFT with
transform length of using the radix-2 algorithm. The
number of additions for DFT with transform length of is
for radix-4 algorithm
[18]. The split-radix algorithm has the best result for addition
complexity: additions for an
DFT [18].
N 2
n
=
4 N
2
log ( ) N 2 ⁄ ( ) 2nN
N 4
n
=
16 N
4
log ( ) N 4 ⁄ ( ) 4nN
A 7N N
2
log ( ) ⁄ 5N – 8 + =
N 2
n
=
N 4
n
=
A 25N 8 N
2
log ( ) ⁄ 43N 12 ⁄ – 16 3 ⁄ + =
A 3N N
2
log 3N – 4 + = N 2
n
=
30 Chapter 2
From the addition complexity point of view, WFTA is a poor
choice. In fact, the irregularity and increase of addition complexity
makes the WFTA less attractive for practical implementation. The
number of real additions for various FFTs on complex data are given
in the following table [18].
2.6. Other Issues
Many issues are related to the FFT algorithm implementations, e.g.,
scaling and rounding considerations, inverse FFT implementation,
parallelism of FFT algorithms, in-place and/or in-order issue,
regularity of FFT algorithms etc. We discuss the first two issues in
more detail.
N Radix-2 Radix-4 SRFFT PFA WFTA
16 152 148 148
60 888 888
64 1032 976 964
240 4812 5016
256 5896 5488 5380
504 13388 14540
512 13566 12292
1008 29548 34668
1024 30728 28336 27652
Table 2.2. Addition complexity for various FFT algorithms.
FFTALGORITHMS 31
2.6.1. Scaling and Rounding Issue
In hardware it is not possible to implement an algorithmwith infinite
accuracy. To obtain sufficient accuracy, the scaling and rounding
effects must be considered.
Without loss of generality, we assume that the input data {x(n)}
are scaled, i.e., |x(n)| < 1/2, for all n. To avoid overflowof the number
range, we apply the safe scaling technique [58]. This ensures that an
overflow cannot occur. We take the 16-point DFT with radix-2 DIF
FFT algorithm (see Fig. 2.8) as an example.
The basic operation for the radix-2 DIF FFT algorithm consists
of a radix-2 butterfly operation and a complex multiplication as
shown in Fig. 2.13.
For two numbers u and v with |u| < 1/2 and |v| < 1/2, we have
(2.33)
(2.34)
where the magnitude for the twiddle factor is equal to 1.
To retain the magnitude, the results must be scaled with a factor
1/2. After scaling, rounding is applied in order to have the same
input and output wordlengths. This introduces an error, which is
W
p
N
U
V
B
u
v
Figure 2.13. Basic operation for radix-2 DIF FFT algorithm.
U u v + u v 1 < + ≤ =
V u v – ( ) W
N
p
⋅ u v – u v 1 < + ≤ = =
32 Chapter 2
called quantization noise. This noise for a real number is modeled as
an additive white noise source with zero mean and variance of
, where ∆ is the weight of the least significant bit.
The additive noise for U respective V is complex. Assume that
the quantization noise for U and V are Q
U
and Q
V.
, respectively. For
the Q
U
, we have
(2.35)
(2.36)
Since the quantization noise is independent of the twiddle factor
multiplication, we have
(2.37)
(2.38)

2
12 ⁄
B
u
v
/2
p
W
N
1/2
U
Q
V
Q
n
U
n
V
Figure 2.14. Model for scaling and rounding of radix-2 butterfly.
E Q
U
{ } E Q
Ure
jQ
Uim
+ { }
E Q
Ure
{ } E jQ
Uim
{ } +
=
=
0 =
Var Q
U
{ } E Q
Ure
2
Q
Uim
2
+
¹ ¹
' ;
¹ ¹
=
E Q
Ure
2
¹ ¹
' ;
¹ ¹
E Q
Uim
2
¹ ¹
' ;
¹ ¹
+
2∆
2
12
--------- = =
E Q
V
W
N
p

¹ ¹
' ;
¹ ¹
E Q
V
{ } E W
N
p
¹ ¹
' ;
¹ ¹
⋅ 0 = =
Var Q
V
W
N
p

¹ ¹
' ;
¹ ¹
E Q
V
W
N
p
⋅ ( ) Q
V
W
N
p
⋅ ( )
¹ ¹
' ;
¹ ¹
=
E Q
V
Q
V
{ }
2∆
2
12
--------- = =
FFTALGORITHMS 33
After analysis of the basic radix-2 butterfly operation, we
consider the scaling and quantization effects in an 8-point DIF FFT
algorithm. The noise propagation path for the output X(0) is
highlighted with bold solid lines in Fig. 2.15.
For the sake of clarity, we assume that ∆ is equal for each stage,
i.e., the internal wordlength is the same for all stages. If we analyze
backwards for X(0), i.e., from stage l back to stage 1, it is easy to find
that noise from stage l-1 is scaled with 1/2 to stage l and stage l-1 has
exactly double noise sources to that of stage l. Generally, if the
transform length is N and the number of stages is n, where ,
the variance of a noise source from stage l is scaled with
and the number of noises sources in stage l is .
Hence the total quantization noise variance for an output X(k) is
(2.39)
if the input sequence is zero mean white noise with variance .
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
W
2
W
2
W
0
W
0
W
0
W
1
W
2
W
3
stage 1 stage 2 stage 3
Figure 2.15. Noise propagation.
N 2
n
=
1 2 ⁄ ( )
2 n l – ( )
2
n l –
Var Q
X k ( )
{ } 2
n l –
1
2
2 n l – ( )
------------------

2
6
------
l 1 =
n

=

2
6
------
1
2
n l –
------------
l 1 =
n


2
6
------ 2
1
2
n 1 –
------------- –
¸ ,
¸ _
= =
δ
2
34 Chapter 2
The output variance for an output X(k) can be derived by the
following equation
(2.40)
where for from the white noise
assumption. If ∆
in
is the weight of the least significant bit for the real
or imaginary part of the input, the input variance is equal to
. The signal-noise-ratio (SNR) for the output X(k) is
therefore [60]
For a radix-r DIF FFT algorithm, a similar analysis [60] yields
This result, which is based on the white noise assumption, can be
used to determine the required internal wordlength.
The finite wordlength effect of finite precision coefficients is
more complicated. Simulation is typically used to determine word-
length of the coefficients.
Var X k ( ) { } E X k ( )X k ( ) { } = =
1
N
2
------- E x n ( )x m ( ) { }W
N
n m – ( )k
m 0 =
N 1 –

n 0 =
N 1 –

= =
E
1
N
---- x n ( )W
N
nk
n 0 =
N 1 –

1
N
---- x m ( )W
N
mk –
m 0 =
N 1 –

¹ ¹
' ;
¹ ¹
= =
1
N
2
------- E x n ( )x n ( ) { }
n 0 =
N 1 –

1
N
2
-------Nδ
2
δ
2
N
----- = = =
E x n ( )x m ( ) { } 0 = n m ≠
δ
2
2∆
in
2
12 ⁄
SNR
2∆
in
2
12 ⁄ ( )
N
--------------------------

2
6
------ 2
1
2
n 1 –
------------- –
¸ ,
¸ _
----------------------------------

in
2

2
--------
1
2
n
-----
2 2
n – 1 +

--------------------------

in
2

2
--------
1
2 2
n
1 – ( )
----------------------- = = =
SNR
2∆
in
2
12 ⁄ ( )
N
--------------------------

2
6
------ 2
1
r
n 1 –
------------ –
¸ ,
¸ _
----------------------------------

in
2

2
--------
1
r
n
-----
2 r
n – 1 +

-------------------------

in
2

2
--------
1
2N r – ( )
--------------------- = = =
FFTALGORITHMS 35
2.6.2. IDFT Implementation
An OFDM system requires both DFT and IDFT for signal
processing. The IDFT implementation is also critical for the OFDM
system.
There are various approaches for the IDFT implementation. The
straightforward one is to compute the IDFT directly according to Eq.
(1.2), which has a computation complexity of . This
approach is obviously not efficient.
The second approach is similar to FFT computation. If we ignore
the scaling factor 1/N, the only difference between DFT and IDFT is
the twiddle factor, which is instead of . This can easily
be performed by changing the read addresses of the twiddle factor
ROM(s) for the twiddle factor multiplications. It also requires the
reordering of input when a radix-r DFT is used. This approach adds
an overhead to each butterfly operation and change the access order
of the coefficient ROM.
The third approach converts the computation on IDFT to the
computation on DFT. This is shown by the following equation.
where the term within the parenthesis is a definition of DFT and
is the conjugate of .
O N
2
( )
W
N
nk
W
N
n – k
x k ( )
1
N
---- X n ( )e
j2πnk N ⁄
k 0 =
N 1 –

1
N
---- X
re
n ( )e
j2πnk N ⁄
jX
im
n ( )e
j2πnk N ⁄
+ [ ]
k 0 =
N 1 –

1
N
---- X
re
n ( )e
j2πnk N ⁄ –
j X
im
n ( )e
j2πnk N ⁄ –

k 0 =
N 1 –


1
N
---- X

n ( )e
j2πnk N ⁄ –
k 0 =
N 1 –


= =
= =
= =
=
a
*
a
36 Chapter 2
The conjugation of a complex number can be done by swapping
the real and imaginary parts. Hence, the IDFT can therefore be
computed with a DFT by adding two swaps and one scaling: swap
the real and imaginary part at input before the DFT computation,
swap the real and imaginary part at output from DFT, and a scaling
with factor 1/N.
2.7. Summary
In this chapter we discussed the most commonly used FFT
algorithms, e.g., the Cooley-Tukey and Sande-Tukey algorithms.
Each computation step was given in detail for the Cooley-Tukey
algorithms. Other algorithms like prime factor algorithm, split-radix
algorithm, and WFTA are also discussed.
We compared the different algorithms in term of number of
addictions and multiplications. Some other aspects, for instance,
memory requirements, will be discussed later.
37
3
LOW POWER TECHNIQUES
Low power consumption has emerged as a major challenge in the
design of integrated circuits.
In this chapter, we discuss the basic principles for power
consumption in standard CMOS circuits. Afterwards, a review of
low-power techniques for CMOS circuits is given.
3.1. Power Dissipation Sources
In CMOS circuits, the main contributions to the power consumption
are from short-circuit, leakage, and switching currents. In the
following subsections, we introduce them separately.
3.1.1. Short-Circuit Power
In a static CMOS circuit, there are two complementary networks: p-
network (pull-up) and n-network (pull-down). The logic functions
for the two networks are complementary. Normally when the input
and output state are stable, only one network is turned on and
conducts the output either to power supply node or to ground node
and the other network is turned off and blocks the current from
flowing. Short-circuit current exists during the transitions as one
network is turned on and the other network is still active. For
example, the input signal to an inverter is switching from 0 to .
It exists a short time interval where the input voltage is larger than
but less than . During this time interval, both
V
dd
V
tn
V
dd
V
tp

38 Chapter 3
PMOS-transistor (p-network) and NMOS-transistor (n-network) are
turned on and the short-circuit current flows through both kinds of
transistors from power supply line to the ground.
The exact analysis of the short-circuit current in a simple inverter
[6] is complex, it can be studied by simulation using SPICE. It is
observed that the short-circuit current is proportional to the slope of
input signals, the output loads and the transistor sizes [54]. The
short-circuit current consumes typically less than 10% of the total
power in a “well-designed” circuit [54].
3.1.2. Leakage Power
There are two contributions to leakage currents: one from the
currents that flow through the reverse biased diodes, the other from
the currents that flow through transistors that are non-conducting.
The leakage currents are proportional to the leakage area and
exponential of the threshold voltage. The leakage currents depend on
the technology and cannot be modified by the designers except in
some logic styles.
The leakage current is in the order of pico-Ampere with current
technology but it will increase as the threshold voltage is reduced. In
some cases, like large RAMs, the leakage current is one of the main
concerns. The leakage current is currently not a severe problem in
most digital designs. However, the power consumed by leakage
current can be as large as the power consumed by the switching
Figure 3.1. Leakage current types: (a) reverse biased diode
current, (b) subthreshold leakage current.
(a) (b)
n
+
p
+
p
-
substrate
V
dd
V
dd
Gnd
Gnd
Gate
I
reverse
I
sub
LOWPOWERTECHNIQUES 39
current for 0.06 µm technology. The usage of multiple threshold
voltages can reduce the leakage current in deep-submicron
technology.
3.1.3. Switching Power
The switching currents are due to the charging and discharging of
node capacitances. The node capacitances mainly include gate,
overlapping, and interconnection capacitances.
The power consumed by switching current [63] can be expressed
as
(3.1)
where α is the switching activity factor, C
L
is the capacitance load, f
is the clock frequency, and V
dd
is the power supply voltage.
The equation shows that the switching power depends on a few
quantities that are readily observable and measurable in CMOS
circuits. It is applicable to almost every digital circuits and gives the
guidance to the low power design.
The power consumed by switching current is the dominant part
of the power consumption. Reducing the switching current is the
focus of most low power design techniques.
P αC
L
f V
dd
2
2 ⁄ =
40 Chapter 3
3.2. Low Power Techniques
Low power techniques can be discussed at various levels of
abstractions: system level, algorithm and architecture level, logic
level, circuit level, and technology level. Fig. 3.2 shows some
examples of techniques at the different levels.
In the following, we give an overview for different low power
techniques. This is organized after the abstraction level.
3.2.1. System Level
A system typically consists of both hardware and software
components, which affect the power consumption.
The system design includes the hardware/software partitioning,
hardware platform selection (application-specific or general-
purpose processors), resource sharing (scheduling) strategy, etc. The
system design usually has the largest impact on the power
consumption and hence the low power techniques applied at this
level have the most potential for power reduction.
At the system level, it is hard to find the best solution for low
power in the large design space and there is a shortage of accurate
power analysis tools at this level. However, if, for example, the
instruction-level power models for a given processor are available,
software power optimization can be performed [56]. It is observed
System
Algorithm
Architecture
Logic
Circuit
Figure 3.2. Low-power design methodology at
different abstraction levels.
Partitioning, Power-down
Parallelism, Pipelining
Voltage scaling
Logic styles and manipulation, Data encoding
Technology
Threshold reduction, Double-threshold devices
Energy recovery, Transistor sizing
LOWPOWERTECHNIQUES 41
that faster code and frequently usage of cache are most likely to
reduce the power consumption. The order of instructions also have
an impact on the internal switching within processors and hence on
the power consumption.
The power-down and clock gating are two of the most used low
power techniques at system level. The non-active hardware units are
shut down to save the power. The clock drivers, which often
consumes 30-40% of the total power consumption, can be gated to
reduce the switching activities as illustrated in Fig. 3.3.
The power-down can be extended to the whole system. This is
called sleep mode and widely used in low power processors. The
StrongARM SA-1100 processor has three power states and the
average power varies for each state [29]. These power states can be
utilized by the software through advanced configuration and power
management interface (ACPI). In the recent year, the power
management has gained a lot attention in operating system design.
For example, the Microsoft desktop operating system supports
advanced power management (APM).
The system is designed for the peak performance. However, the
computation requirement is time varying. Adapting clocking
frequency and/or dynamic voltage scaling to match the performance
constraints is another low power technique. The lower requirement
for performance at certain time interval can be used to reduce the
AND
block enable
clock
to block clock network
Figure 3.3. Clock gating.
RUN
IDLE SLEEP
400 mW
50 mW 160 µW
90 µs 10 µs
10 µs 160 ms
90 µs
Figure 3.4. Power states for StrongARM SA-1100 processor.
42 Chapter 3
power supply voltage. This requires either feedback mechanism
(load monitoring and voltage control) or predetermined timing to
activate the voltage down-scaling.
Another less explored domain for low power design is using
asynchronous design techniques. The asynchronous designs have
many attractive features, like non-global clocking, automatic power-
down, no spurious transitions, and low peak current, etc. It is easy to
reduce the power consumption further by combining the
asynchronous design technique with other lowpower techniques, for
instance, dynamic voltage scaling technique [42]. This is illustrated
in Fig. 3.5.
3.2.2. Algorithm Level
The algorithm selection have large impact on the power
consumption. For example, using fast Fourier transform instead of
direct computation of the DFT reduces the number of operations
with a factor of 102.4 for a 1024-point Fourier transform and the
power consumption is likely to be reduced with a similar factor.
The task of algorithm design is to select the most energy-
efficient algorithm that just satisfies the constraints. The cost of an
algorithm includes the computation part and the
communication/storage part. The complexity measurement for an
algorithm includes the number of operations and the cost of
Load
Monitor
B
u
f
f
e
r
Processing
Unit
DC-DC
Converter
Power
Supply
Synchronous/Asynchronous Interface
Input Output
Figure 3.5. Asynchronous design with dynamic voltage
scaling.
LOWPOWERTECHNIQUES 43
communication/storage. Reduction of the number of operations,
cost per operation, and long distance communications are key issues
to algorithm selection.
One important technique for low power of the algorithmic level
is algorithmic transformations [45] [46]. This technique exploits the
complexity, concurrency, regularity, and locality of an algorithm.
Reducing the complexity of an algorithm reduces the number of
operations and hence the power consumption. The possibility of
increasing concurrency in an algorithm allows the use of other
techniques, e.g., voltage scaling, to reduce the power consumption.
The regularity and locality of an algorithm affects the controls and
communications in the hardware.
The loop unrolling technique [9] [10] is a transformation that
aims to enhance the speed. This technique can be used for reducing
the power consumption. With loop unrolling, the critical path can be
reduced and hence voltage scaling can be applied to reduce the
power consumption. In Fig. 3.6, the unrolling reduces the critical
path and gives a voltage reduction of 26% [10]. This reduces the
power consumption with 20% even the capacitance load is increases
with 50% [10]. Furthermore, this technique can be combine with
other techniques at architectural level, for instance, pipeline and
interleaving, to save more power.
In some cases, like wave digital filters, the faster algorithms,
combined with voltage-scaling, can be chosen for energy-efficient
applications [58].
2D
x
n
x
n-1
b
0
b
0
a
1
b
0
a
1
2
a
1
y
n
y
n-1
D
x
n
y
n
a
1
b
0
Figure 3.6. (a) Original signal flow graph.
(b) Unrolled signal flow graph.
(a) (b)
44 Chapter 3
3.2.3. Architecture Level
As the algorithm is selected, the architecture can be determined for
the given algorithm.
As we can see from Eq. (3.1), an efficient way to reduce the
dynamic power consumption is the voltage scaling. When supply
voltage is reduced, the power consumption is reduced. However, this
increases the gate delay. The delay of a min-size inverter (0.35 µm
standard CMOS technology) increases as the supply voltage is
reduced, which is shown in Fig. 3.7.
To reduce the power supply voltage is used to reduce the power
consumption. However, this increases the delay. To compensate the
delay, we use low power techniques like parallelism and pipelining
[11].
We demonstrate an example of architecture transformation.
0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Delay vs. power supply voltage
Power supply voltage (V)
D
e
l
a
y

(
n
s
)
Figure 3.7. Delay vs. supply voltage for an inverter.
LOWPOWERTECHNIQUES 45
Example 3.1. Parallel [11].
The use of two parallel datapath is equivalent to interleaving of two
computational tasks. A datapath to determine the largest number of
C and (A + B) is shown in Fig. 3.8. It requires an adder and a
comparator. The original clock frequency is 40 MHz [11].
In order to maintain the throughput while reducing the power
supply voltage, we use a parallel architecture. The parallel
architecture with twice the amount of resources is shown in Fig. 3.9.
The clock frequency can be reduced to half, from 40 MHz to 20
MHz since two tasks are executed concurrently. This allows the
supply voltage to be scaled down from 5 V to 2.9 V [11]. Since the
extra routing is required to distribute computations to two parallel
units, the capacitance load is increased by a factor of 2.15 [11]. Still,
this gives a significant power saving [11]:
1/T
C
o
m
p
a
r
a
t
o
r
A
>
B
A
B
C
1/T
1/T
Figure 3.8. Original datapath.
P
par
C
par
V
par
2
f
par
2.15C
orig
( ) 0.58V
orig
( )
2
f
orig
2
------------
¸ ,
¸ _
= =
0.36P
orig

46 Chapter 3
Example 3.2. Pipelining [11].
Pipelining is another method for increasing the throughput. By
adding a pipelining register after the adder in Fig. 3.8, the
throughput can the increased from 1/(T
add
+ T
comp
) to 1/max(T
add
,
T
comp
). If T
add
is equal to T
comp
, this increases the throughput by a
factor of 2. With this enhancement, the supply voltage can also in
this case be scaled down to 2.9 V (the gate delay doubles) [11]. The
effective capacitance increases to a factor of 1.15 because of the
insertions of latches [11]. The power consumption for pipelining
[11] is
C
o
m
p
a
r
a
t
o
r
A
>
B
1/2T
1/2T
1/2T
C
o
m
p
a
r
a
t
o
r
A
>
B
1/2T
1/2T
1/2T
1/T
A
B
C
Figure 3.9. Parallel implementation.
P
pipe
C
pipe
V
pipe
2
f
pipe
1.15C
orig
( ) 0.58V
orig
( )
2
f
orig
0.39P
orig

=
=
LOWPOWERTECHNIQUES 47
One benefit of pipelining is the low area overhead in comparison
with using parallel datapaths. The area overhead equals the area of
the inserted latches. Another benefit is that the amount of glitches
can be reduced.
Further power saving can be obtained by parallelism and/or
pipelining. However, since the delay increases significantly as the
voltage approaches the threshold voltage and the capacitance load
for routing and/or pipeline registers increases, there exists an
optimal power supply voltage. Reduction of supply voltage lower
than the optimal voltage increases the power consumption.
Locality is also an important issue for architecture trade-off. The
on-chip communication through long buses requires significant
amount of power. To reduce such communications is important.
3.2.4. Logic Level
The power consumption depends on the switching activity factor,
which in turn depends on the statistical characteristics of data.
However, most low power techniques do not concentrate on this
issue from the system level to the architecture level. The low power
techniques at the logic level, however, focus mainly on the reduction
of switching activity factor by using the signal correlation and, off
course, the node capacitances.
C
o
m
p
a
r
a
t
o
r
A
>
B
A
B
C
1/2T
1/2T
1/2T 1/2T
1/2T
Figure 3.10. Pipeline implementation.
48 Chapter 3
As we know from the gated clocking, the clock input to non-
active functional block does not change by gating, and, hence,
reduces the switching of clock network. Precomputation [1] uses the
same concept to reduce the switching activity factor: a selective
precomputing of the output of a circuit is done before the output are
required, and this reduces the switching activity by gating those
inputs to the circuit. This is illustrated in Fig. 3.11. The input data
is partitioned into two parts, corresponding to registers R
1
and R
2
.
One part, R
1
, is computed in precomputation block g one clock cycle
before the main computation A is performed. The result from g
decides gating of R
2
. The power can then be saved by reducing the
switching activity factor in A.
An example of precomputation for low-power is the comparator.
The comparator takes the MSBof the two numbers to register R
1
and
the others to R
2
. The comparison of MSB is performed in g. If two
MSBs are not equal, the output from g gated the remaining inputs.
In this way, only a small portion of inputs to the comparator´s main
block A (subtractor) is changed. Therefore the switching activity is
reduced.
Gate reorganization [12] [32] [57] is a technique to restructure
the circuit. This can be decomposition a complex gate to simple
gates, or composition simple gates to a complex gate, duplication of
a gate, deleting/addition of wires. The decomposition of a complex
gate and duplication of a gate help to separate the critical and non-
critical path and reduce the size of gates in the non-critical path, and,
hence, the power consumption. In some cases, the decomposition of
a complex gate increases the circuit speed and gives more space for
En
R
2
R
1
R
3
A
g
Figure 3.11. A precomputation structure for low power.
LOWPOWERTECHNIQUES 49
power supply voltage scaling. The composition of simple gates can
reduce the power consumption if the complex gate can reduce the
charge/discharge of high-frequently switching node. The deleting of
wires reduces the capacitance load and circuit size. The addition of
wires helps to provide an intermediate circuit that may eventually
lead to a better one.
Encoding defines the way data bits are represented on the
circuits. The encoding is usually optimized for reduction of delay or
area. In low power design, the encoding is optimized for reduction
of switching activities since various encoding schemes have
different switching properties.
In a counter design, counters with binary and Gray code have the
same functionality. For N-bit counter with binary code, a full
counting cycle requires transitions [63] A full counting
cycle for a Gray coded N-bit counter requires only transitions.
For instance, the full counting cycle for a 2-bit binary coded counter
is from 00, 01, 10, 11, and back to 00, which requires 6 transitions.
The full counting cycle for 2-bit Gray coded counter is from 00, 01,
11, 10, and back to 00, which requires 4 transitions. The binary
coded counter has twice transitions as the Gray coded counter when
the n is large. Using binary coded counter therefore requires more
power consumption than using Gray coded counter under the same
conditions.
As we can see from the previous example, the logic coding style
has large impact on the number of transitions. Traditionally, the
logic coding style is used for enhancement of speed performance.
Careful choice of coding style is important to meet the speed
requirement and minimize the power consumption. This can be
applied to the finite state machine, where states can be coded with
different schemes.
A bus is an on-chip communication channel that has large
capacitance. As the on-chip transfer rate, increases, the use of buses
contributes with a significant portion of the total power. Bus
encoding is a technique to exploit the property of transmitted signal
to reduce the power consumption. For instance, adding an extra bit
to select one of the inverse or the non-inverse bits at the receiver end
can save power [53]. Low swing techniques can be applied for the
bus also [27].
2 2
n
1 – ( )
2
n
50 Chapter 3
3.2.5. Circuit Level
At the circuit level, the potentials power saving are often less than
that of higher abstract levels. However, this cannot be ignored. The
power savings can be significant as the basic cells are frequently
used. A few percents improvement for D flip-flop can significantly
reduce the power consumption in deep pipelined systems.
In CMOS circuits, the dynamic power consumption is caused by
the transitions. Spurious transitions typically consume between 10%
and 40%of the switching activity power in the typical combinational
logic [20]. In some cases, like array multipliers, the amount of
spurious transitions is large. To reduce the spurious transitions, the
delays of signals from registers that converge at a gate should be
roughly equal. This can be done by insertions of buffers and device
sizing [33]. The insertions of buffer increase the total load
capacitance but can still reduce the spurious transitions. This
technique is called path balancing.
Many logic gates have inputs
that are logically equivalent, i.e.,
the swapping of inputs does not
modify the logic function of the
gate. Example gates are NANDs,
NORs, XORs, etc. However, from
the power consumption point of
view, the order of inputs does effect
the power consumption. For
instance, the A-input, which is near
the output in a two-input NAND
gate, consumes less power than the
B-input closed to the ground with the same switching activity factor.
Pin ordering is to assign more frequently switching to input pin that
consumes less power. In this way, the power consumption will be
reduced without cost. However, the statistics of switching activity
factors for different pins must be known in advanced and this limits
the use of pin ordering [63].
Different logic styles have different electrical characteristics.
The selection of logic style affects the speed and power
consumption. In most cases, the standard CMOS logic is a good
starting point for speed and power trade-off. In some cases, for
B
A
Out
C
out
C
i
Figure 3.12. Nand gate.
LOWPOWERTECHNIQUES 51
instance, the XOR/NXOR implementation, other logic styles, like
complementary pass-transistor logic (CPL) is efficient. CPL
implements a full-adder with fewer transistors than the standard
CMOS. The evaluation of full-adder is done only with NMOS
transistors network. This gives a small layout as well.
Transistor sizing affects both delay and power consumption.
Generally, a gate with smaller size has smaller capacitance and
consumes less power. This is paid for by larger delay. To minimize
the transistor sizes and meet the speed requirement is a trade-off.
Typically, the transistor sizing uses static timing analysis to find out
those gates (whose slack time is larger than 0) to be reduced. The
transistor sizing is generally applicable for different technologies.
3.3. Low Power Guidelines
Several approaches to reduce the power consumption have been
briefly discussed. Belowwe summarize some of the most commonly
used low power techniques.
• Reduce the number of operations. The selection of algorithm
and/or architecture has significant impact on the power
consumption.
• Power supply voltage scaling. The voltage scaling is an
efficient way to reduce the power consumption. Since the
throughput is reduced as the voltage is reduced, this may need
to be compensated for with parallel and/or pipelining
techniques.
Complementary Inputs
P
a
s
s
-
t
r
a
n
s
i
s
t
o
r
(
N
M
O
S
)

N
e
t
w
o
r
k
Output Output
C
o
m
p
l
e
m
e
n
t
a
r
y

I
n
p
u
t
s
Figure 3.13. CPL logic network.
52 Chapter 3
• I/Os between chips can consume large power due to the large
capacitive loads. Reducing the number of chips is a promising
approach to reduce the power consumption.
• Power management. In many systems, the most power
consuming parts are often idle. For example, in a lap-top
computer, the portion of display and harddisk could consume
more than 50% of the total power consumption. Using power
management strategies to shut down these components when
they are idle for a long time can achieve good power saving.
• Reducing the effective capacitance. The effective capacitance
can be reduced by several approaches, for example, compact
layout and efficient logic style.
• Reduce the number of transitions. To minimize the number of
transitions, especially the glitches, is important.
3.4. Summary
In this chapter we discussed some low power techniques that are
applicable at different abstraction levels.
53
4
FFT ARCHITECTURES
Not only several variations of the FFT algorithm have been
developed after the Cooley-Tukey’s publication but also various
implementations. Generally, the FFT can be implemented in
software, general-purpose digital signal processors, application-
specific processors, or algorithm-specific processors.
The implementations with software on general-purpose
computer can be found in literature and still being explored in some
projects, for instance, the FFTW project in the Laboratory for
Computer Science at MIT [28]. Software implementations are not
suitable for our target application as the power consumption is too
high.
Since it is hard to summarize all other implementations, we will
concentrate on algorithmic-specific architectures and only give a
brief overview on some FFT architectures.
4.1. General-Purpose Programmable DSP Processors
Many commercial programmable DSP processors include the
special instructions for the FFT computation. Although the perfor-
mance varies from one to another, most of them belong to the
Harvard architecture from the architecture point of view. A
processor with Harvard architecture has separate busses for data and
control.
54 Chapter 4
A typical programmable DSP processor has on chip data and
program memory, address generator, program control, MAC, ALU,
and I/O interfaces, as illustrated in Fig. 4.1.
The computation of FFT with general-purpose DSP processor
does not differ too much from the software computation of FFT in a
general-purpose computer.
To compute the FFT with a general-purpose DSP processor
requires three steps: first the data input, then the FFT/IFFT compu-
tation, and finally the data output. In some DSP processors, for
instance TI’s TMS320C3x, bit-reverse addressing is available to
accelerate the unscrambling for the data output. Typical FFT/IFFT
execution times are about 1 ms [2] [41] [55], which is far from the
implementation using more specialized implementations. The
implementations with general-purpose programmable DSP
processor is therefore not applicable due to the throughput
requirement.
4.2. Programmable FFT Specific Processors
Several programmable FFT processors have been developed for the
FFT/IFFT computations. These processors are 5 to 10 times faster
than the general-purpose programmable DSP processors.
The programmable FFT-specific processors have specific butter-
flies and at least one complex multiplier [65]. The butterfly is usually
radix-2 or radix-4. There is often an on-chip coefficient ROM, which
I
/
O

i
n
t
e
r
f
a
c
e
I
/
O

i
n
t
e
r
f
a
c
e
Data
Memory
Program
Memory
Address
Generator
Program
Controller
MAC &
ALU
Program
Data
Program
Data
Address
Buss
Data
Buss
Address
Buss
Data
Buss
Figure 4.1. General-purpose programmable DSP processor.
FFTARCHITECTURES 55
stores sinus and cosine coefficients. This type of programmable
FFT-specific processors are often provided with windowing
functions in either time or frequency domain.
The Zarlink’s (former Plessey) PDSP16515A processor
performs decimation in time, radix 4, forward or inverse Fast Fourier
Transforms [65]. Data are loaded into an internal workspace RAM
in normal sequential order, processed, and then read-out in correct
order. The processor has two internal workspace RAMs, one output
buffer, and one coefficient ROM.
Although the PDSP1615A processor accelerates the FFT
computation, it is still hard to meet the throughput requirement with
a single processor due to the slow I/O. The processor requires 98 µs
to perform 1024-point FFT with a system clock of 40 MHz. Using
multiple processor configuration can achieve a higher throughput,
but the power consumption is then substantially higher.
A recent released FFT specific processor from DoubleBW
systems B. V. has higher throughput (100 Msamples/s) [16], but
consumes 8 W at 3.3 V.
Coefficient
ROM
Radix-4
Datapath
Output
Buffer
Workspace
RAM
Workspace
RAM
3 Term
Window
Operator
Input
Output
Figure 4.2. FFT-specific processor PDSP16515A.
56 Chapter 4
4.3. Algorithm-Specific Processors
Non programmable algorithm-specific processors can also be
designed for the computation of FFT algorithms. The processors are
designed mostly for fixed-length FFTs. The architecture of an
algorithmic-specific FFT processor is therefore optimized with
respect to memory structure, control units, and processing elements.
There are mainly three types of algorithm-specific processors:
fully parallel FFT processors, column FFT processors, and pipelined
FFT processors.
All three types of algorithm-specific processors represent
different mapping of the signal-flow graph for FFT to hardware
structures. The hardware structure in a fully parallel FFT processor
is an isomorphic mapping of the signal-flow graph [3]. For example,
the signal-flow graph for an 8-point FFT algorithm is shown in Fig.
4.3. The 8-point fully parallel FFT processor requires 24 complex
adders and 5 complex multipliers. The hardware requirement is
excessive, and, hence, is not power efficient.
To reduce the hardware complexity, a column or a pipelined FFT
processor can be used. A set of process elements in a column FFT
processor [21] compute one stage at a time. The results are fed back
to the same set of process elements to compute the next stage. For
the long transform length, the routing for the processing elements is
complex and difficult.
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
n
multiplication with W
n
8
W
2
W
3
W
1
W
2
W
2
W
0
W
0
W
0
Figure 4.3. Signal-flow graph for an 8-point FFT.
FFTARCHITECTURES 57
For a pipelined FFT processor, each stage has its own set of
processing elements. All the stages are computed as soon as data are
available. pipelined FFT processors have features like simplicity,
modularity and high throughput. These features are important for
real-time, in-place applications where the input data often arrive in
a natural sequential order. We therefore select the pipeline archi-
tecture for our FFT processor implementation.
The most common groups of the pipelined FFT architecture are
• Radix-2 multipath delay commutator (R2MDC)
• Radix-2 single-path delay feedback (R2SDC)
• Radix-4 multipath delay commutator (R4MDC)
• Radix-4 single-path delay commutator (R4SDC)
• Radix-4 single-path delay feedback (R4SDF)
• Radix-2
2
single-path delay commutator (R2
2
SDC)
We will discuss these pipeline architectures in more detail.
4.3.1. Radix-2 Multipath Delay Commutator
The Radix-2 Multipath Delay Commutator (R2MDC) architecture is
the most straightforward approach to implement the radix-2 FFT
algorithm using a pipeline architecture [48]. An 8-point R2MDC
FFT is shown in Fig. 4.4.
When a new frame arrives, the first four input data are multi-
plexed to the top-left delay elements in the figure and the next four
input data directly to the butterfly. In this way the first input data is
delayed by four samples and arrives to the butterfly simultaneously
with the fourth input sample. This complete the start-up of the first
stage of the pipeline. The outputs from the first stage butterfly and
the multiplier are then fed into the multipath delay commutator
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
Output0
Output1
Switch
4
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
2
2
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
1
1
Input
Switch
Mux
Figure 4.4. An 8-point DIF R2MDC architecture.
58 Chapter 4
between stage 1 and stage 2. There are two paths (multipath) with
delay elements and one switch (commutator). The multipath delay
commutator alleviates the data dependency problem. The first and
second outputs from the upper side of the butterfly are fed into the
two upper delay elements. After this, the switch changes and the
third and fourth outputs from the upper output of the first butterfly
are sent directly to the butterfly at stage 2. However, the first and
second outputs from the multiplier at the first stage are now delayed
by the upper delay elements, which make the first and second
outputs from the multiplier of the first stage arrive together with the
fifth and sixth outputs from the top.
The butterfly and the multiplier are idle half the time to wait for
the new inputs. Hence the utilization of the butterfly and the multi-
plier is 50%. The total number of delay elements is 4 + 2 + 2 + 1 +
1 = 10 for the 8-point FFT. The total number of delay elements for
an N-point FFT can be derived in similar way and is
N/2+N/2+N/4+...+2, i.e., 3N/2-2. Each stage (except the last one)
has one multiplier and the number of multipliers is .
4.3.2. Radix-2 Single-Path Delay Feedback
Herbert L. Groginsky and George A. Works introduced a feedback
mechanism in order to minimize the number of delay elements [22].
In the proposed architecture one half of outputs from each stage are
fed back to the input data buffer when the input data are directly sent
N ( )
2
log 1 –
FFTARCHITECTURES 59
to the butterfly. This architecture is called Radix-2 Single-path Delay
Feedback (R2SDF). Fig. 4.5 shows the principle of an 8-point
R2SDF FFT.
The delay elements at the first stage save four input samples
before the computation starts. During the execution they store one
output from the butterfly of the first stage and one output is immedi-
ately transferred to the next stage. Thus, in the new interim half
frame when the delay elements are filled with fresh input sample, the
results of the previous frame are sent to the next stage. The butterfly
is provided with a feedback loop. The modified butterfly is shown in
the right side of Fig. 4.5. When the mux is 0, the butterfly is idle and
data passes by. When the mux is 1, the butterfly processes the
incoming samples. Because of the feedback mechanism we reduce
the requirement of delay elements from 3N/2 to N – 1 (N/2 + N/4 +
... + 1) which is minimal. The number of multiplier is exact the same
as R2MDC FFT architecture, i.e., . The utilization of
multiplier and butterflies remains the same, namely 50%.
4.3.3. Radix-4 Multipath Delay Commutator
This architecture is similar to R2MDC. Input data are separated by
a 4-to-1 multiplexer and 3N/2 delay elements at the first stage. A 4-
path delay commutator is used between two stages. Computation is
taking place only when the last 1/4 part of data is multiplexed to the
0
1
0
1
R
a
d
i
x
-
2
B
u
t
t
e
r
f
l
y
Mux
Radix-2 SDF Butterfly
Output Input
4
R
a
d
i
x
-
2

S
D
F
B
u
t
t
e
r
f
l
y
2
R
a
d
i
x
-
2

S
D
F
B
u
t
t
e
r
f
l
y
1
R
a
d
i
x
-
2

S
D
F
B
u
t
t
e
r
f
l
y
Figure 4.5. An 8-point DIF R2SDF FFT.
N ( )
2
log 1 –
60 Chapter 4
butterfly. The utilization of the butterflies and the multipliers is 25%.
The length of the FFT has to be . A length-64 DIF Radix-4
Multipath Delay Commutator (R4MDC) FFT is shown in Fig. 4.6.
Each stage (except the last stage) has 3 multipliers and the
R4MDCFFT requires in total multipliers for an N-
point FFT which is more than the R2MDC or R2SDF. Moreover the
memory requirement is 5N/2-4, which is the largest among the three
discussed architectures. From the view of hardware and utilization,
it is not a good structure.
4.3.4. Radix-4 Single-Path Delay Commutator
To increase the utilization of the butterflies, G. Bi and E. V. Jones [4]
proposed a simplified radix-4 butterfly. In the simplified radix-4
butterfly, only one output is produced in comparison with 4 in the
conventional butterfly. To provide the same four outputs, the
butterfly works four times instead of just one. Due to this modifi-
cation the butterfly has a utilization of or 100%. To accom-
modate this change we must provide the same four data at four
different times to the butterfly. A few more delay elements are
required with this architecture. Furthermore, the simplified butterfly
needs additional control signals, and so do the commutators. The
number of multipliers is , which is less than the R4MDC
FFT architecture. The utilization of the multiplier is 75% due to the
fact that at least one-fourth of the data are multiplied with the trivial
4
n
Output0
Input
Mux
48
32
16
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Switch
8
4
12
12
4
8
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Switch
2
1
1
2
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Output1
Output2
Output3
3
3
Figure 4.6. A 64-point DIF R4MDC FFT.
3 N ( )
4
log 1 – ( ) ⋅
4 25% ⋅
N ( )
4
log 1 –
FFTARCHITECTURES 61
twiddle factor 1 (no multiplication is needed). The structure of a 16-
point DIF Radix-4 Single-Path Delay Commutator (R4SDC) FFT is
shown below.
The main benefit of this architecture is the utilization
improvement for butterflies. The cost for R4SDC is the increase
amounts of delay elements.
4.3.5. Radix-4 Single-Path Delay Feedback
Radix-4 single-path delay feedback (R4SDF) [15] [62] is a radix-4
version of R2SDF. Since we use the radix-4 algorithmwe can reduce
the number of multipliers to log
4
(N) – 1 compared to log
2
(N) – 2 for
R2SDF. But the utilization of the butterflies are reduced to 25%. The
radix-4 SDF butterflies also become more complicated than the
radix-2 SDF butterflies. A64-point DIF R4SDF FFT is illustrated in
Fig. 4.8.
Single-path
Delay
Commutator
24
Simplified
Radix-4
Butterfly
Single-path
Delay
Commutator
6
Simplified
Radix-4
Butterfly
Output Input
Figure 4.7. A 16-point DIF R4SDC FFT.
Output
Input
R
a
d
i
x
-
4

S
D
F
B
u
t
t
e
r
f
l
y
4
4
4
R
a
d
i
x
-
4

S
D
F
B
u
t
t
e
r
f
l
y
1
1
1
R
a
d
i
x
-
4

S
D
F
B
u
t
t
e
r
f
l
y
16
16
16
Figure 4.8. A 64-point DIF R4SDF FFT.
62 Chapter 4
The radix-4 SDF butterfly is shown in Fig. 4.9. The data are sent
to the butterfly for processing when the mux is 1, otherwise the data
are shifted into a delay-line with a length of 3N/4 (first stage).
4.3.6. Radix-2
2
Single-Path Delay Commutator
The Radix-2
2
Single-path Delay Commutator (R2
2
SDC) archi-
tecture [24] uses a modified radix-4 DIF FFT algorithm. It has the
same butterfly structure as the radix-2 DIF FFT, but places the multi-
pliers at the same places as for the radix-4 DIF FFT. Basically two
kinds of radix-2 SDF butterflies are used to achieve the same output
(but not of the same order) as a radix-4 butterfly. By reducing the
radix from 4 to 2 we increase utilization of the butterflies from 25%
to 50%. We reduce the number of multipliers compared to the
conventional radix-2 algorithm. This approach is based on a 4-point
DFT.
The outputs are bit-reversed instead of 4-reversed as in a conven-
tional radix-4 algorithm.
Radix-4 SDF Butterfly
Mux
0
1
0
1
0
1
0
1
R
a
d
i
x
-
4
B
u
t
t
e
r
f
l
y
Figure 4.9. Radix-4 SDF butterfly.
R
a
d
i
x
-
2

S
D
F

(
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
16 32
R
a
d
i
x
-
2

S
D
F

(
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
R
a
d
i
x
-
2

S
D
F

(
I
I
)
B
u
t
t
e
r
f
l
y

E
l
e
m
e
n
t
8 4 2 1
Input Output
Figure 4.10. A 64-point DIF R2
2
SDC FFT.
FFTARCHITECTURES 63
4.4. Summary
In this chapter, several FFT implementation classes are discussed.
The programmable DSP or FFT-specific processors cannot meet the
requirements in both high throughput and low power applications.
Algorithm-specific implementations, especially with pipelined FFT
architectures are better in this respect.
64 Chapter 4
65
5
IMPLEMENTATION OF
FFT PROCESSORS
In this chapter, we discuss implementation of FFT processors.
In VLSI design, the design method is an important guide for
implementation. We follow the meet-at-the-middle design method.
5.1. Design Method
As the transistor feature size is scaled down, more and more
functionalities can be integrated in a single chip. High speed, high
complexity, and short design time are several requirements for VLSI
designs. This requires that the design methodology must cope with
the increasing complexity using a systematic approach. A design
methodology is the overall strategy to organize and solve the design
tasks at the different steps of the design process [24].
The bottom-up methodology, which builds the system by
assembling the existing building blocks, can hardly catch up with the
high performance and communication requirements of current
system. Hence the bottom-up methodology is not suitable for the
design of complex systems.
In the top-down design methodology, the system requirements
and organization is developed by a successive decomposition.
Typically, a high-level design language is used to define the system
functionality. After a number of decomposition steps, the system is
described by a HDL, which can be used for automatic logic
66 Chapter 5
synthesis. A drawback with this design approach is that the result
highly relies on the synthesis tools. If the final result fails to meet the
performance requirement, the whole design has to be redesigned.
In the meet-in-the-middle
methodology, the specification-
synthesis process is carried out
in essentially a top-down
fashion, but the actual design of
the building blocks is
performed in bottom-up. This is
illustrated in Fig. 5.1. The
design process is therefore
divided into two almost
independent parts that meet in
the middle. The circuit design
phase can be shortened by
using efficient circuit design
tools or even automatic logic
synthesis tools. Often, some of
the building blocks are already
available in a circuit library.
In our target application, the requirement for the FFT processor
is specified. The design process starts with creation of functional
specification of the FFT processor. This results in a high-level
model. The high-level model is then validated by a testbench for the
FFT algorithm. The testbench can be reused for successive models.
After the system functionality is validated by simulation, the
functional specification is mapped into an architectural
specification.
In the architectural specification, the detail computation process
is mapped to the hardware. Different functionalities are partitioned
and mapped into hardware or software components. Detailed
communications between different components are to be decided at
the architecture specification. After the architecture model is
created, the model needs to be simulated for performance and
validation. Basically, the software and hardware design are
Specification/Validation
Algorithm
Scheduling
Architecture
MEET-IN-THE-MIDDLE
Logic
Gate
Transistor
Layout
Layout
Cell
Module
Figure 5.1. The meet-in-the-mid-
dle methodology.
IMPLEMENTATIONOF FFT PROCESSORS 67
separated after this architectural partitioning. Since the FFT
processor is completely implemented in hardware, the partitioning
of software and hardware is not necessary.
Once an architecture is selected, the individual hardware blocks
are refined by adding the implementation details and constraints. In
this phase, we apply the bottom-up design methodology. Different
subblocks are built from cells in combination of blocks.
5.2. High-level Modeling of an FFT Processor
High level modeling serves two purposes: to create a cycle-true
model for the algorithm and hardware architecture, to simulate,
validate and optimize the high-level model.
Since the whole FFT processor is implemented in hardware, the
software and hardware co-design is not needed. As mentioned
previously, we do not need to determine the system specification
since it is given. The system specification for the FFT processor has
been defined as
• Transform length is 1024
• Transform time is less than 40 µs (continuously)
• Continuous I/O
• 25.6 Msamples/sec. throughput
• Complex 24 bits I/O data
According to the meet-in-the-middle design methodology, the
high-level design is a top-down process. We start with the resource
analysis.
5.2.1. Resource Analysis
The high-level design can be divided into several tasks:
• Architecture selection
• Partitioning
• Scheduling
• RTL model generation
• Validation of models
68 Chapter 5
The first three tasks are associated with each other and the aim is
to allocate the resource to meet the systemspecification. In the ASIC
implementation, the resource is constrained. Hence the resource
analysis is required.
There are many possible architectures for FFT processors.
Among them, the pipelined FFT architectures are particularly
suitable for real-time applications since they can easily
accommodate the sequential nature of sampling.
The pipelined FFT architectures can be divided into datapath and
control part. Since the control part is much simpler than the datapath
in respect of both hardware and power consumption, the resource
analysis concentrates on the datapath.
The datapath for the FFT processor consists of memories,
butterflies and complex multipliers. We discuss them separately.
5.2.1.1. Butterflies
From the specification, the computation time for the 1024-point FFT
processor is
s (5.2)
With radix-2 algorithm, the number of butterfly operations is
. A butterfly can be
implemented with parallel adders/subtractors using one clock cycle.
Hence the minimum number of butterflies is
(5.3)
This is optimal with the assumption that ALL data are available
to ALL stages, which is impossible for continuous data streams.
Each butterfly has to be idle for 50%in order to reorder the incoming
data. The allocation of butterfly operations from two stages to the
same butterfly is not possible with as soon as possible (ASAP)
scheduling. Therefore the number of butterflies is 10, i.e., equal to
the number of stages.
t
FFT
4 10
5 –
× =
N r ⁄ ( ) N ( )
r
log N 2 ⁄ ( ) N ( )
2
log 5120 = =
N
BF
No
BFop
t
BFop
×
t
FFT
----------------------------------------
5120
4 10
5 –
× 25.6 10
6
× ×
---------------------------------------------------- 5 = = =
IMPLEMENTATIONOF FFT PROCESSORS 69
With similar discussion, the number of butterflies for a radix-4
pipeline architecture is equal to the number of stages.
5.2.1.2. Complex Multipliers
The number of complex multiplications is
(5.4)
where N is the transform length and r is the radix. It does not include
the complex multiplications within the r-point DFT.
For radix-2 algorithm, the number of complex multiplications is
about 4068. The complex multiplication can be computed either in
one clock cycle and two clock cycles (pipelining). The minimum
number of complex multipliers is, with assumption of fast complex
multipliers (one complex multiplication per clock cycle),
(5.5)
Since the resource sharing between two stages is not possible for
pipeline architectures. The number of complex multipliers is 9, i.e.
each stage except the last stage has its own set of complex
multipliers.
For radix-4 algorithm, the number of complex multipliers is 4.
5.2.1.3. Memories
The memories requirement increases linearly with the transform
length. For a 1024-point FFT processor, it dissipates more power
than the complex multipliers.
The size of the memories are determined by the maximum
amount of live data, which is determined by the architectures. In
general, the architectures with feedback are efficient in terms of the
utilization of memories.
N
cmult
N
r
---- r 1 – ( ) N ( )
r
1 – log ( ) ≈
N
cmult
N
cmult
t
cmult
t
FFT
-------------------------------
4068
4 10
5 –
× 25.6 10
6
× ×
---------------------------------------------------- 4 ≈ = =
70 Chapter 5
5.2.2. Validation of the High-Level Model
After the resource analysis, the next step is to model the FFT
algorithm at high-level. For a fast evaluation, the algorithm is
described with high-level programming language, like C or Matlab.
The validation of high-level model is done through simulation
and comparison. The interface between model and testbench is plain
text files: the input data is stored in a text file and read in by the
model and the output data frommodel is saved in a text file also. This
gives a freedom for the construction of model and testbench: the
model can be written in either C, Matlab or VHDL, the testbench can
also written in C or Matlab. Moreover, it is easy to convert from
floating-point arithmetic to fixed-point arithmetic. The same
testbench can be reused by changing the output/input file arithmetic.
Architecture Memory requirement [words] Memory utilization
R2MDC N/2+N/2+...+2 = 3N/2-2 66%
R2SDF N/2+N/4+...+1 = N-1 100%
R4MDC 3N/4+3N/16+...+12 = 5N/2-4 40%
R4SDF 3N/4+3N/16+...+3 = N-1 100%
R4SDC 3N/2+3N/8+...+6 = 2N-2 50%
R2
2
SDF N/2+N/4+...+1 = N-1 100%
Table 5.1. Memory requirement and utilization for pipelined
architectures.
Device
Under
Test
Test Bench
input text file output text file
Figure 5.2. Testbench.
IMPLEMENTATIONOF FFT PROCESSORS 71
5.2.3. Wordlength Optimization
In the pipelined FFT architectures, the most research effort has been
relative to the regular modular implementations, which uses fixed
wordlength for both data and coefficients for each stages. The
possibility to use different wordlength, which is provided by the
pipeline architecture, is often ignored to achieve modular solutions.
Based on the observation that the wordlength for different stages
in the pipelined FFT processor can be various, we proposed a
wordlength optimization method [34]. We first tune the wordlength
of data memory (data RAM) at each stage separately to make sure
that the precision requirement is met, and then we adjust the
wordlength of the coefficient ROM at each stage. Because our focus
is placed on reducing power consumption in the data memory, the
strategy is that the larger the RAM block in a stage, the shorter its
wordlength should be. The conventional uniform wordlength
scheme for both data memory and coefficient ROM is also
simulated. To obtain the optimal word lengths profile numerous
design iterations have been performed.
Start
End
Coefficient
wordlengths
OK?
Sine wave
test vectors
OK?
Data
wordlengths
Random
test vectors
No
Yes
Yes
No
Figure 5.3. Wordlength optimization for pipelined FFTs.
72 Chapter 5
Two types of testing vectors are used in our simulation. One is
sine wave and the other is random numbers. The sine wave stimuli
is sensitive to the precision of the coefficient representation, and the
samples of random numbers are effective stimuli to check the
precision of butterfly calculations. To make the results obtained
highly reliable, 100,000 sets of random samples are generated and
fed into the simulator. The optimization result is shown table below
for 1024-point pipelined FFT architectures.
5.3. Subsystems
Once the RTL-level model of FFT processor is created and
validated, the subsystems can be constructed according to the meet-
at-the-middle design methodology.
For the subsystems design, there are two design methods: the
semi-custom method and full custom method. Semi-custom design
method has a shorter design time. The RTL-level description in a
HDL can be synthesized with synthesis tool and the synthesis result
is fed to place and route tool for final layout. However, this design
methodology relies on the synthesis tools and place and route tools.
The designer have less control over the design process. Moreover,
the most synthesis tools use static timing analysis and do not
consider the interconnections during synthesis. The designers have
to increase timing margins during synthesis to meet the speed
requirement after place and route. The resulting designs are often
unnecessary large. In our case, the impact of power supply voltage
scaling is hard to predict since the characterization of cells is done at
Architecture Memory size for
fixed wordlength
Memory size for
optimized wordlength
Saving
R2MDC 42952 bits 42824 bits 0%
R2SDF 28644 bits 28580 bits 0%
R4MDC 71552 bits 61488 bits 14%
R4SDF 28644 bits 24708 bits 14%
R4SDC 57288 bits 49176 bits 14%
R2
2
SDF 28672 bits 28580 bits 0%
Table 5.2. Wordlength optimization.
IMPLEMENTATIONOF FFT PROCESSORS 73
normal supply voltage. We select therefore full custom design for
the FFT processor, but use semi-custom design method for the
control path, where the timing is not critical.
In the following we introduce the subsystem design for the FFT
processor. The main subsystems are memories, butterflies, and
complex multipliers.
5.3.1. Memory
In many DSP processors, the memory contributes significant portion
of area and power consumption for the whole processor. In the 1024-
point FFT processor, the memory becomes the most significant part
in both area and power consumption. Hence the low power design of
memories are a key issue for the FFT processor.
5.3.1.1. RAM
The data are stored in RAMs. In the RAM design, there are mainly
two types of RAMs: static RAM (SRAM), dynamic RAM (DRAM).
Since the DRAM often requires special technology and is not
Data reordering
(Memories)
Butterfly
Complex
Multiplier
Figure 5.4. Datapath for a stage in a pipelined FFT processor.
Input
Output
74 Chapter 5
available for standard CMOS technology and SRAM is more
suitable for low voltage operations, we select the SRAM for the data
storage. An overview of SRAM is shown in Fig. 5.5.
The SARM consists of four parts [30]:
• memory cell array
• decoders
• sense amplifiers
• periphery circuits
We discuss the implementation for the first three parts, which is
the main parts of the SRAM.
Memory array
Memory array dominates the SRAM area. The keys for the design of
memory array are cell area and noise immunity of bit-lines.
The memory cell is basic building block in the SRAM and the
size of the memory cell is of importance. Even through a 4-
transistors (4-T) memory cell has less area than that of a 6-transistors
(6-T), the current leakage at low voltage is considerable larger. We
therefore select to use a 6-T memory cell. A typical 6-T memory cell
is shown in Fig. 5.6.
Sense Amplifiers
Data I/O
D
e
c
o
d
e
r
C
t
l

c
i
r
c
u
i
t
s
Cell
Array
Address Data
Figure 5.5. Overview of a SRAM.
IMPLEMENTATIONOF FFT PROCESSORS 75
The stability of the memory cell during the read and write
operations determines the device sizing [52]. The cell designer have
to consider process variations, short channel effects, soft error rate
and low supply voltage [23]. The width ratio is
determined by the read operation and a larger ratio β means less
chance for a SRAM cell changes its state during read operation. The
read stability can be measured by the static noise margin (SNM).
The width ratio affects the write operation and a
larger α means data is more difficult to write into the cell. The width
for the access NMOS transistor, W
a
, is set to min-size for minimal
cell area. Normally, the value for α is 1~2 and β is 2~3.
B
L
B
L
_
b
a
r
WWL W
p
W
n
W
a
Figure 5.6. SRAM cell.
Schematic Layout
β W
n
W
a
⁄ =
α W
p
W
a
⁄ =
76 Chapter 5
Example 6. SNM for a SRAM cell.
The SNM can be simulated with SPICE. The SNM for a SRAM
cell with β · 2 in standard 0.35 µm CMOS technology is shown in
Fig. 5.7.
In order to reduce the power consumption and speed up the read
access, most SRAMs read data through a par bit-lines with small
swing. The voltage swing between two bit-lines is usually about 100
mV to 300 mV, which is sensitive to noise. As the power supply
voltage decreases, the affect of noise becomes more important. To
reduce the noise from outside, the memory array is surrounded with
guard-ring to reduce the substrate-coupling noise. To avoid the
coupling noise from nearby bit-line pars, we use the twisted bit-lines
layout. Thus the coupling from nearby bit-line pars does not affect
the swing difference of bit-lines.
1 1.5 2 2.5 3 3.5
0.42
0.44
0.46
0.48
0.5
0.52
0.54
Power supply voltage (V)
S
N
M

(
V
)
SNM vs power supply voltage
Figure 5.7. SNM of a SRAM cell vs. power supply voltage.
Guard-ring
Bit-lines
Figure 5.8. Noise reduction for memory array.
IMPLEMENTATIONOF FFT PROCESSORS 77
Decoder
The decoder can be realized by using a hierarchical structure, which
reduce both the delay and the activity factor. The row decoder can
use either NOR-NAND decoder or tree decoder. Tree decoder
requires fewer transistors, but suffer from speed degradation due to
the serial-connection of pass-transistors, which could increase the
delay (it becomes worse for lower power supply voltage). The NOR-
NAND decoder has a regular layout, but requires more transistors.
In small decoders the tree decoder is preferred and the NOR-NAND
decoders is preferred for larger decoders.
For the large decoder, a word-line enable signal is added to the
decoder, which controls the width of word-line acting pulse and
reduces the glitches of word-line drivers. This reduces the power
consumption of decoder.
Sense Amplifier
The sense amplifier is used to amplify the bit-line signals with small
swing during read operation. To have a fast access, the sense
amplifier is designed with high gain. This high gain requirement in
turn requires high current and hence high power consumption for
sense amplifier. One way to reduce the power consumption is to
reduce the active time for sense amplifier. This can be achieved by
using pulsed sense enable signal.
At low supply voltage, the current mode sense amplifier is less
suitable. We therefore modified an STC D-flip-flop [64] to form a
two stage latch type sense amplifier. The sense amplifier is
functional when the supply voltage is as low as 0.9 V.
SE_bar
BL BL_bar
Dout Dout_bar
Figure 5.9. STC D-flip-flop.
78 Chapter 5
The simulated waveforms for read operation are shown in Fig.
5.10. The access time is 11 ns using standard 0.35 µm CMOS
technology with typical process under 85˚C. The power
consumption for the sense amplifier is 59.5 µW per bit at 50 MHz.
The simulated waveforms for write operation are shown in
Figure 5.11. The total power consumption is 83.4 µW per bit at 50
MHz.
Symbol Wave
:A0:v(out0)
:A0:v(out1)
:A0:v(prech)
1:A0:v(wwl)
V
o
lta
g
e
s
(
lin
)
-100m
0
100m
200m
300m
400m
500m
600m
700m
800m
900m
1000m
1.1
1.2
1.3
1.4
1.5
Time (lin) (TIME)
120n 122n 124n 126n 128n 130n 132n 134n 136n 138n 140n
SENSE AMPLIFIER TEST
11 ns
Figure 5.10. Read operation.
Symbol Wave
D0:A0:v(clout0)
D0:A0:v(clout1)
D0:A0:v(prech)
D0:A0:v(wwl)
V
o
lta
g
e
s
(
lin
)
0
100m
200m
300m
400m
500m
600m
700m
800m
900m
1000m
1.1
1.2
1.3
1.4
1.5
Time (lin) (TIME)
120n 140n
WRITE TEST
Figure 5.11. Write operation.
IMPLEMENTATIONOF FFT PROCESSORS 79
The pulse width for the word-line signal and sense enable signal
must be selected carefully. Short pulse width dissipates less power,
but needs to be sufficient long to guarantee the read operation under
process variation, low power supply voltage, etc.
For the periphery circuits, the I/O drivers have large capacitance
load. To reduce the short-circuit current is an important issue for the
I/O drivers. Avoiding switching of the PMOS and NMOS
simultaneously is a efficient technique for reducing of short-circuit
current.
5.3.1.2. Implementation
A 256 w×26 b SRAM with separate I/O (Fig. 5.12) has been
implemented with above discussed techniques. The SRAM, which
runs at 1.5 V and 50 MHz, consumes 2.6 mW. A module generator
for SRAM is under development.
5.3.2. Butterfly
The butterfly is one of the characteristic building blocks in an FFT
processor.
The butterfly consists mainly of adders/subtractors. Hence we
discuss the implementation of adder/subtractor first and later the
complete butterfly.
5.3.2.1. Adder design
The adder is one of the fundamental arithmetic components. There
are many adder structures [47].
Figure 5.12. SRAM macro (1.27×0.33 mm
2
).
80 Chapter 5
The ripple-carry adder (RCA) is constructed with full-adder. The
RCA is slowest among the different implementations. However, it is
simple and consumes small amount of power for 16-bit adder
implementations. If the wordlength is small, it is suitable to select
RCA for the butterfly.
When the speed is important, for instance, the vector merge
adder in the multiplier, the RCA cannot meet the speed requirement.
In these cases, other carry accelerating adder structures are
attractive. We select the Brent-Kung adder for high speed adder
implementation. The Brent-Kung adder has a short delay and a
regular structure. It will be discussed later in the complex multiplier
design.
RCA implementation
We have developed a program that generates the schematic and the
layout for RCA. An CMOS full-adder layout is shown in Fig. 5.13.
A 3-bit RCA layout with sign-extension is shown in Fig. 5.14.
Figure 5.13. CMOS full-adder.
S
i
z
e
:

1
8
.
7
×

1
5
.
0
µ
m
2
IMPLEMENTATIONOF FFT PROCESSORS 81
5.3.2.2. High radix butterfly architecture
The use of higher radix tends to reduce the memory access rate,
arithmetic workload, and, hence, the power consumption [39] [60].
Efficient design of high-radix butterflies is therefore important. In
practice, the commonly used high radix butterflies are radix-4 and
radix-8 butterflies. Butterflies with higher radix than radix-8 are
often decomposed to lower radix butterflies.
A conventional butterfly is often based on an isomorphic
mapping of the signal-flow graph. Signal-flow graph for a radix-4
butterfly is shown in Fig. 5.15. The butterfly requires 8 complex
adders/subtractors and a delay of 2 additions/subtractions.
To reduce the complexity, we proposed a carry-save based
butterfly [36]. The computation for a radix-4 butterfly is divided into
two steps. The first step is a 4-2 compression with
addition/subtraction controlled inputs. The second step is a normal
addition. The delay is changed from two additions/subtractions to
Figure 5.14. Layout of 3-bit ripple-carry adder.
Sign
extension
Full-adder Full-adder Full-adder
x(0)
x(1)
x(2)
x(3)
X(0)
X(1)
X(2)
X(3)
j
Figure 5.15. Signal-flow graph for 4-point DFT.
82 Chapter 5
one addition and one 4-2 compression. This implementation reduces
the hardware since a fast adder is more complex than a 4-2
compressor. The total delay is also reduced since the delay for a 4-2
compressor is smaller. The implementation of a radix-4 butterfly
with carry-save adders is shown in Fig. 5.16. In the figure, only real
additions are shown and it appears more complicated than that of
Fig. 5.15, where additions are complex additions.
Carry-save radix-4 butterfly implementation.
A carry-save radix-4 butterfly (wordlength is 15 for real and
imaginary part of input) was described in VHDL-code and
synthesized using AMS 0.8 µm standard CMOS technology [36].
The synthesis result shows that the area saving can be up to 21%
for the carry-save radix-4 butterfly. The delay can be reduced with
22%.
The radix-2/4 split-radix butterfly and radix-8 butterfly can also
be implemented using the carry-save adder.
Architecture Area Delay@3.3 V, 25˚C
Conventional 10504.16 12.32 ns
Carry-save 8266.48 9.59 ns
Table 5.3. Performance comparison for two radix-4 butterflies.
(4,2)-counter
Fast Adder
Inverter
Xre(0)
Xim(0)
Xre(1)
Xim(1)
Xre(2)
Xim(2)
Xre(3)
Xim(3)
xre(0)
xim(0)
xre(1)
xim(1)
xre(2)
xim(2)
xre(3)
xim(3)
Figure 5.16. Parallel radix-4 butterfly.
IMPLEMENTATIONOF FFT PROCESSORS 83
5.3.3. Complex Multiplier
There is no question that complex multipliers are one of the critical
units in FFT processors. From the speed point of view, the complex
multiplier is slowest part in the data path. With pipelining, the
throughput can be increased while the latency retains the same.
From the power consumption point of view, complex multipliers
stand for about 70% to 80% of the total power consumption in the
previous FFT implementations [39] [60]. This has been reduced to
less than 50% of the total power consumption due to the increase of
power consumption for the memories as the transform length of FFT
increases [37]. Hence the complex multipliers are the key
components in FFT design.
A straightforward implementation (see Fig. 5.17 (a)) of a
complex multiplication requires four real multiplications, one
addition and one subtraction. However, the number of
multiplications can be reduced to three by using transformation at
the cost of extra pre- and post additions (see Fig. 5.17 (b)). A more
efficient way to reduce the cost of multiplication is to utilize
distributed arithmetic [35] [58].
5.3.3.1. Distributed Arithmetic
Distributed arithmetic (DA) uses precomputed partial sums for an
efficient computation of inner products of a constant vector and a
variable vector [14].
Figure 5.17. Realization of a complex multiplication.
X
R
Z
I
C
R
C
I
C
I
X
I
X
R
X
I
X
R
C
R
X
R
X
I
C
I
+C
R
C
I
-C
R
C
R
X
I
Z
R
Z
R
Z
I
(a) (b)
84 Chapter 5
Let C
R
+ jC
I
and X
R
+ jX
I
be two complex numbers of which
C
R
+ jC
I
is the coefficient and X
R
+ jX
I
is a variable complex number.
In the case of a complex multiplication, we have
(5.1)
Hence, a complex multiplication can be considered as two inner
products of two vectors of length two. We will realize the real and
imaginary parts separately.
For sake of simplicity, we consider only the first inner product in
Eq. (5.1), i.e., the real part. The complex coefficient, C
R
+ jC
I
is
assumed to be fixed and two’s-complement representation is used
for both the coefficient and data. The data is scaled so that
|Z
R
+ jZ
I
| is less than 1. The inner product Z
R
can be rewritten
where x
Rk
and x
Ik
are the kth bits in the real and imaginary parts,
respectively. By interchanging the order of the two summations we
get
(5.2)
which can be written as
(5.3)
where .
F
k
is a function of two binary variables, i.e., the kth bits in X
R
and
X
I
. Since F
k
can take on only four values, it can be computed and
stored in a look-up table.
In the same way we get the corresponding binary function for the
imaginary part is .
Z
R
jZ
I
+ C
R
jC
I
+ ( ) X
R
j X
I
+ ( )
C
R
X
R
C
I
X
I
– ( ) j C
R
X
I
C
I
X
R
+ ( ) +
=
=
Z
R
C
R
x
R0
– x
Rk
2
k –
k 1 =
W
d
1 –

+ –C
I
x
I 0
– x
Ik
2
k –
k 1 =
W
d
1 –

+ =
Z
R
C –
R
x
R0
C
I
x
I 0
C
R
x
Rk
C –
I
x
Ik
( )2
k –
k 1 =
W
d
1 –

+ + =
Z
R
F
k
x
R0
x
I 0
, ( ) – F
k
x
Rk
x
Ik
, ( )2
k –
k 1 =
W
d
1 –

+ =
F
k
x
Rk
x
Ik
, ( ) C
R
x
Rk
C –
I
x
Ik
=
G
k
x
Rk
x
Ik
, ( ) C
R
x
Ik
C +
I
x
Rk
=
IMPLEMENTATIONOF FFT PROCESSORS 85
Further reduction of the look-up table can be done by Offset
Binary Coding [14].
5.3.3.2. Offset Binary Coding
The offset binary coding can be applied to distributed arithmetic by
using the following expression for the data.
(5.4)
where is the inverse of bit .
Without any loss of generality, we assume that the magnitudes of
C
R
, C
I
, X
R
, and X
I
all are less than 1. The wordlength of X
R
, and X
I
is W
d
. Then, the complex multiplication can be written
Z
R
+ jZ
I
= (C
R
X
R
– C
I
X
I
) + j(C
R
X
I
+ C
I
X
R
)
(5.5)
The functions F and G can be expressed as follows.
(5.6)
(5.7)
x x
0
x
0
– ( )2
1 –
x
i
x
i
– ( )2
i – 1 –
i 1 =
W
d
1 –

2
W
d

– + =
b b
C
R
x
R0
x
R0
– ( )2
1 –
– C
R
x
Ri
x
Ri
– ( )2
i – 1 –
i 1 =
W
d
1 –

C
R
2
W
d

– +
¹ ¹
' ;
¹ ¹
=
C
I
x
I 0
x
I 0
– ( )2
1 –
– C
I
x
Ii
x
Ii
– ( )2
i – 1 –
i 1 =
W
d
1 –

C
I
2
W
d

– +
¹ ¹
' ;
¹ ¹

+ j C
R
x
I 0
x
I 0
– ( )2
1 –
– C
R
x
Ii
x
Ii
– ( )2
i – 1 –
i 1 =
W
d
1 –

C
R
2
W
d

– +
¹ ¹
' ;
¹ ¹
+ j C
I
x
R0
x
R0
– ( )2
1 –
– C
I
x
Ri
x
Ri
– ( )2
i – 1 –
i 1 =
W
d
1 –

C
I
2
W
d

– +
¹ ¹
' ;
¹ ¹
F – x
R0
x
I 0
, ( ) F x
Ri
x
Ii
, ( )2
i – 1 –
i 1 =
W
d
1 –

F 0 0 , ( )2
W
d

+ +
¹ ¹
' ;
¹ ¹
=
+ j G – x
R0
x
I 0
, ( ) G x
Ri
x
Ii
, ( )2
i – 1 –
i 1 =
W
d
1 –

G 0 0 , ( )2
W
d

+ +
¹ ¹
' ;
¹ ¹
F x
Ri
x
Ri
, ( ) C
R
x
Ri
x
Ri
– ( ) C
I
x
Ii
x
Ii
– ( ) – =
G x
Ri
x
Ri
, ( ) C
I
x
Ri
x
Ri
– ( ) C
R
x
Ii
x
Ii
– ( ) + =
86 Chapter 5
In Eq. (5.7), the factor is either 1 or –1. Hence the
partial product, i.e., function F
k
(G
k
), for each bit is of the form
C
R
t C
I
. All possible partial products are tabulated in following
table. Obviously, only two coefficients, i.e., –(C
R
– C
I
) and –(C
R
+
C
I
), are sufficient to store since (C
R
– C
I
) and (C
R
+ C
I
) easily can be
generated from the two former coefficients by inverting all bits and
adding 1 in the least-significant position.
The complex multiplier with distributed arithmetic is illustrated
in Fig. 5.18. The accumulators, which adds the partial products, are
the same as in a real multiplication. The partial product generation
is only slightly more complicated than for a real multiplier. Hence,
the complexity of the complex multiplier in term of chip area
corresponds to approximately two real multipliers.
x
Ri
x
Ii
F(x
Ri
,x
Ii
) G(x
Ri
,x
Ii
)
0 0 –(C
R
–C
I
) –(C
R
+C
I
)
0 1 –(C
R
+C
I
) (C
R
–C
I
)
1 0 (C
R
+C
I
) –(C
R
–C
I
)
1 1 (C
R
–C
I
) (C
R
+C
I
)
Table 5.4. Partial product generation.
x
i
x
i
– ( ) 2 ⁄
X
I
X
R
-(C
I
+C
R
) -(C
R
-C
I
)
Partial product
Generation
Partial product
Generation
F G
Accumulator Accumulator
Z
R
Z
I
Figure 5.18. Block schematic for complex multiplier.
IMPLEMENTATIONOF FFT PROCESSORS 87
5.3.3.3. Implementation Considerations
Multipliers can be divided into three types: bit-parallel, bit-serial,
and digit-serial. Although a bit-serial, or digit-serial, multiplier has
less chip area than that of bit-parallel multiplier, it requires a higher-
speed clock than that of bit-parallel one for the same throughput. To
achieve high throughput, a bit-serial, or digit-serial multiplier often
needs several parallel units, which increases activity factor for the
local clock. To meet the speed requirement, we therefore select a bit-
parallel multiplier. Complex multiplier with DA is shown in Fig.
5.18.
The selection of pre-computed values from Table 5.4, which
correspond to the partial product generation in a real (imaginary)
datapath, can be realized with a 2:1 multiplexer and an XOR gate as
shown in Fig. 5.19.
An alternative is to use a 4:1 multiplexer circuit. The benefit of
this implementation is that the delay is reduced since the generation
of select signal (X
Ri
⊕X
Ii
) is not required. Hence the delay for the
partial product generation is reduced.
For the accumulator design, the selection of structure is
important. The usual structures for accumulators are: array, carry-
save and tree structures. The tree structure is the fastest. It is also
suitable for our lowpower strategy, i.e., designing a faster circuit and
using voltage scaling to reduce the power consumption. We select
the tree structure for accumulator.
0 1
XRi xor XIi
-(CR-CI)i -(CR+CI)i
XRi
PPi
0 1 0 1
0 1
-(CR-CI)i -(CR+CI)i
PPi
XRi
XIi
(a) (b)
Figure 5.19. Circuits for partial product generation.
88 Chapter 5
The fastest, i.e., lowest height, multi-operand tree is the Wallace
tree. The Wallace tree has complex wiring and is therefore difficult
to optimize and the layout becomes irregular. The overturned-stairs
tree [40], which has a regular layout and the same height as the
Wallace tree when the data wordlength is less than 19, is used in the
design of the complex multipliers.
The overturned-stairs adder tree was suggested by Mou and
Jutand [40]. The main features of overturned-stairs adder tree are
• Recursive structure that yields regular routing and simplifies
the design of the layout generator.
• The tree height is low, i.e, O( ), where p depends on the
type of overturned-stairs tree.
There are several types of overturned-stairs adder trees [40]. The
first-order overturned-stairs adder tree, which has the same speed
bound to that of Wallace tree when the number of the operands is
less than 19, is chosen.
The construction of overturned-stairs tree is illustrated in Fig.
5.20. The trees of height 1 to 3 are shown in Fig. 5.20. When the
height is more than three, we can construct the tree with only three
building blocks: body, root, and connector. The body can be
constructed repeatedly according to Fig. 5.20. The body of height j
( ) consists of a body of height , a branch of height ,
and a connector. The branch of height is formed by using
carry-save adders (CSAs) on “top of each other” with proper
interconnections [40]. The connector connects three feed-throughs
from the body of height and two outputs from the branch of
height to construct the body of height j. A root (CSA) is
connected to the outputs of the connector to form the whole tree of
height .
Since there are only three feed-throughs between body of height
to body of height in overturned-stairs tree. It is also easy for
the routing planning in the accumulator design.
N
p
j 2 > j 1 – j 2 –
j 2 –
j 2 –
j 1 –
j 2 –
j 1 +
j 1 – j
IMPLEMENTATIONOF FFT PROCESSORS 89
The full-adder is essential for the accumulator. The choice of
full-adder has large impact on the performance of accumulator. We
compared several full-adders and found the most suitable for our
implementation. However, recently a large number of new adder
cells has been proposed [51] and they should be evaluated in the
future work.
The first type of full-adder is a conventional static CMOS adder.
When the voltage is as low as 1.5 V, the conventional static CMOS
full adder, with large stack height, is too slow. Furthermore, it is not
competitive from a power consumption point of view.
CSA
Tree 1
CSA
CSA
Tree 2
CSA
CSA
Branch n
CSA
n

C
S
A
s
CSA
CSA
Tree 3
CSA CSA
Root
Root
Body 2
CSA
CSA
Connector
Body height j-1
B
r
a
n
c
h
H
e
i
g
h
t
j
-
2
CSA
CSA
CSA
Connector
Root
Tree of height j+1
Body of height j
Figure 5.20. Overturned-stairs tree.
90 Chapter 5
A second type of full-adder is a full-adder with transmission
gates (TG). This full-adder realizes the XOR-gate with transmission
gates and both the power consumption and chip area are smaller than
that of a conventional static CMOS full-adder.
A third type of full adder is Reusens full-adder [50]. This full-
adder is fast and compact but requires buffers for the outputs. The
buffer insertion is usually considered as a drawback since it
introduces delay and increases the power consumption. However, in
x
z
y
y
y
x
y
x
z
z
S
C
Figure 5.21. Conventional static CMOS full-adder.
S
C
z
y
x
Figure 5.22. Transmission gates full-adder.
IMPLEMENTATIONOF FFT PROCESSORS 91
the accumulator the buffer insertion is necessary anyway in order to
drive the long interconnections. There is no direct path from V
DD
or
V
SS
in this full adder, which tends to reduce the power consumption.
5.3.3.4. Accumulator Implementation
After the selection of structure and adder cell, the accumulator can
be implemented.
A software for the automatic generation of overturned-stairs
adder trees has been developed. The software can handle different
wordlengths for the data and coefficient. The generated structural
VHDL-code can be validated by applying random test vectors in a
testbench.
A handcrafted accumulator using overturned-stairs tree with
0.35 µm standard CMOS technology is shown in Fig. 5.24. The
worst case delay is 26 ns at 1.5 V and 25 ˚C with SPICE simulation.
The power consumption for this complex multiplier is 15 mW at 1.5
V and 72.6 mW at 3.3 V, both run at 25 MHz.
Adder type Transistor count
Delay (ns)@1.5 V Power (µW)@1.5 V
Static CMOS 24 4.2 4.3
TG 16 3.5 2.5
Reusens 16 3.2 2.1
Table 5.5. Comparison of full-adders in 0.35 µm technology.
Figure 5.23. Reusens full-adder.
x y z y
S C
92 Chapter 5
5.3.3.5. Brent-Kung Adder Implementation
The Brent-Kung adder is used as the vector merge adder. The Brent-
Kung adder belongs to the prefix adder, which uses the propagation
and generation property of carry bit in a full-adder to accelerate the
carry propagation.
A program for schematic generation for Brent-Kung adder has
been developed and the layout generator is under construction.
The generated schematic of a 32-bit Brent-Kung adder is
illustrated in Fig. 5.25. The layout of a 32-bit Brent-Kung adder is
shown in Fig. 5.26
S
i
z
e
:

7
0
4
.
3
×

4
7
9
.
5
µ
m
2
Figure 5.24. Accumulator layout.
Figure 5.25. Block diagram for a 32-bit Brent-Kung adder.
IMPLEMENTATIONOF FFT PROCESSORS 93
5.4. Final FFT Processor Design
After the design of the components and the selection of FFT
architecture, we apply the meet-in-the-middle methodology to
combine the components into the complete implementation.
An observation is that a large portion of the total power are
consumed by the computation of complex multiplications in the FFT
processor. We have implement a complex multiplier that consumes
72.6 mW with power supply voltage of 3.3 V at 25 MHz in a
standard 0.35 µm CMOS technology. For a 1024-point FFT
processor, it requires four complex multipliers, and, hence,
consumes 290 mW@3.3 V, 25 MHz. Even with bypass techniques
for trivial complex multiplications, the power consumption for the
computation of the complex multiplications is still more than 210
mW. Hence the reduction of the number of complex multiplication
is vital.
Using high radix butterflies can reduce the number of complex
multiplications outsides the butterflies. However, it is not common
to use high radix butterfly for VLSI implementations due to two
main drawbacks: it increases the number of complex multiplications
within the butterflies if the radix is larger than 4, and it increases the
routing complexity as well. Overcoming this two drawbacks is the
key for using high radix butterflies.
Figure 5.26. 32-bit Brent-Kung adder.
S
i
z
e
:

0
.
2
5
×

0
.
1
6

m
m
2
94 Chapter 5
As is well-known, adders consume much less power than the
multipliers with the same wordlength. This is because the adder has
less hardware and has much fewer glitches. We have implemented a
32-bit Brent-Kung adder (real) that consumes 1.5 mW@3.3 V, 25
MHz, which is much less than a bit complex multiplier
(72.6 mW@3.3 V, 25 MHz). Therefore it is efficient to replace the
complex multipliers with constant multiplier when possible.
We use constant multipliers in the design of 16-point butterfly in
order to reduce the number of complex multipliers. For a 16-point
FFT butterfly, there are three type non-trivial complex
multiplications within the butterfly, i.e., multiplications with ,
, and . The multiplications with and can share
coefficients since
and . We therefore
can use constant multipliers, which reduce the complexity. The
implementation of a multiplication with is illustrated in Fig.
5.27.
The selection of FFT algorithm affects the number and positions
of constant multipliers. For 16-point DFT, the radix-4 FFT and
SRFFT algorithm is more efficient than that of radix-2 FFT
algorithm in term of number of multiplications. Moreover, both
radix-2 and split-radix algorithm require three multipliers (two
multipliers with and one multiplier with ) while the
radix-4 algorithm requires only two multipliers (one multiplier
17 13 ×
W
16
1
W
16
2
W
16
3
W
16
1
W
16
3
π 8 ⁄ ( ) cos π 2 ⁄ π 8 ⁄ – ( ) sin 3π 8 ⁄ ( ) sin = =
π 8 ⁄ ( ) sin π 2 ⁄ π 8 ⁄ – ( ) cos 3π 8 ⁄ ( ) cos = =
W
16
1
cos(p/8)+sin(p/8)
cos(p/8)-sin(p/8)
cos(p/8)
Real Input
Imaginary Input
Real Input
Imaginary Input
Figure 5.27. Complex multiplication with . W
16
1
(p = π)
W
16
2
W
16
1
IMPLEMENTATIONOF FFT PROCESSORS 95
with and one multiplier with / ). Hence the 16-
point butterfly with radix-4 is more efficient and is selected for
our implementation.
By replacing the complex multiplications with constant
multiplications within the 16-point butterfly, the power consumption
for complex multiplication within 16-point butterfly is reduced to
10 mW@3.3 V, 25 MHz. The number of non-trivial complex
multiplications can be reduced to 1776. The total number of
complex multipliers is reduced to two for a 1024-point FFT due to
the use of 16-point butterflies. The number of non-trivial complex
multiplications required for 1024-point FFT for different algorithms
is shown in the following table.
In the 1024-point FFT processor, there is only two complex
multipliers and two constant multipliers, which consumes less than
160 mW. Hence, a power saving of more than 20% for the
computation of complex multiplications can be achieved. This is less
than the theoretical saving of 35% (the ratio for the number of
complex multiplications) due to the computation for complex
multiplications within the 16-point butterfly.
To cope with the complex routing associated with high radix
butterflies it is better to divide the 16-point butterfly into four stages.
since the radix-2 butterfly have the simplest routing.
As mentioned in the resource analysis, the most memory
efficient architectures are the architectures with single-path
feedback since it gives the minimum data memory, e.g., only N – 1
words for an N-point FFT.
Algorithm R2FFT R4FFT SRFFT Our approach
No. of Comp. Mult. 3586 2732 2390 1776
Table 5.6. The number of non-trivial complex multiplications
for different FFT architectures.
W
16
2
W
16
1
W
16
3
96 Chapter 5
The radix-4 algorithmcan be decomposed into radix-2 algorithm
as done in [24]. Hence the mapping of 16-point butterfly can be done
with four pipelined radix-2 butterflies. Each butterfly has its own
feedback memory. The 16-point butterfly is illustrated in Fig. 5.28.
The power consumption for the data memory is estimated to 300
mW (the power consumption for 128 words or higher memory is
given by the vendor and the smaller memory is estimated through
linear approximation down to 32 words). The butterflies consumes
about 30 mW.
The total power consumption for the three main subsystems is
490 mW. By assuming the 15% of overhead, for instance, the clock
buffers, communication buses, etc., the power consumption for the
FFT processor is therefore estimated to about 550 mW at 3.3 V [38].
The 1024-point FFT processor can also run at 1.5 V, which gives
more power saving. The total power consumption of the 1024-point
FFT processor is less than 200 mW at 1.5 V for 0.35 µm standard
CMOS process. The memories contribute 55% of the total power
consumption, the computation units for butterfly operations and
complex multiplications with 37%, and others with 8%.
5.5. Summary
In this chapter, we have discussed the implementation of a 1024-
point FFT processor.
Aresource analysis gave a start point for the implementation. We
proposed a wordlength optimization method for the pipelined FFT
architectures. This method gave a memory saving up to 14%.
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
Output Intput
Constant multiplier
Figure 5.28. 16-point Butterfly.
IMPLEMENTATIONOF FFT PROCESSORS 97
We discussed the implementation of subblocks, i.e., butterflies,
memories, and complex multipliers. We proposed the high radix
butterflies using carry-save technique, which is efficient in term of
delay and area. We constructed a complex multiplier using DA and
overturned-stairs tree, which is area efficient. All those subblocks
can be operate at low power supply voltage and suitable for the
voltage scaling.
Finally, we discussed the implementation of FFT processor
using a 16-point butterfly. The use of proposed 16-point butterfly
reduces the number of complex multiplications and retains the
minimum memory requirement, which is power efficient.
98 Chapter 5
99
6
CONCLUSIONS
This thesis discussed the essential parts of low power pipelined FFT
processor design.
The selection of FFT algorithm is an important start point for the
FFT processor implementation. The FFT algorithm with less multi-
plications and additions is attractive.
The selection of low power strategy affects FFT hardware
design. The supply voltage scaling is an efficient low power
technique and was used for the FFT processor design.
After the selection of the FFT algorithm and the low power
strategy, it is important to reduce the hardware complexity. The
wordlengths in each stage of the pipelined FFT processor may be
different and therefore optimized. A simulation-based method has
been developed for wordlength optimization of the pipelined FFT
architectures. In some cases, the wordlength optimization can
reduce the size of memories up to 14%compared with using uniform
wordlength in each stage. This also results in a power saving of 14%
for the memories. The reduction of wordlength also reduces the
power consumption in the complex multipliers and the butterflies
proportionally.
For the detail design, we proposed that a carry-save technique
was used for implementation of the butterflies. This technique is
generally applicable for high-radix butterflies.The proposed high-
radix butterflies reduce both the area and the delay with more than
20%. In the complex multiplier design, we use distributed arithmetic
to reduce the hardware complexity. We select overturned-stairs tree
100 Chapter 6
for the realization of complex multiplier. The overturned-stairs tree
has a regular structure and the same performance as the Wallace tree
when the data wordlength is less than 19, is therefore used.
Simulation shows that the complex multiplier operate up to 30 MHz
at 1.5 V. The power consumption is 15 mW at 25 MHz with a 1.5 V
power supply voltage. In the SRAM design, we modified an STC D-
flip-flop to form a two stage sense amplifier. The sense amplifier can
be operated at low power supply voltage.
With optimized word length, the data memory size is reduced
with 10%. Using proposed 16-point butterfly, the number of
complex multiplications can be reduced and results a power saving
more than 20% for complex multiplications. With all those efforts,
the total power consumption of the 1024-point pipelined FFT
processor with a continuous throughput of 25 Msamples/s and
equivalent wordlength of 12-bit is less than 200 mW at 1.5 V for
0.35 µm standard CMOS process. The memories contribute to 55%
of the total power consumption, the computation units for butterfly
operations and complex multiplications with 37%, and others with
8%. The memories consume the most significant part of the total
power consumption, which indicates that the optimization of the
memory structure could be important for the implementation of low
power FFT processors.
101
REFERENCES
[1] M. Alidina, J. Monterio, S. Devadas, A. Ghosh, and M.
Papaefthymiou, “Precomputation-based sequential logic
optimization for low power,” IEEE Trans. on VLSI Systems,
Vol. 2, No. 4, pp. 426–436, Dec., 1994.
[2] Analog Devices Inc., ADSP-21060 SHARC Super Harvard
Architecture Computer, Norwood, MA, 1993.
[3] A. Antola, R. Negrini, and N. Scarabottolo, “Arrays for
discrete Fourier Transform,” Proc. Eorupean Signal Process.
Conf. (EUSIPO), Amsterdam, Netherlands, Sep. 1988, Vol. 2,
pp. 915–918.
[4] G. Bi and E. V. Jones, “A pipeline FFT processor for word-
sequential data,” IEEE Trans. on Acoustics, Speech, Signal
Processing, Vol. 37, No. 12, pp. 1982–1985, Dec., 1989.
[5] J. A. C. Bingham, "Multicarrier modulation for data
trasnmission: An idea whose time has come," IEEE Commun.
Mag., Vol. 28, pp. 5–14, May 1990.
[6] L. Bisdounis, O. Koufopavlou, and S. Nikolaids, “Accurate
evaluation of CMOS short-circuit power dissipation for short
channel devices,” Intern. Symp. on Low Power Electronics &
Design, Monterey, CA, Aug., 1996, pp. 181–192.
[7] E. O. Brigham, The Fast Fourier Transform and Its
Applications, Prentice Hall, 1988.
102
[8] C. S. Burrus, “Index mappings for multidimentional
formulation of DFT and convolution,” IEEE Trans. on
Acoustics, Speech, Signal Processing, Vol. ASSP–25, No. 3,
pp. 239–242, June, 1977.
[9] A. Chadrakasan and R. W. Brodersen, Low Power Digital
CMOS Design, Kluwer, 1995.
[10] A. Chadrakasan, M. P. Potkonjak, R. Mehra, J. Rabey, and R.
W. Brodersen, “Optimizing power using transformations,”
IEEE Trans. on Computer-Aided Design, Vol. 14,
No. 1, pp. 12–31, Jan., 1995.
[11] A. Chadrakasan, S. Sheng, and R. W. Brodersen, “Low-power
CMOS design,” IEEE Journal of Solid-State Circuits, Vol. 27,
No. 4, pp. 472–484, April, 1992.
[12] S. Chang, M. Marek-Sadowska, and K. Cheng, “Perturb and
simplify: multilevel Boolean network optimizer,” IEEETrans.
on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 15, No. 12, pp. 1494–1504, Dec., 1996.
[13] J. W. Cooley and J. W. Turkey, “An algorithm for the machine
computation of complex Fourier series,” Mathematics of
Computation, Vol. 19, pp. 297–301, April, 1965.
[14] A. Croisier, D.J. Esteban, Levilion, and V. Riso, Digital Filter
for PCM Encoded Signals, U.S. Patent 3777 130, Dec., 1973.
[15] A. M. Despain, “Fourier transform computer using CORDIC
iterations,” IEEE Trans. on Computers, Vol. C–23, No. 10, pp.
993–1001, 1974.
[16] DoubleBWSystems B.V., PowerFFT™processor data sheet,
Delft, the Netherlands, March, 2002.
[17] P. Duhamel and H. Hollmann, “Split-radix FFT algorithm,”
Electronics Letters, Vol. 20, No. 1, pp. 14–16, Jan., 1984.
[18] P. Duhamel and M. Vetterli, “Fast Fourier transforms: A
tutorial review and a state of the art,” Signal Processing,
Vol. 19, No. 4, pp. 259–299, April, 1990.
[19] I. J. Good, “The interaction algorithm and practical Fourier
analysis,” J. Royal Statist. Soc., ser. B, Vol. 20, pp. 361–372,
1958.
103
[20] A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation
of average switching activity in combinational and sequential
circuits,” In Proc. of the 29th Design Automation Conf., June,
1992, pp. 253–259.
[21] S. F. Gorman and J. M. Wills, “Partial Column FFTpipelines,”
IEEE Trans. on Circuits and Systems-II, Vol. 42, No. 6, June,
1995.
[22] H. L. Groginsky and G. A. Works, “A Pipeline Fast Fourier
Transform,” IEEE trans. on Computers, Vol. C–19(11), pp.
1015–1019, 1970.
[23] D. Hang and Y. Kim “A deep sub-micron SRAM cell design
and analysis methodology,” In Proc. of Midwest Symp. on
Circuits and Systems, Dayton, Ohio, USA, Aug., 2001,
pp. 858–861.
[24] S. He and M. Torkelson, “A New Approach to Pipeline FFT
Processor,” In Proc. of the 10th Intern. Parallel Processing
Symp. (IPPS), Honolulu, Hawaii, USA, pp. 766–770, 1996.
[25] M. T. Heideman and C. S. Burrus, “On the number of
multiplications necessary to compute a Length-2
n
DFT,”
IEEE Trans. on Acoustics, Speech, Signal Processing,
Vol. ASSP–34, No. 1, Feb., 1986.
[26] M. T. Heideman, D. H. Johnson, and C. S. Burrus, “Gauss and
the history of the FFT,” IEEE Acoustics, Speech, and Signal
Processing Magazine, Vol. 1, pp 14–21, Oct., 1984.
[27] M. Hikari, H. Kojima, et al., “Data-dependent logic swing
internal bus architecture for ultralow-power LSI’s,” IEEE
Journal of Solid-State Circuits, Vol. 30, No. 4, pp. 379–402,
April, 1995.
[28] http://theory.lcs.mit.edu/~fftw
[29] Intel Corp., SA-1100 Microprocessor Technical Reference
Manual, Santa Clara, CA., USA., 1998.
[30] K. Itoh et. al., “Trends in Low-power RAM Circuit
Technologies,” Proc. of IEEE, pp. 524–543, April, 1995.
104
[31] D. P. Kobla and T. W. Parks, “A prime factor algorithm using
high-speed convolution,” IEEE Trans. on Acoustics, Speech
Signal Processing, Vol. ASSP–31, No. 4, pp. 281–294, Aug.,
1977.
[32] S. Krishnamoorthy and A. Khouja, “Efficient power analysis
of combinational circuits,” In Proc. of Custom Integrated
Circuit Conf., San Diego, California, USA, 1996,
pp. 393–396.
[33] C. Lemonds and S. S. Mahant Shetti, “A low power 16 by 16
multiplier using transition reduction circuitry,” In Proc. of the
Intern. Workshop on the LowPower Design, Napa, California,
USA, April, 1994, pp. 139–142.
[34] W. Li, Y. Ma, and L. Wanhammar, “Word length estimation
for memory efficient pipeline FFT/IFFTProcessors,” ICSPAT,
Orlando, Florida, USA, Nov., 1999, pp. 326–330.
[35] W. Li and L. Wanhammar, “A Complex Multiplier Using
‘Overturned-Stairs’ Adder Tree,” In Proc. of Intern. Conf. on
Electronic Circuits and Systems (ICECS), Paphos, Cyprus,
September, 1999, Vol. 1, pp. 21–24.
[36] W. Li and L. Wanhammar, “Efficient Radix-4 and Radix-8
Butterfly Elements,” In Proc. of NorChip Conf., Oslo,
Norway, Nov., 1999, pp. 262–267.
[37] W. Li and L. Wanhammar, “A Pipeline FFT Processor,” In
Proc. of IEEE Workshop on Signal Processing Systems
(SiPS), Taipei, China, Nov., 1999, pp. 654–662.
[38] W. Li and L. Wanhammar, “An FFT processor based on 16-
point module,” In Proc. of NorChip Conf., Stockholm,
Sweden, Nov., 2001, pp. 125–130.
[39] J. Melander, Design of SIC FFT Architectures, Linköping
Studies in Science and Technology, Thesis No. 618,
Linköping University, Sweden, May, 1997.
[40] Z. Mou and F. Jutand, “Overturned-Stairs’ Adder Trees and
Multiplier Design”, IEEE Trans. Computers, vol. C–41,
pp. 940–948, 1992.
105
[41] Motorola Inc., DSP96002 IEEE Floating-Point Dual-Port
Processor User’s Manual, Phoenix, AZ, 1989.
[42] L. Nielsen, C. Nielsen, J. SparsØ, and K. van Berkel, “Low-
power operation using self-timed circuits and adaptive scaling
of the supply voltage,” IEEE Trans. on VLSI Systems, Vol. 2,
No. 4, pp. 391–397, Dec., 1994.
[43] E. Nordhamn, Design of an Application Specific FFT
Processor, Linköping Studies in Science and Technology,
Thesis No. 324, Linköping University, Sweden, June, 1992.
[44] M. C. Pease, “An adaptation of fast Fourier transform for
parallel processing,” Journal of the Association for
Computing Machinery, Vol. 15, No. 2, pp. 252–264, April,
1968.
[45] M. Potkonjak, and M. Rabaey, “Algorithm selection: A
quantitative optimization intensive approach,” IEEETrans. on
Computer-Aided Design for Integrated Circuits System, Vol.
18, No. 5, pp. 524–532, May, 1999.
[46] J. Rabaey, L. Guerra, and R. Mehra, “Design guidance in the
power dimension,” In Proc. of Intern. Conf. on Acoustics,
Speech and Signal Processing, Detroit, Michigan, USA, May,
1995, Vol. 5, pp. 2837–2840.
[47] J. Rabaey and M. Pedram (Ed.), Low Power Design
Methodologies, Kluwer, 1996.
[48] L. R. Rabiner and B. Gold, Theory and Application of Digital
Signal Processing, Prentice Hall, 1975.
[49] C. M. Rader, “Discrete Fourier transforms when the number
of data samples is prime,” Proc. of IEEE, Vol. 56, pp. 1107–
1108, June, 1968.
[50] P. P. Reusens, High Performance VLSI Digital Signal
Processing Architecture and Chip Design, Cornell University,
Thesis, Aug., 1983.
106
[51] M. Sayed and W. Badawy, “Performance analysis of single-bit
full adder cells using 0.18, 0.25, and 0.35 µm CMOS
technologies,” In Proc. of IEEE Intern. Symp. on Circuits and
Systems (ISCAS), Vol. 3, Scottdale, Arizona, USA, May,
2002, pp. 559–563.
[52] E. Seevinck et al, “Static-Noise Margin Analysis of MOS
SRAM Cells,” IEEE Journal of Solid-State Circuit, Vol.
SC-22, No.5, pp. 748–754, Oct., 1987.
[53] M. R. Stan and W. P. Burleson, “Bus-invert coding for low-
power I/O,” IEEE Trans. on VLSI Systems, Vol. 1, No. 3, pp.
49–58, March, 1995.
[54] H. J. M. Veendrick, “Short-circuit dissipation of static CMOS
circuitry and its impact on the design of buffer circuits,” IEEE
Journal of Solid-State Circuit, Vol. 19, pp. 468–473, Aug.,
1984.
[55] Texas Instruments Incorporated, “An Implementation of FFT,
DCT, and Other Transforms on the TMS320C30,”
Application report: SPRA113, Dallas, Texas, 1997.
[56] V. Tiwari, S. Malik, and P. Ashar, “Compilation techniques for
low energy: an overview,” In Proc. of 1994 IEEE Symp. on
Low Power Electronics, San Diego, California, USA, Oct.,
1994, pp. 38–39.
[57] Q. Wang and S. Vrudhula, “Multi-level logic optimization for
low power using local logic transformation,” Proc. of Intern.
Conf. of Computer-Aided Design, San Jose, California, USA,
pp. 270–277, 1996.
[58] L. Wanhammar, DSP Integrated Circuits, Academic Press,
1999.
[59] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI
Design, Addison-Wesley, second edition, 1993.
[60] T. Widhe, Efficient Implementation of FFT Processing
Elements, Linköping Studies in Science and Technology,
Thesis No. 619, Linköping University, Sweden, June, 1997.
107
[61] S. Winograd, “On the computing the discrete Fourier
transform,” Proc. Nat. Acad. Sci. USA, Vol. 73, No. 4,
pp. 1005–1006, April, 1976.
[62] E. H. Wold and A. M. Despain, “Pipeline and Parallel-pipeline
FFTProcessors for VLSI Implementation,” IEEETransaction
on Computers, Vol. C–33, No. 5, pp. 414–426, 1984.
[63] G. Yeap, Practical Low Power Digital VLSI Design, Kluwer,
1998.
[64] J. Yuan, High Speed CMOS Circuit Technique, Linköping
Studies in Science and Technology, Thesis No. 132,
Linköping University, Sweden, 1988.
[65] Zarlink semiconductor Inc., PDSP16515A Stand Alone FFT
Processor Advance Information, April, 1999.
108

Studies on Implementation of Low Power FFT Processors Copyright © 2003 Weidong Li Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping Sweden ISBN 91-7373-692-9 ISSN 0280-7971

To memory of my father.

Abstract
In the last decade, the interest for high speed wireless and on cable communication has increased. Orthogonal Frequency Division Multiplexing (OFDM) is a strong candidates and has been suggested or standardized in those communication systems. One key component in OFDM-based systems is FFT processor, which performs the efficient modulation/demodulation. There are many FFT architectures. Among them, the pipeline architectures are suitable for the real-time communication systems. This thesis presents the implementation of pipeline FFT processors that has low power consumptions. We select the meet-in-the-middle design methodology for the implementation of FFT processors. A resource analysis for the pipeline architectures is presented. This resource analysis determines the number of memories, butterflies, and complex multipliers to meet the specification. We present a wordlengths optimization method for the pipeline architectures. We show that the high radix butterfly can be efficiently implemented with carry-save technique, which reduce the hardware complexity and the delay. We present also an efficient implementation of complex multiplier using distributed arithmetic (DA). The implementation of low voltage memories is also discussed. Finally, we present a 16-point butterfly using constant multipliers that reduces the total number of complex multiplications. The FFT processor using the 16-point butterflies is a competitive candidate for low power applications.

Electronics Systems at Linköping University. Henrik Ohlsson. Also. and friends. relatives. Finally. I would like to express my gratitude to Oscar Gustafsson. especially A Phung. and Per Löwenberg for the proofreading. I would like to thank my family. and most importantly. Lastly. .Acknowledgement I would like to thank my supervisor Professor Lars Wanhammar for his support and guidance of this research. This work was financially supported by the Swedish Strategic Fund (SSF) under INTELECT program. for their help in the discussions in research as well as other matters. for their boundless support and encouragement. I would like to thank the whole group.

.

..............1....2.... 30 2....... ..3.. ... FFT ALGORITHMS ... ........ 9 2... Sande-Tukey FFT Algorithms ............ . ......... 36 i ...... Thesis Outline ......1......... .6......... Generalized Formula ...... 9 2............ INTRODUCTION .......... Split-Radix FFT Algorithm ..... 1 1.. Prime Factor FFT Algorithms . Other FFT Algorithms . ........ 18 2.............5.2... Cooley-Tukey FFT Algorithms ..... .... ... ..... 10 2.... IDFT Implementation ....... ..4........ ..Table of Contents 1.. 20 2. ........... 31 2................................ ...... ........ 3 1. ......... 27 2.............. .... .............................................. ............. ....5.. 26 2...... 12 2....6............... Eight-Point DFT ...... . .......... ............................... ..2............ Summary ................ ........... 6 1.................. .. 23 2..... Contributions .... 8 2.... ....... Basic Formula ............ 2 1........3..................... ......... ..........1. 29 2.. 26 2................................ .......... Scaling and Rounding Issue ........................ ................... ................................... ....... 23 2................7...1.....4..5...... . .... Addition Complexity . Other Issues .2.............. ............................. ...........1.... ......... 7 1.......... Power Consumption .........1........ ....... ............... DFT and FFT ........ ....... .. Multiplication Complexity ... OFDM Basics ................ . ............................ Performance Comparison .. .. ............................................1. .......... ......... ............1......................... .. ....................... ..... .6........ ... .......1......... .. Winograd Fourier Transform Algorithm ............4......... ...... .3.. ....2..2. 13 2............4........... ...... ............. ................. 35 2........... .5........ ..

.... ... Design Method ..... .............. Low Power Guidelines ......... FFT ARCHITECTURES ..............4........................... 67 ii ................ Radix-4 Multipath Delay Commutator .... ...... . Algorithm-Specific Processors .... Summary .........4............ .. Radix-2 Multipath Delay Commutator ...... ........6.. 67 5....... . Low Power Techniques ...........2..................... ..........3.. 58 4............... Summary ........... 37 3.............. 53 4....2.......2........... Switching Power ... 37 3.......3.....1......1.4............. . ................. General-Purpose Programmable DSP Processors ..............5....1.....2. 47 3..................... System Level ............ ...... . ... .......... ............ ...1...... ..... .. .. .......... ...... Circuit Level . 40 3....3.....1... 54 4............................ ... Algorithm Level .................3....... Radix-4 Single-Path Delay Feedback ..4...2..................................... 39 3.......... Logic Level .....3............. 44 3................2................. .... 50 3.3..... 40 3... .................................... . 53 4........ ...................... 61 4......... LOW POWER TECHNIQUES ..3.......2................. Power Dissipation Sources .2....5. 62 4.. .................................... Short-Circuit Power ..........................................2. ......... ... .. 59 4.. Architecture Level .......2. .. ... Resource Analysis ..... Radix-2 Single-Path Delay Feedback .... .. 60 4..... .... Leakage Power ..... ... ................. 65 5.1.. 57 4. .1... 38 3.......... 52 4.3. .......... ......... 51 3.1.............................3........ ..1......3..... 56 4..... ..... ................. ............. Programmable FFT Specific Processors .....2........1......... 37 3.... . ............ .............. Radix-4 Single-Path Delay Commutator ........... 65 5.. 63 5. ..... ............................. . IMPLEMENTATION OF FFT PROCESSORS .... Radix-22 Single-Path Delay Commutator . .....3............. .. ......2........ 42 3........... ....3.. ....................... High-level Modeling of an FFT Processor ................

........... ... ... ....................... ...... 70 5.3.... 96 6........ .............. Validation of the High-Level Model ....2. 72 5..... Summary .. .. Complex Multiplier . ... ...... ........ ..... ... ........ ..... .... Memory ... Subsystems .......... ...... ....... .... .... . . ... ...... 79 5. 73 5....... .4.... ..... .............. ... ...3.. ................3... ... . ........ REFERENCES .. ................5............ ........................ Butterfly ... .....1......2... .... .. 71 5.... 101 iii .....3....... .. . Final FFT Processor Design . 93 5.... .. . .... ........... ..... . . .... ............5..... 83 5...... . .. ..... ..... .3. .... Wordlength Optimization ..2.... ................ .. ........2...... ..... ... ..... .... .. 99 7...........3.......... . .................. . .. .. ..... CONCLUSIONS ......

iv .

which facilitates the efficient transformation between the time domain and the frequency domain for a sampled signal. This thesis addresses the problem of designing efficient application-specific FFT processors for OFDM based wide-band communication systems. e.g. speech signal processing. communication. This performance requirement necessitates an application-specific integrated circuit (ASIC) solution for FFT implementation. The OFDM based communication systems have high performance requirement in both throughput and power consumption. is used in many applications. radar. we give a short review on DFT and FFT. Then an introduction to OFDM and power consumption are presented. which has made the FFT valuable for those communication systems. Finally. The immunity to multipath fading channel and the capability for parallel signal processing make it a promising candidate for the next generation wide-band communication systems. the outline of the thesis is described. the interest for high speed wireless and on cable communication has increased. 1 .1 INTRODUCTION The Fast Fourier Transform (FFT) is one of the most used algorithms in digital signal processing. In the last decade. has been demonstrated to be an efficient and reliable approach for high-speed data transmission. The modulation and demodulation of OFDM based communication systems can be efficiently implemented with an FFT. In this chapter. sonar. which is a special Multicarrier Modulation (MCM) method. Orthogonal Frequency Division Multiplexing (OFDM) technique.. The FFT.

The number N is also called transform length.∑ X ( n )W N N n=0 N–1 (1. …. k = 0.1) for n = 0. 1. 1. The complexity for computing an N-point DFT is therefore O ( N 2 ) . With the contribution from Cooley and Tukey [13]. Among the FFT algorithms. the complexity for computation of an N-point DFT can be reduced to O ( N log ( N ) ) .1) requires N ( N – 1 ) complex additions and N ( N – 1 ) complex multiplications. Direct computation of an N-point DFT according to Eq. One algorithm is the split-radix algorithm. …. 2 Chapter 1 . the implementation of FFT algorithms is still a challenging task. was published in 1984. 1. The inverse DFT (IDFT) for data sequence { X ( n ) } ( n = 0. Another algorithm is Winograd Fourier Transform Algorithm (WFTA). …. N – 1 is defined as X (n) = N–1 k=0 ∑ x ( k )W N nk (1. which reduces of complexity for DFT computation. which treats the even part and odd part with different radix.2) for k = 0. DFT and FFT The Discrete Fourier transform (DFT) for an N-point data sequence { x ( k ) } . where W N = e – 2π ⁄ N is the primitive Nth root of unity. respectively. two algorithms are especially noteworthy. are called fast Fourier transform (FFT) algorithms. 1.1. which requires the least known number of multiplications among practical algorithms for moderate lengths DFTs and was published in 1976. (1. Due to the high computation workload and intensive memory access. The index k respective n is referred to as time-domain and frequencydomain index. The Cooley and Tukey’s approach and later developed algorithms. N – 1 . …. N – 1 .1. N – 1 ) is –n k 1 x ( k ) = --. Many implementation approaches for the FFT have been proposed since the discovery of FFT algorithms.

The overlapping does not cause interference of subchannels due to the orthogonal modulation. The principle for MCM is shown in Fig. INTRODUCTION 3 .0 Figure 1. The high rate data stream at M f sym bits/s is grouped into blocks with M bits per block at a rate of f sym . 1.1. k and totally M bits for modulation of N carriers. OFDM Basics OFDM is a special MCM technique. which transmit data in parallel [5]. This leads to inefficient usage of spectrum and excessive hardware requirement. Each subchannel has its own modulator and demodulator.n-2 Parallel to serial demodulator n-1 demodulator n-2 Output demodulator 0 fc. In the conventional MCM.n-1 fc.0 fsym symbol/s Channel noise fc.1. the spectrum can be used more efficient since overlapping of subchannels is allowed. With OFDM. the N subchannels are nonoverlapping. The OFDM technique can overcome those drawbacks.n-2 Input Mfsym b/s x(t) modulator 0 M bits (a symbol) fc.n-1 fc. Serial to parallel mn-1 bits modulator n-1 modulator n-2 fc.1. A block is called a symbol. The idea for MCM is to divide transmission bandwidth into many narrow subchannels (subcarriers). A symbol allocates mkbits of M bits for modulation of a carrier k at f c. which send symbols at a rate of f sym .2. A multicarrier modulation system. This results in N subchannels.

For two carrier signals.3) (1. 4 Chapter 1 . The symbol rate is fsym. (1..1. gk. It means that there is no interference to other subchannels with the selected functions. the carrier signals can be expressed as following: k f k = f 0 + --T  e j2πf k t gk ( t ) =  0  0≤k<N–1 0≤t<T otherwise (1. f 1/T Figure 1.2.4) where f0 is the system base frequency and g k is the signal for carrier k at frequency fk. the sending signal x(t) is the summation of symbol transmission in all subchannels.. each symbol is sent during a symbol time T (which is equal to 1/fsym). e. the integral over a symbol time is T ∫ g k ( t )g l∗ ( t ) dt =  0  0 T k = l otherwise (1. OFDM overcomes the inefficient implementation of the modulator and demodulator for conventional MCM. and gl. Spectrum overlapping of subcarriers for OFDM.5) which shows that two carriers are orthogonal. From Fig.3) and Eq. If the frequency of subcarrier k and the base function are chosen according to Eq. 1. (1.g. its spectrum is a sinc function with zero points at f 0 + l ⁄ T (l is integer) except l = k or fk. This orthogonality can also be found in the time domain.The orthogonality can be explained in frequency domain.4). The frequency spacing between adjacent subchannels is set to be 1/T Hz.g. e.

The other issues. intersymbol interference. the interference between subchannels exists due to the non-ideal channel characteristics and frequency offset in transmitters and receivers. INTRODUCTION 5 . which should be transmitted by subchannel k. 1. This interference effects the performance of the OFDM system. can be reduced by techniques like cyclic prefix.3. be compensated. for instance. Hence the OFDM modulator can be implemented with one IFFT processor and baseband modulator for N subcarriers instead of N modulators for conventional MCM. In reality. The simplified OFDM system based on the FFT is shown in Fig. In similar way. the OFDM demodulator can be implemented more efficient than that of conventional MCM. OFDM system based on FFT. in most case.x(t ) = N–1 k=0 ∑ Sk gk ( t ) = e j2πf 0 t N–1 ∑ j2πkt ------------Sk e T k=0 where Sk is the modulated signal of mk bit. This is an N-point Inverse Discrete Fourier Transform (IDFT) and baseband modulation (with e j2π f 0 t ). e2jpf0t IFFT D/A Channel noise e2jpf0t FFT A/D Input Output Figure 1. The IDFT can be computed efficiently by Inverse Fast Fourier Transform (IFFT) algorithm.3. The frequency offset can.

11/ 0.49 11511 0. In the high performance applications.66 3990 1. This is due to the potential workload increase.1.24/ 0. Year Feature size ASIC usable Mega transistors/cm2 (auto layout) ASIC maximum functions per chip (Mega transistors/chip) Package cost (cents/pin)maximum/minimum On-chip.70 3088 1. higher workload. In portable applications.3. contribute to make the power consumption and energy efficiency even more critical.1. the power consumption has grown from a secondary constraint to one of the main constraints in the design of integrated circuit. and longer operation time. However. the power consumption increases or retains almost the same as the advance of technology according to the table above.0 150 2.2 2005 80 nm 225 1286 1. more functionality. Table 1 shows the expectation for the near future from Semiconductor Industry Association. for instances. The power consumption decreases as the feature size and the power supply voltage are reduced.6 218 3.2 2010 45 nm 714 4081 0. where the power consumption traditionally was a secondary 6 Chapter 1 . local clock (MHz) Supply Vdd (V) (high performance) Power consumption for High performance with heatsink (W) Power consumption for Battery(W)-(Hand-held) 2003 107 nm 142 810 1. Technology Roadmap from the International Technology for Semiconductors (ITRS).17/ 0.0 160 3. low power consumption has long been the main constraint.0 Table 1.9 170 3.61 5173 0.8 2004 90 nm 178 1020 1. Power Consumption The famous Moore’s Law predicts the exponential increase in circuit integration and clock frequency during the last three decades. During the last decade.98/ 0. Several other factors.

which in turn reduces the reliability. The basic idea of FFT algorithms. IR-drop etc. The pipeline architectures are discussed in more detail since they are the dedicated architectures for our target application. the low power techniques gain more ground due to the steady increasing cost for cooling and packaging. A general guideline is found in the end of this chapter.. divide and conquer. is demonstrated through a few examples. which is the starting point for the implementation. An overview for low power techniques is given in chapter 3. e. The delivery of power supply to the chip has also raised many problems like power rails design. The choice of FFT architectures is important for the implementation.constraint. noise immunity. Besides those factors.6 Msamples/sec. INTRODUCTION 7 . including the pipeline architectures. A few architectures. Thesis Outline In this thesis we summarize some implementation aspects of a low power FFT processors for an OFDM communication system. Therefore the low power techniques are important for the current and future integrated circuits. throughput • Complex 24 bits I/O data • Low power In chapter 2. are introduced in chapter 4. the increasing power consumption has resulted in higher onchip temperature. The main focus of the for low power techniques is reduction of dynamic power consumption.4. Different techniques are introduced at different abstraction level. The system specification for the FFT processor has been defined as • Transform length is 1024 • Transform time is less than 40 ms (continuously) • Continuous I/O • 25. we introduce several FFT algorithms. 1. Several FFT algorithms and their performance are given also.g.

3. This reduces of the total number of complex multiplications and is described in Section 5. presented in Section 5. Contributions The main contributions of this thesis are: • A method for minimizing the wordlengths in the pipelined FFT architectures. The conclusions for the FFT processor implementation are given in chapter 6. the ripple-carry adder. given in Section 5. 1. more detailed implementation steps for FFT processors are provided.3.In chapter 5.5. 8 Chapter 1 . etc.2. as outlined in Section 5.3.2. complex multiplier. • A 16-point butterfly with constant multipliers. • A complex multiplier using distributed arithmetic and the overturned-stairs tree.4 • Various generators for different components. Brent-Kung adder.2.3. • An approach to construct efficient high-radix butterflies. for instance. Both design method and the design for FFT processors are discussed in this chapter. This is found in Chapter 5.

sub-problems of the same (or related) type. 2. The sub-problems are then independently solved and their solutions are combined to give a solution to the original problem.2 FFT ALGORITHMS In FFT processor design. or more. it was not applied to DFT computation until 1965 [13]. This technique can be applied to DFT computation by dividing the data sequence into smaller data sequences until the DFTs for small data sequences can be computed efficiently. Cooley and Tukey demonstrated the simplicity and efficiency of the divide and conquer approach for DFT computation and made the FFT algorithms widely accepted. hardware complexity. This chapter focuses on the review of FFT algorithms. 9 . Then a basic and a generalized FFT formulation are given. This technique works by recursively breaking down a problem into two. Although the technique was described in 1805 [26]. the mathematical properties of FFT must be exploited for an efficient implementation since the selection of FFT algorithm has large impact on the implementation in term of speed. We give a simple example for the divide and conquer approach. power consumption etc.1. Cooley-Tukey FFT Algorithms The technique for efficient computation of DFTs is based on divide and conquer approach.

Let { x o ( l ) } and { x e ( l ) } ( l = 0. the DFT of { x ( k ) } is given by X (n) = ∑ k=0 7 x ( k )W 8 nk (2.2. e. x o ( l ) = x ( 2l + 1 ) x e ( l ) = x ( 2l ) for l = 0.1. …. 7 .3) ∑ l=0 3 3 x o ( l )W 8 n n ( 2l + 1 ) + + ∑ l=0 3 3 x e ( l )W 8 n ( 2l ) (2.4) ∑ x o ( l )W 8 W 8 n ( 2l ) ∑ x e ( l )W 8 nl n ( 2l ) l=0 = W8 n n ∑ l=0 3 x o ( l )W 4 + nl ∑ l=0 3 l=0 x e ( l )W 4 = W 8 X o(n) + X e(n) 10 Chapter 2 .2) (2. the grouping of { x ( k ) } to { x o ( l ) } and { x e ( l ) } can be done intuitively through separating members by odd and even index.g. Eight-Point DFT In this section. 3 ) be two sequences. Let us consider an 8-point DFT. 3 . 1. 1. One way to break down a long data sequence into shorter ones is to group the data sequence according to their indices. we illustrate the idea of the divide and conquer approach and show why dividing is also conquering for DFT computation. 1. The DFT for { x ( k ) } can be rewritten X (n) = = (2. 1. k = 0. 2.1. …. N = 8 and data sequence { x ( k ) } .. 7 .1) for n = 0. 2.

 – 2π n ( 2l )  – 2π nl ----------------n ( 2l ) nl  8    where W 8 = e = e 4 = W 4 . Furthermore. the number of n complex multiplications for W 8 X o ( n ) can be reduced to 3 from 7 n n–4 since W 8 = – W 8 for n ≥ 4 . This can be shown in Fig. It requires only two 4-point DFTs for the 8point DFT due to the fact that X o ( n ) = X o ( n – 4 ) and X e ( n ) = X e ( n – 4 ) for n ≥ 4 . 2. (2. respectively. (2. Eq. With additional 8 – 1 complex multiplications for W 8 X o ( n ) and 7 complex additions.4). x(0) x(2) x(4) x(6) x(1) x(3) x(5) x(7) W0 4-point DFT W1 W2 W3 Wn x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) n multiplication with W8 Figure 2. respectively. X o ( n ) and X e ( n ) are 4-point DFTs of { x o ( l ) } and { x e ( l ) } . The above 8-point DFT example shows that the decomposition of a long data sequence into smaller data sequences reduces the computation complexity. The direct computation of an 8-point DFT requires 8 ( 8 – 1 ) = 56 complex additions and 8 ( 8 – 1 ) = 56 complex multiplications. The total number of complex additions and complex multiplications is 32 and 27. The computation of two 4-point DFTs requires 2 ⋅ 4 ⋅ ( 4 – 1 ) = 24 complex additions and 2 ⋅ 4 ⋅ ( 4 – 1 ) = 24 complex multiplican tions.4) shows that the computation of an 8-point DFT can be decomposed into two 4-point DFTs and summations. An 8-point DFT computation with two 4-point DFTs.1. 4-point DFT FFT ALGORITHMS 11 . it requires totally 30 complex multiplications and 32 complex additions for the 8-point DFT computation according to Eq.1.

i. With Eq. (2. the index n can be described by (n1. 12 Chapter 2 . n 0 ) = ∑ ∑ n n n x ( k 1. 0 ≤ n0 < r 1 ) nk The term W N can be factorized as ( nk W N = W Nr 1 n 1 + n 0 ) ( r 0 k 1 + k 0 ) n r n n = W N1 r 0 n 1 k 1 W r 0 k 1 W r 1 k 0 W N0 k 0 1 0 (2.8) indicates that the DFT computation can be performed in three steps: 1 2 3 Compute r0 different r1-point DFTs (inner parenthesis).e. k 0 )W r 0 k 1 W r 1 k 0 W N0 k 0 1 0 =  n  n  n   ∑ x ( k 1. Let N be a composite number.2.5) In the similar way.2. Eq.1) can be rewritten r1 – 1 r0 – 1 X ( n 1.7).1. n0) as n = r 1 n1 + n0 ( 0 ≤ n1 < r 0 . Basic Formula The 8-point DFT example illustrates the principle of the CooleyTukey FFT algorithm.8) Eq. n Multiply results with W N0 k 0 . k0) as k = r 0 k 1 + k 0 ( 0 ≤ k 0 < r 0.6) (2. (2. (1. k 0 )W r 10 k 1 W N0 k 0 W r 01 k 0   k0 = 0   k1 = 0 k1 = 0 k0 = 0 r0 – 1 r1 – 1 ∑ (2. Compute r1 different r0-point DFTs (utter parenthesis).. N = r1 × r0. We introduce a more mathematical formulation for the FFT algorithm.7) n n n = W r 0 k 1 W r 1 k 0 W N0 k 0 1 0 r N where W N1 r 0 n 1 k 1 = W N n 1 k 1 = e – 2πNn 1 k 1 ⁄ N = e – 2πn 1 k 1 = 1. the index k can be expressed by a two-tuple (k1. 0 ≤ k 1 < r 1 ) (2.

….10) FFT ALGORITHMS 13 . Otherwise. A closer study for the given 8-point DFT example. The number r0 and r1 are called radix. it is called n mixed-radix system. the number system is called radix-r system. it is easy to show that it does not need to store the input data in the memory after the computation of the two 4-point DFTs. It is a mixed-radix FFT algorithm. Let N = r p – 1 × r p – 2 × … × r 0 . the index k and n can be written as k = r 0 r 1 …r p – 2 k p – 1 + … + r 0 k 1 + k 0 n = r p – 1 r p – 2 …r 1 n p – 1 + … + r p – 1 n 1 + n 0 where k i. This is a reduction from O(N2) to O(N(r1 +r0)).3. It can reduce the total memory size and is important for memory constrained system. r i – 1 ] for i = 0. further improvement to reduce the computation complexity can be achieved by applying the divide and conquer approach recursively to r1-point or/and r0-point DFTs [7]. (2. Generalized Formula If r0 or/and r1 are not prime. Therefore the total number of complex multiplications using Eq.1. An algorithm with this property is called an in-place algorithm.1. If r0 and r1 are equal to r. Example 2. For N = 8 . Therefore the decomposition of DFT reduces the computation complexity for DFT.9) (2. which is shown in Fig. n p – i – 1 ∈ [ 0. (2. 2. The multiplications with W N0 k 0 are called twiddle factor multiplications. p – 1 .1.8) is N ( r 0 + r 1 – 1 ) and the number of complex additions is N ( r 0 + r 1 – 2 ) . The final step requires N ( r 0 – 1 ) complex multiplications and additions. 2. we apply the basic formula by decomposing N = 8 = 4 × 2 with r0 = 2 and r1 = 4.The r0 r1-point DFTs require r 1 r 0 ( r 1 – 1 ) or N ( r 1 – 1 ) complex multiplications and additions. This results in the given 8-point DFT example in the above section. 1. The second step requires N complex multiplications.

nk The factorization of W N can be expressed as n nk W N = W N( r 0 r 1 …r p – 2 k p – 1 + … + r 0 k 1 + k 0 ) r = W N0 r 1 …r p – 2 nk p – 1 + … + r 0 nk 1 + nk 0 r r nk = W N0 r 1 …r p – 2 nk p – 1 …W N0 nk 1 W N 0 nk nk nk = W r p – 1 …W N ⁄1r W N 0 p–1 0 (2.11) r nk + 1 where W N0 r 1 …r i nk i + 1 = W N ⁄i ( r r …r ) .14) 14 Chapter 2 . k p – 2. …. k 0 )W r 0 k p – 1 p–1  k0 = 0 k1 = 0  k p – 1 = 0 nk ⋅ W r p –r2 p–1 p–2 r0 – 1 r1 – 1 r –1 nk nk …W N ⁄1r W N 0 0 (2. 0 1 i Eq. n p – 2. Equation (2. (2. index kp-1 is “replaced” by n0. k 0 )W r 0 k p – 1 p–1 (2. …. k 0 )W r 0 k p – 1 p–1  k0 = 0 k1 = 0  k p – 1 = 0 nk ⋅ W r p –r2 p–1 p–2 r0 – 1 r1 – 1 r –1 nk nk …W N ⁄1r W N 0 0 (2.12) Note that the inner product can be recognized as an rp-1-point DFT for n0.13). …. k 0 ) rp – 1 – 1 = ∑ n x ( k p – 1.12) can now be rewritten as X ( n p – 1.13) kp – 1 = 0 With Eq.1) can then be written X ( n p – 1. Define x 1 ( n 0. k p – 2. n p – 2. n 0 )  p–1  n = ∑ ∑ …  ∑ x ( k p – 1. k p – 2. …. …. k p – 2. (2. …. n 0 )  p–1  n = ∑ ∑ …  ∑ x ( k p – 1.

k 0 ) rp – 2 – 1 = ∑ x 1' ( n 0. k 0 )W r nk p – 2 p – 1r p – 2 n1 k p – 2 kp – 2 = 0 rp – 2 – 1 ∑ n [ x 1 ( n 0. k p – 2. k 0 )W r n1 k p – 2 p–2 (2. (2.14) to Eq. ….nk The term W N ⁄i ( r r …r ) can be factorized as 0 1 i–1 nk ( 1 p 2 …r W N ⁄i ( r r …r ) = W Nr p – rrr –…r 1 n)p – 1 + … + r p – 1 n 1 + n 0 )k i ⁄( 0 1 0 1 i–1 i+1 ( r p – 1 r p – 2 …r i + 1 n p – i – 1 + … + r p – 1 n 1 + n 0 )k i = W r r …r p–1 p–2 i ( 1 n i+2 = W r r p – r …r…rn p – i – 2 + … + n 0 )k i W r p – i – 1 k i p–1 p–2 i i The inner sum of kp-2 in Eq. (2. n 1. k p – 2. n p – 2. (2. n 1.14) can then be written as rp – 2 – 1 ∑ = x 1 ( n 0. …. (2. k p – 2. k 0 )W r 0 k pr– 2 (2. …. ….17) can be repeated p – 2 times until index k0 is replaced by np-1.15) kp – 2 = 0 which can be done through multiplications and rp-2-point DFTs n x 1' ( n 0. FFT ALGORITHMS 15 .14) can be rewritten as X ( n p – 1. k p – 3. k 0 )W r 0 k pr– 2 ]W r p–1 p–2 p–2 (2. …. n 0 )  p–3  nk = ∑ ∑ …  ∑ x 2 ( n 0. k p – 2. k p – 2.16) p–1 p–2 x 2 ( n 0. k 0 )W r p –r3 r  p–1 p–2 p–2  k0 = 0 k1 = 0  k p – 3 = 0 nk nk nk ⋅ W r p –r4 r r …W N ⁄1r W N 0 p–1 p–2 p–3 p–3 0 r0 – 1 r1 – 1 r –1 (2. …. ….18) This process from Eq. ….17) kp – 2 = 0 Eq. k 0 ) = x 1 ( n 0.

n 1. k 1..2. the factorization of W nk can be expressed as N W N = (W 2 nk n0 k 2 )(W 4 n0 k 1 W2 n1 k 1 )(W 8 ( 2n 1 + n 0 )k 0 W2 n2 k 0 ) (2. k 1.14) can then be expressed as X ( n p – 1. k 0 ) = x 1 ( n 0. n 1. n 1. 8-point DFT.20) reorders the output data to natural order. …. k 0 )W 8 ( 2n 1 + n 0 )k 0 (2. n p – 1 ) (2. k 0 )W 2 (2.22) x 1' ( n 0.. (2. k 0 )W r n p – 1k0 0 (2. The addressing for unscrambling is to make a reverse of the address bits and hence is called bit-reverse addressing. k 0 )W 2 n0 k 1 n0 k 2 (2. n 1. the computation of an 8-point DFT can be computed with the following sequential equations [7] x 1 ( n 0.20) Eq. n p – 2. k 0 ) = x 2 ( n 0. Example 2. …. In case of radix-r (r > 2) number system. k 1. k 0 ) = ∑ k2 = 0 1 x ( k 2.. …. n 1. (2. …. This process is called unscrambling.24) x 2' ( n 0. n p – 2.. k 0 ) = (2. In case for radix-2 number system.19) k0 = 0 Eq. n 0 ) . ….x p – 1 ( n 0.23) n1 k 1 ∑ k1 = 0 1 x 1' ( n 0. The unscrambling process requires a special addressing mode that converts address (n0. it is called digit-reverse addressing. Let N = 2 × 2 × 2 . the ni represents a bit.21) By using the generalized formula.np-1) to ( n p – 1. k 1. n 0 ) = x p – 1 ( n 0. n p – 1 ) r0 – 1 = ∑ x p – 2' ( n 0.25) 16 Chapter 2 . k 1. k p – 2. k 0 )W 4 x 2 ( n 0.

2.22) corresponds to the W 2 term in Eq.26) (2. n 2 ) = ∑ k0 = 0 1 x 2' ( n 0. (2. (2.21).x 2 ( n 0. x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) W0 W2 W0 W1 W0 W2 Wn W2 W3 x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) n multiplication with W8 Figure 2.23) corresponds to the W n 0 k 1 term. k 0 )W 2 n2 k 0 (2. Eq.2. 8-point DFT with Cooley-Tukey algorithm. n 1. n 2 ) 1 n0 k 2 where Eq. n 0 ) = x 2 ( n 0. n . 4 The result is shown in Fig.2. n 1. FFT ALGORITHMS 17 . (2.27) X ( n 2. n 1. and so on.

2. 8-point DFT. For the sake of simplicity.28) 18 Chapter 2 8-point DFT . 4-point 4-point DFT DFT Combine two 2-point DFT 2-point DFT 2-point DFT 2-point DFT Combine two 2-point DFT Figure 2.3. This class of algorithms is called decimation-in-time (DIT) algorithm.3. Sande-Tukey FFT Algorithms Another class of algorithms is called decimation-in-frequency (DIF) algorithm. the inputs are divided into smaller and smaller groups. As illustrated in the figure. 2. we do not derive the DIF algorithm but illustrate the algorithm with an example. The computation of DFT with DIF algorithm is similar to computation with DIT algorithm. Example 2.The recursive usage of divide and conquer approach for an 8point DFT is shown in Fig. The factorization of W N can be expressed as W N = (W 2 nk k 2 n0 nk W8 ( 2k 1 + k 0 )n 0 )(W 2 k 1 n1 W4 k 0 n1 Combine two 4-point DFT )(W 2 k 0 n2 2-point DFT Combine two 4-point DFT ) (2. This kind of algorithm is also called the Sande-Tukey FFT algorithm. The divide and conquer approach for DFT. which divides the outputs into smaller and smaller DFTs.3. 2.

The sequential equations can be constructed in similar way as those in Eq. p .30) Comparing Fig. Hence. 8-point DFT with DIF algorithm. (2. k 0 ) and i = 1. …. ….29) where x 0 ( k p – 1. ….10). we can find that the signalflow graph (SFG) for DFT computation with DIF algorithm is transposition of that with DIT algorithm. n 0 ) = x p ( n 0.27).4. …. (2. n p – 1 ) . By using the same notation for index n and k as in Eq. …. The unscrambling process is done by X ( n p – 1. (2.22) through Eq. ….4 with Fig. many properties for DIT and DIF algorithms are the same. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) W0 W2 W0 W1 W2 W3 Wn W0 W2 x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) n multiplication with W8 Figure 2. k 0 )W r ni – 1 k p – i p–i kp – i = 0 n i ( r p – i – 2 …r 0 k p – i – 1 + … + r 0 k 1 + k 0 ) ⋅ W N ⁄ ( r …r ) p–1 p–i (2.2. 2. k p – i – 1. The computation of DFT with DIF algorithms can be expressed with sequential equations. k p – 2. …. The result is shown in Fig. the computation of N-point DFT with DIF algorithm is x i ( n 0. which are similar to that of DIT algorithms. k 0 ) rp – i – 1 = ∑ x i – 1 ( n 0. 2. n i – 1. (2. the computation FFT ALGORITHMS 19 .9) and Eq. 2.4. For instance. …. k p – i. n i. (2. 2. …. k 0 ) = x ( k p – 1. k p – 2.

.g. index n or k is expressed with Eq. so-called Good’s mapping [19].e. –1 20 Chapter 2 . Prime Factor FFT Algorithms In Cooley-Turkey or Sande-Turkey algorithms. e. e. If r 1 and r 0 are relatively prime. i. there are clear differences between DIF and DIT algorithms.3. which reduce the twiddle factor multiplications. and multiplication inverse of modulo r 1 . 0 ≤ n 1 < r 1. the position of twiddle factor multiplications. 2. However. prime factor FFT algorithm.. there exists another type of FFT algorithms.g. This representation of index number is called index mapping.9) and Eq. (2.31) mod N –1 where N = r 1 × r 0 .workload for the DIT and DIF algorithms are the same. it exists another index mapping. This mapping is a variant of Chinese Remainder Theorem. In Cooley-Turkey or Sande-Turkey algorithms. An index n can be expressed as –1  + r  n r –1  n =  r 0  n1 r 0 1 0 1    mod r 1 mod r 0  (2. r 0 –1 is the = 1 . The unscrambling process are required for both DIF and DIT algorithms.10). r 0 ) = 1. greatest common divider gcd(r 1 .. The DIF algorithms have the twiddle factor multiplications after the DFTs and the DIT algorithms have the twiddle factor multiplications before the DFTs. e.g. the twiddle factor multiplications are required for DFT computation. If the decomposition of N is relative prime. 0 ≤ n 0 < r 0 . (2. r 0 r 0 mod r 1 r 1 is the multiplication inverse of modulo r 0 ..

We have N = 3 × 5 with r 1 = 3 and r 0 = 5 .5. FFT ALGORITHMS 21 . 0 ≤ n 0 < r 0 ). 2. We have N = 3 × 5 with r 1 = 3 and r 0 = 5 . It can be constructed by n = ( r 0 n1 + r 1 n0 ) ( 0 ≤ n 1 < r 1.Example 2. the index can be mod r 0 computed according to r1r1  + 3k  k =  5  2k 1     0 mod 5  mod 3 (2.5. The index mapping for the outputs can be constructed by n = ( 5n 1 + 3n 0 ) for 0 ≤ n 1 < 3 and 0 ≤ n 0 < 5 . mod N Example 2. Index mapping for 15-point DFT outputs.6. r 1 is 2 since = 3 ⋅ 2 mod 5 = 1 and r 0 = 2 . n0 n1 0 3 6 9 12 5 8 11 14 2 10 13 1 4 7 Figure 2.4.6. The mod 15 result can be shown Fig.32) mod 15 –1 –1 –1 The mapping can be illustrated with an index matrix k0 k1 0 6 12 3 9 10 1 7 13 4 5 11 2 8 14 Figure 2. Construct index mapping for 15-point DFT outputs. Construct index mapping for 15-point DFT inputs according to Good’s mapping. The mapping for the outputs is simple. Good’s mapping for 15-point DFT inputs.

The computation with prime factor FFT algorithms is similar to the computation with Cooley-Turkey algorithm. It can be divided into two steps:

1 2

Compute r0 different r1-point DFTs. It performs column-wise DFTs for the input index matrix. Compute r1 different r0-point DFTs. It performs row-wise DFTs for the output index matrix.

Example 2.6. 15-point DFT with prime factor mapping FFT algorithm. The input and output index matrices can be constructed as shown in Fig. 2.5 and Fig. 2.6. Following the computation steps above, the computation of 15-point DFT can be performed by and five 3-point DFTs followed three 5-point DFTs. The 15-point DFT with prime factor mapping FFT algorithm is shown in Fig. 2.7.
x(0) x(10) x(5) x(6) x(1) x(11) x(12) x(7) x(2) x(3) x(13) x(8) x(9) x(4) x(14) X(0) X(3) X(6) X(9) X(12) X(5) X(8) X(11) X(14) X(2) X(10) X(13) X(1) X(4) X(7) 3-point DFT k0=0 5-point DFT n0=0 5-point DFT n0=2 5-point DFT n0=1

Figure 2.7. 15-point FFT with prime factor mapping.

22

Chapter 2

3-point DFT k0=4

3-point DFT k0=3

3-point DFT k0=2

3-point DFT k0=1

The prime factor mapping based FFT algorithm above is also an in-place algorithm. Swapping of input and output index matrix gives another FFT algorithm which does not need twiddle factor multiplication outside the butterflies either. Although the prime factor FFT algorithms are similar to the Cooley-Tukey or Sande-Tukey FFT algorithms, the prime factor FFT algorithms are derived from convolution based DFT computations [19] [49] [31]. This leads later to Winograd Fourier Transform Algorithm (WFTA) [61].

2.4. Other FFT Algorithms
In this section, we discuss two other FFT algorithms. One is the split-radix FFT algorithm (SRFFT) and the other one is Winograd Fourier Transform algorithm (WFTA).

2.4.1. Split-Radix FFT Algorithm
Split-radix FFT algorithms (SRFFT) were proposed nearly simultaneously by several authors in 1984 [17] [18]. The algorithms belong to the FFT algorithms with twiddle factor. As a matter of fact, split-radix FFT algorithms are based on the observation of CooleyTurkey and Sande-Turkey FFT algorithms. It is observed that different decomposition can be used for different parts of an algorithm. This gives possibility to select the most suitable algorithms for different parts in order to reduce the computational complexity.

FFT ALGORITHMS

23

For instance, the signal-flow graph (SFG) for a 16-point radix-2 DIF FFT algorithm is shown in Fig. 2.8.
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) W0*0 16 W1*0 16 W2*0 16 W3*0 16 W4*0 16 W5*0 16 W6*0 16 W7*0 16 W0*1 16 W1*1 16 W2*1 16 W3*1 16 W4*1 16 W5*1 16 W6*1 16 W7*1 16 W0*0 8 W1*0 8 W2*0 8 W3*0 8 W0*1 8 W1*1 8 W2*1 8 W3*1 8 W0*0 8 W1*0 8 W2*0 8 W3*0 8 W0*1 8 W1*1 8 W2*1 8 W3*1 8 W0*0 4 W1*0 4 W0*1 4 W1*1 4 W0*0 4 W1*0 4 W0*1 4 W1*1 4 W0*0 4 W1*0 4 W0*1 4 W1*1 4 W0*0 4 W1*0 4 W0*1 4 W1*1 4 X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15)

Figure 2.8. Signal-flow graph for a 16-point DIF FFT algorithm. The SRFFT algorithms exploit this idea by using both a radix-2 and a radix-4 decomposition in the same FFT algorithm. It is obviously that all twiddle factors are equal to 1 for even indexed outputs with radix-2 FFT computation, i.e., the twiddle factor multiplication is not required. But in the radix-4 FFT computation, there is not such general rule (see Fig. 2.9). For the odd indexed outputs, a radix-4 decomposition the computational efficiency is increased because the four-point DFT has the largest multiplicationfree butterfly. This is because the radix-4 FFT is more efficient than the radix-2 FFT from the multiplication complexity point of view. Consequently, the DFT computation uses different radix FFT

24

Chapter 2

2. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) -j -j -j -j -j -j W0*2 16 W1*2 16 W2*2 16 W3*2 16 W0*3 16 W1*3 16 W2*3 16 W3*3 16 R ad ix 2/ 4 -j W0*2 8 W1*2 8 W0*3 8 W1*3 8 -j -j X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(5) X(9) X(13) X(3) X(7) X(11) X(15) Radix-2 Radix-4 R ad ix 2/ 4 Figure 2. R ad ix 2/ 4 FFT ALGORITHMS Ra di x 2/ 4 25 . x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) W0*0 16 W1*0 16 W2*0 16 W3*0 16 W0*1 16 W1*1 16 W2*1 16 W3*1 16 W0*2 16 W1*2 16 W2*2 16 W3*2 16 W0*3 16 W1*3 16 W2*3 16 W3*3 16 X(0) X(4) X(8) X(12) X(1) X(5) X(9) X(13) X(2) X(6) X(10) X(14) X(3) X(7) X(11) X(15) Radix-4 Butterfly Figure 2.10. A 16-point SRFFT is shown in Fig.9.algorithms for odd and even indexed outputs. SFG for 16-point DFT with SRFFT algorithm. This reduces the number of complex multiplications and additions/subtractions. Radix-4 DIF algorithm for 16-point DFT.10.

the computation load is of great concern.4.2.. Performance Comparison For the algorithm implementation. r1-point input adds r0 sets r0-point input adds r1 sets N-point multiplications r0-point output adds r1 sets r1-point output adds r0 sets Figure 2. Winograd Fourier Transform Algorithm The Winograd Fourier transform algorithm (WFTA) [61] uses the cyclic convolution method to compute the DFT. We compare the discussed algorithms from the addition and multiplication complexity point of view. the irregularity of WFTA makes it impractical for most real applications.Although the SRFFT algorithms are derived from the observation of radix-2 and radix-4 FFT algorithm. The SRFFT can also be generalized to lengths N = pk. The computation of an N-point DFT (N is product of two coprime numbers r1 and r0) with WFTA can be divided into five steps: two pre.5. 26 Chapter 2 .and post-addition steps and a multiplication step at the middle. Usually the number of additions and multiplications are two important measurements for the computation workload. This could be the reason that the algorithms are discovered so late [18]. The number of arithmetic operations depends on N.11. e. 2. Furthermore. it cannot be derived by index mapping. WFTA succeeds in minimizing the number of multiplications to the smallest number known.g. General structure of WFTA. 2. the minimization of multiplications results in complicated computation ordering and large increase of other arithmetic operations. However. This is based on Rader’s idea [49] for prime number DFT computation. The number of multiplications is O ( N ) . where p is a prime number [18]. additions. The aim of Winograd’s algorithm is to minimize the number of multiplications.

5. With a simple transformation. For the purpose of comparison. a complex multiplication with twiddle factor k W N does not require any multiplications when k is a multiple of N/4. A complex multiplication can be realized directly with 4 real multiplications and 2 real additions. which is shown in Fig. In many DFT computations. Furthermore. both complex multiplications and real multiplications are required. the multiplication complexity is important for the selection of FFT algorithms.. 2.Since the restriction of transform length for prime factor based algorithms or WFTA. it requires only 2 real multiplications and 2 additions when k is an odd multiple of N/8. e.12 (b). We consider a complex multiplication as 3 real multiplications and 3 real additions in the following analysis. the number of real multiplications can be reduced to 3.12. 2. This number is overestimated. Taking these simplifications into account. the counting is based on the number of real multiplications. 2.12 (a). but the number of real additions increases to 3 as shown in Fig.1. Realization of a complex multiplication. for example. Multiplication Complexity Since multiplication has large impact on the speed and power consumption. ( N log 2N ) ⁄ 2 . the number of complex multiplications can be estimated as half of the number of total butterfly operations. the number of real multiplication for a FFT ALGORITHMS 27 . (a) Direct realization (b) Transformed realization For DFT with transform length of N = 2 n . XR XR CR CI XI CI XR CR XI CI+CR XI CR CI-CR XI XR ZR ZI ZR ZI (a) (b) Figure 2. the comparison is not strictly on the same transform length but rather that of a nearby transform length.g.

28 Chapter 2 .1. WFTA has been proven that it has the lowest number of multiplications for those transform lengths that are less than 16. From the multiplication complexity point of view.DFT with radix-2 algorithm and transform length of N = 2 n is M = 3N ⁄ ( 2 log 2N ) – 5N + 8 [25]. The number of real multiplication for various FFT algorithms on complex data is shown in the following table [18]. N 16 60 64 240 256 504 512 1008 1024 10248 7856 7172 4360 3076 5804 3548 1800 1392 1284 2524 1572 264 208 196 1100 632 Radix-2 24 Radix-4 20 SRFFT 20 200 136 PFA WFTA Table 2. there are lower bounds that can be attained by algorithms for those transform lengths. As mentioned previously. If the transform length is a product of two or more co-prime numbers. following by the prime factor algorithm. The radix-4 algorithm for a DFT with transform length of requires N = 4n M = 9N ⁄ ( 8 log 2N ) – 43N ⁄ 12 + 16 ⁄ 3 real multiplications [18]. the split-radix algorithm. For the split-radix FFT algorithm. It requires the lowest number of multiplications of the existing algorithms. the most attractive algorithm is WFTA. Multiplication complexity for various FFT algorithms. However. the number of real multiplications is M = N log 2N – 3N + 4 for a DFT with N = 2 n [18]. there is no simple analytic expression for the number of real multiplications. These lower bounds can be computed [18]. and the fixed-radix algorithm.

or 4nN . it cannot be ignored. Nevertheless. i. the number of real additions is four since each complex addition/subtraction requires two additions.e. we consider a subtraction equivalent to an addition. As described previously. The split-radix algorithm has the best result for addition complexity: A = 3N log 2N – 3N + 4 additions for an N = 2 n DFT [18]. a DFT requires N/2 radix-2 DFTs for each stage. a complex multiplication requires generally 3 additions. So the total number of real additions is 4 ( log 2N ) ( N ⁄ 2 ) . Each radix-4 DFT requires 8 complex additions/subtractions.2. For each radix-2 butterfly operation (a 2-point DFT). Both radix-2 and radix-4 FFT algorithms require the same number of additions for a DFT with a transform length of powers of 4. the addition and subtraction operations are used for realizing the butterfly operations and the complex multiplications. Since subtraction has the same complexity as addition. 16 real additions.5. FFT ALGORITHMS 29 . The additions for the butterfly operations is the larger part of the addition complexity. The number of additions for DFT with transform length of N = 4 n is A = 25N ⁄ ( 8 log 2N ) – 43N ⁄ 12 + 16 ⁄ 3 for radix-4 algorithm [18].. Addition Complexity In a radix-2 or radix-4 FFT algorithm. a DFT. requires N/4 radix-4 DFTs for each stage. For a transform length of N = 4 n . The exact number [25] is A = 7N ⁄ ( log 2N ) – 5N + 8 for a DFT with transform length of N = 2 n using the radix-2 algorithm. For a transform length of N = 2 n . The number of additions required for the complex multiplications is less than the number of butterfly operations.2. The total number of real additions is 16 ( log 4N ) ( N ⁄ 4 ) . or 2nN with radix-2 FFT algorithms.

From the addition complexity point of view. parallelism of FFT algorithms. inverse FFT implementation. in-place and/or in-order issue. the irregularity and increase of addition complexity makes the WFTA less attractive for practical implementation. In fact.g. N 16 60 64 240 256 504 512 1008 1024 30728 28336 27652 13566 12292 29548 34668 5896 5488 5380 13388 14540 1032 976 964 4812 5016 Radix-2 152 Radix-4 148 SRFFT 148 888 888 PFA WFTA Table 2. regularity of FFT algorithms etc. 30 Chapter 2 .2. e. 2. Addition complexity for various FFT algorithms. We discuss the first two issues in more detail. Other Issues Many issues are related to the FFT algorithm implementations.6. scaling and rounding considerations. The number of real additions for various FFTs on complex data are given in the following table [18].. WFTA is a poor choice.

2.6.1. Scaling and Rounding Issue
In hardware it is not possible to implement an algorithm with infinite accuracy. To obtain sufficient accuracy, the scaling and rounding effects must be considered. Without loss of generality, we assume that the input data {x(n)} are scaled, i.e., |x(n)| < 1/2, for all n. To avoid overflow of the number range, we apply the safe scaling technique [58]. This ensures that an overflow cannot occur. We take the 16-point DFT with radix-2 DIF FFT algorithm (see Fig. 2.8) as an example. The basic operation for the radix-2 DIF FFT algorithm consists of a radix-2 butterfly operation and a complex multiplication as shown in Fig. 2.13.
u U

v

B

Wp N

V

Figure 2.13. Basic operation for radix-2 DIF FFT algorithm. For two numbers u and v with |u| < 1/2 and |v| < 1/2, we have U = u+v ≤ u + v <1 V = (u – v) ⋅ W N = u – v ≤ u + v < 1 where the magnitude for the twiddle factor is equal to 1. To retain the magnitude, the results must be scaled with a factor 1/2. After scaling, rounding is applied in order to have the same input and output wordlengths. This introduces an error, which is
p

(2.33) (2.34)

FFT ALGORITHMS

31

called quantization noise. This noise for a real number is modeled as an additive white noise source with zero mean and variance of ∆ ⁄ 12 , where ∆ is the weight of the least significant bit.
2

u v

1/2

nU UQ nV VQ

B WN /2

p

Figure 2.14. Model for scaling and rounding of radix-2 butterfly. The additive noise for U respective V is complex. Assume that the quantization noise for U and V are QU and QV., respectively. For the QU, we have E { Q U } = E { Q U re + jQ U im } = E { Q U re } + E { jQ U im } = 0  2  2 Var { Q U } = E  Q U re + Q U im   
2  2   2  2∆ = E  Q U re  + E  Q U im  = -------12    

(2.35)

(2.36)

Since the quantization noise is independent of the twiddle factor multiplication, we have   p p E  QV ⋅ W N  = E { QV } ⋅ E  W N  = 0       p p p  Var  Q V ⋅ W N  = E  ( Q V ⋅ W N ) ( Q V ⋅ W N )      2∆ = E { Q V Q V } = -------12
2

(2.37)

(2.38)

32

Chapter 2

After analysis of the basic radix-2 butterfly operation, we consider the scaling and quantization effects in an 8-point DIF FFT algorithm. The noise propagation path for the output X(0) is highlighted with bold solid lines in Fig. 2.15.
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7)

W0 W2 W0 W1 W2 W3 W0 W2

stage 1

stage 2

stage 3

Figure 2.15. Noise propagation. For the sake of clarity, we assume that ∆ is equal for each stage, i.e., the internal wordlength is the same for all stages. If we analyze backwards for X(0), i.e., from stage l back to stage 1, it is easy to find that noise from stage l-1 is scaled with 1/2 to stage l and stage l-1 has exactly double noise sources to that of stage l. Generally, if the transform length is N and the number of stages is n, where N = 2 n , the variance of a noise source from stage l is scaled with (1 ⁄ 2) and the number of noises sources in stage l is 2 n – l . Hence the total quantization noise variance for an output X(k) is Var { Q X ( k ) } = 1 ∆ ∑ 2 n – l ------------------ -----2(n – l) 6 2 l=1
2 n 2 n 2 2(n – l)

∆ = ----6

∆ 1 1 ∑ -----------l = ------  2 – ------------ n– n – 1 6 2 l = 12

(2.39)

if the input sequence is zero mean white noise with variance δ 2 .

FFT ALGORITHMS

33

∑ E { x ( n )x ( n ) } = -----.------------------------. The signal-noise-ratio (SNR) for the output X(k) is therefore [60] 2 ( 2∆ in ⁄ 12 ) 1 ----------------------------2 2 ∆ in ∆ in 2n 1 N SNR = ---------------------------------.∑ ∑ E { x ( n )x ( m ) }W N = N2n = 0m = 0 N–1N–1 1 nk 1 – mk  = E  --. If ∆in is the weight of the least significant bit for the real or imaginary part of the input.= ------. a similar analysis [60] yields 2 ( 2∆ in ⁄ 12 ) 1 ---------------------------2 2 ∆ in ∆ in rn 1 N SNR = --------------------------------.-------------------2 2 – r– n + 1 2 ( 2N – r ) 2 ∆ ∆ ∆  1  ----.= ------.The output variance for an output X(k) can be derived by the following equation Var { X ( k ) } = E { X ( k )X ( k ) } = ( n – m )k 1 = -----. which is based on the white noise assumption.------------------------.= ------.40) where E { x ( n )x ( m ) } = 0 for n ≠ m from the white noise assumption.---------------------2 –n+1 2 n 2 ∆ 2–2 ∆ 2(2 – 1) ∆  1  ----.2 – -----------n – 1 6 2 For a radix-r DIF FFT algorithm. can be used to determine the required internal wordlength. the input variance δ 2 is equal to 2 2∆ in ⁄ 12 .N δ 2 = ---2 N N2 N n=0 N–1 N–1 N–1 (2.∑ x ( m )W N  = N N n = 0  m=0 1 1 δ2 = -----. The finite wordlength effect of finite precision coefficients is more complicated.∑ x ( n )W N --.= ------. 34 Chapter 2 . Simulation is typically used to determine wordlength of the coefficients.2 – ----------n – 1 6 r This result.

This approach adds an overhead to each butterfly operation and change the access order of the coefficient ROM. The second approach is similar to FFT computation. If we ignore the scaling factor 1/N.2. the only difference between DFT and IDFT is nk –n k the twiddle factor.2). This approach is obviously not efficient.6. There are various approaches for the IDFT implementation. The straightforward one is to compute the IDFT directly according to Eq. 1 x ( k ) = --N 1 = --N 1 = --N 1 = --N N–1 ∑ ∑ X ( n )e j2πnk ⁄ N = [ X re ( n )e j2πnk ⁄ N + jX im ( n )e j2πnk ⁄ N ] = X re ( n )e – j2πnk ⁄ N – jX im ( n )e – j2πnk ⁄ N X ∗ ( n )e – j2πnk ⁄ N ∗ ∗ = k=0 N–1 k=0 N–1 ∑ ∑ k=0 N–1 k=0 where the term within the parenthesis is a definition of DFT and a * is the conjugate of a . FFT ALGORITHMS 35 . The third approach converts the computation on IDFT to the computation on DFT. which has a computation complexity of O ( N 2 ) . It also requires the reordering of input when a radix-r DFT is used. This is shown by the following equation.2. The IDFT implementation is also critical for the OFDM system. IDFT Implementation An OFDM system requires both DFT and IDFT for signal processing. (1. This can easily be performed by changing the read addresses of the twiddle factor ROM(s) for the twiddle factor multiplications. which is W N instead of W N .

2. 36 Chapter 2 . Some other aspects. Summary In this chapter we discussed the most commonly used FFT algorithms. and WFTA are also discussed.g. Other algorithms like prime factor algorithm. memory requirements.. Hence. swap the real and imaginary part at output from DFT. the Cooley-Tukey and Sande-Tukey algorithms. We compared the different algorithms in term of number of addictions and multiplications. split-radix algorithm. and a scaling with factor 1/N.The conjugation of a complex number can be done by swapping the real and imaginary parts. for instance.7. the IDFT can therefore be computed with a DFT by adding two swaps and one scaling: swap the real and imaginary part at input before the DFT computation. will be discussed later. Each computation step was given in detail for the Cooley-Tukey algorithms. e.

we discuss the basic principles for power consumption in standard CMOS circuits. 3. and switching currents. During this time interval. Short-Circuit Power In a static CMOS circuit. Afterwards. the main contributions to the power consumption are from short-circuit. It exists a short time interval where the input voltage is larger than V tn but less than V dd – V tp . The logic functions for the two networks are complementary. both 37 . In the following subsections.1.3 LOW POWER TECHNIQUES Low power consumption has emerged as a major challenge in the design of integrated circuits. Power Dissipation Sources In CMOS circuits. Short-circuit current exists during the transitions as one network is turned on and the other network is still active. leakage. we introduce them separately. only one network is turned on and conducts the output either to power supply node or to ground node and the other network is turned off and blocks the current from flowing. Normally when the input and output state are stable.1. the input signal to an inverter is switching from 0 to V dd . For example. there are two complementary networks: pnetwork (pull-up) and n-network (pull-down).1. In this chapter. 3. a review of low-power techniques for CMOS circuits is given.

substrate Vdd Figure 3. The leakage currents depend on the technology and cannot be modified by the designers except in some logic styles. like large RAMs. the other from the currents that flow through transistors that are non-conducting. the output loads and the transistor sizes [54]. However. The short-circuit current consumes typically less than 10% of the total power in a “well-designed” circuit [54].2. The leakage current is currently not a severe problem in most digital designs. 3. In some cases. the power consumed by leakage current can be as large as the power consumed by the switching 38 Chapter 3 . Leakage current types: (a) reverse biased diode current. Gnd Vdd Gnd Ireverse (a) Gate n+ p+ Isub (b) p. (b) subthreshold leakage current. The exact analysis of the short-circuit current in a simple inverter [6] is complex. It is observed that the short-circuit current is proportional to the slope of input signals. Leakage Power There are two contributions to leakage currents: one from the currents that flow through the reverse biased diodes.PMOS-transistor (p-network) and NMOS-transistor (n-network) are turned on and the short-circuit current flows through both kinds of transistors from power supply line to the ground. The leakage current is in the order of pico-Ampere with current technology but it will increase as the threshold voltage is reduced. it can be studied by simulation using SPICE. the leakage current is one of the main concerns.1. The leakage currents are proportional to the leakage area and exponential of the threshold voltage.1.

CL is the capacitance load.1) where α is the switching activity factor.current for 0. overlapping. 2 LOW POWER TECHNIQUES 39 . The usage of multiple threshold voltages can reduce the leakage current in deep-submicron technology.3. It is applicable to almost every digital circuits and gives the guidance to the low power design.06 µm technology. and Vdd is the power supply voltage. f is the clock frequency. The power consumed by switching current [63] can be expressed as P = αC L f V dd ⁄ 2 (3. The equation shows that the switching power depends on a few quantities that are readily observable and measurable in CMOS circuits. The power consumed by switching current is the dominant part of the power consumption. Switching Power The switching currents are due to the charging and discharging of node capacitances.1. and interconnection capacitances. Reducing the switching current is the focus of most low power design techniques. The node capacitances mainly include gate. 3.

At the system level. Pipelining Architecture Voltage scaling Logic Circuit Logic styles and manipulation. This is organized after the abstraction level. it is hard to find the best solution for low power in the large design space and there is a shortage of accurate power analysis tools at this level. if. the instruction-level power models for a given processor are available.2. software power optimization can be performed [56]. logic level. The system design usually has the largest impact on the power consumption and hence the low power techniques applied at this level have the most potential for power reduction. In the following. hardware platform selection (application-specific or generalpurpose processors). Low-power design methodology at different abstraction levels. for example. 3. algorithm and architecture level. System Partitioning. which affect the power consumption. The system design includes the hardware/software partitioning. circuit level.2 shows some examples of techniques at the different levels.2. System Level A system typically consists of both hardware and software components. we give an overview for different low power techniques. Data encoding Energy recovery. 3. resource sharing (scheduling) strategy. and technology level. Low Power Techniques Low power techniques can be discussed at various levels of abstractions: system level. It is observed 40 Chapter 3 .3. Transistor sizing Technology Threshold reduction. However.2. Fig. Double-threshold devices Figure 3.1. etc. Power-down Algorithm Parallelism.

4. The lower requirement for performance at certain time interval can be used to reduce the LOW POWER TECHNIQUES 41 . This is called sleep mode and widely used in low power processors. However. The order of instructions also have an impact on the internal switching within processors and hence on the power consumption. The StrongARM SA-1100 processor has three power states and the average power varies for each state [29]. the Microsoft desktop operating system supports advanced power management (APM). which often consumes 30-40% of the total power consumption. The power-down and clock gating are two of the most used low power techniques at system level. 3.3. The clock drivers. The non-active hardware units are shut down to save the power. the power management has gained a lot attention in operating system design. Clock gating.that faster code and frequently usage of cache are most likely to reduce the power consumption. In the recent year. 400 mW RUN 10 µs 10 µs 90 µs 160 ms 90 µs IDLE 50 mW SLEEP 160 µW Figure 3. block enable clock AND to block clock network Figure 3. the computation requirement is time varying. can be gated to reduce the switching activities as illustrated in Fig. The power-down can be extended to the whole system. Power states for StrongARM SA-1100 processor. For example. These power states can be utilized by the software through advanced configuration and power management interface (ACPI). Adapting clocking frequency and/or dynamic voltage scaling to match the performance constraints is another low power technique. The system is designed for the peak performance.3.

It is easy to reduce the power consumption further by combining the asynchronous design technique with other low power techniques. dynamic voltage scaling technique [42]. 3. Another less explored domain for low power design is using asynchronous design techniques.2. This is illustrated in Fig.4 for a 1024-point Fourier transform and the power consumption is likely to be reduced with a similar factor. This requires either feedback mechanism (load monitoring and voltage control) or predetermined timing to activate the voltage down-scaling. and low peak current. 3. Load Monitor DC-DC Converter Power Supply Output Buffer Input Processing Unit Synchronous/Asynchronous Interface Figure 3. like non-global clocking.5. The cost of an algorithm includes the computation part and the communication/storage part. using fast Fourier transform instead of direct computation of the DFT reduces the number of operations with a factor of 102. automatic powerdown. Asynchronous design with dynamic voltage scaling. for instance.power supply voltage. etc.5. Algorithm Level The algorithm selection have large impact on the power consumption. The complexity measurement for an algorithm includes the number of operations and the cost of 42 Chapter 3 . no spurious transitions. The asynchronous designs have many attractive features. For example. The task of algorithm design is to select the most energyefficient algorithm that just satisfies the constraints.2.

like wave digital filters.6. LOW POWER TECHNIQUES 43 . In Fig. and locality of an algorithm. concurrency. This reduces the power consumption with 20% even the capacitance load is increases with 50% [10].6. and long distance communications are key issues to algorithm selection. regularity.g.communication/storage. this technique can be combine with other techniques at architectural level. 3. cost per operation. voltage scaling. (b) Unrolled signal flow graph. One important technique for low power of the algorithmic level is algorithmic transformations [45] [46]. The loop unrolling technique [9] [10] is a transformation that aims to enhance the speed.. The possibility of increasing concurrency in an algorithm allows the use of other techniques. the unrolling reduces the critical path and gives a voltage reduction of 26% [10]. Reduction of the number of operations. This technique can be used for reducing the power consumption. With loop unrolling. Furthermore. combined with voltage-scaling. The regularity and locality of an algorithm affects the controls and communications in the hardware. the critical path can be reduced and hence voltage scaling can be applied to reduce the power consumption. e. for instance. pipeline and interleaving. to reduce the power consumption. can be chosen for energy-efficient applications [58]. xn b0 yn xn b0 a1 2 yn D 2D a1 yn-1 a1 xn-1 b0 a1b0 (a) (b) Figure 3. Reducing the complexity of an algorithm reduces the number of operations and hence the power consumption. This technique exploits the complexity. In some cases. to save more power. the faster algorithms. (a) Original signal flow graph.

the power consumption is reduced.4 1.4 0. Architecture Level As the algorithm is selected.6 0.3.7.1). this increases the delay.5 Figure 3.2 0 0. an efficient way to reduce the dynamic power consumption is the voltage scaling. However.8 0. We demonstrate an example of architecture transformation. However.3. 44 Chapter 3 . The delay of a min-size inverter (0.2 Delay (ns) 1 0. we use low power techniques like parallelism and pipelining [11]. 3. the architecture can be determined for the given algorithm. As we can see from Eq. To compensate the delay. (3.2. To reduce the power supply voltage is used to reduce the power consumption.8 1.6 1.7. which is shown in Fig. this increases the gate delay. When supply voltage is reduced.35 µm standard CMOS technology) increases as the supply voltage is reduced. Delay vs. supply voltage for an inverter.5 2 Power supply voltage (V) 2. Delay vs. power supply voltage 2 1.5 3 3.5 1 1.

we use a parallel architecture. The use of two parallel datapath is equivalent to interleaving of two computational tasks. Still. Since the extra routing is required to distribute computations to two parallel units. 3.8.58V orig )  ------------  2  ≈ 0. In order to maintain the throughput while reducing the power supply voltage.Example 3.36P orig 45 . A datapath to determine the largest number of C and (A + B) is shown in Fig. It requires an adder and a comparator. This allows the supply voltage to be scaled down from 5 V to 2. The original clock frequency is 40 MHz [11]. the capacitance load is increased by a factor of 2. A 1/T B 1/T C 1/T Comparator A>B LOW POWER TECHNIQUES Figure 3. The parallel architecture with twice the amount of resources is shown in Fig. from 40 MHz to 20 MHz since two tasks are executed concurrently.8. this gives a significant power saving [11]: 2 2 f orig P par = C par V par f par = ( 2.15 [11]. 3.15C orig ) ( 0.9 V [11]. Original datapath.1.9. The clock frequency can be reduced to half. Parallel [11].

A 1/2T B 1/2T C 1/2T Comparator A>B 1/2T Comparator A>B 1/T 1/2T 1/2T Figure 3.9.58V orig ) f orig ≈ 0. 3. The power consumption for pipelining [11] is P pipe = C pipe V pipe f pipe 2 = ( 1. The effective capacitance increases to a factor of 1. Pipelining [11]. the throughput can the increased from 1/(Tadd + Tcomp) to 1/max(Tadd. the supply voltage can also in this case be scaled down to 2. With this enhancement. Parallel implementation. this increases the throughput by a factor of 2.15 because of the insertions of latches [11].15C orig ) ( 0.9 V (the gate delay doubles) [11]. If Tadd is equal to Tcomp. Example 3.39P orig 2 46 Chapter 3 . By adding a pipelining register after the adder in Fig.2. Pipelining is another method for increasing the throughput. Tcomp).8.

47 . off course. 3. focus mainly on the reduction of switching activity factor by using the signal correlation and. Another benefit is that the amount of glitches can be reduced. Reduction of supply voltage lower than the optimal voltage increases the power consumption. Logic Level The power consumption depends on the switching activity factor.4. the node capacitances.10. However. To reduce such communications is important. Locality is also an important issue for architecture trade-off.2. there exists an optimal power supply voltage.A 1/2T B 1/2T C 1/2T 1/2T 1/2T Comparator A>B LOW POWER TECHNIQUES Figure 3. Pipeline implementation. However. One benefit of pipelining is the low area overhead in comparison with using parallel datapaths. The on-chip communication through long buses requires significant amount of power. The area overhead equals the area of the inserted latches. Further power saving can be obtained by parallelism and/or pipelining. however. The low power techniques at the logic level. since the delay increases significantly as the voltage approaches the threshold voltage and the capacitance load for routing and/or pipeline registers increases. which in turn depends on the statistical characteristics of data. most low power techniques do not concentrate on this issue from the system level to the architecture level.

or composition simple gates to a complex gate. This is illustrated in Fig. The input data is partitioned into two parts. duplication of a gate. One part. The decomposition of a complex gate and duplication of a gate help to separate the critical and noncritical path and reduce the size of gates in the non-critical path. corresponding to registers R1 and R2. In some cases. only a small portion of inputs to the comparator´s main block A (subtractor) is changed. R1 A R2 En g Figure 3.11. and this reduces the switching activity by gating those inputs to the circuit. the clock input to nonactive functional block does not change by gating. If two MSBs are not equal.As we know from the gated clocking. deleting/addition of wires. The power can then be saved by reducing the switching activity factor in A. Precomputation [1] uses the same concept to reduce the switching activity factor: a selective precomputing of the output of a circuit is done before the output are required. Gate reorganization [12] [32] [57] is a technique to restructure the circuit. the power consumption. In this way. A precomputation structure for low power. This can be decomposition a complex gate to simple gates. hence. hence. The comparator takes the MSB of the two numbers to register R1 and the others to R2.11. and. 3. The result from g decides gating of R2. reduces the switching of clock network. the decomposition of a complex gate increases the circuit speed and gives more space for 48 Chapter 3 R3 . The comparison of MSB is performed in g. An example of precomputation for low-power is the comparator. is computed in precomputation block g one clock cycle before the main computation A is performed. R1. the output from g gated the remaining inputs. and. Therefore the switching activity is reduced.

Low swing techniques can be applied for the bus also [27]. 10. 10. For instance. The deleting of wires reduces the capacitance load and circuit size. 11. counters with binary and Gray code have the same functionality. which requires 4 transitions. and back to 00. the encoding is optimized for reduction of switching activities since various encoding schemes have different switching properties. which requires 6 transitions. 11. 01. The encoding is usually optimized for reduction of delay or area. adding an extra bit to select one of the inverse or the non-inverse bits at the receiver end can save power [53]. For N-bit counter with binary code. the full counting cycle for a 2-bit binary coded counter is from 00. where states can be coded with different schemes. The full counting cycle for 2-bit Gray coded counter is from 00. As we can see from the previous example. and back to 00. a full counting cycle requires 2 ( 2 n – 1 ) transitions [63] A full counting cycle for a Gray coded N-bit counter requires only 2 n transitions.power supply voltage scaling. A bus is an on-chip communication channel that has large capacitance. increases. Encoding defines the way data bits are represented on the circuits. the use of buses contributes with a significant portion of the total power. the logic coding style is used for enhancement of speed performance. LOW POWER TECHNIQUES 49 . As the on-chip transfer rate. The binary coded counter has twice transitions as the Gray coded counter when the n is large. 01. The addition of wires helps to provide an intermediate circuit that may eventually lead to a better one. For instance. the logic coding style has large impact on the number of transitions. Careful choice of coding style is important to meet the speed requirement and minimize the power consumption. Bus encoding is a technique to exploit the property of transmitted signal to reduce the power consumption. In a counter design. The composition of simple gates can reduce the power consumption if the complex gate can reduce the charge/discharge of high-frequently switching node. Using binary coded counter therefore requires more power consumption than using Gray coded counter under the same conditions. Traditionally. In low power design. This can be applied to the finite state machine.

However. Example gates are NANDs. the amount of spurious transitions is large. Many logic gates have inputs that are logically equivalent. the statistics of switching activity factors for different pins must be known in advanced and this limits the use of pin ordering [63]. the dynamic power consumption is caused by the transitions. In some cases. The power savings can be significant as the basic cells are frequently used. A few percents improvement for D flip-flop can significantly reduce the power consumption in deep pipelined systems. This technique is called path balancing. The selection of logic style affects the speed and power consumption. However. XORs.12.e. the order of inputs does effect B the power consumption. Cout NORs. In some cases. from A the power consumption point of Ci view. the potentials power saving are often less than that of higher abstract levels. which is near the output in a two-input NAND Figure 3. for 50 Chapter 3 . Spurious transitions typically consume between 10% and 40% of the switching activity power in the typical combinational logic [20]. In CMOS circuits. This can be done by insertions of buffers and device sizing [33]. consumes less power than the B-input closed to the ground with the same switching activity factor. like array multipliers. this cannot be ignored. For instance.3. To reduce the spurious transitions.. the delays of signals from registers that converge at a gate should be roughly equal. the A-input. Pin ordering is to assign more frequently switching to input pin that consumes less power. The insertions of buffer increase the total load capacitance but can still reduce the spurious transitions. the swapping of inputs does not modify the logic function of the Out gate. the power consumption will be reduced without cost. In most cases.5. gate. Nand gate. etc. Different logic styles have different electrical characteristics. i. However. Circuit Level At the circuit level.2. In this way. the standard CMOS logic is a good starting point for speed and power trade-off.

Below we summarize some of the most commonly used low power techniques.3. The selection of algorithm and/or architecture has significant impact on the power consumption. Generally. the XOR/NXOR implementation. The voltage scaling is an efficient way to reduce the power consumption. other logic styles. this may need to be compensated for with parallel and/or pipelining techniques. LOW POWER TECHNIQUES 51 . CPL implements a full-adder with fewer transistors than the standard CMOS. Typically. The evaluation of full-adder is done only with NMOS transistors network. 3. • Reduce the number of operations.instance.13. To minimize the transistor sizes and meet the speed requirement is a trade-off. Since the throughput is reduced as the voltage is reduced. a gate with smaller size has smaller capacitance and consumes less power. Transistor sizing affects both delay and power consumption. like complementary pass-transistor logic (CPL) is efficient. the transistor sizing uses static timing analysis to find out those gates (whose slack time is larger than 0) to be reduced. This is paid for by larger delay. Complementary Inputs Complementary Inputs Pass-transistor (NMOS) Network Output Output Figure 3. CPL logic network. Low Power Guidelines Several approaches to reduce the power consumption have been briefly discussed. • Power supply voltage scaling. The transistor sizing is generally applicable for different technologies. This gives a small layout as well.

in a lap-top computer. especially the glitches. Reducing the number of chips is a promising approach to reduce the power consumption.• I/Os between chips can consume large power due to the large capacitive loads. 52 Chapter 3 . In many systems. for example. the most power consuming parts are often idle. The effective capacitance can be reduced by several approaches. compact layout and efficient logic style. • Reducing the effective capacitance. is important. Summary In this chapter we discussed some low power techniques that are applicable at different abstraction levels. the portion of display and harddisk could consume more than 50% of the total power consumption. 3. • Power management. Using power management strategies to shut down these components when they are idle for a long time can achieve good power saving. • Reduce the number of transitions. For example.4. To minimize the number of transitions.

General-Purpose Programmable DSP Processors Many commercial programmable DSP processors include the special instructions for the FFT computation. 4. Generally. or algorithm-specific processors. general-purpose digital signal processors.4 FFT ARCHITECTURES Not only several variations of the FFT algorithm have been developed after the Cooley-Tukey’s publication but also various implementations. The implementations with software on general-purpose computer can be found in literature and still being explored in some projects. the FFT can be implemented in software. A processor with Harvard architecture has separate busses for data and control. Software implementations are not suitable for our target application as the power consumption is too high. applicationspecific processors. for instance. most of them belong to the Harvard architecture from the architecture point of view. we will concentrate on algorithmic-specific architectures and only give a brief overview on some FFT architectures. 53 . the FFTW project in the Laboratory for Computer Science at MIT [28]. Although the performance varies from one to another.1. Since it is hard to summarize all other implementations.

MAC. for instance TI’s TMS320C3x. The butterfly is usually radix-2 or radix-4. General-purpose programmable DSP processor. program control. The programmable FFT-specific processors have specific butterflies and at least one complex multiplier [65]. The implementations with general-purpose programmable DSP processor is therefore not applicable due to the throughput requirement.1. Typical FFT/IFFT execution times are about 1 ms [2] [41] [55]. 4. These processors are 5 to 10 times faster than the general-purpose programmable DSP processors. which is far from the implementation using more specialized implementations. address generator. There is often an on-chip coefficient ROM. bit-reverse addressing is available to accelerate the unscrambling for the data output.2.1. and finally the data output. The computation of FFT with general-purpose DSP processor does not differ too much from the software computation of FFT in a general-purpose computer. Address Data Buss I/O interface Program Address Buss Address Generator Program Memory Data Memory Data Buss I/O interface Program Data Program Controller MAC & ALU Data Buss Figure 4. 4. ALU. then the FFT/IFFT computation.A typical programmable DSP processor has on chip data and program memory. In some DSP processors. Programmable FFT Specific Processors Several programmable FFT processors have been developed for the FFT/IFFT computations. as illustrated in Fig. and I/O interfaces. which 54 Chapter 4 . To compute the FFT with a general-purpose DSP processor requires three steps: first the data input.

The processor has two internal workspace RAMs. The processor requires 98 µs to perform 1024-point FFT with a system clock of 40 MHz. radix 4. This type of programmable FFT-specific processors are often provided with windowing functions in either time or frequency domain. but consumes 8 W at 3. Using multiple processor configuration can achieve a higher throughput. The Zarlink’s (former Plessey) PDSP16515A processor performs decimation in time. Although the PDSP1615A processor accelerates the FFT computation. Data are loaded into an internal workspace RAM in normal sequential order. A recent released FFT specific processor from DoubleBW systems B. has higher throughput (100 Msamples/s) [16]. Input 3 Term Window Operator Coefficient ROM Workspace RAM Workspace RAM Radix-4 Datapath Output Buffer Output Figure 4. FFT-specific processor PDSP16515A. and then read-out in correct order. one output buffer. FFT ARCHITECTURES 55 .3 V.stores sinus and cosine coefficients. it is still hard to meet the throughput requirement with a single processor due to the slow I/O. and one coefficient ROM.2. forward or inverse Fast Fourier Transforms [65]. V. but the power consumption is then substantially higher. processed.

the routing for the processing elements is complex and difficult. x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) W0 W2 W0 W1 W0 W2 Wn W2 W3 x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) n multiplication with W8 Figure 4.4. For the long transform length. the signal-flow graph for an 8-point FFT algorithm is shown in Fig. The results are fed back to the same set of process elements to compute the next stage. a column or a pipelined FFT processor can be used. The 8-point fully parallel FFT processor requires 24 complex adders and 5 complex multipliers.3. is not power efficient.3. A set of process elements in a column FFT processor [21] compute one stage at a time. hence. To reduce the hardware complexity. column FFT processors. The architecture of an algorithmic-specific FFT processor is therefore optimized with respect to memory structure. Algorithm-Specific Processors Non programmable algorithm-specific processors can also be designed for the computation of FFT algorithms. Signal-flow graph for an 8-point FFT. The hardware requirement is excessive. The processors are designed mostly for fixed-length FFTs.3. All three types of algorithm-specific processors represent different mapping of the signal-flow graph for FFT to hardware structures. 56 Chapter 4 . 4. and processing elements. and. control units. For example. The hardware structure in a fully parallel FFT processor is an isomorphic mapping of the signal-flow graph [3]. and pipelined FFT processors. There are mainly three types of algorithm-specific processors: fully parallel FFT processors.

4. pipelined FFT processors have features like simplicity. the first four input data are multiplexed to the top-left delay elements in the figure and the next four input data directly to the butterfly. each stage has its own set of processing elements. in-place applications where the input data often arrive in a natural sequential order. All the stages are computed as soon as data are available. This complete the start-up of the first stage of the pipeline. The most common groups of the pipelined FFT architecture are • Radix-2 multipath delay commutator (R2MDC) • Radix-2 single-path delay feedback (R2SDC) • Radix-4 multipath delay commutator (R4MDC) • Radix-4 single-path delay commutator (R4SDC) • Radix-4 single-path delay feedback (R4SDF) • Radix-22 single-path delay commutator (R22SDC) We will discuss these pipeline architectures in more detail.4. An 8-point R2MDC FFT is shown in Fig. Radix-2 Multipath Delay Commutator The Radix-2 Multipath Delay Commutator (R2MDC) architecture is the most straightforward approach to implement the radix-2 FFT algorithm using a pipeline architecture [48]. In this way the first input data is delayed by four samples and arrives to the butterfly simultaneously with the fourth input sample. The outputs from the first stage butterfly and the multiplier are then fed into the multipath delay commutator FFT ARCHITECTURES 57 .1.4.For a pipelined FFT processor.3. modularity and high throughput. When a new frame arrives. These features are important for real-time. We therefore select the pipeline architecture for our FFT processor implementation. 4. Radix-2 Butterfly Radix-2 Butterfly Radix-2 Butterfly 4 Input Mux 2 2 Switch 1 1 Switch Output0 Output1 Figure 4. An 8-point DIF R2MDC architecture.

However.. Each stage (except the last one) has one multiplier and the number of multipliers is log 2( N ) – 1 . i. 4.e..3. the first and second outputs from the multiplier at the first stage are now delayed by the upper delay elements. 3N/2-2. Works introduced a feedback mechanism in order to minimize the number of delay elements [22]. Groginsky and George A. The multipath delay commutator alleviates the data dependency problem. The butterfly and the multiplier are idle half the time to wait for the new inputs.+2. The total number of delay elements is 4 + 2 + 2 + 1 + 1 = 10 for the 8-point FFT. which make the first and second outputs from the multiplier of the first stage arrive together with the fifth and sixth outputs from the top. In the proposed architecture one half of outputs from each stage are fed back to the input data buffer when the input data are directly sent 58 Chapter 4 .2. Radix-2 Single-Path Delay Feedback Herbert L. There are two paths (multipath) with delay elements and one switch (commutator).between stage 1 and stage 2. the switch changes and the third and fourth outputs from the upper output of the first butterfly are sent directly to the butterfly at stage 2.. After this. Hence the utilization of the butterfly and the multiplier is 50%. The total number of delay elements for an N-point FFT can be derived in similar way and is N/2+N/2+N/4+. The first and second outputs from the upper side of the butterfly are fed into the two upper delay elements.

The delay elements at the first stage save four input samples before the computation starts. This architecture is called Radix-2 Single-path Delay Feedback (R2SDF). + 1) which is minimal. 4.e.. A 4path delay commutator is used between two stages. Fig. Input data are separated by a 4-to-1 multiplexer and 3N/2 delay elements at the first stage.. Radix-4 Multipath Delay Commutator This architecture is similar to R2MDC. The utilization of multiplier and butterflies remains the same. 4 Radix-2 SDF Butterfly 2 Radix-2 SDF Butterfly 1 Radix-2 SDF Butterfly 0 1 0 1 Input Output Radix-2 SDF Butterfly Radix-2 Butterfly Mux Figure 4.. An 8-point DIF R2SDF FFT. the butterfly processes the incoming samples. During the execution they store one output from the butterfly of the first stage and one output is immediately transferred to the next stage. When the mux is 1. 4. Computation is taking place only when the last 1/4 part of data is multiplexed to the FFT ARCHITECTURES 59 . in the new interim half frame when the delay elements are filled with fresh input sample. the butterfly is idle and data passes by. Because of the feedback mechanism we reduce the requirement of delay elements from 3N/2 to N – 1 (N/2 + N/4 + . The number of multiplier is exact the same as R2MDC FFT architecture. i.5 shows the principle of an 8-point R2SDF FFT. 4. The butterfly is provided with a feedback loop.to the butterfly. the results of the previous frame are sent to the next stage.5. Thus. log 2( N ) – 1 .3.5. The modified butterfly is shown in the right side of Fig. When the mux is 0. namely 50%.3.

The utilization of the multiplier is 75% due to the fact that at least one-fourth of the data are multiplied with the trivial 60 Chapter 4 Radix-4 Butterfly 48 12 3 Output0 Output1 Output2 Output3 . it is not a good structure. A few more delay elements are required with this architecture. only one output is produced in comparison with 4 in the conventional butterfly. which is less than the R4MDC FFT architecture.butterfly. the simplified butterfly needs additional control signals. A length-64 DIF Radix-4 Multipath Delay Commutator (R4MDC) FFT is shown in Fig. From the view of hardware and utilization. and so do the commutators. Each stage (except the last stage) has 3 multipliers and the R4MDC FFT requires in total 3 ⋅ ( log 4( N ) – 1 ) multipliers for an Npoint FFT which is more than the R2MDC or R2SDF. The length of the FFT has to be 4 n . the butterfly works four times instead of just one. A 64-point DIF R4MDC FFT. Radix-4 Butterfly Radix-4 Butterfly 32 12 8 4 Switch 8 4 3 2 1 2 1 Input 16 Mux Switch Figure 4. To accommodate this change we must provide the same four data at four different times to the butterfly.6. Furthermore.6.4. 4. V. G. To provide the same four outputs.3. The utilization of the butterflies and the multipliers is 25%. Radix-4 Single-Path Delay Commutator To increase the utilization of the butterflies. Moreover the memory requirement is 5N/2-4. Due to this modification the butterfly has a utilization of 4 ⋅ 25% or 100%. The number of multipliers is log 4( N ) – 1 . Bi and E. 4. Jones [4] proposed a simplified radix-4 butterfly. In the simplified radix-4 butterfly. which is the largest among the three discussed architectures.

5. A64-point DIF R4SDF FFT is illustrated in Fig. The cost for R4SDC is the increase amounts of delay elements. But the utilization of the butterflies are reduced to 25%.8.3. Since we use the radix-4 algorithm we can reduce the number of multipliers to log4(N) – 1 compared to log2(N) – 2 for R2SDF. The radix-4 SDF butterflies also become more complicated than the radix-2 SDF butterflies. Radix-4 Single-Path Delay Feedback Radix-4 single-path delay feedback (R4SDF) [15] [62] is a radix-4 version of R2SDF.7. 16 16 16 4 4 4 1 1 1 Radix-4 SDF Butterfly Radix-4 SDF Butterfly Radix-4 SDF Butterfly Input Output Figure 4. The main benefit of this architecture is the utilization improvement for butterflies. Input Single-path Delay Commutator 24 Simplified Radix-4 Butterfly Single-path Delay Commutator 6 Simplified Radix-4 Butterfly Output Figure 4.8.twiddle factor 1 (no multiplication is needed). A 64-point DIF R4SDF FFT. 4. FFT ARCHITECTURES 61 . The structure of a 16point DIF Radix-4 Single-Path Delay Commutator (R4SDC) FFT is shown below. 4. A 16-point DIF R4SDC FFT.

The outputs are bit-reversed instead of 4-reversed as in a conventional radix-4 algorithm. We reduce the number of multipliers compared to the conventional radix-2 algorithm.10.3. 4.6. Basically two kinds of radix-2 SDF butterflies are used to achieve the same output (but not of the same order) as a radix-4 butterfly. 2 Radix-2 SDF (I) Butterfly Element 1 Radix-2 SDF (II) Butterfly Element Input Output Figure 4.The radix-4 SDF butterfly is shown in Fig. It has the same butterfly structure as the radix-2 DIF FFT. 4. By reducing the radix from 4 to 2 we increase utilization of the butterflies from 25% to 50%. but places the multipliers at the same places as for the radix-4 DIF FFT. 62 Chapter 4 . Radix-22 Single-Path Delay Commutator The Radix-22 Single-path Delay Commutator (R22SDC) architecture [24] uses a modified radix-4 DIF FFT algorithm. Radix-4 SDF butterfly. Radix-4 SDF Butterfly Radix-4 Butterfly 32 Radix-2 SDF (I) Butterfly Element 16 Radix-2 SDF (II) Butterfly Element 8 Radix-2 SDF (I) Butterfly Element 4 Radix-2 SDF (II) Butterfly Element 0 1 0 1 0 1 0 1 Mux Figure 4.9. otherwise the data are shifted into a delay-line with a length of 3N/4 (first stage). The data are sent to the butterfly for processing when the mux is 1. This approach is based on a 4-point DFT. A 64-point DIF R22SDC FFT.9.

4. The programmable DSP or FFT-specific processors cannot meet the requirements in both high throughput and low power applications. several FFT implementation classes are discussed. Algorithm-specific implementations. FFT ARCHITECTURES 63 .4. especially with pipelined FFT architectures are better in this respect. Summary In this chapter.

64 Chapter 4 .

a high-level design language is used to define the system functionality. which builds the system by assembling the existing building blocks. We follow the meet-at-the-middle design method.1. Design Method As the transistor feature size is scaled down. we discuss implementation of FFT processors. high complexity. more and more functionalities can be integrated in a single chip. The bottom-up methodology. High speed. After a number of decomposition steps. In VLSI design. A design methodology is the overall strategy to organize and solve the design tasks at the different steps of the design process [24]. the system requirements and organization is developed by a successive decomposition. In the top-down design methodology. the system is described by a HDL. which can be used for automatic logic 65 .5 IMPLEMENTATION OF FFT PROCESSORS In this chapter. can hardly catch up with the high performance and communication requirements of current system. This requires that the design methodology must cope with the increasing complexity using a systematic approach. Hence the bottom-up methodology is not suitable for the design of complex systems. Typically. 5. the design method is an important guide for implementation. and short design time are several requirements for VLSI designs.

but the actual design of the building blocks is Architecture performed in bottom-up. The design process starts with creation of functional specification of the FFT processor. the building blocks are already available in a circuit library. the whole design has to be redesigned. Different functionalities are partitioned and mapped into hardware or software components. After the architecture model is created. the requirement for the FFT processor is specified.1. This results in a high-level model. Often. The design process is therefore Module Logic divided into two almost Gate Cell independent parts that meet in the middle.1. the detail computation process is mapped to the hardware. In our target application. The high-level model is then validated by a testbench for the FFT algorithm. 5. the specificationsynthesis process is carried out Algorithm in essentially a top-down Scheduling fashion. the functional specification is mapped into an architectural specification. In the meet-in-the-middle Specification/Validation methodology. The testbench can be reused for successive models. Basically. the software and hardware design are 66 Chapter 5 . This is MEET-IN-THE-MIDDLE illustrated in Fig. The circuit design Transistor Layout phase can be shortened by Layout using efficient circuit design tools or even automatic logic synthesis tools. The meet-in-the-middle methodology. the model needs to be simulated for performance and validation. If the final result fails to meet the performance requirement. A drawback with this design approach is that the result highly relies on the synthesis tools. In the architectural specification. Detailed communications between different components are to be decided at the architecture specification. some of Figure 5. After the system functionality is validated by simulation.synthesis.

Resource Analysis The high-level design can be divided into several tasks: • • • • • Architecture selection Partitioning Scheduling RTL model generation Validation of models IMPLEMENTATION OF FFT PROCESSORS 67 . throughput • Complex 24 bits I/O data According to the meet-in-the-middle design methodology. High-level Modeling of an FFT Processor High level modeling serves two purposes: to create a cycle-true model for the algorithm and hardware architecture.2. the software and hardware co-design is not needed. As mentioned previously. Since the whole FFT processor is implemented in hardware. Since the FFT processor is completely implemented in hardware.2. to simulate. 5. Different subblocks are built from cells in combination of blocks.6 Msamples/sec. validate and optimize the high-level model. 5. the high-level design is a top-down process. The system specification for the FFT processor has been defined as • Transform length is 1024 • Transform time is less than 40 µs (continuously) • Continuous I/O • 25. We start with the resource analysis. we do not need to determine the system specification since it is given. the individual hardware blocks are refined by adding the implementation details and constraints. In this phase. Once an architecture is selected. the partitioning of software and hardware is not necessary.separated after this architectural partitioning. we apply the bottom-up design methodology.1.

butterflies and complex multipliers. the resource is constrained.1. There are many possible architectures for FFT processors. In the ASIC implementation. Therefore the number of butterflies is 10. A butterfly can be implemented with parallel adders/subtractors using one clock cycle.2) With radix-2 algorithm.. Hence the resource analysis is required.6 × 10 This is optimal with the assumption that ALL data are available to ALL stages. i. Butterflies From the specification. equal to the number of stages. 5. 68 Chapter 5 . Since the control part is much simpler than the datapath in respect of both hardware and power consumption.The first three tasks are associated with each other and the aim is to allocate the resource to meet the system specification.= ---------------------------------------------------.= 5 (5. Each butterfly has to be idle for 50% in order to reorder the incoming data. the pipelined FFT architectures are particularly suitable for real-time applications since they can easily accommodate the sequential nature of sampling. the number of butterfly operations is ( N ⁄ r ) log r( N ) = ( N ⁄ 2 ) log 2( N ) = 5120 . The datapath for the FFT processor consists of memories.1. The pipelined FFT architectures can be divided into datapath and control part.e. the computation time for the 1024-point FFT processor is t FFT = 4 × 10 –5 s (5.3) –5 6 t FFT 4 × 10 × 25. which is impossible for continuous data streams. We discuss them separately. the resource analysis concentrates on the datapath. Among them.2. The allocation of butterfly operations from two stages to the same butterfly is not possible with as soon as possible (ASAP) scheduling. Hence the minimum number of butterflies is N o BFop × t BFop 5120 N BF = --------------------------------------.

5) –5 6 t FFT 4 × 10 × 25. the number of butterflies for a radix-4 pipeline architecture is equal to the number of stages.( r – 1 ) ( log r( N ) – 1 ) (5. N cmult t cmult 4068 N cmult = ------------------------------.2.1. The number of complex multipliers is 9. For a 1024-point FFT processor. The size of the memories are determined by the maximum amount of live data. the number of complex multipliers is 4.≈ 4 (5. For radix-2 algorithm.3. IMPLEMENTATION OF FFT PROCESSORS 69 . the architectures with feedback are efficient in terms of the utilization of memories.6 × 10 Since the resource sharing between two stages is not possible for pipeline architectures. 5. The complex multiplication can be computed either in one clock cycle and two clock cycles (pipelining).With similar discussion.2. the number of complex multiplications is about 4068. each stage except the last stage has its own set of complex multipliers. For radix-4 algorithm. Complex Multipliers The number of complex multiplications is N N cmult ≈ --.= ---------------------------------------------------. which is determined by the architectures. i. Memories The memories requirement increases linearly with the transform length.4) r where N is the transform length and r is the radix. 5. It does not include the complex multiplications within the r-point DFT.e. The minimum number of complex multipliers is. it dissipates more power than the complex multipliers. with assumption of fast complex multipliers (one complex multiplication per clock cycle). In general.2.1.

This gives a freedom for the construction of model and testbench: the model can be written in either C..2... For a fast evaluation.2.. Memory requirement and utilization for pipelined architectures. The validation of high-level model is done through simulation and comparison. input text file output text file 70 Chapter 5 . The same testbench can be reused by changing the output/input file arithmetic.+3 = N-1 3N/2+3N/8+.Architecture R2MDC R2SDF R4MDC R4SDF R4SDC R22SDF Memory requirement [words] Memory utilization 66% 100% 40% 100% 50% 100% N/2+N/2+..+6 = 2N-2 N/2+N/4+.. Testbench. Device Under Test Test Bench Figure 5..+1 = N-1 3N/4+3N/16+... the testbench can also written in C or Matlab.+12 = 5N/2-4 3N/4+3N/16+. 5.+2 = 3N/2-2 N/2+N/4+..1. Moreover. The interface between model and testbench is plain text files: the input data is stored in a text file and read in by the model and the output data from model is saved in a text file also. like C or Matlab. Matlab or VHDL. the next step is to model the FFT algorithm at high-level. Validation of the High-Level Model After the resource analysis. the algorithm is described with high-level programming language.+1 = N-1 Table 5. it is easy to convert from floating-point arithmetic to fixed-point arithmetic...2.

Based on the observation that the wordlength for different stages in the pipelined FFT processor can be various.2. Wordlength optimization for pipelined FFTs. Because our focus is placed on reducing power consumption in the data memory.3. which uses fixed wordlength for both data and coefficients for each stages. the shorter its wordlength should be.5. which is provided by the pipeline architecture. and then we adjust the wordlength of the coefficient ROM at each stage. Wordlength Optimization In the pipelined FFT architectures.3. The possibility to use different wordlength. IMPLEMENTATION OF FFT PROCESSORS 71 . We first tune the wordlength of data memory (data RAM) at each stage separately to make sure that the precision requirement is met. the most research effort has been relative to the regular modular implementations. Start Coefficient wordlengths No Sine wave test vectors OK? Yes Data wordlengths Random test vectors No OK? Yes End Figure 5. the strategy is that the larger the RAM block in a stage. we proposed a wordlength optimization method [34]. is often ignored to achieve modular solutions. To obtain the optimal word lengths profile numerous design iterations have been performed. The conventional uniform wordlength scheme for both data memory and coefficient ROM is also simulated.

The designers have to increase timing margins during synthesis to meet the speed requirement after place and route. there are two design methods: the semi-custom method and full custom method. this design methodology relies on the synthesis tools and place and route tools. 100. Subsystems Once the RTL-level model of FFT processor is created and validated. Architecture R2MDC R2SDF R4MDC R4SDF R4SDC R22SDF Memory size for fixed wordlength 42952 bits 28644 bits 71552 bits 28644 bits 57288 bits 28672 bits Memory size for optimized wordlength 42824 bits 28580 bits 61488 bits 24708 bits 49176 bits 28580 bits Saving 0% 0% 14% 14% 14% 0% Table 5. The optimization result is shown table below for 1024-point pipelined FFT architectures. The resulting designs are often unnecessary large. the impact of power supply voltage scaling is hard to predict since the characterization of cells is done at 72 Chapter 5 . The designer have less control over the design process.2. Semi-custom design method has a shorter design time. For the subsystems design.000 sets of random samples are generated and fed into the simulator. Moreover. In our case. the most synthesis tools use static timing analysis and do not consider the interconnections during synthesis. The RTL-level description in a HDL can be synthesized with synthesis tool and the synthesis result is fed to place and route tool for final layout. The sine wave stimuli is sensitive to the precision of the coefficient representation. One is sine wave and the other is random numbers. To make the results obtained highly reliable. Wordlength optimization. However. 5.Two types of testing vectors are used in our simulation.3. and the samples of random numbers are effective stimuli to check the precision of butterfly calculations. the subsystems can be constructed according to the meetat-the-middle design methodology.

5. We select therefore full custom design for the FFT processor.3.1. In the following we introduce the subsystem design for the FFT processor. dynamic RAM (DRAM). Hence the low power design of memories are a key issue for the FFT processor. butterflies. In the RAM design. Datapath for a stage in a pipelined FFT processor. Complex Output Multiplier Input Data reordering (Memories) Butterfly Figure 5.3. 5. In the 1024point FFT processor. but use semi-custom design method for the control path. where the timing is not critical. The main subsystems are memories.1. the memory becomes the most significant part in both area and power consumption.normal supply voltage. there are mainly two types of RAMs: static RAM (SRAM). Since the DRAM often requires special technology and is not IMPLEMENTATION OF FFT PROCESSORS 73 .4.1. RAM The data are stored in RAMs. and complex multipliers. the memory contributes significant portion of area and power consumption for the whole processor. Memory In many DSP processors.

which is the main parts of the SRAM. A typical 6-T memory cell is shown in Fig. 5. The SARM consists of four parts [30]: • memory cell array • decoders • sense amplifiers • periphery circuits We discuss the implementation for the first three parts. we select the SRAM for the data storage. Decoder Cell Array Ctl circuits Sense Amplifiers Data I/O Address Data Figure 5.available for standard CMOS technology and SRAM is more suitable for low voltage operations. Overview of a SRAM.5. We therefore select to use a 6-T memory cell. 5. 74 Chapter 5 .6. The memory cell is basic building block in the SRAM and the size of the memory cell is of importance. The keys for the design of memory array are cell area and noise immunity of bit-lines. Even through a 4transistors (4-T) memory cell has less area than that of a 6-transistors (6-T).5. the current leakage at low voltage is considerable larger. Memory array Memory array dominates the SRAM area. An overview of SRAM is shown in Fig.

the value for α is 1~2 and β is 2~3. Normally. SRAM cell. Wa. The width ratio α = W p ⁄ W a affects the write operation and a larger α means data is more difficult to write into the cell. short channel effects. is set to min-size for minimal cell area. The cell designer have to consider process variations. The width for the access NMOS transistor.6. IMPLEMENTATION OF FFT PROCESSORS 75 . BL_bar Wa BL Layout The stability of the memory cell during the read and write operations determines the device sizing [52]. soft error rate and low supply voltage [23]. The width ratio β = W n ⁄ W a is determined by the read operation and a larger ratio β means less chance for a SRAM cell changes its state during read operation.WWL Wp Wn Schematic Figure 5. The read stability can be measured by the static noise margin (SNM).

In order to reduce the power consumption and speed up the read access. SNM of a SRAM cell vs.5 Power supply voltage (V) 3 3.5 Figure 5.35 µm CMOS technology is shown in Fig. The SNM can be simulated with SPICE. Guard-ring Bit-lines Figure 5. SNM for a SRAM cell.8.5 SNM (V) 0. As the power supply voltage decreases.52 0.54 0. The voltage swing between two bit-lines is usually about 100 mV to 300 mV. Noise reduction for memory array. To reduce the noise from outside.44 0. which is sensitive to noise.7.7. the affect of noise becomes more important. we use the twisted bit-lines layout.5 2 2. the memory array is surrounded with guard-ring to reduce the substrate-coupling noise. SNM vs power supply voltage 0. power supply voltage. most SRAMs read data through a par bit-lines with small swing. 5. The SNM for a SRAM cell with β = 2 in standard 0. To avoid the coupling noise from nearby bit-line pars. 76 Chapter 5 .Example 6. Thus the coupling from nearby bit-line pars does not affect the swing difference of bit-lines.46 0.48 0.42 1 1.

The row decoder can use either NOR-NAND decoder or tree decoder. Sense Amplifier The sense amplifier is used to amplify the bit-line signals with small swing during read operation. The NORNAND decoder has a regular layout. This reduces the power consumption of decoder. which reduce both the delay and the activity factor. In small decoders the tree decoder is preferred and the NOR-NAND decoders is preferred for larger decoders.Decoder The decoder can be realized by using a hierarchical structure. One way to reduce the power consumption is to reduce the active time for sense amplifier. the current mode sense amplifier is less suitable. This can be achieved by using pulsed sense enable signal. but requires more transistors. This high gain requirement in turn requires high current and hence high power consumption for sense amplifier. IMPLEMENTATION OF FFT PROCESSORS 77 . which controls the width of word-line acting pulse and reduces the glitches of word-line drivers.9. STC D-flip-flop. SE_bar BL BL_bar Dout_bar Dout Figure 5. We therefore modified an STC D-flip-flop [64] to form a two stage latch type sense amplifier. The sense amplifier is functional when the supply voltage is as low as 0. a word-line enable signal is added to the decoder. which could increase the delay (it becomes worse for lower power supply voltage). but suffer from speed degradation due to the serial-connection of pass-transistors. For the large decoder. To have a fast access.9 V. Tree decoder requires fewer transistors. At low supply voltage. the sense amplifier is designed with high gain.

The total power consumption is 83. The power SENSE AMPLIFIER TEST Wave :A0:v(out0) :A0:v(out1) :A0:v(prech) 1:A0:v(wwl) 1. The access time is 11 ns using standard 0.4 Symbol 1.3 1.11.2 1.5 1.2 1.35 µm CMOS technology with typical process under 85˚C.10. consumption for the sense amplifier is 59.10.11. The simulated waveforms for write operation are shown in Figure 5. 5.4 1.1 1000m 900m Voltages (lin) 800m 11 ns 700m 600m 500m 400m 300m 200m 100m 0 -100m 120n 122n 124n 126n 128n 130n Time (lin) (TIME) 132n 134n 136n 138n 140n Figure 5.The simulated waveforms for read operation are shown in Fig.5 1. Read operation.5 µW per bit at 50 MHz.4 µW per bit at 50 MHz.1 1000m 900m Voltages (lin) 800m 700m 600m 500m 400m 300m 200m 100m 0 120n Time (lin) (TIME) 140n Figure 5. Write operation. 78 Chapter 5 . WRITE TEST Wave D0:A0:v(clout0) D0:A0:v(clout1) D0:A0:v(prech) D0:A0:v(wwl) Symbol 1.3 1.

low power supply voltage. The butterfly consists mainly of adders/subtractors.27×0. the I/O drivers have large capacitance load.12. 5.2. The SRAM. To reduce the short-circuit current is an important issue for the I/O drivers. etc. which runs at 1. IMPLEMENTATION OF FFT PROCESSORS 79 . Avoiding switching of the PMOS and NMOS simultaneously is a efficient technique for reducing of short-circuit current. 5.3. 5. Adder design The adder is one of the fundamental arithmetic components.3.1.1. but needs to be sufficient long to guarantee the read operation under process variation.2.3. Hence we discuss the implementation of adder/subtractor first and later the complete butterfly. 5.5 V and 50 MHz. Butterfly The butterfly is one of the characteristic building blocks in an FFT processor. Figure 5.The pulse width for the word-line signal and sense enable signal must be selected carefully. A module generator for SRAM is under development.2. There are many adder structures [47].33 mm2). Short pulse width dissipates less power.6 mW. consumes 2. Implementation A 256 w×26 b SRAM with separate I/O (Fig. SRAM macro (1. For the periphery circuits.12) has been implemented with above discussed techniques.

It will be discussed later in the complex multiplier design. Figure 5. for instance. However. it is suitable to select RCA for the butterfly. The RCA is slowest among the different implementations.The ripple-carry adder (RCA) is constructed with full-adder. When the speed is important. A 3-bit RCA layout with sign-extension is shown in Fig.0 µm2 . other carry accelerating adder structures are attractive. 5.13. We select the Brent-Kung adder for high speed adder implementation. 80 Chapter 5 Size: 18.14. 5. An CMOS full-adder layout is shown in Fig. If the wordlength is small. RCA implementation We have developed a program that generates the schematic and the layout for RCA. The Brent-Kung adder has a short delay and a regular structure.7 × 15.13. the RCA cannot meet the speed requirement. it is simple and consumes small amount of power for 16-bit adder implementations. In these cases. CMOS full-adder. the vector merge adder in the multiplier.

The computation for a radix-4 butterfly is divided into two steps.2. 5. Layout of 3-bit ripple-carry adder. we proposed a carry-save based butterfly [36]. To reduce the complexity. The delay is changed from two additions/subtractions to IMPLEMENTATION OF FFT PROCESSORS 81 .15.2. x(0) x(1) x(2) x(3) j X(0) X(1) X(2) X(3) Figure 5. the power consumption [39] [60]. arithmetic workload. The second step is a normal addition. and.15. hence. A conventional butterfly is often based on an isomorphic mapping of the signal-flow graph. In practice. the commonly used high radix butterflies are radix-4 and radix-8 butterflies.Sign extension Full-adder Full-adder Full-adder Figure 5. Efficient design of high-radix butterflies is therefore important.3. Signal-flow graph for 4-point DFT. Signal-flow graph for a radix-4 butterfly is shown in Fig. The first step is a 4-2 compression with addition/subtraction controlled inputs.14. High radix butterfly architecture The use of higher radix tends to reduce the memory access rate. 5. Butterflies with higher radix than radix-8 are often decomposed to lower radix butterflies. The butterfly requires 8 complex adders/subtractors and a delay of 2 additions/subtractions.

82 Chapter 5 .3. In the figure. A carry-save radix-4 butterfly (wordlength is 15 for real and imaginary part of input) was described in VHDL-code and synthesized using AMS 0. only real additions are shown and it appears more complicated than that of Fig.15. 25˚C 12.one addition and one 4-2 compression.16.59 ns Table 5. Parallel radix-4 butterfly. The implementation of a radix-4 butterfly with carry-save adders is shown in Fig. This implementation reduces the hardware since a fast adder is more complex than a 4-2 compressor.16.2)-counter Fast Adder Figure 5.8 µm standard CMOS technology [36]. 5. where additions are complex additions. 5. The radix-2/4 split-radix butterfly and radix-8 butterfly can also be implemented using the carry-save adder. Architecture Conventional Carry-save Area 10504.3 V. xre(0) xim(0) xre(1) xim(1) xre(2) xim(2) xre(3) xim(3) Xre(0) Xim(0) Inverter Xre(1) Xim(1) Xre(2) Xim(2) Xre(3) Xim(3) (4.48 Delay@3. The synthesis result shows that the area saving can be up to 21% for the carry-save radix-4 butterfly. The delay can be reduced with 22%. Performance comparison for two radix-4 butterflies.16 8266. Carry-save radix-4 butterfly implementation.32 ns 9. The total delay is also reduced since the delay for a 4-2 compressor is smaller.

A straightforward implementation (see Fig.3. the complex multiplier is slowest part in the data path. 5. Hence the complex multipliers are the key components in FFT design. IMPLEMENTATION OF FFT PROCESSORS 83 .3.5.1.3.and post additions (see Fig. 5. the throughput can be increased while the latency retains the same. Distributed Arithmetic Distributed arithmetic (DA) uses precomputed partial sums for an efficient computation of inner products of a constant vector and a variable vector [14].17 (b)). Realization of a complex multiplication. From the speed point of view. A more efficient way to reduce the cost of multiplication is to utilize distributed arithmetic [35] [58]. 5. This has been reduced to less than 50% of the total power consumption due to the increase of power consumption for the memories as the transform length of FFT increases [37]. complex multipliers stand for about 70% to 80% of the total power consumption in the previous FFT implementations [39] [60].3. With pipelining.17. However. Complex Multiplier There is no question that complex multipliers are one of the critical units in FFT processors. one addition and one subtraction.17 (a)) of a complex multiplication requires four real multiplications. the number of multiplications can be reduced to three by using transformation at the cost of extra pre. XR XR CR CI XI CI XR CR XI CI+CR XI CR CI-CR XI XR ZR ZI ZR ZI (a) (b) Figure 5. From the power consumption point of view.

By interchanging the order of the two summations we get W –1 Z R = – C R x R0 + C I x I 0 + ∑ ( C R x Rk – C I x Ik )2 – k k=1 which can be written as Wd – 1 d (5..e. The complex coefficient.3) k=1 where F k ( x Rk. x Ik ) = C R x Ik + C x Rk . respectively. x Ik )2 – k (5. i. i.2) Z R = – F k ( x R0. (5. x Ik ) = C R x Rk – C I x Ik . We will realize the real and imaginary parts separately. x I 0 ) + ∑ F k ( x Rk. we consider only the first inner product in Eq.1).. Since Fk can take on only four values. The inner product ZR can be rewritten Wd – 1 Z R = C R – x R0 + ∑ Wd – 1 x Rk 2 – k –C I – x I 0 + ∑ x Ik 2 – k k=1 k=1 where xRk and xIk are the kth bits in the real and imaginary parts. the kth bits in XR and XI. I 84 Chapter 5 . The data is scaled so that |ZR + jZI | is less than 1.Let CR + jCI and XR + jXI be two complex numbers of which CR + jCI is the coefficient and XR + jXI is a variable complex number. In the case of a complex multiplication. it can be computed and stored in a look-up table. the real part. a complex multiplication can be considered as two inner products of two vectors of length two.1) Hence. In the same way we get the corresponding binary function for the imaginary part is G k ( x Rk.e. Fk is a function of two binary variables. we have Z R + jZ I = ( C R + jC I ) ( X R + jX I ) = (CR X R – CI X I ) + j(CR X I + CI X R) (5. For sake of simplicity. CR + jCI is assumed to be fixed and two’s-complement representation is used for both the coefficient and data.

3.3.7) IMPLEMENTATION OF FFT PROCESSORS 85 . x Ii )2 + F ( 0. 0 )2    i=1  –W d  –i–1 + j  – G ( x R0.4) i=1 where b is the inverse of bit b . x Ri ) = C I ( x Ri – x Ri ) + C R ( x Ii – x Ii ) (5.5) The functions F and G can be expressed as follows. Wd – 1 x = ( x 0 – x 0 )2 – 1 + ∑ ( x i – x i )2 –i–1 –2 –W d (5. Without any loss of generality.Further reduction of the look-up table can be done by Offset Binary Coding [14]. Then. 0 )2    i=1 Wd – 1 Wd – 1 Wd – 1 Wd – 1 Wd – 1 Wd – 1 (5. and XI is Wd. CI.6) (5. we assume that the magnitudes of CR. XR. x I 0 ) + ∑ G ( x Ri. x Ri ) = C R ( x Ri – x Ri ) – C I ( x Ii – x Ii ) G ( x Ri. The wordlength of XR. and XI all are less than 1. 5. x Ii )2 + G ( 0. the complex multiplication can be written ZR + jZI = (CR XR– CIXI) + j(CRXI + CIXR)  –W d  –1 –i–1 =  C R ( x R0 – x R0 )2 + ∑ C R ( x Ri – x Ri )2 – – CR2    i=1  –W d  –1 –i–1 –  – C I ( x I 0 – x I 0 )2 + ∑ C I ( x Ii – x Ii )2 – CI 2    i=1  –W d  –1 –i–1 + j  – C R ( x I 0 – x I 0 )2 + ∑ C R ( x Ii – x Ii )2 – CR2    i=1  –W d  –1 –i–1 + j  – C I ( x R0 – x R0 )2 + ∑ C I ( x Ri – x Ri )2 – CI 2    i=1  –W d  –i–1 =  – F ( x R0. x I 0 ) + ∑ F ( x Ri. F ( x Ri. Offset Binary Coding The offset binary coding can be applied to distributed arithmetic by using the following expression for the data.2.

In Eq. i.e. The accumulators. XR XI -(CI+CR) -(CR-CI) Partial product Generation F Partial product Generation G Accumulator ZR Accumulator ZI Figure 5. function Fk (Gk). 5.e. are the same as in a real multiplication. Hence. which adds the partial products.. 86 Chapter 5 .7).4.18.18.. The complex multiplier with distributed arithmetic is illustrated in Fig. The partial product generation is only slightly more complicated than for a real multiplier. xRi 0 0 1 1 xIi 0 1 0 1 F(xRi. Hence the partial product. for each bit is of the form CR ± CI. the complexity of the complex multiplier in term of chip area corresponds to approximately two real multipliers. All possible partial products are tabulated in following table. –(CR – CI) and –(CR + CI).xIi) –(CR–CI) –(CR+CI) (CR+CI) (CR–CI) G(xRi. i. (5. Partial product generation. Block schematic for complex multiplier.xIi) –(CR+CI) (CR–CI) –(CR–CI) (CR+CI) Table 5. only two coefficients. are sufficient to store since (CR – CI) and (CR + CI) easily can be generated from the two former coefficients by inverting all bits and adding 1 in the least-significant position. Obviously. the factor ( x i – x i ) ⁄ 2 is either 1 or –1.

Hence the delay for the partial product generation is reduced. Circuits for partial product generation. or digit-serial. a bit-serial. Although a bit-serial.4. the selection of structure is important. carrysave and tree structures. To achieve high throughput.18. The usual structures for accumulators are: array. Implementation Considerations Multipliers can be divided into three types: bit-parallel.3. -(CR-CI)i -(CR+CI)i XRi xor XIi XRi 0 1 0 1 0 1 XIi -(CR-CI)i -(CR+CI)i 0 PPi (a) PPi (b) 1 XRi Figure 5. IMPLEMENTATION OF FFT PROCESSORS 87 . i. multiplier has less chip area than that of bit-parallel multiplier. bit-serial. We select the tree structure for accumulator. which increases activity factor for the local clock. It is also suitable for our low power strategy. To meet the speed requirement. we therefore select a bitparallel multiplier.19. designing a faster circuit and using voltage scaling to reduce the power consumption. The tree structure is the fastest. can be realized with a 2:1 multiplexer and an XOR gate as shown in Fig.. An alternative is to use a 4:1 multiplexer circuit. which correspond to the partial product generation in a real (imaginary) datapath.19. it requires a higherspeed clock than that of bit-parallel one for the same throughput. and digit-serial.3. 5. or digit-serial multiplier often needs several parallel units. 5. For the accumulator design. The benefit of this implementation is that the delay is reduced since the generation of select signal (XRi⊕XIi) is not required.e.3. Complex multiplier with DA is shown in Fig.5. The selection of pre-computed values from Table 5.

we can construct the tree with only three building blocks: body.20. It is also easy for the routing planning in the accumulator design. The overturned-stairs tree [40].. 5. Since there are only three feed-throughs between body of height j – 1 to body of height j in overturned-stairs tree. The Wallace tree has complex wiring and is therefore difficult to optimize and the layout becomes irregular. lowest height. 88 Chapter 5 . The construction of overturned-stairs tree is illustrated in Fig. The trees of height 1 to 3 are shown in Fig. A root (CSA) is connected to the outputs of the connector to form the whole tree of height j + 1 . The connector connects three feed-throughs from the body of height j – 1 and two outputs from the branch of height j – 2 to construct the body of height j. The body can be constructed repeatedly according to Fig. is chosen.e. • The tree height is low. When the height is more than three. 5.e.20. a branch of height j – 2 . multi-operand tree is the Wallace tree. The first-order overturned-stairs adder tree. root. and connector.The fastest. The body of height j ( j > 2 ) consists of a body of height j – 1 . The main features of overturned-stairs adder tree are • Recursive structure that yields regular routing and simplifies the design of the layout generator. There are several types of overturned-stairs adder trees [40]. which has a regular layout and the same height as the Wallace tree when the data wordlength is less than 19. O( p N ). 5. The branch of height j – 2 is formed by using j – 2 carry-save adders (CSAs) on “top of each other” with proper interconnections [40]. i. The overturned-stairs adder tree was suggested by Mou and Jutand [40]. i. and a connector. which has the same speed bound to that of Wallace tree when the number of the operands is less than 19. where p depends on the type of overturned-stairs tree.20. is used in the design of the complex multipliers.

5 V. IMPLEMENTATION OF FFT PROCESSORS Height j-2 Body height j-1 Branch 89 .CSA CSA CSA CSA CSA CSA CSA CSA CSA CSA Tree 1 n CSAs Root Body 2 CSA Tree 2 Tree 3 Root CSA Branch n Connector CSA CSA CSA Connector Body of height j Root Tree of height j+1 Figure 5. However. The choice of full-adder has large impact on the performance of accumulator. Overturned-stairs tree. it is not competitive from a power consumption point of view. The first type of full-adder is a conventional static CMOS adder. the conventional static CMOS full adder. with large stack height.20. recently a large number of new adder cells has been proposed [51] and they should be evaluated in the future work. We compared several full-adders and found the most suitable for our implementation. The full-adder is essential for the accumulator. When the voltage is as low as 1. is too slow. Furthermore.

Conventional static CMOS full-adder. However. A third type of full adder is Reusens full-adder [50]. This fulladder is fast and compact but requires buffers for the outputs.21. The buffer insertion is usually considered as a drawback since it introduces delay and increases the power consumption. A second type of full-adder is a full-adder with transmission gates (TG). x y z S C Figure 5. Transmission gates full-adder.22.y z x z y x y y x z C S Figure 5. This full-adder realizes the XOR-gate with transmission gates and both the power consumption and chip area are smaller than that of a conventional static CMOS full-adder. in 90 Chapter 5 .

5 V and 72. A handcrafted accumulator using overturned-stairs tree with 0. There is no direct path from VDD or VSS in this full adder.35 µm technology. which tends to reduce the power consumption. The worst case delay is 26 ns at 1.4.5 V and 25 ˚C with SPICE simulation.5 V Power (µW)@1. both run at 25 MHz.6 mW at 3.5 V 24 16 16 4. The generated structural VHDL-code can be validated by applying random test vectors in a testbench. Adder type Static CMOS TG Reusens Transistor count Delay (ns)@1. x y z y S C Figure 5. the accumulator can be implemented.3 2.1 Table 5. The software can handle different wordlengths for the data and coefficient.35 µm standard CMOS technology is shown in Fig.3.5 3.the accumulator the buffer insertion is necessary anyway in order to drive the long interconnections. The power consumption for this complex multiplier is 15 mW at 1. Comparison of full-adders in 0. IMPLEMENTATION OF FFT PROCESSORS 91 . A software for the automatic generation of overturned-stairs adder trees has been developed. 5. Reusens full-adder.23.5 2.2 3.2 4.3 V.24. Accumulator Implementation After the selection of structure and adder cell. 5.5.3.

Brent-Kung Adder Implementation The Brent-Kung adder is used as the vector merge adder. 5.26 Figure 5.25. Accumulator layout.Figure 5.3 × 479. The layout of a 32-bit Brent-Kung adder is shown in Fig. A program for schematic generation for Brent-Kung adder has been developed and the layout generator is under construction. The generated schematic of a 32-bit Brent-Kung adder is illustrated in Fig. 92 Chapter 5 Size: 704. 5.5 µm2 .3.25.5. The BrentKung adder belongs to the prefix adder. Block diagram for a 32-bit Brent-Kung adder.24. which uses the propagation and generation property of carry bit in a full-adder to accelerate the carry propagation.3. 5.

However.35 µm CMOS technology.3 V at 25 MHz in a standard 0. IMPLEMENTATION OF FFT PROCESSORS Size: 0. and it increases the routing complexity as well. An observation is that a large portion of the total power are consumed by the computation of complex multiplications in the FFT processor. consumes 290 mW@3. Final FFT Processor Design After the design of the components and the selection of FFT architecture.3 V. the power consumption for the computation of the complex multiplications is still more than 210 mW. hence. 25 MHz. and. 32-bit Brent-Kung adder. Using high radix butterflies can reduce the number of complex multiplications outsides the butterflies.16 mm2 93 . Overcoming this two drawbacks is the key for using high radix butterflies.25 × 0. For a 1024-point FFT processor. we apply the meet-in-the-middle methodology to combine the components into the complete implementation. it requires four complex multipliers.26. We have implement a complex multiplier that consumes 72.Figure 5.6 mW with power supply voltage of 3. Even with bypass techniques for trivial complex multiplications. 5. it is not common to use high radix butterfly for VLSI implementations due to two main drawbacks: it increases the number of complex multiplications within the butterflies if the radix is larger than 4. Hence the reduction of the number of complex multiplication is vital.4.

which is much less than a 17 × 13 bit complex multiplier (72. 25 MHz. We have implemented a 32-bit Brent-Kung adder (real) that consumes 1. the radix-4 FFT and SRFFT algorithm is more efficient than that of radix-2 FFT algorithm in term of number of multiplications. Real Input cos(p/8)+sin(p/8) cos(p/8) Imaginary Input cos(p/8)-sin(p/8) Real Input (p = π) 1 Imaginary Input Figure 5. 5. For 16-point DFT.e.27. This is because the adder has less hardware and has much fewer glitches. i. Moreover..3 V. adders consume much less power than the multipliers with the same wordlength. and W 16 . both radix-2 and split-radix algorithm require three multipliers (two 2 1 multipliers with W 16 and one multiplier with W 16 ) while the radix-4 algorithm requires only two multipliers (one multiplier 94 Chapter 5 .3 V. The multiplications with W 16 and W 16 can share coefficients since cos ( π ⁄ 8 ) = sin ( π ⁄ 2 – π ⁄ 8 ) = sin ( 3π ⁄ 8 ) and sin ( π ⁄ 8 ) = cos ( π ⁄ 2 – π ⁄ 8 ) = cos ( 3π ⁄ 8 ) . multiplications with W 16 . We use constant multipliers in the design of 16-point butterfly in order to reduce the number of complex multipliers.6 mW@3.27. The 1 implementation of a multiplication with W 16 is illustrated in Fig. 2 3 1 3 W 16 . 25 MHz). We therefore can use constant multipliers. The selection of FFT algorithm affects the number and positions of constant multipliers. Complex multiplication with W 16 .As is well-known. which reduce the complexity. Therefore it is efficient to replace the complex multipliers with constant multiplier when possible.5 mW@3. there are three type non-trivial complex 1 multiplications within the butterfly. For a 16-point FFT butterfly.

only N – 1 words for an N-point FFT. Hence the 16point butterfly with radix-4 is more efficient and is selected for our implementation.6.with W 16 and one multiplier with W 16 / W 16 ). e. of Comp. Algorithm No. the most memory efficient architectures are the architectures with single-path feedback since it gives the minimum data memory. since the radix-2 butterfly have the simplest routing. 25 MHz. there is only two complex multipliers and two constant multipliers.g. Mult. The number of non-trivial complex multiplications required for 1024-point FFT for different algorithms is shown in the following table. The number of non-trivial complex multiplications for different FFT architectures. In the 1024-point FFT processor.. Hence. The number of non-trivial complex multiplications can be reduced to 1776.3 V. This is less than the theoretical saving of 35% (the ratio for the number of complex multiplications) due to the computation for complex multiplications within the 16-point butterfly. which consumes less than 160 mW. As mentioned in the resource analysis. a power saving of more than 20% for the computation of complex multiplications can be achieved. IMPLEMENTATION OF FFT PROCESSORS 95 . R2FFT 3586 R4FFT 2732 SRFFT 2390 Our approach 1776 2 1 3 Table 5. the power consumption for complex multiplication within 16-point butterfly is reduced to 10 mW@3. By replacing the complex multiplications with constant multiplications within the 16-point butterfly. The total number of complex multipliers is reduced to two for a 1024-point FFT due to the use of 16-point butterflies. To cope with the complex routing associated with high radix butterflies it is better to divide the 16-point butterfly into four stages.

The 1024-point FFT processor can also run at 1. Mem Butterfly Element Mem Butterfly Element Mem Butterfly Element Mem Butterfly Element Intput Output Constant multiplier Figure 5. We proposed a wordlength optimization method for the pipelined FFT architectures. The butterflies consumes about 30 mW. The power consumption for the data memory is estimated to 300 mW (the power consumption for 128 words or higher memory is given by the vendor and the smaller memory is estimated through linear approximation down to 32 words). for instance. The total power consumption for the three main subsystems is 490 mW.28. etc.35 µm standard CMOS process. This method gave a memory saving up to 14%. the power consumption for the FFT processor is therefore estimated to about 550 mW at 3. 16-point Butterfly. 5. Summary In this chapter. the clock buffers. the computation units for butterfly operations and complex multiplications with 37%.5 V.5 V for 0. we have discussed the implementation of a 1024point FFT processor. communication buses. The total power consumption of the 1024-point FFT processor is less than 200 mW at 1.28.The radix-4 algorithm can be decomposed into radix-2 algorithm as done in [24]. 96 Chapter 5 .5. 5. A resource analysis gave a start point for the implementation. By assuming the 15% of overhead.3 V [38]. Hence the mapping of 16-point butterfly can be done with four pipelined radix-2 butterflies.. The 16-point butterfly is illustrated in Fig. and others with 8%. Each butterfly has its own feedback memory. The memories contribute 55% of the total power consumption. which gives more power saving.

which is efficient in term of delay and area.e.We discussed the implementation of subblocks. The use of proposed 16-point butterfly reduces the number of complex multiplications and retains the minimum memory requirement. which is power efficient. i. All those subblocks can be operate at low power supply voltage and suitable for the voltage scaling. We constructed a complex multiplier using DA and overturned-stairs tree. We proposed the high radix butterflies using carry-save technique. IMPLEMENTATION OF FFT PROCESSORS 97 . Finally. we discussed the implementation of FFT processor using a 16-point butterfly. memories. which is area efficient. and complex multipliers.. butterflies.

98 Chapter 5 .

In some cases. we use distributed arithmetic to reduce the hardware complexity. The wordlengths in each stage of the pipelined FFT processor may be different and therefore optimized. it is important to reduce the hardware complexity. The FFT algorithm with less multiplications and additions is attractive. The selection of low power strategy affects FFT hardware design. A simulation-based method has been developed for wordlength optimization of the pipelined FFT architectures.6 CONCLUSIONS This thesis discussed the essential parts of low power pipelined FFT processor design. In the complex multiplier design. The selection of FFT algorithm is an important start point for the FFT processor implementation. The supply voltage scaling is an efficient low power technique and was used for the FFT processor design. This technique is generally applicable for high-radix butterflies. For the detail design.The proposed highradix butterflies reduce both the area and the delay with more than 20%. we proposed that a carry-save technique was used for implementation of the butterflies. This also results in a power saving of 14% for the memories. the wordlength optimization can reduce the size of memories up to 14% compared with using uniform wordlength in each stage. After the selection of the FFT algorithm and the low power strategy. We select overturned-stairs tree 99 . The reduction of wordlength also reduces the power consumption in the complex multipliers and the butterflies proportionally.

is therefore used.5 V power supply voltage. Using proposed 16-point butterfly.5 V. the data memory size is reduced with 10%. With optimized word length. and others with 8%.35 µm standard CMOS process. Simulation shows that the complex multiplier operate up to 30 MHz at 1.5 V for 0. which indicates that the optimization of the memory structure could be important for the implementation of low power FFT processors. the total power consumption of the 1024-point pipelined FFT processor with a continuous throughput of 25 Msamples/s and equivalent wordlength of 12-bit is less than 200 mW at 1. 100 Chapter 6 . the computation units for butterfly operations and complex multiplications with 37%. With all those efforts. The memories consume the most significant part of the total power consumption. The overturned-stairs tree has a regular structure and the same performance as the Wallace tree when the data wordlength is less than 19. The memories contribute to 55% of the total power consumption. In the SRAM design. The power consumption is 15 mW at 25 MHz with a 1.for the realization of complex multiplier. The sense amplifier can be operated at low power supply voltage. the number of complex multiplications can be reduced and results a power saving more than 20% for complex multiplications. we modified an STC Dflip-flop to form a two stage sense amplifier.

(EUSIPO). “Precomputation-based sequential logic optimization for low power.. pp. Alidina. J.” IEEE Trans. pp. 915–918. L. Vol. Mag. May 1990. Koufopavlou. Jones.. Brigham.. 1989. Vol. Scarabottolo. MA.. 1988. Signal Processing. 2. ADSP-21060 SHARC Super Harvard Architecture Computer. Conf. Devadas.” Proc. pp. Bingham. Monterey.REFERENCES [1] M. 181–192. Dec. A. Netherlands. O. Dec. Papaefthymiou. 1982–1985. 1993. Vol. 1988. R.” Intern. “Accurate evaluation of CMOS short-circuit power dissipation for short channel devices. Norwood. Speech. Bisdounis. 28. Nikolaids. Bi and E. S. pp. Symp. Ghosh. on Low Power Electronics & Design. pp. 4. on Acoustics. Sep. “A pipeline FFT processor for wordsequential data. 2. on VLSI Systems. Analog Devices Inc. “Arrays for discrete Fourier Transform. 1994. Aug. and S. [2] [3] [4] [5] [6] [7] 101 . Monterio. A. No. and M. O. 12. The Fast Fourier Transform and Its Applications.” IEEE Trans. A. G. 5–14. 426–436. "Multicarrier modulation for data trasnmission: An idea whose time has come. V. and N. Amsterdam. CA. Eorupean Signal Process. Negrini. 37. Antola. 1996. No. Vol. Prentice Hall. E.. J. C." IEEE Commun.

pp.[8] C. ASSP–25. 993–1001. Digital Filter for PCM Encoded Signals. No. S. [9] [10] A. 1996. M. Dec. R. Levilion. on Computer-Aided Design of Integrated Circuits and Systems. and R. 12–31. pp. W. Turkey. 10. P. Low Power Digital CMOS Design. Cooley and J. Chang. pp. Duhamel and H. 1977.” Mathematics of Computation. on Computers.” IEEE Trans. [15] A.” Signal Processing. 1984.. 1995. pp. Good. A. “Perturb and simplify: multilevel Boolean network optimizer. 102 . S. 1990. 259–299. Vol. 1494–1504. Signal Processing.. [11] A.” Electronics Letters. 14–16. No. pp. 297–301. Croisier. pp.” IEEE Trans.. April. B. 14. 19. Delft. D. Vol. Rabey. Vol. [18] P. J. No. ser. Marek-Sadowska. [14] A. April. pp. 15. W. W. pp. No. Jan. “An algorithm for the machine computation of complex Fourier series. the Netherlands. Chadrakasan and R. Mehra.J. M. 239–242. March.. Royal Statist. Vol. Soc. W. and K. “Low-power CMOS design.S. June. 20. Brodersen. April. 2002. Speech. “The interaction algorithm and practical Fourier analysis. 20.. Vol.” J. Burrus. and R. Despain. [13] J. 1973. Dec. Esteban.. J. and V. “Split-radix FFT algorithm. C–23. 1974. “Optimizing power using transformations. 1. “Fast Fourier transforms: A tutorial review and a state of the art. 361–372. [16] DoubleBW Systems B. “Index mappings for multidimentional formulation of DFT and convolution. U. PowerFFT™ processor data sheet. Kluwer. Sheng.V. 472–484. [17] P. Vetterli. Vol. Chadrakasan. 1992. 1. [19] I. Patent 3777 130. 4. W. 27.” IEEE Journal of Solid-State Circuits.” IEEE Trans. Potkonjak. Hollmann. 3. Duhamel and M. Vol. on Acoustics. Cheng. [12] S. 1995. pp. Jan. Brodersen.” IEEE Trans. Brodersen. 12. Vol. on Computer-Aided Design. “Fourier transform computer using CORDIC iterations. M. No. Vol. 19. No. 1965. Riso. No. Chadrakasan. 1958. 4.

A. L. 1998. 1995. et al. S.” IEEE Acoustics.” IEEE Journal of Solid-State Circuits. 1995. “Partial Column FFT pipelines. and Signal Processing Magazine. June. Parallel Processing Symp. [28] http://theory. Oct. Keutzer. of IEEE. T. He and M. Hawaii. [27] M. Dayton.edu/~fftw [29] Intel Corp. Heideman and C. Heideman. Vol. [24] S.. 253–259.[20] A. 1015–1019.” IEEE trans. 1. K. 30. Ghosh. CA. Devadas. pp. [25] M. Speech. al. pp.lcs. 524–543. Honolulu. and J. 4. Vol.. pp. Feb. “A Pipeline Fast Fourier Transform.. Works... ASSP–34. Kim “A deep sub-micron SRAM cell design and analysis methodology. [23] D.. 1984. M. “Gauss and the history of the FFT. C–19(11). Hikari.” Proc. “Estimation of average switching activity in combinational and sequential circuits. Ohio. [26] M. T. 103 . 6. Groginsky and G. Wills. S. USA.” In Proc. Santa Clara. Kojima.mit. on Circuits and Systems. of the 10th Intern. on Acoustics. 1970. Johnson. 1. Itoh et. Speech.” IEEE Trans. [21] S. SA-1100 Microprocessor Technical Reference Manual. 1986. 766–770. Burrus. “Data-dependent logic swing internal bus architecture for ultralow-power LSI’s. Torkelson. and C. April. F. on Circuits and Systems-II. No.. H. 1996. Aug. Signal Processing. Vol. [30] K. 2001. “Trends in Low-power RAM Circuit Technologies. June. pp. “On the number of multiplications necessary to compute a Length-2n DFT.” In Proc. Hang and Y. S. “A New Approach to Pipeline FFT Processor. USA. Vol. Burrus. on Computers. pp.. [22] H. Gorman and J. (IPPS). pp 14–21. pp. 1995. H. of the 29th Design Automation Conf. April. 1992. Vol. 42.” IEEE Trans. No.. D. USA. No. 379–402. White.” In Proc. of Midwest Symp. 858–861.

on Acoustics. Mahant Shetti. 393–396. Wanhammar. 654–662.. September. California.” In Proc. “Overturned-Stairs’ Adder Trees and Multiplier Design”.” In Proc.” In Proc. pp. USA. Wanhammar. of NorChip Conf. Workshop on the Low Power Design. Krishnamoorthy and A.” ICSPAT. “A prime factor algorithm using high-speed convolution. pp. of IEEE Workshop on Signal Processing Systems (SiPS). of Custom Integrated Circuit Conf. Vol.. Li and L. Linköping University. “Efficient Radix-4 and Radix-8 Butterfly Elements. of NorChip Conf. Ma. of Intern. 1992. “A low power 16 by 16 multiplier using transition reduction circuitry. [32] S. 281–294. IEEE Trans.” IEEE Trans. 1999. 940–948. 262–267.. Wanhammar. Nov. “A Complex Multiplier Using ‘Overturned-Stairs’ Adder Tree. Parks. Melander. Y.” In Proc. Sweden. Thesis No. April.. Aug. Jutand. Taipei. Florida. Sweden. “Word length estimation for memory efficient pipeline FFT/IFFT Processors. Wanhammar. Stockholm. pp. pp. Kobla and T. 125–130. Wanhammar. 104 . Nov. [35] W. Khouja. Cyprus. P. pp. Li and L. C–41.. Li and L. Speech Signal Processing. pp. 1999. 4. Design of SIC FFT Architectures. 326–330... Norway. Napa. [39] J. 1977. pp. ASSP–31. Computers. and L. No.” In Proc. Orlando. 618. vol. pp. [34] W. San Diego. Lemonds and S. Linköping Studies in Science and Technology. [38] W. 1. 1996. 1994.[31] D. 2001. [36] W. Li. Li and L. W. Mou and F. Paphos. USA. [37] W. Conf. [40] Z. May. USA. 1999. [33] C. California. Nov.” In Proc. “Efficient power analysis of combinational circuits. 1999. “A Pipeline FFT Processor.. 1997. Vol. “An FFT processor based on 16point module. Oslo. pp. on Electronic Circuits and Systems (ICECS). Nov. of the Intern. S. 21–24. 139–142. China.

1999. Phoenix. and R. Vol. “Lowpower operation using self-timed circuits and adaptive scaling of the supply voltage. DSP96002 IEEE Floating-Point Dual-Port Processor User’s Manual. on Computer-Aided Design for Integrated Circuits System. L. 1968. Vol. 105 . J. Sweden. Reusens. Theory and Application of Digital Signal Processing.. of Intern. pp. 1983. Rabaey. 1992. pp. May. Aug. Pease. 524–532. Rabaey..” In Proc. April. May. “Discrete Fourier transforms when the number of data samples is prime.” IEEE Trans.” Proc. 18. [46] J. No. AZ. 1996.[41] Motorola Inc. 1989. Vol. on Acoustics. Linköping Studies in Science and Technology. Vol.).” IEEE Trans. Nielsen. and K. Gold. pp. 324. 391–397. Cornell University. [49] C. 1107– 1108. [47] J. USA. Rader. Linköping University. Detroit. on VLSI Systems. pp.” Journal of the Association for Computing Machinery. Design of an Application Specific FFT Processor. Thesis No. pp. and M. [43] E. C. of IEEE. Low Power Design Methodologies. Guerra. SparsØ. Nordhamn. No. [44] M. van Berkel. “Design guidance in the power dimension. Rabiner and B. 5. M. 252–264. Thesis. 1994. 1968. R. [42] L. Pedram (Ed. Speech and Signal Processing. [48] L. C. Vol. Conf. “An adaptation of fast Fourier transform for parallel processing. Rabaey and M. P. Kluwer. 1975. 15. [50] P. 5. Michigan. 2837–2840. High Performance VLSI Digital Signal Processing Architecture and Chip Design. 56. June. 2. “Algorithm selection: A quantitative optimization intensive approach. [45] M.. Prentice Hall. June. 4. Dec. Potkonjak. Mehra. 1995. No. Nielsen. 2.

2002.5.” IEEE Journal of Solid-State Circuit. pp. 559–563. 1984. Malik. 270–277. of 1994 IEEE Symp. Dallas. Ashar. S. Wang and S. on VLSI Systems.” IEEE Trans. pp. No. USA.. of Intern. California.. Conf. [56] V. of Computer-Aided Design.35 µm CMOS technologies.[51] M. 1993. Arizona. pp. [54] H. 748–754. San Jose. “An Implementation of FFT. 1987.” In Proc. June.. H. M. Vrudhula. Principles of CMOS VLSI Design. 1. Sayed and W. [60] T. R. 1997. pp. [55] Texas Instruments Incorporated.” IEEE Journal of Solid-State Circuit. San Diego. 1996. March. “Compilation techniques for low energy: an overview. [58] L. pp. Burleson. E. Vol.” Application report: SPRA113. Seevinck et al. [53] M. Veendrick. DSP Integrated Circuits. Vol. Thesis No. DCT. May. No. 0. and 0. Sweden. Oct. 619. 49–58. Academic Press. California. “Bus-invert coding for lowpower I/O. “Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. 19. Vol. Addison-Wesley. “Performance analysis of single-bit full adder cells using 0. Oct. USA. Efficient Implementation of FFT Processing Elements. USA. and Other Transforms on the TMS320C30. Widhe. P. second edition. 468–473. Weste and K. of IEEE Intern.18. 3. Linköping Studies in Science and Technology. Eshraghian. 1994.” In Proc. 3. Symp. Tiwari. SC-22. 1997.25. Scottdale. on Circuits and Systems (ISCAS). 1999. Wanhammar. pp. 1995. on Low Power Electronics. [52] E. J. Aug. Stan and W. and P.” Proc. 38–39. Texas. [57] Q. “Static-Noise Margin Analysis of MOS SRAM Cells. Badawy. Linköping University. 106 . “Multi-level logic optimization for low power using local logic transformation. [59] N. Vol.

Sci. 1005–1006. USA. pp. 1976. Nat. Kluwer. 1998. Thesis No. pp. Vol. PDSP16515A Stand Alone FFT Processor Advance Information. 1999. 4. Sweden. Linköping Studies in Science and Technology. [65] Zarlink semiconductor Inc. M. Winograd. 1988. Practical Low Power Digital VLSI Design. 5. Yeap. “Pipeline and Parallel-pipeline FFT Processors for VLSI Implementation. 414–426.” Proc. C–33. Despain. April.. [64] J. April.[61] S. 73. No. Acad. High Speed CMOS Circuit Technique. “On the computing the discrete Fourier transform. 1984. Yuan. Linköping University. Vol. H. [62] E. [63] G. No. Wold and A.” IEEE Transaction on Computers. 107 . 132.

108 .