Thesis No. 1030
STUDIES ON IMPLEMENTATION OF
LOW POWER FFT PROCESSORS
Weidong Li
LiUTekLic2003:29
Department of Electrical Engineering
Linköpings universitet, SE581 83 Linköping, Sweden
Linköping, June 2003
Studies on Implementation of
Low Power FFT Processors
Copyright © 2003 Weidong Li
Department of Electrical Engineering
Linköpings universitet
SE581 83 Linköping
Sweden
ISBN 9173736929 ISSN 02807971
To memory of my father.
Abstract
In the last decade, the interest for high speed wireless and on cable
communication has increased. Orthogonal Frequency Division
Multiplexing (OFDM) is a strong candidates and has been suggested
or standardized in those communication systems. One key
component in OFDMbased systems is FFT processor, which
performs the efﬁcient modulation/demodulation.
There are many FFT architectures. Among them, the pipeline archi
tectures are suitable for the realtime communication systems. This
thesis presents the implementation of pipeline FFT processors that
has low power consumptions.
We select the meetinthemiddle design methodology for the imple
mentation of FFT processors. A resource analysis for the pipeline
architectures is presented. This resource analysis determines the
number of memories, butterﬂies, and complex multipliers to meet
the speciﬁcation.
We present a wordlengths optimization method for the pipeline
architectures. We showthat the high radix butterﬂy can be efﬁciently
implemented with carrysave technique, which reduce the hardware
complexity and the delay. We present also an efﬁcient implemen
tation of complex multiplier using distributed arithmetic (DA). The
implementation of low voltage memories is also discussed.
Finally, we present a 16point butterﬂy using constant multipliers
that reduces the total number of complex multiplications. The FFT
processor using the 16point butterﬂies is a competitive candidate
for low power applications.
Acknowledgement
I would like to thank my supervisor Professor Lars Wanhammar for
his support and guidance of this research. Also, I would like to thank
the whole group, Electronics Systems at Linköping University, for
their help in the discussions in research as well as other matters.
Lastly, I would like to express my gratitude to Oscar Gustafsson,
Henrik Ohlsson, and Per Löwenberg for the proofreading.
Finally, and most importantly, I would like to thank my family,
relatives, and friends, especially A Phung, for their boundless
support and encouragement.
This work was ﬁnancially supported by the Swedish Strategic Fund
(SSF) under INTELECT program.
i
Table of Contents
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. DFT and FFT ..................................................................... 2
1.2. OFDM Basics .................................................................... 3
1.3. Power Consumption .......................................................... 6
1.4. Thesis Outline ................................................................... 7
1.5. Contributions ..................................................................... 8
2. FFT ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. CooleyTukey FFT Algorithms ......................................... 9
2.1.1. EightPoint DFT ....................................................... 10
2.1.2. Basic Formula ........................................................... 12
2.1.3. Generalized Formula ................................................ 13
2.2. SandeTukey FFT Algorithms ........................................ 18
2.3. Prime Factor FFT Algorithms ......................................... 20
2.4. Other FFT Algorithms ..................................................... 23
2.4.1. SplitRadix FFT Algorithm ...................................... 23
2.4.2. Winograd Fourier Transform Algorithm .................. 26
2.5. Performance Comparison ................................................ 26
2.5.1. Multiplication Complexity ....................................... 27
2.5.2. Addition Complexity ................................................ 29
2.6. Other Issues ..................................................................... 30
2.6.1. Scaling and Rounding Issue ..................................... 31
2.6.2. IDFT Implementation ............................................... 35
2.7. Summary ......................................................................... 36
ii
3. LOW POWER TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . 37
3.1. Power Dissipation Sources .............................................. 37
3.1.1. ShortCircuit Power .................................................. 37
3.1.2. Leakage Power ......................................................... 38
3.1.3. Switching Power ....................................................... 39
3.2. Low Power Techniques ................................................... 40
3.2.1. System Level ............................................................ 40
3.2.2. Algorithm Level ....................................................... 42
3.2.3. Architecture Level .................................................... 44
3.2.4. Logic Level ............................................................... 47
3.2.5. Circuit Level ............................................................. 50
3.3. Low Power Guidelines .................................................... 51
3.4. Summary ......................................................................... 52
4. FFT ARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1. GeneralPurpose Programmable DSP Processors ........... 53
4.2. Programmable FFT Specific Processors ......................... 54
4.3. AlgorithmSpecific Processors ........................................ 56
4.3.1. Radix2 Multipath Delay Commutator ..................... 57
4.3.2. Radix2 SinglePath Delay Feedback ....................... 58
4.3.3. Radix4 Multipath Delay Commutator ..................... 59
4.3.4. Radix4 SinglePath Delay Commutator .................. 60
4.3.5. Radix4 SinglePath Delay Feedback ....................... 61
4.3.6. Radix2
2
SinglePath Delay Commutator ................ 62
4.4. Summary ......................................................................... 63
5. IMPLEMENTATION OF FFT PROCESSORS . . . . . . . . 65
5.1. Design Method ................................................................ 65
5.2. Highlevel Modeling of an FFT Processor ...................... 67
5.2.1. Resource Analysis .................................................... 67
iii
5.2.2. Validation of the HighLevel Model ........................ 70
5.2.3. Wordlength Optimization ......................................... 71
5.3. Subsystems ...................................................................... 72
5.3.1. Memory .................................................................... 73
5.3.2. Butterfly .................................................................... 79
5.3.3. Complex Multiplier .................................................. 83
5.4. Final FFT Processor Design ............................................ 93
5.5. Summary ......................................................................... 96
6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7. REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
iv
1
1
INTRODUCTION
The Fast Fourier Transform(FFT) is one of the most used algorithms
in digital signal processing. The FFT, which facilitates the efﬁcient
transformation between the time domain and the frequency domain
for a sampled signal, is used in many applications, e.g., radar,
communication, sonar, speech signal processing.
In the last decade, the interest for high speed wireless and on
cable communication has increased. Orthogonal Frequency Division
Multiplexing (OFDM) technique, which is a special Multicarrier
Modulation (MCM) method, has been demonstrated to be an
efﬁcient and reliable approach for highspeed data transmission. The
immunity to multipath fading channel and the capability for parallel
signal processing make it a promising candidate for the next gener
ation wideband communication systems. The modulation and
demodulation of OFDM based communication systems can be
efﬁciently implemented with an FFT, which has made the FFT
valuable for those communication systems. The OFDM based
communication systems have high performance requirement in both
throughput and power consumption. This performance requirement
necessitates an applicationspeciﬁc integrated circuit (ASIC)
solution for FFT implementation. This thesis addresses the problem
of designing efﬁcient applicationspeciﬁc FFT processors for
OFDM based wideband communication systems.
In this chapter, we give a short review on DFT and FFT. Then an
introduction to OFDM and power consumption are presented.
Finally, the outline of the thesis is described.
2 Chapter 1
1.1. DFT and FFT
The Discrete Fourier transform (DFT) for an Npoint data sequence
, is deﬁned as
(1.1)
for , where is the primitive N
th
root of unity. The number N is also called transform length. The
index respective is referred to as timedomain and frequency
domain index, respectively.
The inverse DFT (IDFT) for data sequence
( ) is
(1.2)
for .
Direct computation of an Npoint DFT according to Eq. (1.1)
requires complex additions and complex
multiplications. The complexity for computing an Npoint DFT is
therefore . With the contribution from Cooley and Tukey
[13], the complexity for computation of an Npoint DFT can be
reduced to . The Cooley and Tukey’s approach and
later developed algorithms, which reduces of complexity for DFT
computation, are called fast Fourier transform (FFT) algorithms.
Among the FFT algorithms, two algorithms are especially
noteworthy. One algorithm is the splitradix algorithm, which treats
the even part and odd part with different radix, was published in
1984. Another algorithm is Winograd Fourier Transform Algorithm
(WFTA), which requires the least known number of multiplications
among practical algorithms for moderate lengths DFTs and was
published in 1976.
Many implementation approaches for the FFT have been
proposed since the discovery of FFT algorithms. Due to the high
computation workload and intensive memory access, the implemen
tation of FFT algorithms is still a challenging task.
x k ( ) { } k 0 1 … N 1 – , , , =
X n ( ) x k ( )W
N
nk
k 0 =
N 1 –
∑
=
n 0 1 … N 1 – , , , = W
N
e
2π N ⁄ –
=
k n
X n ( ) { }
n 0 1 … N 1 – , , , =
x k ( )
1
N
 X n ( )W
N
n – k
n 0 =
N 1 –
∑
=
k 0 1 … N 1 – , , , =
N N 1 – ( ) N N 1 – ( )
O N
2
( )
O N N ( ) log ( )
INTRODUCTION 3
1.2. OFDM Basics
OFDM is a special MCM technique. The idea for MCM is to divide
transmission bandwidth into many narrow subchannels (subcar
riers), which transmit data in parallel [5].
The principle for MCM is shown in Fig. 1.1. The high rate data
stream at bits/s is grouped into blocks with M bits per block
at a rate of . Ablock is called a symbol. Asymbol allocates m
k

bits of M bits for modulation of a carrier k at and totally M bits
for modulation of N carriers. This results in N subchannels, which
send symbols at a rate of .
In the conventional MCM, the N subchannels are non
overlapping. Each subchannel has its own modulator and demodu
lator. This leads to inefﬁcient usage of spectrum and excessive
hardware requirement.
The OFDM technique can overcome those drawbacks. With
OFDM, the spectrum can be used more efﬁcient since overlapping
of subchannels is allowed. The overlapping does not cause inter
ference of subchannels due to the orthogonal modulation.
M f
sym
f
sym
f
c k ,
f
sym
S
e
r
i
a
l
t
o
p
a
r
a
l
l
e
l
M bits (a symbol)
modulator n1
modulator n2
modulator 0
demodulator n1
demodulator n2
demodulator 0
Channel noise
f
c,n1
f
c,n2
f
c,0
f
c,n1
f
c,n2
f
c,0
P
a
r
a
l
l
e
l
t
o
s
e
r
i
a
l
m
n1
bits
Input
Mf
sym
b/s
f
sym
symbol/s
Output
x(t)
Figure 1.1. A multicarrier modulation system.
4 Chapter 1
The orthogonality can be explained in frequency domain. The
symbol rate is f
sym
, e.g., each symbol is sent during a symbol time T
(which is equal to 1/f
sym
). The frequency spacing between adjacent
subchannels is set to be 1/T Hz, the carrier signals can be expressed
as following:
(1.3)
(1.4)
where f
0
is the system base frequency and is the signal for carrier
k at frequency f
k
. If the frequency of subcarrier k and the base
function are chosen according to Eq. (1.3) and Eq. (1.4), its spectrum
is a sinc function with zero points at (l is integer) except
or f
k
. It means that there is no interference to other
subchannels with the selected functions.
This orthogonality can also be found in the time domain. For two
carrier signals, g
k
, and g
l
, the integral over a symbol time is
(1.5)
which shows that two carriers are orthogonal.
OFDM overcomes the inefﬁcient implementation of the
modulator and demodulator for conventional MCM. From Fig. 1.1,
the sending signal x(t) is the summation of symbol transmission in
all subchannels, e.g.,
f
k
f
0
k
T
 + = 0 k N 1 – < ≤
g
k
t ( )
e
j2πf
k
t
0 t T < ≤
0 otherwise
¹
'
¹
=
g
k
f
0
l T ⁄ +
l k =
g
k
t ( )g
l
∗
t ( ) t d
0
T
∫
T k l =
0 otherwise
¹
'
¹
=
Figure 1.2. Spectrum overlapping of subcarriers for OFDM.
f
1/T
INTRODUCTION 5
where S
k
is the modulated signal of m
k
bit, which should be trans
mitted by subchannel k. This is an Npoint Inverse Discrete Fourier
Transform (IDFT) and baseband modulation (with ). The
IDFT can be computed efﬁciently by Inverse Fast Fourier Transform
(IFFT) algorithm. Hence the OFDM modulator can be implemented
with one IFFT processor and baseband modulator for N subcarriers
instead of N modulators for conventional MCM. In similar way, the
OFDM demodulator can be implemented more efﬁcient than that of
conventional MCM. The simpliﬁed OFDMsystembased on the FFT
is shown in Fig. 1.3.
In reality, the interference between subchannels exists due to the
nonideal channel characteristics and frequency offset in trans
mitters and receivers. This interference effects the performance of
the OFDM system. The frequency offset can, in most case, be
compensated.
The other issues, for instance, intersymbol interference, can be
reduced by techniques like cyclic preﬁx.
x t ( ) S
k
g
k
t ( )
k 0 =
N 1 –
∑
e
j2πf
0
t
S
k
e
j2πkt
T

k 0 =
N 1 –
∑
= =
e
j2π f
0
t
e
2jpf
0
t
Channel noise
Input
D
/
A
IFFT
Output
e
2jpf
0
t
A
/
D
FFT
Figure 1.3. OFDM system based on FFT.
6 Chapter 1
1.3. Power Consumption
The famous Moore’s Lawpredicts the exponential increase in circuit
integration and clock frequency during the last three decades. Table
1 shows the expectation for the near future from Semiconductor
Industry Association.
The power consumption decreases as the feature size and the
power supply voltage are reduced. However, the power consumption
increases or retains almost the same as the advance of technology
according to the table above. This is due to the potential workload
increase.
During the last decade, the power consumption has grown from
a secondary constraint to one of the main constraints in the design of
integrated circuit. In portable applications, low power consumption
has long been the main constraint. Several other factors, for
instances, more functionality, higher workload, and longer operation
time, contribute to make the power consumption and energy
efﬁciency even more critical. In the high performance applications,
where the power consumption traditionally was a secondary
Year
Feature size
2003
107 nm
2004
90 nm
2005
80 nm
2010
45 nm
ASIC usable Mega
transistors/cm
2
(auto layout)
142 178 225 714
ASIC maximum functions per
chip (Mega transistors/chip)
810 1020 1286 4081
Package cost (cents/pin)
maximum/minimum
1.24/
0.70
1.17/
0.66
1.11/
0.61
0.98/
0.49
Onchip, local clock (MHz) 3088 3990 5173 11511
Supply V
dd
(V)
(high performance)
1.0 1.0 0.9 0.6
Power consumption for High
performance with heatsink (W)
150 160 170 218
Power consumption for
Battery(W)(Handheld)
2.8 3.2 3.2 3.0
Table 1.1. Technology Roadmap from the International Technology
for Semiconductors (ITRS).
INTRODUCTION 7
constraint, the low power techniques gain more ground due to the
steady increasing cost for cooling and packaging. Besides those
factors, the increasing power consumption has resulted in higher on
chip temperature, which in turn reduces the reliability. The delivery
of power supply to the chip has also raised many problems like
power rails design, noise immunity, IRdrop etc. Therefore the low
power techniques are important for the current and future integrated
circuits.
1.4. Thesis Outline
In this thesis we summarize some implementation aspects of a low
power FFT processors for an OFDM communication system. The
system speciﬁcation for the FFT processor has been deﬁned as
• Transform length is 1024
• Transform time is less than 40 ms (continuously)
• Continuous I/O
• 25.6 Msamples/sec. throughput
• Complex 24 bits I/O data
• Low power
In chapter 2, we introduce several FFT algorithms, which is the
starting point for the implementation. The basic idea of FFT
algorithms, e.g., divide and conquer, is demonstrated through a few
examples. Several FFT algorithms and their performance are given
also.
An overview for low power techniques is given in chapter 3.
Different techniques are introduced at different abstraction level.
The main focus of the for low power techniques is reduction of
dynamic power consumption. A general guideline is found in the
end of this chapter.
The choice of FFT architectures is important for the implemen
tation. A few architectures, including the pipeline architectures, are
introduced in chapter 4. The pipeline architectures are discussed in
more detail since they are the dedicated architectures for our target
application.
8 Chapter 1
In chapter 5, more detailed implementation steps for FFT
processors are provided. Both design method and the design for FFT
processors are discussed in this chapter.
The conclusions for the FFT processor implementation are given
in chapter 6.
1.5. Contributions
The main contributions of this thesis are:
• A method for minimizing the wordlengths in the pipelined
FFT architectures, as outlined in Section 5.2.3.
• An approach to construct efﬁcient highradix butterﬂies,
presented in Section 5.3.2.2.
• A complex multiplier using distributed arithmetic and the
overturnedstairs tree, given in Section 5.3.3.
• A 16point butterﬂy with constant multipliers. This reduces
of the total number of complex multiplications and is
described in Section 5.4
• Various generators for different components, for instance, the
ripplecarry adder, BrentKung adder, complex multiplier,
etc. This is found in Chapter 5.
9
2
FFT ALGORITHMS
In FFT processor design, the mathematical properties of FFT must
be exploited for an efﬁcient implementation since the selection of
FFT algorithm has large impact on the implementation in term of
speed, hardware complexity, power consumption etc. This chapter
focuses on the review of FFT algorithms.
2.1. CooleyTukey FFT Algorithms
The technique for efﬁcient computation of DFTs is based on divide
and conquer approach. This technique works by recursively
breaking down a problem into two, or more, subproblems of the
same (or related) type. The subproblems are then independently
solved and their solutions are combined to give a solution to the
original problem. This technique can be applied to DFT computation
by dividing the data sequence into smaller data sequences until the
DFTs for small data sequences can be computed efﬁciently.
Although the technique was described in 1805 [26], it was not
applied to DFT computation until 1965 [13]. Cooley and Tukey
demonstrated the simplicity and efﬁciency of the divide and conquer
approach for DFT computation and made the FFT algorithms widely
accepted. We give a simple example for the divide and conquer
approach. Then a basic and a generalized FFT formulation are given.
10 Chapter 2
2.1.1. EightPoint DFT
In this section, we illustrate the idea of the divide and conquer
approach and show why dividing is also conquering for DFT
computation.
Let us consider an 8point DFT, e.g., and data sequence
, , the DFT of is given by
(2.1)
for .
One way to break down a long data sequence into shorter ones is
to group the data sequence according to their indices. Let
and ( ) be two sequences, the grouping of
to and can be done intuitively through
separating members by odd and even index.
(2.2)
(2.3)
for .
The DFT for can be rewritten
(2.4)
N 8 =
x k ( ) { } k 0 1 … 7 , , , = x k ( ) { }
X n ( ) x k ( )W
8
nk
k 0 =
7
∑
=
n 0 1 … 7 , , , =
x
o
l ( ) { }
x
e
l ( ) { } l 0 1 2 3 , , , =
x k ( ) { } x
o
l ( ) { } x
e
l ( ) { }
x
o
l ( ) x 2l 1 + ( ) =
x
e
l ( ) x 2l ( ) =
l 0 1 2 3 , , , =
x k ( ) { }
X n ( ) x
o
l ( )W
8
n 2l 1 + ( )
l 0 =
3
∑
x
e
l ( )W
8
n 2l ( )
l 0 =
3
∑
+
x
o
l ( )W
8
n
W
8
n 2l ( )
l 0 =
3
∑
x
e
l ( )W
8
n 2l ( )
l 0 =
3
∑
+
W
8
n
x
o
l ( )W
4
nl
l 0 =
3
∑
x
e
l ( )W
4
nl
l 0 =
3
∑
+
W
8
n
X
o
n ( ) X
e
n ( ) +
=
=
=
=
FFTALGORITHMS 11
where , and
are 4point DFTs of and , respectively.
Eq. (2.4) shows that the computation of an 8point DFT can be
decomposed into two 4point DFTs and summations. The direct
computation of an 8point DFT requires complex
additions and complex multiplications. The
computation of two 4point DFTs requires
complex additions and complex multiplica
tions. With additional complex multiplications for
and 7 complex additions, it requires totally 30 complex multiplica
tions and 32 complex additions for the 8point DFT computation
according to Eq. (2.4). It requires only two 4point DFTs for the 8
point DFT due to the fact that and
for . Furthermore, the number of
complex multiplications for can be reduced to 3 from 7
since for . The total number of complex
additions and complex multiplications is 32 and 27, respectively.
This can be shown in Fig. 2.1.
The above 8point DFT example shows that the decomposition
of a long data sequence into smaller data sequences reduces the
computation complexity.
W
8
n 2l ( )
e
2π –
8

¸ ,
¸ _
n 2l ( )
e
2π –
4

¸ ,
¸ _
nl
W
4
nl
= = = X
o
n ( )
X
e
n ( ) x
o
l ( ) { } x
e
l ( ) { }
8 8 1 – ( ) 56 =
8 8 1 – ( ) 56 =
2 4 4 1 – ( ) ⋅ ⋅ 24 =
2 4 4 1 – ( ) ⋅ ⋅ 24 =
8 1 – W
8
n
X
o
n ( )
X
o
n ( ) X
o
n 4 – ( ) =
X
e
n ( ) X
e
n 4 – ( ) = n 4 ≥
W
8
n
X
o
n ( )
W
8
n
W
8
n 4 –
– = n 4 ≥
W
1
W
2
4

p
o
i
n
t
D
F
T
4

p
o
i
n
t
D
F
T
x(0)
x(2)
x(4)
x(6)
x(1)
x(3)
x(5)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
3
W
n
multiplication with W
n
8
W
0
Figure 2.1. An 8point DFT computation with two 4point DFTs.
12 Chapter 2
2.1.2. Basic Formula
The 8point DFT example illustrates the principle of the Cooley
Tukey FFT algorithm. We introduce a more mathematical
formulation for the FFT algorithm.
Let N be a composite number, i.e., N = r
1
× r
0
, the index k can be
expressed by a twotuple (k
1
, k
0
) as
( ) (2.5)
In the similar way, the index n can be described by (n
1
, n
0
) as
( , ) (2.6)
The term can be factorized as
(2.7)
where .
With Eq. (2.7), Eq. (1.1) can be rewritten
(2.8)
Eq. (2.8) indicates that the DFT computation can be performed
in three steps:
1 Compute r
0
different r
1
point DFTs (inner parenthesis).
2 Multiply results with .
3 Compute r
1
different r
0
point DFTs (utter parenthesis).
k r
0
k
1
k
0
+ = 0 k
0
r
0
0 k
1
r
1
< ≤ , < ≤
n r
1
n
1
n
0
+ = 0 n
1
r
0
< ≤ 0 n
0
r
1
< ≤
W
N
nk
W
N
nk
W
N
r
1
n
1
n
0
+ ( ) r
0
k
1
k
0
+ ( )
W
N
r
1
r
0
n
1
k
1
W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
=
=
=
W
N
r
1
r
0
n
1
k
1
W
N
Nn
1
k
1
e
2πNn
1
k
1
– N ⁄
e
2πn
1
k
1
–
1 = = = =
X n
1
n
0
, ( ) x k
1
k
0
, ( )W
r
1
n
0
k
1
W
r
0
n
1
k
0
W
N
n
0
k
0
k
0
0 =
r
0
1 –
∑
k
1
0 =
r
1
1 –
∑
=
x k
1
k
0
, ( )W
r
1
n
0
k
1
k
1
0 =
r
1
1 –
∑
¸ ,
¸ _
W
N
n
0
k
0
¸ ,
¸ _
W
r
0
n
1
k
0
k
0
0 =
r
0
1 –
∑
=
W
N
n
0
k
0
FFTALGORITHMS 13
The r
0
r
1
point DFTs require or
complex multiplications and additions. The second step requires N
complex multiplications. The ﬁnal step requires
complex multiplications and additions. Therefore the total number
of complex multiplications using Eq. (2.8) is and
the number of complex additions is . This is a
reduction from O(N
2
) to O(N(r
1
+r
0
)). Therefore the decomposition
of DFT reduces the computation complexity for DFT.
The number r
0
and r
1
are called radix. If r
0
and r
1
are equal to r,
the number system is called radixr system. Otherwise, it is called
mixedradix system. The multiplications with are called
twiddle factor multiplications.
Example 2.1. For , we apply the basic formula by decom
posing with r
0
= 2 and r
1
= 4. This results in the
given 8point DFT example in the above section, which is shown in
Fig. 2.1. It is a mixedradix FFT algorithm.
A closer study for the given 8point DFT example, it is easy to
show that it does not need to store the input data in the memory after
the computation of the two 4point DFTs. It can reduce the total
memory size and is important for memory constrained system. An
algorithm with this property is called an inplace algorithm.
2.1.3. Generalized Formula
If r
0
or/and r
1
are not prime, further improvement to reduce the
computation complexity can be achieved by applying the divide and
conquer approach recursively to r
1
point or/and r
0
point DFTs [7].
Let , the index k and n can be
written as
(2.9)
(2.10)
where for .
r
1
r
0
r
1
1 – ( ) N r
1
1 – ( )
N r
0
1 – ( )
N r
0
r
1
1 – + ( )
N r
0
r
1
2 – + ( )
W
N
n
0
k
0
N 8 =
N 8 4 2 × = =
N r
p 1 –
r
p 2 –
… × r
0
× × =
k r
0
r
1
…r
p 2 –
k
p 1 –
… r
0
k
1
k
0
+ + + =
n r
p 1 –
r
p 2 –
…r
1
n
p 1 –
… r
p 1 –
n
1
n
0
+ + + =
k
i
n
p i – 1 –
0 r
i
1 – , [ ] ∈ , i 0 1 … p 1 – , , , =
14 Chapter 2
The factorization of can be expressed as
(2.11)
where .
Eq. (2.1) can then be written
(2.12)
Note that the inner product can be recognized as an r
p1
point
DFT for n
0
. Deﬁne
(2.13)
With Eq. (2.13), index k
p1
is “replaced” by n
0
. Equation (2.12)
can now be rewritten as
(2.14)
W
N
nk
W
N
nk
W
N
n r
0
r
1
…r
p 2 –
k
p 1 –
… r
0
k
1
k
0
+ + + ( )
W
N
r
0
r
1
…r
p 2 –
nk
p 1 –
… r
0
nk
1
nk
0
+ + +
W
N
r
0
r
1
…r
p 2 –
nk
p 1 –
…W
N
r
0
nk
1
W
N
nk
0
W
r
p 1 –
nk
p 1 –
…W
N r
0
⁄
nk
1
W
N
nk
0
=
=
=
=
W
N
r
0
r
1
…r
i
nk
i 1 +
W
N r
0
r
1
…r
i
( ) ⁄
nk
i 1 +
=
X n
p 1 –
n
p 2 –
… n
0
, , , ( )
… x k
p 1 –
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
n
0
k
p 1 –
k
p 1 –
0 =
r
p 1 –
1 –
∑
¸ ,
¸ _
k
1
0 =
r
1
1 –
∑
k
0
0 =
r
0
1 –
∑
=
W
r
p 1 –
r
p 2 –
nk
p 2 –
…W
N r
0
⁄
nk
1
W
N
nk
0
⋅
x
1
n
0
k
p 2 –
… k
0
, , , ( )
x k
p 1 –
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
n
0
k
p 1 –
k
p 1 –
0 =
r
p 1 –
1 –
∑
=
X n
p 1 –
n
p 2 –
… n
0
, , , ( )
… x k
p 1 –
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
n
0
k
p 1 –
k
p 1 –
0 =
r
p 1 –
1 –
∑
¸ ,
¸ _
k
1
0 =
r
1
1 –
∑
k
0
0 =
r
0
1 –
∑
=
W
r
p 1 –
r
p 2 –
nk
p 2 –
…W
N r
0
⁄
nk
1
W
N
nk
0
⋅
FFTALGORITHMS 15
The term can be factorized as
The inner sum of k
p2
in Eq. (2.14) can then be written as
(2.15)
which can be done through multiplications and r
p2
point DFTs
(2.16)
(2.17)
Eq. (2.14) can be rewritten as
(2.18)
This process from Eq. (2.14) to Eq. (2.17) can be repeated
times until index k
0
is replaced by n
p1
.
W
N r
0
r
1
…r
i 1 –
( ) ⁄
nk
i
W
N r
0
r
1
…r
i 1 –
( ) ⁄
nk
i
W
N r
0
r
1
…r
i 1 +
( ) ⁄
r
p 1 –
r
p 2 –
…r
1
n
p 1 –
… r
p 1 –
n
1
n
0
+ + + ( )k
i
W
r
p 1 –
r
p 2 –
…r
i
r
p 1 –
r
p 2 –
…r
i 1 +
n
p i – 1 –
… r
p 1 –
n
1
n
0
+ + + ( )k
i
W
r
p 1 –
r
p 2 –
…r
i
r
p 1 –
…r
i 2 +
n
p i – 2 –
… n
0
+ + ( )k
i
W
r
i
n
p i – 1 –
k
i
=
=
=
x
1
n
0
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
nk
p 2 –
k
p 2 –
0 =
r
p 2 –
1 –
∑
x
1
n
0
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
n
0
k
p 2 –
[ ]W
r
p 2 –
n
1
k
p 2 –
k
p 2 –
0 =
r
p 2 –
1 –
∑
=
x
1
' n
0
k
p 2 –
… k
0
, , , ( ) x
1
n
0
k
p 2 –
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
n
0
k
p 2 –
=
x
2
n
0
n
1
k
p 3 –
… k
0
, , , , ( )
x
1
' n
0
k
p 2 –
… k
0
, , , ( )W
r
p 2 –
n
1
k
p 2 –
k
p 2 –
0 =
r
p 2 –
1 –
∑
=
X n
p 1 –
n
p 2 –
… n
0
, , , ( )
… x
2
n
0
n
1
… k
0
, , , ( )W
r
p 1 –
r
p 2 –
r
p 2 –
nk
p 3 –
k
p 3 –
0 =
r
p 3 –
1 –
∑
¸ ,
¸ _
k
1
0 =
r
1
1 –
∑
k
0
0 =
r
0
1 –
∑
=
W
r
p 1 –
r
p 2 –
r
p 3 –
r
p 3 –
nk
p 4 –
…W
N r
0
⁄
nk
1
W
N
nk
0
⋅
p 2 –
16 Chapter 2
(2.19)
Eq. (2.14) can then be expressed as
(2.20)
Eq. (2.20) reorders the output data to natural order. This process
is called unscrambling. The unscrambling process requires a special
addressing mode that converts address (n
0
,...,n
p1
) to
. In case for radix2 number system, the n
i
represents a bit. The addressing for unscrambling is to make a
reverse of the address bits and hence is called bitreverse addressing.
In case of radixr (r > 2) number system, it is called digitreverse
addressing.
Example 2.2. 8point DFT. Let , the factorization
of can be expressed as
(2.21)
By using the generalized formula, the computation of an 8point
DFT can be computed with the following sequential equations [7]
(2.22)
(2.23)
(2.24)
(2.25)
x
p 1 –
n
0
n
1
… n
p 1 –
, , , ( )
x
p 2 –
' n
0
k
p 2 –
… k
0
, , , ( )W
r
0
n
p 1 –
k
0
k
0
0 =
r
0
1 –
∑
=
X n
p 1 –
n
p 2 –
… n
0
, , , ( ) x
p 1 –
n
0
n
1
… n
p 1 –
, , , ( ) =
n
p 1 –
n
p 2 –
… n
0
, , , ( )
N 2 2 2 × × =
W
N
nk
W
N
nk
W
2
n
0
k
2
( ) W
4
n
0
k
1
W
2
n
1
k
1
( ) W
8
2n
1
n
0
+ ( )k
0
W
2
n
2
k
0
( ) =
x
1
n
0
k
1
k
0
, , ( ) x k
2
k
1
k
0
, , ( )W
2
n
0
k
2
k
2
0 =
1
∑
=
x
1
' n
0
k
1
k
0
, , ( ) x
1
n
0
k
1
k
0
, , ( )W
4
n
0
k
1
=
x
2
n
0
n
1
k
0
, , ( ) x
1
' n
0
k
1
k
0
, , ( )W
2
n
1
k
1
k
1
0 =
1
∑
=
x
2
' n
0
n
1
k
0
, , ( ) x
2
n
0
n
1
k
0
, , ( )W
8
2n
1
n
0
+ ( )k
0
=
FFTALGORITHMS 17
(2.26)
(2.27)
where Eq. (2.22) corresponds to the term in Eq. (2.21), Eq.
(2.23) corresponds to the term, and so on.
The result is shown in Fig. 2.2.
x
2
n
0
n
1
n
2
, , ( ) x
2
' n
0
n
1
k
0
, , ( )W
2
n
2
k
0
k
0
0 =
1
∑
=
X n
2
n ,
1
n
0
, ( ) x
2
n
0
n
1
n
2
, , ( ) =
W
2
n
0
k
2
W
4
n
0
k
1
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
n
multiplication with W
n
8
W
2
W
3
W
1
W
2
W
2
W
0
W
0
W
0
Figure 2.2. 8point DFT with CooleyTukey algorithm.
18 Chapter 2
The recursive usage of divide and conquer approach for an 8
point DFT is shown in Fig. 2.3. As illustrated in the ﬁgure, the inputs
are divided into smaller and smaller groups. This class of algorithms
is called decimationintime (DIT) algorithm.
2.2. SandeTukey FFT Algorithms
Another class of algorithms is called decimationinfrequency (DIF)
algorithm, which divides the outputs into smaller and smaller DFTs.
This kind of algorithm is also called the SandeTukey FFT
algorithm.
The computation of DFT with DIF algorithm is similar to
computation with DIT algorithm. For the sake of simplicity, we do
not derive the DIF algorithm but illustrate the algorithm with an
example.
Example 2.3. 8point DFT. The factorization of can be
expressed as
(2.28)
C
o
m
b
i
n
e
t
w
o
4

p
o
i
n
t
D
F
T
2point
DFT
2point
DFT
2point
DFT
2point
DFT
C
o
m
b
i
n
e
t
w
o
2

p
o
i
n
t
D
F
T
C
o
m
b
i
n
e
t
w
o
2

p
o
i
n
t
D
F
T
8

p
o
i
n
t
D
F
T
4

p
o
i
n
t
D
F
T
4

p
o
i
n
t
D
F
T
C
o
m
b
i
n
e
t
w
o
4

p
o
i
n
t
D
F
T
Figure 2.3. The divide and conquer approach for DFT.
W
N
nk
W
N
nk
W
2
k
2
n
0
W
8
2k
1
k
0
+ ( )n
0
( ) W
2
k
1
n
1
W
4
k
0
n
1
( ) W
2
k
0
n
2
( ) =
FFTALGORITHMS 19
The sequential equations can be constructed in similar way as
those in Eq. (2.22) through Eq. (2.27). The result is shown in Fig.
2.4.
The computation of DFT with DIF algorithms can be expressed
with sequential equations, which are similar to that of DIT
algorithms. By using the same notation for index n and k as in Eq.
(2.9) and Eq. (2.10), the computation of Npoint DFT with DIF
algorithm is
(2.29)
where and
. The unscrambling process is done by
. (2.30)
Comparing Fig. 2.4 with Fig. 2.2, we can ﬁnd that the signal
ﬂow graph (SFG) for DFT computation with DIF algorithm is
transposition of that with DIT algorithm. Hence, many properties for
DIT and DIF algorithms are the same. For instance, the computation
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
W
n
multiplication with W
n
8
W
2
W
2
W
0
W
0
W
0
W
1
W
2
W
3
Figure 2.4. 8point DFT with DIF algorithm.
x
i
n
0
… n
i
k
p i – 1 –
… k
0
, , , , , ( )
x
i 1 –
n
0
… n
i 1 –
k
p i –
… k
0
, , , , , ( )W
r
p i –
n
i 1 –
k
p i –
k
p i –
0 =
r
p i –
1 –
∑
=
W
N r
p 1 –
…r
p i –
( ) ⁄
n
i
r
p i – 2 –
…r
0
k
p i – 1 –
… r
0
k
1
k
0
+ + + ( )
⋅
x
0
k
p 1 –
k
p 2 –
… k
0
, , , ( ) x k
p 1 –
k
p 2 –
… k
0
, , , ( ) =
i 1 2 … p , , , =
X n
p 1 –
… n
0
, , ( ) x
p
n
0
… n
p 1 –
, , ( ) =
20 Chapter 2
workload for the DIT and DIF algorithms are the same. The
unscrambling process are required for both DIF and DIT algorithms.
However, there are clear differences between DIF and DIT
algorithms, e.g., the position of twiddle factor multiplications. The
DIF algorithms have the twiddle factor multiplications after the
DFTs and the DIT algorithms have the twiddle factor multiplications
before the DFTs.
2.3. Prime Factor FFT Algorithms
In CooleyTurkey or SandeTurkey algorithms, the twiddle factor
multiplications are required for DFT computation. If the
decomposition of N is relative prime, there exists another type of
FFT algorithms, i.e., prime factor FFT algorithm, which reduce the
twiddle factor multiplications.
In CooleyTurkey or SandeTurkey algorithms, index n or k is
expressed with Eq. (2.9) and Eq. (2.10). This representation of index
number is called index mapping. If and are relatively prime,
e.g., greatest common divider gcd( , ) = 1, it exists another index
mapping, socalled Good’s mapping [19]. An index n can be
expressed as
(2.31)
where , , is the
multiplication inverse of modulo , e.g., , and
is the multiplication inverse of modulo . This mapping is a
variant of Chinese Remainder Theorem.
r
1
r
0
r
1
r
0
n r
0
n
1
r
0
1 –
mod r
1
¸ ,
¸ _
r
1
n
0
r
1
1 –
mod r
0
¸ ,
¸ _
+
¸ ,
¸ _
mod N
=
N r
1
r
0
× = 0 n
1
≤ r
1
< 0 n
0
≤ r
0
< , r
0
1 –
r
1
r
0
r
0
1 –
mod r
1
1 =
r
1
1 –
r
0
FFTALGORITHMS 21
Example 2.4. Construct index mapping for 15point DFT inputs
according to Good’s mapping.
We have with and . is 2 since
and , the index can be
computed according to
(2.32)
The mapping can be illustrated with an index matrix
The mapping for the outputs is simple. It can be constructed by
( ).
Example 2.5. Construct index mapping for 15point DFT outputs.
We have with and . The index
mapping for the outputs can be constructed by
for and . The
result can be shown Fig. 2.6.
N 3 5 × = r
1
3 = r
0
5 = r
1
1 –
r
1
r
1
1 –
mod r
0
3 2
mod 5
⋅ 1 = = r
0
1 –
2 =
k 5 2k
1
mod 3
¸ ,
¸ _
3 k
0
mod 5
¸ ,
¸ _
+
¸ ,
¸ _
mod 15
=
0 6 12 3 9
10 1 7 13 4
5 11 2 8 14
Figure 2.5. Good’s mapping for 15point DFT inputs.
k
1
k
0
n r
0
n
1
r
1
n
0
+ ( )
mod N
= 0 n
1
≤ r
1
< 0 n
0
≤ r
0
< ,
N 3 5 × = r
1
3 = r
0
5 =
n 5n
1
3n
0
+ ( )
mod 15
= 0 n
1
≤ 3 < 0 n
0
≤ 5 <
0 3 6 9 12
5 8 11 14 2
10 13 1 4 7
Figure 2.6. Index mapping for 15point DFT outputs.
n
1
n
0
22 Chapter 2
The computation with prime factor FFT algorithms is similar to
the computation with CooleyTurkey algorithm. It can be divided
into two steps:
1 Compute r
0
different r
1
point DFTs. It performs columnwise
DFTs for the input index matrix.
2 Compute r
1
different r
0
point DFTs. It performs rowwise DFTs
for the output index matrix.
Example 2.6. 15point DFT with prime factor mapping FFT
algorithm.
The input and output index matrices can be constructed as shown
in Fig. 2.5 and Fig. 2.6. Following the computation steps above, the
computation of 15point DFT can be performed by and ﬁve 3point
DFTs followed three 5point DFTs.
The 15point DFT with prime factor mapping FFT algorithm is
shown in Fig. 2.7.
X(0)
X(3)
X(6)
X(9)
X(12)
X(5)
X(8)
X(11)
X(14)
X(2)
X(10)
X(13)
X(1)
X(4)
X(7)
x(0)
x(10)
x(5)
x(6)
x(1)
x(11)
x(12)
x(7)
x(2)
x(3)
x(13)
x(8)
x(9)
x(4)
x(14)
5

p
o
i
n
t
D
F
T
n
0
=
0
5

p
o
i
n
t
D
F
T
n
0
=
1
5

p
o
i
n
t
D
F
T
n
0
=
2
3

p
o
i
n
t
D
F
T
k
0
=
4
3

p
o
i
n
t
D
F
T
k
0
=
3
3

p
o
i
n
t
D
F
T
k
0
=
2
3

p
o
i
n
t
D
F
T
k
0
=
1
3

p
o
i
n
t
D
F
T
k
0
=
0
Figure 2.7. 15point FFT with prime factor mapping.
FFTALGORITHMS 23
The prime factor mapping based FFT algorithm above is also an
inplace algorithm.
Swapping of input and output index matrix gives another FFT
algorithm which does not need twiddle factor multiplication outside
the butterﬂies either.
Although the prime factor FFT algorithms are similar to the
CooleyTukey or SandeTukey FFT algorithms, the prime factor
FFT algorithms are derived from convolution based DFT
computations [19] [49] [31]. This leads later to Winograd Fourier
Transform Algorithm (WFTA) [61].
2.4. Other FFT Algorithms
In this section, we discuss two other FFT algorithms. One is the
splitradix FFT algorithm (SRFFT) and the other one is Winograd
Fourier Transform algorithm (WFTA).
2.4.1. SplitRadix FFT Algorithm
Splitradix FFT algorithms (SRFFT) were proposed nearly
simultaneously by several authors in 1984 [17] [18]. The algorithms
belong to the FFT algorithms with twiddle factor. As a matter of fact,
splitradix FFT algorithms are based on the observation of Cooley
Turkey and SandeTurkey FFT algorithms. It is observed that
different decomposition can be used for different parts of an
algorithm. This gives possibility to select the most suitable
algorithms for different parts in order to reduce the computational
complexity.
24 Chapter 2
For instance, the signalﬂow graph (SFG) for a 16point radix2
DIF FFT algorithm is shown in Fig. 2.8.
The SRFFT algorithms exploit this idea by using both a radix2
and a radix4 decomposition in the same FFT algorithm. It is
obviously that all twiddle factors are equal to 1 for even indexed
outputs with radix2 FFT computation, i.e., the twiddle factor
multiplication is not required. But in the radix4 FFT computation,
there is not such general rule (see Fig. 2.9). For the odd indexed
outputs, a radix4 decomposition the computational efﬁciency is
increased because the fourpoint DFT has the largest multiplication
free butterﬂy. This is because the radix4 FFT is more efﬁcient than
the radix2 FFT from the multiplication complexity point of view.
Consequently, the DFT computation uses different radix FFT
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
W
2*1
16
W
3*1
16
W
0*1
16
W
1*1
16
W
6*1
16
W
7*1
16
W
4*1
16
W
5*1
16
W
6*0
16
W
7*0
16
W
4*0
16
W
5*0
16
W
2*0
16
W
3*0
16
W
0*0
16
W
1*0
16
W
2*0
8
W
3*0
8
W
0*0
8
W
1*0
8
W
2*1
8
W
3*1
8
W
0*1
8
W
1*1
8
W
2*0
8
W
3*0
8
W
0*0
8
W
1*0
8
W
2*1
8
W
3*1
8
W
0*1
8
W
1*1
8
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
W
0*1
4
W
1*1
4
W
0*0
4
W
1*0
4
Figure 2.8. Signalﬂow graph for a 16point DIF FFT algorithm.
FFTALGORITHMS 25
algorithms for odd and even indexed outputs. This reduces the
number of complex multiplications and additions/subtractions. A
16point SRFFT is shown in Fig. 2.10.
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(4)
X(8)
X(12)
X(1)
X(5)
X(9)
X(13)
X(2)
X(6)
X(10)
X(14)
X(3)
X(7)
X(11)
X(15)
W
2*0
16
W
3*0
16
W
0*0
16
W
1*0
16
W
2*1
16
W
3*1
16
W
0*1
16
W
1*1
16
W
2*2
16
W
3*2
16
W
0*2
16
W
1*2
16
W
2*3
16
W
3*3
16
W
0*3
16
W
1*3
16
Radix4 Butterfly
Figure 2.9. Radix4 DIF algorithm for 16point DFT.
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(5)
X(9)
X(13)
X(3)
X(7)
X(11)
X(15)
j
j
W
2*2
16
W
3*2
16
W
0*2
16
W
1*2
16
W
2*3
16
W
3*3
16
W
0*3
16
W
1*3
16
W
0*3
8
W
1*3
8
W
0*2
8
W
1*2
8
R
a
d
i
x

2
R
a
d
i
x

4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
R
a
d
i
x
2
/
4
j
j
j
j
j
j
j
Figure 2.10. SFG for 16point DFT with SRFFT algorithm.
26 Chapter 2
Although the SRFFT algorithms are derived from the
observation of radix2 and radix4 FFT algorithm, it cannot be
derived by index mapping. This could be the reason that the
algorithms are discovered so late [18]. The SRFFT can also be
generalized to lengths N = p
k
, where p is a prime number [18].
2.4.2. Winograd Fourier Transform Algorithm
The Winograd Fourier transform algorithm (WFTA) [61] uses the
cyclic convolution method to compute the DFT. This is based on
Rader’s idea [49] for prime number DFT computation.
The computation of an Npoint DFT (N is product of two co
prime numbers r
1
and r
0
) with WFTA can be divided into ﬁve steps:
two pre and postaddition steps and a multiplication step at the
middle. The number of arithmetic operations depends on N. The
number of multiplications is .
The aim of Winograd’s algorithm is to minimize the number of
multiplications. WFTA succeeds in minimizing the number of
multiplications to the smallest number known. However, the
minimization of multiplications results in complicated computation
ordering and large increase of other arithmetic operations, e.g.,
additions. Furthermore, the irregularity of WFTA makes it
impractical for most real applications.
2.5. Performance Comparison
For the algorithm implementation, the computation load is of great
concern. Usually the number of additions and multiplications are
two important measurements for the computation workload. We
compare the discussed algorithms from the addition and
multiplication complexity point of view.
O N ( )
r
1
point
input
adds
r
0
point
input
adds
Npoint
multiplications
r
0
point
output
adds
r
1
point
output
adds
r
0
sets r
1
sets r
1
sets r
0
sets
Figure 2.11. General structure of WFTA.
FFTALGORITHMS 27
Since the restriction of transform length for prime factor based
algorithms or WFTA, the comparison is not strictly on the same
transform length but rather that of a nearby transform length.
2.5.1. Multiplication Complexity
Since multiplication has large impact on the speed and power
consumption, the multiplication complexity is important for the
selection of FFT algorithms.
In many DFT computations, both complex multiplications and
real multiplications are required. For the purpose of comparison, the
counting is based on the number of real multiplications. A complex
multiplication can be realized directly with 4 real multiplications
and 2 real additions, which is shown in Fig. 2.12 (a). With a simple
transformation, the number of real multiplications can be reduced to
3, but the number of real additions increases to 3 as shown in Fig.
2.12 (b). We consider a complex multiplication as 3 real
multiplications and 3 real additions in the following analysis.
For DFT with transform length of , the number of
complex multiplications can be estimated as half of the number of
total butterﬂy operations, e.g., . This number is over
estimated, for example, a complex multiplication with twiddle factor
does not require any multiplications when k is a multiple of
N/4. Furthermore, it requires only 2 real multiplications and 2
additions when k is an odd multiple of N/8. Taking these
simpliﬁcations into account, the number of real multiplication for a
X
R
Z
I
C
R
C
I
C
I
X
I
X
R
X
I
X
R
C
R
X
R
X
I
C
I
+C
R
C
I
C
R
C
R
X
I
Z
R
Z
R
Z
I
(a) (b)
Figure 2.12. Realization of a complex multiplication.
(a) Direct realization (b) Transformed realization
N 2
n
=
N N
2
log ( ) 2 ⁄
W
N
k
28 Chapter 2
DFT with radix2 algorithm and transform length of is
[25]. The radix4 algorithm for a
DFT with transform length of requires
real multiplications
[18]. For the splitradix FFT algorithm, the number of real
multiplications is for a DFT with
[18].
If the transform length is a product of two or more coprime
numbers, there is no simple analytic expression for the number of
real multiplications. However, there are lower bounds that can be
attained by algorithms for those transform lengths. These lower
bounds can be computed [18].
As mentioned previously, WFTA has been proven that it has the
lowest number of multiplications for those transformlengths that are
less than 16. It requires the lowest number of multiplications of the
existing algorithms.
From the multiplication complexity point of view, the most
attractive algorithm is WFTA, following by the prime factor
algorithm, the splitradix algorithm, and the ﬁxedradix algorithm.
The number of real multiplication for various FFT algorithms on
complex data is shown in the following table [18].
N Radix2 Radix4 SRFFT PFA WFTA
16 24 20 20
60 200 136
64 264 208 196
240 1100 632
256 1800 1392 1284
504 2524 1572
512 4360 3076
1008 5804 3548
1024 10248 7856 7172
Table 2.1. Multiplication complexity for various FFT algorithms.
N 2
n
=
M 3N 2 N
2
log ( ) ⁄ 5N – 8 + =
N 4
n
=
M 9N 8 N
2
log ( ) ⁄ 43N 12 ⁄ – 16 3 ⁄ + =
M N N
2
log 3N – 4 + =
N 2
n
=
FFTALGORITHMS 29
2.5.2. Addition Complexity
In a radix2 or radix4 FFT algorithm, the addition and subtraction
operations are used for realizing the butterﬂy operations and the
complex multiplications. Since subtraction has the same complexity
as addition, we consider a subtraction equivalent to an addition.
The additions for the butterﬂy operations is the larger part of the
addition complexity. For each radix2 butterﬂy operation (a 2point
DFT), the number of real additions is four since each complex
addition/subtraction requires two additions. For a transform length
of , a DFT requires N/2 radix2 DFTs for each stage. So the
total number of real additions is , or with
radix2 FFT algorithms. For a transform length of , a DFT,
requires N/4 radix4 DFTs for each stage. Each radix4 DFT requires
8 complex additions/subtractions, i.e., 16 real additions. The total
number of real additions is , or . Both
radix2 and radix4 FFT algorithms require the same number of
additions for a DFT with a transform length of powers of 4.
The number of additions required for the complex
multiplications is less than the number of butterﬂy operations.
Nevertheless, it cannot be ignored. As described previously, a
complex multiplication requires generally 3 additions. The exact
number [25] is for a DFT with
transform length of using the radix2 algorithm. The
number of additions for DFT with transform length of is
for radix4 algorithm
[18]. The splitradix algorithm has the best result for addition
complexity: additions for an
DFT [18].
N 2
n
=
4 N
2
log ( ) N 2 ⁄ ( ) 2nN
N 4
n
=
16 N
4
log ( ) N 4 ⁄ ( ) 4nN
A 7N N
2
log ( ) ⁄ 5N – 8 + =
N 2
n
=
N 4
n
=
A 25N 8 N
2
log ( ) ⁄ 43N 12 ⁄ – 16 3 ⁄ + =
A 3N N
2
log 3N – 4 + = N 2
n
=
30 Chapter 2
From the addition complexity point of view, WFTA is a poor
choice. In fact, the irregularity and increase of addition complexity
makes the WFTA less attractive for practical implementation. The
number of real additions for various FFTs on complex data are given
in the following table [18].
2.6. Other Issues
Many issues are related to the FFT algorithm implementations, e.g.,
scaling and rounding considerations, inverse FFT implementation,
parallelism of FFT algorithms, inplace and/or inorder issue,
regularity of FFT algorithms etc. We discuss the ﬁrst two issues in
more detail.
N Radix2 Radix4 SRFFT PFA WFTA
16 152 148 148
60 888 888
64 1032 976 964
240 4812 5016
256 5896 5488 5380
504 13388 14540
512 13566 12292
1008 29548 34668
1024 30728 28336 27652
Table 2.2. Addition complexity for various FFT algorithms.
FFTALGORITHMS 31
2.6.1. Scaling and Rounding Issue
In hardware it is not possible to implement an algorithmwith inﬁnite
accuracy. To obtain sufﬁcient accuracy, the scaling and rounding
effects must be considered.
Without loss of generality, we assume that the input data {x(n)}
are scaled, i.e., x(n) < 1/2, for all n. To avoid overﬂowof the number
range, we apply the safe scaling technique [58]. This ensures that an
overﬂow cannot occur. We take the 16point DFT with radix2 DIF
FFT algorithm (see Fig. 2.8) as an example.
The basic operation for the radix2 DIF FFT algorithm consists
of a radix2 butterﬂy operation and a complex multiplication as
shown in Fig. 2.13.
For two numbers u and v with u < 1/2 and v < 1/2, we have
(2.33)
(2.34)
where the magnitude for the twiddle factor is equal to 1.
To retain the magnitude, the results must be scaled with a factor
1/2. After scaling, rounding is applied in order to have the same
input and output wordlengths. This introduces an error, which is
W
p
N
U
V
B
u
v
Figure 2.13. Basic operation for radix2 DIF FFT algorithm.
U u v + u v 1 < + ≤ =
V u v – ( ) W
N
p
⋅ u v – u v 1 < + ≤ = =
32 Chapter 2
called quantization noise. This noise for a real number is modeled as
an additive white noise source with zero mean and variance of
, where ∆ is the weight of the least signiﬁcant bit.
The additive noise for U respective V is complex. Assume that
the quantization noise for U and V are Q
U
and Q
V.
, respectively. For
the Q
U
, we have
(2.35)
(2.36)
Since the quantization noise is independent of the twiddle factor
multiplication, we have
(2.37)
(2.38)
∆
2
12 ⁄
B
u
v
/2
p
W
N
1/2
U
Q
V
Q
n
U
n
V
Figure 2.14. Model for scaling and rounding of radix2 butterﬂy.
E Q
U
{ } E Q
Ure
jQ
Uim
+ { }
E Q
Ure
{ } E jQ
Uim
{ } +
=
=
0 =
Var Q
U
{ } E Q
Ure
2
Q
Uim
2
+
¹ ¹
' ;
¹ ¹
=
E Q
Ure
2
¹ ¹
' ;
¹ ¹
E Q
Uim
2
¹ ¹
' ;
¹ ¹
+
2∆
2
12
 = =
E Q
V
W
N
p
⋅
¹ ¹
' ;
¹ ¹
E Q
V
{ } E W
N
p
¹ ¹
' ;
¹ ¹
⋅ 0 = =
Var Q
V
W
N
p
⋅
¹ ¹
' ;
¹ ¹
E Q
V
W
N
p
⋅ ( ) Q
V
W
N
p
⋅ ( )
¹ ¹
' ;
¹ ¹
=
E Q
V
Q
V
{ }
2∆
2
12
 = =
FFTALGORITHMS 33
After analysis of the basic radix2 butterﬂy operation, we
consider the scaling and quantization effects in an 8point DIF FFT
algorithm. The noise propagation path for the output X(0) is
highlighted with bold solid lines in Fig. 2.15.
For the sake of clarity, we assume that ∆ is equal for each stage,
i.e., the internal wordlength is the same for all stages. If we analyze
backwards for X(0), i.e., from stage l back to stage 1, it is easy to ﬁnd
that noise from stage l1 is scaled with 1/2 to stage l and stage l1 has
exactly double noise sources to that of stage l. Generally, if the
transform length is N and the number of stages is n, where ,
the variance of a noise source from stage l is scaled with
and the number of noises sources in stage l is .
Hence the total quantization noise variance for an output X(k) is
(2.39)
if the input sequence is zero mean white noise with variance .
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
W
2
W
2
W
0
W
0
W
0
W
1
W
2
W
3
stage 1 stage 2 stage 3
Figure 2.15. Noise propagation.
N 2
n
=
1 2 ⁄ ( )
2 n l – ( )
2
n l –
Var Q
X k ( )
{ } 2
n l –
1
2
2 n l – ( )

∆
2
6

l 1 =
n
∑
=
∆
2
6

1
2
n l –

l 1 =
n
∑
∆
2
6
 2
1
2
n 1 –
 –
¸ ,
¸ _
= =
δ
2
34 Chapter 2
The output variance for an output X(k) can be derived by the
following equation
(2.40)
where for from the white noise
assumption. If ∆
in
is the weight of the least signiﬁcant bit for the real
or imaginary part of the input, the input variance is equal to
. The signalnoiseratio (SNR) for the output X(k) is
therefore [60]
For a radixr DIF FFT algorithm, a similar analysis [60] yields
This result, which is based on the white noise assumption, can be
used to determine the required internal wordlength.
The ﬁnite wordlength effect of ﬁnite precision coefﬁcients is
more complicated. Simulation is typically used to determine word
length of the coefﬁcients.
Var X k ( ) { } E X k ( )X k ( ) { } = =
1
N
2
 E x n ( )x m ( ) { }W
N
n m – ( )k
m 0 =
N 1 –
∑
n 0 =
N 1 –
∑
= =
E
1
N
 x n ( )W
N
nk
n 0 =
N 1 –
∑
1
N
 x m ( )W
N
mk –
m 0 =
N 1 –
∑
¹ ¹
' ;
¹ ¹
= =
1
N
2
 E x n ( )x n ( ) { }
n 0 =
N 1 –
∑
1
N
2
Nδ
2
δ
2
N
 = = =
E x n ( )x m ( ) { } 0 = n m ≠
δ
2
2∆
in
2
12 ⁄
SNR
2∆
in
2
12 ⁄ ( )
N

∆
2
6
 2
1
2
n 1 –
 –
¸ ,
¸ _

∆
in
2
∆
2

1
2
n

2 2
n – 1 +
–

∆
in
2
∆
2

1
2 2
n
1 – ( )
 = = =
SNR
2∆
in
2
12 ⁄ ( )
N

∆
2
6
 2
1
r
n 1 –
 –
¸ ,
¸ _

∆
in
2
∆
2

1
r
n

2 r
n – 1 +
–

∆
in
2
∆
2

1
2N r – ( )
 = = =
FFTALGORITHMS 35
2.6.2. IDFT Implementation
An OFDM system requires both DFT and IDFT for signal
processing. The IDFT implementation is also critical for the OFDM
system.
There are various approaches for the IDFT implementation. The
straightforward one is to compute the IDFT directly according to Eq.
(1.2), which has a computation complexity of . This
approach is obviously not efﬁcient.
The second approach is similar to FFT computation. If we ignore
the scaling factor 1/N, the only difference between DFT and IDFT is
the twiddle factor, which is instead of . This can easily
be performed by changing the read addresses of the twiddle factor
ROM(s) for the twiddle factor multiplications. It also requires the
reordering of input when a radixr DFT is used. This approach adds
an overhead to each butterﬂy operation and change the access order
of the coefﬁcient ROM.
The third approach converts the computation on IDFT to the
computation on DFT. This is shown by the following equation.
where the term within the parenthesis is a deﬁnition of DFT and
is the conjugate of .
O N
2
( )
W
N
nk
W
N
n – k
x k ( )
1
N
 X n ( )e
j2πnk N ⁄
k 0 =
N 1 –
∑
1
N
 X
re
n ( )e
j2πnk N ⁄
jX
im
n ( )e
j2πnk N ⁄
+ [ ]
k 0 =
N 1 –
∑
1
N
 X
re
n ( )e
j2πnk N ⁄ –
j X
im
n ( )e
j2πnk N ⁄ –
–
k 0 =
N 1 –
∑
∗
1
N
 X
∗
n ( )e
j2πnk N ⁄ –
k 0 =
N 1 –
∑
∗
= =
= =
= =
=
a
*
a
36 Chapter 2
The conjugation of a complex number can be done by swapping
the real and imaginary parts. Hence, the IDFT can therefore be
computed with a DFT by adding two swaps and one scaling: swap
the real and imaginary part at input before the DFT computation,
swap the real and imaginary part at output from DFT, and a scaling
with factor 1/N.
2.7. Summary
In this chapter we discussed the most commonly used FFT
algorithms, e.g., the CooleyTukey and SandeTukey algorithms.
Each computation step was given in detail for the CooleyTukey
algorithms. Other algorithms like prime factor algorithm, splitradix
algorithm, and WFTA are also discussed.
We compared the different algorithms in term of number of
addictions and multiplications. Some other aspects, for instance,
memory requirements, will be discussed later.
37
3
LOW POWER TECHNIQUES
Low power consumption has emerged as a major challenge in the
design of integrated circuits.
In this chapter, we discuss the basic principles for power
consumption in standard CMOS circuits. Afterwards, a review of
lowpower techniques for CMOS circuits is given.
3.1. Power Dissipation Sources
In CMOS circuits, the main contributions to the power consumption
are from shortcircuit, leakage, and switching currents. In the
following subsections, we introduce them separately.
3.1.1. ShortCircuit Power
In a static CMOS circuit, there are two complementary networks: p
network (pullup) and nnetwork (pulldown). The logic functions
for the two networks are complementary. Normally when the input
and output state are stable, only one network is turned on and
conducts the output either to power supply node or to ground node
and the other network is turned off and blocks the current from
ﬂowing. Shortcircuit current exists during the transitions as one
network is turned on and the other network is still active. For
example, the input signal to an inverter is switching from 0 to .
It exists a short time interval where the input voltage is larger than
but less than . During this time interval, both
V
dd
V
tn
V
dd
V
tp
–
38 Chapter 3
PMOStransistor (pnetwork) and NMOStransistor (nnetwork) are
turned on and the shortcircuit current ﬂows through both kinds of
transistors from power supply line to the ground.
The exact analysis of the shortcircuit current in a simple inverter
[6] is complex, it can be studied by simulation using SPICE. It is
observed that the shortcircuit current is proportional to the slope of
input signals, the output loads and the transistor sizes [54]. The
shortcircuit current consumes typically less than 10% of the total
power in a “welldesigned” circuit [54].
3.1.2. Leakage Power
There are two contributions to leakage currents: one from the
currents that ﬂow through the reverse biased diodes, the other from
the currents that ﬂow through transistors that are nonconducting.
The leakage currents are proportional to the leakage area and
exponential of the threshold voltage. The leakage currents depend on
the technology and cannot be modiﬁed by the designers except in
some logic styles.
The leakage current is in the order of picoAmpere with current
technology but it will increase as the threshold voltage is reduced. In
some cases, like large RAMs, the leakage current is one of the main
concerns. The leakage current is currently not a severe problem in
most digital designs. However, the power consumed by leakage
current can be as large as the power consumed by the switching
Figure 3.1. Leakage current types: (a) reverse biased diode
current, (b) subthreshold leakage current.
(a) (b)
n
+
p
+
p

substrate
V
dd
V
dd
Gnd
Gnd
Gate
I
reverse
I
sub
LOWPOWERTECHNIQUES 39
current for 0.06 µm technology. The usage of multiple threshold
voltages can reduce the leakage current in deepsubmicron
technology.
3.1.3. Switching Power
The switching currents are due to the charging and discharging of
node capacitances. The node capacitances mainly include gate,
overlapping, and interconnection capacitances.
The power consumed by switching current [63] can be expressed
as
(3.1)
where α is the switching activity factor, C
L
is the capacitance load, f
is the clock frequency, and V
dd
is the power supply voltage.
The equation shows that the switching power depends on a few
quantities that are readily observable and measurable in CMOS
circuits. It is applicable to almost every digital circuits and gives the
guidance to the low power design.
The power consumed by switching current is the dominant part
of the power consumption. Reducing the switching current is the
focus of most low power design techniques.
P αC
L
f V
dd
2
2 ⁄ =
40 Chapter 3
3.2. Low Power Techniques
Low power techniques can be discussed at various levels of
abstractions: system level, algorithm and architecture level, logic
level, circuit level, and technology level. Fig. 3.2 shows some
examples of techniques at the different levels.
In the following, we give an overview for different low power
techniques. This is organized after the abstraction level.
3.2.1. System Level
A system typically consists of both hardware and software
components, which affect the power consumption.
The system design includes the hardware/software partitioning,
hardware platform selection (applicationspeciﬁc or general
purpose processors), resource sharing (scheduling) strategy, etc. The
system design usually has the largest impact on the power
consumption and hence the low power techniques applied at this
level have the most potential for power reduction.
At the system level, it is hard to ﬁnd the best solution for low
power in the large design space and there is a shortage of accurate
power analysis tools at this level. However, if, for example, the
instructionlevel power models for a given processor are available,
software power optimization can be performed [56]. It is observed
System
Algorithm
Architecture
Logic
Circuit
Figure 3.2. Lowpower design methodology at
different abstraction levels.
Partitioning, Powerdown
Parallelism, Pipelining
Voltage scaling
Logic styles and manipulation, Data encoding
Technology
Threshold reduction, Doublethreshold devices
Energy recovery, Transistor sizing
LOWPOWERTECHNIQUES 41
that faster code and frequently usage of cache are most likely to
reduce the power consumption. The order of instructions also have
an impact on the internal switching within processors and hence on
the power consumption.
The powerdown and clock gating are two of the most used low
power techniques at system level. The nonactive hardware units are
shut down to save the power. The clock drivers, which often
consumes 3040% of the total power consumption, can be gated to
reduce the switching activities as illustrated in Fig. 3.3.
The powerdown can be extended to the whole system. This is
called sleep mode and widely used in low power processors. The
StrongARM SA1100 processor has three power states and the
average power varies for each state [29]. These power states can be
utilized by the software through advanced conﬁguration and power
management interface (ACPI). In the recent year, the power
management has gained a lot attention in operating system design.
For example, the Microsoft desktop operating system supports
advanced power management (APM).
The system is designed for the peak performance. However, the
computation requirement is time varying. Adapting clocking
frequency and/or dynamic voltage scaling to match the performance
constraints is another low power technique. The lower requirement
for performance at certain time interval can be used to reduce the
AND
block enable
clock
to block clock network
Figure 3.3. Clock gating.
RUN
IDLE SLEEP
400 mW
50 mW 160 µW
90 µs 10 µs
10 µs 160 ms
90 µs
Figure 3.4. Power states for StrongARM SA1100 processor.
42 Chapter 3
power supply voltage. This requires either feedback mechanism
(load monitoring and voltage control) or predetermined timing to
activate the voltage downscaling.
Another less explored domain for low power design is using
asynchronous design techniques. The asynchronous designs have
many attractive features, like nonglobal clocking, automatic power
down, no spurious transitions, and low peak current, etc. It is easy to
reduce the power consumption further by combining the
asynchronous design technique with other lowpower techniques, for
instance, dynamic voltage scaling technique [42]. This is illustrated
in Fig. 3.5.
3.2.2. Algorithm Level
The algorithm selection have large impact on the power
consumption. For example, using fast Fourier transform instead of
direct computation of the DFT reduces the number of operations
with a factor of 102.4 for a 1024point Fourier transform and the
power consumption is likely to be reduced with a similar factor.
The task of algorithm design is to select the most energy
efﬁcient algorithm that just satisﬁes the constraints. The cost of an
algorithm includes the computation part and the
communication/storage part. The complexity measurement for an
algorithm includes the number of operations and the cost of
Load
Monitor
B
u
f
f
e
r
Processing
Unit
DCDC
Converter
Power
Supply
Synchronous/Asynchronous Interface
Input Output
Figure 3.5. Asynchronous design with dynamic voltage
scaling.
LOWPOWERTECHNIQUES 43
communication/storage. Reduction of the number of operations,
cost per operation, and long distance communications are key issues
to algorithm selection.
One important technique for low power of the algorithmic level
is algorithmic transformations [45] [46]. This technique exploits the
complexity, concurrency, regularity, and locality of an algorithm.
Reducing the complexity of an algorithm reduces the number of
operations and hence the power consumption. The possibility of
increasing concurrency in an algorithm allows the use of other
techniques, e.g., voltage scaling, to reduce the power consumption.
The regularity and locality of an algorithm affects the controls and
communications in the hardware.
The loop unrolling technique [9] [10] is a transformation that
aims to enhance the speed. This technique can be used for reducing
the power consumption. With loop unrolling, the critical path can be
reduced and hence voltage scaling can be applied to reduce the
power consumption. In Fig. 3.6, the unrolling reduces the critical
path and gives a voltage reduction of 26% [10]. This reduces the
power consumption with 20% even the capacitance load is increases
with 50% [10]. Furthermore, this technique can be combine with
other techniques at architectural level, for instance, pipeline and
interleaving, to save more power.
In some cases, like wave digital ﬁlters, the faster algorithms,
combined with voltagescaling, can be chosen for energyefﬁcient
applications [58].
2D
x
n
x
n1
b
0
b
0
a
1
b
0
a
1
2
a
1
y
n
y
n1
D
x
n
y
n
a
1
b
0
Figure 3.6. (a) Original signal flow graph.
(b) Unrolled signal flow graph.
(a) (b)
44 Chapter 3
3.2.3. Architecture Level
As the algorithm is selected, the architecture can be determined for
the given algorithm.
As we can see from Eq. (3.1), an efﬁcient way to reduce the
dynamic power consumption is the voltage scaling. When supply
voltage is reduced, the power consumption is reduced. However, this
increases the gate delay. The delay of a minsize inverter (0.35 µm
standard CMOS technology) increases as the supply voltage is
reduced, which is shown in Fig. 3.7.
To reduce the power supply voltage is used to reduce the power
consumption. However, this increases the delay. To compensate the
delay, we use low power techniques like parallelism and pipelining
[11].
We demonstrate an example of architecture transformation.
0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Delay vs. power supply voltage
Power supply voltage (V)
D
e
l
a
y
(
n
s
)
Figure 3.7. Delay vs. supply voltage for an inverter.
LOWPOWERTECHNIQUES 45
Example 3.1. Parallel [11].
The use of two parallel datapath is equivalent to interleaving of two
computational tasks. A datapath to determine the largest number of
C and (A + B) is shown in Fig. 3.8. It requires an adder and a
comparator. The original clock frequency is 40 MHz [11].
In order to maintain the throughput while reducing the power
supply voltage, we use a parallel architecture. The parallel
architecture with twice the amount of resources is shown in Fig. 3.9.
The clock frequency can be reduced to half, from 40 MHz to 20
MHz since two tasks are executed concurrently. This allows the
supply voltage to be scaled down from 5 V to 2.9 V [11]. Since the
extra routing is required to distribute computations to two parallel
units, the capacitance load is increased by a factor of 2.15 [11]. Still,
this gives a signiﬁcant power saving [11]:
1/T
C
o
m
p
a
r
a
t
o
r
A
>
B
A
B
C
1/T
1/T
Figure 3.8. Original datapath.
P
par
C
par
V
par
2
f
par
2.15C
orig
( ) 0.58V
orig
( )
2
f
orig
2

¸ ,
¸ _
= =
0.36P
orig
≈
46 Chapter 3
Example 3.2. Pipelining [11].
Pipelining is another method for increasing the throughput. By
adding a pipelining register after the adder in Fig. 3.8, the
throughput can the increased from 1/(T
add
+ T
comp
) to 1/max(T
add
,
T
comp
). If T
add
is equal to T
comp
, this increases the throughput by a
factor of 2. With this enhancement, the supply voltage can also in
this case be scaled down to 2.9 V (the gate delay doubles) [11]. The
effective capacitance increases to a factor of 1.15 because of the
insertions of latches [11]. The power consumption for pipelining
[11] is
C
o
m
p
a
r
a
t
o
r
A
>
B
1/2T
1/2T
1/2T
C
o
m
p
a
r
a
t
o
r
A
>
B
1/2T
1/2T
1/2T
1/T
A
B
C
Figure 3.9. Parallel implementation.
P
pipe
C
pipe
V
pipe
2
f
pipe
1.15C
orig
( ) 0.58V
orig
( )
2
f
orig
0.39P
orig
≈
=
=
LOWPOWERTECHNIQUES 47
One beneﬁt of pipelining is the low area overhead in comparison
with using parallel datapaths. The area overhead equals the area of
the inserted latches. Another beneﬁt is that the amount of glitches
can be reduced.
Further power saving can be obtained by parallelism and/or
pipelining. However, since the delay increases signiﬁcantly as the
voltage approaches the threshold voltage and the capacitance load
for routing and/or pipeline registers increases, there exists an
optimal power supply voltage. Reduction of supply voltage lower
than the optimal voltage increases the power consumption.
Locality is also an important issue for architecture tradeoff. The
onchip communication through long buses requires signiﬁcant
amount of power. To reduce such communications is important.
3.2.4. Logic Level
The power consumption depends on the switching activity factor,
which in turn depends on the statistical characteristics of data.
However, most low power techniques do not concentrate on this
issue from the system level to the architecture level. The low power
techniques at the logic level, however, focus mainly on the reduction
of switching activity factor by using the signal correlation and, off
course, the node capacitances.
C
o
m
p
a
r
a
t
o
r
A
>
B
A
B
C
1/2T
1/2T
1/2T 1/2T
1/2T
Figure 3.10. Pipeline implementation.
48 Chapter 3
As we know from the gated clocking, the clock input to non
active functional block does not change by gating, and, hence,
reduces the switching of clock network. Precomputation [1] uses the
same concept to reduce the switching activity factor: a selective
precomputing of the output of a circuit is done before the output are
required, and this reduces the switching activity by gating those
inputs to the circuit. This is illustrated in Fig. 3.11. The input data
is partitioned into two parts, corresponding to registers R
1
and R
2
.
One part, R
1
, is computed in precomputation block g one clock cycle
before the main computation A is performed. The result from g
decides gating of R
2
. The power can then be saved by reducing the
switching activity factor in A.
An example of precomputation for lowpower is the comparator.
The comparator takes the MSBof the two numbers to register R
1
and
the others to R
2
. The comparison of MSB is performed in g. If two
MSBs are not equal, the output from g gated the remaining inputs.
In this way, only a small portion of inputs to the comparator´s main
block A (subtractor) is changed. Therefore the switching activity is
reduced.
Gate reorganization [12] [32] [57] is a technique to restructure
the circuit. This can be decomposition a complex gate to simple
gates, or composition simple gates to a complex gate, duplication of
a gate, deleting/addition of wires. The decomposition of a complex
gate and duplication of a gate help to separate the critical and non
critical path and reduce the size of gates in the noncritical path, and,
hence, the power consumption. In some cases, the decomposition of
a complex gate increases the circuit speed and gives more space for
En
R
2
R
1
R
3
A
g
Figure 3.11. A precomputation structure for low power.
LOWPOWERTECHNIQUES 49
power supply voltage scaling. The composition of simple gates can
reduce the power consumption if the complex gate can reduce the
charge/discharge of highfrequently switching node. The deleting of
wires reduces the capacitance load and circuit size. The addition of
wires helps to provide an intermediate circuit that may eventually
lead to a better one.
Encoding deﬁnes the way data bits are represented on the
circuits. The encoding is usually optimized for reduction of delay or
area. In low power design, the encoding is optimized for reduction
of switching activities since various encoding schemes have
different switching properties.
In a counter design, counters with binary and Gray code have the
same functionality. For Nbit counter with binary code, a full
counting cycle requires transitions [63] A full counting
cycle for a Gray coded Nbit counter requires only transitions.
For instance, the full counting cycle for a 2bit binary coded counter
is from 00, 01, 10, 11, and back to 00, which requires 6 transitions.
The full counting cycle for 2bit Gray coded counter is from 00, 01,
11, 10, and back to 00, which requires 4 transitions. The binary
coded counter has twice transitions as the Gray coded counter when
the n is large. Using binary coded counter therefore requires more
power consumption than using Gray coded counter under the same
conditions.
As we can see from the previous example, the logic coding style
has large impact on the number of transitions. Traditionally, the
logic coding style is used for enhancement of speed performance.
Careful choice of coding style is important to meet the speed
requirement and minimize the power consumption. This can be
applied to the ﬁnite state machine, where states can be coded with
different schemes.
A bus is an onchip communication channel that has large
capacitance. As the onchip transfer rate, increases, the use of buses
contributes with a signiﬁcant portion of the total power. Bus
encoding is a technique to exploit the property of transmitted signal
to reduce the power consumption. For instance, adding an extra bit
to select one of the inverse or the noninverse bits at the receiver end
can save power [53]. Low swing techniques can be applied for the
bus also [27].
2 2
n
1 – ( )
2
n
50 Chapter 3
3.2.5. Circuit Level
At the circuit level, the potentials power saving are often less than
that of higher abstract levels. However, this cannot be ignored. The
power savings can be signiﬁcant as the basic cells are frequently
used. A few percents improvement for D ﬂipﬂop can signiﬁcantly
reduce the power consumption in deep pipelined systems.
In CMOS circuits, the dynamic power consumption is caused by
the transitions. Spurious transitions typically consume between 10%
and 40%of the switching activity power in the typical combinational
logic [20]. In some cases, like array multipliers, the amount of
spurious transitions is large. To reduce the spurious transitions, the
delays of signals from registers that converge at a gate should be
roughly equal. This can be done by insertions of buffers and device
sizing [33]. The insertions of buffer increase the total load
capacitance but can still reduce the spurious transitions. This
technique is called path balancing.
Many logic gates have inputs
that are logically equivalent, i.e.,
the swapping of inputs does not
modify the logic function of the
gate. Example gates are NANDs,
NORs, XORs, etc. However, from
the power consumption point of
view, the order of inputs does effect
the power consumption. For
instance, the Ainput, which is near
the output in a twoinput NAND
gate, consumes less power than the
Binput closed to the ground with the same switching activity factor.
Pin ordering is to assign more frequently switching to input pin that
consumes less power. In this way, the power consumption will be
reduced without cost. However, the statistics of switching activity
factors for different pins must be known in advanced and this limits
the use of pin ordering [63].
Different logic styles have different electrical characteristics.
The selection of logic style affects the speed and power
consumption. In most cases, the standard CMOS logic is a good
starting point for speed and power tradeoff. In some cases, for
B
A
Out
C
out
C
i
Figure 3.12. Nand gate.
LOWPOWERTECHNIQUES 51
instance, the XOR/NXOR implementation, other logic styles, like
complementary passtransistor logic (CPL) is efﬁcient. CPL
implements a fulladder with fewer transistors than the standard
CMOS. The evaluation of fulladder is done only with NMOS
transistors network. This gives a small layout as well.
Transistor sizing affects both delay and power consumption.
Generally, a gate with smaller size has smaller capacitance and
consumes less power. This is paid for by larger delay. To minimize
the transistor sizes and meet the speed requirement is a tradeoff.
Typically, the transistor sizing uses static timing analysis to ﬁnd out
those gates (whose slack time is larger than 0) to be reduced. The
transistor sizing is generally applicable for different technologies.
3.3. Low Power Guidelines
Several approaches to reduce the power consumption have been
brieﬂy discussed. Belowwe summarize some of the most commonly
used low power techniques.
• Reduce the number of operations. The selection of algorithm
and/or architecture has signiﬁcant impact on the power
consumption.
• Power supply voltage scaling. The voltage scaling is an
efﬁcient way to reduce the power consumption. Since the
throughput is reduced as the voltage is reduced, this may need
to be compensated for with parallel and/or pipelining
techniques.
Complementary Inputs
P
a
s
s

t
r
a
n
s
i
s
t
o
r
(
N
M
O
S
)
N
e
t
w
o
r
k
Output Output
C
o
m
p
l
e
m
e
n
t
a
r
y
I
n
p
u
t
s
Figure 3.13. CPL logic network.
52 Chapter 3
• I/Os between chips can consume large power due to the large
capacitive loads. Reducing the number of chips is a promising
approach to reduce the power consumption.
• Power management. In many systems, the most power
consuming parts are often idle. For example, in a laptop
computer, the portion of display and harddisk could consume
more than 50% of the total power consumption. Using power
management strategies to shut down these components when
they are idle for a long time can achieve good power saving.
• Reducing the effective capacitance. The effective capacitance
can be reduced by several approaches, for example, compact
layout and efﬁcient logic style.
• Reduce the number of transitions. To minimize the number of
transitions, especially the glitches, is important.
3.4. Summary
In this chapter we discussed some low power techniques that are
applicable at different abstraction levels.
53
4
FFT ARCHITECTURES
Not only several variations of the FFT algorithm have been
developed after the CooleyTukey’s publication but also various
implementations. Generally, the FFT can be implemented in
software, generalpurpose digital signal processors, application
speciﬁc processors, or algorithmspeciﬁc processors.
The implementations with software on generalpurpose
computer can be found in literature and still being explored in some
projects, for instance, the FFTW project in the Laboratory for
Computer Science at MIT [28]. Software implementations are not
suitable for our target application as the power consumption is too
high.
Since it is hard to summarize all other implementations, we will
concentrate on algorithmicspeciﬁc architectures and only give a
brief overview on some FFT architectures.
4.1. GeneralPurpose Programmable DSP Processors
Many commercial programmable DSP processors include the
special instructions for the FFT computation. Although the perfor
mance varies from one to another, most of them belong to the
Harvard architecture from the architecture point of view. A
processor with Harvard architecture has separate busses for data and
control.
54 Chapter 4
A typical programmable DSP processor has on chip data and
program memory, address generator, program control, MAC, ALU,
and I/O interfaces, as illustrated in Fig. 4.1.
The computation of FFT with generalpurpose DSP processor
does not differ too much from the software computation of FFT in a
generalpurpose computer.
To compute the FFT with a generalpurpose DSP processor
requires three steps: ﬁrst the data input, then the FFT/IFFT compu
tation, and ﬁnally the data output. In some DSP processors, for
instance TI’s TMS320C3x, bitreverse addressing is available to
accelerate the unscrambling for the data output. Typical FFT/IFFT
execution times are about 1 ms [2] [41] [55], which is far from the
implementation using more specialized implementations. The
implementations with generalpurpose programmable DSP
processor is therefore not applicable due to the throughput
requirement.
4.2. Programmable FFT Speciﬁc Processors
Several programmable FFT processors have been developed for the
FFT/IFFT computations. These processors are 5 to 10 times faster
than the generalpurpose programmable DSP processors.
The programmable FFTspeciﬁc processors have speciﬁc butter
ﬂies and at least one complex multiplier [65]. The butterﬂy is usually
radix2 or radix4. There is often an onchip coefﬁcient ROM, which
I
/
O
i
n
t
e
r
f
a
c
e
I
/
O
i
n
t
e
r
f
a
c
e
Data
Memory
Program
Memory
Address
Generator
Program
Controller
MAC &
ALU
Program
Data
Program
Data
Address
Buss
Data
Buss
Address
Buss
Data
Buss
Figure 4.1. Generalpurpose programmable DSP processor.
FFTARCHITECTURES 55
stores sinus and cosine coefﬁcients. This type of programmable
FFTspeciﬁc processors are often provided with windowing
functions in either time or frequency domain.
The Zarlink’s (former Plessey) PDSP16515A processor
performs decimation in time, radix 4, forward or inverse Fast Fourier
Transforms [65]. Data are loaded into an internal workspace RAM
in normal sequential order, processed, and then readout in correct
order. The processor has two internal workspace RAMs, one output
buffer, and one coefﬁcient ROM.
Although the PDSP1615A processor accelerates the FFT
computation, it is still hard to meet the throughput requirement with
a single processor due to the slow I/O. The processor requires 98 µs
to perform 1024point FFT with a system clock of 40 MHz. Using
multiple processor conﬁguration can achieve a higher throughput,
but the power consumption is then substantially higher.
A recent released FFT speciﬁc processor from DoubleBW
systems B. V. has higher throughput (100 Msamples/s) [16], but
consumes 8 W at 3.3 V.
Coefficient
ROM
Radix4
Datapath
Output
Buffer
Workspace
RAM
Workspace
RAM
3 Term
Window
Operator
Input
Output
Figure 4.2. FFTspeciﬁc processor PDSP16515A.
56 Chapter 4
4.3. AlgorithmSpeciﬁc Processors
Non programmable algorithmspeciﬁc processors can also be
designed for the computation of FFT algorithms. The processors are
designed mostly for ﬁxedlength FFTs. The architecture of an
algorithmicspeciﬁc FFT processor is therefore optimized with
respect to memory structure, control units, and processing elements.
There are mainly three types of algorithmspeciﬁc processors:
fully parallel FFT processors, column FFT processors, and pipelined
FFT processors.
All three types of algorithmspeciﬁc processors represent
different mapping of the signalﬂow graph for FFT to hardware
structures. The hardware structure in a fully parallel FFT processor
is an isomorphic mapping of the signalﬂow graph [3]. For example,
the signalﬂow graph for an 8point FFT algorithm is shown in Fig.
4.3. The 8point fully parallel FFT processor requires 24 complex
adders and 5 complex multipliers. The hardware requirement is
excessive, and, hence, is not power efﬁcient.
To reduce the hardware complexity, a column or a pipelined FFT
processor can be used. A set of process elements in a column FFT
processor [21] compute one stage at a time. The results are fed back
to the same set of process elements to compute the next stage. For
the long transform length, the routing for the processing elements is
complex and difﬁcult.
x(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
W
n
multiplication with W
n
8
W
2
W
3
W
1
W
2
W
2
W
0
W
0
W
0
Figure 4.3. Signalﬂow graph for an 8point FFT.
FFTARCHITECTURES 57
For a pipelined FFT processor, each stage has its own set of
processing elements. All the stages are computed as soon as data are
available. pipelined FFT processors have features like simplicity,
modularity and high throughput. These features are important for
realtime, inplace applications where the input data often arrive in
a natural sequential order. We therefore select the pipeline archi
tecture for our FFT processor implementation.
The most common groups of the pipelined FFT architecture are
• Radix2 multipath delay commutator (R2MDC)
• Radix2 singlepath delay feedback (R2SDC)
• Radix4 multipath delay commutator (R4MDC)
• Radix4 singlepath delay commutator (R4SDC)
• Radix4 singlepath delay feedback (R4SDF)
• Radix2
2
singlepath delay commutator (R2
2
SDC)
We will discuss these pipeline architectures in more detail.
4.3.1. Radix2 Multipath Delay Commutator
The Radix2 Multipath Delay Commutator (R2MDC) architecture is
the most straightforward approach to implement the radix2 FFT
algorithm using a pipeline architecture [48]. An 8point R2MDC
FFT is shown in Fig. 4.4.
When a new frame arrives, the ﬁrst four input data are multi
plexed to the topleft delay elements in the ﬁgure and the next four
input data directly to the butterﬂy. In this way the ﬁrst input data is
delayed by four samples and arrives to the butterﬂy simultaneously
with the fourth input sample. This complete the startup of the ﬁrst
stage of the pipeline. The outputs from the ﬁrst stage butterﬂy and
the multiplier are then fed into the multipath delay commutator
R
a
d
i
x

2
B
u
t
t
e
r
f
l
y
Output0
Output1
Switch
4
R
a
d
i
x

2
B
u
t
t
e
r
f
l
y
2
2
R
a
d
i
x

2
B
u
t
t
e
r
f
l
y
1
1
Input
Switch
Mux
Figure 4.4. An 8point DIF R2MDC architecture.
58 Chapter 4
between stage 1 and stage 2. There are two paths (multipath) with
delay elements and one switch (commutator). The multipath delay
commutator alleviates the data dependency problem. The ﬁrst and
second outputs from the upper side of the butterﬂy are fed into the
two upper delay elements. After this, the switch changes and the
third and fourth outputs from the upper output of the ﬁrst butterﬂy
are sent directly to the butterﬂy at stage 2. However, the ﬁrst and
second outputs from the multiplier at the ﬁrst stage are now delayed
by the upper delay elements, which make the ﬁrst and second
outputs from the multiplier of the ﬁrst stage arrive together with the
ﬁfth and sixth outputs from the top.
The butterﬂy and the multiplier are idle half the time to wait for
the new inputs. Hence the utilization of the butterﬂy and the multi
plier is 50%. The total number of delay elements is 4 + 2 + 2 + 1 +
1 = 10 for the 8point FFT. The total number of delay elements for
an Npoint FFT can be derived in similar way and is
N/2+N/2+N/4+...+2, i.e., 3N/22. Each stage (except the last one)
has one multiplier and the number of multipliers is .
4.3.2. Radix2 SinglePath Delay Feedback
Herbert L. Groginsky and George A. Works introduced a feedback
mechanism in order to minimize the number of delay elements [22].
In the proposed architecture one half of outputs from each stage are
fed back to the input data buffer when the input data are directly sent
N ( )
2
log 1 –
FFTARCHITECTURES 59
to the butterﬂy. This architecture is called Radix2 Singlepath Delay
Feedback (R2SDF). Fig. 4.5 shows the principle of an 8point
R2SDF FFT.
The delay elements at the ﬁrst stage save four input samples
before the computation starts. During the execution they store one
output from the butterﬂy of the ﬁrst stage and one output is immedi
ately transferred to the next stage. Thus, in the new interim half
frame when the delay elements are ﬁlled with fresh input sample, the
results of the previous frame are sent to the next stage. The butterﬂy
is provided with a feedback loop. The modiﬁed butterﬂy is shown in
the right side of Fig. 4.5. When the mux is 0, the butterﬂy is idle and
data passes by. When the mux is 1, the butterﬂy processes the
incoming samples. Because of the feedback mechanism we reduce
the requirement of delay elements from 3N/2 to N – 1 (N/2 + N/4 +
... + 1) which is minimal. The number of multiplier is exact the same
as R2MDC FFT architecture, i.e., . The utilization of
multiplier and butterﬂies remains the same, namely 50%.
4.3.3. Radix4 Multipath Delay Commutator
This architecture is similar to R2MDC. Input data are separated by
a 4to1 multiplexer and 3N/2 delay elements at the ﬁrst stage. A 4
path delay commutator is used between two stages. Computation is
taking place only when the last 1/4 part of data is multiplexed to the
0
1
0
1
R
a
d
i
x

2
B
u
t
t
e
r
f
l
y
Mux
Radix2 SDF Butterfly
Output Input
4
R
a
d
i
x

2
S
D
F
B
u
t
t
e
r
f
l
y
2
R
a
d
i
x

2
S
D
F
B
u
t
t
e
r
f
l
y
1
R
a
d
i
x

2
S
D
F
B
u
t
t
e
r
f
l
y
Figure 4.5. An 8point DIF R2SDF FFT.
N ( )
2
log 1 –
60 Chapter 4
butterﬂy. The utilization of the butterﬂies and the multipliers is 25%.
The length of the FFT has to be . A length64 DIF Radix4
Multipath Delay Commutator (R4MDC) FFT is shown in Fig. 4.6.
Each stage (except the last stage) has 3 multipliers and the
R4MDCFFT requires in total multipliers for an N
point FFT which is more than the R2MDC or R2SDF. Moreover the
memory requirement is 5N/24, which is the largest among the three
discussed architectures. From the view of hardware and utilization,
it is not a good structure.
4.3.4. Radix4 SinglePath Delay Commutator
To increase the utilization of the butterﬂies, G. Bi and E. V. Jones [4]
proposed a simpliﬁed radix4 butterﬂy. In the simpliﬁed radix4
butterﬂy, only one output is produced in comparison with 4 in the
conventional butterﬂy. To provide the same four outputs, the
butterﬂy works four times instead of just one. Due to this modiﬁ
cation the butterﬂy has a utilization of or 100%. To accom
modate this change we must provide the same four data at four
different times to the butterﬂy. A few more delay elements are
required with this architecture. Furthermore, the simpliﬁed butterﬂy
needs additional control signals, and so do the commutators. The
number of multipliers is , which is less than the R4MDC
FFT architecture. The utilization of the multiplier is 75% due to the
fact that at least onefourth of the data are multiplied with the trivial
4
n
Output0
Input
Mux
48
32
16
R
a
d
i
x

4
B
u
t
t
e
r
f
l
y
Switch
8
4
12
12
4
8
R
a
d
i
x

4
B
u
t
t
e
r
f
l
y
Switch
2
1
1
2
R
a
d
i
x

4
B
u
t
t
e
r
f
l
y
Output1
Output2
Output3
3
3
Figure 4.6. A 64point DIF R4MDC FFT.
3 N ( )
4
log 1 – ( ) ⋅
4 25% ⋅
N ( )
4
log 1 –
FFTARCHITECTURES 61
twiddle factor 1 (no multiplication is needed). The structure of a 16
point DIF Radix4 SinglePath Delay Commutator (R4SDC) FFT is
shown below.
The main beneﬁt of this architecture is the utilization
improvement for butterﬂies. The cost for R4SDC is the increase
amounts of delay elements.
4.3.5. Radix4 SinglePath Delay Feedback
Radix4 singlepath delay feedback (R4SDF) [15] [62] is a radix4
version of R2SDF. Since we use the radix4 algorithmwe can reduce
the number of multipliers to log
4
(N) – 1 compared to log
2
(N) – 2 for
R2SDF. But the utilization of the butterﬂies are reduced to 25%. The
radix4 SDF butterﬂies also become more complicated than the
radix2 SDF butterﬂies. A64point DIF R4SDF FFT is illustrated in
Fig. 4.8.
Singlepath
Delay
Commutator
24
Simplified
Radix4
Butterfly
Singlepath
Delay
Commutator
6
Simplified
Radix4
Butterfly
Output Input
Figure 4.7. A 16point DIF R4SDC FFT.
Output
Input
R
a
d
i
x

4
S
D
F
B
u
t
t
e
r
f
l
y
4
4
4
R
a
d
i
x

4
S
D
F
B
u
t
t
e
r
f
l
y
1
1
1
R
a
d
i
x

4
S
D
F
B
u
t
t
e
r
f
l
y
16
16
16
Figure 4.8. A 64point DIF R4SDF FFT.
62 Chapter 4
The radix4 SDF butterﬂy is shown in Fig. 4.9. The data are sent
to the butterﬂy for processing when the mux is 1, otherwise the data
are shifted into a delayline with a length of 3N/4 (ﬁrst stage).
4.3.6. Radix2
2
SinglePath Delay Commutator
The Radix2
2
Singlepath Delay Commutator (R2
2
SDC) archi
tecture [24] uses a modiﬁed radix4 DIF FFT algorithm. It has the
same butterﬂy structure as the radix2 DIF FFT, but places the multi
pliers at the same places as for the radix4 DIF FFT. Basically two
kinds of radix2 SDF butterﬂies are used to achieve the same output
(but not of the same order) as a radix4 butterﬂy. By reducing the
radix from 4 to 2 we increase utilization of the butterﬂies from 25%
to 50%. We reduce the number of multipliers compared to the
conventional radix2 algorithm. This approach is based on a 4point
DFT.
The outputs are bitreversed instead of 4reversed as in a conven
tional radix4 algorithm.
Radix4 SDF Butterfly
Mux
0
1
0
1
0
1
0
1
R
a
d
i
x

4
B
u
t
t
e
r
f
l
y
Figure 4.9. Radix4 SDF butterﬂy.
R
a
d
i
x

2
S
D
F
(
I
)
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
R
a
d
i
x

2
S
D
F
(
I
I
)
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
16 32
R
a
d
i
x

2
S
D
F
(
I
)
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
R
a
d
i
x

2
S
D
F
(
I
I
)
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
R
a
d
i
x

2
S
D
F
(
I
)
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
R
a
d
i
x

2
S
D
F
(
I
I
)
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
8 4 2 1
Input Output
Figure 4.10. A 64point DIF R2
2
SDC FFT.
FFTARCHITECTURES 63
4.4. Summary
In this chapter, several FFT implementation classes are discussed.
The programmable DSP or FFTspeciﬁc processors cannot meet the
requirements in both high throughput and low power applications.
Algorithmspeciﬁc implementations, especially with pipelined FFT
architectures are better in this respect.
64 Chapter 4
65
5
IMPLEMENTATION OF
FFT PROCESSORS
In this chapter, we discuss implementation of FFT processors.
In VLSI design, the design method is an important guide for
implementation. We follow the meetatthemiddle design method.
5.1. Design Method
As the transistor feature size is scaled down, more and more
functionalities can be integrated in a single chip. High speed, high
complexity, and short design time are several requirements for VLSI
designs. This requires that the design methodology must cope with
the increasing complexity using a systematic approach. A design
methodology is the overall strategy to organize and solve the design
tasks at the different steps of the design process [24].
The bottomup methodology, which builds the system by
assembling the existing building blocks, can hardly catch up with the
high performance and communication requirements of current
system. Hence the bottomup methodology is not suitable for the
design of complex systems.
In the topdown design methodology, the system requirements
and organization is developed by a successive decomposition.
Typically, a highlevel design language is used to deﬁne the system
functionality. After a number of decomposition steps, the system is
described by a HDL, which can be used for automatic logic
66 Chapter 5
synthesis. A drawback with this design approach is that the result
highly relies on the synthesis tools. If the ﬁnal result fails to meet the
performance requirement, the whole design has to be redesigned.
In the meetinthemiddle
methodology, the speciﬁcation
synthesis process is carried out
in essentially a topdown
fashion, but the actual design of
the building blocks is
performed in bottomup. This is
illustrated in Fig. 5.1. The
design process is therefore
divided into two almost
independent parts that meet in
the middle. The circuit design
phase can be shortened by
using efﬁcient circuit design
tools or even automatic logic
synthesis tools. Often, some of
the building blocks are already
available in a circuit library.
In our target application, the requirement for the FFT processor
is speciﬁed. The design process starts with creation of functional
speciﬁcation of the FFT processor. This results in a highlevel
model. The highlevel model is then validated by a testbench for the
FFT algorithm. The testbench can be reused for successive models.
After the system functionality is validated by simulation, the
functional speciﬁcation is mapped into an architectural
speciﬁcation.
In the architectural speciﬁcation, the detail computation process
is mapped to the hardware. Different functionalities are partitioned
and mapped into hardware or software components. Detailed
communications between different components are to be decided at
the architecture speciﬁcation. After the architecture model is
created, the model needs to be simulated for performance and
validation. Basically, the software and hardware design are
Specification/Validation
Algorithm
Scheduling
Architecture
MEETINTHEMIDDLE
Logic
Gate
Transistor
Layout
Layout
Cell
Module
Figure 5.1. The meetinthemid
dle methodology.
IMPLEMENTATIONOF FFT PROCESSORS 67
separated after this architectural partitioning. Since the FFT
processor is completely implemented in hardware, the partitioning
of software and hardware is not necessary.
Once an architecture is selected, the individual hardware blocks
are reﬁned by adding the implementation details and constraints. In
this phase, we apply the bottomup design methodology. Different
subblocks are built from cells in combination of blocks.
5.2. Highlevel Modeling of an FFT Processor
High level modeling serves two purposes: to create a cycletrue
model for the algorithm and hardware architecture, to simulate,
validate and optimize the highlevel model.
Since the whole FFT processor is implemented in hardware, the
software and hardware codesign is not needed. As mentioned
previously, we do not need to determine the system speciﬁcation
since it is given. The system speciﬁcation for the FFT processor has
been deﬁned as
• Transform length is 1024
• Transform time is less than 40 µs (continuously)
• Continuous I/O
• 25.6 Msamples/sec. throughput
• Complex 24 bits I/O data
According to the meetinthemiddle design methodology, the
highlevel design is a topdown process. We start with the resource
analysis.
5.2.1. Resource Analysis
The highlevel design can be divided into several tasks:
• Architecture selection
• Partitioning
• Scheduling
• RTL model generation
• Validation of models
68 Chapter 5
The ﬁrst three tasks are associated with each other and the aim is
to allocate the resource to meet the systemspeciﬁcation. In the ASIC
implementation, the resource is constrained. Hence the resource
analysis is required.
There are many possible architectures for FFT processors.
Among them, the pipelined FFT architectures are particularly
suitable for realtime applications since they can easily
accommodate the sequential nature of sampling.
The pipelined FFT architectures can be divided into datapath and
control part. Since the control part is much simpler than the datapath
in respect of both hardware and power consumption, the resource
analysis concentrates on the datapath.
The datapath for the FFT processor consists of memories,
butterﬂies and complex multipliers. We discuss them separately.
5.2.1.1. Butterﬂies
From the speciﬁcation, the computation time for the 1024point FFT
processor is
s (5.2)
With radix2 algorithm, the number of butterﬂy operations is
. A butterﬂy can be
implemented with parallel adders/subtractors using one clock cycle.
Hence the minimum number of butterﬂies is
(5.3)
This is optimal with the assumption that ALL data are available
to ALL stages, which is impossible for continuous data streams.
Each butterﬂy has to be idle for 50%in order to reorder the incoming
data. The allocation of butterﬂy operations from two stages to the
same butterﬂy is not possible with as soon as possible (ASAP)
scheduling. Therefore the number of butterﬂies is 10, i.e., equal to
the number of stages.
t
FFT
4 10
5 –
× =
N r ⁄ ( ) N ( )
r
log N 2 ⁄ ( ) N ( )
2
log 5120 = =
N
BF
No
BFop
t
BFop
×
t
FFT

5120
4 10
5 –
× 25.6 10
6
× ×
 5 = = =
IMPLEMENTATIONOF FFT PROCESSORS 69
With similar discussion, the number of butterﬂies for a radix4
pipeline architecture is equal to the number of stages.
5.2.1.2. Complex Multipliers
The number of complex multiplications is
(5.4)
where N is the transform length and r is the radix. It does not include
the complex multiplications within the rpoint DFT.
For radix2 algorithm, the number of complex multiplications is
about 4068. The complex multiplication can be computed either in
one clock cycle and two clock cycles (pipelining). The minimum
number of complex multipliers is, with assumption of fast complex
multipliers (one complex multiplication per clock cycle),
(5.5)
Since the resource sharing between two stages is not possible for
pipeline architectures. The number of complex multipliers is 9, i.e.
each stage except the last stage has its own set of complex
multipliers.
For radix4 algorithm, the number of complex multipliers is 4.
5.2.1.3. Memories
The memories requirement increases linearly with the transform
length. For a 1024point FFT processor, it dissipates more power
than the complex multipliers.
The size of the memories are determined by the maximum
amount of live data, which is determined by the architectures. In
general, the architectures with feedback are efﬁcient in terms of the
utilization of memories.
N
cmult
N
r
 r 1 – ( ) N ( )
r
1 – log ( ) ≈
N
cmult
N
cmult
t
cmult
t
FFT

4068
4 10
5 –
× 25.6 10
6
× ×
 4 ≈ = =
70 Chapter 5
5.2.2. Validation of the HighLevel Model
After the resource analysis, the next step is to model the FFT
algorithm at highlevel. For a fast evaluation, the algorithm is
described with highlevel programming language, like C or Matlab.
The validation of highlevel model is done through simulation
and comparison. The interface between model and testbench is plain
text ﬁles: the input data is stored in a text ﬁle and read in by the
model and the output data frommodel is saved in a text ﬁle also. This
gives a freedom for the construction of model and testbench: the
model can be written in either C, Matlab or VHDL, the testbench can
also written in C or Matlab. Moreover, it is easy to convert from
ﬂoatingpoint arithmetic to ﬁxedpoint arithmetic. The same
testbench can be reused by changing the output/input ﬁle arithmetic.
Architecture Memory requirement [words] Memory utilization
R2MDC N/2+N/2+...+2 = 3N/22 66%
R2SDF N/2+N/4+...+1 = N1 100%
R4MDC 3N/4+3N/16+...+12 = 5N/24 40%
R4SDF 3N/4+3N/16+...+3 = N1 100%
R4SDC 3N/2+3N/8+...+6 = 2N2 50%
R2
2
SDF N/2+N/4+...+1 = N1 100%
Table 5.1. Memory requirement and utilization for pipelined
architectures.
Device
Under
Test
Test Bench
input text ﬁle output text ﬁle
Figure 5.2. Testbench.
IMPLEMENTATIONOF FFT PROCESSORS 71
5.2.3. Wordlength Optimization
In the pipelined FFT architectures, the most research effort has been
relative to the regular modular implementations, which uses ﬁxed
wordlength for both data and coefﬁcients for each stages. The
possibility to use different wordlength, which is provided by the
pipeline architecture, is often ignored to achieve modular solutions.
Based on the observation that the wordlength for different stages
in the pipelined FFT processor can be various, we proposed a
wordlength optimization method [34]. We ﬁrst tune the wordlength
of data memory (data RAM) at each stage separately to make sure
that the precision requirement is met, and then we adjust the
wordlength of the coefﬁcient ROM at each stage. Because our focus
is placed on reducing power consumption in the data memory, the
strategy is that the larger the RAM block in a stage, the shorter its
wordlength should be. The conventional uniform wordlength
scheme for both data memory and coefﬁcient ROM is also
simulated. To obtain the optimal word lengths proﬁle numerous
design iterations have been performed.
Start
End
Coefﬁcient
wordlengths
OK?
Sine wave
test vectors
OK?
Data
wordlengths
Random
test vectors
No
Yes
Yes
No
Figure 5.3. Wordlength optimization for pipelined FFTs.
72 Chapter 5
Two types of testing vectors are used in our simulation. One is
sine wave and the other is random numbers. The sine wave stimuli
is sensitive to the precision of the coefﬁcient representation, and the
samples of random numbers are effective stimuli to check the
precision of butterﬂy calculations. To make the results obtained
highly reliable, 100,000 sets of random samples are generated and
fed into the simulator. The optimization result is shown table below
for 1024point pipelined FFT architectures.
5.3. Subsystems
Once the RTLlevel model of FFT processor is created and
validated, the subsystems can be constructed according to the meet
atthemiddle design methodology.
For the subsystems design, there are two design methods: the
semicustom method and full custom method. Semicustom design
method has a shorter design time. The RTLlevel description in a
HDL can be synthesized with synthesis tool and the synthesis result
is fed to place and route tool for ﬁnal layout. However, this design
methodology relies on the synthesis tools and place and route tools.
The designer have less control over the design process. Moreover,
the most synthesis tools use static timing analysis and do not
consider the interconnections during synthesis. The designers have
to increase timing margins during synthesis to meet the speed
requirement after place and route. The resulting designs are often
unnecessary large. In our case, the impact of power supply voltage
scaling is hard to predict since the characterization of cells is done at
Architecture Memory size for
ﬁxed wordlength
Memory size for
optimized wordlength
Saving
R2MDC 42952 bits 42824 bits 0%
R2SDF 28644 bits 28580 bits 0%
R4MDC 71552 bits 61488 bits 14%
R4SDF 28644 bits 24708 bits 14%
R4SDC 57288 bits 49176 bits 14%
R2
2
SDF 28672 bits 28580 bits 0%
Table 5.2. Wordlength optimization.
IMPLEMENTATIONOF FFT PROCESSORS 73
normal supply voltage. We select therefore full custom design for
the FFT processor, but use semicustom design method for the
control path, where the timing is not critical.
In the following we introduce the subsystem design for the FFT
processor. The main subsystems are memories, butterﬂies, and
complex multipliers.
5.3.1. Memory
In many DSP processors, the memory contributes signiﬁcant portion
of area and power consumption for the whole processor. In the 1024
point FFT processor, the memory becomes the most signiﬁcant part
in both area and power consumption. Hence the low power design of
memories are a key issue for the FFT processor.
5.3.1.1. RAM
The data are stored in RAMs. In the RAM design, there are mainly
two types of RAMs: static RAM (SRAM), dynamic RAM (DRAM).
Since the DRAM often requires special technology and is not
Data reordering
(Memories)
Butterﬂy
Complex
Multiplier
Figure 5.4. Datapath for a stage in a pipelined FFT processor.
Input
Output
74 Chapter 5
available for standard CMOS technology and SRAM is more
suitable for low voltage operations, we select the SRAM for the data
storage. An overview of SRAM is shown in Fig. 5.5.
The SARM consists of four parts [30]:
• memory cell array
• decoders
• sense ampliﬁers
• periphery circuits
We discuss the implementation for the ﬁrst three parts, which is
the main parts of the SRAM.
Memory array
Memory array dominates the SRAM area. The keys for the design of
memory array are cell area and noise immunity of bitlines.
The memory cell is basic building block in the SRAM and the
size of the memory cell is of importance. Even through a 4
transistors (4T) memory cell has less area than that of a 6transistors
(6T), the current leakage at low voltage is considerable larger. We
therefore select to use a 6T memory cell. A typical 6T memory cell
is shown in Fig. 5.6.
Sense Amplifiers
Data I/O
D
e
c
o
d
e
r
C
t
l
c
i
r
c
u
i
t
s
Cell
Array
Address Data
Figure 5.5. Overview of a SRAM.
IMPLEMENTATIONOF FFT PROCESSORS 75
The stability of the memory cell during the read and write
operations determines the device sizing [52]. The cell designer have
to consider process variations, short channel effects, soft error rate
and low supply voltage [23]. The width ratio is
determined by the read operation and a larger ratio β means less
chance for a SRAM cell changes its state during read operation. The
read stability can be measured by the static noise margin (SNM).
The width ratio affects the write operation and a
larger α means data is more difﬁcult to write into the cell. The width
for the access NMOS transistor, W
a
, is set to minsize for minimal
cell area. Normally, the value for α is 1~2 and β is 2~3.
B
L
B
L
_
b
a
r
WWL W
p
W
n
W
a
Figure 5.6. SRAM cell.
Schematic Layout
β W
n
W
a
⁄ =
α W
p
W
a
⁄ =
76 Chapter 5
Example 6. SNM for a SRAM cell.
The SNM can be simulated with SPICE. The SNM for a SRAM
cell with β · 2 in standard 0.35 µm CMOS technology is shown in
Fig. 5.7.
In order to reduce the power consumption and speed up the read
access, most SRAMs read data through a par bitlines with small
swing. The voltage swing between two bitlines is usually about 100
mV to 300 mV, which is sensitive to noise. As the power supply
voltage decreases, the affect of noise becomes more important. To
reduce the noise from outside, the memory array is surrounded with
guardring to reduce the substratecoupling noise. To avoid the
coupling noise from nearby bitline pars, we use the twisted bitlines
layout. Thus the coupling from nearby bitline pars does not affect
the swing difference of bitlines.
1 1.5 2 2.5 3 3.5
0.42
0.44
0.46
0.48
0.5
0.52
0.54
Power supply voltage (V)
S
N
M
(
V
)
SNM vs power supply voltage
Figure 5.7. SNM of a SRAM cell vs. power supply voltage.
Guardring
Bitlines
Figure 5.8. Noise reduction for memory array.
IMPLEMENTATIONOF FFT PROCESSORS 77
Decoder
The decoder can be realized by using a hierarchical structure, which
reduce both the delay and the activity factor. The row decoder can
use either NORNAND decoder or tree decoder. Tree decoder
requires fewer transistors, but suffer from speed degradation due to
the serialconnection of passtransistors, which could increase the
delay (it becomes worse for lower power supply voltage). The NOR
NAND decoder has a regular layout, but requires more transistors.
In small decoders the tree decoder is preferred and the NORNAND
decoders is preferred for larger decoders.
For the large decoder, a wordline enable signal is added to the
decoder, which controls the width of wordline acting pulse and
reduces the glitches of wordline drivers. This reduces the power
consumption of decoder.
Sense Ampliﬁer
The sense ampliﬁer is used to amplify the bitline signals with small
swing during read operation. To have a fast access, the sense
ampliﬁer is designed with high gain. This high gain requirement in
turn requires high current and hence high power consumption for
sense ampliﬁer. One way to reduce the power consumption is to
reduce the active time for sense ampliﬁer. This can be achieved by
using pulsed sense enable signal.
At low supply voltage, the current mode sense ampliﬁer is less
suitable. We therefore modiﬁed an STC Dﬂipﬂop [64] to form a
two stage latch type sense ampliﬁer. The sense ampliﬁer is
functional when the supply voltage is as low as 0.9 V.
SE_bar
BL BL_bar
Dout Dout_bar
Figure 5.9. STC Dﬂipﬂop.
78 Chapter 5
The simulated waveforms for read operation are shown in Fig.
5.10. The access time is 11 ns using standard 0.35 µm CMOS
technology with typical process under 85˚C. The power
consumption for the sense ampliﬁer is 59.5 µW per bit at 50 MHz.
The simulated waveforms for write operation are shown in
Figure 5.11. The total power consumption is 83.4 µW per bit at 50
MHz.
Symbol Wave
:A0:v(out0)
:A0:v(out1)
:A0:v(prech)
1:A0:v(wwl)
V
o
lta
g
e
s
(
lin
)
100m
0
100m
200m
300m
400m
500m
600m
700m
800m
900m
1000m
1.1
1.2
1.3
1.4
1.5
Time (lin) (TIME)
120n 122n 124n 126n 128n 130n 132n 134n 136n 138n 140n
SENSE AMPLIFIER TEST
11 ns
Figure 5.10. Read operation.
Symbol Wave
D0:A0:v(clout0)
D0:A0:v(clout1)
D0:A0:v(prech)
D0:A0:v(wwl)
V
o
lta
g
e
s
(
lin
)
0
100m
200m
300m
400m
500m
600m
700m
800m
900m
1000m
1.1
1.2
1.3
1.4
1.5
Time (lin) (TIME)
120n 140n
WRITE TEST
Figure 5.11. Write operation.
IMPLEMENTATIONOF FFT PROCESSORS 79
The pulse width for the wordline signal and sense enable signal
must be selected carefully. Short pulse width dissipates less power,
but needs to be sufﬁcient long to guarantee the read operation under
process variation, low power supply voltage, etc.
For the periphery circuits, the I/O drivers have large capacitance
load. To reduce the shortcircuit current is an important issue for the
I/O drivers. Avoiding switching of the PMOS and NMOS
simultaneously is a efﬁcient technique for reducing of shortcircuit
current.
5.3.1.2. Implementation
A 256 w×26 b SRAM with separate I/O (Fig. 5.12) has been
implemented with above discussed techniques. The SRAM, which
runs at 1.5 V and 50 MHz, consumes 2.6 mW. A module generator
for SRAM is under development.
5.3.2. Butterﬂy
The butterﬂy is one of the characteristic building blocks in an FFT
processor.
The butterﬂy consists mainly of adders/subtractors. Hence we
discuss the implementation of adder/subtractor ﬁrst and later the
complete butterﬂy.
5.3.2.1. Adder design
The adder is one of the fundamental arithmetic components. There
are many adder structures [47].
Figure 5.12. SRAM macro (1.27×0.33 mm
2
).
80 Chapter 5
The ripplecarry adder (RCA) is constructed with fulladder. The
RCA is slowest among the different implementations. However, it is
simple and consumes small amount of power for 16bit adder
implementations. If the wordlength is small, it is suitable to select
RCA for the butterﬂy.
When the speed is important, for instance, the vector merge
adder in the multiplier, the RCA cannot meet the speed requirement.
In these cases, other carry accelerating adder structures are
attractive. We select the BrentKung adder for high speed adder
implementation. The BrentKung adder has a short delay and a
regular structure. It will be discussed later in the complex multiplier
design.
RCA implementation
We have developed a program that generates the schematic and the
layout for RCA. An CMOS fulladder layout is shown in Fig. 5.13.
A 3bit RCA layout with signextension is shown in Fig. 5.14.
Figure 5.13. CMOS fulladder.
S
i
z
e
:
1
8
.
7
×
1
5
.
0
µ
m
2
IMPLEMENTATIONOF FFT PROCESSORS 81
5.3.2.2. High radix butterﬂy architecture
The use of higher radix tends to reduce the memory access rate,
arithmetic workload, and, hence, the power consumption [39] [60].
Efﬁcient design of highradix butterﬂies is therefore important. In
practice, the commonly used high radix butterﬂies are radix4 and
radix8 butterﬂies. Butterﬂies with higher radix than radix8 are
often decomposed to lower radix butterﬂies.
A conventional butterﬂy is often based on an isomorphic
mapping of the signalﬂow graph. Signalﬂow graph for a radix4
butterﬂy is shown in Fig. 5.15. The butterﬂy requires 8 complex
adders/subtractors and a delay of 2 additions/subtractions.
To reduce the complexity, we proposed a carrysave based
butterﬂy [36]. The computation for a radix4 butterﬂy is divided into
two steps. The ﬁrst step is a 42 compression with
addition/subtraction controlled inputs. The second step is a normal
addition. The delay is changed from two additions/subtractions to
Figure 5.14. Layout of 3bit ripplecarry adder.
Sign
extension
Fulladder Fulladder Fulladder
x(0)
x(1)
x(2)
x(3)
X(0)
X(1)
X(2)
X(3)
j
Figure 5.15. Signalﬂow graph for 4point DFT.
82 Chapter 5
one addition and one 42 compression. This implementation reduces
the hardware since a fast adder is more complex than a 42
compressor. The total delay is also reduced since the delay for a 42
compressor is smaller. The implementation of a radix4 butterﬂy
with carrysave adders is shown in Fig. 5.16. In the ﬁgure, only real
additions are shown and it appears more complicated than that of
Fig. 5.15, where additions are complex additions.
Carrysave radix4 butterﬂy implementation.
A carrysave radix4 butterﬂy (wordlength is 15 for real and
imaginary part of input) was described in VHDLcode and
synthesized using AMS 0.8 µm standard CMOS technology [36].
The synthesis result shows that the area saving can be up to 21%
for the carrysave radix4 butterﬂy. The delay can be reduced with
22%.
The radix2/4 splitradix butterﬂy and radix8 butterﬂy can also
be implemented using the carrysave adder.
Architecture Area Delay@3.3 V, 25˚C
Conventional 10504.16 12.32 ns
Carrysave 8266.48 9.59 ns
Table 5.3. Performance comparison for two radix4 butterflies.
(4,2)counter
Fast Adder
Inverter
Xre(0)
Xim(0)
Xre(1)
Xim(1)
Xre(2)
Xim(2)
Xre(3)
Xim(3)
xre(0)
xim(0)
xre(1)
xim(1)
xre(2)
xim(2)
xre(3)
xim(3)
Figure 5.16. Parallel radix4 butterﬂy.
IMPLEMENTATIONOF FFT PROCESSORS 83
5.3.3. Complex Multiplier
There is no question that complex multipliers are one of the critical
units in FFT processors. From the speed point of view, the complex
multiplier is slowest part in the data path. With pipelining, the
throughput can be increased while the latency retains the same.
From the power consumption point of view, complex multipliers
stand for about 70% to 80% of the total power consumption in the
previous FFT implementations [39] [60]. This has been reduced to
less than 50% of the total power consumption due to the increase of
power consumption for the memories as the transform length of FFT
increases [37]. Hence the complex multipliers are the key
components in FFT design.
A straightforward implementation (see Fig. 5.17 (a)) of a
complex multiplication requires four real multiplications, one
addition and one subtraction. However, the number of
multiplications can be reduced to three by using transformation at
the cost of extra pre and post additions (see Fig. 5.17 (b)). A more
efﬁcient way to reduce the cost of multiplication is to utilize
distributed arithmetic [35] [58].
5.3.3.1. Distributed Arithmetic
Distributed arithmetic (DA) uses precomputed partial sums for an
efﬁcient computation of inner products of a constant vector and a
variable vector [14].
Figure 5.17. Realization of a complex multiplication.
X
R
Z
I
C
R
C
I
C
I
X
I
X
R
X
I
X
R
C
R
X
R
X
I
C
I
+C
R
C
I
C
R
C
R
X
I
Z
R
Z
R
Z
I
(a) (b)
84 Chapter 5
Let C
R
+ jC
I
and X
R
+ jX
I
be two complex numbers of which
C
R
+ jC
I
is the coefﬁcient and X
R
+ jX
I
is a variable complex number.
In the case of a complex multiplication, we have
(5.1)
Hence, a complex multiplication can be considered as two inner
products of two vectors of length two. We will realize the real and
imaginary parts separately.
For sake of simplicity, we consider only the ﬁrst inner product in
Eq. (5.1), i.e., the real part. The complex coefﬁcient, C
R
+ jC
I
is
assumed to be ﬁxed and two’scomplement representation is used
for both the coefﬁcient and data. The data is scaled so that
Z
R
+ jZ
I
 is less than 1. The inner product Z
R
can be rewritten
where x
Rk
and x
Ik
are the kth bits in the real and imaginary parts,
respectively. By interchanging the order of the two summations we
get
(5.2)
which can be written as
(5.3)
where .
F
k
is a function of two binary variables, i.e., the kth bits in X
R
and
X
I
. Since F
k
can take on only four values, it can be computed and
stored in a lookup table.
In the same way we get the corresponding binary function for the
imaginary part is .
Z
R
jZ
I
+ C
R
jC
I
+ ( ) X
R
j X
I
+ ( )
C
R
X
R
C
I
X
I
– ( ) j C
R
X
I
C
I
X
R
+ ( ) +
=
=
Z
R
C
R
x
R0
– x
Rk
2
k –
k 1 =
W
d
1 –
∑
+ –C
I
x
I 0
– x
Ik
2
k –
k 1 =
W
d
1 –
∑
+ =
Z
R
C –
R
x
R0
C
I
x
I 0
C
R
x
Rk
C –
I
x
Ik
( )2
k –
k 1 =
W
d
1 –
∑
+ + =
Z
R
F
k
x
R0
x
I 0
, ( ) – F
k
x
Rk
x
Ik
, ( )2
k –
k 1 =
W
d
1 –
∑
+ =
F
k
x
Rk
x
Ik
, ( ) C
R
x
Rk
C –
I
x
Ik
=
G
k
x
Rk
x
Ik
, ( ) C
R
x
Ik
C +
I
x
Rk
=
IMPLEMENTATIONOF FFT PROCESSORS 85
Further reduction of the lookup table can be done by Offset
Binary Coding [14].
5.3.3.2. Offset Binary Coding
The offset binary coding can be applied to distributed arithmetic by
using the following expression for the data.
(5.4)
where is the inverse of bit .
Without any loss of generality, we assume that the magnitudes of
C
R
, C
I
, X
R
, and X
I
all are less than 1. The wordlength of X
R
, and X
I
is W
d
. Then, the complex multiplication can be written
Z
R
+ jZ
I
= (C
R
X
R
– C
I
X
I
) + j(C
R
X
I
+ C
I
X
R
)
(5.5)
The functions F and G can be expressed as follows.
(5.6)
(5.7)
x x
0
x
0
– ( )2
1 –
x
i
x
i
– ( )2
i – 1 –
i 1 =
W
d
1 –
∑
2
W
d
–
– + =
b b
C
R
x
R0
x
R0
– ( )2
1 –
– C
R
x
Ri
x
Ri
– ( )2
i – 1 –
i 1 =
W
d
1 –
∑
C
R
2
W
d
–
– +
¹ ¹
' ;
¹ ¹
=
C
I
x
I 0
x
I 0
– ( )2
1 –
– C
I
x
Ii
x
Ii
– ( )2
i – 1 –
i 1 =
W
d
1 –
∑
C
I
2
W
d
–
– +
¹ ¹
' ;
¹ ¹
–
+ j C
R
x
I 0
x
I 0
– ( )2
1 –
– C
R
x
Ii
x
Ii
– ( )2
i – 1 –
i 1 =
W
d
1 –
∑
C
R
2
W
d
–
– +
¹ ¹
' ;
¹ ¹
+ j C
I
x
R0
x
R0
– ( )2
1 –
– C
I
x
Ri
x
Ri
– ( )2
i – 1 –
i 1 =
W
d
1 –
∑
C
I
2
W
d
–
– +
¹ ¹
' ;
¹ ¹
F – x
R0
x
I 0
, ( ) F x
Ri
x
Ii
, ( )2
i – 1 –
i 1 =
W
d
1 –
∑
F 0 0 , ( )2
W
d
–
+ +
¹ ¹
' ;
¹ ¹
=
+ j G – x
R0
x
I 0
, ( ) G x
Ri
x
Ii
, ( )2
i – 1 –
i 1 =
W
d
1 –
∑
G 0 0 , ( )2
W
d
–
+ +
¹ ¹
' ;
¹ ¹
F x
Ri
x
Ri
, ( ) C
R
x
Ri
x
Ri
– ( ) C
I
x
Ii
x
Ii
– ( ) – =
G x
Ri
x
Ri
, ( ) C
I
x
Ri
x
Ri
– ( ) C
R
x
Ii
x
Ii
– ( ) + =
86 Chapter 5
In Eq. (5.7), the factor is either 1 or –1. Hence the
partial product, i.e., function F
k
(G
k
), for each bit is of the form
C
R
t C
I
. All possible partial products are tabulated in following
table. Obviously, only two coefﬁcients, i.e., –(C
R
– C
I
) and –(C
R
+
C
I
), are sufﬁcient to store since (C
R
– C
I
) and (C
R
+ C
I
) easily can be
generated from the two former coefﬁcients by inverting all bits and
adding 1 in the leastsigniﬁcant position.
The complex multiplier with distributed arithmetic is illustrated
in Fig. 5.18. The accumulators, which adds the partial products, are
the same as in a real multiplication. The partial product generation
is only slightly more complicated than for a real multiplier. Hence,
the complexity of the complex multiplier in term of chip area
corresponds to approximately two real multipliers.
x
Ri
x
Ii
F(x
Ri
,x
Ii
) G(x
Ri
,x
Ii
)
0 0 –(C
R
–C
I
) –(C
R
+C
I
)
0 1 –(C
R
+C
I
) (C
R
–C
I
)
1 0 (C
R
+C
I
) –(C
R
–C
I
)
1 1 (C
R
–C
I
) (C
R
+C
I
)
Table 5.4. Partial product generation.
x
i
x
i
– ( ) 2 ⁄
X
I
X
R
(C
I
+C
R
) (C
R
C
I
)
Partial product
Generation
Partial product
Generation
F G
Accumulator Accumulator
Z
R
Z
I
Figure 5.18. Block schematic for complex multiplier.
IMPLEMENTATIONOF FFT PROCESSORS 87
5.3.3.3. Implementation Considerations
Multipliers can be divided into three types: bitparallel, bitserial,
and digitserial. Although a bitserial, or digitserial, multiplier has
less chip area than that of bitparallel multiplier, it requires a higher
speed clock than that of bitparallel one for the same throughput. To
achieve high throughput, a bitserial, or digitserial multiplier often
needs several parallel units, which increases activity factor for the
local clock. To meet the speed requirement, we therefore select a bit
parallel multiplier. Complex multiplier with DA is shown in Fig.
5.18.
The selection of precomputed values from Table 5.4, which
correspond to the partial product generation in a real (imaginary)
datapath, can be realized with a 2:1 multiplexer and an XOR gate as
shown in Fig. 5.19.
An alternative is to use a 4:1 multiplexer circuit. The beneﬁt of
this implementation is that the delay is reduced since the generation
of select signal (X
Ri
⊕X
Ii
) is not required. Hence the delay for the
partial product generation is reduced.
For the accumulator design, the selection of structure is
important. The usual structures for accumulators are: array, carry
save and tree structures. The tree structure is the fastest. It is also
suitable for our lowpower strategy, i.e., designing a faster circuit and
using voltage scaling to reduce the power consumption. We select
the tree structure for accumulator.
0 1
XRi xor XIi
(CRCI)i (CR+CI)i
XRi
PPi
0 1 0 1
0 1
(CRCI)i (CR+CI)i
PPi
XRi
XIi
(a) (b)
Figure 5.19. Circuits for partial product generation.
88 Chapter 5
The fastest, i.e., lowest height, multioperand tree is the Wallace
tree. The Wallace tree has complex wiring and is therefore difﬁcult
to optimize and the layout becomes irregular. The overturnedstairs
tree [40], which has a regular layout and the same height as the
Wallace tree when the data wordlength is less than 19, is used in the
design of the complex multipliers.
The overturnedstairs adder tree was suggested by Mou and
Jutand [40]. The main features of overturnedstairs adder tree are
• Recursive structure that yields regular routing and simpliﬁes
the design of the layout generator.
• The tree height is low, i.e, O( ), where p depends on the
type of overturnedstairs tree.
There are several types of overturnedstairs adder trees [40]. The
ﬁrstorder overturnedstairs adder tree, which has the same speed
bound to that of Wallace tree when the number of the operands is
less than 19, is chosen.
The construction of overturnedstairs tree is illustrated in Fig.
5.20. The trees of height 1 to 3 are shown in Fig. 5.20. When the
height is more than three, we can construct the tree with only three
building blocks: body, root, and connector. The body can be
constructed repeatedly according to Fig. 5.20. The body of height j
( ) consists of a body of height , a branch of height ,
and a connector. The branch of height is formed by using
carrysave adders (CSAs) on “top of each other” with proper
interconnections [40]. The connector connects three feedthroughs
from the body of height and two outputs from the branch of
height to construct the body of height j. A root (CSA) is
connected to the outputs of the connector to form the whole tree of
height .
Since there are only three feedthroughs between body of height
to body of height in overturnedstairs tree. It is also easy for
the routing planning in the accumulator design.
N
p
j 2 > j 1 – j 2 –
j 2 –
j 2 –
j 1 –
j 2 –
j 1 +
j 1 – j
IMPLEMENTATIONOF FFT PROCESSORS 89
The fulladder is essential for the accumulator. The choice of
fulladder has large impact on the performance of accumulator. We
compared several fulladders and found the most suitable for our
implementation. However, recently a large number of new adder
cells has been proposed [51] and they should be evaluated in the
future work.
The ﬁrst type of fulladder is a conventional static CMOS adder.
When the voltage is as low as 1.5 V, the conventional static CMOS
full adder, with large stack height, is too slow. Furthermore, it is not
competitive from a power consumption point of view.
CSA
Tree 1
CSA
CSA
Tree 2
CSA
CSA
Branch n
CSA
n
C
S
A
s
CSA
CSA
Tree 3
CSA CSA
Root
Root
Body 2
CSA
CSA
Connector
Body height j1
B
r
a
n
c
h
H
e
i
g
h
t
j

2
CSA
CSA
CSA
Connector
Root
Tree of height j+1
Body of height j
Figure 5.20. Overturnedstairs tree.
90 Chapter 5
A second type of fulladder is a fulladder with transmission
gates (TG). This fulladder realizes the XORgate with transmission
gates and both the power consumption and chip area are smaller than
that of a conventional static CMOS fulladder.
A third type of full adder is Reusens fulladder [50]. This full
adder is fast and compact but requires buffers for the outputs. The
buffer insertion is usually considered as a drawback since it
introduces delay and increases the power consumption. However, in
x
z
y
y
y
x
y
x
z
z
S
C
Figure 5.21. Conventional static CMOS fulladder.
S
C
z
y
x
Figure 5.22. Transmission gates fulladder.
IMPLEMENTATIONOF FFT PROCESSORS 91
the accumulator the buffer insertion is necessary anyway in order to
drive the long interconnections. There is no direct path from V
DD
or
V
SS
in this full adder, which tends to reduce the power consumption.
5.3.3.4. Accumulator Implementation
After the selection of structure and adder cell, the accumulator can
be implemented.
A software for the automatic generation of overturnedstairs
adder trees has been developed. The software can handle different
wordlengths for the data and coefﬁcient. The generated structural
VHDLcode can be validated by applying random test vectors in a
testbench.
A handcrafted accumulator using overturnedstairs tree with
0.35 µm standard CMOS technology is shown in Fig. 5.24. The
worst case delay is 26 ns at 1.5 V and 25 ˚C with SPICE simulation.
The power consumption for this complex multiplier is 15 mW at 1.5
V and 72.6 mW at 3.3 V, both run at 25 MHz.
Adder type Transistor count
Delay (ns)@1.5 V Power (µW)@1.5 V
Static CMOS 24 4.2 4.3
TG 16 3.5 2.5
Reusens 16 3.2 2.1
Table 5.5. Comparison of fulladders in 0.35 µm technology.
Figure 5.23. Reusens fulladder.
x y z y
S C
92 Chapter 5
5.3.3.5. BrentKung Adder Implementation
The BrentKung adder is used as the vector merge adder. The Brent
Kung adder belongs to the preﬁx adder, which uses the propagation
and generation property of carry bit in a fulladder to accelerate the
carry propagation.
A program for schematic generation for BrentKung adder has
been developed and the layout generator is under construction.
The generated schematic of a 32bit BrentKung adder is
illustrated in Fig. 5.25. The layout of a 32bit BrentKung adder is
shown in Fig. 5.26
S
i
z
e
:
7
0
4
.
3
×
4
7
9
.
5
µ
m
2
Figure 5.24. Accumulator layout.
Figure 5.25. Block diagram for a 32bit BrentKung adder.
IMPLEMENTATIONOF FFT PROCESSORS 93
5.4. Final FFT Processor Design
After the design of the components and the selection of FFT
architecture, we apply the meetinthemiddle methodology to
combine the components into the complete implementation.
An observation is that a large portion of the total power are
consumed by the computation of complex multiplications in the FFT
processor. We have implement a complex multiplier that consumes
72.6 mW with power supply voltage of 3.3 V at 25 MHz in a
standard 0.35 µm CMOS technology. For a 1024point FFT
processor, it requires four complex multipliers, and, hence,
consumes 290 mW@3.3 V, 25 MHz. Even with bypass techniques
for trivial complex multiplications, the power consumption for the
computation of the complex multiplications is still more than 210
mW. Hence the reduction of the number of complex multiplication
is vital.
Using high radix butterﬂies can reduce the number of complex
multiplications outsides the butterﬂies. However, it is not common
to use high radix butterﬂy for VLSI implementations due to two
main drawbacks: it increases the number of complex multiplications
within the butterﬂies if the radix is larger than 4, and it increases the
routing complexity as well. Overcoming this two drawbacks is the
key for using high radix butterﬂies.
Figure 5.26. 32bit BrentKung adder.
S
i
z
e
:
0
.
2
5
×
0
.
1
6
m
m
2
94 Chapter 5
As is wellknown, adders consume much less power than the
multipliers with the same wordlength. This is because the adder has
less hardware and has much fewer glitches. We have implemented a
32bit BrentKung adder (real) that consumes 1.5 mW@3.3 V, 25
MHz, which is much less than a bit complex multiplier
(72.6 mW@3.3 V, 25 MHz). Therefore it is efﬁcient to replace the
complex multipliers with constant multiplier when possible.
We use constant multipliers in the design of 16point butterﬂy in
order to reduce the number of complex multipliers. For a 16point
FFT butterﬂy, there are three type nontrivial complex
multiplications within the butterﬂy, i.e., multiplications with ,
, and . The multiplications with and can share
coefﬁcients since
and . We therefore
can use constant multipliers, which reduce the complexity. The
implementation of a multiplication with is illustrated in Fig.
5.27.
The selection of FFT algorithm affects the number and positions
of constant multipliers. For 16point DFT, the radix4 FFT and
SRFFT algorithm is more efﬁcient than that of radix2 FFT
algorithm in term of number of multiplications. Moreover, both
radix2 and splitradix algorithm require three multipliers (two
multipliers with and one multiplier with ) while the
radix4 algorithm requires only two multipliers (one multiplier
17 13 ×
W
16
1
W
16
2
W
16
3
W
16
1
W
16
3
π 8 ⁄ ( ) cos π 2 ⁄ π 8 ⁄ – ( ) sin 3π 8 ⁄ ( ) sin = =
π 8 ⁄ ( ) sin π 2 ⁄ π 8 ⁄ – ( ) cos 3π 8 ⁄ ( ) cos = =
W
16
1
cos(p/8)+sin(p/8)
cos(p/8)sin(p/8)
cos(p/8)
Real Input
Imaginary Input
Real Input
Imaginary Input
Figure 5.27. Complex multiplication with . W
16
1
(p = π)
W
16
2
W
16
1
IMPLEMENTATIONOF FFT PROCESSORS 95
with and one multiplier with / ). Hence the 16
point butterfly with radix4 is more efficient and is selected for
our implementation.
By replacing the complex multiplications with constant
multiplications within the 16point butterﬂy, the power consumption
for complex multiplication within 16point butterfly is reduced to
10 mW@3.3 V, 25 MHz. The number of nontrivial complex
multiplications can be reduced to 1776. The total number of
complex multipliers is reduced to two for a 1024point FFT due to
the use of 16point butterﬂies. The number of nontrivial complex
multiplications required for 1024point FFT for different algorithms
is shown in the following table.
In the 1024point FFT processor, there is only two complex
multipliers and two constant multipliers, which consumes less than
160 mW. Hence, a power saving of more than 20% for the
computation of complex multiplications can be achieved. This is less
than the theoretical saving of 35% (the ratio for the number of
complex multiplications) due to the computation for complex
multiplications within the 16point butterﬂy.
To cope with the complex routing associated with high radix
butterflies it is better to divide the 16point butterﬂy into four stages.
since the radix2 butterﬂy have the simplest routing.
As mentioned in the resource analysis, the most memory
efﬁcient architectures are the architectures with singlepath
feedback since it gives the minimum data memory, e.g., only N – 1
words for an Npoint FFT.
Algorithm R2FFT R4FFT SRFFT Our approach
No. of Comp. Mult. 3586 2732 2390 1776
Table 5.6. The number of nontrivial complex multiplications
for different FFT architectures.
W
16
2
W
16
1
W
16
3
96 Chapter 5
The radix4 algorithmcan be decomposed into radix2 algorithm
as done in [24]. Hence the mapping of 16point butterﬂy can be done
with four pipelined radix2 butterﬂies. Each butterﬂy has its own
feedback memory. The 16point butterﬂy is illustrated in Fig. 5.28.
The power consumption for the data memory is estimated to 300
mW (the power consumption for 128 words or higher memory is
given by the vendor and the smaller memory is estimated through
linear approximation down to 32 words). The butterﬂies consumes
about 30 mW.
The total power consumption for the three main subsystems is
490 mW. By assuming the 15% of overhead, for instance, the clock
buffers, communication buses, etc., the power consumption for the
FFT processor is therefore estimated to about 550 mW at 3.3 V [38].
The 1024point FFT processor can also run at 1.5 V, which gives
more power saving. The total power consumption of the 1024point
FFT processor is less than 200 mW at 1.5 V for 0.35 µm standard
CMOS process. The memories contribute 55% of the total power
consumption, the computation units for butterﬂy operations and
complex multiplications with 37%, and others with 8%.
5.5. Summary
In this chapter, we have discussed the implementation of a 1024
point FFT processor.
Aresource analysis gave a start point for the implementation. We
proposed a wordlength optimization method for the pipelined FFT
architectures. This method gave a memory saving up to 14%.
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
B
u
t
t
e
r
f
l
y
E
l
e
m
e
n
t
Mem
Output Intput
Constant multiplier
Figure 5.28. 16point Butterﬂy.
IMPLEMENTATIONOF FFT PROCESSORS 97
We discussed the implementation of subblocks, i.e., butterﬂies,
memories, and complex multipliers. We proposed the high radix
butterﬂies using carrysave technique, which is efﬁcient in term of
delay and area. We constructed a complex multiplier using DA and
overturnedstairs tree, which is area efﬁcient. All those subblocks
can be operate at low power supply voltage and suitable for the
voltage scaling.
Finally, we discussed the implementation of FFT processor
using a 16point butterﬂy. The use of proposed 16point butterﬂy
reduces the number of complex multiplications and retains the
minimum memory requirement, which is power efﬁcient.
98 Chapter 5
99
6
CONCLUSIONS
This thesis discussed the essential parts of low power pipelined FFT
processor design.
The selection of FFT algorithm is an important start point for the
FFT processor implementation. The FFT algorithm with less multi
plications and additions is attractive.
The selection of low power strategy affects FFT hardware
design. The supply voltage scaling is an efﬁcient low power
technique and was used for the FFT processor design.
After the selection of the FFT algorithm and the low power
strategy, it is important to reduce the hardware complexity. The
wordlengths in each stage of the pipelined FFT processor may be
different and therefore optimized. A simulationbased method has
been developed for wordlength optimization of the pipelined FFT
architectures. In some cases, the wordlength optimization can
reduce the size of memories up to 14%compared with using uniform
wordlength in each stage. This also results in a power saving of 14%
for the memories. The reduction of wordlength also reduces the
power consumption in the complex multipliers and the butterﬂies
proportionally.
For the detail design, we proposed that a carrysave technique
was used for implementation of the butterﬂies. This technique is
generally applicable for highradix butterﬂies.The proposed high
radix butterﬂies reduce both the area and the delay with more than
20%. In the complex multiplier design, we use distributed arithmetic
to reduce the hardware complexity. We select overturnedstairs tree
100 Chapter 6
for the realization of complex multiplier. The overturnedstairs tree
has a regular structure and the same performance as the Wallace tree
when the data wordlength is less than 19, is therefore used.
Simulation shows that the complex multiplier operate up to 30 MHz
at 1.5 V. The power consumption is 15 mW at 25 MHz with a 1.5 V
power supply voltage. In the SRAM design, we modiﬁed an STC D
ﬂipﬂop to form a two stage sense ampliﬁer. The sense ampliﬁer can
be operated at low power supply voltage.
With optimized word length, the data memory size is reduced
with 10%. Using proposed 16point butterﬂy, the number of
complex multiplications can be reduced and results a power saving
more than 20% for complex multiplications. With all those efforts,
the total power consumption of the 1024point pipelined FFT
processor with a continuous throughput of 25 Msamples/s and
equivalent wordlength of 12bit is less than 200 mW at 1.5 V for
0.35 µm standard CMOS process. The memories contribute to 55%
of the total power consumption, the computation units for butterﬂy
operations and complex multiplications with 37%, and others with
8%. The memories consume the most signiﬁcant part of the total
power consumption, which indicates that the optimization of the
memory structure could be important for the implementation of low
power FFT processors.
101
REFERENCES
[1] M. Alidina, J. Monterio, S. Devadas, A. Ghosh, and M.
Papaefthymiou, “Precomputationbased sequential logic
optimization for low power,” IEEE Trans. on VLSI Systems,
Vol. 2, No. 4, pp. 426–436, Dec., 1994.
[2] Analog Devices Inc., ADSP21060 SHARC Super Harvard
Architecture Computer, Norwood, MA, 1993.
[3] A. Antola, R. Negrini, and N. Scarabottolo, “Arrays for
discrete Fourier Transform,” Proc. Eorupean Signal Process.
Conf. (EUSIPO), Amsterdam, Netherlands, Sep. 1988, Vol. 2,
pp. 915–918.
[4] G. Bi and E. V. Jones, “A pipeline FFT processor for word
sequential data,” IEEE Trans. on Acoustics, Speech, Signal
Processing, Vol. 37, No. 12, pp. 1982–1985, Dec., 1989.
[5] J. A. C. Bingham, "Multicarrier modulation for data
trasnmission: An idea whose time has come," IEEE Commun.
Mag., Vol. 28, pp. 5–14, May 1990.
[6] L. Bisdounis, O. Koufopavlou, and S. Nikolaids, “Accurate
evaluation of CMOS shortcircuit power dissipation for short
channel devices,” Intern. Symp. on Low Power Electronics &
Design, Monterey, CA, Aug., 1996, pp. 181–192.
[7] E. O. Brigham, The Fast Fourier Transform and Its
Applications, Prentice Hall, 1988.
102
[8] C. S. Burrus, “Index mappings for multidimentional
formulation of DFT and convolution,” IEEE Trans. on
Acoustics, Speech, Signal Processing, Vol. ASSP–25, No. 3,
pp. 239–242, June, 1977.
[9] A. Chadrakasan and R. W. Brodersen, Low Power Digital
CMOS Design, Kluwer, 1995.
[10] A. Chadrakasan, M. P. Potkonjak, R. Mehra, J. Rabey, and R.
W. Brodersen, “Optimizing power using transformations,”
IEEE Trans. on ComputerAided Design, Vol. 14,
No. 1, pp. 12–31, Jan., 1995.
[11] A. Chadrakasan, S. Sheng, and R. W. Brodersen, “Lowpower
CMOS design,” IEEE Journal of SolidState Circuits, Vol. 27,
No. 4, pp. 472–484, April, 1992.
[12] S. Chang, M. MarekSadowska, and K. Cheng, “Perturb and
simplify: multilevel Boolean network optimizer,” IEEETrans.
on ComputerAided Design of Integrated Circuits and
Systems, Vol. 15, No. 12, pp. 1494–1504, Dec., 1996.
[13] J. W. Cooley and J. W. Turkey, “An algorithm for the machine
computation of complex Fourier series,” Mathematics of
Computation, Vol. 19, pp. 297–301, April, 1965.
[14] A. Croisier, D.J. Esteban, Levilion, and V. Riso, Digital Filter
for PCM Encoded Signals, U.S. Patent 3777 130, Dec., 1973.
[15] A. M. Despain, “Fourier transform computer using CORDIC
iterations,” IEEE Trans. on Computers, Vol. C–23, No. 10, pp.
993–1001, 1974.
[16] DoubleBWSystems B.V., PowerFFT™processor data sheet,
Delft, the Netherlands, March, 2002.
[17] P. Duhamel and H. Hollmann, “Splitradix FFT algorithm,”
Electronics Letters, Vol. 20, No. 1, pp. 14–16, Jan., 1984.
[18] P. Duhamel and M. Vetterli, “Fast Fourier transforms: A
tutorial review and a state of the art,” Signal Processing,
Vol. 19, No. 4, pp. 259–299, April, 1990.
[19] I. J. Good, “The interaction algorithm and practical Fourier
analysis,” J. Royal Statist. Soc., ser. B, Vol. 20, pp. 361–372,
1958.
103
[20] A. Ghosh, S. Devadas, K. Keutzer, and J. White, “Estimation
of average switching activity in combinational and sequential
circuits,” In Proc. of the 29th Design Automation Conf., June,
1992, pp. 253–259.
[21] S. F. Gorman and J. M. Wills, “Partial Column FFTpipelines,”
IEEE Trans. on Circuits and SystemsII, Vol. 42, No. 6, June,
1995.
[22] H. L. Groginsky and G. A. Works, “A Pipeline Fast Fourier
Transform,” IEEE trans. on Computers, Vol. C–19(11), pp.
1015–1019, 1970.
[23] D. Hang and Y. Kim “A deep submicron SRAM cell design
and analysis methodology,” In Proc. of Midwest Symp. on
Circuits and Systems, Dayton, Ohio, USA, Aug., 2001,
pp. 858–861.
[24] S. He and M. Torkelson, “A New Approach to Pipeline FFT
Processor,” In Proc. of the 10th Intern. Parallel Processing
Symp. (IPPS), Honolulu, Hawaii, USA, pp. 766–770, 1996.
[25] M. T. Heideman and C. S. Burrus, “On the number of
multiplications necessary to compute a Length2
n
DFT,”
IEEE Trans. on Acoustics, Speech, Signal Processing,
Vol. ASSP–34, No. 1, Feb., 1986.
[26] M. T. Heideman, D. H. Johnson, and C. S. Burrus, “Gauss and
the history of the FFT,” IEEE Acoustics, Speech, and Signal
Processing Magazine, Vol. 1, pp 14–21, Oct., 1984.
[27] M. Hikari, H. Kojima, et al., “Datadependent logic swing
internal bus architecture for ultralowpower LSI’s,” IEEE
Journal of SolidState Circuits, Vol. 30, No. 4, pp. 379–402,
April, 1995.
[28] http://theory.lcs.mit.edu/~fftw
[29] Intel Corp., SA1100 Microprocessor Technical Reference
Manual, Santa Clara, CA., USA., 1998.
[30] K. Itoh et. al., “Trends in Lowpower RAM Circuit
Technologies,” Proc. of IEEE, pp. 524–543, April, 1995.
104
[31] D. P. Kobla and T. W. Parks, “A prime factor algorithm using
highspeed convolution,” IEEE Trans. on Acoustics, Speech
Signal Processing, Vol. ASSP–31, No. 4, pp. 281–294, Aug.,
1977.
[32] S. Krishnamoorthy and A. Khouja, “Efﬁcient power analysis
of combinational circuits,” In Proc. of Custom Integrated
Circuit Conf., San Diego, California, USA, 1996,
pp. 393–396.
[33] C. Lemonds and S. S. Mahant Shetti, “A low power 16 by 16
multiplier using transition reduction circuitry,” In Proc. of the
Intern. Workshop on the LowPower Design, Napa, California,
USA, April, 1994, pp. 139–142.
[34] W. Li, Y. Ma, and L. Wanhammar, “Word length estimation
for memory efﬁcient pipeline FFT/IFFTProcessors,” ICSPAT,
Orlando, Florida, USA, Nov., 1999, pp. 326–330.
[35] W. Li and L. Wanhammar, “A Complex Multiplier Using
‘OverturnedStairs’ Adder Tree,” In Proc. of Intern. Conf. on
Electronic Circuits and Systems (ICECS), Paphos, Cyprus,
September, 1999, Vol. 1, pp. 21–24.
[36] W. Li and L. Wanhammar, “Efﬁcient Radix4 and Radix8
Butterﬂy Elements,” In Proc. of NorChip Conf., Oslo,
Norway, Nov., 1999, pp. 262–267.
[37] W. Li and L. Wanhammar, “A Pipeline FFT Processor,” In
Proc. of IEEE Workshop on Signal Processing Systems
(SiPS), Taipei, China, Nov., 1999, pp. 654–662.
[38] W. Li and L. Wanhammar, “An FFT processor based on 16
point module,” In Proc. of NorChip Conf., Stockholm,
Sweden, Nov., 2001, pp. 125–130.
[39] J. Melander, Design of SIC FFT Architectures, Linköping
Studies in Science and Technology, Thesis No. 618,
Linköping University, Sweden, May, 1997.
[40] Z. Mou and F. Jutand, “OverturnedStairs’ Adder Trees and
Multiplier Design”, IEEE Trans. Computers, vol. C–41,
pp. 940–948, 1992.
105
[41] Motorola Inc., DSP96002 IEEE FloatingPoint DualPort
Processor User’s Manual, Phoenix, AZ, 1989.
[42] L. Nielsen, C. Nielsen, J. SparsØ, and K. van Berkel, “Low
power operation using selftimed circuits and adaptive scaling
of the supply voltage,” IEEE Trans. on VLSI Systems, Vol. 2,
No. 4, pp. 391–397, Dec., 1994.
[43] E. Nordhamn, Design of an Application Speciﬁc FFT
Processor, Linköping Studies in Science and Technology,
Thesis No. 324, Linköping University, Sweden, June, 1992.
[44] M. C. Pease, “An adaptation of fast Fourier transform for
parallel processing,” Journal of the Association for
Computing Machinery, Vol. 15, No. 2, pp. 252–264, April,
1968.
[45] M. Potkonjak, and M. Rabaey, “Algorithm selection: A
quantitative optimization intensive approach,” IEEETrans. on
ComputerAided Design for Integrated Circuits System, Vol.
18, No. 5, pp. 524–532, May, 1999.
[46] J. Rabaey, L. Guerra, and R. Mehra, “Design guidance in the
power dimension,” In Proc. of Intern. Conf. on Acoustics,
Speech and Signal Processing, Detroit, Michigan, USA, May,
1995, Vol. 5, pp. 2837–2840.
[47] J. Rabaey and M. Pedram (Ed.), Low Power Design
Methodologies, Kluwer, 1996.
[48] L. R. Rabiner and B. Gold, Theory and Application of Digital
Signal Processing, Prentice Hall, 1975.
[49] C. M. Rader, “Discrete Fourier transforms when the number
of data samples is prime,” Proc. of IEEE, Vol. 56, pp. 1107–
1108, June, 1968.
[50] P. P. Reusens, High Performance VLSI Digital Signal
Processing Architecture and Chip Design, Cornell University,
Thesis, Aug., 1983.
106
[51] M. Sayed and W. Badawy, “Performance analysis of singlebit
full adder cells using 0.18, 0.25, and 0.35 µm CMOS
technologies,” In Proc. of IEEE Intern. Symp. on Circuits and
Systems (ISCAS), Vol. 3, Scottdale, Arizona, USA, May,
2002, pp. 559–563.
[52] E. Seevinck et al, “StaticNoise Margin Analysis of MOS
SRAM Cells,” IEEE Journal of SolidState Circuit, Vol.
SC22, No.5, pp. 748–754, Oct., 1987.
[53] M. R. Stan and W. P. Burleson, “Businvert coding for low
power I/O,” IEEE Trans. on VLSI Systems, Vol. 1, No. 3, pp.
49–58, March, 1995.
[54] H. J. M. Veendrick, “Shortcircuit dissipation of static CMOS
circuitry and its impact on the design of buffer circuits,” IEEE
Journal of SolidState Circuit, Vol. 19, pp. 468–473, Aug.,
1984.
[55] Texas Instruments Incorporated, “An Implementation of FFT,
DCT, and Other Transforms on the TMS320C30,”
Application report: SPRA113, Dallas, Texas, 1997.
[56] V. Tiwari, S. Malik, and P. Ashar, “Compilation techniques for
low energy: an overview,” In Proc. of 1994 IEEE Symp. on
Low Power Electronics, San Diego, California, USA, Oct.,
1994, pp. 38–39.
[57] Q. Wang and S. Vrudhula, “Multilevel logic optimization for
low power using local logic transformation,” Proc. of Intern.
Conf. of ComputerAided Design, San Jose, California, USA,
pp. 270–277, 1996.
[58] L. Wanhammar, DSP Integrated Circuits, Academic Press,
1999.
[59] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI
Design, AddisonWesley, second edition, 1993.
[60] T. Widhe, Efﬁcient Implementation of FFT Processing
Elements, Linköping Studies in Science and Technology,
Thesis No. 619, Linköping University, Sweden, June, 1997.
107
[61] S. Winograd, “On the computing the discrete Fourier
transform,” Proc. Nat. Acad. Sci. USA, Vol. 73, No. 4,
pp. 1005–1006, April, 1976.
[62] E. H. Wold and A. M. Despain, “Pipeline and Parallelpipeline
FFTProcessors for VLSI Implementation,” IEEETransaction
on Computers, Vol. C–33, No. 5, pp. 414–426, 1984.
[63] G. Yeap, Practical Low Power Digital VLSI Design, Kluwer,
1998.
[64] J. Yuan, High Speed CMOS Circuit Technique, Linköping
Studies in Science and Technology, Thesis No. 132,
Linköping University, Sweden, 1988.
[65] Zarlink semiconductor Inc., PDSP16515A Stand Alone FFT
Processor Advance Information, April, 1999.
108
Studies on Implementation of Low Power FFT Processors Copyright © 2003 Weidong Li Department of Electrical Engineering Linköpings universitet SE581 83 Linköping Sweden ISBN 9173736929 ISSN 02807971
To memory of my father.
Abstract
In the last decade, the interest for high speed wireless and on cable communication has increased. Orthogonal Frequency Division Multiplexing (OFDM) is a strong candidates and has been suggested or standardized in those communication systems. One key component in OFDMbased systems is FFT processor, which performs the efﬁcient modulation/demodulation. There are many FFT architectures. Among them, the pipeline architectures are suitable for the realtime communication systems. This thesis presents the implementation of pipeline FFT processors that has low power consumptions. We select the meetinthemiddle design methodology for the implementation of FFT processors. A resource analysis for the pipeline architectures is presented. This resource analysis determines the number of memories, butterﬂies, and complex multipliers to meet the speciﬁcation. We present a wordlengths optimization method for the pipeline architectures. We show that the high radix butterﬂy can be efﬁciently implemented with carrysave technique, which reduce the hardware complexity and the delay. We present also an efﬁcient implementation of complex multiplier using distributed arithmetic (DA). The implementation of low voltage memories is also discussed. Finally, we present a 16point butterﬂy using constant multipliers that reduces the total number of complex multiplications. The FFT processor using the 16point butterﬂies is a competitive candidate for low power applications.
Electronics Systems at Linköping University. Henrik Ohlsson. Also. and friends. relatives. Finally. I would like to express my gratitude to Oscar Gustafsson. especially A Phung. and Per Löwenberg for the proofreading. I would like to thank my family. and most importantly. Lastly. .Acknowledgement I would like to thank my supervisor Professor Lars Wanhammar for his support and guidance of this research. This work was ﬁnancially supported by the Swedish Strategic Fund (SSF) under INTELECT program. for their help in the discussions in research as well as other matters. for their boundless support and encouragement. I would like to thank the whole group.
.
..............1....2.... 30 2....... ..3.. ... FFT ALGORITHMS ... ........ 9 2... SandeTukey FFT Algorithms ............ . ......... 36 i ...... Thesis Outline ......1......... .6......... Generalized Formula ...... 9 2............ INTRODUCTION .......... SplitRadix FFT Algorithm ..... 1 1.. Prime Factor FFT Algorithms . Other FFT Algorithms . ........ 18 2.............5.2... CooleyTukey FFT Algorithms ..... .... ... ..... 10 2.... IDFT Implementation ....... ..4........ ..Table of Contents 1.. 20 2. ........... 31 2................................ ...... ........ 3 1. ......... 27 2.............. .... .............................................. ............. ....5.. 26 2...... 12 2....6............... EightPoint DFT ...... . .......... ............................... ..2............ Summary ................ ........... 6 1.................. .. 23 2..... Contributions .... 8 2.... ....... Basic Formula ............ 2 1........3..................... ......... ..........1. 29 2.. 26 2................................ .......... Scaling and Rounding Issue ........................ ................... ................................... ....... 23 2................7...1.....4..5...... . .... Addition Complexity . Other Issues .2.............. ............................. ...........1.... ......... 7 1.......... Power Consumption .........1........ ....... ............... DFT and FFT ........ ....... .. Multiplication Complexity ... OFDM Basics ................ . ............................ Performance Comparison .. .. ............................................1. .......... ......... ............1......................... .. ....................... ..... .6........ ... .......1......... .. Winograd Fourier Transform Algorithm ............4......... ...... .3.. ....2..2. 13 2............4........... ...... ............. ................. 35 2........... .5........ ..
.... ... Design Method ..... .............. Low Power Guidelines ......... FFT ARCHITECTURES ..............4........................... 67 ii ................ Radix4 Multipath Delay Commutator .... ...... . AlgorithmSpecific Processors .... Summary .........4............ .. Radix2 Multipath Delay Commutator ...... ........6.. 67 5....... . Low Power Techniques ...........2..................... ..........3.. 58 4............... Summary ........... 37 3.............. 53 4....2.......2........... Switching Power ... 37 3.......3.....1......1.4............. . ................. GeneralPurpose Programmable DSP Processors ..............5....1.....2. 47 3..................... System Level ............ ...... . ... .......... ............ ...1...... ..... .. .. .......... ...... Circuit Level . 40 3....3.....1... 54 4............................ ... Algorithm Level .................3....... Radix4 SinglePath Delay Feedback ..4...2..................................... 39 3.......... Logic Level .....3............. 44 3................2................. .... 50 3.3..... 40 3... .................................... . 53 4........ ...................... 61 4......... LOW POWER TECHNIQUES ..3.......2................. Power Dissipation Sources .2....5. 62 4.. .................................... ShortCircuit Power ..........................................2. ......... ... .. 59 4.. Architecture Level .......2. .. ... Resource Analysis ..... Radix2 SinglePath Delay Feedback .... .. 60 4..... .... Leakage Power ..... ... ................. 65 5.1.. 57 4. .1... 38 3.......... 52 4.3. .......... ......... 51 3.1.............................3........ ..1......3..... 56 4..... ..... ................. ............. Programmable FFT Specific Processors .....2........1......... 37 3.... . ............ .............. Radix4 SinglePath Delay Commutator ........... 65 5.. 63 5. ..... ............................. . IMPLEMENTATION OF FFT PROCESSORS .... Radix22 SinglePath Delay Commutator . .....3............. .. ......2........ 42 3........... ....3.. ....................... Highlevel Modeling of an FFT Processor ................
........... ... ... ....................... ...... 70 5.3.... 96 6........ .............. Validation of the HighLevel Model ....2. 72 5..... Summary .. .. Complex Multiplier . ... ...... ........ ..... ... ........ ..... .... Memory ... Subsystems .......... ...... ....... .... .... . . ... ...... 79 5. 73 5....... .4.... ..... .............. ... ...3.. ................3... ... . ........ REFERENCES .. ................5............ ........................ Butterfly ... .....1......2... .... .. 71 5.... 101 iii .....3....... .. . Final FFT Processor Design . 93 5.... .. . .... ........... ..... . . .... ............5..... 83 5...... . .. ..... ..... .3. .... Wordlength Optimization ..2.... ................ .. ........2...... ..... ... ..... .... .. 99 7...........3.......... . .................. . .. .. ..... CONCLUSIONS ......
iv .
which facilitates the efﬁcient transformation between the time domain and the frequency domain for a sampled signal. This thesis addresses the problem of designing efﬁcient applicationspeciﬁc FFT processors for OFDM based wideband communication systems. e.g. speech signal processing. communication. This performance requirement necessitates an applicationspeciﬁc integrated circuit (ASIC) solution for FFT implementation. The OFDM based communication systems have high performance requirement in both throughput and power consumption. is used in many applications. radar. we give a short review on DFT and FFT. Then an introduction to OFDM and power consumption are presented. which has made the FFT valuable for those communication systems. Finally. The immunity to multipath fading channel and the capability for parallel signal processing make it a promising candidate for the next generation wideband communication systems. the outline of the thesis is described. the interest for high speed wireless and on cable communication has increased. 1 .1 INTRODUCTION The Fast Fourier Transform (FFT) is one of the most used algorithms in digital signal processing. In the last decade. has been demonstrated to be an efﬁcient and reliable approach for highspeed data transmission. The modulation and demodulation of OFDM based communication systems can be efﬁciently implemented with an FFT. In this chapter. sonar. which is a special Multicarrier Modulation (MCM) method. Orthogonal Frequency Division Multiplexing (OFDM) technique.. The FFT.
The number N is also called transform length.∑ X ( n )W N N n=0 N–1 (1. …. k = 0.1) for n = 0. 1. 1. The complexity for computing an Npoint DFT is therefore O ( N 2 ) . With the contribution from Cooley and Tukey [13]. Among the FFT algorithms. the complexity for computation of an Npoint DFT can be reduced to O ( N log ( N ) ) .1) requires N ( N – 1 ) complex additions and N ( N – 1 ) complex multiplications. Direct computation of an Npoint DFT according to Eq. One algorithm is the splitradix algorithm. …. 2 Chapter 1 . the implementation of FFT algorithms is still a challenging task. was published in 1984. 1. The inverse DFT (IDFT) for data sequence { X ( n ) } ( n = 0. Another algorithm is Winograd Fourier Transform Algorithm (WFTA). …. N – 1 is deﬁned as X (n) = N–1 k=0 ∑ x ( k )W N nk (1. which reduces of complexity for DFT computation. which treats the even part and odd part with different radix.2) for k = 0. DFT and FFT The Discrete Fourier transform (DFT) for an Npoint data sequence { x ( k ) } . where W N = e – 2π ⁄ N is the primitive Nth root of unity. respectively. two algorithms are especially noteworthy. are called fast Fourier transform (FFT) algorithms. 1.1. which requires the least known number of multiplications among practical algorithms for moderate lengths DFTs and was published in 1976. (1. Due to the high computation workload and intensive memory access. The index k respective n is referred to as timedomain and frequencydomain index. The Cooley and Tukey’s approach and later developed algorithms. N – 1 . …. N – 1 .1. N – 1 ) is –n k 1 x ( k ) = . Many implementation approaches for the FFT have been proposed since the discovery of FFT algorithms.
The overlapping does not cause interference of subchannels due to the orthogonal modulation. The principle for MCM is shown in Fig. INTRODUCTION 3 .0 Figure 1. The high rate data stream at M f sym bits/s is grouped into blocks with M bits per block at a rate of f sym . 1.1. k and totally M bits for modulation of N carriers. OFDM Basics OFDM is a special MCM technique. which transmit data in parallel [5]. This leads to inefﬁcient usage of spectrum and excessive hardware requirement. Each subchannel has its own modulator and demodulator.n2 Parallel to serial demodulator n1 demodulator n2 Output demodulator 0 fc. In the conventional MCM.n1 fc.0 fsym symbol/s Channel noise fc.1. the spectrum can be used more efﬁcient since overlapping of subchannels is allowed. With OFDM. the N subchannels are nonoverlapping. The OFDM technique can overcome those drawbacks.n2 Input Mfsym b/s x(t) modulator 0 M bits (a symbol) fc.n1 fc. Serial to parallel mn1 bits modulator n1 modulator n2 fc.1. A block is called a symbol. The idea for MCM is to divide transmission bandwidth into many narrow subchannels (subcarriers). A symbol allocates mkbits of M bits for modulation of a carrier k at f c. which send symbols at a rate of f sym .2. A multicarrier modulation system. This results in N subchannels.
For two carrier signals.3) (1. 4 Chapter 1 . The symbol rate is fsym. (1..1. gk. It means that there is no interference to other subchannels with the selected functions. the carrier signals can be expressed as following: k f k = f 0 + T e j2πf k t gk ( t ) = 0 0≤k<N–1 0≤t<T otherwise (1. f 1/T Figure 1.2.4) where f0 is the system base frequency and g k is the signal for carrier k at frequency fk. the sending signal x(t) is the summation of symbol transmission in all subchannels.. each symbol is sent during a symbol time T (which is equal to 1/fsym). e. the integral over a symbol time is T ∫ g k ( t )g l∗ ( t ) dt = 0 0 T k = l otherwise (1. OFDM overcomes the inefﬁcient implementation of the modulator and demodulator for conventional MCM. and gl. Spectrum overlapping of subcarriers for OFDM.5) which shows that two carriers are orthogonal. From Fig.3) and Eq. If the frequency of subcarrier k and the base function are chosen according to Eq. 1. (1.g. its spectrum is a sinc function with zero points at f 0 + l ⁄ T (l is integer) except l = k or fk. This orthogonality can also be found in the time domain.The orthogonality can be explained in frequency domain.4). The frequency spacing between adjacent subchannels is set to be 1/T Hz.g. e.
The other issues. intersymbol interference. the interference between subchannels exists due to the nonideal channel characteristics and frequency offset in transmitters and receivers. INTRODUCTION 5 . which should be transmitted by subchannel k. 1. This interference effects the performance of the OFDM system. can be reduced by techniques like cyclic preﬁx.3. be compensated. for instance. Hence the OFDM modulator can be implemented with one IFFT processor and baseband modulator for N subcarriers instead of N modulators for conventional MCM. In reality. The simpliﬁed OFDM system based on the FFT is shown in Fig. In similar way. the OFDM demodulator can be implemented more efﬁcient than that of conventional MCM. OFDM system based on FFT. in most case.x(t ) = N–1 k=0 ∑ Sk gk ( t ) = e j2πf 0 t N–1 ∑ j2πkt Sk e T k=0 where Sk is the modulated signal of mk bit. This is an Npoint Inverse Discrete Fourier Transform (IDFT) and baseband modulation (with e j2π f 0 t ). e2jpf0t IFFT D/A Channel noise e2jpf0t FFT A/D Input Output Figure 1. The IDFT can be computed efﬁciently by Inverse Fast Fourier Transform (IFFT) algorithm.3. The frequency offset can.
11/ 0.49 11511 0. In the high performance applications.66 3990 1. This is due to the potential workload increase.1.24/ 0. Year Feature size ASIC usable Mega transistors/cm2 (auto layout) ASIC maximum functions per chip (Mega transistors/chip) Package cost (cents/pin)maximum/minimum Onchip.70 3088 1. higher workload. In portable applications.3. contribute to make the power consumption and energy efﬁciency even more critical.1. the power consumption has grown from a secondary constraint to one of the main constraints in the design of integrated circuit. and longer operation time. However. the power consumption increases or retains almost the same as the advance of technology according to the table above.0 150 2.2 2005 80 nm 225 1286 1. more functionality. Table 1 shows the expectation for the near future from Semiconductor Industry Association. for instances. The power consumption decreases as the feature size and the power supply voltage are reduced.6 218 3.2 2010 45 nm 714 4081 0. where the power consumption traditionally was a secondary 6 Chapter 1 . local clock (MHz) Supply Vdd (V) (high performance) Power consumption for High performance with heatsink (W) Power consumption for Battery(W)(Handheld) 2003 107 nm 142 810 1. Technology Roadmap from the International Technology for Semiconductors (ITRS).17/ 0.0 160 3. low power consumption has long been the main constraint.0 Table 1.9 170 3.61 5173 0.8 2004 90 nm 178 1020 1. Power Consumption The famous Moore’s Law predicts the exponential increase in circuit integration and clock frequency during the last three decades. During the last decade.98/ 0. Several other factors.
which in turn reduces the reliability. The basic idea of FFT algorithms. IRdrop etc. The pipeline architectures are discussed in more detail since they are the dedicated architectures for our target application. the low power techniques gain more ground due to the steady increasing cost for cooling and packaging. A general guideline is found in the end of this chapter.. divide and conquer. is demonstrated through a few examples. which is the starting point for the implementation. An overview for low power techniques is given in chapter 3. e. The delivery of power supply to the chip has also raised many problems like power rails design. The choice of FFT architectures is important for the implementation.constraint. noise immunity. Besides those factors.6 Msamples/sec. INTRODUCTION 7 . including the pipeline architectures. A few architectures. Thesis Outline In this thesis we summarize some implementation aspects of a low power FFT processors for an OFDM communication system. Therefore the low power techniques are important for the current and future integrated circuits. throughput • Complex 24 bits I/O data • Low power In chapter 2. are introduced in chapter 4. the increasing power consumption has resulted in higher onchip temperature. The main focus of the for low power techniques is reduction of dynamic power consumption.4. Different techniques are introduced at different abstraction level. The system speciﬁcation for the FFT processor has been deﬁned as • Transform length is 1024 • Transform time is less than 40 ms (continuously) • Continuous I/O • 25. we introduce several FFT algorithms. 1. Several FFT algorithms and their performance are given also.g.
3. This reduces of the total number of complex multiplications and is described in Section 5. presented in Section 5. Contributions The main contributions of this thesis are: • A method for minimizing the wordlengths in the pipelined FFT architectures. The conclusions for the FFT processor implementation are given in chapter 6. the ripplecarry adder. given in Section 5. 1. more detailed implementation steps for FFT processors are provided.3.In chapter 5.5. 8 Chapter 1 . etc.2. as outlined in Section 5.3.2. complex multiplier. • A 16point butterﬂy with constant multipliers. • A complex multiplier using distributed arithmetic and the overturnedstairs tree.4 • Various generators for different components. BrentKung adder.2.3. • An approach to construct efﬁcient highradix butterﬂies. for instance. Both design method and the design for FFT processors are discussed in this chapter. This is found in Chapter 5.
subproblems of the same (or related) type. 2. The subproblems are then independently solved and their solutions are combined to give a solution to the original problem.2 FFT ALGORITHMS In FFT processor design. or more. it was not applied to DFT computation until 1965 [13]. This technique can be applied to DFT computation by dividing the data sequence into smaller data sequences until the DFTs for small data sequences can be computed efﬁciently. Cooley and Tukey demonstrated the simplicity and efﬁciency of the divide and conquer approach for DFT computation and made the FFT algorithms widely accepted. hardware complexity. This chapter focuses on the review of FFT algorithms. 9 . Then a basic and a generalized FFT formulation are given. This technique works by recursively breaking down a problem into two. Although the technique was described in 1805 [26]. the mathematical properties of FFT must be exploited for an efﬁcient implementation since the selection of FFT algorithm has large impact on the implementation in term of speed. We give a simple example for the divide and conquer approach. power consumption etc.1. CooleyTukey FFT Algorithms The technique for efﬁcient computation of DFTs is based on divide and conquer approach.
Let { x o ( l ) } and { x e ( l ) } ( l = 0. the DFT of { x ( k ) } is given by X (n) = ∑ k=0 7 x ( k )W 8 nk (2.2. e. x o ( l ) = x ( 2l + 1 ) x e ( l ) = x ( 2l ) for l = 0.1. …. 7 .3) ∑ l=0 3 3 x o ( l )W 8 n n ( 2l + 1 ) + + ∑ l=0 3 3 x e ( l )W 8 n ( 2l ) (2.4) ∑ x o ( l )W 8 W 8 n ( 2l ) ∑ x e ( l )W 8 nl n ( 2l ) l=0 = W8 n n ∑ l=0 3 x o ( l )W 4 + nl ∑ l=0 3 l=0 x e ( l )W 4 = W 8 X o(n) + X e(n) 10 Chapter 2 .2) (2. the grouping of { x ( k ) } to { x o ( l ) } and { x e ( l ) } can be done intuitively through separating members by odd and even index.g. EightPoint DFT In this section. 3 ) be two sequences. Let us consider an 8point DFT. 3 . 1. 1. One way to break down a long data sequence into shorter ones is to group the data sequence according to their indices. we illustrate the idea of the divide and conquer approach and show why dividing is also conquering for DFT computation. 1. The DFT for { x ( k ) } can be rewritten X (n) = = (2. 1. k = 0. 2.1. …. N = 8 and data sequence { x ( k ) } .. 7 .1) for n = 0. 2.
– 2π n ( 2l ) – 2π nl n ( 2l ) nl 8 where W 8 = e = e 4 = W 4 . Furthermore. the number of n complex multiplications for W 8 X o ( n ) can be reduced to 3 from 7 n n–4 since W 8 = – W 8 for n ≥ 4 . This can be shown in Fig. It requires only two 4point DFTs for the 8point DFT due to the fact that X o ( n ) = X o ( n – 4 ) and X e ( n ) = X e ( n – 4 ) for n ≥ 4 . 2. (2. respectively. (2. Eq. With additional 8 – 1 complex multiplications for W 8 X o ( n ) and 7 complex additions.4). x(0) x(2) x(4) x(6) x(1) x(3) x(5) x(7) W0 4point DFT W1 W2 W3 Wn x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) n multiplication with W8 Figure 2. respectively. X o ( n ) and X e ( n ) are 4point DFTs of { x o ( l ) } and { x e ( l ) } . The above 8point DFT example shows that the decomposition of a long data sequence into smaller data sequences reduces the computation complexity. The direct computation of an 8point DFT requires 8 ( 8 – 1 ) = 56 complex additions and 8 ( 8 – 1 ) = 56 complex multiplications. The total number of complex additions and complex multiplications is 32 and 27. The computation of two 4point DFTs requires 2 ⋅ 4 ⋅ ( 4 – 1 ) = 24 complex additions and 2 ⋅ 4 ⋅ ( 4 – 1 ) = 24 complex multiplican tions.4) shows that the computation of an 8point DFT can be decomposed into two 4point DFTs and summations. An 8point DFT computation with two 4point DFTs.1. 4point DFT FFT ALGORITHMS 11 . it requires totally 30 complex multiplications and 32 complex additions for the 8point DFT computation according to Eq.1.
i. With Eq. (2. the index n can be described by (n1. 12 Chapter 2 . n 0 ) = ∑ ∑ n n n x ( k 1. 0 ≤ n0 < r 1 ) nk The term W N can be factorized as ( nk W N = W Nr 1 n 1 + n 0 ) ( r 0 k 1 + k 0 ) n r n n = W N1 r 0 n 1 k 1 W r 0 k 1 W r 1 k 0 W N0 k 0 1 0 (2.8) indicates that the DFT computation can be performed in three steps: 1 2 3 Compute r0 different r1point DFTs (inner parenthesis).e. k 0 )W r 0 k 1 W r 1 k 0 W N0 k 0 1 0 = n n n ∑ x ( k 1. Let N be a composite number.2.5) In the similar way.2. Eq.1) can be rewritten r1 – 1 r0 – 1 X ( n 1.7).1. n0) as n = r 1 n1 + n0 ( 0 ≤ n1 < r 0 . Basic Formula The 8point DFT example illustrates the principle of the CooleyTukey FFT algorithm.8) Eq. n Multiply results with W N0 k 0 . k0) as k = r 0 k 1 + k 0 ( 0 ≤ k 0 < r 0.6) (2. (2. (1. k 0 )W r 10 k 1 W N0 k 0 W r 01 k 0 k0 = 0 k1 = 0 k1 = 0 k0 = 0 r0 – 1 r1 – 1 ∑ (2. Compute r1 different r0point DFTs (utter parenthesis).. N = r1 × r0. We introduce a more mathematical formulation for the FFT algorithm.7) n n n = W r 0 k 1 W r 1 k 0 W N0 k 0 1 0 r N where W N1 r 0 n 1 k 1 = W N n 1 k 1 = e – 2πNn 1 k 1 ⁄ N = e – 2πn 1 k 1 = 1. the index k can be expressed by a twotuple (k1. 0 ≤ k 1 < r 1 ) (2.
….10) FFT ALGORITHMS 13 . Otherwise. A closer study for the given 8point DFT example. The number r0 and r1 are called radix. it is called n mixedradix system. the number system is called radixr system. it is easy to show that it does not need to store the input data in the memory after the computation of the two 4point DFTs. It is a mixedradix FFT algorithm. Let N = r p – 1 × r p – 2 × … × r 0 . the index k and n can be written as k = r 0 r 1 …r p – 2 k p – 1 + … + r 0 k 1 + k 0 n = r p – 1 r p – 2 …r 1 n p – 1 + … + r p – 1 n 1 + n 0 where k i. This is a reduction from O(N2) to O(N(r1 +r0)).3. It can reduce the total memory size and is important for memory constrained system. r i – 1 ] for i = 0. further improvement to reduce the computation complexity can be achieved by applying the divide and conquer approach recursively to r1point or/and r0point DFTs [7]. (2. Generalized Formula If r0 or/and r1 are not prime. Therefore the total number of complex multiplications using Eq.1. An algorithm with this property is called an inplace algorithm.1. If r0 and r1 are equal to r. Example 2. For N = 8 . Therefore the decomposition of DFT reduces the computation complexity for DFT.9) (2. which is shown in Fig. n p – i – 1 ∈ [ 0. (2. 2. The multiplications with W N0 k 0 are called twiddle factor multiplications. p – 1 .1.8) is N ( r 0 + r 1 – 1 ) and the number of complex additions is N ( r 0 + r 1 – 2 ) . The ﬁnal step requires N ( r 0 – 1 ) complex multiplications and additions. 2. we apply the basic formula by decomposing N = 8 = 4 × 2 with r0 = 2 and r1 = 4.The r0 r1point DFTs require r 1 r 0 ( r 1 – 1 ) or N ( r 1 – 1 ) complex multiplications and additions. This results in the given 8point DFT example in the above section. 1. The second step requires N complex multiplications.
nk The factorization of W N can be expressed as n nk W N = W N( r 0 r 1 …r p – 2 k p – 1 + … + r 0 k 1 + k 0 ) r = W N0 r 1 …r p – 2 nk p – 1 + … + r 0 nk 1 + nk 0 r r nk = W N0 r 1 …r p – 2 nk p – 1 …W N0 nk 1 W N 0 nk nk nk = W r p – 1 …W N ⁄1r W N 0 p–1 0 (2.11) r nk + 1 where W N0 r 1 …r i nk i + 1 = W N ⁄i ( r r …r ) .14) 14 Chapter 2 . k p – 2. …. k 0 )W r 0 k p – 1 p–1 k0 = 0 k1 = 0 k p – 1 = 0 nk ⋅ W r p –r2 p–1 p–2 r0 – 1 r1 – 1 r –1 nk nk …W N ⁄1r W N 0 0 (2. 0 1 i Eq. n p – 2. Equation (2. (2. index kp1 is “replaced” by n0. k 0 )W r 0 k p – 1 p–1 (2. …. k 0 )W r 0 k p – 1 p–1 k0 = 0 k1 = 0 k p – 1 = 0 nk ⋅ W r p –r2 p–1 p–2 r0 – 1 r1 – 1 r –1 nk nk …W N ⁄1r W N 0 0 (2.12) Note that the inner product can be recognized as an rp1point DFT for n0.13). …. k 0 ) rp – 1 – 1 = ∑ n x ( k p – 1.12) can now be rewritten as X ( n p – 1.13) kp – 1 = 0 With Eq.1) can then be written X ( n p – 1. Deﬁne x 1 ( n 0. k p – 2. n p – 2. n 0 ) p–1 n = ∑ ∑ … ∑ x ( k p – 1. k p – 2. …. …. k p – 2. (2. …. n 0 ) p–1 n = ∑ ∑ … ∑ x ( k p – 1.
k 0 ) rp – 2 – 1 = ∑ x 1' ( n 0. k 0 )W r nk p – 2 p – 1r p – 2 n1 k p – 2 kp – 2 = 0 rp – 2 – 1 ∑ n [ x 1 ( n 0. k p – 2. k 0 )W r n1 k p – 2 p–2 (2. (2.14) to Eq. ….nk The term W N ⁄i ( r r …r ) can be factorized as 0 1 i–1 nk ( 1 p 2 …r W N ⁄i ( r r …r ) = W Nr p – rrr –…r 1 n)p – 1 + … + r p – 1 n 1 + n 0 )k i ⁄( 0 1 0 1 i–1 i+1 ( r p – 1 r p – 2 …r i + 1 n p – i – 1 + … + r p – 1 n 1 + n 0 )k i = W r r …r p–1 p–2 i ( 1 n i+2 = W r r p – r …r…rn p – i – 2 + … + n 0 )k i W r p – i – 1 k i p–1 p–2 i i The inner sum of kp2 in Eq. (2. n 1. k p – 2. n p – 2. (2. n 1.14) can then be written as rp – 2 – 1 ∑ = x 1 ( n 0. …. (2. k p – 2. k 0 )W r 0 k pr– 2 (2. …. ….17) can be repeated p – 2 times until index k0 is replaced by np1.15) kp – 2 = 0 which can be done through multiplications and rp2point DFTs n x 1' ( n 0. FFT ALGORITHMS 15 .14) can be rewritten as X ( n p – 1. k p – 3. k 0 )W r 0 k pr– 2 ]W r p–1 p–2 p–2 (2. …. n 0 ) p–3 nk = ∑ ∑ … ∑ x 2 ( n 0. k p – 2. k p – 2.16) p–1 p–2 x 2 ( n 0. k 0 )W r p –r3 r p–1 p–2 p–2 k0 = 0 k1 = 0 k p – 3 = 0 nk nk nk ⋅ W r p –r4 r r …W N ⁄1r W N 0 p–1 p–2 p–3 p–3 0 r0 – 1 r1 – 1 r –1 (2. …. ….18) This process from Eq. ….17) kp – 2 = 0 Eq. k 0 ) = x 1 ( n 0.
n 1. k 1..2. the factorization of W nk can be expressed as N W N = (W 2 nk n0 k 2 )(W 4 n0 k 1 W2 n1 k 1 )(W 8 ( 2n 1 + n 0 )k 0 W2 n2 k 0 ) (2. k 1.14) can then be expressed as X ( n p – 1. k 0 ) = x 1 ( n 0. n 1. n 1. 8point DFT.20) reorders the output data to natural order. …. k 0 )W 8 ( 2n 1 + n 0 )k 0 (2. n p – 1 ) (2. k 0 )W 2 (2.22) x 1' ( n 0.. (2. k 0 )W r n p – 1k0 0 (2. The addressing for unscrambling is to make a reverse of the address bits and hence is called bitreverse addressing. k 0 )W 2 n0 k 1 n0 k 2 (2. n 1. the computation of an 8point DFT can be computed with the following sequential equations [7] x 1 ( n 0.20) Eq. n p – 2. k 0 ) = x 2 ( n 0. Example 2. …. In case of radixr (r > 2) number system. k 1. k 0 ) = ∑ k2 = 0 1 x ( k 2.. …. n 1. (2. …. This process is called unscrambling.24) x 2' ( n 0. n p – 2.. k 0 ) = (2. In case for radix2 number system.19) k0 = 0 Eq. n 0 ) . ….x p – 1 ( n 0.23) n1 k 1 ∑ k1 = 0 1 x 1' ( n 0. The unscrambling process requires a special addressing mode that converts address (n0. it is called digitreverse addressing. Let N = 2 × 2 × 2 . the ni represents a bit.21) By using the generalized formula.np1) to ( n p – 1. k 1. n 0 ) = x p – 1 ( n 0. n p – 1 ) r0 – 1 = ∑ x p – 2' ( n 0.25) 16 Chapter 2 . k 1. k p – 2. k 0 )W 4 x 2 ( n 0.
2.22) corresponds to the W 2 term in Eq.26) (2. n 2 ) = ∑ k0 = 0 1 x 2' ( n 0. (2. (2.21).x 2 ( n 0. x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) W0 W2 W0 W1 W0 W2 Wn W2 W3 x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) n multiplication with W8 Figure 2.23) corresponds to the W n 0 k 1 term. k 0 )W 2 n2 k 0 (2. Eq.2. 8point DFT with CooleyTukey algorithm. n 1. n 2 ) 1 n0 k 2 where Eq. n 0 ) = x 2 ( n 0. n . 4 The result is shown in Fig.2. n 1. FFT ALGORITHMS 17 . (2.27) X ( n 2. n 1. and so on.
2. 8point DFT. For the sake of simplicity.28) 18 Chapter 2 8point DFT . 4point 4point DFT DFT Combine two 2point DFT 2point DFT 2point DFT 2point DFT Combine two 2point DFT Figure 2.3. This class of algorithms is called decimationintime (DIT) algorithm.3. SandeTukey FFT Algorithms Another class of algorithms is called decimationinfrequency (DIF) algorithm. the inputs are divided into smaller and smaller groups. As illustrated in the ﬁgure. 2. we do not derive the DIF algorithm but illustrate the algorithm with an example. The computation of DFT with DIF algorithm is similar to computation with DIT algorithm. Example 2.The recursive usage of divide and conquer approach for an 8point DFT is shown in Fig. The factorization of W N can be expressed as W N = (W 2 nk k 2 n0 nk W8 ( 2k 1 + k 0 )n 0 )(W 2 k 1 n1 W4 k 0 n1 Combine two 4point DFT )(W 2 k 0 n2 2point DFT Combine two 4point DFT ) (2. This kind of algorithm is also called the SandeTukey FFT algorithm. The divide and conquer approach for DFT. which divides the outputs into smaller and smaller DFTs.3. 2.
The sequential equations can be constructed in similar way as those in Eq. p .30) Comparing Fig. Hence. 8point DFT with DIF algorithm. (2. k 0 ) and i = 1. …. ….29) where x 0 ( k p – 1. ….10). we can ﬁnd that the signalﬂow graph (SFG) for DFT computation with DIF algorithm is transposition of that with DIT algorithm. n 0 ) = x p ( n 0.27).4. …. (2. n p – 1 ) . By using the same notation for index n and k as in Eq. …. The unscrambling process is done by X ( n p – 1. (2.22) through Eq. ….4 with Fig. many properties for DIT and DIF algorithms are the same. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) W0 W2 W0 W1 W2 W3 Wn W0 W2 x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) n multiplication with W8 Figure 2. k 0 )W r ni – 1 k p – i p–i kp – i = 0 n i ( r p – i – 2 …r 0 k p – i – 1 + … + r 0 k 1 + k 0 ) ⋅ W N ⁄ ( r …r ) p–1 p–i (2.2. 2. k p – i – 1. The computation of DFT with DIF algorithms can be expressed with sequential equations. k p – 2. …. The result is shown in Fig. the computation of Npoint DFT with DIF algorithm is x i ( n 0. which are similar to that of DIT algorithms. k 0 ) rp – i – 1 = ∑ x i – 1 ( n 0. 2. n i – 1. (2. the computation FFT ALGORITHMS 19 .9) and Eq. 2.4. For instance. …. k p – i. n i. (2. 2. …. k 0 ) = x ( k p – 1. k p – 2.
.g. index n or k is expressed with Eq. socalled Good’s mapping [19].e. –1 20 Chapter 2 . Prime Factor FFT Algorithms In CooleyTurkey or SandeTurkey algorithms. e. e. If r 1 and r 0 are relatively prime. i. there are clear differences between DIF and DIT algorithms.3. which reduce the twiddle factor multiplications. and multiplication inverse of modulo r 1 . 0 ≤ n 1 < r 1. the position of twiddle factor multiplications. 2. However. prime factor FFT algorithm.. there exists another type of FFT algorithms.g. This representation of index number is called index mapping.9) and Eq. (2.31) mod N –1 where N = r 1 × r 0 .workload for the DIT and DIF algorithms are the same. it exists another index mapping. This mapping is a variant of Chinese Remainder Theorem. In CooleyTurkey or SandeTurkey algorithms. An index n can be expressed as –1 + r n r –1 n = r 0 n1 r 0 1 0 1 mod r 1 mod r 0 (2. r 0 –1 is the = 1 . The unscrambling process are required for both DIF and DIT algorithms.10). r 0 ) = 1. greatest common divider gcd(r 1 .. The DIF algorithms have the twiddle factor multiplications after the DFTs and the DIT algorithms have the twiddle factor multiplications before the DFTs. e.g. the twiddle factor multiplications are required for DFT computation. If the decomposition of N is relative prime. 0 ≤ n 0 < r 0 . (2. r 0 r 0 mod r 1 r 1 is the multiplication inverse of modulo r 0 ..
We have N = 3 × 5 with r 1 = 3 and r 0 = 5 .5. FFT ALGORITHMS 21 . 0 ≤ n 0 < r 0 ). 2. We have N = 3 × 5 with r 1 = 3 and r 0 = 5 . It can be constructed by n = ( r 0 n1 + r 1 n0 ) ( 0 ≤ n 1 < r 1.Example 2. the index can be mod r 0 computed according to r1r1 + 3k k = 5 2k 1 0 mod 5 mod 3 (2.5. The index mapping for the outputs can be constructed by n = ( 5n 1 + 3n 0 ) for 0 ≤ n 1 < 3 and 0 ≤ n 0 < 5 . mod N Example 2. Index mapping for 15point DFT outputs.6. r 1 is 2 since = 3 ⋅ 2 mod 5 = 1 and r 0 = 2 . n0 n1 0 3 6 9 12 5 8 11 14 2 10 13 1 4 7 Figure 2.4.6. The mod 15 result can be shown Fig.32) mod 15 –1 –1 –1 The mapping can be illustrated with an index matrix k0 k1 0 6 12 3 9 10 1 7 13 4 5 11 2 8 14 Figure 2. Construct index mapping for 15point DFT outputs. Construct index mapping for 15point DFT inputs according to Good’s mapping. The mapping for the outputs is simple. Good’s mapping for 15point DFT inputs.
The computation with prime factor FFT algorithms is similar to the computation with CooleyTurkey algorithm. It can be divided into two steps:
1 2
Compute r0 different r1point DFTs. It performs columnwise DFTs for the input index matrix. Compute r1 different r0point DFTs. It performs rowwise DFTs for the output index matrix.
Example 2.6. 15point DFT with prime factor mapping FFT algorithm. The input and output index matrices can be constructed as shown in Fig. 2.5 and Fig. 2.6. Following the computation steps above, the computation of 15point DFT can be performed by and ﬁve 3point DFTs followed three 5point DFTs. The 15point DFT with prime factor mapping FFT algorithm is shown in Fig. 2.7.
x(0) x(10) x(5) x(6) x(1) x(11) x(12) x(7) x(2) x(3) x(13) x(8) x(9) x(4) x(14) X(0) X(3) X(6) X(9) X(12) X(5) X(8) X(11) X(14) X(2) X(10) X(13) X(1) X(4) X(7) 3point DFT k0=0 5point DFT n0=0 5point DFT n0=2 5point DFT n0=1
Figure 2.7. 15point FFT with prime factor mapping.
22
Chapter 2
3point DFT k0=4
3point DFT k0=3
3point DFT k0=2
3point DFT k0=1
The prime factor mapping based FFT algorithm above is also an inplace algorithm. Swapping of input and output index matrix gives another FFT algorithm which does not need twiddle factor multiplication outside the butterﬂies either. Although the prime factor FFT algorithms are similar to the CooleyTukey or SandeTukey FFT algorithms, the prime factor FFT algorithms are derived from convolution based DFT computations [19] [49] [31]. This leads later to Winograd Fourier Transform Algorithm (WFTA) [61].
2.4. Other FFT Algorithms
In this section, we discuss two other FFT algorithms. One is the splitradix FFT algorithm (SRFFT) and the other one is Winograd Fourier Transform algorithm (WFTA).
2.4.1. SplitRadix FFT Algorithm
Splitradix FFT algorithms (SRFFT) were proposed nearly simultaneously by several authors in 1984 [17] [18]. The algorithms belong to the FFT algorithms with twiddle factor. As a matter of fact, splitradix FFT algorithms are based on the observation of CooleyTurkey and SandeTurkey FFT algorithms. It is observed that different decomposition can be used for different parts of an algorithm. This gives possibility to select the most suitable algorithms for different parts in order to reduce the computational complexity.
FFT ALGORITHMS
23
For instance, the signalﬂow graph (SFG) for a 16point radix2 DIF FFT algorithm is shown in Fig. 2.8.
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) W0*0 16 W1*0 16 W2*0 16 W3*0 16 W4*0 16 W5*0 16 W6*0 16 W7*0 16 W0*1 16 W1*1 16 W2*1 16 W3*1 16 W4*1 16 W5*1 16 W6*1 16 W7*1 16 W0*0 8 W1*0 8 W2*0 8 W3*0 8 W0*1 8 W1*1 8 W2*1 8 W3*1 8 W0*0 8 W1*0 8 W2*0 8 W3*0 8 W0*1 8 W1*1 8 W2*1 8 W3*1 8 W0*0 4 W1*0 4 W0*1 4 W1*1 4 W0*0 4 W1*0 4 W0*1 4 W1*1 4 W0*0 4 W1*0 4 W0*1 4 W1*1 4 W0*0 4 W1*0 4 W0*1 4 W1*1 4 X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15)
Figure 2.8. Signalﬂow graph for a 16point DIF FFT algorithm. The SRFFT algorithms exploit this idea by using both a radix2 and a radix4 decomposition in the same FFT algorithm. It is obviously that all twiddle factors are equal to 1 for even indexed outputs with radix2 FFT computation, i.e., the twiddle factor multiplication is not required. But in the radix4 FFT computation, there is not such general rule (see Fig. 2.9). For the odd indexed outputs, a radix4 decomposition the computational efﬁciency is increased because the fourpoint DFT has the largest multiplicationfree butterﬂy. This is because the radix4 FFT is more efﬁcient than the radix2 FFT from the multiplication complexity point of view. Consequently, the DFT computation uses different radix FFT
24
Chapter 2
2. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) j j j j j j W0*2 16 W1*2 16 W2*2 16 W3*2 16 W0*3 16 W1*3 16 W2*3 16 W3*3 16 R ad ix 2/ 4 j W0*2 8 W1*2 8 W0*3 8 W1*3 8 j j X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(5) X(9) X(13) X(3) X(7) X(11) X(15) Radix2 Radix4 R ad ix 2/ 4 Figure 2. R ad ix 2/ 4 FFT ALGORITHMS Ra di x 2/ 4 25 . x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) W0*0 16 W1*0 16 W2*0 16 W3*0 16 W0*1 16 W1*1 16 W2*1 16 W3*1 16 W0*2 16 W1*2 16 W2*2 16 W3*2 16 W0*3 16 W1*3 16 W2*3 16 W3*3 16 X(0) X(4) X(8) X(12) X(1) X(5) X(9) X(13) X(2) X(6) X(10) X(14) X(3) X(7) X(11) X(15) Radix4 Butterfly Figure 2.10. A 16point SRFFT is shown in Fig.9.algorithms for odd and even indexed outputs. SFG for 16point DFT with SRFFT algorithm. This reduces the number of complex multiplications and additions/subtractions. Radix4 DIF algorithm for 16point DFT.10.
the computation load is of great concern.4.2.. Performance Comparison For the algorithm implementation. r1point input adds r0 sets r0point input adds r1 sets Npoint multiplications r0point output adds r1 sets r1point output adds r0 sets Figure 2. Winograd Fourier Transform Algorithm The Winograd Fourier transform algorithm (WFTA) [61] uses the cyclic convolution method to compute the DFT. We compare the discussed algorithms from the addition and multiplication complexity point of view. the irregularity of WFTA makes it impractical for most real applications.Although the SRFFT algorithms are derived from the observation of radix2 and radix4 FFT algorithm. The SRFFT can also be generalized to lengths N = pk. The computation of an Npoint DFT (N is product of two coprime numbers r1 and r0) with WFTA can be divided into ﬁve steps: two pre.5. 26 Chapter 2 .and postaddition steps and a multiplication step at the middle. Usually the number of additions and multiplications are two important measurements for the computation workload. This could be the reason that the algorithms are discovered so late [18]. The number of arithmetic operations depends on N.11. e. 2. Furthermore. it cannot be derived by index mapping. WFTA succeeds in minimizing the number of multiplications to the smallest number known.g. General structure of WFTA. 2. the minimization of multiplications results in complicated computation ordering and large increase of other arithmetic operations. However. This is based on Rader’s idea [49] for prime number DFT computation. The number of multiplications is O ( N ) . where p is a prime number [18]. additions. The aim of Winograd’s algorithm is to minimize the number of multiplications.
5. With a simple transformation. For the purpose of comparison. a complex multiplication with twiddle factor k W N does not require any multiplications when k is a multiple of N/4. A complex multiplication can be realized directly with 4 real multiplications and 2 real additions. which is shown in Fig. In many DFT computations. Furthermore. both complex multiplications and real multiplications are required. the multiplication complexity is important for the selection of FFT algorithms.. 2.Since the restriction of transform length for prime factor based algorithms or WFTA. it requires only 2 real multiplications and 2 additions when k is an odd multiple of N/8. e.12 (b). We consider a complex multiplication as 3 real multiplications and 3 real additions in the following analysis. the number of real multiplications can be reduced to 3.12. 2. This number is overestimated. Taking these simpliﬁcations into account. the counting is based on the number of real multiplications. 2.12 (a). but the number of real additions increases to 3 as shown in Fig.1. Realization of a complex multiplication. for example. Multiplication Complexity Since multiplication has large impact on the speed and power consumption. ( N log 2N ) ⁄ 2 . the number of complex multiplications can be estimated as half of the number of total butterﬂy operations. the number of real multiplication for a FFT ALGORITHMS 27 . (a) Direct realization (b) Transformed realization For DFT with transform length of N = 2 n . XR XR CR CI XI CI XR CR XI CI+CR XI CR CICR XI XR ZR ZI ZR ZI (a) (b) Figure 2. the comparison is not strictly on the same transform length but rather that of a nearby transform length.g.
28 Chapter 2 .1. WFTA has been proven that it has the lowest number of multiplications for those transform lengths that are less than 16. From the multiplication complexity point of view.DFT with radix2 algorithm and transform length of N = 2 n is M = 3N ⁄ ( 2 log 2N ) – 5N + 8 [25]. The number of real multiplication for various FFT algorithms on complex data is shown in the following table [18]. N 16 60 64 240 256 504 512 1008 1024 10248 7856 7172 4360 3076 5804 3548 1800 1392 1284 2524 1572 264 208 196 1100 632 Radix2 24 Radix4 20 SRFFT 20 200 136 PFA WFTA Table 2. there are lower bounds that can be attained by algorithms for those transform lengths. As mentioned previously. If the transform length is a product of two or more coprime numbers. following by the prime factor algorithm. The radix4 algorithm for a DFT with transform length of requires N = 4n M = 9N ⁄ ( 8 log 2N ) – 43N ⁄ 12 + 16 ⁄ 3 real multiplications [18]. the splitradix algorithm. For the splitradix FFT algorithm. It requires the lowest number of multiplications of the existing algorithms. the most attractive algorithm is WFTA. Multiplication complexity for various FFT algorithms. However. the number of real multiplications is M = N log 2N – 3N + 4 for a DFT with N = 2 n [18]. there is no simple analytic expression for the number of real multiplications. These lower bounds can be computed [18]. and the ﬁxedradix algorithm.
or 4nN . it cannot be ignored. Nevertheless. i. the number of real additions is four since each complex addition/subtraction requires two additions.e. we consider a subtraction equivalent to an addition. As described previously. The splitradix algorithm has the best result for addition complexity: A = 3N log 2N – 3N + 4 additions for an N = 2 n DFT [18]. a DFT requires N/2 radix2 DFTs for each stage. a complex multiplication requires generally 3 additions. So the total number of real additions is 4 ( log 2N ) ( N ⁄ 2 ) . Each radix4 DFT requires 8 complex additions/subtractions.2. For each radix2 butterﬂy operation (a 2point DFT). Both radix2 and radix4 FFT algorithms require the same number of additions for a DFT with a transform length of powers of 4. the addition and subtraction operations are used for realizing the butterﬂy operations and the complex multiplications. Since subtraction has the same complexity as addition. 16 real additions.5. FFT ALGORITHMS 29 . The additions for the butterﬂy operations is the larger part of the addition complexity. The number of additions for DFT with transform length of N = 4 n is A = 25N ⁄ ( 8 log 2N ) – 43N ⁄ 12 + 16 ⁄ 3 for radix4 algorithm [18].. Addition Complexity In a radix2 or radix4 FFT algorithm. a DFT. requires N/4 radix4 DFTs for each stage. For a transform length of N = 4 n . The exact number [25] is A = 7N ⁄ ( log 2N ) – 5N + 8 for a DFT with transform length of N = 2 n using the radix2 algorithm. For a transform length of N = 2 n . The number of additions required for the complex multiplications is less than the number of butterﬂy operations.2. The total number of real additions is 16 ( log 4N ) ( N ⁄ 4 ) . or 2nN with radix2 FFT algorithms.
From the addition complexity point of view. parallelism of FFT algorithms. inverse FFT implementation. inplace and/or inorder issue. the irregularity and increase of addition complexity makes the WFTA less attractive for practical implementation. In fact.g. N 16 60 64 240 256 504 512 1008 1024 30728 28336 27652 13566 12292 29548 34668 5896 5488 5380 13388 14540 1032 976 964 4812 5016 Radix2 152 Radix4 148 SRFFT 148 888 888 PFA WFTA Table 2. regularity of FFT algorithms etc. 30 Chapter 2 .2. e. 2. Addition complexity for various FFT algorithms. We discuss the ﬁrst two issues in more detail. Other Issues Many issues are related to the FFT algorithm implementations.6. scaling and rounding considerations. The number of real additions for various FFTs on complex data are given in the following table [18].. WFTA is a poor choice.
2.6.1. Scaling and Rounding Issue
In hardware it is not possible to implement an algorithm with inﬁnite accuracy. To obtain sufﬁcient accuracy, the scaling and rounding effects must be considered. Without loss of generality, we assume that the input data {x(n)} are scaled, i.e., x(n) < 1/2, for all n. To avoid overﬂow of the number range, we apply the safe scaling technique [58]. This ensures that an overﬂow cannot occur. We take the 16point DFT with radix2 DIF FFT algorithm (see Fig. 2.8) as an example. The basic operation for the radix2 DIF FFT algorithm consists of a radix2 butterﬂy operation and a complex multiplication as shown in Fig. 2.13.
u U
v
B
Wp N
V
Figure 2.13. Basic operation for radix2 DIF FFT algorithm. For two numbers u and v with u < 1/2 and v < 1/2, we have U = u+v ≤ u + v <1 V = (u – v) ⋅ W N = u – v ≤ u + v < 1 where the magnitude for the twiddle factor is equal to 1. To retain the magnitude, the results must be scaled with a factor 1/2. After scaling, rounding is applied in order to have the same input and output wordlengths. This introduces an error, which is
p
(2.33) (2.34)
FFT ALGORITHMS
31
called quantization noise. This noise for a real number is modeled as an additive white noise source with zero mean and variance of ∆ ⁄ 12 , where ∆ is the weight of the least signiﬁcant bit.
2
u v
1/2
nU UQ nV VQ
B WN /2
p
Figure 2.14. Model for scaling and rounding of radix2 butterﬂy. The additive noise for U respective V is complex. Assume that the quantization noise for U and V are QU and QV., respectively. For the QU, we have E { Q U } = E { Q U re + jQ U im } = E { Q U re } + E { jQ U im } = 0 2 2 Var { Q U } = E Q U re + Q U im
2 2 2 2∆ = E Q U re + E Q U im = 12
(2.35)
(2.36)
Since the quantization noise is independent of the twiddle factor multiplication, we have p p E QV ⋅ W N = E { QV } ⋅ E W N = 0 p p p Var Q V ⋅ W N = E ( Q V ⋅ W N ) ( Q V ⋅ W N ) 2∆ = E { Q V Q V } = 12
2
(2.37)
(2.38)
32
Chapter 2
After analysis of the basic radix2 butterﬂy operation, we consider the scaling and quantization effects in an 8point DIF FFT algorithm. The noise propagation path for the output X(0) is highlighted with bold solid lines in Fig. 2.15.
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7)
W0 W2 W0 W1 W2 W3 W0 W2
stage 1
stage 2
stage 3
Figure 2.15. Noise propagation. For the sake of clarity, we assume that ∆ is equal for each stage, i.e., the internal wordlength is the same for all stages. If we analyze backwards for X(0), i.e., from stage l back to stage 1, it is easy to ﬁnd that noise from stage l1 is scaled with 1/2 to stage l and stage l1 has exactly double noise sources to that of stage l. Generally, if the transform length is N and the number of stages is n, where N = 2 n , the variance of a noise source from stage l is scaled with (1 ⁄ 2) and the number of noises sources in stage l is 2 n – l . Hence the total quantization noise variance for an output X(k) is Var { Q X ( k ) } = 1 ∆ ∑ 2 n – l  2(n – l) 6 2 l=1
2 n 2 n 2 2(n – l)
∆ = 6
∆ 1 1 ∑ l =  2 –  n– n – 1 6 2 l = 12
(2.39)
if the input sequence is zero mean white noise with variance δ 2 .
FFT ALGORITHMS
33
∑ E { x ( n )x ( n ) } = .. The signalnoiseratio (SNR) for the output X(k) is therefore [60] 2 ( 2∆ in ⁄ 12 ) 1 2 2 ∆ in ∆ in 2n 1 N SNR = .∑ ∑ E { x ( n )x ( m ) }W N = N2n = 0m = 0 N–1N–1 1 nk 1 – mk = E . If ∆in is the weight of the least signiﬁcant bit for the real or imaginary part of the input.= . a similar analysis [60] yields 2 ( 2∆ in ⁄ 12 ) 1 2 2 ∆ in ∆ in rn 1 N SNR = .2 2 – r– n + 1 2 ( 2N – r ) 2 ∆ ∆ ∆ 1 .= .The output variance for an output X(k) can be derived by the following equation Var { X ( k ) } = E { X ( k )X ( k ) } = ( n – m )k 1 = . which is based on the white noise assumption..= .40) where E { x ( n )x ( m ) } = 0 for n ≠ m from the white noise assumption.2 –n+1 2 n 2 ∆ 2–2 ∆ 2(2 – 1) ∆ 1 .2 – n – 1 6 2 For a radixr DIF FFT algorithm. can be used to determine the required internal wordlength. the input variance δ 2 is equal to 2 2∆ in ⁄ 12 .N δ 2 = 2 N N2 N n=0 N–1 N–1 N–1 (2.∑ x ( m )W N = N N n = 0 m=0 1 1 δ2 = . The ﬁnite wordlength effect of ﬁnite precision coefﬁcients is more complicated.∑ x ( n )W N .= . 34 Chapter 2 . Simulation is typically used to determine wordlength of the coefﬁcients.2 – n – 1 6 r This result.
This approach adds an overhead to each butterﬂy operation and change the access order of the coefﬁcient ROM. The second approach is similar to FFT computation. If we ignore the scaling factor 1/N.2. the only difference between DFT and IDFT is nk –n k the twiddle factor.2). This approach is obviously not efﬁcient.6. There are various approaches for the IDFT implementation. The straightforward one is to compute the IDFT directly according to Eq. 1 x ( k ) = N 1 = N 1 = N 1 = N N–1 ∑ ∑ X ( n )e j2πnk ⁄ N = [ X re ( n )e j2πnk ⁄ N + jX im ( n )e j2πnk ⁄ N ] = X re ( n )e – j2πnk ⁄ N – jX im ( n )e – j2πnk ⁄ N X ∗ ( n )e – j2πnk ⁄ N ∗ ∗ = k=0 N–1 k=0 N–1 ∑ ∑ k=0 N–1 k=0 where the term within the parenthesis is a deﬁnition of DFT and a * is the conjugate of a . FFT ALGORITHMS 35 . The third approach converts the computation on IDFT to the computation on DFT. which has a computation complexity of O ( N 2 ) . It also requires the reordering of input when a radixr DFT is used. This is shown by the following equation.2. The IDFT implementation is also critical for the OFDM system. IDFT Implementation An OFDM system requires both DFT and IDFT for signal processing. (1. This can easily be performed by changing the read addresses of the twiddle factor ROM(s) for the twiddle factor multiplications. which is W N instead of W N .
2. 36 Chapter 2 . Some other aspects. Summary In this chapter we discussed the most commonly used FFT algorithms. and WFTA are also discussed.g. Other algorithms like prime factor algorithm. memory requirements.. Hence. swap the real and imaginary part at output from DFT. the CooleyTukey and SandeTukey algorithms. We compared the different algorithms in term of number of addictions and multiplications. splitradix algorithm. and a scaling with factor 1/N.The conjugation of a complex number can be done by swapping the real and imaginary parts. for instance.7. the IDFT can therefore be computed with a DFT by adding two swaps and one scaling: swap the real and imaginary part at input before the DFT computation. will be discussed later. Each computation step was given in detail for the CooleyTukey algorithms. e.
we discuss the basic principles for power consumption in standard CMOS circuits. 3. and switching currents. During this time interval. ShortCircuit Power In a static CMOS circuit. Afterwards. the main contributions to the power consumption are from shortcircuit. It exists a short time interval where the input voltage is larger than V tn but less than V dd – V tp . The logic functions for the two networks are complementary. both 37 . In the following subsections.1.3 LOW POWER TECHNIQUES Low power consumption has emerged as a major challenge in the design of integrated circuits. Power Dissipation Sources In CMOS circuits. Shortcircuit current exists during the transitions as one network is turned on and the other network is still active. leakage. we introduce them separately. only one network is turned on and conducts the output either to power supply node or to ground node and the other network is turned off and blocks the current from ﬂowing. Normally when the input and output state are stable.1. the input signal to an inverter is switching from 0 to V dd . For example. there are two complementary networks: pnetwork (pullup) and nnetwork (pulldown).1. In this chapter. 3. a review of lowpower techniques for CMOS circuits is given.
substrate Vdd Figure 3. The leakage currents depend on the technology and cannot be modiﬁed by the designers except in some logic styles. like large RAMs. the other from the currents that ﬂow through transistors that are nonconducting. the output loads and the transistor sizes [54]. However. The shortcircuit current consumes typically less than 10% of the total power in a “welldesigned” circuit [54].2. The leakage current is currently not a severe problem in most digital designs. 3. In some cases. the power consumed by leakage current can be as large as the power consumed by the switching 38 Chapter 3 . Leakage current types: (a) reverse biased diode current. Gnd Vdd Gnd Ireverse (a) Gate n+ p+ Isub (b) p. (b) subthreshold leakage current. The exact analysis of the shortcircuit current in a simple inverter [6] is complex. It is observed that the shortcircuit current is proportional to the slope of input signals. Leakage Power There are two contributions to leakage currents: one from the currents that ﬂow through the reverse biased diodes.PMOStransistor (pnetwork) and NMOStransistor (nnetwork) are turned on and the shortcircuit current ﬂows through both kinds of transistors from power supply line to the ground. The leakage current is in the order of picoAmpere with current technology but it will increase as the threshold voltage is reduced. it can be studied by simulation using SPICE. the leakage current is one of the main concerns.1. The leakage currents are proportional to the leakage area and exponential of the threshold voltage.1.
CL is the capacitance load.1) where α is the switching activity factor.current for 0. overlapping. 2 LOW POWER TECHNIQUES 39 . The usage of multiple threshold voltages can reduce the leakage current in deepsubmicron technology.3. It is applicable to almost every digital circuits and gives the guidance to the low power design.06 µm technology. and Vdd is the power supply voltage. f is the clock frequency. The power consumed by switching current [63] can be expressed as P = αC L f V dd ⁄ 2 (3. The equation shows that the switching power depends on a few quantities that are readily observable and measurable in CMOS circuits. The power consumed by switching current is the dominant part of the power consumption. Switching Power The switching currents are due to the charging and discharging of node capacitances.1. and interconnection capacitances. Reducing the switching current is the focus of most low power design techniques. The node capacitances mainly include gate. 3.
At the system level. Pipelining Architecture Voltage scaling Logic Circuit Logic styles and manipulation. This is organized after the abstraction level. it is hard to ﬁnd the best solution for low power in the large design space and there is a shortage of accurate power analysis tools at this level. if. the instructionlevel power models for a given processor are available.2. software power optimization can be performed [56]. logic level. The system design usually has the largest impact on the power consumption and hence the low power techniques applied at this level have the most potential for power reduction. In the following. hardware platform selection (applicationspeciﬁc or generalpurpose processors). Lowpower design methodology at different abstraction levels. for example. 3. algorithm and architecture level. System Partitioning. which affect the power consumption. The system design includes the hardware/software partitioning. circuit level.2 shows some examples of techniques at the different levels.2. System Level A system typically consists of both hardware and software components. we give an overview for different low power techniques. Data encoding Energy recovery. 3. resource sharing (scheduling) strategy. and technology level. Low Power Techniques Low power techniques can be discussed at various levels of abstractions: system level. It is observed 40 Chapter 3 .3. Transistor sizing Technology Threshold reduction. However.2. Fig. Doublethreshold devices Figure 3.1. etc. Powerdown Algorithm Parallelism.
4. The lower requirement for performance at certain time interval can be used to reduce the LOW POWER TECHNIQUES 41 . This is called sleep mode and widely used in low power processors. However. The order of instructions also have an impact on the internal switching within processors and hence on the power consumption. The StrongARM SA1100 processor has three power states and the average power varies for each state [29]. the Microsoft desktop operating system supports advanced power management (APM). which often consumes 3040% of the total power consumption. The powerdown and clock gating are two of the most used low power techniques at system level. 3.3. The clock drivers. The nonactive hardware units are shut down to save the power. the power management has gained a lot attention in operating system design. Clock gating.that faster code and frequently usage of cache are most likely to reduce the power consumption. In the recent year. 400 mW RUN 10 µs 10 µs 90 µs 160 ms 90 µs IDLE 50 mW SLEEP 160 µW Figure 3. block enable clock AND to block clock network Figure 3. the computation requirement is time varying. can be gated to reduce the switching activities as illustrated in Fig. The powerdown can be extended to the whole system. Power states for StrongARM SA1100 processor. For example. These power states can be utilized by the software through advanced conﬁguration and power management interface (ACPI). Adapting clocking frequency and/or dynamic voltage scaling to match the performance constraints is another low power technique. The system is designed for the peak performance.3.
It is easy to reduce the power consumption further by combining the asynchronous design technique with other low power techniques. dynamic voltage scaling technique [42]. 3. Another less explored domain for low power design is using asynchronous design techniques.2. This is illustrated in Fig.4 for a 1024point Fourier transform and the power consumption is likely to be reduced with a similar factor. This requires either feedback mechanism (load monitoring and voltage control) or predetermined timing to activate the voltage downscaling. and low peak current. 3. Load Monitor DCDC Converter Power Supply Output Buffer Input Processing Unit Synchronous/Asynchronous Interface Figure 3. like nonglobal clocking.5. The cost of an algorithm includes the computation part and the communication/storage part. using fast Fourier transform instead of direct computation of the DFT reduces the number of operations with a factor of 102. automatic powerdown. Asynchronous design with dynamic voltage scaling. for instance.power supply voltage. etc.5. Algorithm Level The algorithm selection have large impact on the power consumption. The complexity measurement for an algorithm includes the number of operations and the cost of 42 Chapter 3 . no spurious transitions. The asynchronous designs have many attractive features. For example. The task of algorithm design is to select the most energyefﬁcient algorithm that just satisﬁes the constraints.2.
like wave digital ﬁlters.6. LOW POWER TECHNIQUES 43 . In Fig. and locality of an algorithm. concurrency. This reduces the power consumption with 20% even the capacitance load is increases with 50% [10].6. and long distance communications are key issues to algorithm selection. regularity.g.communication/storage. this technique can be combine with other techniques at architectural level. 3. cost per operation. voltage scaling. (b) Unrolled signal flow graph. One important technique for low power of the algorithmic level is algorithmic transformations [45] [46]. The loop unrolling technique [9] [10] is a transformation that aims to enhance the speed.. The possibility of increasing concurrency in an algorithm allows the use of other techniques. the unrolling reduces the critical path and gives a voltage reduction of 26% [10]. Reduction of the number of operations. This technique can be used for reducing the power consumption. With loop unrolling. Furthermore. combined with voltagescaling. The regularity and locality of an algorithm affects the controls and communications in the hardware. the critical path can be reduced and hence voltage scaling can be applied to reduce the power consumption. e. for instance. pipeline and interleaving. to reduce the power consumption. can be chosen for energyefﬁcient applications [58]. xn b0 yn xn b0 a1 2 yn D 2D a1 yn1 a1 xn1 b0 a1b0 (a) (b) Figure 3. Reducing the complexity of an algorithm reduces the number of operations and hence the power consumption. This technique exploits the complexity. In some cases. to save more power. the faster algorithms. (a) Original signal flow graph.
the power consumption is reduced.4 1.4 0. Architecture Level As the algorithm is selected.6 0.3.7.1). this increases the delay.5 Figure 3.2 0 0. an efﬁcient way to reduce the dynamic power consumption is the voltage scaling. However.8 0. We demonstrate an example of architecture transformation. However.3. 44 Chapter 3 . The delay of a minsize inverter (0.2 Delay (ns) 1 0. we use low power techniques like parallelism and pipelining [11]. 3. the architecture can be determined for the given algorithm. As we can see from Eq. To compensate the delay. (3.2. To reduce the power supply voltage is used to reduce the power consumption.8 1.6 1.7. which is shown in Fig. this increases the gate delay. When supply voltage is reduced.35 µm standard CMOS technology) increases as the supply voltage is reduced. Delay vs. supply voltage for an inverter.5 2 Power supply voltage (V) 2. Delay vs. power supply voltage 2 1.5 3 3.5 1 1.
we use a parallel architecture. The use of two parallel datapath is equivalent to interleaving of two computational tasks. Still. Since the extra routing is required to distribute computations to two parallel units. 3.8.58V orig )  2 ≈ 0. In order to maintain the throughput while reducing the power supply voltage.Example 3.36P orig 45 . A datapath to determine the largest number of C and (A + B) is shown in Fig. It requires an adder and a comparator. This allows the supply voltage to be scaled down from 5 V to 2. The original clock frequency is 40 MHz [11]. the capacitance load is increased by a factor of 2. A 1/T B 1/T C 1/T Comparator A>B LOW POWER TECHNIQUES Figure 3. The parallel architecture with twice the amount of resources is shown in Fig. from 40 MHz to 20 MHz since two tasks are executed concurrently.8. this gives a signiﬁcant power saving [11]: 2 2 f orig P par = C par V par f par = ( 2.15 [11]. 3.15C orig ) ( 0.9 V [11]. Original datapath.1.9. The clock frequency can be reduced to half. Parallel [11].
A 1/2T B 1/2T C 1/2T Comparator A>B 1/2T Comparator A>B 1/T 1/2T 1/2T Figure 3.9.58V orig ) f orig ≈ 0. 3. The power consumption for pipelining [11] is P pipe = C pipe V pipe f pipe 2 = ( 1. The effective capacitance increases to a factor of 1. Pipelining [11]. the throughput can the increased from 1/(Tadd + Tcomp) to 1/max(Tadd. the supply voltage can also in this case be scaled down to 2. With this enhancement. Parallel implementation. this increases the throughput by a factor of 2.15 because of the insertions of latches [11].15C orig ) ( 0.9 V (the gate delay doubles) [11]. If Tadd is equal to Tcomp. Example 3.39P orig 2 46 Chapter 3 . By adding a pipelining register after the adder in Fig.2. Pipelining is another method for increasing the throughput. Tcomp).8.
47 . off course. 3. focus mainly on the reduction of switching activity factor by using the signal correlation and. Another beneﬁt is that the amount of glitches can be reduced. Reduction of supply voltage lower than the optimal voltage increases the power consumption. Logic Level The power consumption depends on the switching activity factor.4. the node capacitances.10. However. To reduce such communications is important. Locality is also an important issue for architecture tradeoff.2. there exists an optimal power supply voltage.A 1/2T B 1/2T C 1/2T 1/2T 1/2T Comparator A>B LOW POWER TECHNIQUES Figure 3. Pipeline implementation. However. One beneﬁt of pipelining is the low area overhead in comparison with using parallel datapaths. The onchip communication through long buses requires signiﬁcant amount of power. The area overhead equals the area of the inserted latches. Further power saving can be obtained by parallelism and/or pipelining. however. The low power techniques at the logic level. since the delay increases signiﬁcantly as the voltage approaches the threshold voltage and the capacitance load for routing and/or pipeline registers increases. which in turn depends on the statistical characteristics of data. most low power techniques do not concentrate on this issue from the system level to the architecture level.
or composition simple gates to a complex gate. This is illustrated in Fig. The input data is partitioned into two parts. duplication of a gate. One part. The decomposition of a complex gate and duplication of a gate help to separate the critical and noncritical path and reduce the size of gates in the noncritical path. corresponding to registers R1 and R2. In some cases. only a small portion of inputs to the comparator´s main block A (subtractor) is changed. R1 A R2 En g Figure 3.11. and this reduces the switching activity by gating those inputs to the circuit. the clock input to nonactive functional block does not change by gating. If two MSBs are not equal.As we know from the gated clocking. deleting/addition of wires. The power can then be saved by reducing the switching activity factor in A. Precomputation [1] uses the same concept to reduce the switching activity factor: a selective precomputing of the output of a circuit is done before the output are required. Gate reorganization [12] [32] [57] is a technique to restructure the circuit. the power consumption. In this way. A precomputation structure for low power. This can be decomposition a complex gate to simple gates. hence. hence. The comparator takes the MSB of the two numbers to register R1 and the others to R2.11. and. 3. The result from g decides gating of R2. reduces the switching of clock network. the decomposition of a complex gate increases the circuit speed and gives more space for 48 Chapter 3 R3 . The comparison of MSB is performed in g. An example of precomputation for lowpower is the comparator. is computed in precomputation block g one clock cycle before the main computation A is performed. R1. the output from g gated the remaining inputs. and. Therefore the switching activity is reduced.
Low swing techniques can be applied for the bus also [27]. 10. 10. For instance. The deleting of wires reduces the capacitance load and circuit size. 11. counters with binary and Gray code have the same functionality. which requires 4 transitions. and back to 00. the encoding is optimized for reduction of switching activities since various encoding schemes have different switching properties. which requires 6 transitions. 11. 01. The encoding is usually optimized for reduction of delay or area. adding an extra bit to select one of the inverse or the noninverse bits at the receiver end can save power [53]. For Nbit counter with binary code. the full counting cycle for a 2bit binary coded counter is from 00. where states can be coded with different schemes. The full counting cycle for 2bit Gray coded counter is from 00. As we can see from the previous example. and back to 00. a full counting cycle requires 2 ( 2 n – 1 ) transitions [63] A full counting cycle for a Gray coded Nbit counter requires only 2 n transitions.power supply voltage scaling. A bus is an onchip communication channel that has large capacitance. increases. Encoding deﬁnes the way data bits are represented on the circuits. the use of buses contributes with a signiﬁcant portion of the total power. the logic coding style is used for enhancement of speed performance. LOW POWER TECHNIQUES 49 . As the onchip transfer rate. The binary coded counter has twice transitions as the Gray coded counter when the n is large. 01. The addition of wires helps to provide an intermediate circuit that may eventually lead to a better one. For instance. the logic coding style has large impact on the number of transitions. Careful choice of coding style is important to meet the speed requirement and minimize the power consumption. Bus encoding is a technique to exploit the property of transmitted signal to reduce the power consumption. In a counter design. The composition of simple gates can reduce the power consumption if the complex gate can reduce the charge/discharge of highfrequently switching node. Using binary coded counter therefore requires more power consumption than using Gray coded counter under the same conditions. Traditionally. In low power design. This can be applied to the ﬁnite state machine.
However. Example gates are NANDs. the amount of spurious transitions is large. Many logic gates have inputs that are logically equivalent. the statistics of switching activity factors for different pins must be known in advanced and this limits the use of pin ordering [63]. the dynamic power consumption is caused by the transitions. In some cases. The power savings can be signiﬁcant as the basic cells are frequently used. A few percents improvement for D ﬂipﬂop can signiﬁcantly reduce the power consumption in deep pipelined systems. This technique is called path balancing. The selection of logic style affects the speed and power consumption. However. XORs.12.e. the order of inputs does effect B the power consumption. Cout NORs. In some cases. from A the power consumption point of Ci view. the potentials power saving are often less than that of higher abstract levels. which is near the output in a twoinput NAND Figure 3. for 50 Chapter 3 . Spurious transitions typically consume between 10% and 40% of the switching activity power in the typical combinational logic [20]. In CMOS circuits. This can be done by insertions of buffers and device sizing [33]. consumes less power than the Binput closed to the ground with the same switching activity factor. like array multipliers. this cannot be ignored. For instance.3. To reduce the spurious transitions.. the delays of signals from registers that converge at a gate should be roughly equal. the Ainput. Pin ordering is to assign more frequently switching to input pin that consumes less power. The insertions of buffer increase the total load capacitance but can still reduce the spurious transitions. the swapping of inputs does not modify the logic function of the Out gate. the power consumption will be reduced without cost. In most cases.5. gate. Nand gate. etc. Different logic styles have different electrical characteristics. i. However. Circuit Level At the circuit level.2. In this way. the standard CMOS logic is a good starting point for speed and power tradeoff.
Below we summarize some of the most commonly used low power techniques.3. The selection of algorithm and/or architecture has signiﬁcant impact on the power consumption. Generally. the XOR/NXOR implementation. The voltage scaling is an efﬁcient way to reduce the power consumption. other logic styles. this may need to be compensated for with parallel and/or pipelining techniques. LOW POWER TECHNIQUES 51 . CPL implements a fulladder with fewer transistors than the standard CMOS. Typically. The evaluation of fulladder is done only with NMOS transistors network. 3. • Reduce the number of operations.instance.13. To minimize the transistor sizes and meet the speed requirement is a tradeoff. Since the throughput is reduced as the voltage is reduced. a gate with smaller size has smaller capacitance and consumes less power. Transistor sizing affects both delay and power consumption. like complementary passtransistor logic (CPL) is efﬁcient. the transistor sizing uses static timing analysis to ﬁnd out those gates (whose slack time is larger than 0) to be reduced. This is paid for by larger delay. Complementary Inputs Complementary Inputs Passtransistor (NMOS) Network Output Output Figure 3. CPL logic network. Low Power Guidelines Several approaches to reduce the power consumption have been brieﬂy discussed. • Power supply voltage scaling. The transistor sizing is generally applicable for different technologies. This gives a small layout as well.
in a laptop computer. especially the glitches. Reducing the number of chips is a promising approach to reduce the power consumption.• I/Os between chips can consume large power due to the large capacitive loads. 52 Chapter 3 . In many systems. for example. the most power consuming parts are often idle. The effective capacitance can be reduced by several approaches. compact layout and efﬁcient logic style. • Reducing the effective capacitance. is important. Summary In this chapter we discussed some low power techniques that are applicable at different abstraction levels. the portion of display and harddisk could consume more than 50% of the total power consumption. 3. • Power management. Using power management strategies to shut down these components when they are idle for a long time can achieve good power saving. • Reduce the number of transitions. For example.4. To minimize the number of transitions.
GeneralPurpose Programmable DSP Processors Many commercial programmable DSP processors include the special instructions for the FFT computation. 4. Generally. or algorithmspeciﬁc processors. generalpurpose digital signal processors.4 FFT ARCHITECTURES Not only several variations of the FFT algorithm have been developed after the CooleyTukey’s publication but also various implementations. The implementations with software on generalpurpose computer can be found in literature and still being explored in some projects. the FFT can be implemented in software. A processor with Harvard architecture has separate busses for data and control. Software implementations are not suitable for our target application as the power consumption is too high. applicationspeciﬁc processors. for instance. most of them belong to the Harvard architecture from the architecture point of view. we will concentrate on algorithmicspeciﬁc architectures and only give a brief overview on some FFT architectures. 53 . the FFTW project in the Laboratory for Computer Science at MIT [28]. Although the performance varies from one to another.1. Since it is hard to summarize all other implementations.
MAC. for instance TI’s TMS320C3x. The butterﬂy is usually radix2 or radix4. Generalpurpose programmable DSP processor. program control. The programmable FFTspeciﬁc processors have speciﬁc butterﬂies and at least one complex multiplier [65]. The implementations with generalpurpose programmable DSP processor is therefore not applicable due to the throughput requirement.1. Typical FFT/IFFT execution times are about 1 ms [2] [41] [55]. 4. These processors are 5 to 10 times faster than the generalpurpose programmable DSP processors. which is far from the implementation using more specialized implementations. address generator. There is often an onchip coefﬁcient ROM. bitreverse addressing is available to accelerate the unscrambling for the data output.2.1. and ﬁnally the data output. The computation of FFT with generalpurpose DSP processor does not differ too much from the software computation of FFT in a generalpurpose computer. Address Data Buss I/O interface Program Address Buss Address Generator Program Memory Data Memory Data Buss I/O interface Program Data Program Controller MAC & ALU Data Buss Figure 4. 4. ALU. then the FFT/IFFT computation.A typical programmable DSP processor has on chip data and program memory. In some DSP processors. Programmable FFT Speciﬁc Processors Several programmable FFT processors have been developed for the FFT/IFFT computations. as illustrated in Fig. and I/O interfaces. which 54 Chapter 4 . To compute the FFT with a generalpurpose DSP processor requires three steps: ﬁrst the data input.
The processor has two internal workspace RAMs. The processor requires 98 µs to perform 1024point FFT with a system clock of 40 MHz. radix 4. This type of programmable FFTspeciﬁc processors are often provided with windowing functions in either time or frequency domain. but consumes 8 W at 3. Using multiple processor conﬁguration can achieve a higher throughput. The Zarlink’s (former Plessey) PDSP16515A processor performs decimation in time. Although the PDSP1615A processor accelerates the FFT computation. Data are loaded into an internal workspace RAM in normal sequential order. A recent released FFT speciﬁc processor from DoubleBW systems B. has higher throughput (100 Msamples/s) [16]. Input 3 Term Window Operator Coefficient ROM Workspace RAM Workspace RAM Radix4 Datapath Output Buffer Output Figure 4. FFTspeciﬁc processor PDSP16515A. and then readout in correct order. one output buffer. FFT ARCHITECTURES 55 .3 V.stores sinus and cosine coefﬁcients. it is still hard to meet the throughput requirement with a single processor due to the slow I/O. and one coefﬁcient ROM.2. forward or inverse Fast Fourier Transforms [65]. V. but the power consumption is then substantially higher. processed.
the routing for the processing elements is complex and difﬁcult. x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) W0 W2 W0 W1 W0 W2 Wn W2 W3 x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) n multiplication with W8 Figure 4.4. For the long transform length. the signalﬂow graph for an 8point FFT algorithm is shown in Fig. The results are fed back to the same set of process elements to compute the next stage. a column or a pipelined FFT processor can be used. The 8point fully parallel FFT processor requires 24 complex adders and 5 complex multipliers.3. is not power efﬁcient.3. A set of process elements in a column FFT processor [21] compute one stage at a time. hence. To reduce the hardware complexity. column FFT processors. The architecture of an algorithmicspeciﬁc FFT processor is therefore optimized with respect to memory structure. AlgorithmSpeciﬁc Processors Non programmable algorithmspeciﬁc processors can also be designed for the computation of FFT algorithms. Signalﬂow graph for an 8point FFT. The hardware requirement is excessive. The processors are designed mostly for ﬁxedlength FFTs.3. All three types of algorithmspeciﬁc processors represent different mapping of the signalﬂow graph for FFT to hardware structures. 56 Chapter 4 . 4. and processing elements. and. control units. For example. The hardware structure in a fully parallel FFT processor is an isomorphic mapping of the signalﬂow graph [3]. and pipelined FFT processors. There are mainly three types of algorithmspeciﬁc processors: fully parallel FFT processors.
4. pipelined FFT processors have features like simplicity. the ﬁrst four input data are multiplexed to the topleft delay elements in the ﬁgure and the next four input data directly to the butterﬂy. each stage has its own set of processing elements. inplace applications where the input data often arrive in a natural sequential order. All the stages are computed as soon as data are available. This complete the startup of the ﬁrst stage of the pipeline. The most common groups of the pipelined FFT architecture are • Radix2 multipath delay commutator (R2MDC) • Radix2 singlepath delay feedback (R2SDC) • Radix4 multipath delay commutator (R4MDC) • Radix4 singlepath delay commutator (R4SDC) • Radix4 singlepath delay feedback (R4SDF) • Radix22 singlepath delay commutator (R22SDC) We will discuss these pipeline architectures in more detail.4. An 8point R2MDC FFT is shown in Fig. Radix2 Multipath Delay Commutator The Radix2 Multipath Delay Commutator (R2MDC) architecture is the most straightforward approach to implement the radix2 FFT algorithm using a pipeline architecture [48]. In this way the ﬁrst input data is delayed by four samples and arrives to the butterﬂy simultaneously with the fourth input sample. The outputs from the ﬁrst stage butterﬂy and the multiplier are then fed into the multipath delay commutator FFT ARCHITECTURES 57 .1.4.For a pipelined FFT processor.3. modularity and high throughput. When a new frame arrives. These features are important for realtime. We therefore select the pipeline architecture for our FFT processor implementation. 4. Radix2 Butterfly Radix2 Butterfly Radix2 Butterfly 4 Input Mux 2 2 Switch 1 1 Switch Output0 Output1 Figure 4. An 8point DIF R2MDC architecture.
However.. Each stage (except the last one) has one multiplier and the number of multipliers is log 2( N ) – 1 . i. 4.e..3. the ﬁrst and second outputs from the multiplier at the ﬁrst stage are now delayed by the upper delay elements. 3N/22. Works introduced a feedback mechanism in order to minimize the number of delay elements [22]. Groginsky and George A. The multipath delay commutator alleviates the data dependency problem. The butterﬂy and the multiplier are idle half the time to wait for the new inputs.+2. The total number of delay elements is 4 + 2 + 2 + 1 + 1 = 10 for the 8point FFT. which make the ﬁrst and second outputs from the multiplier of the ﬁrst stage arrive together with the ﬁfth and sixth outputs from the top. In the proposed architecture one half of outputs from each stage are fed back to the input data buffer when the input data are directly sent 58 Chapter 4 .2. Radix2 SinglePath Delay Feedback Herbert L. There are two paths (multipath) with delay elements and one switch (commutator).between stage 1 and stage 2. the switch changes and the third and fourth outputs from the upper output of the ﬁrst butterﬂy are sent directly to the butterﬂy at stage 2.. After this. Hence the utilization of the butterﬂy and the multiplier is 50%. The total number of delay elements for an Npoint FFT can be derived in similar way and is N/2+N/2+N/4+. The ﬁrst and second outputs from the upper side of the butterﬂy are fed into the two upper delay elements.
The delay elements at the ﬁrst stage save four input samples before the computation starts. This architecture is called Radix2 Singlepath Delay Feedback (R2SDF). + 1) which is minimal. 4.e.. A 4path delay commutator is used between two stages. Fig. Input data are separated by a 4to1 multiplexer and 3N/2 delay elements at the ﬁrst stage.. Radix4 Multipath Delay Commutator This architecture is similar to R2MDC. The utilization of multiplier and butterﬂies remains the same. 4 Radix2 SDF Butterfly 2 Radix2 SDF Butterfly 1 Radix2 SDF Butterfly 0 1 0 1 Input Output Radix2 SDF Butterfly Radix2 Butterfly Mux Figure 4.. An 8point DIF R2SDF FFT. the butterﬂy processes the incoming samples. During the execution they store one output from the butterﬂy of the ﬁrst stage and one output is immediately transferred to the next stage. When the mux is 1. 4. Computation is taking place only when the last 1/4 part of data is multiplexed to the FFT ARCHITECTURES 59 . in the new interim half frame when the delay elements are ﬁlled with fresh input sample. the butterﬂy is idle and data passes by. Because of the feedback mechanism we reduce the requirement of delay elements from 3N/2 to N – 1 (N/2 + N/4 + . The number of multiplier is exact the same as R2MDC FFT architecture. i.5 shows the principle of an 8point R2SDF FFT. 4. The butterﬂy is provided with a feedback loop.to the butterﬂy. the results of the previous frame are sent to the next stage.5. Thus. log 2( N ) – 1 .3.5. The modiﬁed butterﬂy is shown in the right side of Fig. When the mux is 0. namely 50%.3.
The utilization of the multiplier is 75% due to the fact that at least onefourth of the data are multiplied with the trivial 60 Chapter 4 Radix4 Butterfly 48 12 3 Output0 Output1 Output2 Output3 . it is not a good structure. A few more delay elements are required with this architecture. only one output is produced in comparison with 4 in the conventional butterﬂy. which is less than the R4MDC FFT architecture.butterﬂy. the simpliﬁed butterﬂy needs additional control signals. A length64 DIF Radix4 Multipath Delay Commutator (R4MDC) FFT is shown in Fig. From the view of hardware and utilization. and so do the commutators. Each stage (except the last stage) has 3 multipliers and the R4MDC FFT requires in total 3 ⋅ ( log 4( N ) – 1 ) multipliers for an Npoint FFT which is more than the R2MDC or R2SDF. The length of the FFT has to be 4 n . the butterﬂy works four times instead of just one. A 64point DIF R4MDC FFT. Radix4 Butterfly Radix4 Butterfly 32 12 8 4 Switch 8 4 3 2 1 2 1 Input 16 Mux Switch Figure 4. To accommodate this change we must provide the same four data at four different times to the butterﬂy.6. Furthermore.6.4. 4. V. G. To provide the same four outputs.3. The utilization of the butterﬂies and the multipliers is 25%. Radix4 SinglePath Delay Commutator To increase the utilization of the butterﬂies. Moreover the memory requirement is 5N/24. Due to this modiﬁcation the butterﬂy has a utilization of 4 ⋅ 25% or 100%. The number of multipliers is log 4( N ) – 1 . Bi and E. 4. Jones [4] proposed a simpliﬁed radix4 butterﬂy. In the simpliﬁed radix4 butterﬂy. which is the largest among the three discussed architectures.
5. A64point DIF R4SDF FFT is illustrated in Fig. The cost for R4SDC is the increase amounts of delay elements. But the utilization of the butterﬂies are reduced to 25%.8.3. Since we use the radix4 algorithm we can reduce the number of multipliers to log4(N) – 1 compared to log2(N) – 2 for R2SDF. The radix4 SDF butterﬂies also become more complicated than the radix2 SDF butterﬂies. Radix4 SinglePath Delay Feedback Radix4 singlepath delay feedback (R4SDF) [15] [62] is a radix4 version of R2SDF.7. 16 16 16 4 4 4 1 1 1 Radix4 SDF Butterfly Radix4 SDF Butterfly Radix4 SDF Butterfly Input Output Figure 4. The main beneﬁt of this architecture is the utilization improvement for butterﬂies. Input Singlepath Delay Commutator 24 Simplified Radix4 Butterfly Singlepath Delay Commutator 6 Simplified Radix4 Butterfly Output Figure 4.8.twiddle factor 1 (no multiplication is needed). A 64point DIF R4SDF FFT. 4. FFT ARCHITECTURES 61 . The structure of a 16point DIF Radix4 SinglePath Delay Commutator (R4SDC) FFT is shown below. 4. A 16point DIF R4SDC FFT.
The outputs are bitreversed instead of 4reversed as in a conventional radix4 algorithm. We reduce the number of multipliers compared to the conventional radix2 algorithm.10.3. 4.6. Basically two kinds of radix2 SDF butterﬂies are used to achieve the same output (but not of the same order) as a radix4 butterﬂy. 2 Radix2 SDF (I) Butterfly Element 1 Radix2 SDF (II) Butterfly Element Input Output Figure 4.The radix4 SDF butterﬂy is shown in Fig. It has the same butterﬂy structure as the radix2 DIF FFT. 4. By reducing the radix from 4 to 2 we increase utilization of the butterﬂies from 25% to 50%. but places the multipliers at the same places as for the radix4 DIF FFT. 62 Chapter 4 . Radix22 SinglePath Delay Commutator The Radix22 Singlepath Delay Commutator (R22SDC) architecture [24] uses a modiﬁed radix4 DIF FFT algorithm. Radix4 SDF butterﬂy. Radix4 SDF Butterfly Radix4 Butterfly 32 Radix2 SDF (I) Butterfly Element 16 Radix2 SDF (II) Butterfly Element 8 Radix2 SDF (I) Butterfly Element 4 Radix2 SDF (II) Butterfly Element 0 1 0 1 0 1 0 1 Mux Figure 4.9. otherwise the data are shifted into a delayline with a length of 3N/4 (ﬁrst stage). The data are sent to the butterﬂy for processing when the mux is 1. This approach is based on a 4point DFT. A 64point DIF R22SDC FFT.9.
4. The programmable DSP or FFTspeciﬁc processors cannot meet the requirements in both high throughput and low power applications. several FFT implementation classes are discussed. Algorithmspeciﬁc implementations. FFT ARCHITECTURES 63 .4. especially with pipelined FFT architectures are better in this respect. Summary In this chapter.
64 Chapter 4 .
a highlevel design language is used to deﬁne the system functionality. which builds the system by assembling the existing building blocks. We follow the meetatthemiddle design method.1. Design Method As the transistor feature size is scaled down. we discuss implementation of FFT processors. high complexity. more and more functionalities can be integrated in a single chip. The bottomup methodology. High speed. After a number of decomposition steps. In VLSI design. A design methodology is the overall strategy to organize and solve the design tasks at the different steps of the design process [24]. the system requirements and organization is developed by a successive decomposition. In the topdown design methodology. the system is described by a HDL. which can be used for automatic logic 65 .5 IMPLEMENTATION OF FFT PROCESSORS In this chapter. can hardly catch up with the high performance and communication requirements of current system. This requires that the design methodology must cope with the increasing complexity using a systematic approach. Hence the bottomup methodology is not suitable for the design of complex systems. Typically. 5. the design method is an important guide for implementation. and short design time are several requirements for VLSI designs.
but the actual design of the building blocks is Architecture performed in bottomup. The design process starts with creation of functional speciﬁcation of the FFT processor. the building blocks are already available in a circuit library. the whole design has to be redesigned. Different functionalities are partitioned and mapped into hardware or software components. After the architecture model is created. the requirement for the FFT processor is speciﬁed.1. This results in a highlevel model. Often. The design process is therefore Module Logic divided into two almost Gate Cell independent parts that meet in the middle.1. the detail computation process is mapped to the hardware. In our target application. The highlevel model is then validated by a testbench for the FFT algorithm. 5. the speciﬁcationsynthesis process is carried out Algorithm in essentially a topdown Scheduling fashion. the functional speciﬁcation is mapped into an architectural speciﬁcation. In the meetinthemiddle Specification/Validation methodology. The testbench can be reused for successive models. Basically. the software and hardware design are 66 Chapter 5 . This is MEETINTHEMIDDLE illustrated in Fig. The circuit design Transistor Layout phase can be shortened by Layout using efﬁcient circuit design tools or even automatic logic synthesis tools. The meetinthemiddle methodology. the model needs to be simulated for performance and validation. If the ﬁnal result fails to meet the performance requirement. A drawback with this design approach is that the result highly relies on the synthesis tools. In the architectural speciﬁcation. Detailed communications between different components are to be decided at the architecture speciﬁcation. some of Figure 5. After the system functionality is validated by simulation.synthesis.
Resource Analysis The highlevel design can be divided into several tasks: • • • • • Architecture selection Partitioning Scheduling RTL model generation Validation of models IMPLEMENTATION OF FFT PROCESSORS 67 . throughput • Complex 24 bits I/O data According to the meetinthemiddle design methodology. Highlevel Modeling of an FFT Processor High level modeling serves two purposes: to create a cycletrue model for the algorithm and hardware architecture.2. the software and hardware codesign is not needed. As mentioned previously. Since the whole FFT processor is implemented in hardware. Since the FFT processor is completely implemented in hardware.2. to simulate. 5. Different subblocks are built from cells in combination of blocks.6 Msamples/sec. validate and optimize the highlevel model. 5. the highlevel design is a topdown process. The system speciﬁcation for the FFT processor has been deﬁned as • Transform length is 1024 • Transform time is less than 40 µs (continuously) • Continuous I/O • 25. We start with the resource analysis. we do not need to determine the system speciﬁcation since it is given. the individual hardware blocks are reﬁned by adding the implementation details and constraints. In this phase. Once an architecture is selected. the partitioning of software and hardware is not necessary.separated after this architectural partitioning. we apply the bottomup design methodology.1.
butterﬂies and complex multipliers. the resource is constrained.1. There are many possible architectures for FFT processors. In the ASIC implementation. Therefore the number of butterﬂies is 10. A butterﬂy can be implemented with parallel adders/subtractors using one clock cycle.2) With radix2 algorithm.. Hence the resource analysis is required.6 × 10 This is optimal with the assumption that ALL data are available to ALL stages. i. Butterﬂies From the speciﬁcation. equal to the number of stages. 5. 68 Chapter 5 . Since the control part is much simpler than the datapath in respect of both hardware and power consumption.The ﬁrst three tasks are associated with each other and the aim is to allocate the resource to meet the system speciﬁcation.= .= 5 (5. Each butterﬂy has to be idle for 50% in order to reorder the incoming data. the pipelined FFT architectures are particularly suitable for realtime applications since they can easily accommodate the sequential nature of sampling. the number of butterﬂy operations is ( N ⁄ r ) log r( N ) = ( N ⁄ 2 ) log 2( N ) = 5120 . The datapath for the FFT processor consists of memories.1. The pipelined FFT architectures can be divided into datapath and control part.e. the computation time for the 1024point FFT processor is t FFT = 4 × 10 –5 s (5.3) –5 6 t FFT 4 × 10 × 25. which is impossible for continuous data streams. We discuss them separately. the resource analysis concentrates on the datapath. Among them.2. The allocation of butterﬂy operations from two stages to the same butterﬂy is not possible with as soon as possible (ASAP) scheduling. Hence the minimum number of butterﬂies is N o BFop × t BFop 5120 N BF = .
5) –5 6 t FFT 4 × 10 × 25. the number of butterﬂies for a radix4 pipeline architecture is equal to the number of stages.( r – 1 ) ( log r( N ) – 1 ) (5. N cmult t cmult 4068 N cmult = .2.1. The number of complex multipliers is 9. For a 1024point FFT processor. The size of the memories are determined by the maximum amount of live data. the number of complex multipliers is 4.≈ 4 (5. For radix2 algorithm.3. IMPLEMENTATION OF FFT PROCESSORS 69 . the architectures with feedback are efﬁcient in terms of the utilization of memories.6 × 10 Since the resource sharing between two stages is not possible for pipeline architectures. 5. The complex multiplication can be computed either in one clock cycle and two clock cycles (pipelining).With similar discussion.2. the number of complex multiplications is about 4068. each stage except the last stage has its own set of complex multipliers. For radix4 algorithm. Complex Multipliers The number of complex multiplications is N N cmult ≈ .= . which is determined by the architectures. i. Memories The memories requirement increases linearly with the transform length.4) r where N is the transform length and r is the radix. 5. It does not include the complex multiplications within the rpoint DFT.e. The minimum number of complex multipliers is. it dissipates more power than the complex multipliers. with assumption of fast complex multipliers (one complex multiplication per clock cycle). In general.2.1.
This gives a freedom for the construction of model and testbench: the model can be written in either C..2... For a fast evaluation.2.. Memory requirement and utilization for pipelined architectures. The validation of highlevel model is done through simulation and comparison. input text ﬁle output text ﬁle 70 Chapter 5 . The same testbench can be reused by changing the output/input ﬁle arithmetic.+3 = N1 3N/2+3N/8+.Architecture R2MDC R2SDF R4MDC R4SDF R4SDC R22SDF Memory requirement [words] Memory utilization 66% 100% 40% 100% 50% 100% N/2+N/2+..+6 = 2N2 N/2+N/4+.. Testbench. Device Under Test Test Bench Figure 5..+1 = N1 3N/4+3N/16+... the testbench can also written in C or Matlab.+12 = 5N/24 3N/4+3N/16+. 5.+2 = 3N/22 N/2+N/4+..1. Moreover. The interface between model and testbench is plain text ﬁles: the input data is stored in a text ﬁle and read in by the model and the output data from model is saved in a text ﬁle also. like C or Matlab. Matlab or VHDL. the next step is to model the FFT algorithm at highlevel. Validation of the HighLevel Model After the resource analysis. the algorithm is described with highlevel programming language.+1 = N1 Table 5. it is easy to convert from ﬂoatingpoint arithmetic to ﬁxedpoint arithmetic...2.
Based on the observation that the wordlength for different stages in the pipelined FFT processor can be various.2. Wordlength optimization for pipelined FFTs. Because our focus is placed on reducing power consumption in the data memory.3. which uses ﬁxed wordlength for both data and coefﬁcients for each stages. the shorter its wordlength should be.5. which is provided by the pipeline architecture. and then we adjust the wordlength of the coefﬁcient ROM at each stage. Wordlength Optimization In the pipelined FFT architectures.3. The possibility to use different wordlength. IMPLEMENTATION OF FFT PROCESSORS 71 . We ﬁrst tune the wordlength of data memory (data RAM) at each stage separately to make sure that the precision requirement is met. the most research effort has been relative to the regular modular implementations. Start Coefﬁcient wordlengths No Sine wave test vectors OK? Yes Data wordlengths Random test vectors No OK? Yes End Figure 5. the strategy is that the larger the RAM block in a stage. we proposed a wordlength optimization method [34]. is often ignored to achieve modular solutions. To obtain the optimal word lengths proﬁle numerous design iterations have been performed. The conventional uniform wordlength scheme for both data memory and coefﬁcient ROM is also simulated.
The designers have to increase timing margins during synthesis to meet the speed requirement after place and route. there are two design methods: the semicustom method and full custom method. this design methodology relies on the synthesis tools and place and route tools. 100. Subsystems Once the RTLlevel model of FFT processor is created and validated. Architecture R2MDC R2SDF R4MDC R4SDF R4SDC R22SDF Memory size for ﬁxed wordlength 42952 bits 28644 bits 71552 bits 28644 bits 57288 bits 28672 bits Memory size for optimized wordlength 42824 bits 28580 bits 61488 bits 24708 bits 49176 bits 28580 bits Saving 0% 0% 14% 14% 14% 0% Table 5. The optimization result is shown table below for 1024point pipelined FFT architectures. The resulting designs are often unnecessary large. the impact of power supply voltage scaling is hard to predict since the characterization of cells is done at 72 Chapter 5 . The designer have less control over the design process.2. Semicustom design method has a shorter design time. For the subsystems design.000 sets of random samples are generated and fed into the simulator. Moreover. In our case. the most synthesis tools use static timing analysis and do not consider the interconnections during synthesis. The RTLlevel description in a HDL can be synthesized with synthesis tool and the synthesis result is fed to place and route tool for ﬁnal layout. The sine wave stimuli is sensitive to the precision of the coefﬁcient representation. One is sine wave and the other is random numbers. To make the results obtained highly reliable. Wordlength optimization. However. 5.Two types of testing vectors are used in our simulation.3. and the samples of random numbers are effective stimuli to check the precision of butterﬂy calculations. the subsystems can be constructed according to the meetatthemiddle design methodology.
5. We select therefore full custom design for the FFT processor.3.1. In the following we introduce the subsystem design for the FFT processor. dynamic RAM (DRAM). Hence the low power design of memories are a key issue for the FFT processor. butterﬂies. In the RAM design. Datapath for a stage in a pipelined FFT processor. Complex Output Multiplier Input Data reordering (Memories) Butterﬂy Figure 5.3. 5. In the 1024point FFT processor. but use semicustom design method for the control path. where the timing is not critical. The main subsystems are memories.1. the memory becomes the most signiﬁcant part in both area and power consumption.normal supply voltage. there are mainly two types of RAMs: static RAM (SRAM). Since the DRAM often requires special technology and is not IMPLEMENTATION OF FFT PROCESSORS 73 .4.1. RAM The data are stored in RAMs. and complex multipliers. the memory contributes signiﬁcant portion of area and power consumption for the whole processor. Memory In many DSP processors.
which is the main parts of the SRAM. A typical 6T memory cell is shown in Fig. 5. The SARM consists of four parts [30]: • memory cell array • decoders • sense ampliﬁers • periphery circuits We discuss the implementation for the ﬁrst three parts. we select the SRAM for the data storage. Decoder Cell Array Ctl circuits Sense Amplifiers Data I/O Address Data Figure 5.available for standard CMOS technology and SRAM is more suitable for low voltage operations. Overview of a SRAM.5. We therefore select to use a 6T memory cell. 5. 74 Chapter 5 .6. The memory cell is basic building block in the SRAM and the size of the memory cell is of importance. The keys for the design of memory array are cell area and noise immunity of bitlines. Even through a 4transistors (4T) memory cell has less area than that of a 6transistors (6T).5. the current leakage at low voltage is considerable larger. Memory array Memory array dominates the SRAM area. An overview of SRAM is shown in Fig.
the value for α is 1~2 and β is 2~3. Normally. SRAM cell. Wa. The width ratio α = W p ⁄ W a affects the write operation and a larger α means data is more difﬁcult to write into the cell. short channel effects. is set to minsize for minimal cell area. The cell designer have to consider process variations. The width for the access NMOS transistor.6. IMPLEMENTATION OF FFT PROCESSORS 75 . BL_bar Wa BL Layout The stability of the memory cell during the read and write operations determines the device sizing [52]. soft error rate and low supply voltage [23]. The width ratio β = W n ⁄ W a is determined by the read operation and a larger ratio β means less chance for a SRAM cell changes its state during read operation.WWL Wp Wn Schematic Figure 5. The read stability can be measured by the static noise margin (SNM).
In order to reduce the power consumption and speed up the read access. SNM of a SRAM cell vs.5 Power supply voltage (V) 3 3.5 Figure 5.35 µm CMOS technology is shown in Fig. The SNM can be simulated with SPICE. Guardring Bitlines Figure 5. SNM for a SRAM cell.8.5 SNM (V) 0. As the power supply voltage decreases.52 0.54 0. The voltage swing between two bitlines is usually about 100 mV to 300 mV. Noise reduction for memory array. To reduce the noise from outside.44 0. which is sensitive to noise.7.7. the affect of noise becomes more important. we use the twisted bitlines layout.5 2 2. the memory array is surrounded with guardring to reduce the substratecoupling noise. SNM vs power supply voltage 0. power supply voltage. most SRAMs read data through a par bitlines with small swing. 5. The SNM for a SRAM cell with β = 2 in standard 0. To avoid the coupling noise from nearby bitline pars. 76 Chapter 5 .Example 6. Thus the coupling from nearby bitline pars does not affect the swing difference of bitlines.46 0.48 0.42 1 1.
The row decoder can use either NORNAND decoder or tree decoder. Sense Ampliﬁer The sense ampliﬁer is used to amplify the bitline signals with small swing during read operation. The NORNAND decoder has a regular layout. This reduces the power consumption of decoder. which reduce both the delay and the activity factor. In small decoders the tree decoder is preferred and the NORNAND decoders is preferred for larger decoders.Decoder The decoder can be realized by using a hierarchical structure. One way to reduce the power consumption is to reduce the active time for sense ampliﬁer. the current mode sense ampliﬁer is less suitable. This can be achieved by using pulsed sense enable signal. but requires more transistors. This high gain requirement in turn requires high current and hence high power consumption for sense ampliﬁer. IMPLEMENTATION OF FFT PROCESSORS 77 . which controls the width of wordline acting pulse and reduces the glitches of wordline drivers.9. STC Dﬂipﬂop. SE_bar BL BL_bar Dout_bar Dout Figure 5. We therefore modiﬁed an STC Dﬂipﬂop [64] to form a two stage latch type sense ampliﬁer. The sense ampliﬁer is functional when the supply voltage is as low as 0. a wordline enable signal is added to the decoder. which could increase the delay (it becomes worse for lower power supply voltage). but suffer from speed degradation due to the serialconnection of passtransistors. For the large decoder. To have a fast access.9 V. Tree decoder requires fewer transistors. At low supply voltage. the sense ampliﬁer is designed with high gain.
The total power consumption is 83. The power SENSE AMPLIFIER TEST Wave :A0:v(out0) :A0:v(out1) :A0:v(prech) 1:A0:v(wwl) 1. The access time is 11 ns using standard 0.4 Symbol 1.3 1.11.2 1.5 1.2 1.35 µm CMOS technology with typical process under 85˚C.10. consumption for the sense ampliﬁer is 59.10.11. The simulated waveforms for write operation are shown in Figure 5. 5.4 1.1 1000m 900m Voltages (lin) 800m 11 ns 700m 600m 500m 400m 300m 200m 100m 0 100m 120n 122n 124n 126n 128n 130n Time (lin) (TIME) 132n 134n 136n 138n 140n Figure 5.The simulated waveforms for read operation are shown in Fig.5 1. Read operation.5 µW per bit at 50 MHz.4 µW per bit at 50 MHz.1 1000m 900m Voltages (lin) 800m 700m 600m 500m 400m 300m 200m 100m 0 120n Time (lin) (TIME) 140n Figure 5. Write operation. 78 Chapter 5 . WRITE TEST Wave D0:A0:v(clout0) D0:A0:v(clout1) D0:A0:v(prech) D0:A0:v(wwl) Symbol 1.3 1.
low power supply voltage. The butterﬂy consists mainly of adders/subtractors.27×0. the I/O drivers have large capacitance load.12. 5.2. The SRAM. To reduce the shortcircuit current is an important issue for the I/O drivers. etc. which runs at 1. IMPLEMENTATION OF FFT PROCESSORS 79 . Avoiding switching of the PMOS and NMOS simultaneously is a efﬁcient technique for reducing of shortcircuit current. 5.3. 5. Adder design The adder is one of the fundamental arithmetic components.3.1.1. but needs to be sufﬁcient long to guarantee the read operation under process variation.2.3. Hence we discuss the implementation of adder/subtractor ﬁrst and later the complete butterﬂy. 5.5 V and 50 MHz. Butterﬂy The butterﬂy is one of the characteristic building blocks in an FFT processor. Figure 5.The pulse width for the wordline signal and sense enable signal must be selected carefully. A module generator for SRAM is under development.2. There are many adder structures [47].33 mm2). Short pulse width dissipates less power.6 mW. consumes 2. Implementation A 256 w×26 b SRAM with separate I/O (Fig. SRAM macro (1. For the periphery circuits.12) has been implemented with above discussed techniques.
It will be discussed later in the complex multiplier design. Figure 5. for instance. However. it is suitable to select RCA for the butterﬂy. The RCA is slowest among the different implementations.The ripplecarry adder (RCA) is constructed with fulladder. When the speed is important. A 3bit RCA layout with signextension is shown in Fig.0 µm2 . other carry accelerating adder structures are attractive. 5.13. We select the BrentKung adder for high speed adder implementation. 80 Chapter 5 Size: 18.14. 5. An CMOS fulladder layout is shown in Fig. If the wordlength is small. RCA implementation We have developed a program that generates the schematic and the layout for RCA. The BrentKung adder has a short delay and a regular structure.7 × 15.13. the RCA cannot meet the speed requirement. it is simple and consumes small amount of power for 16bit adder implementations. In these cases. CMOS fulladder. the vector merge adder in the multiplier.
The computation for a radix4 butterﬂy is divided into two steps.2. 5. Layout of 3bit ripplecarry adder. we proposed a carrysave based butterﬂy [36]. To reduce the complexity. The delay is changed from two additions/subtractions to IMPLEMENTATION OF FFT PROCESSORS 81 .15.2. x(0) x(1) x(2) x(3) j X(0) X(1) X(2) X(3) Figure 5. the power consumption [39] [60]. arithmetic workload. The second step is a normal addition. and.15. hence. A conventional butterﬂy is often based on an isomorphic mapping of the signalﬂow graph. In practice. the commonly used high radix butterﬂies are radix4 and radix8 butterﬂies.Sign extension Fulladder Fulladder Fulladder Figure 5. Efﬁcient design of highradix butterﬂies is therefore important.3. Signalﬂow graph for 4point DFT. Signalﬂow graph for a radix4 butterﬂy is shown in Fig. The ﬁrst step is a 42 compression with addition/subtraction controlled inputs.14. High radix butterﬂy architecture The use of higher radix tends to reduce the memory access rate. 5. Butterﬂies with higher radix than radix8 are often decomposed to lower radix butterﬂies. The butterﬂy requires 8 complex adders/subtractors and a delay of 2 additions/subtractions.
82 Chapter 5 .3. In the ﬁgure. A carrysave radix4 butterﬂy (wordlength is 15 for real and imaginary part of input) was described in VHDLcode and synthesized using AMS 0. only real additions are shown and it appears more complicated than that of Fig.15. 25˚C 12.one addition and one 42 compression.16.59 ns Table 5. Parallel radix4 butterﬂy. The implementation of a radix4 butterﬂy with carrysave adders is shown in Fig. This implementation reduces the hardware since a fast adder is more complex than a 42 compressor.16.2)counter Fast Adder Figure 5.8 µm standard CMOS technology [36]. 5. where additions are complex additions. 5. The radix2/4 splitradix butterﬂy and radix8 butterﬂy can also be implemented using the carrysave adder. Architecture Conventional Carrysave Area 10504.3 V. xre(0) xim(0) xre(1) xim(1) xre(2) xim(2) xre(3) xim(3) Xre(0) Xim(0) Inverter Xre(1) Xim(1) Xre(2) Xim(2) Xre(3) Xim(3) (4.48 Delay@3. The synthesis result shows that the area saving can be up to 21% for the carrysave radix4 butterﬂy. The delay can be reduced with 22%. Performance comparison for two radix4 butterflies.16 8266. Carrysave radix4 butterﬂy implementation.32 ns 9. The total delay is also reduced since the delay for a 42 compressor is smaller.
A straightforward implementation (see Fig.3. the complex multiplier is slowest part in the data path. 5. Hence the complex multipliers are the key components in FFT design. IMPLEMENTATION OF FFT PROCESSORS 83 .3.5.1.3.and post additions (see Fig. 5. the throughput can be increased while the latency retains the same. Distributed Arithmetic Distributed arithmetic (DA) uses precomputed partial sums for an efﬁcient computation of inner products of a constant vector and a variable vector [14].17 (b)). Realization of a complex multiplication. From the speed point of view. A more efﬁcient way to reduce the cost of multiplication is to utilize distributed arithmetic [35] [58]. 5. This has been reduced to less than 50% of the total power consumption due to the increase of power consumption for the memories as the transform length of FFT increases [37]. complex multipliers stand for about 70% to 80% of the total power consumption in the previous FFT implementations [39] [60].3. With pipelining.17. However. Complex Multiplier There is no question that complex multipliers are one of the critical units in FFT processors. one addition and one subtraction.17 (a)) of a complex multiplication requires four real multiplications. the number of multiplications can be reduced to three by using transformation at the cost of extra pre. XR XR CR CI XI CI XR CR XI CI+CR XI CR CICR XI XR ZR ZI ZR ZI (a) (b) Figure 5. From the power consumption point of view.
By interchanging the order of the two summations we get W –1 Z R = – C R x R0 + C I x I 0 + ∑ ( C R x Rk – C I x Ik )2 – k k=1 which can be written as Wd – 1 d (5..e. The complex coefﬁcient.3) k=1 where F k ( x Rk. x Ik ) = C R x Ik + C x Rk . respectively. x Ik )2 – k (5. i. i.2) Z R = – F k ( x R0. (5. x Ik ) = C R x Rk – C I x Ik . We will realize the real and imaginary parts separately. x I 0 ) + ∑ F k ( x Rk. we consider only the ﬁrst inner product in Eq.1).. Since Fk can take on only four values. The inner product ZR can be rewritten Wd – 1 Z R = C R – x R0 + ∑ Wd – 1 x Rk 2 – k –C I – x I 0 + ∑ x Ik 2 – k k=1 k=1 where xRk and xIk are the kth bits in the real and imaginary parts. the kth bits in XR and XI. I 84 Chapter 5 . The data is scaled so that ZR + jZI  is less than 1.Let CR + jCI and XR + jXI be two complex numbers of which CR + jCI is the coefﬁcient and XR + jXI is a variable complex number. In the case of a complex multiplication. it can be computed and stored in a lookup table. the real part. a complex multiplication can be considered as two inner products of two vectors of length two.1) Hence. In the same way we get the corresponding binary function for the imaginary part is G k ( x Rk.e. Fk is a function of two binary variables. we have Z R + jZ I = ( C R + jC I ) ( X R + jX I ) = (CR X R – CI X I ) + j(CR X I + CI X R) (5. For sake of simplicity. CR + jCI is assumed to be ﬁxed and two’scomplement representation is used for both the coefﬁcient and data.
3.3.7) IMPLEMENTATION OF FFT PROCESSORS 85 . x Ii )2 + F ( 0. 0 )2 i=1 –W d –i–1 + j – G ( x R0.4) i=1 where b is the inverse of bit b . x Ri ) = C I ( x Ri – x Ri ) + C R ( x Ii – x Ii ) (5.5) The functions F and G can be expressed as follows. Wd – 1 x = ( x 0 – x 0 )2 – 1 + ∑ ( x i – x i )2 –i–1 –2 –W d (5. Without any loss of generality.Further reduction of the lookup table can be done by Offset Binary Coding [14]. Then. 0 )2 i=1 Wd – 1 Wd – 1 Wd – 1 Wd – 1 Wd – 1 Wd – 1 (5. and XI is Wd. CI.6) (5. we assume that the magnitudes of CR. XR. x I 0 ) + ∑ G ( x Ri. x Ri ) = C R ( x Ri – x Ri ) – C I ( x Ii – x Ii ) G ( x Ri. The wordlength of XR. and XI all are less than 1. 5. x Ii )2 + G ( 0. the complex multiplication can be written ZR + jZI = (CR XR– CIXI) + j(CRXI + CIXR) –W d –1 –i–1 = C R ( x R0 – x R0 )2 + ∑ C R ( x Ri – x Ri )2 – – CR2 i=1 –W d –1 –i–1 – – C I ( x I 0 – x I 0 )2 + ∑ C I ( x Ii – x Ii )2 – CI 2 i=1 –W d –1 –i–1 + j – C R ( x I 0 – x I 0 )2 + ∑ C R ( x Ii – x Ii )2 – CR2 i=1 –W d –1 –i–1 + j – C I ( x R0 – x R0 )2 + ∑ C I ( x Ri – x Ri )2 – CI 2 i=1 –W d –i–1 = – F ( x R0. x I 0 ) + ∑ F ( x Ri. F ( x Ri. Offset Binary Coding The offset binary coding can be applied to distributed arithmetic by using the following expression for the data.2.
In Eq. i.e. The accumulators. XR XI (CI+CR) (CRCI) Partial product Generation F Partial product Generation G Accumulator ZR Accumulator ZI Figure 5. function Fk (Gk). 5.e. are the same as in a real multiplication. Hence. which adds the partial products.. 86 Chapter 5 .7).4.18.18.. The complex multiplier with distributed arithmetic is illustrated in Fig. The partial product generation is only slightly more complicated than for a real multiplier. xRi 0 0 1 1 xIi 0 1 0 1 F(xRi. Hence the partial product. for each bit is of the form CR ± CI. the complexity of the complex multiplier in term of chip area corresponds to approximately two real multipliers. All possible partial products are tabulated in following table. –(CR – CI) and –(CR + CI).xIi) –(CR–CI) –(CR+CI) (CR+CI) (CR–CI) G(xRi. i. (5. Partial product generation. Block schematic for complex multiplier.xIi) –(CR+CI) (CR–CI) –(CR–CI) (CR+CI) Table 5. only two coefﬁcients. are sufﬁcient to store since (CR – CI) and (CR + CI) easily can be generated from the two former coefﬁcients by inverting all bits and adding 1 in the leastsigniﬁcant position. Obviously. the factor ( x i – x i ) ⁄ 2 is either 1 or –1.
Hence the delay for the partial product generation is reduced. Circuits for partial product generation. or digitserial. a bitserial. Although a bitserial.4. the selection of structure is important. carrysave and tree structures. To achieve high throughput.18. The usual structures for accumulators are: array. Implementation Considerations Multipliers can be divided into three types: bitparallel.3. (CRCI)i (CR+CI)i XRi xor XIi XRi 0 1 0 1 0 1 XIi (CRCI)i (CR+CI)i 0 PPi (a) PPi (b) 1 XRi Figure 5. IMPLEMENTATION OF FFT PROCESSORS 87 . i. multiplier has less chip area than that of bitparallel multiplier. bitserial. We select the tree structure for accumulator. which increases activity factor for the local clock. It is also suitable for our low power strategy. To meet the speed requirement. we therefore select a bitparallel multiplier.19. designing a faster circuit and using voltage scaling to reduce the power consumption. The tree structure is the fastest. can be realized with a 2:1 multiplexer and an XOR gate as shown in Fig.. An alternative is to use a 4:1 multiplexer circuit. which correspond to the partial product generation in a real (imaginary) datapath.19. it requires a higherspeed clock than that of bitparallel one for the same throughput. and digitserial.3. 5. or digitserial multiplier often needs several parallel units. 5. For the accumulator design. The beneﬁt of this implementation is that the delay is reduced since the generation of select signal (XRi⊕XIi) is not required.e.3. Complex multiplier with DA is shown in Fig.5. The selection of precomputed values from Table 5.
we can construct the tree with only three building blocks: body.20. It is also easy for the routing planning in the accumulator design. The overturnedstairs tree [40].. 5. Since there are only three feedthroughs between body of height j – 1 to body of height j in overturnedstairs tree. The Wallace tree has complex wiring and is therefore difﬁcult to optimize and the layout becomes irregular. lowest height. 88 Chapter 5 . The construction of overturnedstairs tree is illustrated in Fig. The trees of height 1 to 3 are shown in Fig. A root (CSA) is connected to the outputs of the connector to form the whole tree of height j + 1 . The connector connects three feedthroughs from the body of height j – 1 and two outputs from the branch of height j – 2 to construct the body of height j. The body can be constructed repeatedly according to Fig. is chosen.e. • The tree height is low. When the height is more than three. 5.e.20. a branch of height j – 2 . multioperand tree is the Wallace tree. The ﬁrstorder overturnedstairs adder tree. root. and connector.The fastest. The body of height j ( j > 2 ) consists of a body of height j – 1 . The main features of overturnedstairs adder tree are • Recursive structure that yields regular routing and simpliﬁes the design of the layout generator. There are several types of overturnedstairs adder trees [40]. which has a regular layout and the same height as the Wallace tree when the data wordlength is less than 19. O( p N ). 5. The branch of height j – 2 is formed by using j – 2 carrysave adders (CSAs) on “top of each other” with proper interconnections [40]. i. The overturnedstairs adder tree was suggested by Mou and Jutand [40]. i. and a connector. which has the same speed bound to that of Wallace tree when the number of the operands is less than 19. where p depends on the type of overturnedstairs tree.20. is used in the design of the complex multipliers.
5 V. IMPLEMENTATION OF FFT PROCESSORS Height j2 Body height j1 Branch 89 .CSA CSA CSA CSA CSA CSA CSA CSA CSA CSA Tree 1 n CSAs Root Body 2 CSA Tree 2 Tree 3 Root CSA Branch n Connector CSA CSA CSA Connector Body of height j Root Tree of height j+1 Figure 5. However. The choice of fulladder has large impact on the performance of accumulator. Overturnedstairs tree. it is not competitive from a power consumption point of view. The ﬁrst type of fulladder is a conventional static CMOS adder. the conventional static CMOS full adder. with large stack height.20. recently a large number of new adder cells has been proposed [51] and they should be evaluated in the future work. We compared several fulladders and found the most suitable for our implementation. The fulladder is essential for the accumulator. When the voltage is as low as 1. is too slow. Furthermore.
Conventional static CMOS fulladder. However. A third type of full adder is Reusens fulladder [50]. This fulladder is fast and compact but requires buffers for the outputs.21. The buffer insertion is usually considered as a drawback since it introduces delay and increases the power consumption. A second type of fulladder is a fulladder with transmission gates (TG). x y z S C Figure 5. Transmission gates fulladder.22.y z x z y x y y x z C S Figure 5. This fulladder realizes the XORgate with transmission gates and both the power consumption and chip area are smaller than that of a conventional static CMOS fulladder. in 90 Chapter 5 .
5 V and 72. A handcrafted accumulator using overturnedstairs tree with 0. There is no direct path from VDD or VSS in this full adder.35 µm technology. which tends to reduce the power consumption. The worst case delay is 26 ns at 1.4.5 V and 25 ˚C with SPICE simulation.5 V Power (µW)@1. both run at 25 MHz.6 mW at 3.5 V 24 16 16 4. The generated structural VHDLcode can be validated by applying random test vectors in a testbench. Adder type Static CMOS TG Reusens Transistor count Delay (ns)@1. x y z y S C Figure 5. the accumulator can be implemented.3 2.1 Table 5. The software can handle different wordlengths for the data and coefﬁcient.35 µm standard CMOS technology is shown in Fig.3.5 3.the accumulator the buffer insertion is necessary anyway in order to drive the long interconnections. The power consumption for this complex multiplier is 15 mW at 1. Comparison of fulladders in 0. IMPLEMENTATION OF FFT PROCESSORS 91 . A software for the automatic generation of overturnedstairs adder trees has been developed. 5. Reusens fulladder.23.5 2.2 3.2 4.3 V.24. Accumulator Implementation After the selection of structure and adder cell. 5.5.3.
BrentKung Adder Implementation The BrentKung adder is used as the vector merge adder. 5.26 Figure 5.25. Accumulator layout.Figure 5.3 × 479. The layout of a 32bit BrentKung adder is shown in Fig. A program for schematic generation for BrentKung adder has been developed and the layout generator is under construction. The generated schematic of a 32bit BrentKung adder is illustrated in Fig. 92 Chapter 5 Size: 704. 5.5 µm2 .3.25.5. The BrentKung adder belongs to the preﬁx adder. Block diagram for a 32bit BrentKung adder.24. which uses the propagation and generation property of carry bit in a fulladder to accelerate the carry propagation.3. 5.
However.35 µm CMOS technology.3 V at 25 MHz in a standard 0. IMPLEMENTATION OF FFT PROCESSORS Size: 0. and it increases the routing complexity as well. An observation is that a large portion of the total power are consumed by the computation of complex multiplications in the FFT processor. consumes 290 mW@3. Final FFT Processor Design After the design of the components and the selection of FFT architecture.3 V. the power consumption for the computation of the complex multiplications is still more than 210 mW. hence. 25 MHz. and. 32bit BrentKung adder. Using high radix butterﬂies can reduce the number of complex multiplications outsides the butterﬂies.16 mm2 93 . Overcoming this two drawbacks is the key for using high radix butterﬂies.25 × 0. For a 1024point FFT processor. we apply the meetinthemiddle methodology to combine the components into the complete implementation. it requires four complex multipliers.26. We have implement a complex multiplier that consumes 72.Figure 5.6 mW with power supply voltage of 3. Even with bypass techniques for trivial complex multiplications. 5. it is not common to use high radix butterﬂy for VLSI implementations due to two main drawbacks: it increases the number of complex multiplications within the butterﬂies if the radix is larger than 4. Hence the reduction of the number of complex multiplication is vital.4.
which is much less than a 17 × 13 bit complex multiplier (72. 25 MHz. We have implemented a 32bit BrentKung adder (real) that consumes 1. the radix4 FFT and SRFFT algorithm is more efﬁcient than that of radix2 FFT algorithm in term of number of multiplications. Real Input cos(p/8)+sin(p/8) cos(p/8) Imaginary Input cos(p/8)sin(p/8) Real Input (p = π) 1 Imaginary Input Figure 5. 5. For 16point DFT.e.27. This is because the adder has less hardware and has much fewer glitches. i. Moreover..3 V. adders consume much less power than the multipliers with the same wordlength. and W 16 . both radix2 and splitradix algorithm require three multipliers (two 2 1 multipliers with W 16 and one multiplier with W 16 ) while the radix4 algorithm requires only two multipliers (one multiplier 94 Chapter 5 .3 V. The multiplications with W 16 and W 16 can share coefﬁcients since cos ( π ⁄ 8 ) = sin ( π ⁄ 2 – π ⁄ 8 ) = sin ( 3π ⁄ 8 ) and sin ( π ⁄ 8 ) = cos ( π ⁄ 2 – π ⁄ 8 ) = cos ( 3π ⁄ 8 ) . multiplications with W 16 . We use constant multipliers in the design of 16point butterﬂy in order to reduce the number of complex multipliers.6 mW@3.27. The 1 implementation of a multiplication with W 16 is illustrated in Fig. 2 3 1 3 W 16 . 25 MHz). We therefore can use constant multipliers. The selection of FFT algorithm affects the number and positions of constant multipliers. Complex multiplication with W 16 .As is wellknown. which reduce the complexity. Therefore it is efﬁcient to replace the complex multipliers with constant multiplier when possible.5 mW@3. there are three type nontrivial complex 1 multiplications within the butterﬂy. For a 16point FFT butterﬂy.
only N – 1 words for an Npoint FFT. Hence the 16point butterfly with radix4 is more efficient and is selected for our implementation.6.with W 16 and one multiplier with W 16 / W 16 ). e. of Comp. Algorithm No. the most memory efﬁcient architectures are the architectures with singlepath feedback since it gives the minimum data memory. since the radix2 butterﬂy have the simplest routing. 25 MHz. there is only two complex multipliers and two constant multipliers.g. Mult. The number of nontrivial complex multiplications required for 1024point FFT for different algorithms is shown in the following table. The number of nontrivial complex multiplications for different FFT architectures. In the 1024point FFT processor.. Hence. The number of nontrivial complex multiplications can be reduced to 1776.3 V. This is less than the theoretical saving of 35% (the ratio for the number of complex multiplications) due to the computation for complex multiplications within the 16point butterﬂy. which consumes less than 160 mW. As mentioned in the resource analysis. a power saving of more than 20% for the computation of complex multiplications can be achieved. IMPLEMENTATION OF FFT PROCESSORS 95 . R2FFT 3586 R4FFT 2732 SRFFT 2390 Our approach 1776 2 1 3 Table 5. the power consumption for complex multiplication within 16point butterfly is reduced to 10 mW@3. By replacing the complex multiplications with constant multiplications within the 16point butterﬂy. The total number of complex multipliers is reduced to two for a 1024point FFT due to the use of 16point butterﬂies. To cope with the complex routing associated with high radix butterflies it is better to divide the 16point butterﬂy into four stages.
The 1024point FFT processor can also run at 1. Mem Butterfly Element Mem Butterfly Element Mem Butterfly Element Mem Butterfly Element Intput Output Constant multiplier Figure 5. We proposed a wordlength optimization method for the pipelined FFT architectures. The butterﬂies consumes about 30 mW. The power consumption for the data memory is estimated to 300 mW (the power consumption for 128 words or higher memory is given by the vendor and the smaller memory is estimated through linear approximation down to 32 words). for instance. The total power consumption for the three main subsystems is 490 mW.28. etc.35 µm standard CMOS process. This method gave a memory saving up to 14%. the power consumption for the FFT processor is therefore estimated to about 550 mW at 3. 16point Butterﬂy. 5. Summary In this chapter. the clock buffers. the computation units for butterﬂy operations and complex multiplications with 37%.5 V.5 V for 0. we have discussed the implementation of a 1024point FFT processor. communication buses. The total power consumption of the 1024point FFT processor is less than 200 mW at 1.28.The radix4 algorithm can be decomposed into radix2 algorithm as done in [24]. 96 Chapter 5 .5. 5. A resource analysis gave a start point for the implementation. By assuming the 15% of overhead.3 V [38]. Hence the mapping of 16point butterﬂy can be done with four pipelined radix2 butterﬂies.. The 16point butterﬂy is illustrated in Fig. and others with 8%. Each butterﬂy has its own feedback memory. The memories contribute 55% of the total power consumption. which gives more power saving.
which is efﬁcient in term of delay and area.e.We discussed the implementation of subblocks. The use of proposed 16point butterﬂy reduces the number of complex multiplications and retains the minimum memory requirement. which is power efﬁcient. i. All those subblocks can be operate at low power supply voltage and suitable for the voltage scaling. We constructed a complex multiplier using DA and overturnedstairs tree. We proposed the high radix butterﬂies using carrysave technique. IMPLEMENTATION OF FFT PROCESSORS 97 . Finally. we discussed the implementation of FFT processor using a 16point butterﬂy. memories. which is area efﬁcient. and complex multipliers.. butterﬂies.
98 Chapter 5 .
In some cases. we use distributed arithmetic to reduce the hardware complexity. The wordlengths in each stage of the pipelined FFT processor may be different and therefore optimized. it is important to reduce the hardware complexity. The FFT algorithm with less multiplications and additions is attractive. The selection of low power strategy affects FFT hardware design. A simulationbased method has been developed for wordlength optimization of the pipelined FFT architectures.6 CONCLUSIONS This thesis discussed the essential parts of low power pipelined FFT processor design. In the complex multiplier design. The selection of FFT algorithm is an important start point for the FFT processor implementation. The supply voltage scaling is an efﬁcient low power technique and was used for the FFT processor design. This technique is generally applicable for highradix butterﬂies. For the detail design.The proposed highradix butterﬂies reduce both the area and the delay with more than 20%. we proposed that a carrysave technique was used for implementation of the butterﬂies. This also results in a power saving of 14% for the memories. the wordlength optimization can reduce the size of memories up to 14% compared with using uniform wordlength in each stage. After the selection of the FFT algorithm and the low power strategy. We select overturnedstairs tree 99 . The reduction of wordlength also reduces the power consumption in the complex multipliers and the butterﬂies proportionally.
is therefore used.5 V power supply voltage. Using proposed 16point butterﬂy.5 V. the data memory size is reduced with 10%. With optimized word length. and others with 8%.35 µm standard CMOS process. Simulation shows that the complex multiplier operate up to 30 MHz at 1.5 V for 0. which indicates that the optimization of the memory structure could be important for the implementation of low power FFT processors. the total power consumption of the 1024point pipelined FFT processor with a continuous throughput of 25 Msamples/s and equivalent wordlength of 12bit is less than 200 mW at 1. 100 Chapter 6 . the computation units for butterﬂy operations and complex multiplications with 37%. With all those efforts. The memories consume the most signiﬁcant part of the total power consumption. The overturnedstairs tree has a regular structure and the same performance as the Wallace tree when the data wordlength is less than 19. The memories contribute to 55% of the total power consumption. In the SRAM design. The power consumption is 15 mW at 25 MHz with a 1.for the realization of complex multiplier. The sense ampliﬁer can be operated at low power supply voltage. the number of complex multiplications can be reduced and results a power saving more than 20% for complex multiplications. we modiﬁed an STC Dﬂipﬂop to form a two stage sense ampliﬁer.
(EUSIPO). “Precomputationbased sequential logic optimization for low power.. pp. Alidina. J.” IEEE Trans. pp. 915–918. L. Vol. Mag. May 1990. Koufopavlou. Jones.. Brigham.. 1989. Vol. Scarabottolo. MA.. 1988. Signal Processing. 2. ADSP21060 SHARC Super Harvard Architecture Computer. Conf. Devadas.” Proc. pp. Bingham. Monterey.REFERENCES [1] M. 181–192. Dec. A. Netherlands. O. Dec. Papaefthymiou. 1982–1985. 1993. Vol. 1988. R.” Intern. “Accurate evaluation of CMOS shortcircuit power dissipation for short channel devices. Norwood. Speech. Bisdounis. 28. Nikolaids. Bi and E. S. pp. Symp. Ghosh. on Low Power Electronics & Design. pp. 4. on Acoustics. Sep. “A pipeline FFT processor for wordsequential data. 2. on VLSI Systems. Analog Devices Inc. “Arrays for discrete Fourier Transform. 1994. Aug. and S. [2] [3] [4] [5] [6] [7] 101 . Monterio. A. No. and M. O. 12. The Fast Fourier Transform and Its Applications.” IEEE Trans. A. G. 5–14. 426–436. "Multicarrier modulation for data trasnmission: An idea whose time has come. V. and N. Amsterdam. CA. Eorupean Signal Process. Negrini. 37. Antola. 1996. No. Vol. Prentice Hall. E.. J. C." IEEE Commun.
pp.[8] C. ASSP–25. 993–1001. Digital Filter for PCM Encoded Signals. No. S. [9] [10] A. 1996. M. Dec. R. Levilion. on ComputerAided Design of Integrated Circuits and Systems. and R. 12–31. pp. W. Turkey. 10. P. Low Power Digital CMOS Design. Cooley and J. Chang. pp. Duhamel and H. 1977.” Mathematics of Computation. on Computers.” IEEE Trans. [15] A.” Signal Processing. 1984.. 1995. pp. Good. A. “Perturb and simplify: multilevel Boolean network optimizer. 102 . S. 1990. 259–299. Vol. 1494–1504. Signal Processing.. [11] A.” Electronics Letters. 14–16. No. pp. 297–301. Croisier. pp.” IEEE Trans.. April. B. 14. 19. Delft. D. Vol. Rabey. Vol. [18] P. J. No. ser. MarekSadowska. [14] A. April. pp. 15. W. W. pp. No. Jan. “An algorithm for the machine computation of complex Fourier series. the Netherlands. Chadrakasan and R. Mehra.J. M. 239–242. March.. Royal Statist. Vol. Soc. W. and K. “Lowpower CMOS design.S. June. 20. Brodersen. April. 2002. Speech. “The interaction algorithm and practical Fourier analysis. 20.. Vol.” J. Burrus. and R. Despain. [13] J. 1973. Dec. Esteban.. J. and V. “Splitradix FFT algorithm. C–23. 1974. “Optimizing power using transformations. 1. “Fast Fourier transforms: A tutorial review and a state of the art. 361–372. [16] DoubleBW Systems B. “Index mappings for multidimentional formulation of DFT and convolution. U. PowerFFT™ processor data sheet. Kluwer. Sheng.V. 472–484. [17] P. Vetterli. Vol. Chadrakasan. 1992. 1. [19] I. Patent 3777 130. 4. W. 27.” IEEE Journal of SolidState Circuits.” IEEE Trans. Potkonjak. Hollmann. 3. Duhamel and M. Vol. on Acoustics. Cheng. [12] S. 1995. pp. Jan. Brodersen.” IEEE Trans. Brodersen. 12. Vol. on ComputerAided Design. “Fourier transform computer using CORDIC iterations. M. No. Vol. 19. No. 1965. Riso. No. Chadrakasan. 1958. 4.
A. L. 1998. 1995. et al. S.” IEEE Acoustics.” IEEE Journal of SolidState Circuits. 1995. “Partial Column FFT pipelines. and Signal Processing Magazine. June. Parallel Processing Symp. [28] http://theory. Oct. Keutzer. of IEEE. T. He and M. Hawaii. [27] M. Dayton.edu/~fftw [29] Intel Corp. Heideman and C. Heideman. Vol. [24] S.. 253–259.[20] A. 1015–1019.” IEEE trans. 1. K. 30. Ghosh. CA. Devadas. pp. [25] M. Speech. al. pp.lcs. 524–543. Honolulu. and J. 4. Vol.. pp. Feb. “A Pipeline Fast Fourier Transform.. Works... ASSP–34. Kim “A deep submicron SRAM cell design and analysis methodology. [23] D.. 1984. M. “Gauss and the history of the FFT. C–19(11). Hikari.” Proc. “Estimation of average switching activity in combinational and sequential circuits. Ohio. [26] M. T. 103 . 6. Groginsky and G. Wills. S. USA.” In Proc. Santa Clara. Kojima.mit. on Circuits and Systems. of the 10th Intern. on Acoustics. 1970. Johnson. 1. Itoh et. Speech.” IEEE Trans. [21] S. SA1100 Microprocessor Technical Reference Manual. 1986. 766–770. Burrus. “Datadependent logic swing internal bus architecture for ultralowpower LSI’s. Torkelson. and C. April. F. on Circuits and SystemsII. No.. H. 1996. Aug. Signal Processing. Vol. [30] K. 2001. “Trends in Lowpower RAM Circuit Technologies. June. pp. “On the number of multiplications necessary to compute a Length2n DFT.” In Proc. Hang and Y. S. “A New Approach to Pipeline FFT Processor. USA. Vol. Burrus. on Computers. pp.. [22] H. Gorman and J. (IPPS). pp 14–21. pp. 1995. H. of the 29th Design Automation Conf. April. 1992. Vol. 42.” IEEE Trans. No.. D. USA. No. 379–402. White.” In Proc. of Midwest Symp. 858–861.
on Acoustics. Mahant Shetti. 393–396. Wanhammar. 654–662.. September. California.” In Proc. “OverturnedStairs’ Adder Trees and Multiplier Design”.” In Proc.” In Proc. pp. USA. Wanhammar. of NorChip Conf. Workshop on the Low Power Design. Krishnamoorthy and A.” ICSPAT. “A prime factor algorithm using highspeed convolution. pp. of IEEE Workshop on Signal Processing Systems (SiPS). of Custom Integrated Circuit Conf. Vol.. Li and L. Linköping University. “Efﬁcient Radix4 and Radix8 Butterﬂy Elements. of NorChip Conf. Ma. of Intern. 1992. “A low power 16 by 16 multiplier using transition reduction circuitry. [32] S. 281–294. IEEE Trans.” IEEE Trans. 1999. 940–948. 262–267.. Wanhammar. Nov. “A Complex Multiplier Using ‘OverturnedStairs’ Adder Tree. Parks. Melander. Y.” In Proc. Sweden. Thesis No. April.. Aug. Jutand. Taipei. Florida. Sweden. “Word length estimation for memory efﬁcient pipeline FFT/IFFT Processors. Wanhammar. Stockholm. pp. pp. Kobla and T. 125–130. Wanhammar. 104 . Nov. [35] W. Khouja. Cyprus. P. pp. Li and L. C–41.. Li and L. Speech Signal Processing. pp. 1999. 4. Design of SIC FFT Architectures. 326–330... Norway. Napa. [39] J. 1977. pp. ASSP–31. Computers. and L. No.” In Proc. Orlando. 618. vol. pp. [34] W. San Diego. Lemonds and S. Linköping Studies in Science and Technology. [38] W. 1. 1996. 1994.[31] D. 2001. [36] W. Li. Li and L. W. Mou and F. Paphos. USA. [37] W. Conf. [40] Z. May. USA. 1999. [33] C. California. Nov.” In Proc. “Efﬁcient power analysis of combinational circuits. 1999. “A Pipeline FFT Processor.. 1997. Vol. “An FFT processor based on 16point module. Oslo. pp. on Electronic Circuits and Systems (ICECS). Nov. of the Intern. S. 21–24. 139–142. China.
1999. Phoenix. and R. Vol. “Lowpower operation using selftimed circuits and adaptive scaling of the supply voltage. DSP96002 IEEE FloatingPoint DualPort Processor User’s Manual. on ComputerAided Design for Integrated Circuits System. L. 1968. Vol. 105 . J. Sweden. Reusens. Theory and Application of Digital Signal Processing.. of Intern. pp. 1983. Rabaey. 1992. pp. May. Aug. Pease. 524–532. Rabaey..” In Proc. April. May. “Discrete Fourier transforms when the number of data samples is prime.” IEEE Trans.” Proc. 18. [46] J. No. AZ. 1996.[41] Motorola Inc. 1989. Vol. on Acoustics. Linköping Studies in Science and Technology. Vol.).” IEEE Trans. Nielsen. and K. Gold. pp. 324. 391–397. Cornell University. [49] C. 1107– 1108. [47] J. USA. Rader. Linköping University. Detroit. on VLSI Systems. pp.” Journal of the Association for Computing Machinery. Design of an Application Speciﬁc FFT Processor. Thesis No. pp. and M. [43] E. C. of IEEE. Low Power Design Methodologies. Guerra. SparsØ. Nordhamn. No. [44] M. van Berkel. “Design guidance in the power dimension. Rabiner and B. 5. M. 252–264. Thesis. 1994. 1968. R. [42] L. Pedram (Ed. Speech and Signal Processing. [48] L. C. Vol. Conf. “An adaptation of fast Fourier transform for parallel processing. Rabaey and M. P. Kluwer. 1975. 15. [50] P. 5. Michigan. 2837–2840. High Performance VLSI Digital Signal Processing Architecture and Chip Design. 56. June. 2. “Algorithm selection: A quantitative optimization intensive approach. [45] M.. Prentice Hall. June. 4. Dec. Potkonjak. Mehra. 1995. No. Nielsen. 2.
2002.5.” IEEE Journal of SolidState Circuit. pp. 559–563. 1984. Malik. 270–277. of 1994 IEEE Symp. Dallas. Ashar. S. Wang and S. on VLSI Systems.” IEEE Trans. pp. No. USA.. of Intern. California.. Conf. [56] V. of ComputerAided Design.35 µm CMOS technologies.[51] M. 1993. Arizona. pp. [54] H. 748–754. San Jose. “An Implementation of FFT. 1987.” In Proc. June.. H. M. Vrudhula. Principles of CMOS VLSI Design. 1. Sayed and W. [60] T. R. 1997. pp. [55] Texas Instruments Incorporated.” IEEE Journal of SolidState Circuit. San Diego. 1996. March. “Compilation techniques for low energy: an overview. [58] L. pp. Burleson. E. Vol.” Application report: SPRA113. Seevinck et al. [53] M. Veendrick. DSP Integrated Circuits. Vol. Thesis No. DCT. May. No. 0. and 0. Sweden. Oct. 619. 49–58. Academic Press. California. “Businvert coding for lowpower I/O. “Shortcircuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. 19. Vol. AddisonWesley. “Performance analysis of singlebit full adder cells using 0. Oct. USA. Efﬁcient Implementation of FFT Processing Elements. USA. and Other Transforms on the TMS320C30. Widhe. P. second edition. 468–473. Weste and K. of IEEE Intern.18. 3. Linköping Studies in Science and Technology. Eshraghian. 1994.” In Proc. 3. Symp. Tiwari. SC22. 1997.25. Scottdale. on Circuits and Systems (ISCAS). 1999. Wanhammar. pp. 1995. on Low Power Electronics. [52] E. J. Aug. Stan and W. and P.” Proc. 38–39. Texas. [57] Q. “StaticNoise Margin Analysis of MOS SRAM Cells. Badawy. Linköping University. 106 . “Multilevel logic optimization for low power using local logic transformation. [59] N. Vol.
Sci. 1005–1006. USA. pp. 1976. Nat. Kluwer. 1998. Thesis No. pp. Vol. PDSP16515A Stand Alone FFT Processor Advance Information. 1999. 4. Sweden. Linköping Studies in Science and Technology. [65] Zarlink semiconductor Inc. M. Winograd. 1988. Practical Low Power Digital VLSI Design. 5. Yeap. “Pipeline and Parallelpipeline FFT Processors for VLSI Implementation. 414–426.” Proc. C–33. Despain. April.. [64] J. April.[61] S. 73. No. Acad. High Speed CMOS Circuit Technique. “On the computing the discrete Fourier transform. 1984. Yuan. Linköping University. Vol. H. [62] E. [63] G. No. Wold and A.” IEEE Transaction on Computers. 107 . 132.
108 .