IAETSD-Pipelined Parallel FFT Architecture Through Folding Transformation

Proceedings of International Conference On Current Innovations In Engineering And Technology
ISBN : 978 - 1502851550
Pipelined Parallel FFT Architecture through Folding Transformation

M.S.Krishna priya, M.TECH Student, DEPARTMENT of ECE, Shri Vishnu Engg college for women
D.Murali Krishna, Sr.ASST.PROFESSOR, DEPARTMENT of ECE, Shri Vishnu Engg college for women
is the basis for several television and radio broadcast
ABSTRACT:
applications, including the European digital broadcast

This project presents a FUSING FFT system using
television standard, as well as digital radio in North
OFDM application. It is demonstrated by a software
America. FUSING platform provides software
reconfigurable
control of variety of modulation schemes, wideband
programmable floating point DSP. A new VLSI
or narrow band operation, communications security
architecture for real-time pipeline FFT processor is
functions such as frequency hopping and waveform
proposed in this project. In this project, high radix
requirements of current and evolving standards over a
floating point butterflies are implemented more
broad frequency range. It is viewed as a single radio
efficiently
floating-point
platform, providing services to multiple cellular
operations. The fused operations are a two-term dot
standards. F AST FOURIER TRANSFORM (FFT) is
product and add-subtract unit. Both discrete and
widely used in the field of digital signal processing
fused radix processors are implemented; compared
(DSP) such as filtering, spectral analysis, etc., to
in regarded with area wise. OFDM systems and the
compute the discrete Fourier transform (DFT). FFT
associated clock cycles required to demodulate data
plays
using loop and straight-line FFT programming
communications such as digital video broadcasting
methods are provided. Higher execution speed is
and orthogonal frequency division multiplexing
achieved by using straight-line code instead of
(OFDM) systems. Much research has been carried
looped code. The tradeoff of this optimization is a
out
larger program memory requirement of the straight-
computation of FFT of complex valued signals
line assembly code.
(CFFT). Various algorithms have been developed to
OFDM
with the
system
two
fused
using
on
critical
designing
role
in
pipelined
modern
digital
architectures
for
reduce the computational complexity, of which

KEYWORDS: Fusing, OFDM, FFT, Radix, Dot
product,
Folding
transformation,
Cooley-Tukey radix-2 FFT [1] is very popular.
Optimization,
complex valued Fourier transform, Add-substract
Note that this is not the only way to represent floating
unit.
point numbers, it is just the IEEE standard way of
1.INTRODUTION:
OFDM
is
multimode
modulation and multiple access technique used in a

number
of
commercial
wired
and
doing it. Here is what we do: the representation has

three fields:
wireless
applications. In the wired side, it is used for a variant

of digital subscriber line (D). SLFor wireless, OFDM
International Association Of Engineering & Technology For Skill Development

72
www.iaetsd.in
|S| E
ISBN : 978 - 1502851550
difference calculation, significand swapping, and the

significand shifting for both the add and the subtract
----------------------------
operations are performed with a single set of

hardware and the results are shared by both the
S is one bit representing the sign of the number
operations. This significantly reduces the required
E is an 8-bit biased integer representing the exponent
circuit area. The significand swapping and shifting is
F is an unsigned integer
done based solely on the values of the exponents (i.e.,

without comparing the significands).
the decimal value represented is:

S
(-1) x f x 2
where e = E bias
f = ( F/(2^n) ) + 1
for single precision representation (the emphasis in
this
class)
23
bias = 127
for
double
precision
representation
(a
64-bit
representation)
n = 52 (there are 52 bits for the mantissa field)
bias = 1023 (there are 11 bits for the exponent field)
2.FUSED FLOATING-POINT ADD-SUBTRACT
UNIT:The floating-point fused add-subtract unit
(Fused AS) performs anaddition and a subtraction in
To demonstrate the utility of the Fused DP and Fused
parallel on the same pair of data. The fused add-
AS units for FFT implementation, FFT butterfly unit
subtract unit is based on a conventional floatingpoint
designs using both the discrete and the fused units
adder [8]. Although higher speed adder designs are
have been made. First, a radix-2 decimation in
available(see [9] for example), the basic design
frequency FFT butterfly was designed. All lines carry
shown here serves todemonstrate the concept. A
complex pairs of 32-bit IEEE-754 numbers and all
block diagram of the fused addsubtractunit is shown
operations are complex. The complex add, subtract,
in Fig. 5 (after the initial design from [10]). Some
and multiply operations can be realized with a
details, such as the LZA and normalization logic are
discrete implementation that uses two real adders to
omitted here to simplify the figure. The exponent
perform the complex add or subtract and four real

73
www.iaetsd.in
ISBN : 978 - 1502851550
multipliers and two real adders to perform the

complex multiply.
The nodes from
A0A7.represent the eight
butterflies in the first stage of the FFT and B0..B7.

represent the butterflies in the second stage. Assume
the butterflies have only one multiplier at the bottom
output instead of both outputs.
3.HIGH
THROUGHPUT
ARCHITECTURE:
The
proposed
FFT
architecture
consists of the following main parts, together with

their specific novelties and advantages. (i) A memory
unit composed of 16 dual-port memory banks, which
facilitates 16-way parallel data access. (ii) A memory
bank index and address generation unit (BAGU),
Although there is a multiplicative factor j after the
which generates conflict-free and in-place memory
first stage, the first two stages consists of only real-
bank indexes and address for the radix-16 FFT
valued datapath.We need to just combine the real and
operation. (iii) Four commutator blocks located in
imaginary parts and send it as an input to the
front of the input side and after the output side of the
multiplier in the next stage. For this, we do not need a
memory, provide efficient data routing mechanism
full complex butterfly stage. The factor is handled in
which is governed by the BAGU signals. (iv) A
the second butterfly stage using a bypass logic which
scaling
forwards the two samples as real and imaginary parts
operations for block floating point (BFP) operations,
to the input of the multiplier. The adder and
which generates higher signal-to-quantization noise
subtractor in the butterfly remains inactive during
ratio (SQNR) than the existing designs. (v) The
that time. Scheduling Method 2: Another way of
kernel
scheduling
performance computing
is
proposed
which
modifies
the
unit (SU) coordinates controlled scaling
processing
engine,
which
is
high
engine for radix-16
architecture slightly and also reduces the required
butterfly operations. Four radix-16 PEs (i.e., PE_R16
number of delay elements. In this scheduling, the
0 through PE_R16 3), two sets of radix 2 PEs (each
input samples are processed sequentially, instead of
set contains four radix-2 PEs), and four sets of
processing the even and odd samples separately. This
complex multipliers(each contains four complex
can be derived using the following folding sets:
multipliers) for twiddle factor multiplications. Those

multipliers are optimized with the help of commonsubexpression sharing technique and a new twiddle-

74
www.iaetsd.in
ISBN : 978 - 1502851550
factor multiplication scheme. All the function units
order of the output samples in the proposed
inside the kernel processing engine are detailed. To
architectures is not in the bit-reversed order. The
avoid possible conflicts in simultaneously reading (or
output order changes for different architectures
writing) 16 data from (or to) the memory banks
because of different folding sets/scheduling schemes.
during FFT operations, a proper memory addressing

scheme is necessary. The well-known non-conflict
memory addressing schemes [5], [7] are only
5.BOOTH
ENCODER
FOR
MULTIPLICATION:
applicable to radix-2 FFT algorithm. Although the
We use the sign extension circuitry developed in [2]
addressing scheme in [6] is for general radix- FFT
and [3]. The conventional MBE partial product array
operations, its FFT size should be a power-of-
has two drawbacks: 1) an additional partial product
number. Besides, those schemes are only limited to
term at the (n-2)th bit position; 2) poor performance
single-PE architecture. On the other hand, the radix-2
at the LSB-part compared with the non-Booth design
addressing scheme for multiple PEs [16] is relatively
when using the TDM algorithm. To remedy the two
inefficient compared with higher-radix schemes. The
drawbacks, the LSB part of the partial product array
proposed scheme has three special features. First, it
is modified. Referring to theory, the Row_LSB (gray
ensures conflict-free FFT butterfly executions during
circle) and the Neg_cin terms are combined and
the entire FFT operation. Second, it supports parallel
further simplified using Boolean minimization. All
data outputs with normal ordering. This feature is
these are efficiently implemented using this advanced
always desirable for providing immediate and
modified booth algorithm. Below figure shows the
normal-order
succeeding
architecture of the commonly used modified Booth
functional blocks, such as channel estimator for
multiplier. The inputs of the multiplier are multiplicand
timely operations. Thirdly, like many other designs,
X and multiplier Y. The Booth encoder encodes input Y
the in-place FFT computation strategy is also adopted
and derives the encoded signals as shown in below
for low memory overhead consideration.
figure The Booth decoder generates the partial products
FFT
outputs
to
the
according to the logic diagram using the encoded signals
4. REORDERING OF THE OUTPUT SAMPLES:
and the other input X. The carry save tree computes the
Reordering of the output samples is an inherent
last two rows by adding the generated partial products.
problem in FFT computation. The outputs are
The last two rows are added to generate the final
obtained in the bit-reversal order [5] in the serial
multiplication results using the carry save addition.
architectures. In general the problem is solved using a

memory of size . Samples are stored in the memory
in natural order using a counter for the addresses and
then they are read in bit-reversal order by reversing
the bits of the counter. In embedded DSP systems,
special memory addressing schemes are developed to
solve this problem. But in case of real-time systems,
this will lead to an increase in latency and area. The
Fig: Modified Booth Encoder

75
www.iaetsd.in
6.RESULT:
ISBN : 978 - 1502851550
designed and experimental results are obtained with

XILINX.
REFERENCES:
[1] J. W. Cooley and J. Tukey, An algorithm for
machine calculation of complex fourier series,
Math. Comput., vol. 19, pp. 297301, Apr. 1965.
[2] A. V. Oppenheim, R.W. Schafer, and J.R.Buck,
Discrete-Time Singal Processing, 2nd ed. Englewood
Cliffs, NJ: Prentice-Hall, 1998.
[3] P. Duhamel, Implementation of split-radix FFT
algorithms for complex, real, and real-symmetric
data, IEEE Trans. Acoust., Speech, Signal Process.,
vol. 34, no. 2, pp. 285295, Apr. 1986. [4] S. He and
M. Torkelson, A new approach to pipeline FFT
processor, in Proc. of IPPS, 1996, pp. 766770.
[5] L. R. Rabiner and B. Gold, Theory and
Application of Digital Signal Processing. Englewood
Cliffs, NJ: Prentice-Hall, 1975.
[6] E. H. Wold and A. M. Despain, Pipeline and
parallel-pipeline
FFT
processors
for
VLSI
7. CONCLUSION: Finally, This paper describes the
implementation, IEEE Trans. Comput.,vol. C-33,
design of two new fused floating-point arithmetic
no. 5, pp. 414426, May 1984.
units and their application to the implementation of
[7] A. M. Despain, Fourier transfom using CORDIC
FFT butterfly operations. Although the fused add-
iterations, IEEE Trans. Comput., vol. C-233, no. 10,
subtract unit is specific to FFT applications, the fused
pp. 9931001, Oct. 1974.
dot product is applicable to a wide variety of signal
[8] E. E. Swartzlander, W. K. W. Young, and S. J.
processing applications. Both the fused dot product
Joseph, A radix-4 delay commutator for fast Fourier
unit and the fused add-subtract unit are smaller than
transform processor implementation, IEEE J. Solid-
parallel implementations constructed with discrete
State Circuits, vol. SC-19, no. 5, pp. 702709, Oct.
floating-point adders and multipliers. The fused dot
1984.
product
conventional
[9] E. E. Swartzlander, V. K. Jain, and H. Hikawa,
implementation, since rounding and normalization is
A radix-8 wafer scale FFT processor, J. VLSI
not required as a part of each multiplication. Due to
Signal Process., vol. 4, no. 2/3, pp. 165176, May
longer interconnections, the fused add-subtract unit is
1992.
slightly slower than the discrete implementation. An
[10] G. Bi and E. V. Jones, A pipelined FFT
efficient and more flexible architecture of FFT is
processor for word-sequential data, IEEE Trans.
is
faster
than
the

76
www.iaetsd.in
ISBN : 978 - 1502851550
Acoust., Speech, Signal Process., vol. 37, no. 12, pp.

19821985, Dec. 1989.
[11] Y. W. Lin, H. Y. Liu, and C. Y. Lee, A 1-GS/s
FFT/IFFT processor for UWB applications, IEEE J.
Solid-State Circuits, vol. 40, no. 8, pp. 17261735,
Aug. 2005.
[12] J. Lee, H. Lee, S. I. Cho, and S. S. Choi, A
High-Speed two parallel radix- FFT/IFFT processor
for MB-OFDM UWB systems, in Proc. IEEE Int.
Symp. Circuits Syst., 2006, pp. 47194722.
[13] J. Palmer and B. Nelson, A parallel FFT
architecture for FPGAs, Lecture Notes Comput. Sci.,
vol. 3203, pp. 948953, 2004.
[14] M. Shin and H. Lee, A high-speed four parallel
radix- FFT/IFFT processor for UWB applications, in
Proc. IEEE ISCAS, 2008, pp. 960963.
[15] M. Garrido, Efficient hardware architectures
for the computation of the FFT and other related
signal processing algorithms in real time,Ph.D.
dissertation, Dept. Signal, Syst., Radio commun.,
Univ. Politecnica Madrid, Madrid, Spain, 2009.

77
www.iaetsd.in

IAETSD-Pipelined Parallel FFT Architecture Through Folding Transformation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IAETSD-Pipelined Parallel FFT Architecture Through Folding Transformation

Uploaded by

Copyright:

Available Formats

Proceedings of International Conference On Current Innovations In Engineering And Technology

ISBN : 978 - 1502851550

Pipelined Parallel FFT Architecture through Folding Transformation

is the basis for several television and radio broadcast

applications, including the European digital broadcast

television standard, as well as digital radio in North

OFDM application. It is demonstrated by a software

America. FUSING platform provides software

control of variety of modulation schemes, wideband

programmable floating point DSP. A new VLSI

or narrow band operation, communications security

architecture for real-time pipeline FFT processor is

functions such as frequency hopping and waveform

proposed in this project. In this project, high radix

requirements of current and evolving standards over a

floating point butterflies are implemented more

broad frequency range. It is viewed as a single radio

platform, providing services to multiple cellular

operations. The fused operations are a two-term dot

standards. F AST FOURIER TRANSFORM (FFT) is

product and add-subtract unit. Both discrete and

widely used in the field of digital signal processing

fused radix processors are implemented; compared

(DSP) such as filtering, spectral analysis, etc., to

in regarded with area wise. OFDM systems and the

compute the discrete Fourier transform (DFT). FFT

associated clock cycles required to demodulate data

using loop and straight-line FFT programming

communications such as digital video broadcasting

methods are provided. Higher execution speed is

and orthogonal frequency division multiplexing

achieved by using straight-line code instead of

(OFDM) systems. Much research has been carried

looped code. The tradeoff of this optimization is a

larger program memory requirement of the straight-

computation of FFT of complex valued signals

line assembly code.

(CFFT). Various algorithms have been developed to

reduce the computational complexity, of which

Cooley-Tukey radix-2 FFT [1] is very popular.

complex valued Fourier transform, Add-substract

Note that this is not the only way to represent floating

point numbers, it is just the IEEE standard way of

modulation and multiple access technique used in a

doing it. Here is what we do: the representation has

applications. In the wired side, it is used for a variant

International Association Of Engineering & Technology For Skill Development

Proceedings of International Conference On Current Innovations In Engineering And Technology

ISBN : 978 - 1502851550

difference calculation, significand swapping, and the

operations are performed with a single set of

S is one bit representing the sign of the number

operations. This significantly reduces the required

E is an 8-bit biased integer representing the exponent

circuit area. The significand swapping and shifting is

done based solely on the values of the exponents (i.e.,

the decimal value represented is:

To demonstrate the utility of the Fused DP and Fused

parallel on the same pair of data. The fused add-

AS units for FFT implementation, FFT butterfly unit

subtract unit is based on a conventional floatingpoint

designs using both the discrete and the fused units

adder [8]. Although higher speed adder designs are

have been made. First, a radix-2 decimation in