Professional Documents
Culture Documents
DOI 10.1007/s00034-015-0106-5
B Selcuk Keskin
selcuk.keskin@eng.bahcesehir.edu.tr
Omer Cetin
omer_cetin@outlook.com.tr
Taskin Kocak
taskin.kocak@eng.bahcesehir.edu.tr
1 Turkish Air Force Academy Aeronautics and Space Technologies Institute, Istanbul 34149,
Turkey
2 Department of Computer Engineering, Bahcesehir University, Istanbul 34353, Turkey
Circuits Syst Signal Process
Keywords Fast Fourier transform (FFT) General purpose graphics processing unit
(GPGPU) Compute unified device architecture (CUDA) Orthogonal frequency
division multiplexing (OFDM)
1 Introduction
Increasing data traffic and multimedia services in recent years have paved the way
for the development of optical transmission methods to be used in high-bandwidth
communications systems. Meanwhile, thanks to data intensive applications and the
proliferation of mobile devices such as smartphones and tablet computers, there is a
huge demand for high throughput wireless communications. To address this demand,
the communication industry aims to design the next-generation wireless networks
capacity to support tens of gigabits per second similar to the wired networks.
OFDM is a very attractive technique for high-rate data transmission in multipath
environments. The main idea behind OFDM is to split the data stream to be trans-
mitted into N parallel streams of reduced data rate and to transmit each of them on
a separate subcarrier. These carriers are made orthogonal by appropriately choosing
the frequency spacing between them. The advantage of applying OFDM in high data
rate communication systems is a relatively long symbol duration compared with the
delay spread of the channel, in which inter-symbol interference (ISI) can be elimi-
nated by adding a guard interval (GI) [30]. The main issue is the requirement of very
high throughput FFT, Inverse FFT (IFFT) processor, and other circuits [18]. FPGAs
and DSPs are usually employed to process OFDM blocks in real time. Being mostly
hardware-level solutions, it is also very difficult for them to adapt to new standards.
The signal resolution is sacrificed and many FPGAs and DSPs are used to reach high
data rates. Qian et al. [19] implemented a transmitter to demonstrate it works in real
time by using FPGA-based DSP platform. Schmogrow et al. [22] demonstrated an
optic OFDM transmitter with 16-QAM modulation. The work was improved with
some optimizations and got better result by Inan et al. [12]. Buchali et al. [6] have
demonstrated a 256 point FPGA-based IFFT implementation at high-speed data rate
of 12.1 Gb/s. However, integer arithmetic operations are used reduced complexity, i.e.,
10-bit resolution at maximum.
GPUs were only used for 3D graphics rendering in the first years of their evo-
lution. As the demand for high-performance parallel computing increases across
many areas of science, medicine, engineering, and finance, the performance of GPUs
has undergone increasing performances over the last decade thanks to their inten-
sive, multithreaded and highly parallel computations. To meet that demand, powerful
GPU computing architectures are improved by GPU manufacturers. In this paper, the
proposed parallel FFT algorithm is programmed in two different NVIDIA GPU archi-
tectures, namely Fermi and Kepler. The Fermi GPU, released in early 2010, has three
billion transistors and features up to 512 cores. In the first quarter of 2013, the Kepler
GPU with 7.1 billion transistors was announced. The GK110 chip of Kepler GPU was
designed for high-performance computing with a much higher 64-bit floating-point
performance than its predecessor graphics chips.
Circuits Syst Signal Process
IQ imbalance Guard
Received Frame CFO
estimator & interval
OFDM signal detector corrector
corrector extraction
MAC Demodulated
de-framer bit data
As explained, the main focus of the paper is the implementation of the FFT and
exploration of the feasibility of using the computational capability of GPUs. In this
paper, NVIDIA GPGPU architectures and their computing platform called compute
unified device architecture (CUDA) are used to achieve over 10 gigabit baseband
throughput using the OFDM specification. FFT process is a single instruction multi-
ple data (SIMD) type computation and it can be parallelized as described in literature
[1,2,13,16]. For a GPU-based parallel FFT algorithm, the best performance can be
achieved by overcoming a number of challenges, the major three of which may be
I/O limitations, memory management and the development of a new suitable FFT
algorithm that is highly dependable on the hardware [23]. Several enhancements are
proposed and implemented in this paper to enhance FFT computation performance
in different programming levels by considering different hardware architectures. The
approaches that are defined in this paper, aim to minimize inter-processor communi-
cation to reduce the number of shared memory accesses.
This paper is organized as follows. Section 2 gives a brief overview of OFDM
system and also introduces the general purpose graphics processing unit (GPGPU)
architecture, which will be used in this paper. In the next section, optimizations for
the parallel FFT calculation process in the GPU will be defined and the effects of the
each optimization will be put forth. In Sect. 5, experimental works will be defined
and the success and the performance of the parallel FFT calculation process for an
OFDM-based system will be demonstrated. Finally, conclusion is given in Sect. 6. In
Appendix section FFT algorithm design, FFT algorithm design will be explained
by including CooleyTukey Algorithm description and parallel approach.
2 Preliminaries
OFDM is a method of encoding digital data on multiple carrier frequencies [14]. Basi-
cally, fundamental processes of OFDM signal for modulation and demodulation can
be seen in Fig. 1. After a received signal is converted by frame detector to obtain a
predetermined frequency, the converted signal turns input to the quadrature demodula-
Circuits Syst Signal Process
MCS index Data rate (Mb/s) Modulation scheme Spreading factor Coding mode FEC Rate
msb 8b msb 8b
tor that performs detection using carrier from numerical control oscillator (NCO) and
outputs a base-band OFDM modulation wave. The output (I signal) along the in-phase
and the output (Q signal) of quadrature are real and imaginary components of OFDM
modulation wave, respectively. After IQ imbalance corrector, FFT is performed on the
received OFDM modulation wave. The output of FFT process is also fed to channel
estimator which generates an output that is supplied to NCO so that the frequency
of local oscillator carrier is controlled. An information symbol is separated from the
output of the estimator. The output is deinterleaved by a de-tone interleaver to reduce
the influence caused by a burst error. The deinterleaved data are decoded by forward
error correction (FEC) decoder. The last operation is accomplished according to the
rule which is the inverse of one used for scrambling and performed by a circuit called
descrambler. FFT process is the most computationally intensive function in all these
OFDM blocks.
For both the transmitter and receiver, input and output data types of the FFT and
IFFT blocks are arrays of symbols. One symbol can be represented as a complex
number and each complex number can be represented as two float variables (one for
real part and one for imaginary part of the number). By combining 512 two float type
numbers, it becomes 512-point arrays for the FFT block. The numerous 512-point
arrays of data creates another large array which is called batch. Each point of data in
the batch is computed with the same instruction set for the FFT process so that it is
suitable for the SIMD type parallel approach.
The specification focused on in this paper has a maximum data rate of 5.775 Gbps
which can be achieved by using a 64-QAM scheme with 5/8 FEC rate as implemented
in IEEE802.15.3c as shown in Table 1 [11]. For an OFDM-based modem with 512 sub-
carriers, this means that about 2304 complex multiplications, 4608 complex additions
and the ancillary operations such as loading the data into the input registers of the FFT
processor must be completed in 193 ns at most according to specs. In order to meet
Circuits Syst Signal Process
Thread #63
Thread #1
complex data
complex data
512# Thread #0
512#
FFT #1
Group Group Group
Host to Device memory transfer data to data to data to
8-point 8-point 8-point
values values values
complex data
complex data
512#
512#
FFT #2
Calculate Calculate Calculate
Twiddle Twiddle Twiddle
Butterfly Butterfly Butterfly
Calculation Calculation Calculation
complex data
complex data
Store Data Store Data Store Data
512#
512#
FFT
#batchsize
that keeps the intermediate values during the computation and the size of these values,
etc. depend on the hardware.
The 512-point FFT process that is required for the OFDM-based system described
in Sect. 2, can be performed as an 3-stage computation module, where each stage is
an 8-point FFT as described in Appendix section FFT algorithm design. The major
steps of the each stage can be summarized as fetch the correct data which are coming
from channel to device memory and transferred from host to device memory (H2D),
calculate twiddle factors and implement butterfly operations for each radix and then
store back. Three repetitions of these blocks will be implemented to achieve 512-point
FFT process as three stages. Finally, outputs of the FFT process on GPU are transferred
from device to host memory (D2H).
If the 3-stage computation module is used to calculate the 512-point FFT process,
64 independent threads (64 8 = 512) are needed to achieve the size of 512. By
computing these threads in parallel on different cores, parallel computing of the FFT
process will be implemented as shown in Fig. 3. Each thread fetches 8 complex data
Circuits Syst Signal Process
Block #(batch/2)
Thread 0
Block #1 Thread 1
Thread 0
MemoryFFT Calculation
MemoryFFT Calculation
MemoryFFT Calculation
Block #0
Thread
Thread 0 1
FFT Calculation
FFT Calculation
FFT Calculation
Memory
Memory
Memory
Memory
Thread 1
Shared 8-point
Shared 8-point
Shared 8-point
Shared
Shared
Shared
Shared
Calculation
Calculation
Calculation
Memory
FFT 8-point
FFT 8-point
FFT 8-point
Shared
Thread 63
512#
Memory
Memory
Memory
Memory
8-point
8-point
8-point
Shared
Shared
Shared
Shared
Complex Data Thread 64
Thread 63
Thread 65
Thread
63 64
MemoryFFT Calculation
MemoryFFT Calculation
MemoryFFT Calculation
Thread
Thread
Thread 64 65
FFT Calculation
FFT Calculation
FFT Calculation
Memory
Memory
Memory
Memory
Thread 65
Shared 8-point
Shared 8-point
Shared 8-point
Shared
Shared
Shared
Shared
Calculation
Calculation
Calculation
Memory
FFT 8-point
FFT 8-point
FFT 8-point
Shared
Thread 127
512#
Memory
Memory
Memory
Memory
8-point
8-point
8-point
Shared
Shared
Shared
Thread 127
8# Complex
Data
from different locations of 512 complex data in each stage. Sixty-four threads with
3-stage computation module are able to calculate 512-point FFT and store the results
back to the global memory. However, a kernel can be executed by multiple equally
shaped thread blocks. The input data will be transferred, as number of batchsi ze
which represents number of FFTs will be executed together.
There are two kinds of major optimizations for a CUDA algorithm by considering
device architecture, first of them is related with memory and the other one is related
with threads. Using two optimization approaches to achieve better performance that
are discussed in following section optimizes FFT algorithm. To speed up the memory
access between 8-point FFT computation blocks, we used shared memory instead of
global memory to process communication.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimension-
al grid of thread blocks. We used two-dimensional grid architecture to describe parallel
blocks. As described before, each block has 64 independent threads and these 64
threads calculate one 512-point FFT. We can calculate two different 512-point FFTs in
one block, simultaneously, to use system resources and memory spaces more efficiently
as shown in Fig. 4. Thus, total block size will be equal half of the batchsi ze. By
Circuits Syst Signal Process
implementing two 512-point FFT calculations in one block, thread representation will
be [64 2] as CUDA block representation form and it will be equal to 128 independent
parallel threads in one block. Each thread uses 3-stage computation module and shared
memory to calculate 512-point FFT. After the calculations, 1024 complex data (512
2 = 1024) are stored back to global memory as the results of two different 512-point
FFT process. The total number of parallel threads in the whole kernel execution will
be 128 batchsi ze/2. The 3-stage FFT algorithm, which is optimized for Fermi
architecture is summarized in Algorithm 1.
1 128
Thread Thread Thread
#1 #1 #1
257 384
255 255
Thread Thread Thread
#255 #255 #255
511 511
architectures. Each optimization step and its affects on the performance are evaluated
with Microsoft Visual Studio and NVIDIA Occupancy Calculator [27].
FFT function, butterfly operation can be embedded the code block directly to dispose
of branching.
In addition, there are usually two separated code files, one of them is C++ main
function file and the other one is CUDA GPU Kernel function in the default forms
of CUDA programs. The main function of the C++ block and kernel procedures of
CUDA are gathered in one file and compiled together to minimize the number of total
branch operations. The reduction in branching from the first-pass algorithm to the
optimized code can be summarized in Table 3. The count of branches for each step of
optimizations shown in the table is taken from CUDA profiler tool.
After completing the optimizations about the number of branch operations in the kernel
code, another critical point is data type representation in the memory. As described
in Sect. 2, IFFT and FFT computations must be completed in about 193 ns while
using single (32 bit) data precision to achieve over 10 Gbps throughput on OFDM-
based system. FFT input and output data can be assumed as a stream, and they can
be represented as arrays. Each signal value (Ck ) is a complex number, which includes
real (a) and imaginary parts (b) as Ck = a + bi format. Real and imaginary parts of
the signal values can be represented as float values in the memory by two different
arrays. The lengths of the arrays are dependent on the batch size, so array lengths are
512 batch size. Each complex value can be represented as 2 f loat so its size is
8 bytes. So the array size in the memory is equal to 8 512 batch bytes. These
memory spaces for input and output data must be allocated in system memory before
the parallel function can run.
To implement previous enhancements and to apply the followings, a standard signal
value representation must be defined. There are two different data format in CUDA
environment to define a complex value; first form is using float2 type that is a vari-
able that can be defined by using single variable name and pointer. Second way is
using separated variables as two float variable for the each complex number as real
part and imaginary part. By using second data type definition, every access on the
memory becomes 32-bit access otherwise it is a 64-bit access. Shared memory banks
are organized such that successive 32-bit words are assigned to successive banks and
the bandwidth is 32 bits per bank per clock cycle. So 32-bit access is better than 64-
bit access. By changing the representation of complex numbers to the float values for
separated real and imaginary values, we can reduce the transfer and computation time.
Circuits Syst Signal Process
Sync. Sync.
Using two float variables to represent the single signal value in memory brings addi-
tional enhancement opportunities together besides to organize 32-bit memory access
patterns. First of them is using single-precision floating-point operations as shown
in Fig. 6 while implementing operations to calculate the twiddle factor values. GPU
default mode for arithmetic operations is double precision for the floating-point val-
ues. It is converted from default double-precision to single-precision math operations
by adding f tag to the values and by using use_ f ast_math compiler option.
By using single-precision floating-point operations, we achieved better computation
performance results.
Another enhancement opportunity is reducing the shared memory (SMEM) require-
ment to the half size for the each 512-point FFT computation data set. As shown in Fig.
7, at the end of the each stage, 512 f loat (4bytes) 2 si zes of data are transferred
from register space to shared memory for single 512-point FFT process. Thus shared
memory size (48 K bytes for max) is a limitation factor for parallel execution on GPU
as seen from Table 2. By reducing the shared memory usage size, number of threads
that can fit on to the shared memory on single SM can be increased. Instead of writing
real and imaginary parts of the data to the shared memory together, first of all real
parts are written and then they are moved the register space, and finally the imaginary
values be written to the shared memory. By swapping the real and imaginary parts
between register space and the shared memory, shared memory requirement becomes
the half size for single 512-point FFT process. So the number of threads, which are
limited because of shared memory size in a SM, is doubled as the number of max
parallel block value reaches to 8192 from 4096.
Circuits Syst Signal Process
Fig. 8 The implementation of the CUDA code for the sine and cosine look-up tables
After by using new data type definition and memory usage optimizations, NVIDIA
occupancy calculator measures the occupancy as 66 % value. Especially after applying
overwriting enhancement on shared memory, it is noticed that when the NVIDIA
profiler outputs are checked, the number of active warps becomes 32 from 16. Better
performance result is achieved, but the limitations on the warp size, register and shared
memory usage still continues for code to run parallel on GPU.
The next enhancement is about providing calculation operations for the GPU acceler-
ated parallel CooleyTukey-based FFT computation algorithm. In Fermi Architecture,
there are four Special Function Units (SFU) and 32 CUDA cores on each SM. Trigono-
metric functions are calculated in SFUs, not on the cores. Thus in this case, memory
transfer delay is acceptable instead of calculating much trigonometric calculation on
limited number of SFUs. It is clear that Cosine and Sine values of the twiddle factors in
the stages are the same for the every thread, because of the CooleyTukey Algorithm.
So there is no need to calculate Sine and Cosine values for each parallel thread.
Thus, a pre-computed look-up table is created to get the trigonometric function
results instead of computation them each time. There is a small but fast memory space
in CUDA architecture which is called as Constant Memory on the chip. Constant
memory space is used to keep trigonometric look-up tables. Some part of trigonometric
look-up tables and usage of them can be seen in Fig. 8.
Thread IDs 0 2 4 6
1 3 5 7
0 1
Thread IDs 2 3
4 5 6 7
Fig. 9 Memory coalescing (fast access) and not-coalesced (slow access) representation
contiguous memory region. By default, all accesses are cached through L1, which as
128-byte lines.
The first and simplest case of coalescing can be achieved by any CUDA-enabled
device: the kth thread accesses the kth word in memory as shown in Fig. 9. Not all
threads need to participate. For example, if the threads of a warp access adjacent 4-byte
words (e.g., adjacent float values), a single 128B L1 cache line and therefore a single
coalesced transaction will service that memory access. The number of bus transactions
is minimized by memory coalescing to maximize global memory bandwidth. As we
mentioned, global memory has slowest access time in the memory hierarchy of GPU.
Therefore, memory coalescing optimization occupies an important place to obtain
better FFT computation time.
In the primary algorithm, inputs were written to the shared memory from global
memory. In the iteration, inputs were written to registers from shared memory and
results were calculated with registers. After the iteration, outputs were read again
from shared memory and written to global memory. Many threads access the shared
memory. Therefore, memory is divided into banks which are assigned by 32-bit words.
A memory can service as many simultaneous accesses as it has banks. Multiple simul-
taneous accesses to a bank result in a bank conflict that makes accesses are serialized.
To reduce shared memory access with less or no bank conflicts, we changed progress
as shown in Fig. 10. Inputs are read to registers from global memory. In the itera-
tion, calculations are done and written to shared memory. And again inputs are read
from shared memory for the next iteration. After the iteration, outputs are already on
the registers so these values are used to calculate outputs which are written to global
memory.
Another enhancement while calculating FFT on the Fermi and Kepler architectures is
compiler-related optimizations. NVCC compiler of the CUDA has different computing
Circuits Syst Signal Process
...
Ax1 = In[index].x;
Bx1 = In[index2].x;
...
for(int c = 7; c>-1; c , d = 1)
{
smem[thid] = Ax1 + Bx1;
// calculation of others
syncthreads();
Ax1 = smem[index3];
Bx1 = cosf(((thid c) c) * Omega) * smem[index3
+ d] sinf(((thid c) c) * Omega) *
smem[index3 + d3 + d];
// calculation of others
syncthreads();
}
Out[index].x = Ax1 + Bx1;
// calculation of others
...
5 Experimental Results
Our work is implemented with CUDA 5.0 environment and executed on Tesla K20c
and GeForce GTX590 GPUs. The CooleyTukey-based FFT algorithm is designed
and optimized according to used architecture. Algorithm 1 is optimized for Fermi-
based GPU and Algorithm 2 is optimized for Kepler-based GPU. All the experiments
are conducted on a 3.20 GHz Intel Core i7-960 CPU with 12GB of memory.
A soft modem can be seen in Fig. 11 consists of modulator and demodulator mod-
ules of the OFDM-based system is implemented in MATLAB by using MATLAB
Circuits Syst Signal Process
Table 4 Summary of
Fermi Kepler
optimizations
Branch reduction
No operator overload
Embedded function
Data type definition
Float values for input/output
Float2 values for input/output
Floating-point operations
Trigonometric functions
Using look-up tables
Computing with SFU
Memory coalescing
SMEM optimizations
Half size usage by swapping
Reducing of access count
Compiler optimizations
512 Sub-channel
F/O
MEX
Interfaces
IFFT FFT
Fig. 11 Integration of OFDM soft modem with CUDA based parallel FFT computation
250 150
Time per FFT
240 GFLOPS 144
230 138
nanoseconds 220 132
210 126
GFLOPS
200 120
190 114
180 108
170 102
160 96
150 90
256 512 1024 2048 4096
Batch size
Fig. 12 Executed on GTX590
On the other hand, the Tesla K20c is an example of a Kepler-based graphics card,
which consists of 2496 CUDA cores, with groups of 192 CUDA cores being organized
into 13 SMX. There are five times improvements with number of cores. To achieve
a better result with K20c, the number of threads was increased. To calculate 512
point FFT, instead of 3-stage based algorithm design, 9-stage algorithm design is used
as shown in Fig. 5 and each stage calculates only 2-point FFT as demonstrated in
Fig. 18. To achieve the size of 512, 256 independent threads (256 2 = 512) are
used in parallel. Tesla K20c has extra Special Function Units compared to the Fermi
architecture. Mathematical functions (like sine and cosine) are calculated faster than
on GTX590. So fetching the trigonometric values from the look-up tables is costly
instead of computation in this architecture. Thus twiddle factors are calculated directly
in the computation process. Other enhancements, implemented on the code, are with
the same as on Fermi architecture. Algorithm 2 achieves 85 ns per FFT computation
Circuits Syst Signal Process
250 300
Time per FFT
230 GFLOPS 280
210 260
190 240
nanoseconds
170 220
GFLOPS
150 200
130 180
110 160
90 140
70 120
50 100
256 512 1024 2048 4096
Batch size
Fig. 13 Executed on K20c
For both algorithms, the biggest bottleneck is shared memory. Algorithm 1 calculates
two FFTs in one block and writes 512 2 f loat values (real or imaginary parts) to
SMEM at the same time. Also Algorithm 2 calculates one FFT in one block and writes
512 f loat2 values (real and imaginary parts) to SMEM at the same time. That means
one block uses 4096 KB in SMEM. Available number of blocks in SMEM can be seen
in Fig. 14.
While 12 FFT blocks can be stored in SMEM, we can only calculate 8 blocks at
the same time because of the thread count limitation in SM/SMX. The performance
of using different SMEM size can be seen in Fig. 15.
Our GPU-based FFT algorithms can be compared with NVIDIAs cuFTT algorithm.
The cuFFT algorithm achieves better performance than our algorithms when the batch
size is more than 2048 as seen in Fig. 16. However, our algorithms are able to perform
Circuits Syst Signal Process
SMEM
Block #10
Block #11
Block #12
Block #1
Block #2
Block #3
Block #4
Block #5
Block #6
Block #7
Block #8
Block #9
16 KB 32 KB 48 KB
25,00
Throughput (Gbps)
20,00
15,00
10,00
5,00
0,00
SMEM size
GTX590 (SM) K20c (SMX)
a real-time process when the batch size is used as 1024. The FFT process must be
finished at most 193 ns to give a real-time result according to IEEE802.15.3c specs. As
we mentioned in Sect. 5.1, the proposed algorithms finish under the 190 ns. In other
words, cuFFT calculates the FFT process faster than ours by using more batch size,
but we do not need to increase the batch size to perform real-time process. If the batch
size is increased, cuFFT meets the deadline of real-time process. However, memory
requirement for buffering will increase. At that point, the proposed algorithms are
faster than cuFFT algorithm and provided a real-time process.
Moreover, cuFFT is a closed source algorithm. Before calling the FFT algorithm,
a plan function is executed by using the transform size, data type and number of
transforms. The process must be waited until this plan function is finished. In the
other words, the next process in the OFDM system cannot be started. Considering to
Circuits Syst Signal Process
350
Alg. 1
Alg. 2
cuFFT
300
250
200
150
100
50
256 512 1024 2048 4096
Pre-FFT Pre-FFT
Processes Processes
cuFFT BauFFT
Post-FFT Post-FFT
Processes Processes
accelerate all processes in the OFDM system by GPU, only one kernel can be used
to implement whole OFDM process without too much kernel initialization overhead
for each process. That is possible with proposed algorithms, because any plan is not
required for our proposed algorithms. The illustration of kernel usage with cuFFT and
our GPGPU-based algorithm; namely, BauFFT can be seen in Fig. 17. At least three
kernel initializations are required to use cuFFT algorithm. Pre-FFT processes must be
finished and the plan and input must be prepared according to outputs of the previous
functions. And then the host side can call the cuFFT algorithm. On the other hand,
one kernel initialization is enough to calculate OFDM system including pre-FFT, FFT
and post-FFT processes.
6 Conclusions
algorithms have achieved 12.29 Gbps on Fermi-based GPU and 23.71 Gbps on
Kepler-based GPU, respectively. To achieve such high throughput, several techniques
including the algorithmic optimizations, efficient data structures and memory access
optimizations are used as enhancements on the first-pass form of the algorithm for
Fermi- and Kepler-based GPUs, separately. Future work includes implementation of
all processes of the OFDM system on GPU and integration of the proposed FFT
algorithm to decrease the computation time of the OFDM system.
Acknowledgments A part of this work is financially supported by KDDI R&D Laboratories Inc., Japan.
Appendix
An FFT process computes the discrete Fourier transform (DFT) for a set of signal data,
and it produces exactly the same result as evaluating the DFT definition directly; the
only difference is that an FFT is much faster. The DFT is obtained by decomposing
a sequence of values into components of different frequencies. An FFT is a way to
compute the same result more quickly: computing the DFT of N points in naive way.
The difference in speed can be enormous, especially for long data sets where N may
be in the thousands or over.
N 1
n
Xk = xn e2 k N k = 0, . . . , N 1 (3)
n=0
If we assume that x0 . . . x N 1 are complex numbers, Eq. 3 defines the DFT. Eval-
uating this definition directly requires O(N 2 ) operations: there are N outputs (X k ),
and each output requires a sum of N terms. An FFT is a method to compute the same
results in O(Nlog N ) operations.
CooleyTukey Algorithm
The publication by Cooley and Tukey in 1965 of an efficient algorithm for the cal-
culation of the DFT was a major turning point in the development of digital signal
processing [8]. Then, various extensions and modifications were made to the original
algorithm [7]. By far the most commonly used FFT is the CooleyTukey algorithm
against the others like Prime-factor FFT algorithm [9], Bruuns FFT algorithm [5],
Raders FFT algorithm [21], and Bluesteins FFT algorithm [3].
y0 = x0 + x1 w k
(4)
y1 = x0 x1 w k
The FFT butterfly operation which is the basic calculation element in the FFT
process takes two complex points and converts them into two other complex points
Circuits Syst Signal Process
x(0) + + + y(0)
Even points
x(4) + + y(1)
- +
x(2) + y(2)
+ - +
x(6) + y(3)
- + - +
x(1) + + - + y(4)
Odd points
x(5) - + + - + y(5)
x(3) - + y(6)
+ - +
x(7) y(7)
- + - + - +
Fig. 19 8-point FFT calculation flow diagram (unordered input ordered output)
as shown in Fig. 18. In the case of the 2-point (radix-2) CooleyTukey algorithm, the
butterfly is simply a DFT of size-2 that takes two complex inputs (x0 , x1 ) which are
corresponding outputs of the two sub-transforms and gives two complex outputs (y0 ,
y1 ) by using Eq. 4. The w k is called twiddle factor. A twiddle factor in FFT algorithms,
is any of the trigonometric constant coefficients that are multiplied by the data in the
course of the algorithm.
More specifically, twiddle factors originally referred to the root-of-unity complex
multiplicative constants in the butterfly operations of the CooleyTukey FFT algo-
rithm, used to recursively combine smaller discrete Fourier transforms [29]. This
remains the terms most common meaning, but it may also be used for any data-
independent multiplicative constant in an FFT.
An N point signal is decomposed into N signals each containing a single point. Each
stage of FFT uses an interlace decomposition, separating the even and odd numbered
samples [25]. The DFT of a N -point sequence can be simply calculated from the two
N /2-point DFTs of the even index terms x0 , x2 . . . x N 2 and the odd index terms
x1 , x3 . . . x N 1 , then those two results are combined to produce the DFT of the whole
sequence [4,17,20]. This idea can then be performed recursively to reduce the overall
runtime to O(Nlog N ).
This simplified form assumes that N is a power of two; since the number of sample
points N can usually be chosen freely by the application, this is often not an important
restriction. How many stages to use in solving the FFT depends on the algorithm that
is designed for the special hardware capabilities. For example to calculate 512 point
Circuits Syst Signal Process
FFT, a 3-stage algorithm (8 8 8) can be designed, and each stage can calculate 8-
point FFT. Major bottlenecks that limit the usage of the number of stages are hardware
issues like memory size and processor inter communication costs.
There are two different approaches to calculate 8-point FFT in each stage, first of
them, we used, is required replacing the input data order for the suitable one to provide
ordered outputs. There are three stages inside of the 8-point FFT as stages. Each stage
calculates 2-point FFT results after calculation of twiddle factors. The flow chart of
8-point FFT calculation can be seen in Fig. 19.
References
1. G. Bergland, A parallel implementation of the fast Fourier transform algorithm. IEEE Trans. Comput.
C21(4), 366370 (1972)
2. G. Bergland, D. Wilson, A fast Fourier transform global, highly parallel processor. IEEE Trans. Audio
Electroacoust. 17(2), 125127 (1969)
3. L.I. Bluestein, A linear filtering approach to the computation of discrete Fourier transform. IEEE Trans.
Audio Electroacoust. 18(4), 451455 (1970)
4. E.O. Brigham, The Fast Fourier Transform and its Applications, 1st edn. (Prentice-Hall Inc., Englewood
Cliffs, 1988)
5. G. Bruun, z-Transform DFT filters and FFTs. IEEE Trans. Acoust. Speech Signal Process. 26(1), 5663
(1978)
6. F. Buchali, R. Dischler, A. Klekamp, M. Bernhard, D. Efinger, Realisation of a real-time 12.1 gb/s
optical ofdm transmitter and its application in a 109 gb/s transmission system with coherent reception,
in European Conference on Optical Communication (ECOC), pp. 12 (2009)
7. J.W. Cooley, P.A.W. Lewis, P.D. Welch, Historical notes on the fast Fourier transform. Proc. IEEE
55(10), 16751677 (1967)
8. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math.
Comput. 19(90), 297301 (1965)
9. I.J. Good, The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. B 20, 361372
(1958)
10. N. Hinitt, T. Kocak, GPU-based FFT computation for multi-gigabit wirelessHD baseband processing.
EURASIP J. Wirel. Commun. Netw. 2010, 359081 (2010). doi:10.1155/2010/359081
11. IEEE standards association. http://standards.ieee.org/findstds/standard/802.15.3c-2009.html (2013)
12. B. Inan, S. Adhikari, O. Karakaya, P. Kainzmaier, M. Mocker, H. von Kirchbauer, N. Hanik, S.L.
Jansen, Real-time 93.8-Gb/s polarization-multiplexed OFDM transmitter with 1024-point IFFT. Opt.
Express 19(26), B64B68 (2011)
13. L.H. Jamieson, P.T. Mueller, H.J. Siegel, Fft algorithms for simd parallel processing systems. J. Parallel
Distrib. Comput. 3(1), 4871 (1986)
14. V. Kanwar, H. Thakur, N. Sharma, Performance evaluation of OFDM system under various modulation
techniques and various channels. Int. J. Res. Eng. Adv. Technol. 1(3), 15 (2013)
15. D.B. Kirk, W.M.W. Hwu, Programming Massively Parallel Processors, 2nd edn. (Morgan Kaufmann,
Boston, 2012)
16. Y. Li, J.R. Diamond, X. Wang, H. Lin, Y. Yang, Z. Han, Large-scale fast Fourier transform on a
heterogeneous multi-core system. Int. J. High Perform. Comput. Appl. 26(2), 148158 (2012)
17. A.V. Oppenheim, R. Schafer, Discrete-Time Signal Processing, 3rd edn. (Prentice-Hall Inc., Englewood
Cliffs, 2009)
18. C.H. Peng, K.T. Shr, M.H. Lin, Y.H. Huang A baseband receiver for optical OFDM systems, in
International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp. 14 (2011)
19. D. Qian, T.O. Kwok, N. Cvijetic, J. Hu, T. Wang, 41.25 Gb/s real-time OFDM receiver for variable
rate WDM-OFDMA-PON transmission, in Optical Fiber Communication, Collocated National Fiber
Optic Engineers Conference (OFC/NFOEC), pp. 13 (2010)
20. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing, 1st edn. (Prentice-Hall
Inc., Englewood Cliffs, 1975)
Circuits Syst Signal Process
21. C.M. Rader, Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56(6),
11071108 (1968). doi:10.1109/PROC.1968.6477
22. R. Schmogrow, M. Winter, D. Hillerkuss, B. Nebendahl, S. Ben-Ezra, J. Meyer, M. Dreschmann, M.
Huebner, J. Becker, C. Koos, W. Freude, J. Leuthold, Real-time OFDM transmitter beyond 100 Gbit/s.
Opt. Express 19(13), 1274012749 (2011)
23. B.P. Sinha, J. Dattagupta, A. Sen, Improvement in the speed of FFT processors using segmented
memory and parallel arithmetic units. Signal Process. 8(2), 267274 (1985)
24. B.P. Sinha, J. Dattagupta, A. Sen, Parallel implementation of wavelet-based image denoising on pro-
grammable PC-grade graphics hardware. Signal Process. 90(8), 23962411 (2010)
25. S.W. Smith, The Scientist and Engineers Guide to Digital Signal Processing (California Technical
Publishing, San Diego, 1997)
26. J.B. Srivastava, R. Pandey, J. Jain, Implementation of digital signal processing algorithm in general
purpose graphics processing unit (GPGPU). Int. J. Innov. Res. Comput. Commun. Eng. 1(4), 1006
1012 (2013)
27. The CUDA Occupancy Calculator. http://developer.download.nvidia.com/compute/cuda/CUDA_
Occupancy_calculator.xls (2013)
28. The CUDA Programming Guide. https://developer.nvidia.com/category/zone/cuda-zone (2013)
29. M. Vetterli, H.J. Nussbaumer, Simple FFT and DCT algorithms with reduced number of operations.
Sig. Process. 6(4), 267278 (1984)
30. M. Yoshida, T. Taniguchi, An LDPC-coded OFDM receiver with pre-FFT iterative equalizer for ISI
channels, in IEEE 61st Vehicular Technology Conference, vol 2, pp. 767772 (2005)