You are on page 1of 24

Circuits Syst Signal Process

DOI 10.1007/s00034-015-0106-5

Real-Time FFT Computation Using GPGPU


for OFDM-Based Systems

Omer Cetin1 Selcuk Keskin2 Taskin Kocak2

Received: 26 November 2014 / Revised: 9 June 2015 / Accepted: 10 June 2015


Springer Science+Business Media New York 2015

Abstract In optical and wireless communications systems, the goal is to reach


10 Gbps or above data rates. In order to support such extremely high data rates, the
physical layer generally uses orthogonal frequency division multiplexing (OFDM)
modulation. Unlike serial transmission of symbols, the OFDM modulation transmits
data with many parallel sub-carriers, which help to provide high bandwidth. Field
programmable gate arrays (FPGAs) and digital signal processors (DSPs) are usually
employed to process OFDM blocks in real time. However, FPGAs and DSPs are not
cost effective, and they are difficult to adapt to new standards. One of the most com-
putationally intensive functions in OFDM systems is the fast Fourier transform (FFT)
computation process. This paper aims to accelerate the FFT process to achieve high
communication throughput in real time. Two parallel approaches are implemented for
two different NVIDIA graphics processing unit (GPU) architectures. To obtain the
best performance values, several optimizations are implemented. Our general purpose
graphics processing unit (GPGPU)-based FFT computation achieves up to 24 Gbps
throughput in real time.

B Selcuk Keskin
selcuk.keskin@eng.bahcesehir.edu.tr
Omer Cetin
omer_cetin@outlook.com.tr
Taskin Kocak
taskin.kocak@eng.bahcesehir.edu.tr

1 Turkish Air Force Academy Aeronautics and Space Technologies Institute, Istanbul 34149,
Turkey
2 Department of Computer Engineering, Bahcesehir University, Istanbul 34353, Turkey
Circuits Syst Signal Process

Keywords Fast Fourier transform (FFT) General purpose graphics processing unit
(GPGPU) Compute unified device architecture (CUDA) Orthogonal frequency
division multiplexing (OFDM)

1 Introduction

Increasing data traffic and multimedia services in recent years have paved the way
for the development of optical transmission methods to be used in high-bandwidth
communications systems. Meanwhile, thanks to data intensive applications and the
proliferation of mobile devices such as smartphones and tablet computers, there is a
huge demand for high throughput wireless communications. To address this demand,
the communication industry aims to design the next-generation wireless networks
capacity to support tens of gigabits per second similar to the wired networks.
OFDM is a very attractive technique for high-rate data transmission in multipath
environments. The main idea behind OFDM is to split the data stream to be trans-
mitted into N parallel streams of reduced data rate and to transmit each of them on
a separate subcarrier. These carriers are made orthogonal by appropriately choosing
the frequency spacing between them. The advantage of applying OFDM in high data
rate communication systems is a relatively long symbol duration compared with the
delay spread of the channel, in which inter-symbol interference (ISI) can be elimi-
nated by adding a guard interval (GI) [30]. The main issue is the requirement of very
high throughput FFT, Inverse FFT (IFFT) processor, and other circuits [18]. FPGAs
and DSPs are usually employed to process OFDM blocks in real time. Being mostly
hardware-level solutions, it is also very difficult for them to adapt to new standards.
The signal resolution is sacrificed and many FPGAs and DSPs are used to reach high
data rates. Qian et al. [19] implemented a transmitter to demonstrate it works in real
time by using FPGA-based DSP platform. Schmogrow et al. [22] demonstrated an
optic OFDM transmitter with 16-QAM modulation. The work was improved with
some optimizations and got better result by Inan et al. [12]. Buchali et al. [6] have
demonstrated a 256 point FPGA-based IFFT implementation at high-speed data rate
of 12.1 Gb/s. However, integer arithmetic operations are used reduced complexity, i.e.,
10-bit resolution at maximum.
GPUs were only used for 3D graphics rendering in the first years of their evo-
lution. As the demand for high-performance parallel computing increases across
many areas of science, medicine, engineering, and finance, the performance of GPUs
has undergone increasing performances over the last decade thanks to their inten-
sive, multithreaded and highly parallel computations. To meet that demand, powerful
GPU computing architectures are improved by GPU manufacturers. In this paper, the
proposed parallel FFT algorithm is programmed in two different NVIDIA GPU archi-
tectures, namely Fermi and Kepler. The Fermi GPU, released in early 2010, has three
billion transistors and features up to 512 cores. In the first quarter of 2013, the Kepler
GPU with 7.1 billion transistors was announced. The GK110 chip of Kepler GPU was
designed for high-performance computing with a much higher 64-bit floating-point
performance than its predecessor graphics chips.
Circuits Syst Signal Process

IQ imbalance Guard
Received Frame CFO
estimator & interval
OFDM signal detector corrector
corrector extraction

CFO CFO FFT


NCO
estimator tracking
Channel
estimator &
corrector

FEC Stuff bits Constellation De-tone


De-scrambler
decoder extraction de-mapping interleaver

MAC Demodulated
de-framer bit data

Fig. 1 Receiver diagram of OFDM system

As explained, the main focus of the paper is the implementation of the FFT and
exploration of the feasibility of using the computational capability of GPUs. In this
paper, NVIDIA GPGPU architectures and their computing platform called compute
unified device architecture (CUDA) are used to achieve over 10 gigabit baseband
throughput using the OFDM specification. FFT process is a single instruction multi-
ple data (SIMD) type computation and it can be parallelized as described in literature
[1,2,13,16]. For a GPU-based parallel FFT algorithm, the best performance can be
achieved by overcoming a number of challenges, the major three of which may be
I/O limitations, memory management and the development of a new suitable FFT
algorithm that is highly dependable on the hardware [23]. Several enhancements are
proposed and implemented in this paper to enhance FFT computation performance
in different programming levels by considering different hardware architectures. The
approaches that are defined in this paper, aim to minimize inter-processor communi-
cation to reduce the number of shared memory accesses.
This paper is organized as follows. Section 2 gives a brief overview of OFDM
system and also introduces the general purpose graphics processing unit (GPGPU)
architecture, which will be used in this paper. In the next section, optimizations for
the parallel FFT calculation process in the GPU will be defined and the effects of the
each optimization will be put forth. In Sect. 5, experimental works will be defined
and the success and the performance of the parallel FFT calculation process for an
OFDM-based system will be demonstrated. Finally, conclusion is given in Sect. 6. In
Appendix section FFT algorithm design, FFT algorithm design will be explained
by including CooleyTukey Algorithm description and parallel approach.

2 Preliminaries

OFDM is a method of encoding digital data on multiple carrier frequencies [14]. Basi-
cally, fundamental processes of OFDM signal for modulation and demodulation can
be seen in Fig. 1. After a received signal is converted by frame detector to obtain a
predetermined frequency, the converted signal turns input to the quadrature demodula-
Circuits Syst Signal Process

Table 1 Modulation and coding schemes (MCS)

MCS index Data rate (Mb/s) Modulation scheme Spreading factor Coding mode FEC Rate
msb 8b msb 8b

0 32.1 QPSK 48 EEP 1/2


1 1540 QPSK 1 1/2
2 2310 QPSK 1 3/4
3 2695 QPSK 1 7/8
4 3080 16-QAM 1 1/2
5 4620 16-QAM 1 3/4
6 5390 16-QAM 1 7/8
7 5775 64-QAM 1 5/8
8 1925 QPSK 1 UEP 1/2 3/4
9 2503 QPSK 1 3/4 7/8
10 3850 16-QAM 1 1/2 3/4
11 5005 16-QAM 1 3/4 7/8

tor that performs detection using carrier from numerical control oscillator (NCO) and
outputs a base-band OFDM modulation wave. The output (I signal) along the in-phase
and the output (Q signal) of quadrature are real and imaginary components of OFDM
modulation wave, respectively. After IQ imbalance corrector, FFT is performed on the
received OFDM modulation wave. The output of FFT process is also fed to channel
estimator which generates an output that is supplied to NCO so that the frequency
of local oscillator carrier is controlled. An information symbol is separated from the
output of the estimator. The output is deinterleaved by a de-tone interleaver to reduce
the influence caused by a burst error. The deinterleaved data are decoded by forward
error correction (FEC) decoder. The last operation is accomplished according to the
rule which is the inverse of one used for scrambling and performed by a circuit called
descrambler. FFT process is the most computationally intensive function in all these
OFDM blocks.
For both the transmitter and receiver, input and output data types of the FFT and
IFFT blocks are arrays of symbols. One symbol can be represented as a complex
number and each complex number can be represented as two float variables (one for
real part and one for imaginary part of the number). By combining 512 two float type
numbers, it becomes 512-point arrays for the FFT block. The numerous 512-point
arrays of data creates another large array which is called batch. Each point of data in
the batch is computed with the same instruction set for the FFT process so that it is
suitable for the SIMD type parallel approach.
The specification focused on in this paper has a maximum data rate of 5.775 Gbps
which can be achieved by using a 64-QAM scheme with 5/8 FEC rate as implemented
in IEEE802.15.3c as shown in Table 1 [11]. For an OFDM-based modem with 512 sub-
carriers, this means that about 2304 complex multiplications, 4608 complex additions
and the ancillary operations such as loading the data into the input registers of the FFT
processor must be completed in 193 ns at most according to specs. In order to meet
Circuits Syst Signal Process

Fig. 2 Streaming multiprocessors

this deadline, we explore using highly parallel intensive computation performance of


GPUs in this paper.
A GPU provides a parallel architecture, which combines raw computation power
with programmability [26]. GPU provides extremely high computational throughput
by employing many cores working on a large set of data in parallel [24]. CUDA,
developed by NVIDIA, is a widely used programming approach in massively parallel
computing applications [15,28]. The NVIDIA GPU architectures consist of multiple
stream multiprocessors (SM). Each SM consists of pipelined cores and instruction
dispatch units. During execution, each dispatch unit can issue a numerous wide SIMD
instruction, which is executed on a group of cores. Although CUDA provides the
possibility to unleash GPUs computational power, several restrictions prevent pro-
grammers from achieving peak performance. The programmer should pay attention
to the hardware-based aspects to achieve near-peak performance.
A GPGPU based on Fermi architecture (GF-codenamed chip) consists of streaming
multiprocessors (SM), and each of them has stream processors (cores). The mul-
tiprocessors called next-generation streaming multiprocessors (SMX) with Kepler
architecture (GK-codenamed chip). A Fermi GF110 in GTX590 implementation
includes 16 SM units. Each SM unit features 32 CUDA cores and has 16 load/store
units. On the other hand, a Kepler GK110 in K20c implementation includes 13 SMX
units and six 64-bit memory controllers. Each SMX has 192 single-precision CUDA
cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store
units (LD/ST). Streaming multiprocessors can be seen in Fig. 2.
Circuits Syst Signal Process

Table 2 Specifications of evaluation system

FERMI GF110 KEPLER GK110

Compute capability 2.0 3.5


Threads/warp 32 32
Max warps/multiprocessor 48 64
Max threads/multiprocessor 1536 2048
Max thread blocks/multiprocessor 8 16
32-bit registers/multiprocessor 32,768 65,536
Max registers/thread 63 255
Max threads /thread block 1024 1024
Shared memory size configurations (KB) 16/48 16/32/48

Each stream multiprocessor has 64 KB L1 cache; thus, it can be used as shared


memory with setting parameters before calling kernels on host side of a code. Global
memory of GPU is an off-chip memory. Whole SM(X) can access the global memory,
but access time is the slowest. In addition, the Fermi architecture has better memory
bandwidth and faster operating frequency; however, the Kepler has more streaming
multiprocessors and supports CUDA Compute Capability 3.5 instead of old Compute
Capability 2.0 which is supported by Fermi GF110. Detailed hardware information
can be seen from Table 2.
CUDA comes with a software environment that allows developers to use C as a high-
level programming language. CUDA C provides a simple path for users familiar with
the C programming language to easily write programs for execution by the device.
It consists of a minimal set of extensions to the C language and a runtime library.
CUDA C extends C by allowing the programmer to define C functions, called kernels,
that, when called, are executed N times in parallel by N different CUDA threads, as
opposed to only once like regular C functions.
CUDA threads may access data from multiple memory spaces during their execu-
tion. Each thread has private local memory. Each thread block has shared memory
visible to all threads of the block and with the same lifetime as the block. All threads
have access to the same global memory. There are also two additional read-only mem-
ory spaces accessible by all threads: the constant and texture memory spaces.

3 GPGPU-Based Parallel FFT Algorithm Design

In the GPGPU-based parallel computing, hardware architecture is very important while


designing FFT computation algorithm to achieve the peak performance. To achieve the
best performance, different algorithms are designed for both NVIDIA architectures
and the differences between algorithms are put forward with reasons detailed in the
following part of the paper. Major key points in the algorithm design, such as pre-
computed values like twiddle factors, number of stages in FFT computation, batch
size of the number of FFTs that will be computed in parallel, the memory architecture
Circuits Syst Signal Process

Thread #63

Thread #1

complex data

complex data
512# Thread #0

512#
FFT #1
Group Group Group
Host to Device memory transfer data to data to data to
8-point 8-point 8-point
values values values
complex data

complex data
512#

512#
FFT #2
Calculate Calculate Calculate
Twiddle Twiddle Twiddle

Device to Host memory transfer


Factors Factors Factors


Butterfly Butterfly Butterfly
Calculation Calculation Calculation
complex data

complex data
Store Data Store Data Store Data
512#

512#
FFT
#batchsize

Stage #1 Stage #2 Stage #3

8 point FFT * 8 point FFT * 8 point FFT = 512 point FFT


Fig. 3 Thread organization of the GPU parallel calculation process

that keeps the intermediate values during the computation and the size of these values,
etc. depend on the hardware.
The 512-point FFT process that is required for the OFDM-based system described
in Sect. 2, can be performed as an 3-stage computation module, where each stage is
an 8-point FFT as described in Appendix section FFT algorithm design. The major
steps of the each stage can be summarized as fetch the correct data which are coming
from channel to device memory and transferred from host to device memory (H2D),
calculate twiddle factors and implement butterfly operations for each radix and then
store back. Three repetitions of these blocks will be implemented to achieve 512-point
FFT process as three stages. Finally, outputs of the FFT process on GPU are transferred
from device to host memory (D2H).
If the 3-stage computation module is used to calculate the 512-point FFT process,
64 independent threads (64 8 = 512) are needed to achieve the size of 512. By
computing these threads in parallel on different cores, parallel computing of the FFT
process will be implemented as shown in Fig. 3. Each thread fetches 8 complex data
Circuits Syst Signal Process

Block #(batch/2)
Thread 0

Block #1 Thread 1
Thread 0

MemoryFFT Calculation

MemoryFFT Calculation

MemoryFFT Calculation
Block #0
Thread
Thread 0 1

FFT Calculation

FFT Calculation

FFT Calculation
Memory

Memory

Memory

Memory
Thread 1

Shared 8-point
Shared 8-point

Shared 8-point
Shared

Shared

Shared

Shared
Calculation

Calculation

Calculation
Memory
FFT 8-point

FFT 8-point

FFT 8-point
Shared

Thread 63
512#
Memory

Memory

Memory

Memory
8-point

8-point

8-point
Shared

Shared

Shared

Shared
Complex Data Thread 64
Thread 63
Thread 65
Thread
63 64
MemoryFFT Calculation

MemoryFFT Calculation

MemoryFFT Calculation
Thread
Thread
Thread 64 65
FFT Calculation

FFT Calculation

FFT Calculation
Memory

Memory

Memory

Memory
Thread 65

Shared 8-point
Shared 8-point

Shared 8-point
Shared

Shared

Shared

Shared
Calculation

Calculation

Calculation
Memory
FFT 8-point

FFT 8-point

FFT 8-point
Shared

Thread 127
512#
Memory

Memory

Memory

Memory
8-point

8-point

8-point
Shared

Shared

Shared

Complex Data Shared


Thread 127

Thread 127
8# Complex
Data

D2S Memory Stage #1 Stage #2 Stage #3 S2D Memory


Transfer 8-point FFT 8-point FFT 8-point FFT Transfer

Fig. 4 Block representation of the parallel computing

from different locations of 512 complex data in each stage. Sixty-four threads with
3-stage computation module are able to calculate 512-point FFT and store the results
back to the global memory. However, a kernel can be executed by multiple equally
shaped thread blocks. The input data will be transferred, as number of batchsi ze
which represents number of FFTs will be executed together.
There are two kinds of major optimizations for a CUDA algorithm by considering
device architecture, first of them is related with memory and the other one is related
with threads. Using two optimization approaches to achieve better performance that
are discussed in following section optimizes FFT algorithm. To speed up the memory
access between 8-point FFT computation blocks, we used shared memory instead of
global memory to process communication.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimension-
al grid of thread blocks. We used two-dimensional grid architecture to describe parallel
blocks. As described before, each block has 64 independent threads and these 64
threads calculate one 512-point FFT. We can calculate two different 512-point FFTs in
one block, simultaneously, to use system resources and memory spaces more efficiently
as shown in Fig. 4. Thus, total block size will be equal half of the batchsi ze. By
Circuits Syst Signal Process

implementing two 512-point FFT calculations in one block, thread representation will
be [64 2] as CUDA block representation form and it will be equal to 128 independent
parallel threads in one block. Each thread uses 3-stage computation module and shared
memory to calculate 512-point FFT. After the calculations, 1024 complex data (512
2 = 1024) are stored back to global memory as the results of two different 512-point
FFT process. The total number of parallel threads in the whole kernel execution will
be 128 batchsi ze/2. The 3-stage FFT algorithm, which is optimized for Fermi
architecture is summarized in Algorithm 1.

1 Read 8 signal values from global memory


2 Fetch the twiddle factors from constant memory
3 for i 1 to 2 do
4 Calculate 4 butterfly operations
5 Write 8 signal values to shared memory
6 Read 8 signal values from shared memory
7 Fetch the twiddle factors from constant memory
8 end
9 Calculate 4 butterfly operations
10 Write 8 signal values to global memory

Algorithm 1: Pseudo-code for a thread on a Fermi based GPU

Three-stage algorithm design is especially suitable for Fermi-based GeForce


GTX590. But when the same algorithm is running on Kepler architecture, there will
be waste on the threads and resources. Tesla K20c has 2496 cores (192 cores/SMX
13 SMX=2496 cores), while GeForce GTX590 has 512 cores. There is five times
improvement. So the algorithm design will be different. To achieve a better result with
K20c, we must increase the number of threads in one block to use resources and cores
effectively. Instead of 3-stage architecture, a 9-stage architecture is defined to use in
Tesla K20c.
As shown in Fig. 5, 2-point FFT is performed by one thread in each stage. To achieve
the size of 512, we will use 256 independent threads (2562 = 512). We are not able
to define two parallel blocks, like the other approach, according to resources of K20c.
Thus, thread representation will be [256 1]. And total number of parallel threads in
whole kernel execution will be 256 batchsi ze. The 9-stage FFT algorithm, which
is optimized for Kepler architecture is summarized in Algorithm 2.

4 Optimizations of the GPGPU-Based Parallel FFT Algorithm

As described in Appendix section FFT algorithm design, to achieve better per-


formance values, we must optimize parallel FFT algorithm considering GPU device
architecture. Primary GPGPU-based parallel FFT computation algorithm is inspired
from Hinitt and Kocak algorithm [10]. Since that work, GPU capabilities and the archi-
tectures are advanced to completely different point. In this part of the paper, algorithm
and optimizations on primary approach will be explained in detail for the both GPU
Circuits Syst Signal Process

Stage Stage Stage


1 2 9
0 0
Thread Thread Thread
#0 #0 #0
256 256

1 128
Thread Thread Thread
#1 #1 #1
257 384

255 255
Thread Thread Thread
#255 #255 #255
511 511

Shared Shared Shared


Memory Memory Memory

Fig. 5 9-Stage 512-point FFT design

1 Read 2 signal values from global memory


2 Calculate the twiddle factors
3 for i 1 to 8 do
4 Calculate the butterfly operation
5 Write 2 signal values to shared memory
6 Read 2 signal values from shared memory
7 Calculate the twiddle factors
8 end
9 Calculate the butterfly operation
10 Write 2 signal values to global memory

Algorithm 2: Pseudo-code for a thread on a Kepler-based GPU

architectures. Each optimization step and its affects on the performance are evaluated
with Microsoft Visual Studio and NVIDIA Occupancy Calculator [27].

4.1 Reduction in Branch Operations Due to Function Calls

The major operation in CooleyTukey algorithm-based FFT process is butterfly oper-


ation as described in previous section and the CUDA or C libraries do not natively
support it. To accomplish butterfly operation, first of all, complex math operations
(i.e., addition, subtraction and multiplication) must be defined. Math operators for the
complex numbers that are used in butterfly operation can be defined as separated oper-
ator overload functions. However, using operator overload functions causes too many
branch operations because of function call. Besides the complex number operations,
butterfly operation can be defined as a function, too. Instead of using separated 2-point
Circuits Syst Signal Process

Table 3 The reduction in


Optimization No. of branch
branching operations
No optimization 684,432
Embedded math operations 258,048
Embedded butterfly 49,152
Single code file 12,288

FFT function, butterfly operation can be embedded the code block directly to dispose
of branching.
In addition, there are usually two separated code files, one of them is C++ main
function file and the other one is CUDA GPU Kernel function in the default forms
of CUDA programs. The main function of the C++ block and kernel procedures of
CUDA are gathered in one file and compiled together to minimize the number of total
branch operations. The reduction in branching from the first-pass algorithm to the
optimized code can be summarized in Table 3. The count of branches for each step of
optimizations shown in the table is taken from CUDA profiler tool.

4.2 Data Type Definition and Memory Usage Optimizations

After completing the optimizations about the number of branch operations in the kernel
code, another critical point is data type representation in the memory. As described
in Sect. 2, IFFT and FFT computations must be completed in about 193 ns while
using single (32 bit) data precision to achieve over 10 Gbps throughput on OFDM-
based system. FFT input and output data can be assumed as a stream, and they can
be represented as arrays. Each signal value (Ck ) is a complex number, which includes
real (a) and imaginary parts (b) as Ck = a + bi format. Real and imaginary parts of
the signal values can be represented as float values in the memory by two different
arrays. The lengths of the arrays are dependent on the batch size, so array lengths are
512 batch size. Each complex value can be represented as 2 f loat so its size is
8 bytes. So the array size in the memory is equal to 8 512 batch bytes. These
memory spaces for input and output data must be allocated in system memory before
the parallel function can run.
To implement previous enhancements and to apply the followings, a standard signal
value representation must be defined. There are two different data format in CUDA
environment to define a complex value; first form is using float2 type that is a vari-
able that can be defined by using single variable name and pointer. Second way is
using separated variables as two float variable for the each complex number as real
part and imaginary part. By using second data type definition, every access on the
memory becomes 32-bit access otherwise it is a 64-bit access. Shared memory banks
are organized such that successive 32-bit words are assigned to successive banks and
the bandwidth is 32 bits per bank per clock cycle. So 32-bit access is better than 64-
bit access. By changing the representation of complex numbers to the float values for
separated real and imaginary values, we can reduce the transfer and computation time.
Circuits Syst Signal Process

global void RunTestBA(float *In, float *Out, int sign) {


...
tempsum=BlockShift + ThreadID;
Ax1 = In[tempsum];
Bx1 = In[tempsum + 256];
...
Omegax = cosf(-0.012271846f * ThreadID);
Omegay = sinf(-0.012271846f * ThreadID);
Value0 = Ax1 + Bx1;
Value4 = (Ax1 - Bx1) * Omegax - (Ay1 - By1) * Omegay;
... }

Fig. 6 Single-precision floating-point operation implementation

Sync. Sync.

Write real Read real


part of results Shared part of data
1 memory 2
float *
512 size
Write imag. Read imag.
part of results part of data
3 4

Fig. 7 Overwriting approach to reduce the shared memory size

Using two float variables to represent the single signal value in memory brings addi-
tional enhancement opportunities together besides to organize 32-bit memory access
patterns. First of them is using single-precision floating-point operations as shown
in Fig. 6 while implementing operations to calculate the twiddle factor values. GPU
default mode for arithmetic operations is double precision for the floating-point val-
ues. It is converted from default double-precision to single-precision math operations
by adding f tag to the values and by using use_ f ast_math compiler option.
By using single-precision floating-point operations, we achieved better computation
performance results.
Another enhancement opportunity is reducing the shared memory (SMEM) require-
ment to the half size for the each 512-point FFT computation data set. As shown in Fig.
7, at the end of the each stage, 512 f loat (4bytes) 2 si zes of data are transferred
from register space to shared memory for single 512-point FFT process. Thus shared
memory size (48 K bytes for max) is a limitation factor for parallel execution on GPU
as seen from Table 2. By reducing the shared memory usage size, number of threads
that can fit on to the shared memory on single SM can be increased. Instead of writing
real and imaginary parts of the data to the shared memory together, first of all real
parts are written and then they are moved the register space, and finally the imaginary
values be written to the shared memory. By swapping the real and imaginary parts
between register space and the shared memory, shared memory requirement becomes
the half size for single 512-point FFT process. So the number of threads, which are
limited because of shared memory size in a SM, is doubled as the number of max
parallel block value reaches to 8192 from 4096.
Circuits Syst Signal Process

constant float sinsin[256]={0.0000000f,-0.0122713f,...}


constant float coscos[256]={1.0000000f,0.9999247f,...}
global void RunTestBA(float *In, float *Out, int sign) {
...
tempsum=BlockShift + ThreadID;
Ax1 = In[tempsum];
Bx1 = In[tempsum + 256];
...
Omegax = coscos[ThreadID];
Omegay =sinsin[ThreadID];
Value0 = Ax1 + Bx1;
Value4 = (Ax1 - Bx1) * Omegax - (Ay1 - By1) * Omegay;
... }

Fig. 8 The implementation of the CUDA code for the sine and cosine look-up tables

After by using new data type definition and memory usage optimizations, NVIDIA
occupancy calculator measures the occupancy as 66 % value. Especially after applying
overwriting enhancement on shared memory, it is noticed that when the NVIDIA
profiler outputs are checked, the number of active warps becomes 32 from 16. Better
performance result is achieved, but the limitations on the warp size, register and shared
memory usage still continues for code to run parallel on GPU.

4.3 Using Look-Up Tables Instead of Calculation of Trigonometric Functions

The next enhancement is about providing calculation operations for the GPU acceler-
ated parallel CooleyTukey-based FFT computation algorithm. In Fermi Architecture,
there are four Special Function Units (SFU) and 32 CUDA cores on each SM. Trigono-
metric functions are calculated in SFUs, not on the cores. Thus in this case, memory
transfer delay is acceptable instead of calculating much trigonometric calculation on
limited number of SFUs. It is clear that Cosine and Sine values of the twiddle factors in
the stages are the same for the every thread, because of the CooleyTukey Algorithm.
So there is no need to calculate Sine and Cosine values for each parallel thread.
Thus, a pre-computed look-up table is created to get the trigonometric function
results instead of computation them each time. There is a small but fast memory space
in CUDA architecture which is called as Constant Memory on the chip. Constant
memory space is used to keep trigonometric look-up tables. Some part of trigonometric
look-up tables and usage of them can be seen in Fig. 8.

4.4 Memory Coalescing

Perhaps, the single most important performance consideration in programming for


the CUDA architecture is the coalescing of global memory accesses. The concurrent
accesses of the threads of a warp will coalesce into a number of transactions equal
to the number of cache lines necessary to service all of the threads of the warp. To
improve memory system efficiency, scalar threads into a single access to a small,
Circuits Syst Signal Process

0 128 256 384 512


memory
coalesced

Thread IDs 0 2 4 6
1 3 5 7

0 128 256 384 512


memory

0 1
Thread IDs 2 3
4 5 6 7

Fig. 9 Memory coalescing (fast access) and not-coalesced (slow access) representation

contiguous memory region. By default, all accesses are cached through L1, which as
128-byte lines.
The first and simplest case of coalescing can be achieved by any CUDA-enabled
device: the kth thread accesses the kth word in memory as shown in Fig. 9. Not all
threads need to participate. For example, if the threads of a warp access adjacent 4-byte
words (e.g., adjacent float values), a single 128B L1 cache line and therefore a single
coalesced transaction will service that memory access. The number of bus transactions
is minimized by memory coalescing to maximize global memory bandwidth. As we
mentioned, global memory has slowest access time in the memory hierarchy of GPU.
Therefore, memory coalescing optimization occupies an important place to obtain
better FFT computation time.

4.5 Reduction in Shared Memory Access

In the primary algorithm, inputs were written to the shared memory from global
memory. In the iteration, inputs were written to registers from shared memory and
results were calculated with registers. After the iteration, outputs were read again
from shared memory and written to global memory. Many threads access the shared
memory. Therefore, memory is divided into banks which are assigned by 32-bit words.
A memory can service as many simultaneous accesses as it has banks. Multiple simul-
taneous accesses to a bank result in a bank conflict that makes accesses are serialized.
To reduce shared memory access with less or no bank conflicts, we changed progress
as shown in Fig. 10. Inputs are read to registers from global memory. In the itera-
tion, calculations are done and written to shared memory. And again inputs are read
from shared memory for the next iteration. After the iteration, outputs are already on
the registers so these values are used to calculate outputs which are written to global
memory.

4.6 Compiler-Related Optimizations

Another enhancement while calculating FFT on the Fermi and Kepler architectures is
compiler-related optimizations. NVCC compiler of the CUDA has different computing
Circuits Syst Signal Process

...
Ax1 = In[index].x;
Bx1 = In[index2].x;
...
for(int c = 7; c>-1; c , d = 1)
{
smem[thid] = Ax1 + Bx1;
// calculation of others
syncthreads();

Ax1 = smem[index3];
Bx1 = cosf(((thid c) c) * Omega) * smem[index3
+ d] sinf(((thid c) c) * Omega) *
smem[index3 + d3 + d];
// calculation of others
syncthreads();
}
Out[index].x = Ax1 + Bx1;
// calculation of others
...

Fig. 10 Data flow after the SMEM enhancement

and architecture supports. The entire compiler-related optimizations may depend on


the CUDA version. In this paper, we used CUDA 5.0 release and following compiler
options:
Compiler must be set to compute capabilities 2.0 by using arch=sm_20 parameter
for Fermi, 3.5 by using arch=sm_35 parameter for Kepler.
The compiler option -use_fast_math that forces each function to compile to its
intrinsic counterpart must be set. In addition to reducing the accuracy of the affected
functions, it may also cause some differences in special case handling. A more
robust approach is to selectively replace mathematical function calls by calls to
intrinsic functions only where it is merited by the performance gains and where
changed properties such as reduced accuracy and different special case handling
can be tolerated.
Limit the maximum register usage per thread with 32 on the Fermi architecture by
using -maxrregcount compiler option. There is no need to limit for the Kepler so
it uses up to 64 registers as default value.
Turn off the Compiler Heuristic Code Optimizations.
A summary of optimizations for both architecture is given in Table 4.

5 Experimental Results

Our work is implemented with CUDA 5.0 environment and executed on Tesla K20c
and GeForce GTX590 GPUs. The CooleyTukey-based FFT algorithm is designed
and optimized according to used architecture. Algorithm 1 is optimized for Fermi-
based GPU and Algorithm 2 is optimized for Kepler-based GPU. All the experiments
are conducted on a 3.20 GHz Intel Core i7-960 CPU with 12GB of memory.
A soft modem can be seen in Fig. 11 consists of modulator and demodulator mod-
ules of the OFDM-based system is implemented in MATLAB by using MATLAB
Circuits Syst Signal Process

Table 4 Summary of
Fermi Kepler
optimizations
Branch reduction
No operator overload
Embedded function
Data type definition
Float values for input/output
Float2 values for input/output
Floating-point operations
Trigonometric functions
Using look-up tables
Computing with SFU
Memory coalescing
SMEM optimizations
Half size usage by swapping
Reducing of access count
Compiler optimizations

512 Sub-channel
F/O

MATLAB r2012b MATLAB r2012b

Input Data Other Modulator Channel Other Demodulator Output Data


Modules Modules
(FEC 5/8 and QAM64) (FEC 5/8 and QAM64)

MEX
Interfaces

IFFT FFT

CUDA & C++ CUDA & C++

Fig. 11 Integration of OFDM soft modem with CUDA based parallel FFT computation

Communications System Toolbox. By creating a special MEX interface between C++


programming side and MATLAB, CUDA-based parallel FFT calculation functions can
be accessed from MATLAB environment. Data conversions are used for establishing
a correct format between CUDA code and other functions of the Soft Modem except
the FFT/IFFT process. The soft modem validates the results. Thus, our algorithms can
be used in the simulation for OFDM-based system.

5.1 Parallel FFT Computation Performance on GPUs

The GeForce GTX590 is an example of a Fermi-based graphics card, which consists


of 512 CUDA cores, with groups of 32 CUDA cores being organized into 16 SM. Note
that GeForce GTX590 has two GPU processors, but in this paper, we used only one of
Circuits Syst Signal Process

250 150
Time per FFT
240 GFLOPS 144

230 138
nanoseconds 220 132

210 126

GFLOPS
200 120

190 114

180 108

170 102

160 96

150 90
256 512 1024 2048 4096
Batch size
Fig. 12 Executed on GTX590

them. With compiling by using NVCC compiler on MS Windows 7 OS environment,


Algorithm 1 is able to perform real-time process. The performance of Algorithm 1
with different batchsize can be seen in Fig. 12.
In the OFDM-based system used in this work, there are 336 data subcarriers and
each subcarrier holds 6 bits of data. It means that each 512 point FFT produces 3366
bit = 2016 bits payload data. The algorithm with optimized batchsize achieves 164
ns computation time per FFT. Accordingly, the throughput of the FFT process with
GeForce GTX590 becomes as shown in Eq. 1;

336 6 bits 2016 bits


= = 12.29 Gbps (1)
164 109 s 164 109 s

On the other hand, the Tesla K20c is an example of a Kepler-based graphics card,
which consists of 2496 CUDA cores, with groups of 192 CUDA cores being organized
into 13 SMX. There are five times improvements with number of cores. To achieve
a better result with K20c, the number of threads was increased. To calculate 512
point FFT, instead of 3-stage based algorithm design, 9-stage algorithm design is used
as shown in Fig. 5 and each stage calculates only 2-point FFT as demonstrated in
Fig. 18. To achieve the size of 512, 256 independent threads (256 2 = 512) are
used in parallel. Tesla K20c has extra Special Function Units compared to the Fermi
architecture. Mathematical functions (like sine and cosine) are calculated faster than
on GTX590. So fetching the trigonometric values from the look-up tables is costly
instead of computation in this architecture. Thus twiddle factors are calculated directly
in the computation process. Other enhancements, implemented on the code, are with
the same as on Fermi architecture. Algorithm 2 achieves 85 ns per FFT computation
Circuits Syst Signal Process

250 300
Time per FFT
230 GFLOPS 280

210 260

190 240
nanoseconds

170 220

GFLOPS
150 200

130 180

110 160

90 140

70 120

50 100
256 512 1024 2048 4096
Batch size
Fig. 13 Executed on K20c

as peak performance; thus, it performs a real-time computation like Algorithm 1. The


performance of its with different batchsize can be seen in Fig. 13.
The throughput of the FFT block with Tesla K20c becomes as shown in Eq. 2;

336 6 bits 2016 bits


= = 23.71 Gbps (2)
85 109 s 85 109 s

5.2 Impact of SMEM Size

For both algorithms, the biggest bottleneck is shared memory. Algorithm 1 calculates
two FFTs in one block and writes 512 2 f loat values (real or imaginary parts) to
SMEM at the same time. Also Algorithm 2 calculates one FFT in one block and writes
512 f loat2 values (real and imaginary parts) to SMEM at the same time. That means
one block uses 4096 KB in SMEM. Available number of blocks in SMEM can be seen
in Fig. 14.
While 12 FFT blocks can be stored in SMEM, we can only calculate 8 blocks at
the same time because of the thread count limitation in SM/SMX. The performance
of using different SMEM size can be seen in Fig. 15.

5.3 Comparison of Performance

Our GPU-based FFT algorithms can be compared with NVIDIAs cuFTT algorithm.
The cuFFT algorithm achieves better performance than our algorithms when the batch
size is more than 2048 as seen in Fig. 16. However, our algorithms are able to perform
Circuits Syst Signal Process

SMEM

Block #10

Block #11

Block #12
Block #1

Block #2

Block #3

Block #4

Block #5

Block #6

Block #7

Block #8

Block #9
16 KB 32 KB 48 KB

Fig. 14 Block placement in SMEM

Fig. 15 The throughput 30,00


according to SMEM size

25,00
Throughput (Gbps)

20,00

15,00

10,00

5,00

0,00

SMEM size
GTX590 (SM) K20c (SMX)

a real-time process when the batch size is used as 1024. The FFT process must be
finished at most 193 ns to give a real-time result according to IEEE802.15.3c specs. As
we mentioned in Sect. 5.1, the proposed algorithms finish under the 190 ns. In other
words, cuFFT calculates the FFT process faster than ours by using more batch size,
but we do not need to increase the batch size to perform real-time process. If the batch
size is increased, cuFFT meets the deadline of real-time process. However, memory
requirement for buffering will increase. At that point, the proposed algorithms are
faster than cuFFT algorithm and provided a real-time process.
Moreover, cuFFT is a closed source algorithm. Before calling the FFT algorithm,
a plan function is executed by using the transform size, data type and number of
transforms. The process must be waited until this plan function is finished. In the
other words, the next process in the OFDM system cannot be started. Considering to
Circuits Syst Signal Process

350
Alg. 1
Alg. 2
cuFFT
300

250

200

150

100

50
256 512 1024 2048 4096

Fig. 16 The performance comparison of FFT algorithms

Pre-FFT Pre-FFT
Processes Processes

cuFFT BauFFT

Post-FFT Post-FFT
Processes Processes

Time (ns) Time (ns)

Kernel Inizalions Kernel Inizalion

Fig. 17 Timeline comparison between cuFFT and BauFFT

accelerate all processes in the OFDM system by GPU, only one kernel can be used
to implement whole OFDM process without too much kernel initialization overhead
for each process. That is possible with proposed algorithms, because any plan is not
required for our proposed algorithms. The illustration of kernel usage with cuFFT and
our GPGPU-based algorithm; namely, BauFFT can be seen in Fig. 17. At least three
kernel initializations are required to use cuFFT algorithm. Pre-FFT processes must be
finished and the plan and input must be prepared according to outputs of the previous
functions. And then the host side can call the cuFFT algorithm. On the other hand,
one kernel initialization is enough to calculate OFDM system including pre-FFT, FFT
and post-FFT processes.

6 Conclusions

This paper presents the design, methodology and implementation of GPGPU-based


FFT algorithms for 10 Gbps or above OFDM systems. The optimized two FFT
Circuits Syst Signal Process

algorithms have achieved 12.29 Gbps on Fermi-based GPU and 23.71 Gbps on
Kepler-based GPU, respectively. To achieve such high throughput, several techniques
including the algorithmic optimizations, efficient data structures and memory access
optimizations are used as enhancements on the first-pass form of the algorithm for
Fermi- and Kepler-based GPUs, separately. Future work includes implementation of
all processes of the OFDM system on GPU and integration of the proposed FFT
algorithm to decrease the computation time of the OFDM system.

Acknowledgments A part of this work is financially supported by KDDI R&D Laboratories Inc., Japan.

Appendix

FFT Algorithm Design

An FFT process computes the discrete Fourier transform (DFT) for a set of signal data,
and it produces exactly the same result as evaluating the DFT definition directly; the
only difference is that an FFT is much faster. The DFT is obtained by decomposing
a sequence of values into components of different frequencies. An FFT is a way to
compute the same result more quickly: computing the DFT of N points in naive way.
The difference in speed can be enormous, especially for long data sets where N may
be in the thousands or over.


N 1
n
Xk = xn e2 k N k = 0, . . . , N 1 (3)
n=0

If we assume that x0 . . . x N 1 are complex numbers, Eq. 3 defines the DFT. Eval-
uating this definition directly requires O(N 2 ) operations: there are N outputs (X k ),
and each output requires a sum of N terms. An FFT is a method to compute the same
results in O(Nlog N ) operations.

CooleyTukey Algorithm

The publication by Cooley and Tukey in 1965 of an efficient algorithm for the cal-
culation of the DFT was a major turning point in the development of digital signal
processing [8]. Then, various extensions and modifications were made to the original
algorithm [7]. By far the most commonly used FFT is the CooleyTukey algorithm
against the others like Prime-factor FFT algorithm [9], Bruuns FFT algorithm [5],
Raders FFT algorithm [21], and Bluesteins FFT algorithm [3].

y0 = x0 + x1 w k
(4)
y1 = x0 x1 w k

The FFT butterfly operation which is the basic calculation element in the FFT
process takes two complex points and converts them into two other complex points
Circuits Syst Signal Process

Fig. 18 Butterfly operation in


+
FFT
2-point 2-point
input output
- +

Stage #1 Stage #2 Stage #3

x(0) + + + y(0)
Even points

x(4) + + y(1)
- +
x(2) + y(2)
+ - +
x(6) + y(3)
- + - +
x(1) + + - + y(4)
Odd points

x(5) - + + - + y(5)

x(3) - + y(6)
+ - +
x(7) y(7)
- + - + - +

Fig. 19 8-point FFT calculation flow diagram (unordered input ordered output)

as shown in Fig. 18. In the case of the 2-point (radix-2) CooleyTukey algorithm, the
butterfly is simply a DFT of size-2 that takes two complex inputs (x0 , x1 ) which are
corresponding outputs of the two sub-transforms and gives two complex outputs (y0 ,
y1 ) by using Eq. 4. The w k is called twiddle factor. A twiddle factor in FFT algorithms,
is any of the trigonometric constant coefficients that are multiplied by the data in the
course of the algorithm.
More specifically, twiddle factors originally referred to the root-of-unity complex
multiplicative constants in the butterfly operations of the CooleyTukey FFT algo-
rithm, used to recursively combine smaller discrete Fourier transforms [29]. This
remains the terms most common meaning, but it may also be used for any data-
independent multiplicative constant in an FFT.
An N point signal is decomposed into N signals each containing a single point. Each
stage of FFT uses an interlace decomposition, separating the even and odd numbered
samples [25]. The DFT of a N -point sequence can be simply calculated from the two
N /2-point DFTs of the even index terms x0 , x2 . . . x N 2 and the odd index terms
x1 , x3 . . . x N 1 , then those two results are combined to produce the DFT of the whole
sequence [4,17,20]. This idea can then be performed recursively to reduce the overall
runtime to O(Nlog N ).
This simplified form assumes that N is a power of two; since the number of sample
points N can usually be chosen freely by the application, this is often not an important
restriction. How many stages to use in solving the FFT depends on the algorithm that
is designed for the special hardware capabilities. For example to calculate 512 point
Circuits Syst Signal Process

FFT, a 3-stage algorithm (8 8 8) can be designed, and each stage can calculate 8-
point FFT. Major bottlenecks that limit the usage of the number of stages are hardware
issues like memory size and processor inter communication costs.
There are two different approaches to calculate 8-point FFT in each stage, first of
them, we used, is required replacing the input data order for the suitable one to provide
ordered outputs. There are three stages inside of the 8-point FFT as stages. Each stage
calculates 2-point FFT results after calculation of twiddle factors. The flow chart of
8-point FFT calculation can be seen in Fig. 19.

References
1. G. Bergland, A parallel implementation of the fast Fourier transform algorithm. IEEE Trans. Comput.
C21(4), 366370 (1972)
2. G. Bergland, D. Wilson, A fast Fourier transform global, highly parallel processor. IEEE Trans. Audio
Electroacoust. 17(2), 125127 (1969)
3. L.I. Bluestein, A linear filtering approach to the computation of discrete Fourier transform. IEEE Trans.
Audio Electroacoust. 18(4), 451455 (1970)
4. E.O. Brigham, The Fast Fourier Transform and its Applications, 1st edn. (Prentice-Hall Inc., Englewood
Cliffs, 1988)
5. G. Bruun, z-Transform DFT filters and FFTs. IEEE Trans. Acoust. Speech Signal Process. 26(1), 5663
(1978)
6. F. Buchali, R. Dischler, A. Klekamp, M. Bernhard, D. Efinger, Realisation of a real-time 12.1 gb/s
optical ofdm transmitter and its application in a 109 gb/s transmission system with coherent reception,
in European Conference on Optical Communication (ECOC), pp. 12 (2009)
7. J.W. Cooley, P.A.W. Lewis, P.D. Welch, Historical notes on the fast Fourier transform. Proc. IEEE
55(10), 16751677 (1967)
8. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math.
Comput. 19(90), 297301 (1965)
9. I.J. Good, The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. B 20, 361372
(1958)
10. N. Hinitt, T. Kocak, GPU-based FFT computation for multi-gigabit wirelessHD baseband processing.
EURASIP J. Wirel. Commun. Netw. 2010, 359081 (2010). doi:10.1155/2010/359081
11. IEEE standards association. http://standards.ieee.org/findstds/standard/802.15.3c-2009.html (2013)
12. B. Inan, S. Adhikari, O. Karakaya, P. Kainzmaier, M. Mocker, H. von Kirchbauer, N. Hanik, S.L.
Jansen, Real-time 93.8-Gb/s polarization-multiplexed OFDM transmitter with 1024-point IFFT. Opt.
Express 19(26), B64B68 (2011)
13. L.H. Jamieson, P.T. Mueller, H.J. Siegel, Fft algorithms for simd parallel processing systems. J. Parallel
Distrib. Comput. 3(1), 4871 (1986)
14. V. Kanwar, H. Thakur, N. Sharma, Performance evaluation of OFDM system under various modulation
techniques and various channels. Int. J. Res. Eng. Adv. Technol. 1(3), 15 (2013)
15. D.B. Kirk, W.M.W. Hwu, Programming Massively Parallel Processors, 2nd edn. (Morgan Kaufmann,
Boston, 2012)
16. Y. Li, J.R. Diamond, X. Wang, H. Lin, Y. Yang, Z. Han, Large-scale fast Fourier transform on a
heterogeneous multi-core system. Int. J. High Perform. Comput. Appl. 26(2), 148158 (2012)
17. A.V. Oppenheim, R. Schafer, Discrete-Time Signal Processing, 3rd edn. (Prentice-Hall Inc., Englewood
Cliffs, 2009)
18. C.H. Peng, K.T. Shr, M.H. Lin, Y.H. Huang A baseband receiver for optical OFDM systems, in
International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp. 14 (2011)
19. D. Qian, T.O. Kwok, N. Cvijetic, J. Hu, T. Wang, 41.25 Gb/s real-time OFDM receiver for variable
rate WDM-OFDMA-PON transmission, in Optical Fiber Communication, Collocated National Fiber
Optic Engineers Conference (OFC/NFOEC), pp. 13 (2010)
20. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing, 1st edn. (Prentice-Hall
Inc., Englewood Cliffs, 1975)
Circuits Syst Signal Process

21. C.M. Rader, Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56(6),
11071108 (1968). doi:10.1109/PROC.1968.6477
22. R. Schmogrow, M. Winter, D. Hillerkuss, B. Nebendahl, S. Ben-Ezra, J. Meyer, M. Dreschmann, M.
Huebner, J. Becker, C. Koos, W. Freude, J. Leuthold, Real-time OFDM transmitter beyond 100 Gbit/s.
Opt. Express 19(13), 1274012749 (2011)
23. B.P. Sinha, J. Dattagupta, A. Sen, Improvement in the speed of FFT processors using segmented
memory and parallel arithmetic units. Signal Process. 8(2), 267274 (1985)
24. B.P. Sinha, J. Dattagupta, A. Sen, Parallel implementation of wavelet-based image denoising on pro-
grammable PC-grade graphics hardware. Signal Process. 90(8), 23962411 (2010)
25. S.W. Smith, The Scientist and Engineers Guide to Digital Signal Processing (California Technical
Publishing, San Diego, 1997)
26. J.B. Srivastava, R. Pandey, J. Jain, Implementation of digital signal processing algorithm in general
purpose graphics processing unit (GPGPU). Int. J. Innov. Res. Comput. Commun. Eng. 1(4), 1006
1012 (2013)
27. The CUDA Occupancy Calculator. http://developer.download.nvidia.com/compute/cuda/CUDA_
Occupancy_calculator.xls (2013)
28. The CUDA Programming Guide. https://developer.nvidia.com/category/zone/cuda-zone (2013)
29. M. Vetterli, H.J. Nussbaumer, Simple FFT and DCT algorithms with reduced number of operations.
Sig. Process. 6(4), 267278 (1984)
30. M. Yoshida, T. Taniguchi, An LDPC-coded OFDM receiver with pre-FFT iterative equalizer for ISI
channels, in IEEE 61st Vehicular Technology Conference, vol 2, pp. 767772 (2005)

You might also like