Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
2011_JSPS_TurboDecoder

2011_JSPS_TurboDecoder

Ratings: (0)|Views: 2 |Likes:

More info:

Published by: Ghionoiu Marius Catalin on Apr 10, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/13/2014

pdf

text

original

 
J Sign Process SystDOI 10.1007/s11265-011-0617-7
Implementation of a High Throughput 3GPPTurbo Decoder on GPU
Michael Wu
·
Yang Sun
·
Guohui Wang
·
Joseph R. Cavallaro
Received: 27 January 2011 / Revised: 1 June 2011 / Accepted: 7 August 2011© Springer Science+Business Media, LLC 2011
Abstract
Turbo code is a computationally intensivechannel code that is widely used in current and up-coming wireless standards. General-purpose graphicsprocessor unit (GPGPU) is a programmable commod-ity processor that achieves high performance computa-tion power by using many simple cores. In this paper,we present a 3GPP LTE compliant Turbo decoder ac-celerator that takes advantage of the processing powerof GPU to offer fast Turbo decoding throughput. Sev-eral techniques are used to improve the performanceof the decoder. To fully utilize the computational re-sources on GPU, our decoder can decode multiplecodewords simultaneously, divide the workload for asingle codeword across multiple cores, and pack mul-tiple codewords to fit the single instruction multipledata (SIMD) instruction width. In addition, we useshared memory judiciously to enable hundreds of con-current multiple threads while keeping frequently useddata local to keep memory access fast. To improveefficiency of the decoder in the high SNR regime, wealso present a low complexity early termination schemebased on average extrinsic LLR statistics. Finally, weexamine how different workload partitioning choicesaffect the error correction performance and the de-coder throughput.
Keywords
GPGPU
·
Turbo decoding
·
Accelerator
·
Parallel computing
·
Wireless
·
Error control codes
·
Turbo codes
M. Wu (
B
)
·
Y. Sun
·
G. Wang
·
J. R. CavallaroRice University, Houston, TX, USAe-mail: mbw2@rice.edu
1 Introduction
Turbo code [1] has become one of the most impor-tant research topics in coding theory since its discov-ery in 1993. As a practical code that can offer nearchannel capacity performance, Turbo codes are widelyused in many 3G and 4G wireless standards such asCDMA2000, WCDMA/UMTS, IEEE 802.16e WiMax,and 3GPP LTE (long term evolution). The inherentlylarge decoding latency and complex iterative decodingalgorithmhavemadeitverydifficulttobeimplementedin a general purpose CPU or DSP. As a result, Turbodecoders are typically implemented in hardware [28]. Although ASIC and FPGA designs are more powerefficientandcanofferextremelyhighthroughput,thereare a number of applications and research fields, suchas cognitive radio and software based wireless test-bed platforms such as WARPLAB[9], which require the support for multiple standards. As a result, wewant an alternative to dedicated silicon that supportsa variety of standards and yet delivers good throughputperformance.GPGPU is an alternative to dedicated silicon whichis flexible and can offer high throughput. GPU employshundreds of cores to process data in parallel, whichis well suited for a number of wireless communica-tion algorithms. For example, many computationallyintensive blocks such as channel estimation, MIMOdetection, channel decoding and digital filters can beimplemented on GPU. Authors in [10] implementeda complete 2
×
2 WiMAX MIMO receiver on GPU.In addition, there are a number of recent papers onMIMO detection[11,12]. There are also a number of  GPU based LDPC channel decoders [13]. Despite thepopularity of Turbo codes, there are few existing Turbo
 
J Sign Process Syst
decoder implementations on GPU [14,15]. Compared to LDPC decoding, implementing a Turbo decoder onGPU is more challenging as the algorithm is fairlysequential and difficult to parallelize.In our implementation, we attempt to increase com-putational resource utilization by decoding multiplecodewords simultaneously, and by dividing a codewordinto several sub-blocks to be processed in parallel. Asthe underlying hardware architecture is single instruc-tionmultipledata(SIMD),wepackmultiplesub-blocksto fit the SIMD vector width. Finally, as an excessiveuse of shared memory decreases the number of threadsthat run concurrently on the device, we attempt tokeep frequently used data local while reducing theoverall shared memory usage. We include an early ter-mination algorithm by evaluating the average extrinsicLLR from each decoding cycle to improve throughputperformance in the high signal to noise ratio (SNR)regime. Finally, we provide both throughput and biterror rate (BER) performance of the decoder and showthat we can parallelize the workload on GPU whilemaintaining reasonable BER performance.The rest of the paper is organized as follows: InSections2and3,we give an overview of the CUDA ar- chitecture and Turbo decoding algorithm. In Section4,we will discuss the implementation aspects on GPU.Finally,wewillpresentBERperformanceandthrough-put results and analyses in Section5and conclude inSection6.
2 Compute Unified Device Architecture (CUDA)
A programmable GPU offers extremely high compu-tation throughput by processing data in parallel us-ing many simple stream processors (SP) [16]. Nvidia’sFermi GPU offers up to 512 SP grouped into multiplestream multi-processors (SM). Each SM consists of 32SP and two independent dispatch units. Each dispatchunit on an SM can dispatch a 32 wide SIMD instruction,a
warp
instruction, to a group of 16 SP. During execu-tion, a group of 16 SP processes the dispatched
warp
in-struction in a data parallel fashion. Input data is storedin a large amount of external device memory (
>
1GB)connected to the GPU. As latency to device memoryis high, there are fast on-chip resources to keep dataon-die. The fastest on-chip resource is registers. Thereis a small amount of 64KB fast memory per SM, splitbetween user-managed shared memory and L1 cache.In addition, there is an L2 cache per GPU device whichfurther reduces the number of slow device memoryaccesses.There are two ways to leverage the computationalpower of Nvidia GPUs. Compute Unified Device Ar-chitecture[16] is an Nvidia specific software program- ming model, while OpenCL is a portable open standardwhich can target different many core architectures suchas GPUs and conventional CPUs. These two program-ming models are very similar but utilize different modelterminologies. Although we implemented our designusing CUDA, the design can be readily ported intoOpenCL to target other multi-core architectures.In the CUDA programming model, the programmerspecifies the parallelism explicitly by defining a kernelfunction, which describes a sequence of operations ap-plied to a data set. Multiple thread-blocks are spawnedon GPU during kernel launch. Each thread-block con-sists of multiple threads, where each thread is arrangedon a grid and has a unique 3-dimensional ID. Using theunique ID, each thread selects a data set and executesa kernel function on the selected data set.At runtime, each thread-block is assigned to an SMand executed independently. Thread-blocks typicallyare synchronized by writing to device memory andterminating the kernel. Unlike thread-blocks, threadswithin a thread-block, which reside on a single SM, canbe synchronized through barrier synchronization andshare data through shared memory. Threads within athread-block execute in blocks of 32 threads. When 32threads share the same set of operations, they sharethe same
warp
instruction and are processed in parallelin an SIMD fashion. If threads do not share the sameinstruction, the threads are executed serially.To achieve peak performance on a programmableGPU, the programmer needs to keep the availablecomputation resource fully utilized. Underutilizationoccurs due to horizontal and vertical waste. Verticalwaste occurs when an SM stalls and cannot find aninstruction to issue. And horizontal waste occurs whenthe issue width is larger than the available parallelism.Vertical waste occurs primarily due to pipeline stalls.Stalls occur for several reasons. As the floating pointarithmetic pipeline is long, register to register depen-dencies can cause a multi-cycle stall. In addition, an SMcan stall waiting for device memory reads or writes.In both cases, GPU has hardware support for fine-grain multithreading to hide stalls. Multiple threads, orconcurrent threads, can be mapped onto an SM andexecuted on an SM simultaneously. The GPU can min-imize stalls by switching over to another independent
warp
instruction on a stall. In the case where a stallis due to memory access, the programmer can fetchfrequently used data into shared memory to reducememory access latency. However, as the number of concurrent threads is limited by the amount of shared
 
J Sign Process Syst
memory and registers used per thread-block, the pro-grammer needs to balance the amount of on-chip mem-ory resources used. Shared memory increases compu-tational throughput by keeping data on-chip. However,excessive amount of shared memory used per thread-block reduces the number of concurrent threads andleads to vertical waste.Although shared memory can improve performanceof a program significantly, there are several limitations.Shared memory on each SM is banked 16 ways. It takesone cycle if 16 consecutive threads access the sameshared memory address (broadcast) or none of thethreads access the same bank (one to one). However,a random layout with some broadcast and some one-to-one accesses will be serialized and cause a stall. Theprogrammer may need to modify the memory accesspattern to improve efficiency.Horizontal waste occurs when there is an insufficientworkload to keep all of the cores busy. On a GPUdevice, this occurs if the number of thread-blocks issmaller than the number of SM. The programmerneedsto create more thread-blocks to handle the workload.Alternatively, the programmer can solve multiple prob-lems at the same time to increase efficiency. Horizontalwaste can also occur within an SM. This can occurif the number of threads in a thread-block is not amultiple of 32. For this case, the programmer needs todivide the workload of a thread-block across multiplethreads if possible. An alternative is to pack multiplesub-problems into one thread-block as close to thewidth of the SIMD instruction as possible. However,packing multiple problems into one thread-block mayincrease the amount of shared memory used, whichleads to vertical waste. Therefore, the programmer mayneed to balance horizontal waste and vertical waste tomaximize performance.As a result, it is a challenging task to implement analgorithm that keeps the GPU cores from idling–weneed to partition the workload across cores, use sharedmemory effectively to reduce device memory accesses,while ensuring a sufficient number of concurrently exe-cuting thread-blocks to hide stalls.
3 Turbo Decoding Algorithm
Turbo decoding is an iterative algorithm that canachieve error performance close to the channel ca-pacity. A Turbo decoder consists of two componentdecoders and two interleavers, which is shown in Fig.1.The Turbo decoding algorithm consists of multiplepassesthroughthetwocomponentdecoders,whereoneiteration consists of one pass through both decoders.
-1
Decoder 1Decoder 0
L
c
(y
s
)L
c
(y
p0
)L
c
(y
p1
)
L
a
L
e
L
a
+L
ch
L
e
+L
ch
ΠΠ
Figure 1
Overview of Turbo decoding.
Although both decoders perform the same sequenceof computations, the decoders generate different log-likelihood ratios (LLRs) as the two decoders havedifferent inputs. The inputs of the first decoder arethe deinterleaved extrinsic log-likelihood ratios (LLRs)from the second decoder and the input LLRs fromthe channel. The inputs of the second decoder are theinterleaved extrinsic LLRs from the first decoder andthe input LLRs from the channel.Each component decoder is a MAP (maximum
a posteriori
) decoder. The principle of the decoding al-gorithm is based on the BCJR or MAP algorithms [17].Each component decoder generates an output LLR foreach information bit. The MAP decoding algorithm canbe summarized as follows. To decode a codeword with
information bits, each decoder performs a forwardtrellis traversal to compute
sets of forward statemetrics, one
α
set per trellis stage. The forward tra-versal is followed by a backward trellis traversal whichcomputes
sets of backward state metrics, one
β
setper trellis stage. Finally, the forward and the backwardmetrics are merged to compute the output LLRs. Wewill now describe the metric computations in detail.As shown by Fig.2,the trellis structure is defined by the encoder. The 3GPP LTE Turbo code trellis haseight states per stage. In the trellis, for each state in thetrellis, there are two incoming paths, with one path for
stage k stage k+1U
b
= 1U
b
= 0
Figure 2
3GPP LTE Turbo code trellis with eight states.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->