J Sign Process Syst
memory and registers used per thread-block, the pro-grammer needs to balance the amount of on-chip mem-ory resources used. Shared memory increases compu-tational throughput by keeping data on-chip. However,excessive amount of shared memory used per thread-block reduces the number of concurrent threads andleads to vertical waste.Although shared memory can improve performanceof a program significantly, there are several limitations.Shared memory on each SM is banked 16 ways. It takesone cycle if 16 consecutive threads access the sameshared memory address (broadcast) or none of thethreads access the same bank (one to one). However,a random layout with some broadcast and some one-to-one accesses will be serialized and cause a stall. Theprogrammer may need to modify the memory accesspattern to improve efficiency.Horizontal waste occurs when there is an insufficientworkload to keep all of the cores busy. On a GPUdevice, this occurs if the number of thread-blocks issmaller than the number of SM. The programmerneedsto create more thread-blocks to handle the workload.Alternatively, the programmer can solve multiple prob-lems at the same time to increase efficiency. Horizontalwaste can also occur within an SM. This can occurif the number of threads in a thread-block is not amultiple of 32. For this case, the programmer needs todivide the workload of a thread-block across multiplethreads if possible. An alternative is to pack multiplesub-problems into one thread-block as close to thewidth of the SIMD instruction as possible. However,packing multiple problems into one thread-block mayincrease the amount of shared memory used, whichleads to vertical waste. Therefore, the programmer mayneed to balance horizontal waste and vertical waste tomaximize performance.As a result, it is a challenging task to implement analgorithm that keeps the GPU cores from idling–weneed to partition the workload across cores, use sharedmemory effectively to reduce device memory accesses,while ensuring a sufficient number of concurrently exe-cuting thread-blocks to hide stalls.
3 Turbo Decoding Algorithm
Turbo decoding is an iterative algorithm that canachieve error performance close to the channel ca-pacity. A Turbo decoder consists of two componentdecoders and two interleavers, which is shown in Fig.1.The Turbo decoding algorithm consists of multiplepassesthroughthetwocomponentdecoders,whereoneiteration consists of one pass through both decoders.
Decoder 1Decoder 0
Overview of Turbo decoding.
Although both decoders perform the same sequenceof computations, the decoders generate different log-likelihood ratios (LLRs) as the two decoders havedifferent inputs. The inputs of the first decoder arethe deinterleaved extrinsic log-likelihood ratios (LLRs)from the second decoder and the input LLRs fromthe channel. The inputs of the second decoder are theinterleaved extrinsic LLRs from the first decoder andthe input LLRs from the channel.Each component decoder is a MAP (maximum
) decoder. The principle of the decoding al-gorithm is based on the BCJR or MAP algorithms .Each component decoder generates an output LLR foreach information bit. The MAP decoding algorithm canbe summarized as follows. To decode a codeword with
information bits, each decoder performs a forwardtrellis traversal to compute
sets of forward statemetrics, one
set per trellis stage. The forward tra-versal is followed by a backward trellis traversal whichcomputes
sets of backward state metrics, one
setper trellis stage. Finally, the forward and the backwardmetrics are merged to compute the output LLRs. Wewill now describe the metric computations in detail.As shown by Fig.2,the trellis structure is defined
by the encoder. The 3GPP LTE Turbo code trellis haseight states per stage. In the trellis, for each state in thetrellis, there are two incoming paths, with one path for
stage k stage k+1U
3GPP LTE Turbo code trellis with eight states.