You are on page 1of 10


7, JULY 2007


SIMD Processor-Based Turbo Decoder Supporting Multiple Third-Generation Wireless Standards
Myoung-Cheol Shin, Member, IEEE, and In-Cheol Park, Senior Member, IEEE
Abstract—A programmable turbo decoder is designed to support multiple third-generation wireless communication standards. We propose a hybrid architecture of hardware and software, which has small size, low power, and high performance like hardware implementations, as well as the flexibility and programmability of software. It mainly consists of a configurable hardware soft-inputsoft-output (SISO) decoder and a 16-b single-instruction multipledata processor, which is equipped with five processing elements and special instructions customized for interleaving in order to provide interleaved data at the speed of the hardware SISO. A fast and flexible software implementation of the block interleaving algorithm is also proposed. The interleaver generation is split into two parts, preprocessing and on-the-fly generation, to reduce the timing overhead of changing the interleaver structure. We present detailed descriptions of the interleaving implementation applied to the W-CDMA and cdma2000 standard turbo codes. The decoder occupies 8.90 mm2 in a 0.25- m CMOS with five metal layers and exhibits the maximum decoding rate of 5.48 Mb/s. Index Terms—cdma2000, parallel algorithm, turbo code, turbo interleaver, W-CDMA.



S TURBO codes [1], or parallel concatenated convolutional codes, have extremely impressive performances, they entered the field of standardized systems in recent years. One of the most important examples is the channel coding for high-speed data transmission of the third-generation (3G) mobile communication systems such as W-CDMA [2] and cdma2000 [3]. Flexible and programmable turbo decoders are required for 3G communications because of two reasons: 1) the global roaming is recommended between different 3G standards and 2) even in a standard, the turbo coding frame size may change on a frame-by-frame basis. The turbo decoder consists of interleavers and soft-input-soft-output (SISO) decoders that decode recursive systematic convolutional (RSC) codes. Flexible and programmable implementation is especially needed for the turbo interleaver, as each 3G standard has a distinct and complicated interleaver. The simplest approach to implement an interleaver is to store the interleaved patterns in a ROM. The approach is not adequate for a turbo decoder supporting multiple wireless standards, as

it needs a large ROM to store all of the possible interleaved patterns. Though some implementations based on digital signal processors (DSPs) have programmability that may support various standards, their performance is far below the maximum bit rate of up-to-date wireless systems [4], [5]. In [6], Bekooij et al. proposed a flexible turbo decoder as a form of very long instruction word (VLIW) microprocessor. However, dedicated hardware blocks, which are not programmable, are employed to implement its interleaver and SISO decoder. In this paper, we propose a multiple-standard turbo decoder implemented with a combination of the dedicated hardware part processing the computationally intensive but regular tasks such as SISO decoding and the software part running on a programmable single-instruction-multiple-data (SIMD) processor for the tasks that requires flexibility. The turbo interleaving that differs largely depending on the standards is also implemented in software. In this way, we can achieve the small area, high performance, and low power consumption of hardware, as well as the flexibility and programmability of software needed to support multiple standards. In addition, we address a software interleaving implementation that can change the interleaver structure in a very short period of time and requires only a small amount of memory. The interleaver construction is split into two parts, preprocessing and on-the-fly generation, in order to hide the timing overhead of interleaver changing effectively. Since the proposed SIMD processor is equipped with instructions specialized for turbo interleaving algorithms, the processor can provide interleaved data at the speed of the hardware SISO decoder. The remainder of this paper is organized as follows. Section II presents the fundamentals of turbo coding and iterative decoding, leading to a discussion on the turbo codes adopted by 3G wireless systems in Section III. In Section IV, the proposed turbo decoder is described in detail. Section V explains the proposed interleaver implementation, and Section VI describes its application to standards including W-CDMA and cdma2000. In Section VII, experimental results on the implemented chip are presented. Finally, we make concluding remarks in Section VIII. II. ITERATIVE DECODING OF TURBO CODES Here, we summarize the most essential part of the iterative turbo decoding and the notations to be use in next sections. A turbo encoder consists of two binary RSC encoders separated by an -bit interleaver, together with an optional puncturing mechanism [1]. A typical example is shown in Fig. 1. The function of the interleaver is to take each incoming frame of data bits and rearrange them in a pseudorandom fashion prior to

Manuscript received November 8, 2002; revised March 29, 2007. This work was supported in part by the Institute of Information Technology Assessment through the ITRC and IC Design Education Center (IDEC). M.-C. Shin was with the Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea. He is now with Samsung Electronics Company, Ltd., Giheung, 449-711, Korea (e-mail: I.-C. Park is with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail: Digital Object Identifier 10.1109/TVLSI.2007.899237

1063-8210/$25.00 © 2007 IEEE

JULY 2007 The SISO that is optimal in terms of minimizing the decoded BER should produce the a posteriori probability that each encoder input bit was 1 or 0 given the received symbol sequence y and a priori probability. The branch metric of each transition is calculated as Fig. originally proposed this maximum a posteriori probability (MAP) algorithm [1]. [7]. Log-MAP or Max-Log-MAP algorithms [8]. VOL.802 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. 2. 1. NO. Unlike the other interleavers mainly used to spread burst errors. which are modeled as the sum of modulated bits and channel noise with . similarly the sum of and channel variance . As the feedback iterations go on. The forward path metric is computed recuras sively as the second encoding. All of these sequences are binary bits. Component decoders are usually called SISO decoders as they deal with the probabilities of the inputs instead of hard decisions. 15. or uniformity. and the extrinsic information other SISO as additional a priori information. Since the improvement obtained with additional iterations decreases as the number of iterations increases. for example. the Log-MAP operation exactly by adding a algorithm calculates the correction term as follows: (1) . As it requires a huge dynamic range of intermediate values and a large number of expensive operations such as multiplications and divisions. where The general structure of the iterative turbo decoder is shown in Fig. but the modulated sequences and are genfollowing the equations: erally assumed to have the value of (4) In the Max-Log-MAP algorithm. (3) is the logarithm of a priori probability. Any uniformity may degrade the error-correcting performance of turbo decoding. Similarly. Turbo encoder for cdma2000. If we use capital Greek letters to indicate the logarithm of the variables. where . the backward metric is calculated as (6) (2) Each SISO receives the systematic channel bits . the algorithms can be represented with the following equations. parity bits coming from the noise. Two component decoders are linked by interleavers in a feedback structure. the bit-error-rate (BER) performance of the decoded bits improves due to the extrinsic information. the encoder processes of bits to produce the the input sequence and parity sequences systematic sequence . it is important that the turbo interleaver should sort the bits as randomly as possible so that no apparent order. For an encoder input bit . The puncturing mechanism periodically deletes some code bits to increase the coding rate. and . can be found. which causes some BER performance degradation. are usually used in hardware implementation. where is a current trellis state and is a previous state. the soft input and output of the SISO decoder are typically represented in terms of the so-called log likelihood ratio (LLR) given by (5) The correction function can be implemented with a small lookup table (LUT). Bahl et al. The logarithm of a posteriori probability for each input bit is then obtained by (7) and the extrinsic information to be fed to the next SISO by (8) . However. 7. which is where equal to the extrinsic LLR output ) obtained from the is the channel reliability measure given previous SISO. and . four to eight iterations are usually used. where the superscript represents an interleaved sequence [1]. 1. its simplified versions. the operation is approximated as the real maximum operation. As shown in Fig.

only a fully programmable processor is appropriate for a unified interleaver used for multiple standards. 2. and why a multistandard turbo decoder requires a programmable interleaver. 1 shows the turbo encoder architecture of cdma2000. Though the turbo decoders for cdma2000 and W-CDMA share the constituent encoder structure. the code rate is fixed to 1/3 and the encoder is obtained simply by removing and from Fig. 1. The code rates other than the path to 1/3 can be implemented by applying the rate-matching process [2]. as the RSC code of W-CDMA is actually a subset of cdma2000. The RSC decoder of the W-CDMA is the same as that of cdma2000 when the rate is 1/3. Tail bits come from the shift registers in order to bring the trellises back to the all-zero state. which are shown in Table I. because their interleaver frame size. By combining a hardware SISO decoder and a programmable processor. the W-CDMA interleaver is more complicated and not straightforward to implement in hardware. and the number of states is eight. which are repeatedly used for both the first and the second SISO decoding processes in one iteration. they have significant differences in the interleaver implementation. the performance of hardware and the flexibility of software can be achieved together. Since the turbo codes work in a frame-wise fashion. one interleaver.SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 803 Fig. -1/4. 3. TURBO CODES IN 3G STANDARDS We discuss here the similarities and differences of the turbo codes of the W-CDMA [2] and cdma2000 [3]. The dotted lines in Fig. and the termination sequence of W-CDMA is the same as that of the rate-1/2 cdma2000 turbo code. We can conclude that a SISO compatible to both standards can be implemented without much difficulty. Compared with W-CDMA. The cdma2000 termination sequence varies depending on the rate. Due to those differences. the trellises of the two constituent encoders need to be terminated at the end of a frame. Iterative turbo decoder structure. Because the contents of the shift registers are different in each constituent encoder. individual operations. the proposed turbo decoder employs hybrid implementation of hardware and software. Since the write address is the same as the . and -1/5 turbo codes are realized with appropriate puncturing patterns. For the W-CDMA. On the other hand. The constraint length of the RSC code for W-CDMA and cdma2000 is four. TABLE I DIFFERENCES BETWEEN THE cdma2000 AND W-CDMA TURBO CODES III. the tail bit sequences are different and the systematic code bit of the second encoder should be transmitted. the cdma2000 interleaver is much simpler and suitable for hardware implementation. For both cdma2000 and W-CDMA systems. and other parameters used in generating interleaved addresses are quite different. The data are accessed in a sequential order for the first SISO decoding of an iteration and in an interleaved order for the second decoding. 1 are active during the trellis termination. The detailed descriptions of both interleavers are given in Section VI. Fig. Rate-1/2. IV. turbo codes are terminated in a similar way. TURBO DECODER ARCHITECTURE As shown in Fig. -1/3. and one extrinsic LLR ( ) memory. It has the simplest time-multiplex architecture [8] that contains only one SISO.

The ACSA unit is therefore designed to select one of the two algorithms with the last multiplexer in Fig. 4 is chosen. and the address queue whose length is equal to the SISO latency saves the read addresses in order to use them again as the write addresses. respectively. Basic arithmetic/logic instructions and a few customized instructions for interleaving. 10. which are described in detail in Section V. As shown by the dotted lines in Fig. SIMD Processor To keep pace with the hardware SISO decoder. 2) for slight performance degradation instead of (6. can be performed by all of the PEs. which is the unit of interleaved address generation. However. 7. the SISO decoder employs configurable ACSA units shown in Fig. 2) the read address can be buffered in order to use it later as the write address. a SIMD instruction is not executed in all PEs at the same time. The SISO decoder contains one additional memory for temporarily storing the computed ’s and an additional section of add-compare-select-add (ACSA) units to calculate dummy backward metrics for the sliding window algorithm [11]. The write back is carried out at the end of the MEM stage. 3. Thus the processor can fetch an instruction. NO. On the other hand. The bit widths of instructions and data are all 16. demonstrated in [12] that the Max-Log-MAP is more tolerant to the channel estimation error than the Log-MAP algorithm. and send an interleaved address through the I/O port at the same time. but executed in one PE after another so that a data memory port and an I/O port can be shared in a time-multiplexed fashion. and extrinsic LLR ’s. the SIMD processor calculates the interleaved addresses of various standards. Worm et al. If we can obtain reliable channel estimation such as received signal strength indication (RSSI). Let us use the nota. The bit width of data and internal operations are determined based on the approach presented in [13]. inputs are added in a carry branch metric save adder (CSA) to reduce timing delay of multiple additions. Based on the result of a performance simulaat the cost of tion. For this reason. 5. 2) for inputs and (7. SISO Decoder As shown in Fig. 7. The figure shows that those instructions share a port without any conflict. B. EX (execution). we selected (5. To support multiple standards. load data. we should use the Max-Log-MAP to avoid error caused by the misestimation of channel. The processor stops the decoding iteration to reduce power consumption based on the cyclic redundancy check (CRC) or a simple stopping criterion presented in [10]. A simple SIMD architecture depicted in Fig. JULY 2007 Fig. the power and area of the buffer memory were . The first PE. and MEM (memory operation). 7. 3) and (8. and processes the trellis termination and a stopping criterion. 3. and remainder. and the processor has four pipeline stages: IF (instruction fetch). as it is suitable for the simple and repetitive interleaved address generation and has simpler control and lower power consumption than VLIW or superscalar architectures. The rectangles filled with light gray represent the data memory instructions and the dark gray the I/O instructions. the Log-MAP algorithm outperforms the MaxLog-MAP algorithm [8]. which indicates whether or not the further iterations may improve BER. which produce almost ideal results. respectively. A. 3).. 15. as shown in Fig. The ACSA can support the rate of 1/2 to 1/5 turbo codes with arbitrary transfer functions by configuring the input multiplexers in Fig. ID (instruction decode). read address: 1) the content in memory is always sequential. where and tion of the fixed-point representation as represent the total bit width and the bit width of the fractional part. . The processor employs Harvard architecture. An example of this situation is shown in Fig. and 3) the decoder contains no deinterleaver but only one interleaver. if nothing is known about the channel. divide. Input data are read in a memory where the window size and used three times for calculating forward metric ’s. PE0. Generally. The five PEs share a single data memory port and a single I/O port in order to save memory access power consumption and support simple I/O interface. is 5. the processor controls the hardware blocks. 4.804 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. we can choose the Log-MAP algorithm for better performance. The scalar instructions running only on PE0 include control instructions such as branch and call and multicycle operations such as multiply. i. and has a separate I/O port. For calculation. or 20. By choosing the smaller widths.e. plays the role of controlling the other PEs and functions as a complete scalar processor. 6. During the first SISO decoding process of an iteration. PE0 has an additional 16-entry scalar register files to store nonvector data or special control data. All of the common register files of the five PEs form a five-element vector register file of 16 entries. which does not need an interleaver. backward metric ’s. its instruction memory and data memory are separated. the architecture of the proposed SISO decoder is similar to the memory architecture presented in [11]. 7. The branch delay is one cycle and the processor always executes the instruction in a branch delay slot. The number of processing elements (PEs) of the SIMD processor is set to five because the number of rows in the W-CDMA interleaver. Block diagram of the proposed turbo decoder. interfaces with external host. parallel processing is indispensable in the design of a programmable interleaver. That of cdma2000 varies between 18 and 24 if we discard the rows that always produce invalid addresses.

Fig. in addition. 7. TURBO INTERLEAVER IMPLEMENTATION Here. Fig. Given the bit width of inputs. Prunable Interleavers saved by 15.SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 805 Fig. ASCA unit used to calculate a forward metric A (s). . Standard turbo codes commonly employ block interleavers. and. we can induce the required bit width of internal calculations. Overall architecture of the SISO decoder. V. 4. 6. W-CDMA and cdma2000 share the general concept of prunable block interleavers presented in Fig. we describe common features of 3G standard turbo interleavers and present the proposed interleaved address generation algorithm and the specialized instructions to implement these multiple turbo interleavers with the proposed SIMD processor. A. which is ten in this case [13]. Execution timing of the processing elements.6%. 5. the size and speed of the LUT for Log-MAP was improved dramatically. Architecture of the SIMD processor. Although the complicated operations and parameters used in interleaving are quite different.

15. (9) can be rewritten as (11) and can be obtained recursively as (12) where and and . only the preprocessing is performed to prepare a relatively small number of seed parameters and variables selected to make the on-the-fly generation as simple as possible. Whenever the interleaved address sequence is required. and increment . 8. Data are written in a two-dimensional (2-D) mother interleaver matrix row by row and permuted within each row. and LOOP. we use an incremental vector w of and a cumulative vector of as follows: instead of q (10) Then. 9(a). are pruned. Therefore. and for the first column of the block interleaver are calculated and stored in vector registers of the SIMD processor in the preprocessing for interleaving. NO. VOL. Proposed implementation applied to Fig. 18 and 19. 8(b) is (9) is the permuted index of the th row and th column. This means the interleaver structure may change at every 10 ms. . 9. (b) Intrarow permutation. is An example of a simple prunable interleaver with shown in Fig. Let us explain our approach with the example of Fig. (a) Incoming data indexes. In order to remove the computationally expensive multiplication and modulo operations that take a lot of clock cycles from (9). C. 8. This dividing technique reduces the timing overhead when the interleaver structure changes. As shown in Fig. B. 7. If the interleaver size changes by more than the granularity of the mother interleavers. All of the 3G mobile communication standards support variable bit rates that may change the interleaver size on a 10-ms or 20-ms frame basis. b. Each of them substitutes a sequence of three ordinary instructions but takes only one clock cycle to execute. where the data indexes are written in a matrix form. After the rows are also permuted one another. N [14]. This scheme exploits the properties of prunable interleavers. When the bit rate changes. . the entire interleaver structure should be reconstructed. Fig. which will be read out column by column as since the elements exa sequence of ceeding the range of interest. SIMD Instructions for Turbo Interleavers To speed up the interleaved address generation. We propose a solution to this problem by dividing the interleaver construction into two parts: preprocessing for interleaving and incremental on-the-fly address generation. Calculated addresses are sent to the address queue. 8. if they are smaller than . Prunable interleaver with = 18. Then. SUBGE (subtract if greater or equal). They can implement interleavers of arbitrary size by first building the mother interleaver of a predefined size and then pruning unnecessary indexes. JULY 2007 Fig. and the modulo by a subtract if greater or equal (SUBGE) instruction described in Section V-C. (b) Incremental on-the-fly address generation. the SIMD processor upaccording to (13) and calculates the addresses based dates on (11). It also does not require a large memory because it does not save the whole interleaved address pattern. the processor generates it column by column using the parameters and variables. (a) Preprocessing for interleaving.806 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. Preprocessing and On-the-Fly Generation Generating the whole interleaved address pattern at once is time-consuming and requires a large memory to store the pattern. but the minimal seed data to calculate it. The figure shows that they are stored in the order of interrow perin advance so as to simplify the mutation such as on-the-fly generation. (12) is equivalent to if otherwise (13) where the multiplication and modulo operations are replaced by cheaper operations. since . the multiplication by an addition. w. 8(c) shows the interrow permutation result. (c) Interrow permutation. we introduced three SIMD processor instructions: STOLT (store to output port if less than). 9(b). In the column-by-column on-the-fly address generation shown in Fig. the interleaved data are read out column by column. The intrarow permutation rule applied to Fig. an efficient method to change the interleaver structure in a short period of time is essential to the turbo decoding of 3G communications. where base Fig. and W-CDMA even supports multiple separately coded transport channels. 8.

Pseudocode of the W-CDMA on-the-fly address generation when R = 5. The MUL (multiply) and REM (remainder) operations. or 20 and the SIMD processor has five PEs. ters when The pseudocode of the on-the-fly address generation for is shown in Fig. The relation between . is quite useful for the block interleavers that commonly use modulo operations. A. The preprocessing of the cdma2000 is quite simple. 11 [3]. Finally. Line 2 corresponds to SUBGE. The actual order of processor instructions are changed from the pseudocode to avoid data dependencies that stall the processor pipeline. If four times to produce an entire column of the interleaver matrix with five PEs. The SUBGE safely substitutes because the condition is satisfied ( and before they or 20. . cumulative vari. It substitutes a modulo or if the condition is remainder operation satisfied. and for each Fig. First. where We save instead of in order to replace the modulo operations with SUBGEs in the on-the-fly generation part. 10. able . and are found following In the preprocessing part. Using the special instructions. and increment value . 10. 10. BLT (branch if less than). the method explained above. BGE (branch if greater or equal). each vector is stored in four separate vector regis. This instruction conforms to the sequence of CMP. Another conditional instruction SUBGE.SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 807 The STOLT instruction is to send an address output to the queue only if the generated address is in the range of the interleaver size. . This procedure is equivalent to the bit-level processing illustrated in Fig. In addition to the three special instructions. Since the W-CDMA allows to be 5. readers are referred to [2]. two neighboring entries in a row is . Multistandard turbo decoding can be realized by loading several interleaver programs and control programs on the memory and switch them whenever needed. line 5 corresponds to STOLT. and SUB instructions. which is the most time-consuming part of the preprocessing. The intrarow permutation termined by rule of the most case is (14) where (15) and the prime integer g.c. we save the address base of . and SUB (subtract). where is an integer between four and ten. we save the seed variable vectors of length : address base . We can divide the complicated process into a preprocessing part and an on-the-fly generation such that no timing overhead exists in changing the frame rate. which is equivalent to the sequence of CMP. and line 6 corresponds to LOOP. 8 but much more complex. then as the minimum prime number such that and as or . W-CDMA The W-CDMA turbo interleaver is quite similar to the example of Fig. This indicates that the five PEs can provide one address per cycle for cdma2000 turbo decoding. lines 1–5 are repeated twice or are added). . which at once decrements the loop count and branches. Each line W-CDMA when corresponds to an instruction of the processor code. we can reduce the lengths of the on-the-fly generation program loop of W-CDMA and cdma2000 to six and five instructions. depending on the range of the given interleaver frame size . and STO (store to output port). value found in an LUT given in [3]. The computationally expensive modulo operations required in the on-the fly part are substituted by the SUBGE instructions. We applied the proposed turbo decoder and turbo interleaver implementation to a few other standard turbo codes. for the on-the-fly generation. is determined as 5. the prime number . the turbo interleaver of the cdma2000 standard is easy to implement in hardware. are required in the preprocessing of the W-CDMA interleaver. Like the W-CDMA turbo interleaver. which is the same as the pruning mentioned earlier. the number of rows of the block interleaver matrix . and the number of columns are determined. For a more detailed description of the W-CDMA interleaver. 10. As one can expect from the figure. BNE (branch if not equal). or 20. cdma2000 The cdma2000 turbo interleaver has a dimension of . The LOOP instruction is adopted from DSP processors to reduce the loop overhead. The MASK instruction that disables one or more of PEs is convenient when the column length of the block interleaver is not a multiple of five.d is a least prime integer such that . VI. respectively. A prime number and its associated primitive root are involved in the whole interleaver generation process [2]. and is a row-specific where is the column index. Then. APPLICATION TO STANDARDS We have applied the technique described in Section V to W-CDMA [2] and cdma2000 [3] turbo interleavers. which corresponds to (13). the permutation function is obtained using the recursive (15) and saved in the data memory. The interrow permutation rule is also deand the table in [2]. which take several cycles to complete. It is equivalent to a sequence of three instructions: CMP (compare). The rows are then shuffled according to a bit-reversal rule. there are a few instructions necessary to implement interleavers. B.

12. C. However. All of the 3G standards adopted convolutional codes as the channel coding as well as turbo codes. 12 and is almost identical to that of the W-CDMA.89 nJ was obtained by measuring 40 working samples on a test machine.808 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. 3 can also work as a Viterbi decoder by rearranging data flow by repeatedly using only one group of ACSA units in Fig. The proposed turbo decoder in Fig. VII. and the CRC stopping criterion. 14. The proposed SIMD processor can implement the interleaver. the Log-MAP decoding algorithm. The estimated maximum data rate of 5. address offset . most part of the decoders can be shared and reused [19]. Table III compares the complexity of the proposed turbo decoder and that of [19]. the constraint length of CCSDS turbo code is five. 3. EXPERIMENTAL RESULTS We implemented an entire turbo decoder that supports both W-CDMA and cdma2000 1x RTT [3] turbo codes in a 0. the number of remaining rows becomes 18–24 out of 32 for any of 12 predefined cdma2000 interleavers. 13. NO. The maximum operating frequency of 135 MHz at 2. We used the W-CDMA turbo code. where a turbo decoder is implemented for the W-CDMA standard. We measured the BER performance of the proposed turbo decoder by a simulation program in C language. Although the work in [19] contains more circuits such as the first interleaver of the W-CDMA receiver and the bit precision of the computation is larger than ours. 7. our implementation is smaller than that in [19]. respectively. and . 11. 15. The control of the data flow from the memory to the SISO decoder will become a little more complex. the software interleavers can be very easily implemented by utilizing SUBGE instruction and four-way parallel processing of the proposed SIMD processor. which requires a SISO decoder that is twice as large as the other standards. and increment obtained by the LUT into the vector registers of the SIMD processor. The pseudocode of the on-the-fly generation part is shown in Fig. Pseudocode of the cdma2000 on-the-fly address generation. The FLO turbo decoder can be implemented in the same way as that of cdma2000.5 V and energy/bit/iteration of 6. The turbo codes adopted by Digital Video Broadcasting standard of return channel via satellite (DVB-RCS) [17] and IEEE Standard 802. 6 and by utilizing the SIMD processor for traceback. The turbo interleaver used in the Consultative Committee for Space Data Systems (CCSDS) standard [16] also belongs to the category of prunable interleavers. respectively. the row will always produce an invalid address if we do not discard them. Turbo interleaver generation procedure of cdma2000.48 Mb/s indicates that the decoder can easily cover 3G standards whose maximum rate is 2 Mb/s. and only four instructions are enough to form the on-the-fly generation loop. we can safely claim that the proposed multistandard turbo decoder based on a programmable processor is comparable in size to average hardware turbo decoder implementations. Fig. As you can see in Table III.16 of broadband wireless access (BWA) systems [18] are double binary circular recursive systematic convolutional (CRSC) codes. Since both the turbo decoders and the Viterbi decoders are based on the similar trellis decoding.25.16 using the proposed architecture shown in Fig. The performances of the proposed decoder and an ideal decoder are shown in Fig. The ideal turbo decoder is implemented with double-precision floating- . If and are all discarded and their spaces are overwritten with . for . The summary of the chip characteristics and the layout are given in Table II and Fig. assuming an additive white Gaussian noise (AWGN) channel. We can implement a turbo decoder for DVB-RCS and IEEE Standard 802. The double binary encoders split input data into two sequences and introduce nonuniformity by simply shuffling them each other instead of applying the complex nonuniform interleaving to a single data sequence. After discarding those rows. Because the address generated in the th row is always larger than or equal to . VOL. Other Standards The Forward Link Only (FLO) mobile multicasting standard adopted the turbo code of cdma2000 with a fixed frame size of [15]. However.m CMOS technology [20]. JULY 2007 TABLE II SUMMARY OF THE CHIP IMPLEMENTATION Fig.

In addition. which imply high bit rates.48 Mb/s performance.SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 809 Fig. To hide the timing overhead of interleaver changing. the relatively larger overheads for small-sized interleavers shown in the first and second We have presented a turbo decoder designed for multiple 3G wireless communication standards. the performance of the proposed interleaving algorithm is analyzed in terms of the cycle counts for four critical interleaver sizes of the W-CDMA turbo decoding. the interleaver generation is split into two parts. The proposed software implementation of turbo interleaver running on the SIMD processor is sufficiently fast that the timing overhead of interleaver changing is hidden in most cases. Fig. VIII. which is sufficient for 3G communication standards.25. and the rate of the interleaved address generation for the 3G standards is almost one interleaved address per cycle. The second SISO decoding with the interleaved addresses can start as soon as the first SISO decoding finishes. BER performance comparison with the ideal turbo decoder. It contains a configurable hardware SISO decoder and a 16-b SIMD processor with five PEs and specialized instructions to perform incremental block interleaving. TABLE III AREA COMPARISON OF TURBO DECODER IMPLEMENTATIONS rows of Table IV actually have a negligible effect on the system speed. For brevity. The performance and power efficiency of the hardware and the flexibility of the software are thus achieved together. The size of the multistandard turbo decoder is comparable to average W-CDMA standard hardware turbo decoders. and the original Log-MAP algorithm that does not exploit the sliding window technique and the stopping criterion. which completely hides the preprocessing delay as the preprocessing completes during the first SISO decoding. since the SIMD processor is ready for the on-the-fly generation of the new interleaver structure. a fast incremental software implementation of turbo interleavers has been proposed. 13. TABLE IV CYCLE COUNTS FOR THE MOST CRITICAL INTERLEAVER SIZES OF W-CDMA Fig. 14. and the others when the frame size is 5114 and the maximum number of iterations is ten. . the simpler and faster cdma2000 interleaver case is omitted. The proposed decoder implemented in a 0.05 dB compared with the ideal case. 14 shows that the performance degradation of our implementation is within 0. the preprocessing time is shorter than the SISO decoding time.m CMOS technology shows 5. preprocessing and incremental on-the-fly generation. In addition. Micrograph of the chip. CONCLUSION point arithmetic. which is suitable for use with real-time structure change such as VBR in 3G wireless communications. It can decode both of W-CDMA and cdma2000 bit streams by changing the software running on the SIMD processor. which is mainly due to the finite-precision effect such as saturation and quantization error. The last column of Table IV shows that the on-the-fly address generation is almost as fast as one address per clock cycle for large interleaver sizes. As shown in Table IV. Since a small interleaver size implies a low bit rate in 3G systems. We obtained the BER curves on the right when the interleaver frame size is 1024 and the maximum number of iterations is six.

Korea. pp. [12] A. Dielissen. pp. Watson Research Center. Tech. Jelinek. VOL. Lett. A. Dig. Solid State Circuits Conf. P. Tech. in 1988 and 1992. J. no.16-2004. Yan. [8] P. Giheung.D. 7. [9] J. Shao. 2704–2708. Sel. no. 1999.” in Proc. Thitimajshima. “Power-efficient application-specific VLIW processor for turbo decoding. respectively. no. Park was the recipient of the Best Paper Award at ICCD in 1999 and the Best Design Award at ASP-DAC in 1997. Prof. Commun. Zhou. D. “DSP implementation issues for UMTSchannel coding. [5] U. degree in electronic engineering from Seoul National University.0.-C.S. TX.1.. In-Cheol Park (S’88–M’92–SM’02) received the B. 4. His current research interest includes computer-aided design algorithms for high-level synthesis and VLSI architectures for general-purpose microprocessors. 1993. and 2004. pp. 2002. JULY 2007 REFERENCES [1] C. and P. IEE Colloq. C. Shin and I..” in Proc.ver. R. C. Daejeon. [19] M.. Samsung Electronics Company. 7. pp. Papers. 284–287. pp.” IEEE Trans. “Optimal decoding of linear codes for minimizing symbol error rate. KAIST. [20] M. respectively. Sep.” in IEEE Int. and J. 12/1–12/6. J. “Near shannon limit error-correcting coding and decoding: Turbo-codes. IT-20. Koora. Yorktown. Prokop. Bahl. in 1986.. no. 871–882. A. in 1996. 2000. Int.” in IEEE Int. Jun. Int.. Technical Specification Group Radio Access Network. [7] L. pp. [2] 3rd Generation Partnership Project. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST). Theory. and P. “Turbo-decoding without snr estimation. NY. Benedetto. Fettweis. Roch. 2002. [16] Recommendation for Space Data System Standards: Telemetry Channel Coding. and Ph. 3GPP TS 25. 1669–1673. Conf.. Dig. Sep. and M. 124–125. and G. Technol. [3] Physical Layer Standard for cdma2000 Spread Spectrum Systems. pp.0-B-6. 15. Woodard.” Proc. 5. van Meerbergen. R. Masera. Sawitzki.6.. [4] J.S0002-B. S.S. Lin. pp. 47. 2000. and M.S. R. Piccinini. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST). Raviv.810 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS.. 6. K.4. Fettweis.” in Proc. “A comparison of optimal and sub-optimal MAP decoding algorithms operating in the log domain. “Two simple stopping criteria for turbo decoding. 3GPP2 C. [13] G. J. Papers. 180–181. Y. [6] M. pp.D. Turbo Codes Digit. 19. E. and R. Eroz and A. 2006. Apr. Vogt. Myoung-Cheol Shin (S’97–M’07) received the B. S. A.S. NO. 1995. . as a Senior Engineer working on VLSI architectures and implementations of 3G wireless communication receivers and mobile digital broadcasting receivers. 1974. vol. [15] Forward Link Only Air Interface Specification for Terrestrial Mobile Multimedia Multicast. Berrou. Hammons Jr. 8. C. 1. “A programmable turbo decoder for multiple 3G wireless standards. 369–379. 1117–1120. [17] Digital Video Broadcasting (DVB): Interaction Channel for Satellite Distribution Systems. v1. Cocke.-C. Commun. [10] R. F. EN 301 790. Solid State Circuits Conf. Part 16: Air Interface for Fixed Broadband Wireless Access Systems. Bickerstaff. T. IEEE GLOBECOM. Papers. Oct. “Design of fixed-point iterative decoders for concatenated codes with interleavers. M.-H. and Ph.212 v4. “Comparison of different turbo decoder realization for IMT-2000. and the M.18 m CMOS. Conf.” in Proc. Inf. Zamboni. TIA-1099.. 193–195.” IEEE Commun. 3. 1999. Solid State Circuits Conf. May 2001.” IEEE J. Glavieux. 1998. pp. “On the design of prunable interleavers for turbo codes. WA. Since June 1996. M. vol. pp. Mar. Veh. Prior to joining KAIST.” IEEE Trans. Finger. Huisken. rev. pp. [14] M. 2005. 1999. Conf. and J. 2002. Ltd. Nicol.” in IEEE Int. he was with the IBM T. vol. G. from May 1995 to May 1996. 1999. 2004. [11] G. J. Villebrun. Seattle. Aug. Houston. Thomas. F. Aug. vol. Montorsi and S. 2. ICASSP. Korea. “VLSI architectures for turbo codes. Areas Commun. Garrett. Multiplexing and Channel Coding (FDD). Hoeher. Harmsze. Daejeon. pp. P. pp. 2001. Bekooij. 1009–1013. vol. no. (VLSI) Syst. Dig. Seoul. 154–155. Broadcasting. Robertson. G. Fossorier. In 2003. where he performed research on high-speed circuit design.” in Proc. P. he has been an Assistant Professor and is now a Professor with the School of Electrical Engineering and Computer Science.” IEEE Trans. Blue Book. Sep. 1999. B. Oct. 1064–1070. IEEE Standard 802. Walther and G. [18] IEEE Standard for Local and Metropolitan Area Networks. Worm. 6.. 3219–3222.. He holds one patent in this area. Very Large Scale Integr. 2002.00. CCSDS 101. “A unified turbo/viterbi channel decoder for 3GPP mobile wireless in 0. Park. Widdup. “Implementation of high rate turbo decoders for third generation mobile communication. 2003. van der Werf. Commun. Tech. he joined the System LSI Division.