You are on page 1of 6

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts

for publication in the ICC 2007 proceedings.

Pankaj Bhagawat*, Weihuang Wang*, Momin Uppal*, Gwan Choi*, Zixiang Xiong*, Mark Yeary** and Alan Harris*** *-Texas A&M University College-Station, **-University of Oklahoma, ***-University of North Florida
AbstractDirty paper code (DPC) can be used in a number of communication network applications; broadcast channels, multiuser interference channels and ISI channels to name a few. We study various implementation bottlenecks and issues with implementing a DPC pre-coder based on nested trellis technique. The aim is to achieve a practical hardware realization of the precoder for wireless LAN/DSL applications. We describe the architectural development process and realization of the precoder on a Xilinx Virtex 2V8000 FPGA. To the best of our knowledge this is the first reported DPC pre-coder hardware implementation. Index TermsPre-coding, Dirty Paper Codes, Nested Trellis, Hardware Architectures.
I.

An FPGA Implementation of Dirty Paper Precoder

INTRODUCTION

Consider a scenario where a sender wishes to transmit a signal X with a power constraint E [ X 2 ] PX over a additive white Gaussian noise channel, with the noise given by Z~N(0,PZ), where E[X2] is average power of signal X and Z is normally distributed with mean 0 and variance Pz . In addition, the received signal is corrupted by an additive interference S which is available at the sender but not at the user. The received signal Y is given by Y = X + S + Z . (1) If the interference is also known to the decoder, the Shannons capacity of this channel is given by the well known formula 1 P C = log(1 + X ) . (2) 2 PZ Costa [1] considered the scenario where the interference is known only at the transmitter, and proved the surprising result that the capacity of the channel under such an assumption is the same as (2), i.e. the capacity of a channel where the interference is known only at the transmitter is the same as if the interference was not present at all. The problem of communicating data over a noisy interference channel with the interference available at the sender is known as channel coding with side information at the transmitter. For the special case where the noise is Gaussian, the problem is also referred to in the literature as Dirty-paper Coding (DPC). Costas work did not draw much attention at first, primarily because it did not consider any applications of its results. However, it is now known that several communication scenarios can be modeled by the channel given by (1). For example, consider a broadcast channel where a single base station intends to communicate with multiple users (wireless

LANs (WLANs), Digital Subscriber Line (DSL), and high speed interconnect design are some systems where such a situation arises). Each user ends up receiving not only its own signal, but also sees the signals intended for other users as unknown interference. This interference, if left as is can cause a significant degradation in capacity. However, since the base station has knowledge of the signals of all users, it can use DPC to mitigate the effect of this inter-user interference. Indeed recent results on the capacity of a multiple-input multiple-output (MIMO) broadcast channel [3] indicate that DPC is the only capacity achieving strategy. Besides broadcast channels, DPC finds applications in numerous other situations; inter-symbol interference (ISI) channels, cooperative networks, and digital watermarking to name a few. Driven by these applications, the past few years have seen a surge in practical designs for DPC. Notable amongst these for low spectral efficiencies are [4],[5],[6] which mostly find applications in digital watermarking, and [7],[8],[9] for higher spectral efficiencies which are more suitable to communication scenarios such as broadcast channels. Although DPC promises significant gains over other commonly used interference cancellation schemes, it has yet to find widespread use in code designs for practical applications. Probably the major factor which hinders its application to practical scenarios is its perceived hardware complexity, especially in situations which require real time encoding and decoding. This serves as a strong motivation for us to build a hardware implementation of DPC and analyze if it is feasible to be used in a real time system. To the best of our knowledge, this paper presents the first hardware implementation of DPC. Specifically, we consider the simplest scheme of [8] which employs trellis coded quantization (TCQ) as the source code and trellis coded modulation (TCM) as the channel code. Although there are other DPC schemes [2] involving the same source and channel code, the reason why we consider the design of [8] is that it can easily be modified to a stronger dirty-paper code involving turbo codes. As we explain later in this paper that hardware implementation of DPC based on [8] has considerable challenges, especially if we need high throughput and good bit error rate (BER) performance. We also conclude that for good BER performance (which warrants use of turbo codes) major bottleneck would be the size and management /organization of storage elements. The paper is organized as follows, section II starts with a review of dirty paper codes (DPC), and we explain how quantization helps in mitigating the interference. Starting with

1-4244-0353-7/07/$25.00 2007 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.

scalar quantization we eventually describe vector quantization and various available strategies to implement it. Section III describes the detailed data flow involved in implementing a DPC using nested trellis scheme. Specifically we explain Trellis Coded Quantization/Trellis Coded Modulation (TCQ/TCM) scheme. We then go through the proposed hardware architecture for TCQ/TCM. In section IV. we discuss the implementation results on an FPGA, we also discuss likely implementation issues for more advanced DPCs and future research directions. We end this paper by drawing conclusions in section V. II. PRACTICAL DIRTY PAPER CODING In this section we will give a brief overview of the scheme of [8] which we implement in hardware. However, before describing the specific scheme, we will first of all look at how modulo operations can be employed for interference cancellation without a significant penalty on transmission power. Such modulo operations can naturally be tied to quantization, which thus indicates the importance of employing a source code in an apparent channel coding problem. A. Modulo Pre-coding Consider a signal U with a power E [U 2 ] = P to be U transmitted over a dirty-paper channel characterized by an additive interference S with E [S 2 ] = PS and a noise Z~ N(0,PZ). At first glance one would consider pre-subtracting the side information from the transmitted signal in order to cancel the interference, i.e. transmitting Xs=U S. Indeed, the received signal will now be Ys= Xs + S + Z = U + Z, and hence interference free. A closer look at this approach however reveals that this pre-subtraction would have to pay a severe power penalty. Assuming that X and S are independent, the transmitter power will be E [ X s2 ] = P + PS . Since PS can be U arbitrarily large, E [ X s2 ] can be much higher than P , which U will result in a severely reduced transmission rate than (2). In order to avoid this power penalty, we can use modulo arithmetic in order to constrain the transmitted signal to a finite interval. Let the codeword to be transmitted be constrained to a finite interval of length , i.e. U [0, ] . The signal transmitted to the channel is X = (U S) mod . Hence X is now limited to the same finite interval as U and does not suffer the power penalty which a simple pre-subtraction would. At the decoder, a same mod operation is performed to get an estimate of U. In the absence of noise, it can be shown that this modulo precoding guarantees that the codeword U is recovered without error at the decoder. This modulo pre-coding is also sometimes referred to as Tomlinson-Harashima Pre-coding (THP). As pointed out in [2], THP suffers a significant loss from the capacity, especially at low signal to noise ratios. The main drawback of THP is that it only uses the current value of the side information S and does not consider the future values.

The mod operation is performed on a symbol by symbol basis, which is equivalent to performing the mod operation over a high dimensional cuboid. Based on the information theoretic proof of Costas capacity, the transmitted signal should have an i.i.d. Gaussian distribution in each dimension. The loss due to the transmitted signal not being Gaussian is called the shaping loss. In order to avoid all of this loss, the mod operation should be performed over a high dimensional sphere instead of a cuboid, which results in a Gaussian quantization error. This indicates that the solution to recover the shaping loss lies in vector quantization. Moreover, in addition to the requirement of quantization (source coding) for satisfying the power constraint, one needs to add error protection to the transmission. Thus one needs to view DPC as a joint source and channel code design problem. Zamir et al [9] proposed such a scheme based on a practical binning strategy using nested lattice codes, where source lattice code is nested within the channel lattice code B. DPC Using Nested Trellis Codes
S

Figure 2.1: Nested trellis DPC scheme of [8] (a) Encoder (b) Decoder A binning approach for DPC using nested trellis codes was presented in [8], the block diagram for which is presented in Fig. 2.3. The nested trellis structure is constructed via a ratek/n/m concatenated code (denoted by C1+ C2, with C1 and C2 being a non-systematic rate-k/n and a systematic rate-n/m convolutional code respectively). The message to be transmitted w is input to an inverse syndrome mapper H-1, where H is the parity check matrix for code C1. We point out that the inverse syndrome mapper H-1 is in fact the pseudoinverse of the parity check matrix H. In other word, wL selects the coset (bin) to be used for quantization. The channel coding scheme is trellis coded modulation (TCM) characterized by the convolutional code C2. The encoder employs trellis coded quantization (TCQ) on the concatenated code C1+ C2 to

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.

quantize the side information sequence SL using the coset selected by the message sequence wL, where L is the sequence length (hence we will refer to this scheme as TCQ/TCM in the subsequent discussion). Note that the source code is nested within the channel code i.e. every codeword of the source code is also the codeword of the channel code. At the decoder the received sequence is first decoded using a Viterbi algorithm based on C2. The decoded output is then multiplied by the parity check matrix H to get the decoded message w . According to the results reported in [8], at a transmission rate of 1 bit/sample the TCQ/TCM scheme performs approximately 5.2 dB away from Costas capacity. This large difference is due to the relatively weak channel code being used (see Fig. 2.1). Indeed, as reported in [7, 8], using turbo codes instead of the simple convolutional code C2 can reduce this gap by a significant margin. III.
ARCHITECTURE FOR A DPC PRECODER USING NESTED TRELLIS

In this section we begin with a detailed data flow description of a nested trellis-based DPC pre-coding. Then we present our architecture. A. Data Flow for a Nested Trellis DPC Encoder As explained earlier a nested trellis-based DPC scheme involves two codes C1 and C2. C2 is the channel code used for error correction and C1 concatenated with C2 serves as a source code. DPC pre-coding using nested trellis is best explained by an example: Example 3.1: Nested trellis scheme consists of two codes C1 and C2. For illustration purpose let C1 and C2 be as shown in Fig. 3.1(In practice C2 is usually a Recursive Systematic Convolutional Code (RSC)). Data bits are first multiplied by pseudo inverse of the C1s parity matrix (H-1). Output of this multiplication is then fed into the input of C2 via two XOR gates. It is clear how the data-bits influence the state diagram of the overall source code. For example, when data-bits are 00 then a state input of 1 will cause the next state to be 10000 with output as 100, whereas for data bits=11 an input of 1 will cause the state to go to 10100 and output 101. Consequently, the trellis structure will change depending on the data bits. This means that the connections between states will need to be reconfigured on the fly as a function of data-bits. This is the principal difference between the Viterbi algorithm and nested trellis pre-coding.

Figure 3.1: A TCQ/TCM Scheme Just like in TCQ, we choose a path in the trellis that minimizes the Euclidean distance between the two sequences. Here, one of the sequences is the scaled version of interference sequence (S) which is available at the transmitter noncasually. The other is the sequence of PAM symbols. In the example, we will have to use an 8-PAM constellation since we have three outputs. Thus, each combination of the bits O1, O2, O3 is uniquely mapped to a PAM symbol. This mapping is usually done such that the minimum distance of the resulting code is maximized [10].

Figure 3.2: A Sample of trellis behavior as a function of data-bits Fig. 3.2 shows an example of how the trellis connections change with data-bits. Also notice how the outputs associated with branches coming out of same state change with data-bits. These output bits are pre-mapped to a PAM constellation. For example, following could be a possible mapping:

Figure 3.3: A possible 8-PAM mapping.

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.

A branch metric unit (BMU) operates on two operands; first is the interference sample (a real gaussian random variable), and second is a mapped PAM symbol. Suppose at a particular stage of trellis, the scaled interference sample is 2.7 and the PAM symbol associated with that branch is 1.5. Then the BMU computes the metric as (2.7-1.5)2. However, if the interference sample was 6.5, then BMU would first subtract 7 from it (6.5-7=-0.5), and compute the metric = (-0.5-1.5)2. This essentially means that we are replicating the constellation shown in Fig.3.3 to accommodate the interference sample of 6.5.In general, when scaled interference sample X is more than the maximum of the PAM symbols Y, BMU has to compute (X+Y)mod(2Y) - Y before computing the Euclidean distance. B. Proposed Architecture The architecture presented here processes all the states at a particular stage of the trellis in parallel (see Fig. 3.4). This is commonly referred to as a state parallel architecture and is desirable in high throughput decoding of convolutional codes using Viterbi algorithm [12]. However the proposed architecture differs in that it can accommodate dynamically changing trellis structure during run time.

data-bits). Also, we need to store the survivor paths for each state and the data-bits for each stage (these are needed during the trace-back operation). This affects both, the forward recursion (computing metrics along the trellis), and the Trace Back Unit (to obtain the final output). To achieve the aforementioned dynamic interconnect configuration for forward recursion, we use the data bits at the input to the switch network. Depending on the data combination, the switch network is configured to realize a particular trellis structure. In order to compute the branch metric at a given trellis stage, we use the programmable symbol mapper, which is a look up table containing the output symbols for the convolutional encoder for each trellis structure (see Fig.3.1 and 3.2). Branch metric computation unit (BMU) then uses the interference sequence and the mapped PAM symbol to compute the Euclidean distance between them. Accumulate Compare Select unit (ACS) selects the least cumulative metric and forwards it to the next stage. Also, the survivor path (mapped PAM symbol) is sent to the memory in order to facilitate the trace-back operation. Next, we describe the architecture for trace-back unit (TBU). TBU is used for computing the output sequence which is, in our case, a sequence that is jointly source and channel coded. The process flow for TBU is shown in Fig. 3.5.

Figure 3.5: Architecture for the Trace-Back Unit Starting from the last stage in the trellis, a sorting unit picks the state with minimum cumulative metric. Data-bits are used to decide the trellis structure at a particular stage. Survivor block has the survivor path associated with the state that is in Current-State block. Given these three quantities we can uniquely identify the previous state, its associated data-bits and survivor path. These quantities can now be utilized to revert back to previous stage. This is continued until we arrive at the origin of the trellis. In the presented implementation we have used Block RAMs (BRAMs) available in FPGA for storing the survivors and data-bits.

Figure 3.4: Proposed State Parallel Architecture Registers Reg.1 through Reg.(N) on the left hand side are the storage elements for retaining the cumulative metrics from the previous stage, and ones on the right hand side store the cumulative metrics the next stage. To accommodate variation in the trellis structure at each stage, we need to have a programmable interconnection unit and a programmable PAM symbol mapper (because output of the state machine shown in Fig. 3.1 map to a PAM symbol, which is also a function of

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.

Figure 3.6: Memory Organization for trace-back Fig. 3.6 shows the memory organization for the proposed state parallel architecture. Each row indicates an independent memory unit. Each column indicates a stage of the trellis. At each memory location (each square block in the Fig. 3.6) is stored the survivor for a state, which is just one bit in our case as each state has two incoming branches and only one of them survives. The sorting unit at the end of trellis returns a state number, this serves as the pointer to that state. Using this pointer we access the location in the last column of memory. Contents of this location along with the state number are then fed to the Reverse LUT (RLUT), which uniquely identifies the previous state and corresponding PAM symbol for the survivor branch. This process is continued till we arrive at the first column of the memory. The selected sequence of PAM symbols from across the trellis stages is the final output sequence.

independent memories, since it will decide the parallelism factor (i.e. number of states processed in single clock cycle). If Ns is the number of states at each stage of the trellis and Nd is the number of data bits (see Fig. 3.1). Then the size of trace-back memory is given by Mtb=Ltb (Ns+ Nd) bits, where Ltb is the trace-back length. Memory requirements for RLUT can be computed as follows, In RLUT the memory size MRLUT will depend on the number of states Ns, PAM constellation size NPAM, and number of incoming/outgoing branches R from each state, which in turn is depends on the number of state inputs I (see Fig. 3.1), if number of state inputs is IS, then R=2 S. MRLUT is computed as R. Ns .(log2 (Ns) + log2 (NPAM)). Total N memory for RLUT is MRLUT (2 d), where Nd is the number of data-bits. Note that in our case since we are using RSC as C2 only one data-bit actually influence the trellis so the total memory for RLUT is MRLUT (2Ndata-bits-1). In the presented implementation of DPC we have used two serially concatenated convolutional codes, one of which serves as a channel code. In applications where we need high bit error rate (BER) performance, use of turbo codes or convolutional codes with very high constraint lengths is more useful. Encoding procedure and data flow graph for DPC with turbo code is very similar to the one described earlier in this paper. However, good turbo codes have lengths of the order of a few thousand bits. This essentially means that hardware should be designed such that we are able to accommodate trace-back lengths of a few thousand stages in the trellis. This means that we need large amounts of memory, since we need to store the survivor path for every state at each stage and the data-bits for every stage of the trellis. The memory requirements for a 256 state DPC where C2 is a length 2048 turbo code can be estimated as follows: Ns =256 states, Ltb =2048, NPAM=8, and Nd=2-1=1=>R=2 (since C2 is RSC), Mtb =2048(256+1) =526.336 Kbits. Total memory for RLUT is Mtotal-lut=256.(2(log2 (256) + log2 (8)) =4.36Kbits. Overall memory requirement is therefore about 530Kbits. IV.
DISCUSSION OF RESULTS AND FUTURE DIRECTIONS

In this section we present estimates of FPGA implementation for 256 states DPC pre-coder. Following table shows the resource usage for these implementations on a Xilinx Virtex-2V8000 FPGA. 256 States Figure 3.7: A Reverse LUT Fig. 3.7 shows a RLUT. This LUT stores the state connections as they are from next state to previous state. It also stores corresponding PAM symbols associated with each branch emanating from a state. It can be seen that the size of the memory scales directly with the number of states and the trace-back length. Also, the throughput performance of the architecture will depend on the number of available Amount used Percent Used

Slices 18570 39% Block RAM 8 4% LUT 33588 36% Critical Path 60ns Throughput 51 Mbps Table 1: FPGA Implementation statistics. It should be noted that critical path of 60ns means we can trace back approx. 17 trellis stages per second. But for each

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.

stage we have 3 bits of output (since we are using 8-PAM). Hence, the total throughput is about 51 Mbps. It is well known that the Fanos algorithm and M/T algorithms are suboptimal ways to implement trellis decoding algorithm [11]. These algorithms reduce the storage requirements; however, their BER performance is not as good as Viterbi algorithm. To the best of our knowledge BER performance analysis for DPC schemes using M algorithm or Fanos algorithm is not known. In the implementation that use multipliers in BMU are expensive both, in terms of hardware size and its speed. Also, the sorting unit to pick the state with minimum cumulative metric contributes about 30 ns more to the critical path, and takes up more than 3000 slices Also, we are exploring ways to avoid the use of multipliers (to compute branch metric). For instance, branch metric computed as |x-y| rather than (x-y)2 get rid of the multipliers in the BMU, but as our preliminary simulations show, this degrades the BER performance by about 0.7dB at a BER of 10-5. Efforts are being made to narrow this degradation by using other metrics that do not require multipliers. Moreover, we are trying to come up with a hardware unit which will pick the state with minimum metric more efficiently. V.
CONCLUSION

[11] Bengough, P.A.; Simmons, S.J.,Sorting-based VLSI architectures for the M-algorithm and T-algorithm trellis decoders, IEEE Trans. Communications Vol 43, Issue 234, Feb-Mar-Apr 1995 Page(s):514 522. [12] Zhang, Y.F.; Csillag, P.,Parallel architecture for high-speed Viterbi decoding of convolutional codes,Electronic letter, Vol 25, Issue 14, 6 July 1989 Page(s):887 - 888

DPC based on nested trellis scheme was presented. A state parallel architecture is presented to realize a working DPC Pre-coder. The design is implemented using Xilinx Virtex 2v8000 FPGA and the resource-use and performances results are obtained for 256 state nested trellis DPC. As mentioned earlier this may be the first reported implementation of a DPC in open literature, and hence, it is not possible to have a comparative study of the implementation. REFERENCES
M. Costa, Writing on dirty paper, IEEE Trans. Inform. Theory, vol. 29, pp. 439 441, May 1983. [2] W. Yu, D. Varodayan, and J. Cioffi, Trellis and convolutional precoding for transmitter-based interference pre-subtraction, IEEE Trans. On Communications, vol. 53, pp. 1220 1230, July 2005. [3] H. Weingarten, Y. Steinberg, and S. Shamai, Capacity region of the MIMO broadcast channel, IEEE Trans. Inform. Theory, to appear. [4] U. Erez and S. ten Brink, A close-to-capacity dirty paper coding scheme, IEEE Trans. Inform. Theory, vol. 51, pp. 3417 3432, October 2005. [5] A. Bennatan, D. Burshtein, G. Caire, and S. Shamai, 'Superposition coding for side-information channels, IEEE Trans. Inform. Theory, vol. 52, pp. 1872 1889, May 2006. [6] Y. Sun, A.D. Liveris, V. Stankovi, and Z. Xiong, Near-capacity dirtypaper code designs based on TCQ and IRA codes, Proc. ISIT'05, Adelaide, Australia, September 2005, pp. 184 188. [7] Y. Sun, M. Uppal, A. Liveris, S. Cheng, V. Stankovi, and Z. Xiong, Nested turbo codes for the Costa problem, revised for IEEE Trans. Communications, October 2006. [8] J. Chou, Channel Coding with Side Information: Theory, Practice and Applications," Ph.D. dissertation, Dept. of Elect. Eng., University of California at Berkeley, Berkeley, CA, 2002. [9] R. Zamir, S. Shamai, and U. Erez, Nested linear/lattice codes for structured multiterminal binning, IEEE Trans. Inform. Theory, vol. 48, pp. 1250 1276, June 2002. [10] G. Ungerboeck,"Channel coding with multilevel/phase signals,"IEEE Trans. on Information Theory, vol. 28, no. 1, pp. 55-67, Jan. 1982. [1]

You might also like