You are on page 1of 6

A Low Cost Multi-standard Near-Optimal Soft-Output Sphere Decoder: Algorithm and Architecture

u Ozg n Paker
ST Ericsson High Tech Campus 41 Eindhoven, The Netherlands Email: ozgun.paker@stericsson.com

Sebastian Eckert and Andreas Bury


Blue Wonder Communications GmbH Am Waldschl sschen 1,D-01099 o Dresden, Germany Email: {sebastian.eckert,andreas.bury}@bluwo.com

AbstractWe present an algorithm and architecture of a soft-output sphere decoder with an optimized hardware implementation for 2x2 MIMO-OFDM reception. We introduce a novel table look-up approach for symbol enumeration that simplies the implementation of soft-output decoders. The HW implementation is targeted towards WLAN (IEEE 802.11n) with stringent latency and throughput requirements. The current implementation supports all modulation schemes (BPSK,QPSK,16QAM,64-QAM) and shows near-optimal real-time performance. To achieve this, the sphere decoder computes in the worst-case Euclidean distances of 4.1 Giga QAM symbols per second. This challenging requirement is met by a scalable, multi-standard HW architecture which can be tuned to other applications such as LTE, WiMax with no re-design effort. The current instance for WLAN occupies an area of only 0.17 mm2 in 45 nm CMOS technology while providing a guaranteed throughput of 374 Msoftbits/s at 312 MHz clock rate (i.e. outputting 26 softbits worst-case every 10 clock cycles).

S yn c h .

FFT S p h e re S e a rc h

S yn c h .

FFT

Space F re q u e n c y De In te rle a v e r

FEC D ecoder

Channel E s tim a tio n

Q R d e c o m p o s itio n

Fig. 1. A 2x2 MIMO-OFDM WLAN receiver employing a sphere decoder for equalization and bit detection.

I. I NTRODUCTION While MIMO transmission offers increased capacity and link robustness [1], a key challenge in the receiver design is to be able to separate the interference caused by multiple transmit antennas. A topic that attracts a lot of attention both in industry and academia revolves around this challenge: MIMO bit detection. There are a variety of techniques spanning from linear schemes (Zero-forcing/MMSE) to non-linear approaches (sphere decoding [2]) involving a tradeoff between complexity and BER performance. Linear detection schemes (e.g. MMSE) can be even mapped onto software for real-time processing [3], though on a powerful programmable vector processor. Although software implementation is attractive (i.e. reduced risk, fast time-to-market), such approaches suffer from a poor BER performance. Due to the near-optimal detection performance, signicant attention on efcient HW implementation of sphere decoding can be found in literature [4][7]. Sphere decoding achieves the BER performance of optimal maximum-likelihood (ML) decoders at a much lower complexity. This is essentially because of the reduced search space.

Though the original algorithm is tuned to nding the most likely transmitted symbol vector (i.e. hard-decision), we are interested in an algorithm that generates soft outputs to fully exploit the error correction capabilities of state-of-the-art channel decoders. In this paper we propose such a variant that uses 1. multiple search trees i.e. one for each spatial dimension similar to the approach in [8] but further improving the search efciency by visiting fewer nodes 2. a novel lookup table for enumerating QAM symbols without explicitly sorting and/or calculating the associated partial Euclidean distances such as done in [4][6]. We also present an HW architecture that is exible and scalable in terms of compute power. This enables addressing different applications with differing requirements such as WLAN (IEEE 802.11n), LTE and WiMax (IEEE 802.16e). For this paper, we present results for IEEE 802.11n as WLAN has challenging data rates and latency requirements to be met. Fig 1 shows a 2x2 MIMO-OFDM receiver targeting WLAN. The task of the receiver as shown is to: 1. synchronize to the received signal, 2. to demodulate the signal (FFT) and estimate the channel, 3. to equalize 4. to generate soft-output (log-likelihood ratios) for the channel decoder (bit detection) and 5. to de-interleave and decode the coded bitstream by a channel decoder such as Viterbi, Turbo etc. The role of the sphere decoder in MIMO reception is to replace the task of the equalization and bit detection as shown in the gure. In WLAN, a key timing constraint is the SIFS (Short Inter Frame Spacing). The receiver has to acknowledge the reception of a packet after the so-called SIFS interval, otherwise the transmitter assumes the packet is lost and re-transmits the same

978-3-9810801-6-2/DATE10 2010 EDAA

packet leading to a signicant drop in throughput. The sphere decoder has to be able to meet the SIFS (latency constraint) and the throughput constraint (i.e. process an OFDM symbol with 52 sub-carriers in 4 s). The SIFS constraint allowed our decoder only a 2s latency budget which amounts to the processing of 4.1 Giga QAM symbols per second (4 levels/subcarrier 40 QAM symbols/level/subcarrier 52 subcarriers in 2s). Our Contributions: Similar to [8], we also employ multiple search trees. We form the search trees by swapping the last column of the channel matrix H with its other columns, to make sure each spatial stream becomes at the top level of a search tree after QR decomposition (explained in section II-D). But unlike [8], we do not process all the QAM symbols of the top level for each search tree. Instead we opt for a depth rst search scheme that prunes the tree earlier. We observed through simulations that to guarantee near-ML BER performance, we only need to process a maximum of 40 QAM symbols (in contrast to 64 in [8]) for a 64-QAM constellation, saving a search complexity of 37.5%. We also present a novel table look-up approach that simplies the implementation signicantly. The use of the look-up table (LUT) eliminates the need to sort the QAM symbols by explicitly calculating partial Euclidean distances. This leads to a reduced critical path in the implementation of the depth-rst search scheme. The LUT does not perform an exact SchnorrEuchner ordering [9] but this has negligible impact on the BER performance for soft output sphere decoding. To the best of our knowledge, this has not been reported in the literature so far. Our nal contribution is by presenting a multi-standard architecture that can also address other wireless applications with no re-design effort. The paper is organized as follows: Section II presents background in sphere decoding and also our algorithm, section III explains the HW architecture, section IV benchmarks our approach with others, and section V nally concludes the paper. II. T HE A LGORITHM A. MIMO Transmission Lets consider a MIMO system with Nt transmit and Nr receive antennas. The coded bit-stream is mapped to a symbol vector of size Nt 1 denoted by s. Each symbol si in the vector is selected from a complex signal constellation C where the number of constellation points are given with |C| = 2M . Here M represents the number of bits associated with a single symbol. For instance for 64-QAM, M = 6. We denote x as the transmitted bit pattern, and denote xi,q as the qth bit of the ith symbol in s = [s1 , s2 , . . . , sNt ]T . The MIMO transmission in the baseband can be given by y =H s+n (1)

B. Hard-decision sphere decoding In its original form, the sphere decoding algorithm can be used to nd the transmitted symbol vector with the smallest Euclidean distance to the received signal y. This is called the maximum likelihood (ML) solution and given by: sM L = arg min
sC Nt

y Hs

(2)

where the bit pattern comprising the sM L vector represents the hard-decisions for each bit. Calculating the ML solution using a bruteforce approach such as trying out all possible QAM symbols is not practical from an implementation point of view. The classical sphere decoder on the other hand performs a limited search around the recieved signal. The channel matrix, H, in (2) can be decomposed further using QR factorization and the equivalent ML estimate can be represented as: sM L = arg min
sC Nt

QH y Rs

(3)

where Q is a unitary matrix and R is an upper-triangular matrix whose diagonal elements are real. Using the uppertriangular nature of R, the decoding of the symbols start from the last row. An associated partial euclidean distance (PED) di , can be calculated for symbol si at each row during the search process and is given as: di = di+1 + |ei |2 ,
Nt

ei = yi
j=i

Ri,j sj

(4)

where H denotes the Nr Nt complex channel matrix and n is an i.i.d zero-mean complex Gaussian noise vector of size Nr 1 with variance No per complex entry.

where i = Nt , Nt1 . . . 1 and y = QH y with yi being the ith row of y . Once the PED for a symbol is below a certain radius r, dened during the initialization phase i.e. di r (so called sphere constraint), the search progresses to an upper row. In case the PED is larger than the radius, the current symbol is discarded and if there are no other symbols left to try at the current level, the search backtracks i.e. drops to a lower row. This process is called depth-rst search and is performed on the tree shown in Fig 2.a. For Nt = 3, s3 corresponds to the top level of the tree which is connected to an imaginary root node. When the search backtracks to the root node, the search process is nished. When the search reaches one of the leaves of the tree, the Euclidean distance for the corresponding path is evaluated and the radius shrinked accordingly. Another important aspect is the ordering of symbols at each level. If the symbols are ordered with an increasing PED (Schnorr-Euchner enumeration [9]), the search can be stopped for that level when a symbol that violates the sphere constraint (i.e. a symbol with a PED that is larger than the current radius) is reached. The reason is that there is no point of continuing the search with the other symbols at that level as none will result in a smaller PED than the current symbol. For this reason, the bottom level of the tree in Fig 2.a consists of single leaf node per path. The other nodes at that level are pruned away. We adopt depth rst search combined with radius shrinking and symbol ordering as this combination leads to fast pruning

ro o t T re e le ve l

for MIMO bit detection. Rewriting LLRs of (5) using M L , and M L and sM L , we get: i,q
00 01 00 01 00 01

s3 s2 s1

11 01 00

10 00 01

L(xi,q ) =

1 2 1 2

M L M L i,q M L M L i,q

, ,

xM L = 0 i,q xM L = 1 i,q

(7)

a ) T re e re p re se n tin g th e in itia l co lu m n o rd e r o f H ro o t T re e le ve l

D. Our Soft-output Sphere Decoder Multiple Search Threads: A straightforward approach to calculate M L requires that for each bit in the counter hyi,q pothesis, one has to perform a separate search using the same tree. Calculation of M L requires a single search to nd the sM L solution. Therefore in total Nt M +1 search threads would sufce to nd all the log-likelihood ratios. Certainly this would be cumbersome, as this repetitive search strategy would mean processing of the same tree nodes multiple times in different search threads [11]. We avoid this to a large extent by searching for all (including counter hypothesis bits) minimum distances for the bits of si in parallel, reducing the number of search threads to Nt . Fig 2 shows three search threads required for a 3x3 MIMO receiver. Each search thread is mainly responsible (i.e. associated with a stopping criterion) for nding the minimum distances for the bits of si which is at the top level of the tree (e.g. s3 in Fig 2.a, s2 in Fig 2.b etc.). If we were to search for minimum distances of bits of s2 in the tree shown in Fig 2.a, we would have to visit all four subtrees (A subtree here is formed by assuming any symbol from s3 as the root). For example, lets assume we are searching for the minimum distance where the rst bit of s2 has a logical value of 0. We need to investigate all the paths from the root to the leaves where s2 is 00 or 01. Even though symbols of s2 within a subtree are ordered, there is no order that can be derived for the symbols of s2 (denoted with the blue color) across subtrees without explicitly calculating the total Euclidean distances for all the paths involving the 00 and 01 nodes of s2 . This means we can not make use of the efcient depth rst search scheme with early stopping properly. However, if we use the tree in Fig 2.b to search for the same distance value, we only need to investigate two subtrees where the root symbols are denoted with the blue color. If we mainly search for the bits of s2 using this tree, we can make use of symbol ordering and early stopping at the top level. The main issue with using the search tree of Fig 2.a for s2 or s1 is the lack of a proper stopping criterion. Furthermore, if we use this tree for s1 , there is an additional problem. As there is only a single leaf node per path at the bottom level, the search thread could possibly span the entire tree but still may not nd a leaf node that assumes the value 00 or 01. That means a wasted search effort with no estimate for a particular distance value. Therefore to be able to make use of symbol ordering and pruning, the trees are formed in a way that each spatial dimension si is treated at the top level once.

s2 s3 s1

01

11

10

00

b ) T re e g e n e ra te d b y s w a p p in g c o lu m n s o f H (fo r s 2 a n d s 3 )

ro o t T re e le ve l

s1 s2 s3

00

01

11

10

c) T re e g e n e ra te d b y s w a p p in g co lu m n s o f H (fo r s 1 a n d s 3 )

Fig. 2.

Sphere search trees for a 3x3 MIMO receiver using QPSK.

of the tree and limits the numbers of visited tree nodes signicantly [10]. Another search scheme employs breadthrst-search (also known as K-best decoding) that select the K-best possible paths at each level of the search [5], [6]. C. Computation of Log-likelihood ratios In order to avoid complexity, similar to [11], we employ the max-log approximation as given in: 1 2
2

L(xi,q ) =

0 sCi,q

min

y Hs

min 1

sCi,q

y Hs|2

(5)

0 where Ci,q represents the set of transmitted symbol vectors whose ith symbol at bit position q assumes the value of 0. 1 Similarly Ci,q represents the set of symbol vectors whose ith symbol at bit position q assumes the value of 1. By denition these are disjoint sets. For each bit, one of the two minima in (5) is calculated by M L = y HsM L 2 , where sM L represents the ML solution and already seen in the previous section can be calculated by a sphere decoder. The other minima in (5) are given by

M L = i,q

xM L i,q sCi,q

min

y Hs

(6)

where xM L also called as counter hypothesis denotes the i,q binary complement of the qth bit in the ith symbol of sM L . Looking at (5), we can conclude that calculation of M L , M L and sM L are sufcient to achieve max-log approximation i,q

We form the search trees by swapping the last column of the channel matrix with its other columns. The question is how to order the columns of the channel matrix at the rst place. This so-called layer ordering problem has been investigated previously [12] and thus will not be dealt with here. However note that for the 2x2 sphere decoder that is presented in this paper, layer ordering has not been an issue as both search trees cover all possible orders. We dene for each si , multiple radius registers representing the minimum distances for each bit value. As we need to calculate two distances (for bit values 0 and 1) for each bit, there are 2 M distances per si to be determined. A meaningful stopping criterion would then be to stop the search thread for si when all relevant distance values are minimized (see section on LLR clipping and soft decision search constraint for a more elaborate discussion). In order to further limit the search space, we decided to allow different search threads to interact by modifying a common distance table which is simply the union of all radius registers. The total size of the common distance table is 2Nt M . If a search thread for si nds a smaller distance for bit q = 1 . . . M belonging to search thread for sk where k = 1 . . . Nt , k = i, it simply updates the distance table with its value which helps the thread for sk to prune its tree earlier if needed. LLR clipping and soft-decision search constraint: Similar to [11], we also clip the LLR at run-time by introducing a softdecision sphere constraint which is used to decide to backtrack a search thread when the accumulated PED as given in (4) fails the sum of the minimum current distance rmin taken from the common distance table with the maximum LLR, maxLLR: di rmin + maxLLR (8)

sphere decoding calculating an exact order is not necessary for near-optimal performance. To determine an order for symbols at level i, we rst calculate bi which will together with Rii will determine the distance increment ei to the current PED, rewriting (4) to:
Nt

ei = bi Rii si

where

bi = yi
j=i+1

Ri,j sj

(9)

Finding the symbol si that will lead to the minimum increment to ei , is simply rounding bi /Rii to the closest QAM symbol in the constellation. Consequently, we enumerate the remaining modulation symbols according to their Euclidean distance with respect to the closest symbol. When two or more symbols have the same distance to the closest symbol, an almost arbitrary selection can be made hence yielding the imperfect order mentioned earlier. III. H ARDWARE A RCHITECTURE FOR 2 X 2 MIMO The overall architecture is shown in Fig 3. The channel H and the demodulated received signal y, are fed to the pre-processing block which performs two QR decompositions using the modied Gram Schmidt algorithm. The QR block essentially generates two search trees as explained in Fig 2. To save computational effort, we maximized re-use of intermediate variables in the algorithm. The QR block makes use of the high concurrency available in the input stream in order to fully utilize the HW by employing pipelined interleaving. This is possible due to the OFDM symbol structure which consists of 52 independent subcarriers that could be processed in parallel. The datapath resembles a VLIW-like architecture with 12 multiply-accumulate and 4 multiply units with a 31x16-bit register le. The control is based on a tiny microprogram that implements the relevant state machine for sequencing the arithmetic operations. We also investigated whether programming this part on a powerful vector DSP such as EVP [13] could save silicon area. As the EVP was already a part of the baseband architecture, simply mapping as many tasks as possible on it is a logical architecting goal. Even though there were available clock cycles on EVP, we realized that due to the block processing capability (SIMD), a SW mapping would imply an increased latency for the QR part of the decoder and hence a reduced latency budget for the search threads. We realized that the area increase in the search part (due to the reduction of the latency budget) outweighs the area saving of the SW implementation, by roughly 80Kgates. Basically we save silicon area by implementing the QR block in HW, in hindsight a counter-intuitive optimization. The other fundemantal block that is important to understand is the design of a thread unit (TU). A pair of TUs process two search threads for a single subcarrier in parallel and outputs softbits which are collected by the right most block of Fig 3. Due to the variable execution time, this block synchronizes with the output of each TU and makes sure the results appear in order at the output of the decoder. All the TUs share the

Note that rmin becomes equal to M L when the ML solution is found by one of the search threads. Using this constraint and allowing to run the search long enough, the distance table would hold all the distances: M L value for the bits corresponding to the ML solution and the values of either M L or M L + maxLLR for the counter hypothesis bits depending on whether a minimum distance is found or not for the particular counter hypothesis bit, respectively. To be able to guarantee a desired BER performance as well as a real-time predictable execution time, we investigated an upper bound for the search effort in terms of number of nodes to be visited (40 QAM symbols at the top level for a 2x2 64-QAM constellation). Novel Table Lookup Approach: Contrary to the current approaches [4][6], where an exact ordering of QAM symbols at each level of the tree is done based on the actual calculation of PEDs, we employ a simple and unique look-up table that avoids all the complexity related to PED calculation and sorting. The simplicity is due to the fact that we calculate an order that is not exact but a close-enough approximation to the Schnorr-Euchner order. This is not a problem. We observed through extensive simulations that for soft-decision

10

Simulation Mode I: FER vs. SNR 2x2 64 QAM IEEE channel model D

LU T

T h re a d u n it

10
T h re a d u n it

H r
Q R D ecom p.

R,y
Allocate

T h re a d u n it Collect

S oftbits
10
2

{R,y}
ro t

T h re a d u n it

FER

3.2dB

5.2dB

T h re a d u n it

T h re a d u n it

10

maxLog MAP r=5/6 SSD r=5/6 MMSE r=5/6 maxLog MAP r=2/3 SSD r=2/3 MMSE r=2/3 24 26 28 30 32 SNR in dB 34 36 38 40

22

Fig. 3.

Hardware architecture of the 2x2 soft-output sphere decoder

Fig. 4.

FER performance comparison for 802.11n 2x2 MIMO 64-QAM.

look-up table (LUT) which is denoted at the top. Initially, each TU requests a read from this memory. A TDMA approach is used for arbitration i.e. each TU requests a read every clock cycle until its turn for the subsequent access to the LUT. Each read however retrieves N QAM symbols to enable LUT-access-free internal processing upto N clock cycles. For efcient utilization of TU and memory bandwidth, the number of TUs should be equal to N . For WLAN, we chose N = 8 to meet the tight latency requirement. Note that as two thread units cooperate for the processing of a single subcarrier (refer to common distance update as explained in section II-D), we decided not to implement pipelined interleaving in this block. This is because the amount of state that needs to be replicated is quite large and the benets of HW sharing are outweighed by the increase in the area of a single TU and the associated complexity to manage pipelined interleaving. This means that two thread units work on two search threads associated with a single subcarrier and only assume a new subcarrier when both threads nish their search. This simplies the design. The allocate unit dispatches a new subcarrier in a circular order to the available TU. But to get a deterministic and simple data ow, we force an in-order dispatch in the allocate unit similar to the collect unit. The idle TUs who are waiting for their turn simply go into a low power mode enabled by clock gating. We guarantee worst-case throughput and if the average case work load is lower, we achieve a smaller power consumption thanks to clock gating. IV. R ESULTS AND D ISCUSSION Fig 4 shows simulation results of our soft sphere decoder (SSD) for IEEE 802.11n 2x2 64-QAM (channel model D) 20 MHz bandwidth for 2 different coding rates and compares it with the optimal max-log MAP and MMSE. We limited the search to 160 tree nodes for a single subcarrier, to be able to guarantee a worst-case throughput for an implementation. The SSD shows almost identical FER performance with a max-log-

MAP decoder and outperforms MMSE by 5.2dB at a code rate of r=5/6. The TUs occupy roughly 69% of the overall area. Our unique table look-up approach costs only 5%, but also enables us to avoid area and throughput inefciencies resulting from symbol ordering and sorting circuitry heavily used in K-best approaches. For instance, in [14], the sorting circuitry costs as much as 37%. The QR decomposition block costs 15%, allowing us the HW/SW tradeoff mentioned earlier. Benchmarking the existing soft-decision sphere decoders is not easy. Not only there are many variations of the algorithm, but also the specications (MIMO setup, modulation scheme, inclusion/exclusion of pre-processing circuitry, target latency, throughput etc.) differ signicantly. To make as fair comparison as possible, we use computational density as a gure of merit. As sphere decoders are essentially tree-search algorithms, silicon efciency can be measured by computing how many QAM symbols a particular design could process given the same amount of silicon area and execution time. In Table I, we provide such a metric. To be fair to variable complexity decoders, we consider the maximum number of nodes each design could process when performing nearoptimal detection. For comparison, all delays and areas are normalized to 45 nm CMOS technology. It is stated in [4] that a search for a 4x4 16-QAM received vector is limited to 256 steps. Each step equals the processing of upto 3 valid candidate QAM symbols without effecting the throughput. This amounts to 768 nodes visited per received vector in 0.417 s. Table I shows our design is a factor of two more efcient than [4]. We believe the difference is due to sorting the paths at runtime which impacts area and throughput in [4]. The K-best approach in [5] extends 20 paths at each level of the tree where processing a path for two consecutive levels corresponds to processing a complex QAM symbol. It takes 60 cycles to process two levels at a clock rate of 200 MHz (i.e.

Design MIMO setup Modulation CMOS nm Delay s Area mm2 GNodes/s/mm2

Proposed 2x2 64-QAM 45 0.038 0.17 24.7

[4] 4x4 16-QAM 180 0.417 10 11.6

[5] 4x4 16-QAM 130 0.3 0.56 2.8

[6] 8x8 QPSK 130 0.278 12 2.8

TABLE I C OMPARISON WITH SOFT- DECISION SPHERE DECODERS .

20 symbols in 0.3 s). After normalizing the area and delay, we observe that the design in [5] achieves 2.8 Gnodes/s/mm2 computational density. The efciency of our sphere decoder is almost 9 times higher than [5]. Note that [5] does not include preprocessing area overhead as the other designs in Table I. Though highly pipelined and efcient, [5] suffers from the overhead of runtime ordering of extendible paths. The authors of [5] report that 20 cycles out of 30 are consumed for sorting in the pipeline stages. Similarly the K-best approach in [6] extends 16 paths out of 64 after the second level of their 8x8 complex QPSK symbol based search tree. For the rst 2 levels, they process 4, and 16 nodes. This amounts to 404 nodes to be processed per recieved vector at a rate of 3.6MHz. Compared to our design, the difference is roughly a factor of 9. Like [5], the design of [6] suffers from the overhead of sorting. But to be fair to [6], the preprocessing for an 8x8 conguration is also more complex and area consuming than ours. Though we could not quantify this fact from [6] due to lack of area breakdown gures. The soft-output sphere design of [11] achives roughly 1 node per clock cyle visit at a clock rate of 71MHz, resulting in 6.1 GNodes/s/mm2 . Though they do not implement the preprocessing circuitry, the price paid in [11] is the long critical path which involves an exact symbol ordering mechanism based on PED calculations. Note that our LUT avoids such complexity. Finally, [7] reports 75 Msoftbit/s, an area of 0.092 mm2 for a 2x2 64-QAM setup without pre-processing circuitry. For the same CMOS technology and throughput, the design of [7] is 7% bigger than ours (0.15 vs. 0.14 mm2 ). They approximate the distances using the Manhattan metric (i.e. no multipliers for LLR calculations, slightly trading-off performance), suggesting that the difference between both approaches is actually bigger than the rst order comparison, if we opted for the same distance metric. Still, we believe the high efciency of [7] is mainly due to a simple heuristic for symbol enumeration (i.e. LUT-like) which also enables parallelization during the search. V. C ONCLUSION We presented an algorithm and architecture of a softoutput sphere decoder with an optimized implementation for 2x2 MIMO OFDM reception. The decoder HW supports all modulation schemes upto 64-QAM and shows near optimal BER performance real-time. Due to the exibility of the architecture, we can target other wireless applications (LTE, WiMax) with no re-design effort.

We have shown that the computational density measured in terms of processed nodes/s/mm2 is among the best compared with existing soft-output decoders. This is mainly achieved by a novel LUT that simplies symbol enumeration signicantly, by preventing the use of expensive sorting circuitry. We have also shown that soft-output sphere decoding does not necessarily need exact symbol ordering for near-optimal BER performance. Contrary to the general belief, we have also demonstrated that multiple-tree search schemes can be cost-efcient. Currently we are investigating how the approach scales from an implementation perspective with a bigger MIMO setup such as 3x3 and/or 4x4. ACKNOWLEDGMENT The authors would like to thank Sebastien Mouy from NXP Semiconductors for the VHDL design and implementation. R EFERENCES
[1] G. Foschini and M. Gans, On limits of wireless communications in a fading environment when using multiple antennas, Wireless Personal Communications, vol. 6, pp. 311 335, 1998. [2] B. Hochwald and S. Brink, Achieving near-capacity on a multipleantenna channel, IEEE Transactions on Communications, vol. 51, no. 3, pp. 389 399, March 2003. [3] O. Paker, K. van Berkel, and K. Moerman, Hardware and software implementations of an MMSE equalizer for MIMO-OFDM based WLAN, in IEEE Workshop on Signal Processing Systems Design and Implementation, Nov. 2005, pp. 1 6. [4] D. Garrett, L. Davis, S. Brink, B. Hochwald, and G. Knagge, Silicon complexity for maximum likelihood mimo detection using spherical decoding, IEEE Journal of Solid State Circuits, vol. 39, no. 9, pp. 1544 1552, Sept. 2004. [5] Z. Guo and P. Nilsson, Algorithm and implementation of the K-best sphere decoding for MIMO detection, IEEE Journal on Selected Areas in Communications, vol. 24, no. 3, pp. 491 503, March 2006. [6] G. Knagge, M. Bickerstaff, B. Ninness, S. Weller, and G. Woodward, A VLSI 8 8 MIMO Near-ML Decoder Engine, in IEEE Workshop on Signal Processing Systems Design and Implementation, Oct. 2006, pp. 387 392. [7] R. Fasthuber, M. Li, D. Novo, P. Raghavan, L. V. D. Perre, and F. Catthoor, Novel energy-efcient scalable soft-output ssfe mimo detector architectures, in International Symposium on Systems, Architectures, Modeling, and Simulation, July 2009, pp. 165171. [8] M. Siti and M. Fitz, A Novel Soft-Output Layered Orthogonal Lattice Detector for Multiple Antenna Communications, in IEEE International Conference on Communications, June 2006, pp. 1686 1691. [9] C.P.Schnorr and M. Euchner, Lattice basis reduction:Improved practical algorithms and solving subset problems, Math Programming, vol. 66, pp. 181191, 1994. [10] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bolcskei, Vlsi implementation of mimo detection using the sphere decoding algorithm, IEEE Journal of Solid State Circuits, vol. 40, no. 7, pp. 1566 1577, July 2005. [11] C. Studer, A. Burg, and H. B lcskei, Soft-output sphere decoding: o algorithms and VLSI implementation, IEEE Journal on Selected Areas in Communications, vol. 26, no. 2, pp. 290 300, February 2008. [12] D.W. Waters and J.R Barry, The Chase Family of Detection Algorithms for Multiple-Input Multiple-Output Channels, IEEE Transactions on Signal Processing, vol. 56, no. 2, pp. 739 747, Feb 2008. [13] K. van Berkel, F. Heinle, P. Meuwissen, K. Moerman, and M. Weiss, Vector processing as an enabler for Software-dened radio in handheld devices, EURASIP Journal on Applied Signal Processing, pp. 2613 2625, January 2005. [14] S.Chen, T.Zhang, and Y.Xin, Relaxed K -Best MIMO Signal Detector Design and VLSI Implementation, IEEE Transactions on VLSI Systems, vol. 15, no. 3, pp. 328 337, March 2007.

You might also like