You are on page 1of 11


2, FEBRUARY 2010 411

A 74.8 mW Soft-Output Detector IC for 8 8

Spatial-Multiplexing MIMO Communications
Chun-Hao Liao, To-Ping Wang, and Tzi-Dar Chiueh, Senior Member, IEEE

Abstract—In this paper, VLSI implementation of a configurable, Recently, in an amazing demonstration, 5 Gbps downlink
soft-output MIMO detector is presented. The proposed chip can wireless communication is achieved using spatial multiplexing
support up to 8 8 64-QAM spatial multiplexing MIMO commu-
on 12 12 antenna configuration and 100 MHz bandwidth,
nications, which surpasses all reported MIMO detector ICs in an-
tenna number and modulation order. Moreover, this chip provides reaching a record-breaking 50 bps/Hz spectrum efficiency [4].
configurable antenna number from 2 2 up to 8 8 and modula- Moreover, along with the development of the RF front-end
tion order from QPSK to 64-QAM. Its outputs include bit-wise log circuit technology, short-range data communication through the
likelihood ratios (LLRs) and a candidate list, making it compatible extremely high frequency (EHF) band is no longer infeasible.
with powerful soft-input channel decoders and iterative decoding
system. The MIMO detector adopts a novel sphere decoding algo- In this frequency band, antenna size can be shrunk to several
rithm with high decoding efficiency and superior error rate per- millimeters, making MIMO systems with a large number of
formance, called modified best-first with fast descent (MBF-FD). antennas practical even for portable devices.
Moreover, a low-power pipelined quad-dual-heap (quad-DEAP) One of the most challenging tasks in MIMO communi-
circuit for efficient node pool management and several circuit tech-
niques are implemented in this chip. When this chip is configured
cation systems is data detection at the receiver when spatial
as 4 4 64-QAM and 8 8 64-QAM soft-output MIMO detectors, multiplexing is applied. Multiple streams of signals, coupled
it achieves average throughputs of 431.8 Mbps and 428.8 Mbps with noise and channel fading, interfere with each other when
with only 58.2 mW and 74.8 mW respective power consumption traveling in space and are received by a plurality of antennas.
and reaches 10 5 coded bit error rate (BER) at signal-to-noise
The optimal detection solution mandates exhaustive search
ratio (SNR) of 24.2 dB and 22.6 dB, respectively.
among the entire transmitted signal space and requires com-
Index Terms—Multiple-input multiple-output (MIMO) detec- plexity that scales exponentially with the number of antennas.
tion, soft-output sphere decoder, VLSI implementation.
To reduce the search complexity, sphere decoding (SD) was
proposed and it is capable of achieving optimal detection
I. INTRODUCTION performance with much reduced complexity [5]. To further
improve the error rate performance, the original hard-output
M ULTIPLE-INPUT multiple-output (MIMO) techniques
have recently enjoyed high degree of popularity in
wireless communications as they significantly enhance spec-
sphere decoding has been modified to provide soft-valued out-
puts, making it applicable in iterative detection and decoding
trum resource utilization [1]. In particular, a MIMO technique architectures to attain significantly enhanced detection perfor-
called spatial multiplexing can increase the data throughput mance [5]. The complexity of the hard-output and soft-output
almost linearly with the number of antennas [2]. Hence, the sphere decoding algorithms depends to a large extent on the
spatial multiplexing MIMO technique has been adopted in adopted search method. Several previous research works pro-
many current wireless communication standards. For example, posed a variety of search algorithms, such as K-best [6]–[11],
the IEEE 802.11n wireless LAN standard adopts MIMO depth-first [5], [12], [13] etc. However, owing to the limitation
configurations with up to 4 4 spatial multiplexing, and the in search scalability these algorithms are mainly applicable
latest IEEE 802.16e mobile WiMAX standard also includes a to MIMO systems with either fewer antenna elements or
lower-order modulation.
4-stream spatial multiplexing mode.
Systems with higher number of antennas are on the horizon. In light of the trend in spatial-multiplexing MIMO commu-
For instance, it was proposed that 8 8 spatial multiplexing nications toward higher-order modulation, more spatial streams
may be necessary in the next-generation (4G) mobile com- and soft-valued output, we propose, in this paper, a configurable
munication standard to achieve peak spectrum efficiency [3]. soft-output MIMO detector IC based on a novel complex-plane
sphere decoding algorithm. In this IC, several architecture and
circuit techniques are proposed and implemented to achieve the
Manuscript received February 02, 2009; revised July 06, 2009. Current
version published February 05, 2010. This paper was approved by Associate
following advanced features:
Editor Bevan Baas. This work was supported in part by the National Science • First MIMO detector IC supporting 8 8 64-QAM spatial
Council, Taiwan, R.O.C., under Grant NSC98-2752-M-002-002-PAE and multiplexing.
NSC97-2219-E-002-011. The work of Chun-Hao Liao is also partially spon-
sored by the Institute for Integrated Signal Processing Systems, RWTH Aachen • Support for antenna configuration from 2 2 to 8 8 and
University, Aachen, Germany. modulation from QPSK to 64-QAM.
The authors are with the Graduate Institute of Electronics Engineering and • Provision of soft-valued outputs and candidate list, making
the Department of Electrical Engineering, National Taiwan University, Taipei,
Taiwan 10617 (e-mail: it compatible with soft-input error-correction-code (ECC)
Digital Object Identifier 10.1109/JSSC.2009.2037292 decoders and iterative detection and decoding system.
0018-9200/$26.00 © 2010 IEEE

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

• Novel modified best first with fast descent (MBF-FD) vector that contains the a priori information about . In iter-
MIMO detection algorithm enhancing detection efficiency ative detection and decoding system, the MIMO detector first
and performance. computes the LLR outputs without the a priori information; the
• Low-latency, pipelined quad-dual-heap (quad-DEAP) cir- LLRs are then passed to a soft-in-soft-out ECC decoder, whose
cuit facilitating node pool maintenance. outcome will then be fed back to the MIMO detector as a priori
• Tabular enumeration scheme providing fast and efficient information; and the iteration goes on.
enumeration. Generally speaking, sphere decoders are very effective as
• Optimized node processing circuit enabling high clock rate soft-output MIMO detectors due to the efficient search strategy
and low power consumption. that confines the search space to include only the vectors
• Average throughput of 431.8 Mbps with 58.2 mW in 4 4 whose costs are smaller than the sphere constraint. However,
64-QAM configuration and 428.82 Mbps with 74.8 mW in as search with the sphere constraint the whole space in each
8 8 64-QAM configuration. iteration is still time-consuming, we adopted a compromised
The rest of this paper is organized as follows: after intro- solution proposed in [5]. This scheme generates a candidate
ducing the MIMO detection problem and the conventional list during the first MIMO detection and afterwards confines
sphere decoding algorithms in Section II, we will give the the search to only among the solutions in that candidate list. In
main idea of the MBF-FD algorithm for MIMO detection the candidate-list-based MIMO detectors, where the a priori
in Section III and expound on related simulation results and information can be ignored, we can rewrite the cost function as
comparison with existing solutions. Section IV presents circuit
design and implementation of the proposed IC, including (4)
hardware architecture and circuit techniques. Then, Section V
reports the chip measurements and compares the proposed chip where , and and are respectively a unitary ma-
with several reported MIMO detector chips. Finally the paper trix and an upper-triangle matrix that satisfy . Since
is concluded in Section VI. is an upper-triangular matrix, the complex symbols in
can be determined sequentially from bottom to top. The de-
coding can then be mapped into a search over an -layer -ary
II. DETECTION IN SPATIAL-MULTIPLEXING MIMO SYSTEMS tree, whose leaf nodes correspond to the possible solu-
tion vectors. By expanding (4), we recursively define the par-
Let us consider spatial streams, each transmitting -bit
tial cost of an intermediate node in layer with partial solution
data per symbol using -QAM modulation over an MIMO
system with transmitting and receiving antennas.
Denote these -bit data as a binary row vector with
, , and let be the QAM-mapped complex
constellation symbol vector having complex symbols,
. The received complex symbol, ,
is then given by
where and are respectively elements in and , and
where is the channel matrix that is assumed known
beforehand and is the complex Gaussian noise vector.
Several tree search schemes have been studied in the context
For simplicity, in the rest of the paper we assume
of sphere decoding MIMO detection. Among them, the breadth-
first, depth-first and best-first algorithms are the most popular.
Hard-output MIMO detectors try to find the symbol vector
Breadth-first algorithms [6]–[11] are favored due to their regular
(and correspondingly the binary vector ) that maximizes the
memory arrangement and amenability to pipelined and paral-
likelihood of the received vector. On the other hand, soft-output
leled implementation. However, for systems with more number
MIMO detectors compute the extrinsic bit-wise log-likelihood
of antenna and/or higher modulation order, breadth-first algo-
ratio (LLR) of each bit in under the max-log maximum a pos-
rithms tend to require enormous computational complexity to
teriori (MAP) criterion according to [5]
achieve acceptable performance. On the other hand, depth-first
algorithms [5], [12], [13] have better search efficiency, although
their tree traversing strategy still leaves room for improvement.
In [14], a best-first algorithm is proposed and shown to be a
(2) better search method. This best-first algorithm maintains a pool
of nodes to visit, which are not necessarily in the same sub-tree.
(3) When the current best node with the lowest partial cost has been
visited and processed, the best-first algorithm starts from the
where and are respectively the extrinsic and a next best node in the pool. Namely, it can hop within the tree
priori LLR of the bit ; is the search space ; without being restricted by the structure and connectivity of the
is the noise power spectral density, and is the row tree and always looks into the most promising nodes.

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

Fig. 1. Operation of the best-first algorithms: (a) original best-first, (b) MBF, and (c) Modified best-first with fast descent algorithm MBF-FD.

III. A LOW-COMPLEXITY SEARCH ALGORITHM order to reach more leaf nodes, the MBF algorithm is further
modified to include the flavor of depth-first tree search. The
A. Algorithm Description final algorithm, called modified best-first with fast descent
By continuously starting the search from the current best node (MBF-FD) [15], continuously searches downward for the best
with the lowest partial cost, the aforementioned best-first algo- child nodes and pushes best sibling nodes along the search path
rithm avoids the limitation of adjacency in the tree suffered by into the node pool until a leaf node is reached. Then a new
the depth-first and thus achieves a better search efficiency. How- forward search is started from the best node in the node pool.
ever, in the original best-first algorithm, the nodes are connected The MBF-FD algorithm preserves the benefits of the MBF
in a traditional -ary tree—each node has children and each algorithm while guaranteeing enough full-length solutions for
child node can be reached only from its parent. So individual soft-output MIMO detection. Fig. 1(c) illustrates the operation
partial cost of all children nodes must be evaluated before the of the MBF-FD algorithm.
search can move downward to the next level as indicated in
Fig. 1(a). Evaluation of all child nodes’ partial costs often makes B. Simulation Results
the best-first algorithm’s efficiency less than desirable. What’s
We compare the proposed MBF-FD algorithm with the
worse is that in a tree with high degree, pushing in many nodes
modified K-best Schnorr-Euchner (MKSE) algorithm [7] and
and removing only one parent node can quickly bloat the node
the single tree search (STS) algorithm [13], which are popular
pool with useless nodes.
breadth-first-based and depth-first-based algorithms, respec-
In the modified best-first (MBF) algorithm [15], the original
tively. To make a fair comparison, we evaluate them in terms of
-ary tree is converted into an equivalent binary tree, as illus-
the computational complexity measured in average number of
trated in Fig. 1(b). When a node is visited, we can replace this
required partial cost calculations (PCC) to reach coded bit error
node in the pool by only two new nodes: its best child node in
rate (BER) of at certain SNR. The data are coded in a rate
the next layer and its best yet-to-visit sibling. Afterwards, the
systematic convolutional code with constraint length 3,
next best node in the sorted node pool is examined and vis-
and interleaved with a 128 72 row-in-column-out interleaver.
ited and so on. By adding these two nodes into the pool (and
A spatially uncorrelated Rayleigh channel matrix is assumed
deleting the current node), the legacy of the current node is pre-
in each case and its elements are complex zero-mean Gaussian
served, downward by its child node and horizontally by its sib-
random variables with variance 0.5 per dimension. The sphere
ling node. This procedure is similar to encoding a general or-
constraint is set to 2 in all algorithms initially, which leads to a
dered -ary tree (e.g., 4-ary, 16-ary, or 64-ary) into a binary
fair search space reduction while maintaining good error rate
tree by a method called first-child/next-sibling binary tree [16].
performance from extensive simulations. The different sphere
The MBF algorithm greatly reduces the degree of a node by in-
decoding algorithms are compared under various run-time
troducing horizontal connections and thus effectively decreases
constraint settings, e.g., maximum number of visited nodes,
the complexity of child node evaluation in the original best-first
in MBF-FD and STS, and in MKSE.
algorithm. It also makes the node pool more efficient in cap-
Fig. 2 depicts the required average number of PCC and min-
turing promising nodes for future visit.
imum SNR to achieve coded BER for each algorithm
Although the MBF algorithm successfully resolves the
under different run-time constraints, where the SNR is defined
complexity and node pool issues of the traditional best-first
algorithm, it still has the problem of spending too much time
searching on higher layers and may not reach even one leaf
node (for a full-length solution) under a time constraint. In (6)

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

Fig. 2. Comparison of MBF-FD, STE, and MKSE algorithms under different run-time constraints.

From Fig. 2, we can see that the low complexity advantage of

the MBF-FD algorithm over the STS and MKSE algorithms is
quite obvious. In particular, when BER is mandated for
channels with 17.5 dB SNR in the 4 4 16-QAM configuration,
the average number of PCC required by the MBF-FD algorithm
is only 41% of the STS algorithm and 13% of the MKSE algo-
rithm. In the 8 8 64-QAM configuration this advantage is even
more pronounced, where for channels with 21.5 dB SNR the
MBF-FD algorithm needs on the average only 9.8% and 3.3%
the complexity of the STS algorithm and the MKSE algorithm,


The proposed MIMO detector consists of three major parts
for the implementation of the MBF-FD algorithm. First, a node
pool holds the information of the nodes for future visit. Sec-
ondly, the node processing part performs the MBF-FD tree tra-
Fig. 3. An example of the DEAP structure.
versal. Finally, a third part generates the soft detection result and
the candidate list output.
best and the worst nodes is obviously not acceptable due to high
A. Node Pool circuit complexity and long delay. In the following, we will pro-
Node pool, the most critical block in the proposed IC, main- pose the pipelined quad-DEAP for implementatiing the node
tains a group of nodes to visit in a way that the best and the pool. Techniques that improve throughput, accuracy, power and
worst nodes can be readily identified. The size of the node pool complexity will also be presented.
should be properly determined to guarantee satisfactory BER Dual-heap (DEAP) [17], consisting of a minimum and a max-
performance. From extensive simulation, we found that a node imum heap1 arranged in a back-to-back fashion, is a data struc-
pool with about 30–40 nodes is sufficient. The best node (with ture dedicated for efficiently maintaining the minimum and the
minimum partial cost) is the next node to visit, while the worst maximum among a group of numbers. Fig. 3 depicts an example
node (with maximum partial cost) is to be removed when a of DEAP. Note that a leaf node in the minimum heap is less than
new node is inserted and the pool is already full. The nodes or equal to the corresponding leaf node in the maximum heap.
come in and out so frequently that the efficiency of the node Upon retrieval of the minimum or maximum node, DEAP can
pool significantly affects the MIMO detector performance. A 1A minimum (maximum) heap is a tree in which a parent node always holds
simple-minded design using two comparators to search for the a value less (greater) than the values that its child nodes hold.

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

Fig. 4. Operation of the pipelined quad-DEAP.

be easily maintained through node exchanges propagating from Although pipelining improves the node pool, it also intro-
one end to the other. duces possible incoherence in the best node value when the node
With the above DEAP structure, we next present several exchanges associated with a best node replacement are not com-
techniques for efficient circuit implementation of DEAP. First, pleted in time, leading to degradation in error rate performance.
DEAP suffers from possible long latency which scales lin- To this end, we include a best node cache, which holds the best
early with the number of layers due to propagation of node node of the pool while the quad-DEAP handles the other nodes
exchanges. To reduce the number of layers, we propose to use in the pool. Armed with this cache, the aforementioned inco-
quad trees instead of binary trees. Fig. 4 depicts the adopted herence and possible BER degradation are avoided. Finally, we
6-layer quad-DEAP structure, which contains 42 nodes to introduce two more low-power circuit techniques for the node
guarantee satisfactory BER performance. pool. First, for the idle nodes which are not on the path of
Moreover, an interlaced pipelining scheme is implemented in propagation, we turn off the associated circuits by clock gating.
the node exchange operations to improve the node processing Second, when node exchange procedure halts at a certain stage,
rate and circuit utilization, as illustrated in Fig. 4. To implement the inputs of the comparators in the ensuing stages are frozen to
node exchanges, the pipelining stages operate in a period of two minimize signal switching. A 38.2% saving in power is achieved
clock cycles. Specifically, in the first clock cycle, two root nodes by these techniques according to gate-level power simulation.
in layers 1 and 6 update their values with the respective inputs
if necessary, while nodes in layers 2, 3, 4, 5 that have been up- B. Node Processing
dated in the previous cycle compare with associated nodes in This part performs the main operations of MBF-FD tree tra-
layers 3, 2, 5, 4 and exchange values whenever necessary. In the versal, including identifying the child and sibling nodes and cal-
second clock cycle, similar node exchanges are performed be- culating their partial costs. A dedicated pipelining strategy is in-
tween layers 1 and 2, layers 3 and 4, layers 5 and 6, respectively. troduced to cut down the possible long delay path. We partition
Note that both upward and downward propagation of node ex- the computation involved with a child node into three stages: the
changes are possible. In addition, these two types of propagation inter-antenna interference cancellation (IAIC) block first can-
can happen simultaneously in the pipelining stages. Therefore, cels the interference from the QAM symbols that have been de-
the circuit is designed to handle upward and downward node cided in the previous layers; the child node processing (CNP)
exchanges concurrently. Finally, the comparators are shared be- block then finds the best child node; and finally the partial cost
tween the two phases (even-cycle phase and odd-cycle phase) to calculation (PCC) block computes and accumulates the squared
increase circuit utilization. error. The operation involved with a sibling node is similarly

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

Booth encoding to reduce the number of partial products to two.

Second, the complex multiplication and the following subtrac-
tion are integrated in one carry-save adder whose final addition
is a multi-stage carry-select adder. Gate-level synthesis results
show that the above techniques reduce 38.2% of the critical path.
2) Child Node Processing: Assuming the newly-popped-out
node is in layer , CNP finds recursively the constellation
point and its corresponding difference of the best
child node in layer ,


Fig. 5. Pipelining schedule of node processing.
where runs from layer down to layer 0 and is the quan-
tization function that converts its argument to the nearest con-
partitioned into the sibling node processing (SNP) block and stellation point. To avoid the division in (8), we adopt a search
PCC as well. With the above partitioning and pipelining, the over all constellation points instead:
clock speed of the proposed chip can be close to 200 MHz.
Referring to Fig. 1, note that and depend on the de- (10)
cision of . Thus, we propose a pipelining schedule as shown
in Fig. 5. The top three blocks refer to the processing of the By the orthogonality between the real and imaginary parts, we
child node . At time , the decision of is already avail- can search the real and the imaginary part independently for
able though the partial cost of is yet to be computed. Con- the closest constellation point. In addition, as the signs of the
sequently, we parallelize the processing of and with real/imaginary parts of the closest point is identical to those of
the PCC of . Tree traversal for the following layers is simi- , we search only constellation points in the first quad-
larly performed until a leaf node is reached. Note that in parallel rant. In summary, we compare and with
another SNP block determines the best sibling of the best , , in parallel to find the real and imaginary parts
node retrieved from the node pool, [see Fig. 1(c)]. of simultaneously. Concurrently, all possible combinations
The above scheduling has several advantages. First, only one of the corresponding difference are computed and then
set of IAIC, CNP, SNP, and PCC circuits is implemented. Next, selected by the results of aforementioned comparison to reduce
since the tree is traversed sequentially, by adjusting the schedule the path delay. Fig. 6 depicts the circuit diagram of the CNP
this architecture can be configured to support different number block. From synthesis results, the proposed simplified child
of antennas (layers), different modulations, and run-time con- node search circuit saves 70.4% of area and 39.7% of circuit
straints. Also, the rate of node processing matches that of the delay when compared with the straightforward implementation.
node pool, i.e., two clock cycles per node, thus enhancing cir-
cuit utilization. We next introduce the circuit techniques adopted 3) Sibling Node Processing and Tabular Enumeration:
in these blocks. Finding the next sibling node requires sorting the yet-to-visit
1) Inter-Antenna Interference Cancellation: Assuming that constellation points according to their partial costs, which can
the current node is in layer , IAIC computes the first two terms account for a significant portion of the complexity in tree-search
inside the square norm of (5): MIMO detection hardware. To avoid that, we apply the tabular
enumeration (TE) technique proposed in [15] for fast node
order look-up. Fig. 7 illustrates how this technique works.
First, suppose the constellation point closest to the equalized
and interference-cancelled signal, , has been found and
To reduce the critical path delay, the associated terms in- denoted as . The region around this constellation
side the summation in (7) are computed and accumulated as point is then divided into eight triangular sub-regions. For each
early as possible, namely, during the processing of nodes at sub-region, the most likely visiting order of all other constel-
layers through . Hence, can be computed with lation points is computed in advance and stored in a table.
only one final multiplication and addition. Rearranging the cal- Extensive simulation indicates that TE introduces negligible
culation of greatly facilitates design configurability over the BER degradation when comparing to the exact enumeration
number of antennas. For the proposed 8 8 MIMO detector IC, order.
seven such IAIC units are implemented to compute through Direct implementation of TE requires eight tables for each
. In configurations with smaller number of antennas, fewer constellation point, each with entries. To reduce the re-
IAIC units are needed and unused ones are simply turned off. quired storage, we further unify these tables into one by uti-
To further reduce the critical path delay, two more circuit lizing to the symmetry in the eight sub-regions and the shift
techniques are adopted. First, as is a QAM constellation point invariance property of the partial cost function. Fig. 7 shows
and thus its real part (as well as imaginary part) has at most the unified node order table with a maximum offset of
eight possible values, the multiplier uses a simplified radix-4 which supports up to 64-QAM. Note that the node order and

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

Fig. 6. Circuit diagram of the child node processing (CNP) block.

Fig. 8. Three possible cases of the second best sibling node assuming sub-
region 0 is considered.
Fig. 7. Illustration of triangular partitions and node ordering table in tabular

can be found in one clock cycle. Fig. 9(a) and (b) shows the cir-
the offset to the current are listed inside and around cuit diagram of the proposed TE and STE, respectively, where
the table, respectively. As there is only one table for all possible the index TN is the sub-region index and the flip block handles
, boundary check is necessary to skip those offsets the symmetry processing of the offsets according to TN. Finally,
that lead to points outside the constellation. For different QAM the SNP circuit adopting STE is depicted in Fig. 10. Note that
modulations, the same table can be reused by merely modifying the first two bits of TN is simply the sign value of real and imagi-
the boundary. In sum, the unified table is implemented in only nary part of difference, , while the third bit of TN, ,
1.76 K bits, which is 0.88% of the straightforward design. Al- needs one more comparison. To reduce the critical path, two
though the unified table significantly reduced the storage, re- STE blocks are implemented to process the two possible cases
peated table-look-up to skip the invalid offsets can be a speed of . Moreover, many possible partial results for the differ-
bottleneck. To prevent this, eight parallel boundary check units ence are available from the CNP unit. These two techniques
are implemented. results in a 56.9% saving in critical path.
Note that that except for in Fig. 1 all other sibling nodes 4) Partial Cost Computation: The PCC block squares the
in MBF-FD are always the second best among all nodes of the differences obtained in CNP and SNP blocks and accumulates
same parent. Therefore, we further propose simplified TE (STE) the partial cost according to
for processing these sibling nodes during fast descent. Assume
that falls in sub-region 0 without loss of generality. (11)
Then there are only three possible cases of TE for these sibling
nodes as illustrated in Fig. 8, and hence the table can be reduced To reduce complexity and shorten critical path, a special squarer
to only three entries: (0, 2), (2, 0), ( 2, 0). These entries are pro- is designed and its outcomes are fed to a carry-save adder that
cessed and boundary checked in parallel so that a sibling node updates the partial cost according to (11).

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

Fig. 9. Circuit diagram of (a) tabular enumeration (TE) and (b) simplified tabular enumeration (STE).

Fig. 11. Block diagram of the MBF-FD MIMO detector.

When the search terminates, the LLR values are obtained by

computing cost differences in pairs of counter hypotheses. Note
that for sphere decoders under run-time constraint, the costs of
counter hypotheses may not be found during the tree search. In
Fig. 10. Circuit diagram of sibling node processing (SNP) block adopting sim- our design, the initial sphere constraint is used as an approxi-
plified tabular enumeration (STE) algorithm.
mation for the unavailable cost.
The candidate list block maintains a list of full-length solu-
tions with lower costs during the search. As proposed in [18], a
C. Soft-Output Generation and Candidate List
4-layer binary heap with 15 entries is used in this work. In ad-
In the proposed MIMO detector IC, SOG generates the LLR dition to the low-power design techniques used in quad-DEAP,
values according to the max-log LLR criterion in (2) and these clock gating turns off unused SOG and CL units when the de-
values are essential for the high-performance soft-input ECC tector IC is configured in low-antenna and/or low-order QAM
decoders in the first iteration. Two register files, each with 48 modulation. The candidate list is quite useful in iterative detec-
registers, are used to store, for each bit, the costs of two hy- tion and decoding system where in later iterations, a reduced
potheses, 0 and 1. This configuration thus supports up to 8 8 search over the candidate list, rather than over the whole solu-
64-QAM MIMO detection. Each register is initialized with a tion space, is sufficient.
maximum cost, which can be regarded as the initial sphere con-
straint of the MBF-FD algorithm. When a full-length solution is D. Summary
found, the SOG block updates those registers corresponding to To sum up, Fig. 11 depicts the block diagram of the proposed
relevant hypotheses depending on whether or not the cost of the MBF-FD MIMO detector. Note that in the node pool, a side in-
found full-length solution is smaller than the register contents. formation memory (SIM) works together with the DEAP circuit

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.


to provide detail information about the nodes for future visit. For
node processing, we see the signals flow through IAIC, CNP, Fig. 12. Chip microphotograph.
SNP and PCC. The outputs of the two SNP blocks are fed into
the node pool. Finally, SOG and CL receive the full-length solu-
tions and generate the soft-output LLR values and maintain the
list of candidate solutions.
Significant saving in power, delay, and circuit complexity has
been attained through several circuit techniques adopted in de-
signing the proposed MIMO detector IC. Table I summarizes all
the techniques used and their improvements in power reduction,
clock speed-up, and circuit/storage complexity.

The proposed IC is fabricated in a 0.13-micron CMOS tech-
nology. To validate the feasibility of the proposed IC for high-
speed MIMO receivers, two copies of the circuit in Fig. 11, Fig. 13. Maximum clock rate of the proposed IC versus supply voltage.
processing element (PE), are integrated in this IC. Each PE
can execute independently MBF-FD MIMO detection for a re-
ceived -element signal vector, . The core area of the IC is
mm . Fig. 12 depicts the chip microphotograph.
The maximum operating clock rates of the chip under different
supply voltages are plotted in Fig. 13. In the nominal 1.3 V
supply voltage, the chip can operate up to 198 MHz, about 1%
less than the post-simulation result. Fig. 14 depicts power con-
sumption of the IC when it is configured in four different modes
and operating at the maximum frequencies under several supply
voltages. As expected, more power is consumed when the de-
tector IC operates with more antennas and/or higher-order QAM
The throughput of the proposed IC is formulated as Fig. 14. Power consumption versus supply voltage of the proposed detector IC
in four operation modes.

configurations by constraining to 8 and 16, respectively.
where is the clock rate, is the average number of vis- Specifically, these configuration can reach 10 coded BER at
ited nodes, is the number of PEs, and is the average SNR of 24.2 dB and 22.6 dB. However, when the channel condi-
number of clock cycles to visit a node, which is 2.53 in the pro- tions are poor, the MIMO detector may need to visit more nodes
posed IC. Operating in the maximum frequency and under good and require longer run time to obtain more precise soft-output
channel conditions, the proposed IC achieves 431.8 Mbps and values for acceptable BER. As such, the achievable throughput
421.8 Mbps throughput in 4 4 64-QAM and 8 8 64-QAM can become lower.

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.


Table II lists the overall performance of the proposed IC and is willing to sacrifice BER performance, the proposed IC can
several reported sphere decoding MIMO detector ICs. The pro- achieve even higher throughput by setting a smaller maximum
posed IC is the only one that supports a maximum of eight an- number of visited nodes. Finally, the proposed IC has the best
tennas and 64-QAM modulation. The detector IC in [8] sup- measured power performance. Note that the power consump-
ports 8 8 MIMO systems, but only for QPSK modulation. On tion is normalized considering supply voltage and adopted tech-
the contrary, the proposed IC, capable of providing 21 configu- nology using
rations from 2 2 to 8 8 and from QPSK to 64-QAM, is the
most configurable among all reported ICs. Only one other im-
plementation provides some degree of configurability, but with
less flexibility in antenna number [9]. Moreover, the proposed
chip is one of the very few chips that provide both soft LLR
and candidate list output, which are indispensable for MIMO
detectors in advanced iterative detection and decoding systems. VI. CONCLUSIONS
Therefore, although its throughput is not the highest, satisfac- This paper presents the design of a novel configurable
tory BER performance is guaranteed. In other words, if one soft-output MIMO detector IC. From the algorithmic aspect,

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.

a new and efficient sphere decoding algorithm, the MBF-FD [14] A. Murugan, H. Gamal, M. Damen, and G. Caire, “A unified frame-
algorithm, is shown to be very effective in soft-output MIMO work for tree search decoding: Rediscovering the sequential decoder,”
IEEE Trans. Information Theory, vol. 52, no. 3, pp. 933–953, 2006.
detection, especially when the antenna number or the modu- [15] T.-P. Wang, T.-H. Lee, and T.-D. Chiueh, “Low-complexity
lation order is high. In terms of VLSI implementation, new soft-output MIMO detection for iterative decoding using modi-
hardware architectures are proposed for a better hardware fied best-first tree search,” IEEE Trans. Wireless Commun., submitted
for publication.
design. These include the pipelined quad-DEAP and tabular
[16] D. E. Knuth, The Art of Computer Programming, 3rd ed. Reading,
enumeration. Several circuit techniques are adopted in the MA: Addison Wesley, 1997, vol. 1, Fundamental Algorithms.
design of function blocks to further improve the performance [17] A. Carlsson, “The DEAP: A double-ended heap to implement double-
of the IC. Measurement results show that the proposed IC ended priority queues,” Information Process. Lett., vol. 26, pp. 33–36,
outperforms all the other implementations in terms of normal- [18] P. Salmela, J. Antikainen, O. Silven, and J. Takala, “Memory-based list
ized power. Moreover, the proposed IC is the first soft-output updating for list sphere decoders,” in Proc. IEEE Workshop on Signal
sphere decoding MIMO detector IC that can support up to Processing Systems (SiPS), 2007, pp. 633–638.
[19] C. Hess, M. Wenk, A. Burg, P. Luethi, C. Studer, N. Felber, and W.
8 8 64-QAM MIMO systems. When the chip is configured in Fichtner, “Reduced-complexity MIMO detector with close-to ML
4 4 64-QAM and 8 8 64-QAM and constraining to 8 error rate performance,” in Proc. 17th ACM Great Lakes Symp. VLSI
and 16, its throughput can reach 431.8 Mbps and 428.8 Mbps, (GLSVLSI), 2007, pp. 200–203.
respectively. With such performance, the proposed IC is very [20] M. Shabany and P. Gulak, “Scalable VLSI architecture for K-best lat-
tice decoders,” in Proc. ISCAS, 2008, pp. 940–943.
competitive among all soft-output MIMO detectors.

ACKNOWLEDGMENT Chun-Hao Liao was born in Taichung, Taiwan,

The authors greatly appreciate the Chip Implementation in 1983. He received the B.S. degree in electrical
engineering and the M.S. degree from the Graduate
Center (CIC) of Taiwan for the fabrication and measurement of Institute of Electronics Engineering at National
the proposed chip. They also thank the anonymous reviewers Taiwan University, Taipei, Taiwan, in 2006 and
for the valuable suggestions that greatly improved this paper. 2009, respectively.
From 2008 to 2009, he also worked a research
assistant in the Institute for Integrated Signal
REFERENCES Processing Systems at RWTH Aachen University,
Aachen, Germany. His research interests include
[1] G. J. Foschini, “Layered space-time architecture for wireless commu- baseband signal processing of communication
nication in a fading environment when using multi-element antennas,” systems, VLSI design, LDPC codes, wireless channel enumerator, and sphere
Bell Labs. Tech. J., vol. 1, no. 2, pp. 41–59, 1996. decoding algorithms.
[2] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless
Communications. Cambridge, U.K.: Cambridge Univ. Press, 2003.
[3] 3GPP Technical Report 36.913: Requirements for Further Advance-
ments for Evolved Universal Terrestrial Radio Access (E-UTRA) To-Ping Wang was born in Taipei, Taiwan, in 1983.
(LTE-Advanced) [Online]. Available: He received the B.S. degree in electrical engineering
Specs/html-info/36913.htm. and the M.S. degree from the Graduate Institute of
[4] NTT DoCoMo Press Release [Online]. Available: http://www.nttdo- Electronics Engineering at National Taiwan Univer- sity, Taipei, Taiwan, in 2005 and 2007, respectively.
[5] B. Hochwald and S. t. Brink, “Achieving near-capacity on a multiple- His research interests include baseband signal pro-
antenna channel,” IEEE Trans. Communications, vol. 51, no. 3, pp. cessing of communication systems, MIMO channel
389–399, 2003. enumerator and sphere decoding algorithms.
[6] K.-W. Wong, C.-Y. Tsui, R.-K. Cheng, and W.-H. Mow, “A VLSI ar-
chitecture of a K-best lattice decoding algorithm for MIMO channels,”
in Proc. ISCAS, 2002, vol. 3, pp. 273–276.
[7] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best
sphere decoding for MIMO detection,” IEEE J. Sel. Areas Commun.,
vol. 24, no. 3, pp. 491–503, 2006.
[8] G. Knagge, M. Bickerstaff, B. Ninness, S. R. Weller, and G. Woodward, Tzi-Dar Chiueh (S’87–M’90–SM’03) received
“A VLSI 8 8 MIMO near-ML decoder engine,” in Proc. IEEE 2006
the B.S. and Ph.D. in electrical engineering from
National Taiwan University and California Institute
Workshop on Signal Processing Systems (SiPS), Oct. 2006.
of Technology in 1983 and 1989, respectively.
[9] R. Shariat-Yazdi and T. Kwasniewski, “Configurable K-best MIMO
He is now a Professor in the Department of
detector architecture,” in Proc. 3rd ISCCSP, 2008, pp. 1565–1569. Electrical Engineering and Graduate Institute of
[10] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, “K-best Electronics Engineering at National Taiwan Uni-
MIMO detection VLSI architectures achieving up to 424 Mbps,” in versity. His research interests include algorithm,
Proc. ISCAS, 2006, pp. 1151–1154. architecture, and integrated circuits for baseband
[11] S. Chen, T. Zhang, and Y. Xin, “Relaxed K-best MIMO signal detector communication systems.
design and VLSI implementation,” IEEE Trans. Very Large Scale In- Dr. Chiueh has received the Acer Longtern Award
tegr. (VLSI) Syst., vol. 15, no. 3, pp. 328–337, Mar. 2007. 11 times and the Golden Silicon Award in 2002, 2005, 2007, and 2009. His
[12] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and teaching efforts were recognized five times by the Teaching Excellence Award
H. Bolcskei, “VLSI implementation of MIMO detection using the from NTU. Prof. Chiueh was the recipient of the Outstanding Research Award
sphere decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, pp. from National Science Council, Taiwan in 2004–2007. In 2005, he received the
1566–1577, 2005. Outstanding Electrical Engineering Professor from the Chinese Institute of Elec-
[13] C. Studer, A. Burg, and H. Bolcskei, “Soft-output sphere decoding: trical Engineers (Taiwan), and was awarded the Himax Chair Professorship at
Algorithms and VLSI implementation,” IEEE J. Sel. Areas Commun., NTU in 2006. In 2009, he received the Outstanding Industry Contribution Award
vol. 26, no. 2, pp. 290–300, 2008. from the Ministry of Economic Affairs, Taiwan.

Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.