Professional Documents
Culture Documents
2, FEBRUARY 2008
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
HUANG et al.: SYSTEM ARCHITECTURE AND IMPLEMENTATION OF MIMO SPHERE DECODERS ON FPGA 189
VLSI implementation in order to achieve maximal performance thus far. The FPGA implementations provide the flexibility
in real-time. Several hardware implementations have been re- for varying the number of antennas and signal constella-
ported by prototyping the VB algorithm, the SE algorithm, or tions.
their modified versions [16]–[18]. Most recently, the applica- 2) The SoC approach is introduced to simplify the design of
tion-specific integrated circuit (ASIC) implementation by Berg the sphere decoder and to improve the efficiency. The hard-
et al. has demonstrated the decoding rate up to 73 Mb/s at 20-dB ware/software codesign techniques partition the compli-
signal-to-noise ratio (SNR) [19]. It is a modified searching al- cated preprocessing tasks such as matrix factorizations and
gorithm based on the Pohst strategy and incorporates the depth- inversions to an embedded processor and the real-time de-
first tree traversal with progressive radius reduction. However, coding functions to customized hardware modules.
an ASIC implementation is generally refined for a fixed number 3) Both the VB and the SE decoding algorithms are imple-
of antennas and a certain signal constellation, and is optimized mented on an FPGA platform and are evaluated for de-
for low power high frequency circuit design. The limitation of coding rates, BER performance, power consumptions, and
an ASIC implementation is lack of flexibility when the number area utilizations. The implementations on a DSP are also
of antennas or the signal constellation changes. presented for comparison.
Field-programmable gate-array (FPGA) devices are widely 4) Three levels of parallelism are explored to accelerate the
used in signal processing, communications, and network appli- processing: the concurrent execution of the preprocessing
cations because of their reconfigurability and support of par- on an embedded processor and the decoding on hardware
allelism. FPGA has at least three advantages over a DSP pro- cores; the parallel decoding of real/imaginary parts if com-
cessor: the inherit parallelism of an FPGA is equipped for vector plex constellation applies; and the concurrent execution of
processing; it has reduced instruction overhead; the processing multiple steps during the closest lattice point search.
capacity is scalable if the FPGA resource is available. The dis- This paper is organized as follows. Section II reviews the
advantage is that the development cycle of the FPGA design is principles of the VB and SE decoding algorithms. The hard-
usually longer than the DSP implementation. But once an ef- ware/software codesign architecture is described in Section III.
ficient architecture is developed and the parallel implementa- Three levels of parallelism are explored in Section IV. The data
tion is explored, FPGA is able to significantly improve the pro- dependency among the iterative lattice search procedures is also
cessing speed because of its intrinsic density advantage [20]. analyzed in this section. A comparison of the experimental re-
Furthermore, FPGA also has several advantages over an ASIC sults between these two algorithms among the FPGA and DSP
implementation: an FPGA device is reconfigurable to accom- implementations is presented in Section V. The conclusion is
modate system configuration changes even in run-time; it has given in Section VI.
significantly reduced prototyping latency comparing to ASIC;
it is a cost-effective solution to meet the low volume short cycle II. SPHERE DECODING ALGORITHM
product requirement [21]. Considering an MIMO system with transmit and re-
In addition, the system-on-chip (SoC) concept has been ceive antennas, the received signal is given by
adopted to FPGA lately by introducing one or more embedded
processors into the FPGA design [22], e.g., the PowerPC hard (1)
processor cores and the MicroBlaze soft processors on Xilinx where is a transmitted signal vector and is an additive white
FPGAs as well as the Nios soft processors on Altera FPGA Gaussian noise vector. is the channel matrix that
devices.1 2 The SoC architecture significantly improves the can be assumed as known from perfect channel estimation and
interporability and reduces the design complexity of many synchronization. For selected modulation scheme, each element
complex computational algorithms. Consequently, the hard- of the transmit vector is a constellation point and the channel
ware/software codesign technique can be applied to partition matrix generates a lattice. The ML decoding algorithm is to
the computational algorithm into customized hardware and find the minimal distance between the received point and the
embedded software. For instance, one or more embedded pro- examining lattice point that
cessors can be instantiated in an FPGA to execute processing
tasks that are less time critical but highly sequential or consid- (2)
erably complicated for direct circuit implementation. In this
paper, we introduce the FPGA-based SoC architectures for two where is the decoded vector. The entries of are chosen from
typical sphere decoding algorithms. a complex constellation . The set of all possible transmitted
The main contributions of this paper are summarized as fol- vector symbols is denoted by . Thus, the ML-based sphere
lows. decoding system can be summarized as follows.
1) We present the FPGA-based system architectures and im- Input: The channel lattice generation matrix and the re-
plementations for MIMO sphere decoding. To the authors’ ceived signal vector .
knowledge, the real-time performance results of the system Output: A vector such that is a lattice point that
prototypes are among the fastest MIMO decoders reported is the closest to .
1[Online]. Available: http://www.altera.com/technology/embedded/emb-
In this section, we summarize the two typical sphere coding
index.html algorithms for MIMO detection, namely the VB algorithm and
2[Online]. Available: http://www.xilinx.com/ise/embedded_design_prod/ the SE algorithm. Both algorithms are ML lattice decoding algo-
index.htm rithms, which try to enumerate the lattice points inside a sphere,
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
190 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 2, FEBRUARY 2008
and find the best lattice point with minimum distance to the re- 1) Preprocessing: Transform into an upper triangular ma-
ceived point. The main difference between these two algorithms trix by Cholesky decomposition algorithm [10]. Ini-
is the investigating order inside the lattice structure. The VB al- tialize the sphere radius by an adaptive method [11], set
gorithm searches from the lower bound to the upper bound in dimension index and , and find the upper
each search layer and examines all possible lattice points falling bound and index .
into a certain sphere in the lattice structure with an initial radius 2) Finite-State Machine (FSM): Upgrade . If
. Meanwhile, the SE algorithm spreads out from a nearby ( and ), then go to State A; If ( and
lattice point of the received signal and terminates once the total ), then go to State B; If , then go to State
distance is greater than the best distance and the search proce- C.
dure reaches the bottom layer. No initial radius is needed in the 3) State A: Expand the search into sublayer, find the
SE algorithm, and it is not required to upgrade the upper and parameters used to upgrade and , and go to State D.
lower bounds of each layer using the time consuming square 4) State B: Compute . If , record the cur-
root functions as needed in the VB algorithm. The detail proce- rently best distance and the best lattice point, set ,
dures for each of the sphere decoding algorithms are presented and go to State D. If , then go to FSM.
as follows. 5) State C: If , stop the algorithm. Otherwise, move
the search one layer up and go to FSM.
A. VB Decoding Algorithm 6) State D: Upgrade and that involve square root com-
putations, and then go to FSM.
The direct approach to find the closest lattice point is to enu-
merate all lattice points falling inside a sphere centered at the B. SE Decoding Algorithm
received point so as to identify the closest lattice point in the Eu-
Instead of examining the lattice points within a sphere, the SE
clidean metric. The VB decoding algorithm is developed based
algorithm searches the closest lattice point in the nearest hyper-
on the aforementioned principle. Basis reduction can be per-
plane. So no initial radius is required in the SE algorithm. The
formed on the lattice generation matrix to reduce the com-
search of the closest lattice point starts from the Babai point
plexity of the decoding procedure. In this case, Cholesky fac-
[23], and spreads out within the distance between the Babai
torization is applied to the Gram matrix and it
point and the received vector . The lattice generation matrix
yields , where is an upper triangular matrix.
is factorized into a lower triangular matrix and an or-
The closest lattice point search problem is formulated as
thonormal matrix using KZ reduction or LLL reduction [24],
where . The closest lattice point problem is formu-
(3)
lated as
(6)
(4)
(7)
Starting from the bottom row of matrix and working back- where the round function finds the closest integer to .
wards, the upper and lower bound of the examining lattice point The orthogonal distance to the current -dimensional layer
can be determined from (4). We use to represent the index can be found as
of the examined layer and to denote the upper bound of .
The closest lattice point search starts from the bottom layer to (8)
the top layer and scans each lattice index from the lower bound
to the upper bound. When the algorithm reaches the top layer where is the diagonal element of . Thus, the partial Eu-
without violating the bound constraint, a valid lattice point is clidean distance (PED) up to the -dimensional layer is calcu-
found. Then the new distance between the valid lattice lated as
point and the received vector is calculated and compared with
the current best distance . A closer lattice point to the re- (9)
ceived point is found if is less than . This lattice point
is saved and the search radius is upgraded as . The process
iterates until all of the lattice points within the sphere are exam- If the is less than the current best distance , it is
ined. More details about the theory of the decoding algorithm stored and the search procedure expands to dimensional
can be founded in [11]. sublayer with an update
The step-by-step procedures of the VB algorithm are listed in
the following. where (10)
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
HUANG et al.: SYSTEM ARCHITECTURE AND IMPLEMENTATION OF MIMO SPHERE DECODERS ON FPGA 191
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
192 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 2, FEBRUARY 2008
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
HUANG et al.: SYSTEM ARCHITECTURE AND IMPLEMENTATION OF MIMO SPHERE DECODERS ON FPGA 193
Fig. 6. Example of the closest point search process: (a) sequential implemen-
tation and (b) parallel implementation.
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
194 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 2, FEBRUARY 2008
TABLE II
FPGA RESOURCE USAGE OF THE LATTICE DECODING MODULES FOR A 4 24
MIMO SYSTEM WITH 16-QAM MODULATION
Based on the SoC architecture described in Section III, the where denotes the statistic percentage for visiting state
lattice point search procedures are mapped as customized hard- , which can be obtained from high-level simulation.
ware units on the FPGA to speedup the decoding rate. For the is the number of clock cycles used for state captured from
4 4 MIMO system, the FPGA resource usages and the post the FPGA circuit simulation. The simulation results show that
place-and-route (PAR) frequencies are compared between the the average processing time per state visit is 8.5 cycles for the
SE and VB algorithms as shown in Table II. The results show SE algorithm and 7.7 cycles for the VB algorithm. Based on
that the VB-based decoder uses more embedded multipliers and these parameters, the decoding rate of the sphere decoder for
more FPGA slices than the SE decoder because the VB algo- the 4 4 system with 16-QAM modulation at 20-dB SNR is
rithm involves more complicated computations. given in Table III. The results show that the SE based decoder is
more than 2.21 times faster than the VB based decoder. These
C. Decoding Rate Performance results match the theoretical analysis in [13].
These two algorithms are also implemented on DSP in com-
The data rate for the MIMO sphere decoder with transmit
parison with the FPGA-based decoders. A fixed-point DSP pro-
and receive antennas is determined by
cessor TMS320C6201 is chosen as the benchmark platform,
which operates at 200 MHz [25]. Fig. 7 gives a data rate compar-
(15)
ison between the FPGA and DSP implementations. The results
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
HUANG et al.: SYSTEM ARCHITECTURE AND IMPLEMENTATION OF MIMO SPHERE DECODERS ON FPGA 195
TABLE IV TABLE V
DATA RATES FOR THE DSP-BASED SPHERE POWER AND AREA COMPARISON OF THE FPGA AND DSP IMPLEMENTATIONS
DECODING ALGORITHM IMPLEMENTATIONS
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
196 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 2, FEBRUARY 2008
sented in this paper. Hardware/software codesign technique is [11] E. Viterbo and J. Boutros, “A universal lattice code decoder for fading
applied when partitioning the decoding algorithm on a single channels,” IEEE Trans. Inf. Theory, vol. 45, no. 5, pp. 1639–1642, May
1999.
FPGA device, with channel matrix preprocessing on an em- [12] C. P. Schnorr and M. Euchner, “Lattice basis reduction: Improved prac-
bedded processor and the decoding functions on customized tical algorithms and solving subset sum problems,” Math. Program.,
hardware modules. Multiple levels of parallelism are explored vol. 66, no. 2, pp. 181–191, 1994.
[13] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in
in the system design in order to achieve faster throughput. The lattices,” IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 2201–2214, Aug.
decoders for a 4 4 MIMO system with 16-QAM modulation 2002.
are prototyped on a Xilinx XC2VP30 device with a MicroB- [14] A. Adjoudani, E. C. Beck, A. P. Burg, G. M. Djuknic, T. G. Gvoth,
D. Haessig, S. Manji, M. A. Milbrodt, M. Rupp, D. Samardzija, A.
laze soft core processor. Compared to the VB algorithm, the B. Siegel, I. Sizer, T. , C. Tran, S. Walker, S. A. Wilkus, and P. W.
SE algorithm has lower circuit complexity, smaller silicon area, Wolniansky, “Prototype experience for MIMO blast over third-genera-
and fewer lattice search iterations. The results show that the tion wireless system,” IEEE J. Sel. Areas Commun., vol. 21, no. 3, pp.
440–451, Mar. 2003.
SE based decoder is more than 2.21 times faster than the VB [15] M. Guillaud, A. Burg, M. Rupp, E. Beck, and S. Das, “Rapid proto-
based decoder and uses 30% fewer FPGA slices. The same al- 2
typing design of a 4 4 blast-over-umts system,” in Proc. 35th Asilomar
gorithms are also implemented on a TMS320C6201 DSP pro- Conf. Signals, Syst. Comput., 2001, pp. 1256–1260, vol. 2.
[16] Z. Guo and P. Nilsson, “A VLSI architecture of the soft-output sphere
cessor to compare the performance. The FPGA prototypes of decoder for MIMO systems,” in Proc. IEEE Int. Midwest Symp. Cir-
the SE and VB algorithms show that they support up to 81.5 cuits Syst., 2005, pp. 1195–1198.
and 36.1-Mb/s data rates at 20-dB SNR, which are about 22 and [17] O. Damen, A. Chkeif, and J. C. Belfiore, “Lattice code decoder for
space-time codes,” IEEE Commun. Lett., vol. 4, no. 5, pp. 161–163,
97 times faster than their respective implementations in a DSP. May 2000.
Power consumptions of the FPGA and DSP implementations [18] D. Garrett, L. Davis, S. ten Brink, B. Hochwald, and G. Knagge,
are comparable, but the areas of the FPGA designs are smaller. “Silicon complexity for maximum likelihood MIMO detection using
spherical decoding,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp.
The throughput advantage of the FPGA is due to its rich com- 1544–1552, Sep. 2004.
putational resources and the parallel processing using the cus- [19] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H.
tomized circuits. Future research will further optimize the VLSI Bolcskei, “Vlsi implementation of MIMO detection using the sphere
decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, no. 7, pp.
implementations to reduce the power and area utilization and to 1566–1577, Jul. 2005.
achieve faster system throughput on a single chip. [20] A. DeHon, “The density advantage of configurable computing,” IEEE
Comput., vol. 33, no. 4, pp. 41–49, Apr. 2000.
[21] R. Hartenstein, “A decade of reconfigurable computing: A visionary
retrospective,” in Proc. Int. Conf. Des., Autom. Test Eur., 2001, pp.
ACKNOWLEDGMENT 642–649.
[22] M. Vuletid, L. Pozzi, and P. Ienne, “Seamless hardware-software
The authors would like to thank Y. Liu (University of Col- integration in reconfigurable computing systems,” IEEE Des. Test
orado, Boulder) for the generous help on the decoding algo- Comput., vol. 22, no. 2, pp. 102–113, Feb. 2005.
[23] W. W. Hager, Applied Numerical Linear Algebra. Englewood Cliffs,
rithms from the information theory prospective. They would NJ: Prentice-Hall, 1988.
also like to thank the anonymous reviewers for their careful re- [24] H. Cohen, A Course in Computational Algebraic Number Theory.
view and precious comments. Berlin, Germany: Springer-Verlag, 1993.
[25] Texas Instruments Inc., Dallas, TX, “TMS320C6201 fixed-point digital
signal processor,” 2004.
[26] K. Compton and S. Hauck, “Automatic design of area-efficient config-
REFERENCES urable asic cores,” IEEE Trans. Comput., vol. 56, no. 5, pp. 662–672,
May 2007.
[1] G. J. Foschini and M. Gans, “On the limits of wireless communication [27] Texas Instruments Inc., Dallas, TX, “TMS320C62X/C67X power con-
in a fading environment when using multiple antennas,” Wireless Per- sumption summary (SPRA486C),” 2002.
sonal Commun., vol. 6, pp. 311–335, 1998. [28] J. Neel, S. Srikanteswara, J. H. Reed, and P. M. Athanas, “A com-
[2] B. M. Hochwald and S. Ten Brink, “Achieving near-capacity on a parative study of the suitability of a custom computing machine and a
multiple-antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, pp. VLIW DSP for use in 3G applications,” in Proc. IEEE Workshop Signal
389–399, Mar. 2003. Process. Syst., 2004, pp. 188–193.
[3] D. Gesbert, M. Shafi, S. Da-shan, P. J. Smith, and A. Naguib, “From
theory to practice: An overview of mimo space-time coded wireless
systems,” IEEE J. Sel. Areas Commun., vol. 21, no. 3, pp. 281–302,
Mar. 2003.
[4] “Multiple-input multiple-output in UTRA (Release 7),” 3GPP Xinming Huang (M’01) received the Ph.D. degree in
TR25.876 (V7.0.0), 2006. electrical engineering from Virginia Polytechnic In-
[5] Draft Amendment to Wireless LAN Media Access Control (MAC) stitute and State University, Blacksburg, VA in 2001.
and Physical Layer (PHY) Specifications: Enhancements for Higher He is an Assistant Professor with the Department
Throughput, IEEE P802.11n/D1.0, 2006. of Electrical and Computer Engineering, Worcester
[6] Part 16: Air Interface for Fixed and Mobile Broadband Wireless Access Polytechnic Institute (WPI), Worcester, MA. He
Systems, IEEE802.16e standard for local and metropolitan area, 2006. was a member of Technical Staff with the Wireless
[7] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless Advanced Technology Laboratory, Bell Labs of
Communications. New York: Cambridge Univ. Press, 2003. Lucent Technologies, Whippany, NY, from 2001 to
[8] M. O. Damen, H. El Gamal, and G. Caire, “On maximum-likelihood 2003. He was also an Assistant Professor with the
detection and the search for the closest lattice point,” IEEE Trans. Inf. Department of Electrical Engineering, University of
Theory, vol. 49, no. 10, pp. 2389–2402, Oct. 2003. New Orleans, New Orleans, LA, from 2003 to 2006.
[9] U. Fincke and M. Pohst, “Improved methods for calculating vectors Dr. Huang was a recipient of a DARPA/MTO Young Faculty Award in 2007,
of short length in a lattice, including a complexity analysis,” Math. the IBM Faculty Fellowship Award in 2004, and the Central Bell Labs Annual
Comput., vol. 44, no. 170, pp. 463–471, 1985. Excellence and Teamwork Award in 2002. His research interests include the
[10] M. Pohst, “On the computation of lattice vectors of minimal length, areas of circuit design and system architecture, with an emphasis on recon-
successive minima and reduced bases with applications,” SIGSAM figurable computing, wireless communications, and networked embedded sys-
Bull., vol. 15, no. 1, pp. 37–44, 1981. tems.
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.
HUANG et al.: SYSTEM ARCHITECTURE AND IMPLEMENTATION OF MIMO SPHERE DECODERS ON FPGA 197
Cao Liang (S’06) received the B.S. degree in Jing Ma (M’02) received the B.S. degree from
communication and information engineering from Northwestern Polytechnic University, Xi’an, China,
the University of Electronic Science and Technology in 1993, the M.S. degree from the National Uni-
of China, Chengdu, China, in 2001, and the M.S. versity of Singapore, Singapore, in 1998, and the
degree in electrical engineering from the University Ph.D. degree from the Virginia Polytechnic Institute
of New Orleans, New Orleans, LA, in 2006. He is and State University, Blacksburg, in 2002, all in
currently pursuing the Ph.D. degree in electrical and electrical engineering.
computer engineering from Worcester Polytechnic She is currently with The MathWords Inc., Natick,
Institute, Worcester, MA. MA. She was an Assistant Professor with the Depart-
His research interests include high performance ment of Electrical Engineering, University of New
computing architecture, VLSI design, and routing in Orleans, New Orleans, LA, from 2003 to 2006. Her
collaborative computing networks. research interests include reconfigurable computing, wireless sensor networks,
and computer-aided design.
Authorized licensed use limited to: Worcester Polytechnic Institute. Downloaded on November 12, 2008 at 13:24 from IEEE Xplore. Restrictions apply.