This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


Software-Defined Sphere Decoding for FPGA-based MIMO Detection
Xuezheng Chu, Member, IEEE, and John McAllister, Member, IEEE

Abstract Sphere Decoding (SD) is a highly effective detection technique for Multiple-Input Multiple-Output (MIMO) wireless communications receivers, offering quasi-optimal accuracy with relatively low computational complexity as compared to the ideal ML detector. Despite this, the computational demands of even low-complexity SD variants, such as Fixed Complexity SD (FSD), remains such that implementation on modern software-defined network equipment is a highly challenging process, and indeed real-time solutions for MIMO systems such as 4 × 4 16-QAM 802.11n are unreported. This paper overcomes this barrier. By exploiting large-scale networks of fine-grained softwareprogrammable processors on Field Programmable Gate Array (FPGA), a series of unique SD implementations are presented, culminating in the only single-chip, real-time quasi-optimal SD for 4×4 16-QAM 802.11n MIMO. Furthermore, it demonstrates that the high performance software-defined architectures which enable these implementations exhibit cost comparable to dedicated circuit architectures. Index Terms MIMO, Sphere Decoder, FPGA, Multicore.

I. I NTRODUCTION Multiple-Input, Multiple-Output (MIMO) communications systems [1] exploit spatial diversity to provide wireless communications channels of unprecedented capacity and throughput, prompting their adoption in wireless communications standards such as 802.11n [2]. A generic MIMO system employing M transmit and N receive antennas is shown in Fig. 1. Effectively harnessing the benefits of MIMO technology, however, relies on the existence of accurate, high throughput receiver equipment - a very significant embedded architecture design problem. This difficulty is due to two main factors. Firstly, the high computational complexity of accurate detector algorithms such as Sphere Decoders (SDs) is apparent in the current absence of reported real-time implementations for even moderate MIMO systems, such as the 4 × 4 16-QAM topologies employed in 802.11n [3], [4], [5], [6], [7], [8]. Furthermore, such realisations should ideally be ’software-defined’, for integration in modern network equipment and design processes
Authors are with the Institute of Electronics, Communications and Information Technology (ECIT), Queen’s University Belfast e-mail: xchu01,

June 22, 2012


(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


Modulation & Mapping


hM 2

h1 N





ˆ s2



MIMO Channel


ˆ sN

Fig. 1.

Generic MIMO Communication System

[9], [10], whilst current implementations of parts of SD algorithms require custom circuit architectures to achieve real-time processing. Combining these two to achieve real-time, software-defined detection is a highly challenging implementation problem. This paper presents a unique design approach which overcomes these barriers. It extends the work in [11], [12] to create a series of unique real-time, software-defined SD processing architectures for 4 × 4 16-QAM 802.11n MIMO. Specifically, four contributions are made: 1) A highly efficient, software-defined FPGA processing architecture for baseband DSP [11], [12] is presented. 2) The architecture from 1) is used to create the first known real-time preprocessing architecture for 802.11n FSD. 3) It is shown how 1) enables the only known software-defined FSD metric calculation and sorting architecture. 4) The implementations from 2) and 3) are combined to create the only known full FSD detector for 4 × 4 16-QAM 802.11n. The remainder of this paper is organized as follows. Section III describes the software-defined FPGA processing paradigm, before it is used to create real-time architectures for preprocessing and metric calculation and sorting in Sections IV and V respectively. Finally, Section VI exploits this approach to create the only recorded single-chip, real-time SD architecture for 4 × 4 16-QAM 802.11n MIMO. II. BACKGROUND AND M OTIVATION In an M -transmit, N-receive antenna MIMO system, the M -element transmitted symbol vector s suffers multipath distortion and noise corruption (v) when propagating across the channel to the receiver. Hence the N-element received symbol vector y is formulated mathematically as (1), where H ∈ CN ×M represents the MIMO channel, used typically as a parallel set of flat-fading subchannels via Orthogonal Frequency Division Multiplexing (OFDM).

y = Hs + v

Demodulation & Separation





ˆ s1


SD is a receiver baseband signal processing approach employed to estimate s. It offers near-ideal detection performance with significantly reduced computational complexity relative to the ideal ML detector [13], [14].
June 22, 2012 DRAFT

(c) 2011 Crown Copyright. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing

3 Despite this. [16]. A range of simplified SD variants have emerged in an attempt to alleviate this complexity problem whilst maintaining quasi-ML detection accuracy. SD algorithms in general remain computationally complex and present a significant implementation challenge. FSD has a two-phase behaviour: 1) Pre-Processing (PP): The symbols of y are ordered for detection and the centre of the decoding sphere is initialised using Zero Forcing n2 . but has not been fully edited. this is a challenging real-time implementation problem. where Hi is the channel matrix with previously selected columns zeroed. such as the maximum 480 Mbps. groups of M symbols undergo detection via a tree-search structure illustrated in Fig. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. [5]. Content may change prior to final publication. (3) June 22.11n [3]. even recorded custom circuit architectures have not proven capable of supporting real-time quasi-ML SD for 4 × 4 16-QAM 802. 2(a)) via an M -phase iterative process: 1) Calculate Wi according to (2). nfs is given by (3).. where only a single candidate detected symbol is maintained between layers. Personal use is permitted. 2) Metric Calculation & Sorting (MCS): An M -level decode tree performs a Euclidean distance based statistical estimation of s. In the context of the industrywide move toward software-programmable or ’software-defined’ DSP architectures which emphasise flexibility to support multiple radio standards along with implementation cost and performance [9]. [15]. √ nf s = M −1 . 2012 DRAFT (c) 2011 Crown Copyright. the worst distorted symbols undergo Full Search (FS). This is achieved by reordering the columns of H to give H† (the general form of which is illustrated in Fig. particularly in base-station equipment where demanding real-time performance metrics must be met. At each MCS tree level. . For any other purposes.This article has been accepted for publication in a future issue of this journal. 4µs latency required by 4×4 16-QAM 802. . where the search space is fully enumerated resulting in P child nodes at level i + 1 per node at level i. The node distribution at each level in the tree is given by nS = (n1 . (4) and (5) are performed.. For full diversity. deterministic behaviour and quasi-ML accuracy [19].. Fixed-Complexity SD (FSD) is exceptional since it uniquely combines relatively low complexity. PP orders the received symbols according to the perceived distortion experienced by each. where P is the number of QAM constellation points. Wi = (H−H )H−1 i i 2) The signal sk to be detected is selected according to ˆ   arg max (H† ) j i j k=  arg min (H† ) j (2) 2 2 if ni = P if ni = P i j Post-ordering. Amongst these.11n MIMO [2]. The remaining nss (nss = M − nf s) least distorted symbols subsequently undergo Single Search (SS). 2(b). nM )T . During the first nfs levels. [18]. [17].

obtained via QR decomposition of H during PP.nss) Least Distorted Symbol (N. Di = di + Di+1 (5) In (4) and (5). 2.nss) (2. Content may change prior to final publication.nss+1) (N. FSD Algorithm Components si = sZF. rij refers to an entry in R.j − sj ) s ˆ r j=i+1 ii 2 Mt (4) di = j=i 2 rij sZF. Personal use is permitted. but has not been fully edited.M) (2.M-1) (1.j − sj ˆ ˆ . Since Di+1 can be considered as the Accumulated Partial Euclidean Distance (APED) at June 22.M-1) (N.1) (1. 2012 DRAFT (c) 2011 Crown Copyright. .This article has been accepted for publication in a future issue of this journal.nss) (1. 4 nss (1.M-1) (2.i − ˜ ˆ Mt rij (ˆZF.2) (2.1) (2. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. For any other purposes.M) (N. sZF is the center ˆ of the constrained FSD sphere and sj is the j th detected data.2) (1.1) (N.M) Most Distorted Symbol Increasing Distortion Detection Order (a) General Form of H† Received Symbols 1) Preprocessing 2) Metric Calculation & Sorting (MCS) (i) Full Search (FS) (ii) Single Search (SS) (iii) Sorting Detected Symbols (b) FSD Tree Search Structure Fig.nss+1) ( which is sliced to sj in subsequent iterations of the ˜ ˆ detection process [20].2) (N.nss+1) nfs (1.

quasi-optimal SD algorithms known [21]. prohibited real-time implementation of even PP. MCS and full detection are described in Sections IV. whilst modern equipment design processes require software-defined architectures. 2) FSD detection of all 108 OFDM subcarriers in 802. to date. the APED can be obtained by recursively applying (5) from level i = M to i = 1. e.This article has been accepted for publication in a future issue of this journal. Their programmability implies the need for data storage and circuitry for datapath control. two key issues currently prevent software-defined FPGA architectures from resolving this problem: 1) Software-defined FPGA architectures. Further. In 802. The architecture of the FPE is shown in Fig. in the form of Look Up Tables (LUTs) and Block RAM (BRAMs). 3) Existing real-time MCS realisations [22] rely on custom dedicated circuits. FSD is amongst the lowest complexity. Content may change prior to final publication.g. software-defined MIMO detection on FPGA.11n: 1) Single-chip detectors remain elusive: there is no recorded real-time architecture which integrates both PP and MCS for all 108 802.11n is a large scale 2) The high computational complexity of PP (the M iterations of the O(M 3 ) pseudo-inverse in (2) results in a O(M 4 ) algorithm) has.11n OFDM subcarriers.11n MIMO. This paper presents a unique solution which demonstrates the viability of real-time. a new approach to software-defined realisation of SD is required. Whilst technologies such as Field Programmable Gate Array (FPGA) are computationally capable of hosting real-time MCS at least. software-defined architectures capable of exploiting these resources are lacking. A unique. V and VI respectively. 5 level j = i + 1 of the MCS tree and di as the PED in level i. The resulting candidate symbols are sorted based on their Euclidean distance measurements. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. To resolve this issue. Section III describes the processing architecture exploited. 2012 DRAFT (c) 2011 Crown Copyright. [23]. by realising FSD PP. June 22. before its effectiveness for FSD PP. existing FPGA processors are typically resource hungry and performance limited. but has not been fully edited. and the final result produced post-sorting. MCS and full detector architectures which meet the 480 Mbps.11n. Hence whilst modern FPGA house very high levels of programmable computational capacity. three outstanding issues remain in real-time software-defined FSD for standards such as 802. . 3. requiring a highly scalable processing architecture. lean processing architecture known as the FPGA Processing Element (FPE) is proposed to resolve this deficiency.11n FSD. For any other purposes. once per OFDM subcarrier employed. T HE FPGA-BASED P ROCESSING E LEMENT (FPE) The emergence of components such as the DSP48E on recent generations of Xilinx FPGA offer unprecedented levels of computational capacity enclosed in programmable datapath components. along with the highly parallel nature of FSD this has proven effective in enabling real-time FSD MCS [22]. we show how the resulting realisations exhibit cost comparable to custom circuit solutions. Personal use is permitted. this collective behaviour must be replicated 108 times. [24] are too costly and low performance to meet the real-time demands of 802. However. but despite modern FPGA housing plentiful resources with which to realise these. III. 4 µS latency requirements of 802.

employing flat criteria. Tx/Rx ports No. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. Personal use is permitted. The FPE is configurable such that its architecture can be customised pre-synthesis in terms of the aspects listed in Table I. 2012 DRAFT (c) 2011 Crown Copyright. When implemented on Xilinx Virtex 5 VLX110T FPGA. The FPE architecture The FPE houses the minimum set of resources required for programmable operation: the instructions pointed to by the Program Counter (PC) are loaded from Program Memory (PM) and decoded by the Instruction Decoder (ID). or in the case of immediate data Immediate Memory (IMM) and processed by the ALU (implemented using a Xilinx DSP48E). the computational capability and cost of six FPE configurations . it is currently programmed manually at the assembly level. with neither speed nor area prioritized. in addition. 6 Branch Detection COMM Zerooverhead Loop ID DSP48E IMM PM RF ALU DM PC Coprocessor Branch Control Instruction Fetch ID/RF Source Select EXE1 EXE2 Result Select Write Back Fig. Data operands are read either from Register File (RF). a Data Memory (DM) is used for bulk data storage and a Communication Adapter (COMM) performs on/off-FPE communications. the FPE is programmable via the instruction set described in Table II. 32 bit Complex (32C) and 32 bit Real (32R) variants . Content may change prior to final publication. and the instruction set is extensible to incorporate new instructions for specific coprocessors./size immediate data Values 16/32 bits Real/complex 1-4 Unlimited Unlimited ≤1024 Unlimited memory locations 1 All synthesis results are post place-and-route.This article has been accepted for publication in a future issue of this journal. June 22. .are as described in Table III1 . For any other purposes. Width DM/ RF Capacity No. but has not been fully edited. to accelerate critical operations. the ALU can be extended with custom coprocessors. DSP48E slices PM Capacity/Instrn. In addition. TABLE I FPE C ONFIGURATION PARAMETERS Parameter DataWidth DataType ALUWidth PMDepth/PMWidth DMDepth/RFDepth TxCOMM/RxCOMM IMMDepth/IMM Width Meaning Data wordsize Type of data No. Further.16 bit Real (16R).org. 3.

MCS and full detector architectures using the FPE processing paradigm.5 474 215. Sections IV .8 increase in computational capacity. 2012 DRAFT (c) 2011 Crown Copyright. to the best of the authors’ knowledge. . These metrics imply that the FPE is the most likely of any software-defined FPGA architecture to support real-time FSD.VI examine implementation of TABLE III FPE A RITHMETIC P ERFORMANCE Resource Config LUTs 16 R 90 132 16 C 172 140 185 32 R 182 3 DSP48Es 1 1 2 4 2 Latency (Cycles) 4 7 5 5 6 7 Clock (MHz) 483 476 453 474 431 431 Throughput (MMACs/s) 483 119 226. 7 TABLE II FPE I NSTRUCTION S ET Instruction LOOP/RPT BEQ/BGT/BLT CTRL JMP GET/PUT GETCH/CLRCH NOP MUL/ADD/SUB ALU MULADD/MULSUB(FWD) COPROC LD/ST MEM LDIMM/STIMM LDIAR Function loop/repeat branch if equal/greater/less jump load/push data from/to channel load data from/clear channels no operation multiply/add/subtract multiply-add/subtract (& forward) coprocessor access load/store data from/to memory load/store data from/to IMM updata IMM address register Table III describes a range of performance/cost metrics that. Personal use is permitted. whilst enabling a factor 2. the FPE occupies only 18% of the resource of a conventional MicroBlaze processor. permission must be obtained from the IEEE by emailing pubs-permissions@ieee.5 431 June 22. for instance. For any other purposes.This article has been accepted for publication in a future issue of this journal. are unmatched in any other software-defined FPGA architecture. Content may change prior to final publication. but has not been fully edited.

and order are used for FSD MCS as defined in (4) and (5). Non-restoring 2 For clarity. where qi is the ith column of Q. Hence. 24 and 16 bit fixed-point variants and is compared with the ideal floatingpoint version in Fig. this avoids the M iterations of QRD in V-BLAST. Note the merged ordering and QRD of H in Phase 2. but has not been fully edited. This section tests the ability of the FPE to support real-time SQRD-based PP for FSD. As Fig. prohibited real-time implementation for MIMO systems such as 4 × 4 802. For any other purposes.11n: 1) SQRD remains highly computationally demanding.This article has been accepted for publication in a future issue of this journal. . the behaviours under 32 and 24 bit fixed point conditions are omitted from Fig. R. and hence are usually preferred. to date. It operates in two phases. A. 4 due to the high detection performance of 16 bit solutions. FPE-BASED P RE . FPE-based realisation is viable and 16 bit fixed-point arithmetic with a 10 bit fractional part is chosen for FPE-based SQRD. [26] can potentially overcome this issue. however since FPE-based realisation requires reduced-precision fixed-point arithmetic. in each of which the k th lowest entry in norm is identified (lines 9 & 10) before the corresponding column of R and elements in order and norm are permuted with the ith (line 11) and orthogonalized (line 12-18). June 22. 4 shows. 2) The square root (line 12) and division (line 13) operations used in SQRD offer very low performance on sequential processors [27]. enabling an observed order-of-magnitude complexity reduction for FSD PP. the BER performance of SQRD-based FSD detection of a 4 × 4 16-QAM Rayleigh Fading MIMO channel has been performed for 32. there are two major issues that must be resolved to enable FPE-based SQRD PP for 4 × 4 802. it appears that a large-scale multi-FPE architecture is required to enable SQRD for 4 × 4 802. The resulting Q. whilst deriving order. particularly when integer wordsizes of 9 or 10 bits are employed. 16 bit arithmetic in the SQRD phase is sufficient to enable detection performance almost equal to that of the ideal floating-point solution. recurrence algorithms generally exhibit lower complexity and latency. 42 . 2012 DRAFT (c) 2011 Crown Copyright. FPE-based Division Acceleration Binary division is usually achieved using digital recurrence or convergence algorithms [27]. similar verification under these conditions is required. Personal use is permitted. However the recent emergence of sub-optimal PP algorithms such as Sorted QRD (SQRD) [25]. order. special consideration of these is required for real-time PP for 802. norm and nf s are initialized as shown in lines 2-5 of Algorithm 1.PROCESSING U SING SQRD Section II described how the complexity of SD PP has. Despite its relatively low complexity and suitability for fixed-point implementation.11n MIMO. To this end.11n. Phase 2 comprises M iterations. Content may change prior to final publication. 8 IV. The quasi-ML accuracy of floating-point SQRD-based FSD is demonstrated in [26]. R. given the capabilities of a single FPE. SQRD-based ordering for FSD transforms the input channel matrix H to the product of a unitary matrix Q and an upper-triangular R via QR decomposition.11n. In Phase 1 Q. as described in Algorithm 1 [26]. Of these alternatives. the order of detection of the received symbols during MCS. as outlined in Table IV. permission must be obtained from the IEEE by emailing

Personal use is permitted. 5. For any other purposes.l end end Algorithm 1: Sorted QR decomposition for FSD recurrence algorithms generally enable higher performance FPGA implementations by avoiding sophisticated control overhead [28]. R = 0M . nf s = for i ← 1 to M do normi = qi end Phase 2: SQRD ordering for i ← 1 to M do k = min (nf s + 1.l · qi norml = norml − r2 i. This equates to approximately 1. M output: Q. · · · .the structure of these coprocessors are outlined in Fig. FPE-R4 ) when implemented on Virtex 5 FPGA is described in Table V. Content may change prior to final publication.i = normi qi = qi /ri.l = qH · ql i ql = ql − ri. 9 input : H. but has not been fully edited. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Phase 1: Initialization Q = H. to accelerate FPE-based division. As this shows. The performance. or TP/LUT) of the programmed FPE implementation (FPE-P) and radix-2/4 coprocessor augmented FPEs (FPE-R2 .11n would require at least 100 FPEs dedicated solely to division. Radix-2/4 non-restoring division coprocessors [27] are considered in this context . which means that to achieve the 120 MDIV/s required by real-time 4 × 4 SQRD for 2012 DRAFT (c) 2011 Crown Copyright.2 MDIV/s (millions of divisions per second).This article has been accepted for publication in a future issue of this journal. norm and Q √ ri. R. M ].··· . FPE-R2 and FPE-R4 increase June 22. M − i + 1) k 2 √ M −1 ki = arg min normj j=i. specifically its ability to incorporate coprocessors within the ALU. order.M Exchange columns i and ki in R. The high resource cost such a solution could entail may potentially be avoided by exploiting the configurable nature of the FPE. cost and efficiency (in terms of throughput per LUT.i for l ← i + 1 to M do ri. . order = [1. Non-restoring 16 bit division [27] requires 312 cycles on a 16R FPE.

Personal use is permitted. enabling a 93. Content may change prior to final publication. 4×4 16-QAM Fixed Point Simulation TABLE IV 4 × 4 SQRD O PERATIONAL C OMPLEXITY Operation +/ − × ÷ √ No. FPE-C offers simultaneous increases in throughput and efficiency by factors of 23 and 10 respectively as compared June 22. For any other purposes. There are two primary options for achieving this: software-based execution on the native FPE. . the implied implementation cost and performance metrics of each option are summarised in Table V. B. Accordingly. 2012 DRAFT (c) 2011 Crown Copyright. Given the need for 120 MDIV/s for SQRD-based detection of 4 × 4 802.12 throughput by factors of 8. 4.12 0.7 as compared to FPE-P respectively. This suggests that FPE-R2 represents the lowest cost real-time solution. FPE-based Acceleration Of Square Root Operations Implementing square root operations poses a similar problem to that of division .11n system. 10 10 0 4x4 16−QAM FSD SQRD ORDERING Fixed Point Simulation 10 −1 BER (Bit Error Rate) 10 −2 10 −3 10 −4 16 bit−8 fraction 16 bit−9 fraction 16 bit−10 fraction 16 bit−11 fraction 16 bit−12 fraction 16 bit−13 fraction Floating Point 6 8 10 12 14 16 SNR(dB) 18 20 22 24 26 Fig. or by using a standard CORDIC component available in vendor IP libraries [29]. The programmed solution (FPE-P) is compared with that incorporating the CORDIC coprocessor (FPE-C) in Table VI. per second (×109 ) As this shows.9 and 13.24 12.4 and 10. using the pencil-and-paper method [27].11n MIMO systems.3 and hardware efficiency by factors of 9. this approach is adopted in the FPE-based SQRD implementation.120 MSQRT/s (million square root operations per second) are required for real-time SQRD-based detection of a 4 × 4 802. permission must be obtained from the IEEE by emailing pubs-permissions@ieee.72 0. but has not been fully edited.This article has been accepted for publication in a future issue of this journal.4% reduction in resource cost relative to FPE-P.

For any other purposes. Personal use is permitted. TABLE VI C OMPARISON OF 16 BIT PSQRT. Content may change prior to final publication.This article has been accepted for publication in a future issue of this journal.1 June 22. and is adopted for realising FPE-based square root operations. 2012 DRAFT (c) 2011 Crown Copyright.11n are summarised in Table VII. Hence FPE-C enables real-time operation whilst incurring only 11% of the resource required by FPE-P. CSQRT ON FPE FPE-P [27] PM/RF locations Cost LUTs DSP48Es Clock (MHz) Latency (Cycles) Throughput (MSQRT/s) TP/LUT (×10− 3) 29/14 142 1 367.7 191 1.600 900 944 Throughput (MDiv/s) 120 120 144 to FPE-P. . permission must be obtained from the IEEE by emailing pubs-permissions@ieee.93 13. This implies that the resources required to realise real-time square-root for SQRD-based detection of 4 × 4 11 Q16-j Quotient MSB Partial remainder Divisor Add/Sub Complement 16 16 1/2 16-bit Adder 16 1/2 1 quotient bit obtained per iteration Fig.6 FPE-C [29] 8/1 330 0 350 8 43.6 132. but has not been fully edited. SQRD Divider Coprocessor Architecture TABLE V SQRD D IVISION I MPLEMENTATIONS Resource Solution FPEs FPE-P FPE-R2 FPE-R4 100 5 4 DSP48Es 100 5 4 LUTs 13. 5.

7.1 -T2. Multiple Data (MIMD) processing architecture. division and square root. These results are notable since.3 ). D. illustrated in Fig. Hence. in accordance with the analysis in Section IV. with the separate parts of norm gathered by FPE4 to undergo ordering. 6(a) describes the SQRD algorithm as a flow chart composed of four main iterative tasks (T1 . Fig. 2012 DRAFT (c) 2011 Crown Copyright. the FPEs employ 16 bit datapaths. to implement PP for all 108 subcarriers of 802. . R and norm respectively (lines 14 . According to the metrics quoted in Table VIII(b). Initially.3 permuting and updating Q. Inter-FPE communication occurs via point-to-point FIFO links. Across the architecture.11n. the cost and performance of the 802. Implementation Evaluation When implemented on Xilinx Virtex 5 VSX240T FPGA. 7 are as quoted in Table IX.1 .This article has been accepted for publication in a future issue of this journal. integrating these components into a coherent processing architecture to perform SQRD. and are otherwise configured as described in Table VIII(a). To realise this behaviour a 4-FPE Multiple Instruction. 7. and computes the diagonal elements of R (lines 11 . to the best of the authors’ June 22. with the subsequent concurrent tasks T2. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. These resulting metrics are distributed to the outer FPEs for independent iterative permutation and update of Q. conducts the iterative channel norm ordering. T2. and replicating that behaviour to provide PP for the 108 subcarriers of 802.BASED SQRT I MPLEMENTATIONS Resource Solution FPEs FPE-P FPE-C 63 3 DSP48Es 63 3 LUTs 8946 990 Throughput (MSqrt/s) 121.FPE3 perform permutation of Q.3 in Fig. The first task. For any other purposes.6 130. R and norm and iterative updating (T2. FPE-based SQRD Whilst Sections IV-A and IV-B have provided FPE-based components of sufficiently high performance and low cost to enable real-time division and square-root. The mapping of subcarriers to groups is as described in Fig. The performance and cost of the 4-FPE grouping is given in Table VIII(b). Personal use is permitted. Content may change prior to final publication. is used.8 C.13 in Algorithm 1). incorporating 36 groups of the 4-FPE array.11n.T2. H and the calculation of norm are distributed amongst the FPEs. but has not been fully edited. chosen due to their relatively low cost on FPGA and implicit ability to synchronize the multi-FPE architecture in a data-driven manner whilst avoiding data access conflicts. is used.11n PP architecture (FPE-SQRD) described in Fig. 6(a)).T2.18 in Algorithm 1).1 . R and norm. 6(b). the throughput of each 4-FPE group is sufficient to support SQRD-based PP of 3 subcarriers within the real-time constraints of 802. whilst FPE4 calculates the diagonal elements of R (T1 ). the architecture illustrated in Fig. SQRD-based PP of a 4 × 4 matrix occurs in three phases. FPE1 . challenging implementation problem. T1 . 12 TABLE VII MIMO is a large scale.

i  norm i q i  q i ri . they do not implement it. The June 22.. R ki T1 FPE4 FPE4 for l in i  1  M for l in i  1 to M ri. such as [31]. 4 × 4 SQRD FPE-MIMD Mapping TABLE VIII 4-FPE BASED SQRD (b) 4 × 4 SQRD Architecture (a) FPE Configuration Parameter PM Depth RFDepth IMMDepth DMDepth TxComm RxComm Value 350 Cost 32 32 64 32 32 (b) FPE-SQRD Metrics Aspect LUTs DSP48Es BRAMs Clock (MHz) T (MSQRD/s) Latency (µS) Value 2109 4 0 3.9 knowledge. permission must be obtained from the IEEE by emailing pubs-permissions@ieee.whilst a number of SD implementations rely on SQRD-based PP. end Y i<44 i N end T2.1 N permute R i . comparing with other PP realisations is difficult . Table IX compares FPE-SQRD with a custom circuit-based V-BLAST detector [30]. balanced objective comparison is difficult since it reports only the resource cost for an ASIC-based 2 × 2 SQRD. Personal use is permitted. they constitute the only recorded real-time SQRD PP and FSD PP architectures for 4 × 4 MIMO. They achieve 32.3 FPE3 (a) 4 × 4 SQRD Fig.1 FPE2 Y T2.2 permute Q i .3 permute norm i . 2012 DRAFT (c) 2011 Crown Copyright. M k ri . but has not been fully edited. Given that they are unique in enabling real-time PP for SD-based detection for 4×4 MIMO. [32].org. norm ki T2. Further.5 MSQRD/s (millions of SQRD operations per second) exceeding the required 30 MSQRD/s.07 0.l  q iH  q l q l  q l  ri . 6. Q ki T2. whilst the work in [33] describes an SQRD implementation. giving no measure of real-time performance. .15 1. Content may change prior to final publication.l  q i end end for l in i  1 to M norm l  norm l  ri 2l . 13 start i =1 T1 ki  arg min norm j j  i .i i =i+1 i4 T2.2 FPE1 T2. For any other purposes.This article has been accepted for publication in a future issue of this journal.

Indeed. June 22. Section V investigates its ability to support the second major suboperation of FSD: MCS.0 87 22 1. this analysis shows that the FPE has enabled the only software-defined real-time PP architecture for 4 × 4 16-QAM 802.560 144 0 152. Personal use is permitted.512 426 N/A 276. further.11n FSD. combining these relates to an overall reduction of 45% in terms of Equivalent LUTs (ELUTs)3 . whilst enabling an increased throughput of almost 50%.5 265 32. 4 × 4 SQRD Mapping TABLE IX 4 × 4 SQRD I MPLEMENTATIONS Ref LUTs Resource DSP48Es BRAMs ELUTs (×103 ) Clock (MHz) T (MSQRD/s) L (µS) FPE-SQRD 70. 7. it significantly reduces the demand for DSP48E resources by 66%.This article has been accepted for publication in a future issue of this journal.43 implementation in [30] does not achieve the required throughput for real-time processing and whilst FPE-SQRD consumes significantly more LUT resource. Content may change prior to final publication. For any other purposes. whilst incurring resource costs comparable to existing circuit architectures. 3 ELUTs combine measurement of resource on modern Xilinx FPGA in a single quantity.1 [30] 33. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. . 2012 DRAFT (c) 2011 Crown Copyright. the FPE-SQRD it is the only real-time realisation of any kind. the reader is referred to [34] for details.5 1. but has not been fully edited. Hence. 14 Subcarrier Subcarrier Subcarrier Subcarrier Subcarrier Subcarrier 5 6 2 3 4 1 FPE1 Subcarrier Subcarrier Subcarrier 108 106 107 FPE2 FPE4 FPE4 FPE1 FPE2 FPE2 FPE1 FPE1 FPE2 FPE3 FPE4 FPE4 FPE4 FPE4 FPE4 FPE4 Core 36 FPE3 FPE3 FPE3 Core 1 Core 2 Core 3

can significantly reduce these penalties. per second (×109 ) 32.11n is even more computationally demanding than SQRD-based PP. . Content may change prior to final publication.BASED FSD MCS FOR 802.5 Table XI reports a large increase in resource cost for 16R-MCS as compared to the basic 16R reported in Table III.20 TABLE XI FPE.7% of the total number of instructions. in a manner similar to that described in Section IV to accelerate division and square root operations. but has not been fully edited. For any other purposes. Personal use is permitted.BASED MCS I MPLEMENTATIONS 16R-MCS PM/RF Locations Cost LUTs DSP48Es Clock (MHz) Latency (Cycles) Throughput (MOP/s) 4591/32 2520 1 367. A SWITCH coprocessor (Fig. As a result. as described in Table 15 V. June 22. 8(a)). the wasted cycles these NOPs represent dramatically increase cost and reduce throughput . but their ability to eliminate wasted instructions can significantly reduce the PM size leading to an overall cost decrease and performance increase when these coprocessors are used.indeed branch and NOP instructions represent 50. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. FPGA.7 3281 1. Each of these coprocessors costs 20 LUTs. optimising the FPE architecture to reduce the impact of these brach instructions could have a significant impact on the MCS cost/performance.11 N FSD O PERATIONAL C OMPLEXITY Operation +/ − × No. whose number is swollen by the FPE’s deep pipeline [12]. whilst a MIN coprocessor (Fig. which compares the input to one of a number of pre-defined options can be used to accelerate slicing. A significant factor in this large number of instructions are the comparison operations required for slicing (equation (4)) and sorting the PED metrics.37 19. 8(b)) can accelerate the sorting operation.This article has been accepted for publication in a future issue of this journal. 2012 DRAFT (c) 2011 Crown Copyright. The observed order of magnitude increase is a consequence of the large PM required to house the 4591 instructions required. which require branch instructions. Associated with these branch instructions are NOPs. TABLE X 802. the performance and cost are as reported as 16R-MCS in Table XI.9 16R + Coprocessors 1420/32 805 0 350 1420 4.11 N The MCS stage of FSD for 4 × 4 16 QAM 802. When a single 4×4 16-QAM FSD MCS is implemented on a 16R FPE. Employing ALU coprocessors.

SIMD-based Implementation of 802. This produces an implementation capable of realising FSD MCS for a single 802. Note that the PC. a collection of smaller SIMDs is used. 16 > 1 2 3 SWITCH 4 MIN SUB DSP48E (a) Switch Coprocessor Fig. each tree branch performs an identical sequence of operations on distinct data streams . For any other purposes. PM.11n FSD MCS A large.the definition of Single Instruction Multiple Data (SIMD) behaviour. Table XII defines the configurable aspects of the SIMD processor. BGT and BLT) can be used as SIMD instructions. 2(b)).11n subcarrier in real-time. . Personal use is permitted. ID and IMM are now all centralised in the SIMD. Content may change prior to final publication. 8. 2012 DRAFT (c) 2011 Crown Copyright. All of the FPE instructions (except BEQ. permission must be obtained from the IEEE by emailing and hence do not appear in each FPE way. 2) The number of FPEs required to implement MCS for all 108 OFDM subcarriers on a single. the FPE is used as a foundation for a configurable SIMD processor component. 9. SIMD Processor Architecture June 22. A. including these components results in a 68% reduction in resource cost and a factor 2. providing a good foundation unit for implementing MCS for all 108 subcarriers. as illustrated in Fig. To enable these multi-SIMD architectures. very wide SIMD processor implies limitations on the achieveable clock rate as a result of high signal fan-outs to broadcast instructions from a central PM to a very large number of ALUs.This article has been accepted for publication in a future issue of this journal. Two important observations of the application’s behaviour help guide the choice of multiprocessing architecture: 1) In the tree-structured FSD MCS (Fig. FPE Coprocessors for Switch and Min Acceleration (b) Min Coprocessor as described in column 3 of Table XI. coherent collection of FPEs is required to implement FSD MCS for all 108 subcarriers of 802.11n MIMO. As this shows. 9.3 increase in throughput. but has not been fully edited. PC RF IMM FPE FPE FPE FPE RF RF RF … PM ALU ALU ALU … ALU ID Fig. Hence. restricting performance [11].

the Level 2 SIMD is composed of 16 ways. parallel FPE elements No. Level 1 consists of 8 SIMDs.11n. where j is the set of subcarriers j=1 i=0 processed by Core i. Subcarrier Subcarrier 105 106 Subcarrier Subcarrier Subcarrier 9 Subcarrier 10 1 2 Subcarrier Subcarrier 16 8 Subcarrier 108 Level 1 (8 x 16-way SIMD Cores) Core 0 Core 1 Core 7 Fig.This article has been accepted for publication in a future issue of this journal. and communication between the two levels exploit 8-element FIFO queues. The Level 1 SIMDs incorporate SWITCH coprocessors to accelerate the slicing operation. All processors exploit P M Depth = 128. 17 TABLE XII SIMD P ROCESSOR C ONFIGURATION PARAMETERS Parameter SIMDways IMMDepth/IMMWidth PMDepth/PMWidth Meaning No. as illustrated in Fig./width of PM locations Values Unlimited Unlimited Unlimited The increasing limiting effect of instruction broadcast from the central PM results in 16-way SIMD configurations offering the best cost/performance balance. hence each FPE is configured to exploit 16 bit real-valued arithmetic. Sorting for the subcarriers implemented in each Level 1 SIMD is performed by adjacent pairs of ways in the Level 2 SIMD . whilst the Level 2 SIMDs support the MIN ALU extension to accelerate the sort operation.11n subcarriers is implemented on a dual-layer network of such processors. The 802. .11n OFDM MCS-SIMD Mapping June 22. but has not been fully edited. For any other purposes. The analysis in [22] shows that 16 bit data is sufficient for FSD-based detection of 4 × 4 16-QAM 802. Content may change prior to final publication. RF Depth = 32 and DM Depth = 0. 2012 Level 2 (16-way SIMD) 8 DRAFT (c) 2011 Crown Copyright. The 16 branches of the MCS tree for each subcarrier are processed in parallel across the 16 ways of the Level 1 SIMD onto which they have been 10. accordingly FSD MCS for all 108 802. 802.hence given the 8 Level 1 SIMDs. 10.11n subcarriers are clustered into 8 groups {Gi = {j : (j − 1) mod 8 = i}108 }7 ./width of IMM locations No. Personal use is permitted. permission must be obtained from the IEEE by emailing pubs-permissions@ieee.

11n. with the empty parts of the program flow representing NOP instructions.2 PUT FPE16 Slice4.1 Slice2.2 APED2.1 APED2.15 Slice2.12 PUT FSD Slicing /APED FPE2 Slice4.2 APED4. permission must be obtained from the IEEE by emailing pubs-permissions@ieee.587 0 0 N/A 52 27.1 APED3.1 Slice4. As this shows.16 Program Flow APED3.0 150 600 N/A [17] 18. independent instructions from another.2 Slice3. interleaving branches occupies wasted NOP cycles.2 APED1.15 APED4. 11(b).16 … Slice2.2 APED3.16 PUT PUT Program Flow NOPs 8 Interleaved FSD tasks (a) Original FSD Threads Fig.1 APED1. but has not been fully edited.2 Slice2.601 144 0 98.1 APED4.15 Slice3.1 Slice1.16 Slice2.2 Slice1.1 Slice3. the NOP cycles in one branch can be occupied by the useful. it comfortably exceeds the real-time performance criteria of 802. FPE Branch Interleaving (b) Interleaved Threads When implemented on Xilinx Virtex 5 VSX240T FPGA the performance and cost of the FSD-MCS for 802.15 APED3.11n are reported as FPE-MCS in Table XIII.16 APED2.15 Slice1. These NOP cycles represent 29% of the total instruction count but since they represent ALU idle cycles they should preferably be eliminated.16 Slice1.15 APED1. FPE1 FPE8 FPE1 Slice4. 11. TABLE XIII 4 × 4 16-QAM FSD I MPLEMENTATIONS Ref LUT DSP48E BRAM ELUT (×103 ) Clock (MHz) T (Mbps) L (µS) FPE-MCS 16.7 N/A June 22.16 APED2.16 Slice3.5 296 502.16 Slice3.1 Slice1.16 APED1.1 Slice3.2 PUT PUT Slice4.16 APED3.2 Slice3.1 APED1.1 APED2. To achieve this. .2 APED3. Content may change prior to final publication.e the branches may be interleaved as illustrated in Fig.1 Slice2.5 0.16 APED1.16 APED4. For any other purposes. used to properly synchronise movement of data into and out of memory.16 PUT Slice4.1 APED4.15 APED2. i. 18 The program flow for each Level 1 SIMD is as illustrated in Fig.each FPE performs a single branch of the MCS tree.9 [22] 13. 11(a). Personal use is permitted.197 160 49 168.2 Slice2. and is the first software-defined implementation to do so. 2012 DRAFT (c) 2011 Crown Copyright.893 64 12 N/A 100 200 N/A [18] 6.1 APED3.16 APED4. As this shows. to the extent that when two branches are interleaved the proportion of wasted cycles is reduced to 4%.15 APED4. As this shows.2 APED1.This article has been accepted for publication in a future issue of this journal.2 Slice1.2 APED2.16 Slice1.

and the architectural changes necessary to enable real-time performace are such that direct comparison is very difficult. and hence is inherently more flexible for adaption to other SD or even other more general DSP algorithms.3 Clock (MHz) T L (µS) June 22. but in common with [6] the resource and performance implications of adapting this to enable all 108 802. The work in [22] displays a slightly lower LUT cost and higher Personal use is permitted.11n. the FPE-based design approach is applied to the design of a full FSD detector. it trades absolute performance/cost for flexibility and ease of design: it offers a predominately software-based design approach.BASED S OFTWARE -D EFINED FSD FOR 802. Finally.This article has been accepted for publication in a future issue of this journal. described in Sections IV and V respectively.11n.115 408 N/A 328 189 483 2. are combined to create a full FSD detector implementation. Given that this implementation realises the real-time processing requirements of 4 × 4 16 QAM 802. 2012 DRAFT (c) 2011 Crown Copyright. Similarly to the FPE-SQRD presented in Section IV. it shows that massively parallel networks of simple processors (> 140 in this case) on FPGA can support real-time processing with resource costs comparable to custom circuits. but the FPE-MCS enables software-programmability whilst maintaining real-time behaviour and comparable cost. making comparison difficult. the novel aspect of the FPE-MCS is clear: it is the only software-defined approach which supports real-time FSD MCS for 4 × 4. The work in [8] presents a very high performance 800 Mbps single subcarrier architecture on Altera FPGA. Given these comparisons. despite its software-defined nature. The architectures in [17].11n subcarriers is unknown. as well as being more suited to existing software radio design processes. the cost and performance of the implementation are as reported in Table XIV. For any other purposes. target technology variations between the FPE-MCS and the 2 × 2 ASIC-based custom circuit in [35] make comparison very difficult also. VI. to the best of the authors’ knowledge it is the only single-chip implementation to do so. 19 Table XIII also compares the FPE-MCS with existing Xilinx FPGA custom circuit SD realisations. FPGA. [18] operate well below realtime. but has not been fully edited. 16 QAM 802.11 N When the PP and MCS implementation strategies. Content may change prior to final publication. Like all software-defined radio platforms. TABLE XIV 4 × 4 SQRD FSD F ULL D ETECTOR I MPLEMENTATIONS Aspect LUTs Resource DSP48Es BRAMs ELUTs (×103 ) FPE-FSD 96. In Section VI. . permission must be obtained from the IEEE by emailing pubs-permissions@ieee.

. the resulting software-defined architectures satisfy the demanding real-time performance metrics of modern MIMO standards. B. [8] M. Catthoor. June 2009. C. and F. Matthew Milford for their valuable assistance in this work. IEEE Intl. R. [2] IEEE802. [9] J.” 2009. “802. Jul 2005. [4] X. Foschini. John Thompson. Van Der Perre. and S.” Design. Fichtner. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. Li.11n MIMO.” e Elsevier Journal of Signal Processing. Valenzuela. Prof. whilst incurring resource costs of the order of existing dedicated circuit architectures. 90. [3] A. Roger Woods. C ONCLUSION This paper has presented a uniquely capable approach for implementing SD detectors for MIMO receivers: it is the first software-defined platform to support real-time detection for applications such as 4 × 4 16 QAM 802.” IEEE Journal of Solid-State Circuits. M. 1 –5. Zellweger. Choi.-D. VLSI Systems. June 22. Ma.11n MIMO. Bolcskei.” in 1998 URSI Int. by composing fine-grained. J. 40. 2. Software Defined Radio: The Software Communications Architecture. vol. G. and M. 68 –73. Oct. G. M.This article has been accepted for publication in a future issue of this journal. this is only the case given supporting Computer Aided Design (CAD) and software compilation infrastructure. 2008. Conf. 2007. M.” in 2009 IEEE Intl. This paper has concentrated on demonstrating the feasibility of the architectures to support such realisations. Signals. 7. 3) The only recorded single-chip integrated quasi-optimal detector for 4 × 4 16-QAM 802. 1998. Automation and Test in Europe (DATE). [5] M. Content may change prior to final publication. “Efficient FPGA Implementation of MIMO Decoder for Mobile WiMAX System. Wolniansky. This paper has shown how. Xu. pp. “Programmable Processor Implementations of K-best List Sphere Detector for MIMO Receiver. and R. This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC). Juntti. For any other purposes. 2008. Computer Design (ICCD). [7] J. M. pp. no. Creating these technologies is left as future work of similar significance to the demonstration of architectural viability presented here. R EFERENCES [1] P. and J. Novo. Dr Chengwei Zheng and Mr. pp. pp. L. Golden. 2) The only recorded real-time software-defined FSD MCS architecture for 4 × 4 16-QAM 802. 16.” in Proc. W. Dash. pp. 1. Abdallah. pp. Kovarik Jr. “V-BLAST: An Architecture for Realizing Very High Data Rates Over The Rich-Scattering Wireless Channel. Silv´ n. Personal use is permitted. 313–323.11n MIMO. Symp. under grant EP/F031017/1. Conf. and Electronics. no. but has constructed and programmed them manually at the Register Transfer Level (RTL) and assembly level respectively. Borgmann. 20 VII. vol. March 2008. ACKNOWLEDGMENT The authors would like to thank Prof.11n. 295–300. on Communications (ICC’09). Huang. Burg. Khairy.” IEEE Trans. However. vol. 2009. no. Bard and V. . W. “Optimizing Near-ML MIMO Detector for SDR Baseband on Parallel Programmable Architectures. very high performance programmable components into large scale multiprocessing architectures on FPGA. “Dynamically Reconfigurable Soft Output MIMO Detector. 444–449. It is important to note that their software-defined nature implicitly eases the design process for architectures such as these. We have demonstrated this by creating three unique implementations: 1) The only recorded SQRD-based PP architecture for 4 × 4 802. but has not been fully edited. [6] P. and G. Wiley. D. Bougard. Liang. “VLSI Implementation of MIMO Detection Using The Sphere Decoding Algorithm. Bhagawat.11n MIMO. and H. Wenk. Janhunen. 1566–1577. O.11n-2009 IEEE Local and metropolitan area networks–Specific requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 5: Enhancements for Higher Throughput. Systems. “System Architecture and Implementation of MIMO Sphere Decoders on 188–197. 2012 DRAFT (c) 2011 Crown Copyright. pp. Habib.

Chu. no. pp. 97–106. 2131 –2142. W. [14] C. [12] X. Yu. March 2010. “MMSE Extension of V-BLAST based on Sorted QR Decomposition. McAllister. J. vol. “LogiCORE IP CORDIC v4. “Fixing the Complexity of the Sphere Decoder for MIMO Detection. “A Novel VLSI Architecture of Fixed-Complexity Sphere Decoder. 2011. Automation Test in Europe (DATE). pp. May 2009. 2009. on Field-Programmable Technology (FPT). Oct. [18] B. 601 –604. Content may change prior to final publication. “A Low-Area Flexible MIMO Detector for WiFi/WiMAX Standards. pp.” in IEEE Intl. Lemieux. Webb. Kuhn. Bohnke. pp. on Signals. and T. P. 1.” IEEE Trans. “Real-Valued Fixed-Complexity Sphere Decoder for High Dimensional QAM-MIMO Systems. Thompson. vol. G. but has not been fully edited. and R.-D. Successive Minima and Reduced Bases with Applications. on Reconfigurable Computing: Architectures. Chu. pp. J. pp. M. Silven.” Mathematical Programming. T. [13] M. Single and Multi-carrier Quadrature Amplitude Modulation: Principles and Applications for Personal Communications. 6. 174 –179. no. [23] P. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. S. Kim. Barbero and J. June 2008. and C. 1. on Field Programmable Gate Arrays (FPGA). Janhunen. Nov. 59.1. [33] J. [22] L. Moezzi-Madani. 8–11. Euchner. Wubben.” SIGSAM Bull. and K. Tools and Applications (ARC). pp. O. Feb. [32] M. 181–199. McAllister. 2010. Chu and J. Nov. 2010. 4493 –4499. 100–107. McAllister. 737 –744. M. 2003. vol. 66. K. 2002. on Embedded Computer Systems: Architectures. “Fine-Grain Performance Scaling of Soft Vector Processors. Qi and C. 479 –484. and R. G. Systems and Computers. Steffan. 1994.” in 2011 Wireless Innovation Forum (SDR’11-WINNComm). and R. Kammeyer. pp. Computer Arithmetic: Algorithms and Hardware Designs. 2009. G. 19. Personal use is permitted. June 22. [19] L.This article has been accepted for publication in a future issue of this journal. “Implementation of High-Speed Fixed-Point Dividers on FPGA. Juntti. 508 – 512 Vol. Masera. pp. “Rapid Prototyping of an Improved Cholesky Decomposition Based MIMO Detector on FPGAs. P. Myllyla. Sept. on Digital System Design: Architectures.” in 2003 IEEE Vehicular Technology Conference (VTC 2003). ACM/SIGDA Symp. 2011.” IEEE Trans. Juntti. no. 1633 –1636. M. Conf. [25] D. Oct. pp. J. Juntti. [24] J. Im. on Communications. and M. Conf. [21] C. on Circuits and Systems (ISCAS). Benkrid. Wu and G. “Architecture Design and Implementation of the Metric First List Sphere Detector Algorithm. Chu. Sorokin.” in 7th Intl.” Journal of Computer Science & Technology.” in Intl. 15. 943 –947. Takala. 2010. Myllyla. 2nd ed. Chakrabarti. 133–144. and M. [20] L. Woods. “A Low Complexity Real-Time MIMO-Preprocessing For Fixed-Complexity Sphere Decoder. pp. 2010. Conf.0. and Simulation (SAMOS).org. “Vector Processing as a Soft-core CPU Accelerator. 21 [10] J. Chu. [15] J. 3082–3087. Sept. 222–232. Salmela. R. Jun. 9. “Rapid Prototyping of a Fixed-Throughput Sphere Decoder for MIMO Systems. 2006. Record of the Forty-First Asilomar Conf. pp. Keller.” in Design. For any other purposes. [16] J. “Low-Power Low-Complexity MIMO-OFDM Baseband Processor for Wireless LANs. Cho. Rose. Silven. and M. pp. Jung. no. pp. Davis. McAllister. and W. pp.” in IEEE Intl. 2010. Signal Processing. 2008. 5.” 2011. Zheng.” in IEEE Workshop on Signal Processing Systems (SIPS). Y.” in IEEE Int. “A Pipeline Interleaved Heterogeneous SIMD Soft Processor Array Architecture for MIMO-OFDM Detection. Methods and Tools. ACM. [11] X. 895 –899. 2006. V. on Compilers. and J. [31] N. O. and J. H. pp. pp. pp. Thompson. . Dec. 1981.. Antikainen. Woods. 2011. 1. Myllyla and. OUP USA. Software Radio: A Modern Approach To Radio Engineering. Yiannacouras. [26] X. 2007. and J.” in Conf. Symp. 2000.. Hanzo. J. Barbero and J. pp. May 2011. Pohst.” in Intl.” in Intl. Schnorr and M. vol. Eagleston. “On The Computation of Lattice Vectors of Minimal Length. Prentice Hall. Mar. VLSI Systems. Woods. “Lattice Basis Reduction: Improved Practical Algorithms and Solving Subset Sum Problems. Architecture and Synthesis for Embedded Systems (CASES). Thompson. Conf. Reed. X. [27] B. 601 –604. [17] Q. 2007. Thorolfsson.” in NASA/ESA Conf. “FPGA Based Soft-core SIMD Processing: A MIMO-OFDM Fixed-Complexity Sphere Decoder Case Study. [28] N. Wireless Communications. J. Jul 2008. no. pp.” IEEE Trans. Cavallaro. Parhami. [29] Xilinx Inc. Conf. 369–375. “Application-Specific Instruction Set Processor Implementation of List Sphere Detector. WLANs and Broadcasting. 37–44. on Adaptive Hardware and Systems (AHS). 2012 DRAFT (c) 2011 Crown Copyright.” in 13th Euromicro Conf. “Software Defined Radio Implementation of K-best List Sphere Detector Algorithm. vol. J. “Parallel High Throughput Soft-output Sphere Decoder. Oct. Modeling. [30] X.

“Low-Complexity High Throughput VLSI Architecture of Soft-output ML MIMO Detector. Sheldon. 1396 –1401. .” in IEEE/ACM Intl. and D.” in Proc. Content may change prior to final publication. Conf. F. 2006. 261–268. Tomasoni. [35] T. Tullsen. on Computer-Aided Design. March 2010. Personal use is permitted. Cupaiuolo. R. Design. and A.This article has been accepted for publication in a future issue of this journal. For any other purposes. June 22. Siti. 2012 DRAFT (c) 2011 Crown Copyright. permission must be obtained from the IEEE by emailing pubs-permissions@ieee. Lysecky. Automation Test in Europe (DATE).org. M. “Application-Specific Customization of Parameterized FPGA Soft-Core Processors. R. pp. 22 [34] D. Vahid. Kumar. but has not been fully edited. pp.