You are on page 1of 11

Analog Integr Circ Sig Process (2011) 69:119–129

DOI 10.1007/s10470-011-9765-8

Implementation of sphere decoder for MIMO-OFDM on FPGAs


using high-level synthesis tools
Juanjo Noguera • Stephen Neuendorffer •

Sven Van Haastregt • Jesus Barba •


Kees Vissers • Chris Dick

Received: 14 February 2011 / Revised: 6 July 2011 / Accepted: 16 August 2011 / Published online: 23 September 2011
Ó Springer Science+Business Media, LLC 2011

Abstract In this study we explain the implementation of a spectral efficiency, and hence capacity, of a wireless
sphere detector for spatial multiplexing in broadband wire- communication system: it is a core component of next
less systems using high-level synthesis (HLS) tools. These generation wireless systems, for example, WiMAX
modern FPGA design tools accept C/C?? descriptions as (Worldwide Interoperability for Microwave Access) and
input specifications, and automatically generate a register other OFDM-based wireless communication standards.
transfer level (RTL) description for FPGA implementation Spatial multiplexing MIMO processing is a computation-
using traditional FPGA implementation tools. We have used ally intensive application that implements highly
AutoESL’s AutoPilot HLS tool to implement this demand- demanding signal processing algorithms. A specific
ing algorithm on a Virtex-5 running at a clock frequency of example of spatial multiplexing in MIMO systems is
225 MHz. The obtained results show that these modern HLS sphere decoding, which is a complexity-efficient method to
tools produce Quality of Results competitive to the ones solve the MIMO detection problem, while maintaining a
obtained using a traditional RTL design approach, while bit-error rate (BER) performance comparable to the opti-
significantly abstracting the designer from the low-level mal maximum-likelihood (ML) detection algorithm.
FPGA implementation details. However, even this reduced complexity algorithm is gen-
erally not feasible to implement on a digital signal pro-
Keywords Sphere decoding  High level synthesis  cessor (DSP) in real-time.
FPGAs Field programmable gate arrays (FPGAs) are an attrac-
tive target platform for the implementation of complex
DSP-intensive algorithms, like the sphere decoder (SD).
1 Introduction Modern FPGAs are high-performance parallel computing
platforms that provide the high-performance of dedicated
Spatial division multiplexing MIMO (Multiple Input and hardware solutions, while keeping the flexibility of pro-
Multiple Output) processing significantly increases the grammable DSP processors. There are several studies
showing that FPGAs could achieve 1009 higher perfor-
mance and 309 better cost/performance than traditional
J. Noguera (&) DSP processors in a number of signal-processing applica-
Xilinx Ireland, Dublin, Ireland tions [1].
e-mail: juanjo.noguera@xilinx.com Despite this tremendous performance advantage,
FPGAs are not generally used in wireless signal pro-
S. Neuendorffer  K. Vissers  C. Dick
Xilinx, Inc, San Jose, CA, USA cessing since they are perceived as devices difficult to use
for traditional DSP programmers. The key barrier for the
S. Van Haastregt widespread adoption of FPGAs in wireless applications is
Leiden University, Leiden, The Netherlands
the traditional hardware-centric design-flow and tools.
J. Barba Currently, the use of FPGAs requires significant hardware
University of Castilla-La Mancha, Ciudad Real, Spain design experience, like for example, being familiar in

123
120 Analog Integr Circ Sig Process (2011) 69:119–129

using hardware description languages (e.g., VHDL, opportunities for parallization that it offers. However, the
Verilog). implementation in [8] provides ML error performance for
Recently, new high-level synthesis (HLS) tools [2] have the 2 9 2 implementation but demanding a considerable
become available as design tools for FPGAs. These design amount of resources when compared to a 4 9 4 SD. The
tools take as input a high-level algorithm description and work of [9] showed that the optimal channel ordering not
generate RTL that can be used with standard FPGA only depends on the channel matrix, but also on the
implementation tools (e.g., Xilinx ISE/EDK). These tools received points. In this study, we do not take this into
offer an increase in the design productivity and reduction account however.
of the development time, while producing good quality of Instead of focusing primarily on the application archi-
results [3]. This study describes the FPGA implementation tecture, we focus in this paper on the design method used to
of a complex wireless algorithm on a modern FPGA (i.e., implement SD application blocks. Both the work of [10] and
sphere detector for spatial multiplexing MIMO in 802.16e the particular implementation [11] on which our work is
systems) using HLS tools. We have used AutoESL’s based were specified in MATLAB and subsequently imple-
AutoPilot HLS tool to target a Xilinx Virtex-5 running at mented using Xilinx’ System Generator for DSP. Instead, we
225 MHz. We present a comprehensive study reporting on have taken the MATLAB reference implementation, con-
our experiences in using HLS tools for this particular verted it in a literal fashion to C?? code and worked to
wireless algorithm and compare the results to an imple- optimize the implementation through the HLS tool flow.
mentation generated using on a traditional FPGA hard- Similar approaches were followed in [12–14] for algorithms
ware-centric design approach. that had lower complexity or lower throughput requirements.
In [12], the PICO high level synthesis tool was employed to
implement image processing kernels. In [13], the Catapult-C
2 Related work high level synthesis tool was employed to implement a
64-QAM decoder in both FPGA and ASIC technology. In
WiMAX, based on the IEEE 802.16e-2005 standard, refers [14], the Impulse C tool suite was employed to implement a
to a new generation of (mobile) wireless broadband access computed tomography back-projection algorithm. The
networks. WiMAX supports multiple input and multiple independent BDTI HLS tool certification program [3] has
output antenna configurations, meaning that both the been set up to evaluate HLS tools for FPGA design. As of
transmitter and the receiver use multiple antennas to mid-2010, both the PICO and AutoPilot HLS tools have been
increase bandwidth efficiency and link reliability. Decod- certified by this program.
ing the data from the different antennas using a maximum
likelihood (ML) detector yields the optimal bit error rate
(BER) performance. However, a ML detector implemen- 3 Sphere decoder
tation grows exponentially with the number of antennas
and the choice of modulation scheme, making it cost-pro- Sphere detection is a prominent method of simplifying the
hibitive for high-data rate systems with large numbers of detection complexity in spatial multiplexing systems while
antennas. Alternatively, channel decoding can be realized maintaining BER performance comparable with optimum
using a SD, whose implementation is less expensive while ML detection. The block diagram of the MIMO 802.16e
still achieving a BER performance comparable to that of an wireless receiver is shown in the Fig. 1. It is assumed that the
ML detector [4]. Hence, in this paper we focus on a SD. channel matrix is perfectly known to the receiver which can
Most of the existing literature on SDs focuses on the be accomplished by classical means of channel estimation.
architecture of the SD. For example, in [5] the authors After channel reordering and QR decomposition, the
described different designs optimized for implementation sphere detector (SD) is applied. In preparation for engaging
on an FPGA. In [6], a VLSI architecture for the K-best a soft-input-soft-output channel decoder (e.g. Turbo deco-
MIMO detector algorithm was presented and implemented der), soft outputs are produced by computing the log-
on both FPGA and ASIC technology. A multi-core archi- likelihood ratio (LLR) of the detected bits. A detailed
tecture for a 16 9 16 antenna configuration was presented explanation of this algorithm can be found in [9]. Fol-
in [7] whereas we use a 4 9 4 configuration in this study. lowing we briefly introduce the key three building blocks
An implementation of a Layered ORthogonal Lattice in this algorithm.
Detector (LORD) for a 2 9 2 configuration was presented
in [8]. Although the LORD algorithm follows a different 3.1 Channel matrix reordering
approach to the one implemented in sphere detectors,
comparisons are valuable. In principle, the LORD algo- The order in which the antennas are processed by the
rithm is more suitable for hardware implementations due to sphere detector has a profound impact on the BER

123
Analog Integr Circ Sig Process (2011) 69:119–129 121

Fig. 1 Block diagram for


sphere decoder

performance. So prior to sphere detection, channel reor- QR decomposition (QRD). A common method for realiz-
dering is applied. By utilizing a channel matrix pre-pro- ing QRD is based on Givens Rotations. The proposed
cessor, that realizes a type of successive interference implementation performs the complex rotations in the
cancellation similar in concept to that employed in BLAST Diagonal and OffDiagonal cells, which are the funda-
(Bell Labs Layered Space Time) processing, the detector mental computations units in the systolic array we are
achieves close to ML performance. using. After QRD and prior to the reordering operations,
The method implemented by the channel reordering back substitution of the decomposed matrix is done.
determines the optimum detection order of columns of the
complex channel matrix over several iterations. Depending 3.2 Modified real-valued QR decomposition
on the iteration count, the row with the maximum or
minimum norm is selected. The row with the minimum After obtaining the optimal ordering of the channel matrix
Euclidian norm represents the influence of the strongest columns, the QR decomposition on the real-valued matrix
antenna while the row with the maximum Euclidian norm coefficients is applied. The functional unit used for this
represents the influence of the weakest antenna. The novel QRD processing is similar to the QRD engine designed to
approach first processes the weakest stream. All subsequent compute the inverse matrix, but with some modifications.
iterations process the streams from highest to lowest The input data in this case are real valued and the systolic
power. array structure has a correspondingly higher degree. In
To meet the high data rate requirements, the cannel order to meet the desired timing constraints the input data
ordering block is realized using the pipelined architecture consumption rate had to be 1 input sample per clock
shown in Fig. 2, which processes 5 channels in a time cycle. This introduced challenges around processing
division multiplexing (TDM) approach. This approach latency problems which could not be addressed with a
provides more processing time between the matrix ele- 5-channel TDM structure. The number of channels in a
ments of the same channel while sustaining high data TDM group was increased to 15 to provide more time
throughput. The calculation of the matrix inverse is the between the successive elements of the same channel
most demanding process in Fig. 2 which is realized using matrix.

Fig. 2 Iterative channel matrix


reordering algorithm

123
122 Analog Integr Circ Sig Process (2011) 69:119–129

3.3 Sphere detector (SD) ð102:9us=360Þ


num cycles ¼ ffi 64: ð1Þ
1=225MHz
The iterative sphere detection algorithm can be viewed as a
tree traversal with each level of the tree i corresponding to As mentioned earlier, the most computationally
processing symbols from the ith antenna. The tree traversal demanding configuration with 4 9 4 antennas and 64-QAM
can be performed using several different methods. In our modulation scheme has been designed. The achievable raw
implementation we selected a breadth-first search due to data rate in this case is 83.965 Mbps.
the attractive hardware-friendly nature of the approach. At
each level only the K nodes with the smallest partial
Euclidian distance (Ti) are chosen for expansion. This type 4 High-level synthesis for FPGA
of detector is called a K-best detector.
The norm computation is done in the partial euclidean High-level synthesis tools take as their input a high-level
distance (PED) blocks of the sphere detector. Depending description of the specific algorithm to implement and
on the level of the tree, three different PED blocks are generate the RTL description of FPGA implementation.
used: the root node PED block calculates all possible PEDs Modern HLS tools accept untimed C/C?? descriptions
(tree level index is i = M = 8). The second level PED as input specifications. These tools give two interpretations
block computes 8 possible PEDs for each of the 8 survivor to the same C/C?? code: (1) sequential semantics for
paths generated in the previous level. This will give us 64 input/output behavior; and (2) architecture specification
generated PEDs for the tree level index i = 7. The third based on C/C?? code and compiler directives. Based on
type of PED block is used for all other tree levels which the C/C?? code, compiler directives and target throughput
compute the closest-node PED for all PEDs computed on requirements, these HLS tools generate high-performance
the previous level. This will fix the number of branches on pipelined architectures. Among other features, HLS tools
each level to K = 64, thus propagating to the last level enable automatic pipeline stages insertion and resource
i = 1 and producing 64 final PEDs along with their sharing to reduce FPGA resource utilization. In summary,
detected symbol sequences. HLS tools raise the level of abstraction for FPGA design,
The pipeline architecture of the SD allows data pro- and make transparent the time-consuming and error-prone
cessing on every clock cycle, thus the number of PED RTL design tasks. We have focused on using C??
blocks necessary at every tree level is only one. The total descriptions, with the goal of leveraging C?? template
number of PED units is equal to the number of tree levels, classes to represent arbitrary precision integer types and
which for 494 64-QAM, is 8. The block diagram of the SD template functions to represent parameterized blocks in the
is illustrated in the Fig. 3. architecture.
The overall design approach is shown in Fig. 4, where
3.4 FPGA performance implementation targets the starting point is a reference C/C?? code that could
have been derived from a MATLAB functional description.
The target FPGA device is a Xilinx Virtex-5 FPGA, with a As illustrated in this figure, the first step in implementing
target clock frequency of 225 MHz. The channel matrix is an application on any hardware target is often to restructure
estimated for every data subcarrier which limits the the reference C/C?? code. By ‘‘restructuring,’’ we mean
available processing time for every channel matrix. For the rewriting the initial C code (which is typically coded for
selected clock frequency and a communication bandwidth clarity and ease of conceptual understanding rather than for
of 5 MHz (corresponding to 360 data sub-carriers in a optimized performance) into a format more suitable for the
WiMAX system), the available number of processing clock target processing engine. For example, on a DSP it may be
cycles per channel matrix interval is calculated as follows: required to rearrange an application’s code so that the

Fig. 3 Sphere detector


processing pipeline

123
Analog Integr Circ Sig Process (2011) 69:119–129 123

used. Two essential constraints are the target FPGA family


(i.e., technology) and the target clock frequency, which
obviously have an effect on the number of pipeline stages
in the generated architecture. Additionally, HLS tools
accept compiler directives (e.g., pragmas inserted in the
C/C?? code) which can be applied to different sections of
the C/C?? code. There are different types of directives;
for example, there are directives that are applied to loops
(e.g., loop unrolling), while other directives can be applied
to arrays (e.g., to specify which FPGA resource must be
used to the implementation of the array).
Based on all these inputs, the HLS tools generate an
output architecture (in RTL) and report the throughput of
the individual generated modules. Depending on this
throughput, then the designer can modify the directives
and/or the implementation C/C?? code. If the generated
architecture meets the required throughput, then the output
RTL is used as the input to the FPGA implementation tools
(ISE/EDK). The final achievable clock frequency and
number of FPGA resources used is reported only after
running logic synthesis and place and route. If the design
Fig. 4 High-level synthesis for FPGAs
does not meet timing or the FPGA resources are not the
expected ones, the designer should modify the implemen-
tation C/C?? code or the compiler directives.
The last step in this workflow, once the synthesis report
certifies the design achieves the demanded performance, is
RTL verification. Unlike non HLS design flows, this stage
occurs only once at the very end of the process instead of
taking place in each iteration. Thus, a non negligible
amount of RTL simulation hours are saved to the designer
who sees how his productivity rises. Moreover, in our case,
AutoESL’s Autopilot HLS tool supports the automatic
generation of the entire co-simulation infrastructure that
enables reusing the C/C?? testbenches to feed the gen-
erated RTL through a set of SystemC adapters.

Fig. 5 Iterative C/C?? refinement design approach 5 Sphere decoder high-level synthesis
implementation
algorithm makes an efficient use of the cache memories.
When targeting FPGAs, this code restructuring might We have implemented the key three building blocks of the
involve, for example, rewriting the code so it represents an WiMAX SD shown in Fig. 1 using AutoPilot 2010.07.ft
architecture specification that can achieve the desired from AutoESL. It is important to emphasize that the
throughput, or rewriting the code to make efficient use of algorithm is exactly the algorithm described in [11], and
the specific FPGA features like embedded DSP macros. hence has exactly the same BER. In this section we give
The functional verification of this implementation specific examples of code re-writing and compiler direc-
C/C?? code is achieved using traditional C/C?? com- tives that we have used for this particular implementation.
pilers (e.g., gcc) and reusing C/C?? level testbenches
developed for the verification of the reference C/C?? 5.1 Design approach: iterative C/C?? refinement
code. The C/C?? code implementation is the main input
to the HLS tools. However, there are additional inputs to The original reference C code, derived from a MATLAB
the HLS tools that significantly influence the generated functional description, included 2000 lines of code (aprox.),
hardware, its performance and number of FPGA resources including synthesizable and verification C code. It contains

123
124 Analog Integr Circ Sig Process (2011) 69:119–129

time-consuming RTL simulations and hence, contributing


to the reduction in the overall development time.

5.2 Macro-architecture specification

Probably the most important code refactoring example is to


rewrite the C/C?? code to describe the macro-architec-
ture that efficiently would implement a specific function-
ality. In other words, the designer is accountable for the
macro-architecture specification, while the HLS tools are
in charge of the micro-architecture generation. This type of
code restructuring has a major impact on the obtained
throughput and quality of results.
In the case of the SD, there are several instances of this
type of code restructuring. For example, to meet the high
throughput of the channel ordering block, the designer
should describe in C/C?? the macro-architecture shown in
Fig. 6 Sphere detector macro-architecture description Fig. 2. Such C/C?? code would consist of several func-
tion calls communicating using arrays. The High Level
Synthesis tools might automatically translate these arrays
in ping-pong buffers to allow parallel execution of the
several matrix calculation blocks in the pipeline. Another
example of code restructuring at this level would be to
decide how many number of channels should be employed
in the TDM structure of a specific block (e.g., 5 channels in
the Channel Matrix Reordering block, or 15 channels in the
modified real-valued QR decomposition block.
Figure 6 is a specific example of macro-architecture
specification. This snipped of C?? code describes the
sphere detector block diagram shown in Fig. 3. We can
observe a pipeline of nine function calls, each one repre-
Fig. 7 Example of code parameterization
senting a block as shown in Fig. 3. The communication
between functions is achieved through arrays, which are
mapped to streaming interfaces (not embedded BRAM
only fixed-point arithmetic using C built-in data types. All
memories in the FPGA) by using the appropriate directives
the required floating point operations (e.g., sqrt) have been
(pragmas) in lines 5 and 7.
approximated by a FPGA-friendly implementation.
In addition to the reference C code describing the
functions to synthesize in the FPGA, there is a complete 5.3 Parameterization
C-level verification testbench. The input test vectors, as
well as the golden output reference files, were generated Parameterization is another key example of code re-writ-
from the MATLAB description. The original C reference ing. We have extensively leveraged C?? template
code is bit accurate with the MATLAB specification, and functions to represent parameterized modules in the
passes the entire regression suite consisting of multiple architecture.
data sets. In the implementation of the SD, there are several
This reference C code has gone through different types examples of this type of code rewriting. A specific example
of code restructuring. As examples, Fig. 5 shows three would be the different matrix operations used in the
instances of code restructuring that we have used, which channel reordering block. The Matrix Calculations blocks
are explained in the following subsections. A key concept (4 9 4, 3 9 3 and 2 9 2) showed in Fig 2 use different
to keep in mind is that the C-level verification infra- types of matrix operations like Matrix Inverse or Matrix
structure has been re-used to verify any change to the Multiply. These blocks are coded as C?? template func-
implementation C/C?? code. All verification has been tions with the dimensions of the matrix as template
carried out at the C-level, not at the RTL level, avoiding parameters.

123
Analog Integr Circ Sig Process (2011) 69:119–129 125

Fig. 8 Applying TDM to the 898 M-RVD QRD C code Fig. 9 FPGA optimization for DSP48 utilization

Figure 7 shows the C?? template function for Matrix where recurrences are not met. In the SD application, since
Multiply. In addition to the matrix dimension, this template 360 independent data subcarriers have to be processed for
function has a third parameter MM_II (Initiation Interval each frame, TDM is a straightforward way to handle the
for Matrix Multiply), which is used to specify the number presence of the recurrence while incurring additional buf-
of clock cycles between two consecutive loop iterations. fering and a small amount of additional processing latency.
See directive (pragma) in line 9, which is used to param-
eterize the required throughput for a specific instance. This 5.5 FPGA optimizations
is a really important feature, since it has a major impact on
the generated micro-architecture. That is, the ability of the FPGA optimization is the last example of code rewriting.
HLS tools to exploit resource sharing, and hence, reducing The designer can rewrite the C/C?? code to more effi-
the FPGA resources used in the implementation. For ciently utilize specific FPGA resources, and hence,
example, just by modifying this Initiation Interval (II) improve timing and reduce area. There are two very spe-
parameter and using exactly the same C?? code, the HLS cific examples of this type of optimizations: (1) bit-widths
tools automatically achieve different levels of resource optimizations; and (2) efficient use of embedded DSP
sharing in the implementation of the different Matrix blocks (i.e., DSP48 s). The reference C/C?? code was
Inverse (4 9 4, 3 9 3, 2 9 2) blocks. written using built-in C/C?? data types (e.g., short, int),
while the design uses 18-bit fixed point data types to rep-
5.4 Time division multiplexing resent the matrix elements. We have leveraged C??
template classes to represent arbitrary precision fixed-point
For designs without feedback, the HLS tool is generally data types, hence reducing FPGA resources and minimiz-
able to instantiate registers freely to increase clock fre- ing impact on timing.
quency and throughput. However, in pipelines that are part Figure 9 is a C?? template function that implements a
of a feedback loop, registers cannot be inserted freely multiplication followed by a subtraction, where the width
without introducing pipeline stalls. Hence, recurrences, or of the input operands is parameterized. These two arith-
feedback loops, in a design are the key limiter of metic operations can be mapped into a single embedded
throughput. For example, Fig. 8 shows the high level DSP block (i.e., DSP48 block). By efficiently using
structure of the 8 9 8 M-RVD QRD loop nest. Although DSP48 s, timing is improved and FPGA resource utiliza-
there are several recurrences in the application, the critical tion is minimized. In Fig. 9, we can also observe two
recurrence in this code occurs when the result X[j][0][t] of directives that instruct the HLS tool to use a maximum of
the diagonal function call is used as an argument to the two cycles to schedule these operations and use a register
next diagonal call. Synthesis of the diagonal function for the output return value.
results in a 14-stage pipeline. Hence, to accommodate the
recurrence without introducing pipeline stalls, we apply
TDM over 15 datasets. The parts typeset in boldface in 6 Productivity metrics
Fig. 8 explicitly reflect TDM or c-slowing over separate
datasets through the inner t-loop. 6.1 Development time
We observe several characteristics of this design. First,
the code accurately reflects the order in which data is In Fig. 10 we plot how the size of the design (i.e., FPGA
processed in the reference design. Second, the number of resources) generated using AutoESL’s AutoPilot evolves
channels to iterate over, that is, the TDM depth, cannot be over time and compare it to a traditional System Generator
determined without knowing the size of the critical recur- (i.e., RTL) implementation. It is important to observe in
rences. Although AutoPilot does not compute the number this figure that by using HLS tools we are able to imple-
of channels automatically, the HLS process does analyze ment many valid solutions, which differ in size over time.
the source code for recurrences and report to the designer On the other hand, there is only one RTL solution, which

123
126 Analog Integr Circ Sig Process (2011) 69:119–129

Fig. 10 Development time

requires of a long development time. Many solutions can implemented using HLS tools and the reference system
be obtained using HLS tools. Depending on the amount of generator implementation, which is basically a structural
code restructuring, the designer can tradeoff how fast to get RTL design, explicitly instantiating FPGA primitives, like
a solution versus the size of that solution. for example, DSP48 blocks. Please, note that the devel-
We have observed that it is relatively quick to obtain opment time for AutoESL includes learning the tool, pro-
several fast solutions, which use significant more FPGA ducing results, design space exploration and detailed
resources (i.e., area) than the traditional RTL solution. On verification.
the other hand, the designer might decide to generate many To have accurate comparisons, we have re-implemented
more expert solutions by implementing more advanced the reference RTL design using the latest Xilinx ISE 12.1
C/C?? code restructuring techniques (e.g., FPGA-specific tools targeting a Virtex-5 FPGA. The RTL generated by
optimizations) to reduce FPGA resource utilization. AutoESL’s AutoPilot has also been implemented using ISE
Finally, and since all verification has been carried out at 12.1 targeting the same FPGA. From Table 1, we can
the C/C?? level, not at the RTL-level, we avoided the observe that AutoESL’s AutoPilot achieves significant
time consuming RTL simulations. We found that doing the savings in FPGA resources, which is mainly the result of
design verification at the C/C?? level significantly con- the application of resource sharing techniques in the
tributed to the reduction in the overall development time. implementation of the matrix inverse blocks. We can also
observe a significant reduction in the number of registers
6.2 Quality of results and a slightly higher utilization of Look-up Tables (LUTs).
This result is partially due to the fact that delay lines are
In Table 1, we compare final FPGA resource utilization mapped onto SRL16 s (i.e., LUTs) in the AutoESL
and overall development time for the complete SD implementation, while the delay lines are implemented
using registers in the system generator implementation.
There are other modules where we tradeoff BRAMs for
Table 1 Quality of results (development time and FPGA resources) LUTRAM, resulting in lower BRAM usage in the channel
Metric SysGen AutoESL % Diff preprocessor.
expert expert For a better understanding on how the area savings are
Development time (weeks) 16.5 15 29
obtained, it is instructive to look more closely at smaller
blocks of the design. The RVD-QRD block, summarized in
LUTs 27,870 29,060 ?4
Table 2, operates at II = 1, completing an 8 9 8 QR
Registers 42,035 31,000 226
decomposition of 18-bit fixed point values every 64 cycles.
DSP48 s 237 201 215
The block implements a standard Givens-rotation based
18 K BRAM 138 99 228
systolic array consisting of diagonal and off-diagonal cells,

123
Analog Integr Circ Sig Process (2011) 69:119–129 127

Table 2 8 9 8 RVD-QRD implementation results 28 DSP48

Metric RTL AutoPilot AutoPilot D


expert expert expert x x x x x x x

Dev. time (man-weeks) 4.5 3 5 (a) 4x4 case (ToplevelII=3, OD II=3)


LUTs 5,082 6,344 3,862
Registers 5,699 5,692 4,931 D 12 DSP48
DSP48 s 30 46 30 x
x x
18 K BRAMs 19 19 19
(b) 3x3 case (ToplevelII=5, OD II=1)

where the diagonal cells compute an appropriate rotation, D 4 DSP48


zeroing one of the matrix elements, and the off-diagonal x

cells apply this rotation to the other matrix elements in the (c) 2x2 case (ToplevelII=9, OD II=3)
same row. To meet the required throughput, one row of the
systolic array is instantiated, consisting of one diagonal cell Fig. 11 Complex QRD architectures
and 8 off-diagonal cells, and the remaining rows are time
multiplexed over the single row. In addition, since the imaginary components time-multiplexed over a single
systolic array includes a recurrence, 15 channels are time- datapath. Deriving and verifying the 3 9 3 and 2 9 2 case
division multiplexed over the same hardware. took approximately 1 week each. In AutoPilot, the three
In brief, Autopilot was able to generate exactly the same cases were implemented as C?? template functions,
architecture but with a more optimized pipeline for the parameterized by the size of the matrix. All three cases
non-critical off-diagonal cell, resulting in slightly lower were implemented concurrently, using a script to run
resource usage after optimization. After only 3 weeks, the multiple tool invocations in parallel. Depending on the
AutoPilot design had met timing and throughput goals, but matrix dimension, different initiation intervals (II) were
required more logic resources than the RTL design. targeted, resulting in a variety of resource sharing archi-
Additional optimization and synthesis constraints on the tectures for each block, as shown in Fig. 11
DSP48 mapping, lead AutoPilot to get the same DSP48 In the 4 9 4 case, the off-diagonal cell implements fine-
mapping as the RTL design (3 DSP48 s to implement the grained resource sharing, with one resource-shared complex
off-diagonal cell rotations and 6 DSP48 s to implement the multiplier. In the 3 9 3 case, the off-diagonal cell contains
diagonal cell computation and rotation), including mapping 3 complex multipliers and the off-diagonal cell itself is
onto the DSP48 post-adder. resource shared at a coarser granularity. In the 2 9 2 case,
Table 3 details multiple implementations of the Matrix- all of the off-diagonal cell operations are time multiplexed
Multiply Inverse components, consisting of the combined on a single complex multiplier, combining both coarse-
Matrix Multiply, QR Decomposition, and Back Substitution grained and fine-grained resource sharing techniques. In
blocks. This combination implements (ATA)-1 for various AutoPilot, the only difference between these blocks is the
dimensions of 18-bit complex fixed-point matrices. In both different target II, resulting in significant resource sharing.
RTL and AutoPilot design approaches, the 4 9 4 case was Certainly, there is no doubt that an RTL designer could have
implemented first, and the 3 9 3 and 2 9 2 cases were achieved such optimized architectures, given the appropri-
derived from the 4 9 4 case. In RTL, resource sharing was ate insight, but at the cost of a complete RTL-level redesign.
implemented in a similar way for each case, with real and We do observe that AutoPilot uses additional BRAM to
implement this block relative to the RTL implementation,
because AutoPilot requires tool-implemented double-buf-
Table 3 Matrix-multiply inverse results. autopilot (AP) versus RTL fers to only be read or written in a single loop. When con-
implementation sidered as part of the overall design, however, we were able
Metric 494 494 393 393 292 292 to reduce BRAM usage by converting BRAMs to
RTL AP RTL AP RTL AP LUTRAM due to the improved architecture of this block.
Dev. Time (man- 4 4 1 0 1 0
weeks)
7 Conclusions
LUTs 9,016 7,997 6,969 5,028 5,108 3,858
Registers 11,028 7,516 8,092 4,229 5,609 3,441
In this study we have presented the implementation of a
DSP48 s 57 48 44 32 31 24
complex and demanding wireless MIMO receiver using a
18 K BRAMs 16 22 14 18 12 14
HLS tool targeting a Xilinx FPGA.

123
128 Analog Integr Circ Sig Process (2011) 69:119–129

This evaluation has demonstrated that AutoESL’s 13. Takach, A., Bowyer, B., & Bollaert, T. (2005). C Based hardware
AutoPilot achieves significant abstractions from low-level design for wireless applications. In Proceedings of the conference
on Design, Automation and Test in Europe - (DATE ’05), Vol. 3,
FPGA implementation details (e.g., timing and pipeline pp. 124–129. Washington, DC, USA: IEEE Computer Society.
design), while producing quality of results highly com- doi:10.1109/DATE.2005.87. http://dx.doi.org/10.1109/DATE.
petitive to the ones obtained using a traditional RTL design 2005.87.
approach. C/C?? level verification contributes to the 14. Xu, J., Subramanian, N., Alessio, A., & Hauck, S. (2010).
Impulse C vs. VHDL for accelerating tomographic reconstruc-
reduction in the overall development time by avoiding time tion. symposium on field-programmable custom computing
consuming RTL simulations. However, obtaining excellent machines. doi:10.1109/FCCM.2010.33.
results for complex and challenging designs requires good
macro-architecture definition and FPGA tools knowledge
Juanjo Noguera received his
(e.g., understand and interpret FPGA tool reports). BSc degree in Computer Science
from the Autonomous University
of Barcelona, Spain, in 1997, and
his PhD degree in Computer
References Science from the Technical
University of Catalonia, Barce-
1. Berkeley Design Technology, Inc. (2007). FPGAs for DSP. BDTI lona, Spain, in 2005. He has
Industry Reports (2nd ed). Available online at: http://www.bdti. worked for the Spanish National
com/Services/WPAR/IndustryReports. Centre for Microelectronics, the
2. Grant, M., & Gary, S. (2009). High-level synthesis: Past, present, Technical University of Catalo-
and future. IEEE Design and Test of Computers, 26(4), 18–25. nia, and Hewlett-Packard Inkjet
3. Berkeley design technology, Inc. (2010). An independent evalu- Commercial Division. Since
ation of: High-level synthesis tools for xilinx FPGAs. White 2006, he has been with the Xilinx
paper. http://www.xilinx.com/technology/dsp/BDTI_techpaper. Research Labs, Ireland. His
pdf. Accessed 22 January 2011. interests include high-level system design, reconfigurable architectures
4. Amiri, K., Cavallaro, J., Dick, C., & Rao, R. M. (2009). A high and next-generation wireless communications.
throughput configurable SDR detector for multi-user MIMO
wireless systems. Journal of Signal Processing Systems. doi: Stephen Neuendorffer gradu-
10.1007/s11265-009-0360-5. ated in 1998 from the University
5. Barbero, L., & Thompson, J. (2006). FPGA design considerations of Maryland and BS degrees in
in the implementation of a fixed-throughput sphere decoder for Electrical Engineering and
MIMO systems. International Conference on Field Programmable Computer Science. He gradu-
Logic and Applications. doi:10.1109/FPL.2006.311247. ated with University Honors,
6. Mondal, S., Eltawil, A., Shen, C.-A., & Salama, K. (2009). departmental Honors in Electri-
Design and Implementation of a sort free K-best sphere decoder. cal Engineering, and was named
IEEE Transactions on Very Large Scale Integration Systems, Outstanding Graduate in the
17(11), 1497–1501. Computer Science department.
7. Yang, C., & Markovic, D. (2008). A multi-core sphere decoder He completed his PhD at UC
VLSI architecture for MIMO communications. Global Tele- Berkeley in 2003 after being
communications Conference. doi:10.1109/GLOCOM.2008.ECP. one of the key architects of
633. Ptolemy II. He has been work-
8. Cupaiuolo, T., Siti, M., & Tomasoni, A. (2010). Low-complexity ing in the Research Labs at
high throughput VLSI architecture of soft-output ML MIMO Xilinx ever since on various aspects of system design in FPGAs.
detector. In Proceedings of the conference on Design, Automation
and Test in Europe (DATE ’10). European Design and Automa- Sven van Haastregt received
tion Association, 3001 Leuven, Belgium, pp. 1396–1401. the BSc (with honors) and MSc
9. Su, K., & Wassell, I. (2005). A new ordering for efficient sphere (with honors) degrees in com-
decoding. IEEE International Conference on Communications. puter science from Leiden
doi:10.1109/ICC.2005.1494671. University, Leiden, The Neth-
10. Barbero, L., & Thompson, J. (2006). Rapid prototyping of a erlands, in 2006 and 2008,
fixed-throughput sphere decoder for MIMO systems. IEEE respectively. From January
International Conference on Communications. doi:10.1109/ICC. 2010 to April 2010, he was a
2006.255278. guest researcher at Xilinx
11. Dick, C., Trajkovic, M., Denic, S., Vuletic, D. et al. (2009). Research Labs, San Jose, CA.
FPGA implementation of a near-ML sphere detector for 802.16E He is currently working towards
broadband wireless systems. Software Defined Radio Technical his PhD degree at the Leiden
Conference and Product Exposition. Embedded Research Center of
12. Denolf, K., Neuendorffer, S., & Vissers, K. (2009). Using C-to- Leiden University. His research
gates to program streaming image processing kernels efficiently interests include automated
on FPGAs. Field Programmable Logic and Applications. doi: parallelization using polyhedral analysis, high level synthesis, and
10.1109/FPL.2009.5272373. prototyping on Field Programmable Gate Arrays (FPGAs).

123
Analog Integr Circ Sig Process (2011) 69:119–129 129

Jesús Barba received the MS processors and reconfigurable fabric. He has a quantitative approach
and PhD degrees in Computer to tools and architectures. He has a broad interest in Signal Processing
Engineering Diploma from the Systems.
University of Castilla-La Man-
cha (UCLM), Spain, in 2001 Chris Dick is the DSP Chief
and 2008 respectively. He is Arhictect at Xilinx and the
working as Teaching Assistant engineering manager for the
with the Department of Infor- Xilinx Communications Signal
mation and Systems Technol- Processing Group. Chris has
ogy since 2001. His research worked with signal processing
interests include SoCs, HW/SW technology for two decades and
integration and reconfigurable his work has spanned the com-
systems. mercial, military and academic
sectors. Prior to joining Xilinx
in 1997 he was a professor at La
Trobe University, Melbourne
Kees Vissers graduated from
Australia for 13 years and
Delft University in the Nether-
managed a DSP Consultancy
lands. He worked at Philips
called Signal Processing Solu-
Research on processors, image
tions. He has been an invited speaker at many international signal
processing, HDTV process-
processing symposiums and workshops and has authored more than
ing and Hardware—Software
100 journal and conference publications, including many papers in the
co-design. He was a visiting
fields of parallel computing, inverse synthetic aperture radar (ISAR),
industrial fellow at CMU,
FPGA implementation of wireless communication system PHYs and
working on High Level Syn-
the use of FPGA custom computing. Chris’ work and research
thesis, and a visiting industrial
interests are in the areas of fast algorithms for signal processing,
fellow at UC Berkeley working
digital communication, MIMO, OFDM, software defined radios,
on streaming models of com-
VLSI architectures for DSP, adaptive signal processing, synchroni-
putation. He was a director of
zation, hardware architectures for real-time signal processing, and the
Architecture at Trimedia, CTO
use of Field Programmable Arrays (FPGAs) for custom computing
at Chameleon Systems, and
machines and real-time signal processing. He holds a bachelor’s and
consulted for Intel, Nvidia and Xilinx. He is today a distinguished
PhD degrees in the areas of computer science and electronic
engineer in the CTO office of Xilinx. He is working with the teams
engineering.
that work on designing and programming systems that consists of

123

You might also like