You are on page 1of 7

Advances in DSP Design Tool Flows for FPGAs

Mark Jervis
Manager, DSP Tools Altera Corporation mjervis@altera.com
AbstractThis paper highlights recent advances in digital signal processing (DSP) design tools for Field-Programmable Gate Arrays (FPGAs), concentrating on model-based design high-level synthesis. Next generation FPGA model-based design tools provide a mechanism for abstracted design definition at the algorithmic rather than implementation level. The tools make use of FPGA structural and timing knowledge, and of mathematical and graph theory techniques to optimize and technology map the algorithm to a pipelined FPGA implementation, controlled by high level parameters and threshold settings. Such tools allow simple design space exploration and retargeting of algorithms to different device families. This design-once and retarget as required method improves productivity over manual hardware description language (HDL) coding, especially for projects which are subject to change. The simple design style, optimized generated hardware and productivity in design change and implementation exploration are highlighted with two examples; a direct radio-frequency (RF) radar design and a simple 8 by 8 beamforming design. Keywords-DSP, FPGA, model-based design, high-level synthesis, direct RF, radar, beamforming

A higher level of abstraction uses vectors and matrices to represent repeated elements, such as processing multiple channels, or even within channels, rather than blocks copied or replicated with some variation. With this level of abstraction, users can design at a simplified level and rely on the above optimizations to reduce any duplication. In combination, these features provide a good framework for design space exploration within a Simulink environment, and offer a method to produce easy-to-maintain, portable designs. This paper demonstrates examples from radar and communications how a direct RF design targeted at a specific device can be retargeted to a different device through the change of a single parameter, and how a different design with different levels of time-sharing and target-specific pipelining is generated. A second example, based on beam forming, demonstrates how large, repetitive designs can be folded automatically, based on the ratio of sample rate to system clock rate. This design also demonstrates the high abstraction level achieved through the use of vectors, which is usually unavailable in schematic-based tools. II. SIGNAL PROCESSING IN FPGAS

I.

INTRODUCTION FPGAs are now frequently used to implement highperformance DSP systems across a spectrum of applications, including high-capacity CDMA and OFDMA communication systems, medical imaging systems, and military communications and radar systems. Such DSP systems, requiring many channels or higher throughput, are well suited to optimal implementation in FPGAs. The availability of massive parallel processing is one of the most compelling advantages of FPGAs for DSP-rich applications. For example, calculating the processing power (measured as Giga Multiply-Accumulate Operations per second or GMACs) with Alteras EP4SE230 FPGA, with 1288 18x18 multipliers, give an impressive number of 708 GMACs. What this data shows is that FPGAs now have a great deal of DSP horsepower and they offer benefits for applications requiring a lot of parallel multiplication operations. FPGAs can be used to complement digital signal processors or even replace them entirely. The net benefit is to improve system performance, lower system cost, reduce board space, and/lower lower system thermal power dissipation. The DSP portfolio from FPGA vendors now comprise silicon features, design software, intellectual property (IP) cores, IEEE 754-compliant floating-point functions, reference designs, and development platforms that enable system designers to implement a broad range of high-performance

Many of todays DSP designssuch as for military software defined radio (SDR) systems and radarare modeled using the industry-standard MATLAB/Simulink tools. FPGA vendor and third-party tools are available that translate the MATLAB/Simulink design descriptions to structural HDL. The structural HDL generated traditionally is not optimized for performance, latency, or even pipelined for optimum timedivision multiplexing (TDM). This must be done painstakingly by the designer over days, if not weeks, and must be repeated every time the design specification changes or a different device family is targeted. Next-generation MATLAB/Simulink synthesis tools offer a range of improvements: Design entry based on untimed schematic designs and automatic timed HDL code generation with optimal pipelining are derived from the untimed schematic, sample rate, clock rate, and target device. These improvements remove the need for hand-optimizing the generated HDL code and provide an order of magnitude improvement in designer productivity. Algorithmic and implementation optimizationssuch as duplicate code removal, canonical signed digit constant multipliers, and 6:3 compressor treescan be performed at pre-synthesis level on the untimed algorithm, thereby offering improvements that are impossible to achieve in latency constrained synthesis tools.

CP-01070-1.0

October 2010

DSP designs. With these capabilities at the users disposal, the problem becomes one of making the best use of FPGAs and exploring the possibilities. III. DSP DESIGN TOOL FLOWS

calculated values. Bus logicto write or read from to the correct memory addressedis automatically generated. IV. DSP DESIGN TOOL INTERNALS

The focus for DSP designers is often higher productivity. A design flow is required that enables design space exploration, promotes design reuse, and automates implementation of algorithmic non-value-add design taskssuch as pipelining, TDM control logic, DSP datapath precision changes, efficient mapping of generic designs to FPGA resources and parameterization by common system parametersall while providing the quality of results of hand-optimized HDL. Many of todays DSP designs are modeled using the industry-standard MATLAB/Simulink tools. Algorithmic development in a simulation environment allows most of the design exploration and debugging up front. FPGA vendor and third-party tools are available that translate these simulatable MATLAB/Simulink design descriptions to structural HDL. This HDL can then be verified in HDL simulation tools, such as ModelSim, compiled for a target FPGA and debugged with hardware debug tools in co-simulation, if necessary. Traditionally the structural HDL generated by such tools is not optimized for performance, pipelined for timing, or shared for TDM. This painstaking work requires days if not weeks of hand optimizing the code. Alteras DSP Builder Advanced Blockset (DSPB-AB) is one example of FPGA model-based design tools, which generates HDL code that is automatically optimized to meet the performance, latency, or pipelining specified at the system level. It provides timing-driven Simulink synthesis, while building in powerful mathematical optimization techniques such as linear programming solvers and simulated annealing. With such next-generation high-level synthesis tools, users can use system-level information, thresholds, fMAX, and latency constraints to automatically generate timing-optimized HDL targeting specified devices. This capability can save hours if not days of hand-tweaking the HDL code. Such design flows are particularly relevant for wireless RF designs and for military applications such as radar, electronic warfare, and SDR. At its simplest, the designer develops an algorithm, targets a device, and specifies how fast it should run and how fast the data is coming. Internally, the tool models the silicon, and will pipeline and TDM the implementation accordingly. Multichannel designs are made easy with the handling of vectors and specification through top-level parameters. A textbook design style, as illustrated in figure 1 for a multi-channel infinite impulse response filter (IIR) example, enables the designer to concentrate on the algorithm, rather than on device implementation details and cycle counting. Tools such as DSPB-AB also provide libraries of memories and registers that can be accessed within the DSP datapath and via an external interface to allow easy configuration of coefficients and run-time parameters as well as read-back of

Within such high-level tools, the design schematic is mapped to a data flow graph (DFG). The DFG is realized as a netlist of basic functional units. These functional units have a direct register-transfer level (RTL) implementation. One Simulink block may map to zero (in the case of simple wiring), one, or many functional units. The tool applies optimization transforms both within and across groups of functional units that alter latency, provided it is done consistently.

Figure 1: Textbook-like design of a 20-channel IIR filter in modelbased design. All design-specified delays are algorithmic.

The addition of pipelining is obviously something that cannot easily be done later in synthesis or place and route tools, as at this stage the algorithm is indistinct from the implementation. At such later stages, one cannot distinguish what registering is required as part of the algorithm, and what has been inserted for timing closure. However, in a high-level synthesis tool, only the algorithm is expressed. So the tool knows, through basic graph cutting theory, how the algorithm can be manipulated in a way that does not change its meaning. Parameters can constrain the level of manipulation, such as the level of pipelining applied or the resultant total latency. Powerful mathematical optimization techniques can be brought to bear on problems that designers coding by hand would be hard-pressed to replicate, and certainly hard-pressed to turn around quickly following a change of specification. For example, maintaining a balance of latencies while pipelining to achieve a clock rate target while using the minimum possible number of inserted registers can be framed as an integer linear programming problem (ILPP) and solved with a standard LPP solver. Optimizations across or on groups of functional units include, for example, reorganization of adder trees, bringing adder tree stages following multipliers into DSP blocks, compressor-based addition where a large number of low bit-

width inputs must be summed, Karatsuba multiplication for high bit-width integer multiplication, splitting and pipelining adder chains, duplicate code removal, canonical signed digit representation of constant multipliers, and automated TDM. All of these techniques are known and documented, such as in Alteras Advanced Synthesis Cookbook, and are methods that would be employed by a dedicated HDL coder looking to optimize design implementation. However, to apply these would often be time consuming, and the exact implementation might vary from family to family. With next-generation model-based design tools, these techniques are carried out on each simulation and their application can be controlled indirectly according to target threshold parameters. Crucially, they use device knowledge so that switching the target device or changing the target clock frequency results in a completely different optimization result with the push of a button. We will now illustrate this retargeting and how the abstracted design entry can help productivity with some examples. V. DIRECT RF EXAMPLE

Direct RF is an excellent fit for the capabilities of FPGAs. The design presented here is effectively RF up-conversion. The input is 8 QAM channels at 16 MSPS with 16-bit I/Os which are up-converted digitally in the design by 256x to 4096 MSPS. The RF carrier frequency is set as 860 MHz, the RF sample frequency as 4096 MHz, and the FPGA operating frequency as 256 MHz. The digital up-converter (DUC) interfaces with a MAX5881A 4.3-GSPS DAC and the whole system complies with DOCSIS 3.0 DRFI specifications. This example initially targets a Stratix III 260E FPGA. The ADCs and DACs have multiple ~1-GSPS LVDS lines to obtain support higher data rates. LVDS SERDES blocks act as multiplexers for the internal ~256-MHz FPGA processing up to ~1 GSPS. Parallelism and signal processing techniques are used to manage high data rates. The datapath can be drawn as shown in figure 2. The interpolate-by-16 filters are implemented as a sequence of four interpolate-by-2 filters. At each stage, the amount of information doubles. By the time the second set of filers is implemented, interpolating from 256 to 4096 MHz, a single wire starts with a new data sample every clock cycle. The number of physical wires must then double at each interpolateby-2 filter. This is handled naturally by tools such as DSPB-AB with a vector representation.

Radar direct RF is the ability to connect the FPGA directly up to multi-gigabit analog-to-digital-to and digital-to-analog converters (ADCs and DACs). The application exploits DSP techniques instead of temperature-dependent analog components, replacing analog RF components in the radar receive chain. The RF section is limited to conditioning output of a DAC using anti-aliasing filters, etc. Pushing more of the application architecture into digital reduces power, provides higher level of integration within the system, and reduces overall cost. The potential barrier to FPGA implementation of such designs is the HDL and target architecture knowledge required to get the best out of the FPGAs, and the time taken to explore the possibilities and tradeoffs in implementation. This is where high-level synthesis tools provide a solution. Model-based design with next-generation high-level synthesis techniques make this easy. Constraint-driven design allows the user to set the desired clock frequency, relying on the tool to model the silicon and tailor the generated HDL to the target device family and speed grade. The optimizations applied in generating the implementation include automated resource sharing. The tool analyzes the clock rate in relation to sample rate, the number of channels, and the interpolation/decimation factor, and develops efficient HDL automated pipelining to meet the desired clock rate, enabling timing closure at high clock rates of 400-500 MHz. The advantage of using a tool is increased productivity. The complexity of the FPGA is hidden from the user, to avoid such non-algorithmic implementation questions as: How much pipelining is required to hit the required system clock rate? How much time sharing can be exploited to save FPGA area/resources? Iterations to explore architectures are started with the push of a button, with each closing timing automatically.

16
NCO (fc = -21MHz)

I Q From Base Band I 16


NCO (fc = 21MHz)

16 16
NCO

RF Out To DAC

16 Q

16
fs = 256 MHz fs = 4096 MHz

Figure 2: Direct RF datapath

Without this vector representation, the design methodology would require complex control, with the user having to manage the reordering of odd and even phases of half band filters shown in figure 3:

256 Msps H 2

512 Msps D 2

1024 Msps G 2 G 2 D 2 G 2 G 2

2048 Msps

Furthermore, the whole design flow can be controlled from a single environment in Simulink, including compiling, programming a development board, and test and verification in hardware. Figure 5 below shows a synthesizable testbench for the direct RF design programmed and controlled from within the Simulink design environment.

256 MHz Clock

Figure 3: Super-sample multiphase filter chain without vector support

Using DSPB-AB, for example, significantly simplifies high-speed design, to a representation like that shown in figure 4. The tool automatically duplicates the necessary hardware to generate parallel outputs, handling polyphase reordering.
256 Msps H 2 512 Msps (2) D 2 1024 Msps (4) G 2 2048 Msps (8)

256 MHz Clock


Figure 4: Super-sample multiphase filter chain with vectors and data-rates > clock rates natively supported.

The structure produced for each FIR filter is optimized for the incoming data rate and clock rate, and according to the target device characteristics. Changing the target device requires changing a single parameter then re-running simulation. Differences in fabric speed, for example, will change the amount of pipelining the tool needs to apply to achieve the desired clock frequency. A tool like DSPB-AB can use knowledge of device timing data to get the level of pipelining required correct and linear program-solving techniques to get the location of pipeline insertion optimized to reduce overall register use. An experienced designer may be able to achieve the same result by hand for a particular target device and clock speed. But suppose they want to explore implementation possibilities. What if they want to try a different clock speed? All pipelining would have to be recalculated, as would the TDM of hardware components. What if they wanted to retarget to a different device family? This would involve the above, but perhaps even some fundamental architectural choices in their implementation would need to change, depending on the characteristics of the target hardware, such as DSP block and bulk memory features. Such design space exploration is prohibitively slow using conventional design techniques, but using high-level synthesis tools this can be achieved push-button. Table 1 shows the resource usage when the design was retargeted to two different device families. No design changes were necessary to retarget; all optimization, technology mapping and pipelining changes were implemented by the tool.

Figure 5: The Simulink testbench for the direct RF design can be synthesized and programmed onto a board from within Simulink, and the output data brought back an analyzed

TABLE I.

DIRECT RF DESIGN RESOURCES

Direct RF Results
Device DSP block 18-bit elements Combinational ALUTs Memory ALUTs Dedicated logic registers Total block memory bits

Stratix III
EP3SL150F1152C4 202 (53 %) 25,417 (22 %) 5,020 (9 %) 38,426 (34 %) 1,688,896 (30 %)

Stratix IV
EP4SGX230KF40C2 202 (16 %) 26,639 (15 %) 8,076 (9 %) 48746 (27 %) 2,711,552 (19 %)

VI.

BEAM-FORMING EXAMPLE

Beam-forming is a method where an antenna array is used and the antenna beam is steered using digital signal processing techniques, rather than physically moving a single antenna. It can be applied to both transmit and receive antennas. The number of antennas can vary from two to eight as commonly used in todays mobile communication systems, to several hundred in modern radar systems. The example in this paper is based on an 8x8 antenna beam-former. The heart of the algorithm is the solution of a sum of complex vector multiplies. In previous generation Simulink synthesis tools, a schematic approach to implementing this complex vector multiplication might see each complex multiply implemented as, for example, four multipliers, an addition and a subtraction. Here the real and imaginary components are handled explicitly. Then each of these would have to be replicated across the array of constant complex multiplies we wish to perform; in this fully parallel example an 8 by 8 grid of complex multiplies. This can be greatly simplified by supporting vector and complex types within the schematic design as shown in figure 6.

[Valid, Channel, Data]: Enables synch without cycle counting.

Folded region

Complex vector data signal

Complex Data Signal

Complex Vector Coefficients

Complex Vector Multiply

Figure 7: Fully parallel beam-forming layout detail showing vector and complex data types and operations and automatically folding

This vastly simplifies design specification, and also allows for top-level parameterizations that can act by varying vector widths, where previously manual design editing by replication with some variation were required. There is also the side benefit of increased simulation speed. Of course a fully parallel implementation may not be required. The data rate is likely to be less than the system clock rate. In this case an HDL writer would have to calculate the time division multiplexing (TDM) that is resource sharing for full utilization, and insert the required multiplexing and pipelining manually. The pipelining here is not algorithmic, but specific to the TDM. It would also be in addition to the pipelining required to achieve the required fMAX. At this stage the HDL designer has to mix the algorithm with device implementation details to achieve an optimum design. A change in parameters, from data rate, to clock rate, to target device requires a disentanglement of these and a redesign back. IP from FPGA vendors and third parties and IP-level block with their design tools often automatically TDM for optimal resource usage.
Figure 6: Fully parallel beam-forming design

With complex support, each four input subsystem for a complex multiply becomes a single multiply taking complex input types, and each pair of constants representing a complex coefficient becomes a single constant. With vector support, each column of multipliers and constants become a single multiplier and single constant taking vector input types. Part of the design represented in DSPB-AB is shown in figure 7.

Next generation high-level synthesis tools such as DSPBAB can also TDM general user-designed subsystems built from primitive blocks such as multipliers, adders, etc. These are folded down by the tool using knowledge of the sample rate to clock rate per channel and the number of channels. In optimizing for folding it is required to know the users desired trade-off value for resource types. For example multipliers can be implemented as hard DSP blocks or in soft logic the data flow graph may be arranged so that one is traded off against the other. The trade-off value effectively the cost of a DSP in terms of logic elements (LEs) in the FPGA is set with threshold parameters by the user. More folding requires more multiplexing logic, so there might come a stage where folding further to reduce multipliers results in an increase in

logic above the user set threshold. At this point the tool has to make a decision about whether or not to share the multiplier resource (hence reduce multiplier count) and pay the cost in terms of multiplexing and pipelining logic. This is one area where such top-level thresholds come into consideration as specifying a cost, or objective, function for resource optimization. Another case would be the more direct trade off of whether to implement a multiply as a hard DSP block or in soft logic. For multiplications by constants, this is invariable better in logic. This is especially so if you consider the canonical signed digit representation of constants. Using the canonical signed digit representation a constant multiply can be represented as a minimal set of shifts (which are effectively free in terms of FPGA resources) additions and subtractions. In this beam-forming example the tool is applying DSP sharing to multiplexing logic trade offs and soft logic to hard DSP block implementation trade off is it optimization objective function. These are the sorts of optimizations a designer writing in HDL might implement once; but would take a long time to iterate, such as if a designer wanted to keep the sample rate the same and vary the clock rate. Within DSPB-AB, folding varies in line with clock to sample rate ratio push-button, and the power of writing scripts to loops over designs while varying parameters makes the flow very attractive and highly productive for design space exploration. In this example, three test-cases were been generated, based on different input/output sequences: 1. Fully parallel inputs/outputs 2. Real & imaginary interleaved inputs/outputs on separate buses 3. Real & Imaginary interleaved inputs/outputs on single bus Two system clock frequencies were used to investigate additional savings from over-clocking the system. Of course driven from Matlab is it a simple matter to sweep over a whole range of parameters, and much more design space exploration would be possible All three options use a comparable amount of resources, shown in table II.

TABLE II.

FOLDED BEAMFORMING DESIGN RESOURCES

8x8 Beamforming Resource Count Actual FMAX LE/Reg pairs 18x18 multipliers Memory bits

80MHz 117MHz 21.5K 35 13.6K

240MHz 248MHz 17K 15 14.5K

VII. CONCLUSION Next-generation model-based design tools provide a highly productive design methodology. Building a system from blocks that are behavioral in nature specifying what to do, not when to do itthe user is free to focus on the signal flow representation. This results in a textbook-like schematic that captures the algorithm unpolluted by device implementation details. This is much easier to debug and modify than would be if implementation details such as pipelining had to be built into the model. By automating pipelining, timing closure is as easy as pushing a button. Specify the desired fMAX, and the tools insert the right amount of pipeline and maintain the datapath algorithm accuracy. By automatically technology-mapping the schematic to device-specific features, the tool gives high quality results from generic input. The user can now design once and can retarget to different speed grades and families, or simply modify to expand the number of channels. The problem of exploring the possibilities to make the best use of FPGAs for DSP applications now has a simpleand highly productivesolution. ACKNOWLEDGMENT Thanks To Volker Mauer for much of the design work highlighted in this paper, and Steve Perry, Daniel Elphick and Simon Finn for the work on DSP Builder, Advanced blockset used in these examples.

101 Innovation Drive San Jose, CA 95134 www.altera.com

Copyright 2010 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylized Altera logo, specific device designations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and service marks of Altera Corporation in the U.S. and other countries. All other product or service names are the property of their respective holders. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.