You are on page 1of 14

An Independent Evaluation of:

The AutoESL AutoPilot High-Level Synthesis Tool

By the staff of
Berkeley Design Technology, Inc.

O VERVIEW
Interest in high-level synthesis tools for FPGAs is intensifying as FPGAs and their applications grow larger
and more complex. Prospective users want to understand how well high-level synthesis tools work, both
in terms of usability and quality of results. To meet this need, BDTI launched the BDTI High-Level
Synthesis Tool Certification Program™. This program evaluates high-level synthesis tools used to
implement demanding digital signal processing applications on FPGAs.

This white paper presents detailed results of BDTI’s analysis of the AutoESL AutoPilot high-level synthesis
tool used in conjunction with Xilinx RTL tools to target a Xilinx Spartan-3A DSP 3400 FPGA. We discuss
the ease of use, productivity, and quality of results obtained using AutoPilot compared to implementing the
same application on a DSP processor. We also compare AutoPilot quality of results vs. traditional RTL
FPGA design.

Our findings will be surprising to many—and may indicate a major shift on the horizon for FPGA and DSP
processor users.

Contents 2009. AutoPilot is based on tools and techniques


Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 developed at UCLA and licensed exclusively to
About BDTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 AutoESL.
Design Using AutoPilot and Xilinx ISE . . . . . . . . 2 AutoPilot takes as its input a C, C++ or SystemC
The BDTI High-Level Synthesis Tool Certification description of functionality at a high level of abstrac-
Program™ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 tion. It then generates a device-specific Verilog or
Evaluation Workloads . . . . . . . . . . . . . . . . . . . . . . 4 VHDL register-transfer-level (RTL) description of a
Description of Metrics . . . . . . . . . . . . . . . . . . . . . . 5 hardware implementation targeting an FPGA (for
Description of Platforms . . . . . . . . . . . . . . . . . . . . 5 example, from Altera or Xilinx) or ASIC. This elim-
Implementation, Certification Process . . . . . . . . . 5 inates the time-consuming and error-prone step of
Quality of Results Metrics . . . . . . . . . . . . . . . . . . . 6 manually creating the RTL implementation. Accord-
Usability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 7 ing to AutoESL, AutoPilot also generates a cycle-
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 accurate SystemC simulation model for the synthe-
The BDTI Optical Flow Workload™ . . . . . . . . . 11 sized results.
The BDTI DQPSK Receiver Workload™ . . . . . . 12 High-level synthesis tools, such as AutoPilot, that
Usability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 13 target FPGAs are typically of interest to two classes
of prospective users: current FPGA users who want
Introduction to improve their productivity and gain design porta-
AutoESL Design Technologies, Inc., (“AutoESL”) bility, and processor users who are considering
is a high-level synthesis tool company founded in switching to an FPGA to achieve superior perfor-
2006 and headquartered in Cupertino, California. mance or cost/performance for computationally
The company unveiled its AutoPilot high-level syn- demanding applications. There are good reasons for
thesis tool at the Design Automation Conference in considering such a switch—BDTI’s report, FPGAs for

© 2010 BDTI (www.BDTI.com). All rights reserved.


DSP, includes benchmark results showing that land, California. Founded in 1991, BDTI is widely
FPGAs can achieve 100X higher performance and known for its highly regarded, independent perfor-
30X better cost-performance than DSP processors in mance analysis of processing platforms for embedded
some highly parallel DSP applications. Yet FPGAs applications. Its six suites of processor benchmarks
are not widely used as processing engines in these have been licensed for use with nearly 100 processing
applications, primarily due to development chal- platforms, from MCUs to FPGAs. Further informa-
lenges—using traditional design techniques, it takes tion about the company and benchmark results are
much longer to develop an application on a FPGA available at www.BDTI.com.
than on a DSP processor. In addition, RTL FPGA
design requires different skills than those required for Design Using AutoPilot and Xilinx ISE
DSP processor software development, and most DSP Figure 1 shows the design flow used in BDTI’s
engineers don’t have the necessary RTL hardware evaluation of the AutoESL AutoPilot high-level syn-
design expertise. thesis tool. AutoPilot is used to compile C code and
High-level synthesis tools have the potential to
overcome these challenges and bring the performance Reference C code

of FPGAs to a much wider range of users, but such


tools are often met with skepticism. In part, this skep- Manual restructuring of Standard SW Tools
ticism is due to the common notion that software C code (e.g. MS Visual Studio)

cannot produce results that are as good as what a


skilled engineer can produce. And in part, the skepti- Implementation C code
cism is due to the fact that high-level synthesis tools
have been around for a long time but have never
achieved widespread use. AutoPilot AutoCC
Opposing this skepticism is a growing body of
anecdotal evidence suggesting that some modern
high-level synthesis tools, such as AutoPilot, are very RTL ModelSim

effective, both in terms of usability and quality of


results. Given this conflicting information, how is a
prospective user to judge whether a high-level synthe- Xilinx ISE/EDK Tool

sis tool is worth considering? Design representation


In 2009, BDTI created the BDTI High-Level Syn-
Bitstream, Netlist Tool-generated
thesis Tool Certification Program to fill this gap, data enables code
tuning
with the goal of providing objective, credible data and
analysis to enable potential users of high-level synthe-
FIGURE 1. Design Flow Using the AutoPilot High-Level
sis tools for FPGAs to quickly understand the capa-
Synthesis Tool with Xilinx “ISE” RTL Tools
bilities and limitations of these tools.
BDTI has used this methodology to evaluate
AutoESL’s AutoPilot tool used in conjunction with generate an RTL implementation, and Xilinx’s RTL
Xilinx’s ISE and EDK tool chain targeting a Xilinx tools (which include the ISE tool chain and the
Spartan-3A DSP 3400 FPGA. This white paper sum- Embedded Developer’s Kit, or EDK) are used to
marizes the results of BDTI’s evaluation, and is based transform that RTL implementation into a complete
on BDTI’s in-depth, independent assessment of Auto- FPGA implementation in the form of a bitstream for
Pilot augmented with the results of detailed inter- programming a specific FPGA. AutoPilot offers a sig-
views of AutoPilot users conducted by BDTI. It nificant advantage over hand-coded RTL in terms of
includes several results that will be surprising to retargetability; the tool enables users to migrate from
many, and that may indicate that the time for high- one FPGA to another (or even to an ASIC design)
level synthesis tools for FPGAs has finally arrived. without manually rewriting their RTL.
For this project, BDTI could have limited the eval-
About BDTI uation to the use of AutoPilot alone, ignoring the
Berkeley Design Technology, Inc. (BDTI) is an RTL-to-bitstream portion of the design flow. We
independent technology analysis, consulting, and believe, however, that it is important for potential
engineering services company headquartered in Oak- users to understand how difficult (or easy) it is to get

Page 2 © 2010 BDTI (www.BDTI.com). All rights reserved.


from the high-level application description all the The reference C code may also need to be modi-
way to an FPGA implementation—which requires fied to take advantage of the FPGA’s ability to imple-
the Xilinx RTL tools in addition to the high-level ment arbitrary-precision data paths and other
synthesis tool. For this reason, BDTI evaluated the features. AutoPilot supports arbitrary-width data
entire implementation process shown in Figure 1— types and also automates synthesis of a variety of
not just the C-to-RTL portion, but also the Xilinx interfaces between modules, including interfaces
RTL tool chain. between various buses and third-party intellectual
When starting with a C language description of property modules such as the Xilinx multi-port mem-
required functionality, the first step in implementing ory controller (MPMC).
an application on any hardware target is often to The AutoPilot tool chain includes AutoCC,
restructure the input reference C code to represent an which compiles the C code for execution on the
architecture that is suitable for hardware implementa- workstation and handles FPGA-oriented modifica-
tion. On a DSP, for example, it may be appropriate tions such as arbitrary-width data types. AutoCC
to rearrange the application’s control flow so that doesn’t generate RTL; instead, it provides a quick
intermediate data always fits in cache. On an FPGA, check of the functionality of the FPGA-oriented C
restructuring often provides a more parallel, FPGA- code. This intermediate verification capability pro-
friendly representation of the application. vides a significant advantage over verifying hand-
Restructuring can sometimes enable improve- written RTL, since RTL simulation is orders of mag-
ments of several orders of magnitude in speed and/or nitude slower. In our evaluation, for example, the C
resource utilization. If the original C source code is simulation typically required 15-30 seconds, while
not appropriate for the hardware target, restructur- RTL simulation time varied from hours to days,
ing is required in order to obtain an efficient imple- depending on the workload under consideration.
mentation. For example, a video application may pass However, C simulation does not simulate the full
entire video frames of intermediate data between hardware features (such as pipelining and scheduling),
algorithm blocks. A straightforward C implementa- so RTL simulation is still needed for hardware verifi-
tion of the application may require large frame buff- cation.
ers that do not fit in an FPGA, but typically the code The functionally verified C code can then be opti-
can be restructured to enable much smaller buffers. mized for better performance or efficiency. This step
By appropriately streaming the dataflow between is primarily accomplished using AutoPilot synthesis
algorithm blocks, the application may be able to directives that can be either embedded in the C code
buffer just a few scan lines—or even just a few pixels— as pragmas or placed in AutoPilot scripts as com-
rather than buffering entire frames. mands. Because AutoPilot generates RTL very
In some cases, applications may require the use of quickly, the user can explore a variety of design strat-
external memory (for example, if a video application egies that would be prohibitively time-consuming to
must store entire frames and can’t be restructured to explore using RTL.
use scan lines or pixels). In this case, the restructuring AutoPilot compiles the optimized C code and
process requires the addition of code that supports generates an RTL implementation by converting
the use of an external memory interface. each function call in the C code into an RTL module.
The effort required to restructure the application For the moderately complex workloads used in this
depends on the specifics of the application and the project, that process typically took less than 30 sec-
hardware being targeted, as well as the capabilities of onds.
the tools used. In addition to generating RTL, AutoPilot also
AutoPilot, like other current high-level synthesis generates reports that estimate the FPGA resource
tools, does not handle restructuring automatically. utilization, latency, and throughput of the RTL
Instead, the restructuring is typically done by hand. implementation. The reports include a breakdown
In fact, the restructuring can be done entirely inde- by individual functions and loops in the C source
pendently of AutoPilot (in our evaluation, for exam- code, allowing users to pinpoint areas for improve-
ple, we used Microsoft Visual Studio for restructuring ment and fine-tune synthesis directives or C code in
and verifying the C code). Compared to hand-written response.
RTL, where restructuring and language translation Once the user is satisfied with the performance
are performed as a single combined step, restructur- and efficiency estimated by AutoPilot, the process
ing entirely in C is easier and less error-prone. switches over to more traditional RTL tools. An

© 2010 BDTI (www.BDTI.com). All rights reserved. Page 3


RTL simulator such as Mentor Graphics’ ModelSim based approach (for the BDTI DQPSK Receiver
can be used to functionally verify the AutoPilot-gen- Workload), or on a DSP processor using its associated
erated RTL. Alternatively, AutoPilot can generate a development tools (for the BDTI Optical Flow
cycle-accurate SystemC RTL model with all the nec- Workload). In this manner, BDTI is able to compare
essary test bench related files and scripts based on the the quality of results and productivity levels associ-
C-language reference test bench which then can be ated with using various design flows.
used for a faster functional verification than that The two workloads have been chosen to be
achieved with ModelSim. broadly representative of the types of embedded com-
Finally, Xilinx’s RTL tools (ISE and EDK) are puting applications that electronic system designers
used to take the RTL generated by AutoPilot as implement using FPGAs, and as such, they are inher-
input, perform synthesis and place-and-route tasks, ently well suited for FPGAs. This is an important
report the exact resource utilization of the implemen- point to keep in mind. There are many important
tation and alert the user to any timing closure issues. applications (such as high-definition audio codecs)
The user may then further tune the C source code that don’t require the computational performance
and repeat the RTL generation if needed. levels or data rates of the workloads used here, and
The transition point from AutoPilot to the Xilinx that require much more complex algorithms. Such
tools is smoothed, to some extent, by AutoPilot’s applications may yield very different results than
ability to automatically generate scripts and con- those reported in this analysis.
straints for running synthesis and place-and-route
steps, which would be quite time-consuming if done Evaluation Workloads
manually. AutoPilot does not, however, provide a The two workloads used in BDTI’s evaluation are
unified “cockpit” from which the entire C-to-FPGA the BDTI Optical Flow Workload™ and the BDTI
implementation process can be executed. For exam- DQPSK Receiver Workload™.
ple, AutoPilot does not generate scripts to transfer The term “optical flow” (or “optic flow”) refers to
RTL and netlist outputs to appropriate folders for the a class of video processing algorithms that analyze the
Xilinx EDK tools; this must be done manually. It’s a motion of objects and object features (such as edges)
trivial process once the user understands exactly what within a scene. The BDTI Optical Flow Workload
needs to be done—but understanding what needs to operates on a 720p resolution (1280×720 progressive
be done requires a high level of familiarity with the scan) input video sequence and produces a series of
Xilinx tools, including their complex directory struc- two-dimensional matrices characterizing the appar-
ture requirements. AutoPilot also does not provide a ent vertical and horizontal motion within the
script to download the bitstream onto the FPGA; sequence. In designing this workload, BDTI has
this requires working with the Xilinx tools. In other increased the control complexity relative to what is
words, while AutoPilot eliminates the labor-inten- often used in this class of optical flow algorithms in
sive (and error-prone) process of hand-coding in order to ensure that the workload provides a suffi-
RTL, it does not entirely insulate the user from hav- ciently challenging test case for the tools. More specif-
ing to learn and use the RTL tools. ically, BDTI incorporated dynamic, data-dependent
decision making and array indexing into the optical
The BDTI High-Level Synthesis Tool flow application.
Certification Program™ There are two Operating Points associated with
BDTI evaluates high-level synthesis tools (includ- the BDTI Optical Flow Workload, each of which
ing the underlying RTL tools) using two well-defined uses the same algorithm but is optimized for a differ-
sample applications, or “workloads” (described ent metric.
briefly below, and in more detail in Appendix A). • Operating Point 1 is a fixed workload defined as
These applications (described in the next section) are processing video with 720p resolution (1280×720
representative of typical digital signal processing progressive scan) at 60 frames per second. The
applications for FPGAs. The two applications are objective for Operating Point 1 is to achieve the
implemented using several approaches. First, a given required throughput while minimizing resource
workload is implemented on the target FPGA using utilization. Resource utilization refers to the per-
the high-level synthesis tool in conjunction with the centage of total FPGA or processor resources
Xilinx RTL tools. The same workload is then imple- required to implement the workload.
mented on the same FPGA using a traditional RTL-

Page 4 © 2010 BDTI (www.BDTI.com). All rights reserved.


• Operating Point 2 is defined as the maximum Description of Platforms
throughput capability of the workload imple- For this evaluation, the target FPGA was the Xil-
mentation on the target device (measured in inx Spartan-3A DSP 3400 (XC3SD3400A). For the
frames per second) for 720p resolution (1280×720 BDTI Optical Flow Workload, the Xilinx Xtrem-
progressive scan). The objective for Operating eDSP Video Starter Kit (which is based on the
Point 2 is to maximize the throughput (measured XC3SD3400A) was used. Spartan-3A DSPs are based
in frames per second) using all available device on Xilinx’s low-cost Spartan-3A family, but have a
resources. number of enhancements to accelerate digital signal
The secondary workload is the BDTI DQPSK processing. For example, Spartan-3A DSP chips have
Receiver Workload. This workload is a wireless com- double the block RAM (“BRAM”) memory of other
munications receiver baseband application that Spartan devices and incorporate hard-wired DSP data
includes classical communications blocks found in paths, called “DSP48A slices.” Each DSP48A slice
many types of wireless receivers. It is a fixed work- contains an 18x18 multiplier with pre-adders and an
load with a single Operating Point defined as process- accumulator, among other features. The
ing an input stream of complex, modulated data at XC3SD3400A includes 126 DSP48A slices that can be
18.75 Msamples/second with the receiver chain clocked at up to 250 MHz, and roughly 54,000 logic
clocked at 75 MHz. The receiver produces a demodu- cells. Xilinx RTL tools, including the ISE and EDK
lated output bitstream of 4.6875 Mbits/second. The tool suites, were used with AutoPilot. (ISE and EDK
objective for this workload is to minimize the FPGA version 10.1.03 (lin64)).
resource utilization needed to achieve the specified The target DSP processor was the Texas Instru-
throughput. ments TMS320DM6437. The TMS320DM6437 is a
Additional details on the two workloads are pro- video-oriented processor that includes a 600 MHz
vided in Appendix A. Texas Instruments TMS320C64x+ DSP core along
with video hardware accelerators. (The hardware
Description of Metrics accelerators do not support the BDTI Optical Flow
Two categories of metric are used in this evalua- Workload, and therefore were not used in the DSP
tion: processor implementation of the BDTI Optical Flow
• Quality of results metrics assess the perfor- Workload.) The evaluation used the Texas Instru-
mance and efficiency of the workload implemen- ments DM6437 Digital Video Development Environ-
tation. For the BDTI Optical Flow Workload, ment combined with the Texas Instruments Code
quality of results metrics are reported for the Composer Studio tools suite (version V3.3.82.13,
AutoESL-Xilinx implementation and for the Code Generation Tools version 6.1.9).
DSP processor implementation. For the BDTI We should note here that, as is typical for high-
DQPSK Receiver Workload, quality of results level synthesis tools, AutoPilot costs considerably
metrics are reported for the AutoESL-Xilinx flow more than DSP processor software development
and for a traditional FPGA implementation using tools. Tool cost is not reflected in the cost-perfor-
a hand-written RTL design. mance results presented later in this paper. A funda-
mental question is whether the higher tool cost is
• Usability metrics assess the productivity and justified by higher quality of results and/or higher
ease of use associated with the AutoESL-Xilinx productivity enabled by AutoPilot. We believe that
design flow, and are based on the BDTI Optical the results and analysis presented in this paper will
Flow Workload. These metrics compare the pro- prove valuable to prospective AutoPilot users in
ductivity and ease of use associated with using the answering that question.
AutoPilot and Xilinx tools targeting an FPGA
relative to using a DSP processor with its associ- Implementation, Certification Process
ated software development tool chain.
The work of implementing the two workloads on
Usability metrics are evaluated qualitatively the two chips was distributed between AutoESL, Xil-
based on nine aspects of tool use, including out- inx, and BDTI based on the chip and tool chain used.
of-the-box experience, ease of use, completeness AutoESL implemented both workloads using the
of tool capabilities, efficiency of overall design AutoPilot and Xilinx tools and submitted resource
methodology, and quality of documentation and utilization results to BDTI for independent verifica-
support. tion and certification. In parallel, BDTI’s engineers

© 2010 BDTI (www.BDTI.com). All rights reserved. Page 5


independently implemented portions of the BDTI an order of magnitude higher computational power.
Optical Flow Workload using the AutoPilot and Xil- For applications that can make good use of this horse-
inx tools to gain first-hand insight into the usability power (such as the BDTI Optical Flow Workload),
of the tool chain. BDTI implemented the BDTI Opti- the FPGA’s performance advantage is compelling.
cal Flow Workload on the DSP processor, and Xilinx In contrast to Operating Point 1 of the BDTI
implemented the hand-coded RTL FPGA version of Optical Flow Workload, which specifies a fixed
the BDTI DQPSK Receiver workload (which was throughput requirement, the goal of Operating Point
then verified and certified by BDTI). 2 is to achieve the highest throughput possible, using
In addition, BDTI interviewed a number of Auto- all of the available chip resources. Table 2 presents
Pilot customers about their experiences in using the these results in terms of the maximum frames per sec-
tool and the quality of results they have attained. ond supported by each single chip and the associated
These interviews were used to augment (and sanity- cost per frame per second.
check) the results obtained using the workload imple-
TABLE 2. Quality of Results for BDTI Optical Flow
mentations. Workload Operating Point 2: Maximum
Throughput (1280×720 Progressive Scan)
Quality of Results Metrics
CERTIFIED TM

In this section we present the Quality of Results Maxi-


Chip Cost
findings of our evaluation. The first results we mum
Unit per FPS
present are for the BDTI Optical Flow Workload. Frames
Platform Cost (Lower
per
Table 1 shows results for Operating Point 1, which (Qty. is
Second
evaluates the percentage of on-chip hardware 10K) Better)
(FPS)
resources consumed by a 60 fps Optical Flow Work-
load at 720p resolution. AutoESL AutoPilot
As shown in Table 1, the FPGA implementation plus Xilinx RTL
tools targeting the
using the AutoESL-Xilinx tool chain utilized 39% of $26.65 183 fps $0.14
Xilinx
the FPGA resources to implement the workload. The XC3SD3400A
DSP processor was unable to implement this work- FPGA
load; a minimum of 12 ‘DM6437 chips would be
needed to achieve 60 fps operation on the BDTI Opti- Texas Instruments
cal Flow Workload. software develop-
ment tools targeting $21.25 5.1 fps $4.20
TABLE 1. Quality of Results for BDTI Optical Flow TMS320DM6437
Workload Operating Point 1: Fixed Throughput DSP processor
(1280×720 Progressive Scan, 60 fps)
CERTIFIED TM

Here again, the FPGA implementation created


Chip Unit
Chip Resource with AutoPilot achieves much higher performance
Platform Cost
Utilization than the DSP processor, and also much better cost-
(Qty. 10K)
performance.
AutoESL AutoPilot Prospective users are often interested in under-
plus Xilinx RTL tools standing the capacity of synthesis tools. The Optical
$26.65 39%
targeting the Xilinx
Flow Operating Point 2 represents the largest design
XC3SD3400A FPGA
evaluated by BDTI, and used 76% of the Xilinx
N/A XC3SD3400A FPGA. AutoPilot had no trouble han-
Texas Instruments soft-
(a minimum of dling a design of this size. For this workload, the ref-
ware development
12 DSPs would erence C code contained 559 functional lines of code.
tools targeting $21.25
be required to The restructured and optimized reference C code
TMS320DM6437 DSP
meet this oper-
processor contained 1,604 lines of code, which then generated
ating point)
38,222 lines of Verilog RTL or 35,849 lines of VHDL
RTL. AutoPilot synthesized the C description and
These results illustrate the difference in processing generated the RTL in less than 30 seconds.
horsepower available on the Spartan-3A DSP versus
The next question our evaluation attempted to
the TI ‘DM6437 chip—for a 25% higher chip price,
answer was: How efficient is an AutoPilot-based
the FPGA used with AutoPilot provides more than FPGA implementation relative to an implementation

Page 6 © 2010 BDTI (www.BDTI.com). All rights reserved.


created using hand-written RTL code? For this part From the results presented here and the user inter-
of the evaluation we used the BDTI DQPSK Receiver views conducted by BDTI, it is clear that AutoESL
Workload (the Optical Flow workload would have has done an excellent job in overcoming this prob-
been prohibitively time-consuming to hand code in lem, at least for the class of workloads used in this
RTL). analysis.
Table 3 shows the efficiency of the implementa-
tion using the AutoESL-Xilinx tool chain versus the Usability Metrics
hand-written RTL implementation. In this section we present the usability metrics for
TABLE 3. Quality of Results for DQPSK Receiver
the BDTI High-Level Synthesis Tool Certification
Workload: Fixed Throughput (18.75 Msamples/ Program. The usability metrics, presented in Table 4
Second Input Data with a 75 MHz Clock Speed)
CERTIFIED
and Table 5, provide an assessment of the productiv-
ity and ease of use of the high-level synthesis tool
TM

Chip Resource flow compared with the DSP processor tool chain.
Platform Utilization For each usability metric described below, BDTI
(Lower is Better) assigns a score of Excellent, Very Good, Good, Fair,
AutoESL AutoPilot plus or Poor. For the AutoESL-Xilinx flow, the first score
Xilinx RTL tools listed is the score for the complete flow. Beneath
5.6%
targeting the Xilinx those scores, in parenthesis, separate scores are pro-
XC3SD3400A FPGA vided for AutoPilot and for the Xilinx RTL tool
Hand-written RTL code chain.
using Xilinx RTL tools In assigning these scores, BDTI considers the over-
5.9% all design process for a complete project—starting
targeting the Xilinx
XC3SD3400A FPGA with a C language application specification and end-
ing with a real-time implementation on the target
In this case, the hand-written RTL implementa- processing platform (either an FPGA or DSP proces-
tion was created by an experienced FPGA designer, sor). Detailed descriptions of each usability metric
and made use of Xilinx Coregen IP blocks where are provided in Appendix B.
applicable. As shown in Table 3, AutoPilot was able In general, AutoPilot was quite straightforward to
to achieve a level of efficiency comparable to that of install and use. In contrast, the Xilinx RTL tools were
hand-coded RTL on this workload. Achieving this difficult to install and complicated to use, particularly
result required modifications to the C language for a novice FPGA user. The net result is that, as
description of the workload; however, the extent of shown in Table 4 and Table 5, the AutoPilot plus Xil-
these modifications was modest. inx tool chain has productivity ratings similar to
The similarity of AutoPilot and hand-written those of the DSP processor flow.
RTL results is probably not accidental; AutoESL was Although we did not directly compare the ease of
provided with the resource utilization for the hand- use of AutoPilot vs. hand-written RTL code, the
coded RTL result at the outset of the implementation AutoPilot customers we interviewed estimated that
process, and likely used this as a target in optimizing using AutoPilot cut their design times by roughly
its implementation. It is important to note, however, 50% compared with using hand-written RTL code. In
that such information is not required for effective use addition some saw a more significant reduction in
of AutoPilot, and that AutoESL was not provided time required for verification. For example, some cus-
with the hand-coded design. The results shown in tomers said that because much of their simulation
Table 3 are consistent with results reported by Auto- could be done at the C level rather than at the RTL
Pilot customers interviewed by BDTI. These users level, it was much faster and easier. And some said
generally reported that the tool produced results that that they expected to be able to reduce total design
were similar in efficiency to hand-coded RTL. Fur- time even further.
thermore, they said that using a high-level synthesis The quality of documentation is a weak spot for
tool enabled them to easily explore alternative archi- AutoPilot; there is very little of it. This is likely
tectures, which often led to more efficient implemen- attributable to the newness of the tool.
tations.
Historically, poor quality of results has been one
of the biggest pitfalls of high-level synthesis tools.

© 2010 BDTI (www.BDTI.com). All rights reserved. Page 7


CERTIFIED TM

TABLE 4. Usability metrics (1 of 2)

Quality of
Out-of-Box Completeness of
Ease of Use Documentation
Experience Capabilities
and Support

Combined AutoESL
AutoPilot plus Xilinx Fair Good Good Good
RTL tools rating1

(AutoESL AutoPilot rating (Very Good/Poor) (Very Good/Fair) (Good/Good) (Fair/Very Good)
/ Xilinx rating)
Texas Instruments soft-
ware development tools Good Very Good Very Good Very Good
rating2
1.AutoESL AutoPilot plus Xilinx RTL tools targeting a Xilinx XC3SD3400A FPGA
2.Texas Instruments software development tools targeting a TMS320DM6437 DSP processor

CERTIFIED TM

TABLE 5. Usability metrics (2 of 2)

Efficiency of Design Methodology


Design and Extent of
Design and
Implementation Modifications
Learning to Use Implementation Platform
(Final Required to
the Tool (First Infrastructure
Optimized Reference Code
Compiling Development
Version)
Version)

Combined
AutoESL Auto-
Pilot plus Xil-
Very Good Very Good Good Good Good
inx RTL tools
rating1
(Very Good/NA2) (Very Good/NA) (Good/Good) (Good/Good) (Good/NA)
(AutoESL Auto-
Pilot rating / Xil-
inx rating)
Texas Instru-
ments software NA (assuming
Excellent Good Good Fair
development already familiar)
tools rating3
1.AutoESL AutoPilot plus Xilinx RTL tools targeting a Xilinx XC3SD3400A FPGA
2.“NA” = not applicable
3.Texas Instruments software development tools targeting a TMS320DM6437 DSP processor

Page 8 © 2010 BDTI (www.BDTI.com). All rights reserved.


The process of developing the platform infrastruc- DSP processor and FPGA tools were used by engi-
ture was similar for the FPGA and DSP in this neers already familiar with them, BDTI did not assign
project. In general, DSP processors come well a score for learning to use these tools.
equipped with peripherals for implementing DSP Perhaps the most significant usability weakness of
applications, but provide limited flexibility in periph- the AutoPilot-based tool flow is not AutoPilot
erals and interfaces. FPGAs, in contrast, provide itself—it’s the Xilinx RTL tools. We had assumed that
excellent flexibility in peripherals and interfaces, but AutoPilot would largely abstract the user from the
may not, by default, include needed peripheral process of going from RTL to the FPGA bitstream,
blocks. and, as mentioned earlier, that turned out to be a false
As mentioned earlier, obtaining an efficient imple- assumption. Generating the final FPGA implementa-
mentation of the BDTI Optical Flow Workload tion requires the user to create scripts that run some
using AutoPilot required modifications to the C parts of the Xilinx tools, and more importantly, can
code. This was also true for the DSP processor. The require the user to manually integrate and run vari-
extent of modifications required to the C code was ous RTL blocks together.
moderate for the AutoESL-Xilinx flow and high for Though it is relatively straightforward to get good
the DSP processor flow. Optimizing the BDTI Opti- results from AutoPilot, the remaining task of getting
cal Flow Workload for the DSP processor required from RTL to bitstream using the Xilinx RTL tool
extensive code reorganization relative to the refer- chain was time-consuming and not a process that
ence code. This reorganization process is not only most DSP software engineers would be equipped to
time-consuming but also requires a very high level of handle without assistance from an FPGA expert. It’s
expertise. clear that, while AutoPilot can produce efficient
Part of the reason the DSP processor required so results and significant improvements in productivity,
much reorganization is that the BDTI Optical Flow it does not eliminate the requirement for an FPGA
Workload used in this analysis is well suited for expert as part of the team. Along these same lines, for
implementation on FPGAs and not well suited for current FPGA users, AutoPilot accelerates several
implementation on DSP processors with caches (such significant portions of the design process, but does
as the TMS320DM6437 chip used in this analysis). not address all important aspects of the process.
The amount of reorganizing required will vary by
application. Conclusions
Overall, considering the entire design flow we BDTI’s earlier benchmarking of FPGAs and DSP
expect that, in general, the level of effort for imple- processors showed large performance and cost-per-
menting many kinds of applications on a Xilinx formance advantages for FPGAs on some applica-
FPGA using AutoPilot and the Xilinx RTL tools will tions when the FPGA implementations were created
be similar to that of implementing the same applica- using traditional RTL design techniques. The new
tions on a DSP processor using software development analysis presented here shows that FPGAs can
tools. achieve similar performance and cost performance
As shown in Table 6, two different skill sets are advantages when used with the AutoESL AutoPilot
required for the FPGA implementation: skills associ- high-level synthesis tool. In addition, we found that
ated with AutoPilot, and skills associated with the AutoPilot can achieve quality of results equivalent to
Xilinx RTL tools. In comparison, for DSP processor hand-written RTL code. We were surprised and
application development, an engineer with special- impressed by the quality of results that AutoPilot was
ized DSP algorithm and programming skills is able to produce, given that this has been a historic
required. A typical DSP software engineer with an weakness for high-level synthesis tools in general.
awareness of hardware architecture fundamentals FPGA designs done using traditional hand-writ-
(e.g., pipelining, latency) can learn to effectively use ten RTL coding typically take much more effort than
AutoPilot. Although a learning curve on the order of the equivalent application implemented in software
several weeks to a few months is required to become on a DSP processor. Therefore, perhaps the most sur-
proficient using AutoPilot (depending on the back- prising outcome of this project was that it took
ground of the engineer), no other specialized exper- roughly the same effort to implement the evaluation
tise is required. Since BDTI engineers learned to use workload on the FPGA using the AutoPilot plus Xil-
AutoPilot from the ground up, we include a Learning inx RTL tools as it took on the DSP processor. This
to Use the Tool usability metric. However, because the is a significant breakthrough, and one with the poten-

© 2010 BDTI (www.BDTI.com). All rights reserved. Page 9


TABLE 6. BDTI High-Level Synthesis Tool Certification Program Results Skills Required

Platform Required Skill Set


• Application expertise
AutoESL AutoPilot high-level
• C programming
synthesis tool
• Hardware architecture fundamentals
• Algorithmic optimization and restructuring
• FPGA architecture details
• RTL design
Xilinx RTL tools
• RTL tools
• Devices and interfaces
• Application expertise
• C and assembly programming
TI DSP tools • Processor chip architecture details
• Algorithmic optimization and restructuring
• Devices and drivers

tial to have a major impact on the design of high-per-


formance embedded computing applications. Given
the FPGA’s advantages in speed and cost-perfor-
mance, we expect that the availability of AutoPilot
will significantly change the trade-offs between DSPs
and FPGAs for certain types of applications.
Overall, our analysis indicates that the AutoESL-
Xilinx tool chain used with the Spartan-3A DSP 3400
FPGA yields much better performance and cost/per-
formance than the TI DSP processor on the work-
load we considered, while offering similar
development effort. And when compared with hand-
written RTL, AutoPilot was able to deliver equiva-
lent results. In exchange for these advantages, users
will pay much more for the tool chain and will need
FPGA expertise on the development team.
We expect that many system designers will be
happy to make that trade-off.

Page 10 © 2010 BDTI (www.BDTI.com). All rights reserved.


The block diagram shown in Figure 2 character-
APPENDIX A: Workload Details izes the implementation of the BDTI Optical Flow
Workload in the C language reference implementa-
The BDTI Optical Flow Workload™ tion included with the BDTI specification package.
A block diagram of the BDTI Optical Flow However, it is not required that final optimized
Workload is shown in Figure 2. implementations of the workload maintain this struc-
The BDTI Optical Flow Workload is a video pro- ture. Blocks may be merged or restructured to
cessing application suitable for implementation on an improve the efficiency of the final implementation,
FPGA or high-performance processor. The term provided all acceptance criteria are met (e.g., all test
“optical flow” (or “optic flow”) refers to a class of vectors pass).
video processing algorithms that analyze the motion Note that, at the level of this block diagram, the
of objects and object features (such as edges) within a workload is characterized by a feed-forward chain of
scene. signal processing blocks. The feed-forward nature of
The BDTI Optical Flow workload operates on a the application enables straightforward pipelining of
720p resolution (1280×720 progressive scan) input workload implementations to increase throughput.
video sequence and produces a series of two-dimen- The recommended data widths at the input and out-
sional matrices characterizing the apparent vertical put of each block are specified by BDTI.
and horizontal motion within the sequence. As men- A brief description of each of the five main func-
tioned in the main text of this white paper, there are tional blocks in the workload follows.
two Operating Points associated with the BDTI
Optical Flow Workload, each of which uses the same Gradient Calculation
algorithm: The gradient calculation block computes lumi-
• Operating Point 1 is a fixed workload defined as nance gradients of the incoming video sequence in the
processing video with 720p resolution (1280×720 horizontal, vertical, and time dimensions. It is imple-
progressive scan) at 60 frames per second. The mented as a one dimensional FIR filter applied to the
objective for Operating Point 1 is to achieve the video sequence in each of the respective dimensions
required throughput while minimizing resource to produce three 1280×720 gradient matrices.
utilization. Resource utilization refers to the per-
centage of total FPGA or processor resources Gradient Weighting
required to implement the workload. The gradient weighting block smoothes the gradi-
• Operating Point 2 is defined as the maximum ent matrices computed in the gradient calculation
throughput capability of the workload imple- block using a separable two-dimensional FIR filter.
mentation on the target device (measured in The separable filter is implemented in the reference
frames per second) for 720p resolution (1280×720 code using one-dimensional horizontal and vertical
progressive scan). The objective for Operating FIR filters applied to each of the three gradient matri-
Point 2 is to maximize the throughput (measured ces.
in frames per second) using all available device
resources. Outer Product
The outer product calculation is implemented as
Quality of results scores for Operating Points 1
an element-wise product of each of the weighted gra-
and 2 are based on results obtained when using a
dient matrices as shown below:
BDTI-proprietary video clip.

Pxx Txx
Gx Gx'
Image
Sequence 4. Vx
Pyy Tyy
1. 2. 3. Tensor 5.
Y Gradient Gradient Outer Calculation Velocity
Gy Gy' Pxy Txy
Calculation Weighting Product And Calculation
Normalization
Pxz Txz Vy
Gz Gz'
Pyz Tyz

FIGURE 2. Block Diagram of the BDTI Optical Flow Workload

© 2010 BDTI (www.BDTI.com). All rights reserved. Page 11


Pxx = Gx' * Gx' Receiver Workload is a wireless communications
Pyy = Gy' * Gy' application suitable for implementation on an FPGA
Pxy = Gx' * Gy' or a high performance digital signal processor. The
Pxz = Gx' * Gz' workload includes classic communications blocks
Pyz = Gy' * Gz' that can be found in many wireless receivers in vari-
ous forms and complexities.
Where Gx', Gy', and Gz' are each 1280×720 gradi- The BDTI DQPSK Receiver Workload is a fixed
ent matrices. workload with a single Operating Point defined as
processing an input stream of complex modulated
Tensor Calculation and Normalization data at 18.75 Msamples/second with the receiver
The tensor calculation is a two-dimensional, sepa- chain clocked at 75 MHz. The corresponding
rable FIR filter. It is implemented in the reference DQPSK demodulated output bitstream is
code using one-dimensional horizontal and vertical 4.6875 Mbits/second. The objective for this work-
FIR filters, applied to each outer product. load is to minimize the FPGA resource utilization
Tensor normalization manages data word widths needed to achieve the specified throughput. Resource
by scaling the five tensors at each pixel position. The utilization refers to the percentage of total processing
tensors at each pixel position are normalized based on engine resources required to implement the work-
the largest tensor magnitude within a small neighbor- load.
hood of pixels. The size of the neighborhood is data The high-level block diagram shown in Figure 3
dependent, but this block is designed to have characterizes the implementation of the workload in
bounded computational requirements, guaranteeing the C language reference implementation included
that a real-time implementation is feasible. with BDTI’s specification package. Blocks may be
merged or restructured to improve the efficiency of
Velocity Calculation the final implementation, provided all acceptance cri-
The velocity calculation computes the horizontal teria are met (e.g., all test vectors pass). The recom-
and vertical components of velocity at each pixel mended data widths at the input and output of each
position, and is implemented as follows: block are specified by BDTI.
velocityX = (Tyz*Txy-Txz*Tyy) / (Txx*Tyy- A brief description of each of the five main func-
Txy*Txy) tional blocks in the workload follows.
velocityY = (Txz*Txy-Tyz*Txx) / (Txx*Tyy-
Txy*Txy) Matched Filter
The matched filter block implements a square
Where each T in the above equations is a 1280×720 root raised cosine (SQRC) FIR filter with selectable
matrix output from the tensor calculation. roll-off factor. The input to the block is a complex
DQPSK modulated signal at four times the symbol
The BDTI DQPSK Receiver Workload™ rate. The output of the block is a complex filtered sig-
A block diagram of the BDTI DQPSK Receiver nal at four times the symbol rate.
Workload is shown in Figure 3. The BDTI DQPSK

fo u n d _ w o rd s
1. 2. 3. 4. 5. 6.
M a tc h e d C a r rie r T im in g DQ PSK V ite rb i D e fr a m e r
F ilte r R ecovery R ecov ery D em od D ecode r d e fra m e d _ d a ta
Q

c o e ff_ in d s o f t_ v a lu e c o r r_ th re s h o ld fr a m e _ s ize

C o n tro l R e g is te rs
FIGURE 3. Block Diagram of the BDTI DQPSK Receiver Workload

Page 12 © 2010 BDTI (www.BDTI.com). All rights reserved.


Carrier Recovery
The carrier recovery block implements a modified APPENDIX B: Usability Metrics Details
Costas Loop to estimate, correct, and track errors
between the transmitter and receiver carrier frequen- Usability Metrics
cies. The usability metrics shown in Table 4 are
The input to this block is the complex output of defined as follows:
the SQRC at four times the symbol rate. The output
is a complex sample corrected by the estimated phase Required Skills
error at four times the symbol rate. In addition to assigning scores for the usability
metrics listed below, BDTI also identifies the skills
Timing Recovery required to effectively work with each tool chain. No
The timing recovery block implements a modified score is provided for this metric.
Mueller and Mueller algorithm for symbol timing
recovery. The timing error is calculated using the cur- Out-of-the-Box Experience
rent received and detected (sliced) symbols, and previ- This includes all activities from unpacking the box
ously received and detected symbols. The block also to getting everything set up and installed so that the
includes an interpolator that interpolates the received user can start the design process. The assessment
data value at the estimated sampling time using the includes items such as: clarity of documentation,
newly received data and the previously sampled data. smoothness of the installation process, the time
The input to this block is a phase corrected com- required to perform the installation, and the helpful-
plex sample at four times the symbol rate from the ness of tutorials or demo applications. Note: for the
carrier recovery block and the output is a complex Xilinx tools, the Out-of-the-Box Experience was
sample at the estimated sampling time at symbol rate. assessed by DSP software engineers rather than an
experienced FPGA designer.
DQPSK Demodulator
The DQPSK demodulator accepts a complex sym- Ease of Use
bol and differentially decodes its phase relative to the This is an assessment of how easy it is to use the
previously received complex symbol. It then slices features provided by the tool chain. It is not meant to
this phase into corresponding decoded bits. identify missing features (this is addressed as part of
The input to the DQPSK demodulator is a com- the “completeness of capabilities” metrics). The
plex value and the output consists of two soft values assessment includes items such as the intuitiveness
corresponding to the two decoded bits. and user-friendliness of the user interface, responsive-
ness (i.e., whether the tool chain was slow to com-
Viterbi Decoder plete actions), reliability (did the tools crash or hang?)
The Viterbi decoder accepts a continuous stream and clarity of on-line help.
of soft decisions from the slicer and outputs a stream
of decoded binary bits. Completeness of Capabilities
This is an assessment of the extent to which the
Deframer tool chain includes all capabilities necessary to enable
The deframer is a correlator that operates on a user to efficiently complete the implementation of
every output bit searching for an “access code.” The the workloads.
access code is a predefined sequence of bits indicating
the start of the payload. Quality of Documentation and Support
Throughout the certification process BDTI
assesses the documentation supplied with the tool
chain. This includes the quality of the getting started
guide, tutorials, and the ease of finding answers to
specific questions (including technical support).

Efficiency of the Overall Design Methodology


This is primarily an assessment of user productiv-
ity. It includes assessing the extent to which a user of

© 2010 BDTI (www.BDTI.com). All rights reserved. Page 13


the high-level tool is abstracted from the underlying Extent of Modification to the Original Reference
architecture (RTL for FPGA implementations and Code
the DSP processor core and chip architecture for DSP This is an assessment of how closely the code used
processor implementations). Since this is a broad cat- to generate the final optimized BDTI Optical Flow
egory, it is broken up into the following sub-catego- Workload implementation matches the original C
ries: language reference code provided by BDTI. This is
Learning to Use the Tool: not a simple count of the number of lines of code that
• For FPGA designs: Learning to efficiently use the have been changed, but rather reflects the complexity
high-level synthesis tool. As mentioned above, no and effort involved in making the necessary changes.
score is provided for learning the Xilinx RTL Typically changes are made for one of the following
tools because an experienced BDTI FPGA engi- reasons:
neer worked with them. • Structural changes: Changes to the overall data
• For DSP designs: As mentioned above, no score is flow and code structure required to map the appli-
provided for learning the DSP tools because expe- cation the underlying device architecture
rienced BDTI DSP engineers worked with them. • Changes required because of tool restrictions or
First Compiling Version: limitations
• The effort required to create an initial functional • Timing and resource optimizations on individual
implementation of the application. For the FPGA blocks
this includes only the HLS tool—use of the RTL • Interfacing to peripherals and other external mod-
tools to support integration into the FPGA is not ules
included. The first compiling version is not
BDTI considers factors such as: the number of
expected to be optimized in terms of performance
or resource utilization, but rather an initial imple- changes, the extent to which the tools automate
mentation based on which the optimization pro- implementing these changes, the level of difficulty for
cess can begin. the developer in incorporating the changes and the
level of difficulty in debugging and testing the
Final Optimized Version:
changes.
• The effort required to take the application from
the first compiling version to a final optimized
implementation (not including completion of
interfaces required to run on an actual chip). For
the FPGA, this does not include final integration
with the video and memory interfaces, but does
include adding C language code required to inter-
act with external memory. For the DSP proces-
sor, this includes testing via file I/O, but not
integration with external video ports.
Platform Infrastructure Development:
• This category includes integration of platform
components that must be incorporated into a
design so that it will run on a physical chip (i.e., a
DSP processor or FPGA). This includes, but is
not limited to:
• For FPGA implementations: Complete inte-
gration of the memory controller, external
memory and video I/O.
• For DSP processor implementations: Installa-
tion and configuration of drivers and libraries,
and interfacing to video I/O.

Page 14 © 2010 BDTI (www.BDTI.com). All rights reserved.

You might also like