You are on page 1of 9

Designing Parameterized Signal Processing IPs for High

Level Synthesis in a Model Based Design Environment


Shahzad Ahmad Butt Luciano Lavagno
Electronics and Telecommunication Department Electronics and Telecommunication Department
Politecnico di Torino Italy Politecnico di Torino
Corso Duca Degli Abruzzi, 24 Corso Duca Degli Abruzzi, 24
Shahzad.Butt@polito.it Luciano.Lavagno@polito.it

ABSTRACT Keywords
Model based hardware/software synthesis can lead to fast and Model-based design, Model-based high level synthesis, Synthesis
efficient embedded system implementations, by enabling quick from Simulink models, Parameterized IPs, Design reuse, Audio
design space exploration. High level hardware modeling and detector, GPS acquisition, C/C++ Hardware IP description, FFT.
implementation can be accelerated by using functionally verified
parameterized models that are optimized for high level hardware
1. INTRODUCTION
synthesis. Such models can be designed so that they can be easily
Model-based design environments, such as Simulink, are
integrated with a high level modeling environment, such as
becoming more widespread as they expand their capabilities of
Simulink, and at the same time provide ample flexibility to
synthesizing efficient hardware and software from abstract
perform design space exploration when mapped to hardware.
models. One of the major concerns of model-based design is to let
During signal processing hardware design, the focus is mostly on the algorithm designer work mostly at an abstract level without
the architectural representation (data parallelism, pipelining, caring about the low level implementation details. This is because
memory access, etc.) to meet throughput requirement and on data the area, performance and power optimizations that can be
path modeling to analyze the effects of quantization. In this paper achieved at this level outweigh by far anything that can be
we present our experience of modeling an FFT block that can be obtained at the micro-architectural level and below, and also
integrated with the Simulink model based design environment for result in improved productivity and faster time to market. The
simulation and verification, and later can be used to perform latter gains are achieved mostly by reusing already modeled and
architectural design space exploration and hardware verified macro blocks, and by easing the verification task by
implementation with optimal data path selection. A key advantage allowing to use a single golden reference model throughout the
of our model is that the very same bit-accurate C code is used for design flow. However, implementation efficiency must not be
simulation and for high-level synthesis, because it has been forgotten if one wants to use model-based design in practice.
written with both aspects in mind (while for software Commercially available model-based design tools allow
implementation either our code or the code provided by the automatic generation of software and hardware from a single
Mathworks can be used equally well). To prove the feasibility of model. The automatically generated software can be optimized by
our proposed approach we synthesized our FFT for two DSP using knowledge of the target processor architecture and its
applications with very different performance and cost memory hierarchy. But when it comes to model-based hardware
requirements, namely a frequency domain audio detector and a design and implementation (which essentially entails the
GPS acquisition algorithm, and compared it with existing manual generation of a cycle accurate RTL model from very abstract
implementations block level models that have no of notion clock cycles), manual
re-implementation is still often required. This paper attempts to
Categories and Subject Descriptors avoid this expensive step, by using a model-based design
B.5.2 [Register-Transfer-Level Implementation]: Design Aids – paradigm in which the code used to simulate a block in Simulink
automatic synthesis, hardware description languages, simulation, is optimized for HW synthesis (and good for SW synthesis as
verification, optimization well), as opposed to being tuned only for SW synthesis..

B5.1 [Register-Transfer-Level Implementation]: Design – data- Simulink provides a rich set of modeling components which range
path design from simple blocks like addition and subtraction, to complex
macro blocks like frequency domain transforms, digital filters,
etc. The computationally complex blocks are normally
General Terms represented in Simulink as S-functions, whose body is written in
Algorithms, Design, Verification. C or Matlab. S-functions can be easily integrated in the Simulink
environment because they implement the core functionality of the
block plus a wrapper around it to satisfy the semantics of the
Permission to make digital or hard copies of all or part of this work for simulation environment. Real Time Workshop (RTW) is a
personal or classroom use is granted without fee provided that copies are MathWorks tool that translates a Simulink model composed of S-
not made or distributed for profit or commercial advantage and that functions and simple blocks to C code, meant for use on micro-
copies bear this notice and the full citation on the first page. To copy
controllers and DSPs. The MathWorks also provides a tool, called
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. HDL coder, which can translate Simulink models to RTL for
CODES+ISSS’12, October 7–12, 2012, Tampere, Finland. hardware implementation. However, it cannot handle all blocks,
Copyright 2012 ACM 978-1-4503-1426-8/12/09...$15.00. and even the latest version, which can perform a limited amount

295
of micro-architectural design exploration (e.g. by changing the 2. RELATED WORK
latency and pipelining level of a model), does so by modifying the Simulink models have been used for hardware synthesis and
source model, which partially defies the purpose of model-based hardware/software co-design in the past. For example, [2]
design. Recent work [3][9] has shown that Simulink models can presents a Simulink image processing blockset for
be used for efficient hardware/software synthesis, design space hardware/software codesign, using Xilinx System Generator at the
exploration and rapid prototyping when they are used as input for backend for hardware generation. In [1] an ESL design tool is
high level hardware synthesis tools. Most of these tools take as presented that can do automatic design space exploration starting
input a functional specification written in some variant of from behavioral SystemC models. [1] presents a design flow for
C/C++/SystemC. The interface between the Simulink model and mapping a Simulink model to a full Multi-Processor System-on-
high level synthesis was thus obtained in that work by using the Chip. The design flow allows processor and task design space
automatic C code generation capability provided by RTW. exploration at various abstraction levels, but does not provide any
From our experience we observed that the C code generated by support for mapping part of the model to dedicated hardware.
RTW can be used efficiently for hardware synthesis if it involves In [3] the authors discuss how to configure RTW for generating C
mostly simple blocks. But when it comes to using complex macro code from Simulink models and also to how to wrap the
block like Fast Fourier Transform (FFT), Discrete Cosine automatically generated code into a SystemC wrapper that can be

Figure 1. Complete Model-based design flow enhanced by FFT S-function IP.


Transform (DCT), etc. then the software-oriented C code that is used for hardware implementation. They focus on how to tune
generated limits the hardware design space that can be explored. RTW for generating code that is suited for high level synthesis
This is because the SW-oriented C code used to model such and on how to obtain different points in the design space without
blocks has a signal flow representation that inherently limits, as requiring a deep understanding of the automatically generated
we will argue below, the kind of micro-architectures that can be SystemC code. On the other hand, [4] discusses the design of a
explored. In this paper we propose the modeling of such complex register based FFT IP that can be parameterized and used for high
macro-blocks still using plain C, which is essential for smooth level synthesis. The authors focused on representing the design
integration with Simulink verification and SW generation, but using advanced C++ templates and some specific data type
using a code structure that lends itself to better HW design space libraries that can be used to describe bit accurate hardware.
exploration as a parameterized high-level Intellectual property However, these models cannot be used in Simulink, which cannot
(IP) block. Figure-1 illustrates in detail the integration of FFT IP use templatized C++ and would have difficulties incorporating
in Simulink model-based design flow for high level hardware external libraries within S functions On the other hand, our
synthesis. parameterized IP blocks are modeled in plain C, to be compatible
with Simulink S-function modeling. In [10] authors compare SW

296
and a HW implementation of several applications (2-body JPEG in DSP applications the datapath width is decided based on the
and AES), identifying from where hardware actually extracts results of several bit-accurate simulations (e.g. using the fixed point
parallelism and execution speed. We exploit this when creating optimization capabilities of Simulink). However, representing bit-
the IP block. [11] defines a modeling style that allows for accurate types in plain C can be tricky. Moreover, DSP applications
efficient HW/SW tradeoffs. We use Simulink as a well-known derive most of their performance from both architectural level and
industrial implementation of a similar modeling style, based on a fine-grained parallelism and pipelining.
dataflow paradigm. [12] presents a tool named Compaan that We developed a modeling strategy based on a representation which
automatically transforms a nested loop program written as Matlab can be easily used to explore different micro-architectures. It also
Script or C code into a process network specification. It can accurately models different datapath bit widths and arithmetic
extract parallelism from Matlab and C code, but requires code overflow/saturation modes in a single C model, which is both
rewriting to fit the Compaan modeling style (affine array indices compatible with the S-function modeling style and amenable to
within nested loops without control) while in our case we exploit efficient HW synthesis. This strategy is illustrated in this paper for
designer-provided top-level parallelism among Simulink blocks. the sake of explanation with an FFT block, but is applicable to a
Simulink HDL Coder [5] generates synthesizable RTL code from much more general class of complex Simulink blocks.
Simulink models. However, the generated RTL code has a close
1-to-1 correspondence with the Simulink model, hence different
3.1 FFT Signal Flow Graph Representations
The Fast Fourier Transform (FFT) is widely used to calculate a
micro-architectures in hardware require very different Simulink
Discrete Fourier Transform and its inverse efficiently. The FFT
models, thus defeating the separation between functionality and
exploits the symmetry of the calculation and the re-use of already
architecture that is essential for true model-based design.
performed calculations to reduce the implementation complexity.
Moreover, most architectural trade-offs like resource sharing,
FFT calculations have many signal flow graph representations
pipelining and exposing more parallelism are tedious to model in

Figure 2. Signal flow graph for radix-2, 8-point in place FFT computations
Simulink. System Generator for DSP [6] is another model-based which can prove beneficial for targeting different application
hardware design tool that can generate synthesizable RTL for domains with different performance requirements under different
Xilinx FPGAs from Simulink models. It adds to Simulink constraints.
specialized block sets, i.e. libraries of components that are used
Figure-2 shows the signal flow graph (SFG) for computing an
for RTL modeling, thus resulting in a flow that is very similar to
FFT with 8 samples. The SFG represents the fully unrolled
HDL Coder. Due to strict 1-to-1 mapping between Simulink
computations and data dependencies (and thus the full available
model and the final RTL representation, it is difficult to represent
parallelism) implied by the C code structure used by RTW for a
macro blocks like FFTs as parameterizable IPs that can be
software-oriented FFT implementation. In this SFG each node
mapped to different hardware architecture depending on the
represents a complex operator and each arc represents a complex
application requirements. Altera also offers a tool named DSP
value. It is called radix-2 FFT since its basic unit, called butterfly
Builder [13] that has more or less same capabilities and
and marked by the dotted box at the top left of the figure,
limitations as Xilinx System Generator.
consumes two input samples to produce two output samples.
Constants marked as WN0, WN1...WN3 are complex exponentials
3. Our IP Modeling strategy which are known as twiddle factors. Inputs x(0),x(1),…x(7) are
Modeling IP in plain C that still enables HW design space the complex time domain samples of the signal to be transformed
exploration through high level synthesis can be challenging. Even and outputs X(0),X(1),…X(7) are the complex values of the
though high level synthesis can vary several micro-architectural frequency spectrum of the signal. Each butterfly represents the
parameters such as architectural parallelism, loop pipelining, multiplication of twiddle factors with input samples and then one
resource sharing and so on, several of these options are available addition and subtraction to calculate outputs.
only when the appropriate modeling style is used in C. For example

297
The signal flow graph in Figure-2 is called in-place FFT because capability offered by high level synthesis tools, which allows one
every butterfly can write outputs to the same memory from where to map two different memories of different length and width
it has read the inputs. Such a representation is useful for (arrays in C) to a single aggregate memory. In our FFT IP we
implementing a resource shared FFT with relatively low used the SFG in Figure-3 because it can offer broader design
throughout requirements targeting low power applications and space exploration as compared to Figure-2, which corresponds to
with limited on chip memory size and bandwidth. But this kind of the default software implementation from Simulink RTW.
signal flow graph is not well suited when throughput
requirements are high and either a pipelined implementation or a 3.2 FFT Butterfly Operations
fully unrolled register based (rather than memory-based) A more detailed representation of the signal flow graph of a radix-
implementation is required. For example, let us assume that in 2 butterfly is shown in Figure-4, where values and operators are
order to increase throughput we unroll the inner loop that on real, rather than complex, numbers.
performs butterflies in a stage (a column of Figure-2), and that

Figure 3. The Signal flow graph for radix-2, 8-point FFT computations

stage inputs are mapped to registers. After performing a butterfly Here X0r, X0i and X1r, X1i represent inputs, Y0r, Y0i and Y1r, Y1i
computation, the inputs for the next butterflies mapped to the represent outputs and WNr, WNi represent twiddle factors.
same multiplier/adder/subtractor resources will come from signal Superscript ‘i’ and ‘r’ identify the real and imaginary parts
flow graph positions that are different from the first stage, which respectively. For hardware implementation this local signal flow
in hardware will imply high multiplexing cost and hence will not graph must be represented as a fixed point datapath with precise
be efficient. Similarly, some tools and memory architectures may
not efficiently support pipelining of loops in which computations
read and write from the same memory, due to the need to use
multi-ported memories. This, on the other hand, would be easy in
software, for which the graph in Figure-2 works best.
Figure-3 shows another signal flow graph for FFT, which also can
be represented in C in the form of nested loops, but is much more
flexible than the one in Figure-2 to derive many possible
implementations using high level synthesis. In particular, it can be
mapped to a register-based unrolled implementation.
The advantage of such an FFT representation with respect to
Figure-2 is that the interconnection network between the stages is
the same for all the stages, which results in less multiplexing cost
when a stage is partially or fully unrolled and subsequent stages
are implemented by iteration. Even when a memory-based
implementation with more aggressive resource sharing and lower
throughput is required, still the signal flow graph in Figure-3 is
more flexible because one can always map inputs and outputs of a Figure 4. Radix-2 butterfly signal flow graph
butterfly to two different memories, while still allowing partial bit widths. Based on our design flow requirements, the same
unrolling depending on the memory read/write bandwidth. The representation must be usable for high level synthesis and for
signal flow graph in Figure-3 can even be mapped to a single on- simulation in Simulink. In the next section we discuss how we
chip memory implementation by utilizing the memory merging

298
modeled fixed point operators with different arithmetic modes to used to model the core operation take as input the operands with
satisfy all these requirements. bit width and decimal point information and produce an output
with the desired bit width and decimal point. They call rounding
3.3 Modeling Arithmetic Operators for HLS and overflow automatically as needed.
Fixed point operators take two inputs and produce an output, each
with a given bit width, location of the decimal point, and rounding All these functions are automatically inlined during hardware
and overflow mode. To correctly perform arithmetic operations synthesis, and since the precision, point position, rounding and
some pre and post processing steps such as decimal point overflow selection arguments are constant, the high-level
alignment, rounding and overflow management are necessary. We synthesis tool can perform efficient bit width inference for all
divided our C-based implementation of fixed point arithmetic in needed hardware resources.
three steps. Note that this method is only applicable if the total bit width for
 core operation (including alignment), each input or output (including temporary outputs before
overflow and rounding management) is less than or equal to the
 rounding, maximum integer size supported by the compiler where Simulink
 overflow management. and the high-level synthesis tool are executed.
Alignment is included in the core operation because it depends on Normally high level synthesis tools use data types, like sc_fixed
the operation (e.g. for addition or subtractions inputs need to be or ac_fixed, which use C++ templates to make arithmetic
aligned, while for multiplication they do not). expression representation and automated conversions and casting
easier and more natural to handle. However, such sophisticated
Rounding, overflow management and core operation are modeled
mechanisms are not available in the plain C which is required by
as generic C functions. For example, the overflow manager
our methodology, since we do not want to use different models for
function (implementing wrapping and saturation) is called when
verification and hardware synthesis, which would require one to
any addition, multiplication or subtraction operation requires an
re-verify the HW-oriented models to be used by the high-level

Figure 5. Flow chart for addition operation with masked sign extension

output to be stored in fewer bits than the full bit width of the core synthesis tool. However, we observed that if masking coupled
operation result, which can be computed from the input operands. with sign extension is applied at appropriate places in the
The rounding function on the other hand (implementing datapath, then a high level synthesis tool can identify and
truncation, ceil, floor and round) is called when the number of bits optimize the data widths of the allocated resources even with our
for the output fractional part is reduced. Finally, functions that are plain C representation of fixed-point data types. Moreover, we are

299
advocating this style only for frequently re-used IP blocks, where width and point position for inputs and outputs. It also allows
hardware implementation flexibility and efficiency are more one to switch between fixed point and floating point in order
important than ease of modeling within the blocks. Figure-5 to ease fixed point conversion.
represents the flow chart of a fixed point addition operation.
 The “globals.h” file defines all the global variables, such as
Note that for an FFT, butterflies in the same stage have the same the buffer memories, which should be defined as public
datapath and produce output with the same fixed point members when encapsulated in SystemC.
representation. But butterflies in different stages can have  The “const_members.h” file defines all constant data, such as
different bit widths, which can be handled either by increasing the the twiddle table defining all the complex twiddle factors
width by one at each stage, or by appropriate rounding. We used that are required to perform butterflies in different stages.
the latter technique, since it lends itself to better resource sharing Unfortunately they are different for every FFT length. Then
among stages, and is commonly used in practice. The fixed point the twiddle table must be pre-computed and saved as a C text
parameters passed to the various arithmetic operators can thus be
derived from input bit width, output bit width and length of FFT.

3.4 C model of the FFT IP


In this section we discuss how we represented the FFT as a
parameterized C IP block that can be used for verification with
Simulink and for high level synthesis using a C or SystemC-based
tool.
Simulink integration is performed by implementing and using
several C functions defined by the S-function APIs that allow the
Simulink engine to interact with the block. S-functions can model
discrete, continuous and hybrid blocks and be written in a variety
of languages, such as C and Matlab (but not C++ or SystemC).
Here our focus is on modeling discrete blocks in C. Figure-6
illustrates how the Simulink engine interacts with the blocks,
bo6h S-function based are native.
Initialization is carried out at the start of simulation by calling a
function provided by each block to initialize state or global
variables. Then simulation enters a loop in which it calls two
other block functions to calculate outputs and update states.
Simulink provides a unified access to inputs, outputs and other
Figure 7. IP C code organization in different files
file used to initialize a constant array for any given
parameterization of the IP.
 The “hls_types.h” file conditionally defines basic data types
as floating point or integer, as selected in “params.h”.
 The “private_funs.c” and “private_funs.h” files define and
declare all the other functions required to implement the
FFT, including the top level core wrapper function that
iterates over butterflies to implement the FFT execution.

3.5 SystemC wrapper for the FFT IP


When I/O signal names and bit widths and the communication
Figure 6. Interaction of Simulink simulation protocol are selected for the FFT IP, then a SystemC wrapper can
engine with S-function blocks be generated for high level synthesis. This wrapper serves two
purposes; it defines the synthesizable RTL interface of the block
parameters of an S-function block through a structure called and it integrates the plain C code in a class structure, where global
SimStruct. The members of the structure can be accessed for variables appear as public members and all constants appear as
reading and writing through a set of functions provided by constant public static members. It also defines the SystemC thread
Simulink. behavior by appropriately calling the top-level function defined in
Our FFT model is written in plain C, split in different files as “private_funs.c”, as shown in Figure-8 I/O interfaces are
shown in Figure-7, in order to make it more understandable and implemented as two separate cycle accurate functions that inherit
ready for easy encapsulation in an S-function wrapper or SystemC the I/O data types from “hls_types.h”. For the FFT IP we used a
wrapper. simple streaming I/O protocol with handshake signal to initiate
data transfers. Note that generation of this boilerplate code can be
 The “params.h” file defines all the constants to model the
easily automated with a script.
fixed point datapath, the parameters of the FFT computation,
such as for example the FFT length, radix, as well as bit

300
4.1 Frequency Domain Audio Detector
In sound-triggered wireless security camera applications, a
front end audio detector is employed at the start of the alarm
processing chain. Still images and video streams are very
expensive to collect and process, hence the video cameras are
only turned on when the low-power sound sub-system detects an
event of interest, as illustrated in Figure-9.
A frequency domain audio detector is considered in this case, as
shown in Figure-10. It consists of two main blocks: “power
spectral density estimation” (PSD) and “threshold estimation and
signal detection” that we fully modeled in Simulink.
After code generation from Simulink and profiling, the FFT
within PSD estimation is identified as the power and performance
bottleneck, and thus selected for HW implementation. We then
compared the results of high level synthesis when performed on
our IP and when performed on the C code automatically
generated by RTW.
In this use case, due to the relatively low frequency of the audio
input, performance requirements are low, namely one 256-sample
FFT with 14 bits of precision every 4 msec. Hence the goal is
essentially area and power optimization.
Table-1 compares the implementation results obtained after high
level synthesis and RTL level power estimation of our IP block
Figure 8. SystemC wrapper for IP and of the default RTW C code implementation, also
encapsulated in SystemC. Our (hardware-oriented) IP consumes
3.6 S-function wrapper for the FFT IP 45 percent less area and has essentially the same throughput as the
The S-function wrapper can be generated in two ways. First it can
be written manually, by following all the rules described by the
MathWorks. As an alternative, Simulink provides a graphical
environment, called S-function builder, that can be configured to
include all the header and source files to generate the S-function
wrapper automatically. In this case only a few lines of manually
written code are required to describe how the top level function
that implements the FFT kernel accesses the input and output (the
FFT block has no state to be carried across executions).

4. Use Cases Figure 10. Front audio detection algorithm block diagram
To demonstrate the effectiveness of our design methodology for
IP blocks to be used in a model-based hardware design (software-oriented) RTW code. The power consumption is also
environment based on high level synthesis, we considered two lower in our case.
very different use cases for our FFT example. One is an FFT One important reason for the area difference is the better bit-
processor for frequency domain audio detection and the other is width optimizations that are enabled by our bit-accurate
an FFT processor for a GPS acquisition front end. Both use cases representation of the individual butterflies. Both the data path and
were synthesized using UMC-90nm technology libraries. the memories were trimmed to the exact 14 bit width required by
the usage scenario, instead of the default 16 bits used by the RTW
implementation (which is limited to C data types).
Table 1. RTW FFT vs. FFT IP for audio detector application

4.2 FFT-based GPS Acquisition


A GPS receiver must be able to capture and demodulate signals
transmitted by at least four GPS satellites. Every satellite
convolves its signal with a code that spreads its power over a
Figure 9. Sound triggered multi-modal wireless relatively large spectrum. When a GPS receiver is turned on, it
surveillance network first task is to identify which satellites are visible at that time and

301
Table 3. Hand optimized RTL implementation vs. FFT IP for
GPS acquisition application

Figure 11. Core GPS acquisition algorithm


place. This means sampling the received signal and figuring out 5. Conclusions and Future Work
which codes have been used by the visible satellites for spreading In this paper we demonstrated how one can design a
and what is the approximate phase of the (known) spreading parameterized bit-true IP, coded in plain C. This IP can be
sequence at that time. Because of the relative movement of integrated with Simulink for modeling, simulation and
satellites with respect to the receiver, there is also a Doppler verification and then can be easily used for efficient high level
frequency shift which the receiver is required to estimate. synthesis by wrapping it into SystemC. To prove the effectiveness
of our approach we also compared our synthesis results with other
There are many different techniques for acquisition, including implementations that are based on hand optimized RTL or
parallel search techniques based on FFT. The FFT-based GPS generating C code automatically from Simulink models.
acquisition algorithm block diagram is shown in Figure-11. The
first block in the chain performs two tasks. First it down converts In the future we will expand the capabilities of our FFT IP by
data to baseband, and then it averages data samples to have a exploiting other signal flow graph structures, including for
lower sampling rate at the output. The next block performs the example implementing the FFT stages in pipeline, in order to
frequency domain transformation using the FFT algorithm. Then further improve the throughput for high-performance real time
the data is multiplied with the transform of the spreading code. streaming applications.
This is followed by an inverse FFT and the peak search in the
time domain. 6. REFERENCES
[1] Kai Huang; Sang-il Han; Popovici, K.; Brisolara, L.; Guerin,
We modeled this algorithm in the Simulink environment,
X.; Lei Li; Xiaolang Yan; Soo-lk Chae; Carro, L.; Jerraya,
verifying it with both the native Simulink FFT block and the S-
A.A.; , "Simulink-Based MPSoC Design Flow: Case Study
function FFT IP developed by us. Then we synthesized both FFTs
of Motion-JPEG and H.264," Design Automation
(our IP and the RTW version) under the very stringent timing
Conference, 2007. DAC '07. 44th ACM/IEEE , pp.39-42, 4-8
constraints required by this usage scenario, namely one 1024-
June 2007
sample FFT with 4 bits of precision every msec.
[2] Haubelt, C.; Schlichter, T.; Keinert, J.; Meredith, M.;,
Note that the throughput is16 times higher than in the frequency "SystemCoDesigner: Automatic design space exploration
domain audio detector case, but the datapath requires only about and rapid prototyping from behavioral models," Design
1/4 of the bits of the previous case (4 versus 14). Automation Conference, 2008. DAC 2008. 45th ACM/IEEE,
Table-2 compares the results of these two implementations. Our pp.580-585, 8-13 June 2008
IP is much better in terms of area than the RTW version, with [3] Butt, S.A.; Lavagno, L.; , "Model-based rapid prototyping of
essentially the same throughput. Power consumption is also multirate digital signal processing algorithms," NORCHIP,
smaller in our case. 2011 , vol., no., pp.1-4, 14-15 Nov. 2011
Table 2. RTW FFT vs. FFT IP for GPS acquisition [4] Takach, A.; "Creating C++ IP for High Performance
application Hardware Implementations of FFTs," DesignCon 2010
[5] Simulink HDL Coder - Generate HDL code from Simulink
models and MATLAB code.
http://www.mathworks.comlproducts/slhdlcoder
[6] System Generator for DSP
http://www.xilinx.com/tools/sysgen.htm
[7] Molino, Andrea; Girau, Gianmarco; Nicola, Mario; Fantino,
Maurizio; Pini, Marco; , "Evaluation of a FFT-Based
We also compare our implementation with a manually optimized Acquisition in Real Time Hardware and Software GNSS
RTL that was specifically designed for FPGA implementation in Receivers," Spread Spectrum Techniques and Applications,
[7], while our IP model was not specifically tuned for FPGAs but 2008. ISSSTA '08. IEEE 10th International Symposium on ,
only for generic HW implementation. The results are reported in vol., no., pp.32-36, 25-28 Aug. 2008
Table-3, showing that our implementation is comparable with a
hand optimized RTL. The SRAM requirements are exactly the [8] A.; Suardiaz, J.; Cuenca, S.; Grediaga, A.; , "Novel Simulink
same, while we are about 20% worse in terms of area. blockset for image processing codesign," Electrotechnical
Conference, 2006. MELECON 2006. IEEE Mediterranean ,
vol., no., pp.117-120, 16-19 May 2006

302
[9] Sayyah, P.; Butt, S.A.; Lavagno, L.; , "Simulink-based [12] Bart, K., Edwin, R. and Ed Deprettere. 2000. Compaan:
hardware/software trade-off analysis technique," Applied deriving process networks from Matlab for embedded signal
Electrical Engineering and Computing Technologies processing architectures. In Proceedings of the eighth
(AEECT), 2011 IEEE Jordan Conference on , vol., no., pp.1- international workshop on Hardware/software codesign
7, 6-8 Dec. 2011 (CODES '00). ACM, New York, NY, USA, 13-17.
[10] Scott, S. and Alessandro, F. September 2008. Where’s the DOI=10.1145/334012.334015
Beef? Why FPGAs Are So Fast. Technical Report Microsoft http://doi.acm.org/10.1145/334012.334015
Research Center Redmond. [13] http://www.altera.com/products/software/products/dsp/dsp-
[11] John, R. W. and Greg, St. 2010. Elastic computing: a builder.html
framework for transparent, portable, and adaptive multi-core
heterogeneous computing. SIGPLAN Not. 45, 4 (April
2010), 115-124. DOI=10.1145/1755951.1755906
http://doi.acm.org/10.1145/1755951.1755906

303

You might also like