A Low-Power Reconfigurable DSP System

A Low-Power Reconfigurable DSP System
Marlene Wan, Martin Benes, Arthur Abnous

Jan Rabaey
EECS Department, University of California, Berkeley
Abstract
Reconfigurable architectures has emerged to be a promising implementation platform to provide

flexibility, high-performace and low-power for future wireless embedded devices. We discuss in
detail an reconfigurable architecture template and a set of software tools to perform automatic
mapping and performance prediction from algorithm to the architecture. We present results on
digital signal processing and wireless communication algorithms to show the effectiveness of the
system in achieving energy efficiency.
1. Motivation and Background
Future wireless multimedia computing devices are required to adapt their functionality to the
changing parameters of the communication link available at a given time (i.e., bandwidth, error
rates, protocols, etc.). Therefore, these devices have to be flexible enough to accommodate a
various multimedia services (e.g., different video decompression schemes) and communication
capabilities (e.g., cellular GSM, PCS, pico-cellular). At the same time, low-power consumption
will continue to be the predominant design challege of wireless systems. Reconfigurable
architectures has emerged to be a promising implementation platform to provide flexibility, high-
performance [ref] and low-power [ref] [ref] for future wireless embedded devices. In
[Abnous96], a reconfigurable architecture template is proposed to meet both the flexibility and
low-power requirement. In this paper, we will introduce a realization of such architecture
template (in particular, its model of computation and basic processing elements for data-flow
computations) and supporting software to assist direct implementations on such architecture. The
shaded box in Figure 1 shows the scope this paper covers. The energy efficiency of the proposed
realization is then demonstrated by mapping wireless communication and signal processing
algorithms to the architecture.
Kernel* Computation
Hardware
Components
Mapping Architecture To architecture selection:
Estimation Desription
Algorithm
Optimization
Reconfigurable Architecture Implementation

Optimization
*Kernel-computational Intensive operations, often corresponds to datafow computations in nested loops
Figure 1.
2. Architecture Description
The basic idea behind the proposed architecture is illustrated in Figure. 1. Control flow computa-
tion in performed on the microprocessor and dataflow computation is performed on the satellites.
The architecture template fixes the communication scheme between each satellite as well as the
interface method between the microprocessor and the satellite. Communications between each
satellite is data-flow driven and each satellite also follows strict execution (i.e. operation starts
only when all input datas are ready). Dedicated links are established between satellites.
In the current realization of the architecture, the satellites are medium to fine grain according to
the definition of [Bart]. The fuctionality of the satellites are divided into three catagories: source,
computation and memory. To support adaptive computations without reconfiguration such as
changing the vector length or number of taps for the computation satellites, a minimum-overhead
mechanism to pass data structure (scalar, vector and matrix) is developed. Each computation
satellite needs to be configured to the data structure it consumes and produces (vectors to scalar
for MAC, for example). The source satellites generate tokens indicating the end of the data
structure along with corresponding data.
Talk about dedicated links between satellites and data steering elements- Three categories: static
(data goes in a fixed direction in-between reconfiguration periods), statically scheduled (data
goes in directions instructed by programs configured at reconfiguration times), dynamically (data
is equipped with the direction) determined. The first two are supported by the current realization
of the architecture template.
The current implementation of the data driven computation scheme is globally asynchronous and
locally synchronous clocking. address generator and inport (with data from microprocessor) and
FPGA can serve as sources. Reconfigurable interconnect [Zhang].
3. The Software Tools
In order to supply fast implementation feedback to the user, tools are developed to support
application specific simulation and direct-mapped synthesis from a high-level language to the
satellites.
3.A. Simulation Tool
Based on the realization of the architecture template, a simulation environment is developed to
provide application specific simulator in a style similar to [Bart].
Since compution is mapped to clusters of satellites, an object-oriented intermediate form based
on the concept of modules (heterogeneous satellites) and queues (links between satellites) is
created. A mapped kernel is constructed by building a netlist using the module and queue library
(Figure 1). In order to facilitate verification and performance feedback, wrappers are placed
around all modules and queues so modules can be modeled as concurrent processes and queue as
synchronized objects. Energy and time stamps are also associated with each modules and queues
so performance can be collected. A application specific simulator is automatically instantiated
once a netlist is specified.
Currently, the intermediate form is implemented in the C++ language and the Solaris thread
library [26] (other common thread libraries can be switched in easily). Common satellite
processors (such as MAC/multiply processor, ALU processor, memory and address generator
etc.) have been incorporated in our module library.
3.B. Synthesis Tool
To ease the process of manually mapping algorithms to the architecture, a synthesis tool is
provided to translate an algorithm (specified in a subset of C) to the direct-mapped
implementation of the architecture. The output is the computation specified in the intermediate
form, the kernel performance and energy can then be dynamically collected. For algorithms with
loops with constant loop length, energy and performance information is also analyzed statically
to avoid the overhead of simulation.
The algorithm is compiled to SUIF intermediate form then converted to hierarchical Control
Data Flow Graph (CDFG [Hyper]). The current conversion from SUIF to CDFG exposes all
scalar dependencies but preserve all WAW, RAW, WAW dependencies in array access. The
current mapping allocates arrays of the same name to a particular memory and each operation
node in CDFG to a hardware unit.
Generation of data steering element and address generator is based on the nested loops.
Statically performance estimation for loops with known loop length is also done.
3.C. Orthogonalization as an example

4. Case Studies
All satellite modules are characterized. Interconnect are characterized also in [Zhang98].
Preliminary overhead of steering element is added. Low energy feature of the system. Allows
architecture selection and serves as the basis of future optimizations
4.A. Multiuser Detection Channel Estimator
Synthesis and performance is determined statically and verified dynamically using the simulator.
Architecture Power (mW)
TMS320C54x 460 * [ref][ref]
Pleiades 18.04
ASIC 3 [ref]
4.B. VSELP Speech CODEC
All kernels synthesized, simulated and performance gathered.
Dot_product, FIR, IIR, VectorSumScalarMul, Compute_Code, Covariance_Matrix_Compute.
5. Conclusion
We have presented a low-power reconfigurable multiprocessor system. Future work will include
software level transformation (loop transformation and parallelism), implementation
optimization and more application mappings in the wireless communication domain.
6. References
• G. R. Goslin, “ A Guide to Using Field Programmable Gate Arrays for Application Specific
Digital Signal Processing Performance”, Proceedings of SPIE, vol. 2914, p321-331.
• Abnous et al, “Evaluation of a Low-Power Reconfigurable DSP Architecture”, Proceedings
of the Reconfigurable Architecture Workshop, Orlando, Florida, USA, March 1998.
• M. Goel and N. R. Shanbhag, “Low-Power Reconfigurable Signal Processing via Dynamic
Algorithm Transformations (DAT)”, Proceedings of Asilomar Conference on Signals,
Systems and Computers, Pacific Grove, CA, November, 1998.
• Gerson and M. Jasiuk, “Vector Sum Excited Linear Prediction (VSELP) Speech Coding at
8Kbps,” Proceedings of the International Conference on Acoustics, Speech, and Signal
Processing, pp. 461-464, April 1990.
• K. Ueda, et al., “Multimedia Complex on a Chip,” ISSCC Digest of Technical Papers, pp.
28-29, February 1993.

A Low-Power Reconfigurable DSP System

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Low-Power Reconfigurable DSP System

Uploaded by

Copyright:

Available Formats

A Low-Power Reconfigurable DSP System

Marlene Wan, Martin Benes, Arthur Abnous

Reconfigurable architectures has emerged to be a promising implementation platform to provide

1. Motivation and Background

will continue to be the predominant design challege of wireless systems. Reconfigurable

architectures has emerged to be a promising implementation platform to provide flexibility, high-

low-power requirement. In this paper, we will introduce a realization of such architecture

realization is then demonstrated by mapping wireless communication and signal processing

algorithms to the architecture.

Reconfigurable Architecture Implementation

*Kernel-computational Intensive operations, often corresponds to datafow computations in nested loops

computation and memory. To support adaptive computations without reconfiguration such as

structure along with corresponding data.

goes in directions instructed by programs configured at reconfiguration times), dynamically (data

of the architecture template.

FPGA can serve as sources. Reconfigurable interconnect [Zhang].

3. The Software Tools

Based on the realization of the architecture template, a simulation environment is developed to

provide application specific simulator in a style similar to [Bart].

Since compution is mapped to clusters of satellites, an object-oriented intermediate form based

so performance can be collected. A application specific simulator is automatically instantiated

once a netlist is specified.

etc.) have been incorporated in our module library.

3.B. Synthesis Tool

provided to translate an algorithm (specified in a subset of C) to the direct-mapped

to avoid the overhead of simulation.

node in CDFG to a hardware unit.

3.C. Orthogonalization as an example

architecture selection and serves as the basis of future optimizations

4.A. Multiuser Detection Channel Estimator

Architecture Power (mW)

TMS320C54x 460 * [ref][ref]

4.B. VSELP Speech CODEC

All kernels synthesized, simulated and performance gathered.

Dot_product, FIR, IIR, VectorSumScalarMul, Compute_Code, Covariance_Matrix_Compute.

software level transformation (loop transformation and parallelism), implementation

optimization and more application mappings in the wireless communication domain.

Digital Signal Processing Performance”, Proceedings of SPIE, vol. 2914, p321-331.

• Abnous et al, “Evaluation of a Low-Power Reconfigurable DSP Architecture”, Proceedings

of the Reconfigurable Architecture Workshop, Orlando, Florida, USA, March 1998.

• M. Goel and N. R. Shanbhag, “Low-Power Reconfigurable Signal Processing via Dynamic

Algorithm Transformations (DAT)”, Proceedings of Asilomar Conference on Signals,

Systems and Computers, Pacific Grove, CA, November, 1998.

8Kbps,” Proceedings of the International Conference on Acoustics, Speech, and Signal

Processing, pp. 461-464, April 1990.

28-29, February 1993.

You might also like