Efficient Design Using Fpgas: of Application Specific Cores

EFFICIENT DESIGN OF APPLICATION SPECIFIC DSP CORES
USING FPGAs
Saniay Attri, Electronics & Commn Engg.Deptt., Technical Teachers' Training Institute, Sector-26,
Chandigarh, India. Tel: +91-0172-794349, E-mail: sattri@yahoo.com
B.S. Sohi, Electronics & Cornmn Engg.Deptt., Technical Teachers' Training Institute, Sector-26,
Chandigarh, India. Tel: +91-0172-791349, E-mail: bssohi@yahoo.com
Y.C. Chopra, Electronics & Commn Engg. Deptt., BBSB Engineering College, Fatehgath Sahib,
Punjab, India. Tel: +91-0172-791349, E-mail: ycchopra@?yahoo.com
Abstract ever decreasing time to market has

pushed the design complexity and
This paper focuses on design of digital correctness to such a level that it has
signal processing (DSP) cores for necessitated the use of pre-verified
compact and efficient implementations functions. A pre-defined and pre-verified
of real fime DSP applications on field complex functional block that is
programmable gate arrays (FPGAs) integrated into the designer's logic is
using distributed arithmetic. The called a core. The rapid trend towards
resulting serial distributed arithmetic sub-micron technologies has brought
(SDA) and the parallel-distributed forth the new concept of System-Level-
arithmetic (PDA) designs are Integration (SLI). This new approach
implemented. As an example, the makes use of a core to save
implementation of a 16-tap 8-bit Finite development time while focusing
impulse response (FIR) filter on a Xilinx engineering time and energy on those
XC4000E FPGA using SDA and PDA parts of the design that add value and
techniques has been examined. An differentiate the product.
anaiysis of fhe performance comparison
is described. The results show that PDA 2. Core-based design
designs with a digit size of 2 bits are
more efficient in area-time product Core-based designs have several
parameter than those of SDA benefits like
implementations. shorter design-cycle times
reduced risk and
1. Introduction 0 improved performance through
higher levels of integration
The increasing need to digitally The end result of designs based on core
process analog information signals, like lead to shorter time-to-market, lower
audio and video, is causing a major shift production costs, and improved system
in the DSP implementation techniques. profit margins [I],
Since DSP is the mathematical The advent of large, fast FPGAs with
manipulation of these digitized dedicated arithmetic capabilities has
information signals, specialized circuitry created an opportunity to perform
is required for efficient signal flexible, re-programmable DSP functions
processing. Moreover, in this era of in dedicated lagic, rather than DSP
rapid system design the goals that are processors [Z], [3].The advantages of
to be met are reduction in design and this approach are multiple:
verification time, reduction in cost of A higher speed with respect to the
design and its Iife cycle and capability of general-purpose DSP solution can
insertion of new innovations in the be obtained.
design. The increasing complexity of
modern integrated circuits coupled with
0-7803-6677-8/01/$10.0002001 IEEE.
462
Re-configurable systems can be taken up. A filter is used to remove
realized using the re- unwanted portions from a stream of
programmability of the FPGAs. data. FIR filters are common
DSP and ASIC based systems can components in many DSP systems and
be fast prototyped, different design are used to perform signal pre-
options can be emulated, and long conditioning, anti-aliasing, band
simulations can be avoided. selection, interpolation, low-pass filtering
Off-chip interconnections and etc. The advantages of the FIR filter
external components like FlFOs or include guaranteed stability for all
RAMS can be integrated in the realizable filter coefficient values,
embedded applications. absence of overflow oscillations and the
Hard-wired DSP cores can be ability to implement filters with linear
simplified and optimized for a given phase response. There are several
application. basic structures of FIR filters such as
canonical, pipelined and inverted form.
The FPGAs have the ability to In FIR filter applications, arithmetic
implement a DSP function using one of elements for operations such as
several techniques, which depend on addition, multiplication and delay are
the performance required. These commonly required[6], [a. These
techniques can be used to optimize the arithmetic circuits can be designed and
implementation of many different types implemented using common sub-circuit
of data processing or MAC-based building blocks.
techniques. In the areas where the
speed of conventional bit parallel circuits 3.1 16 tap, 8 bit FIR filter design
is not needed, techniques based on
distributed arithmetic (DA) can be used The response of a K tap FIR filter
141. Parallel Distributed Arithmetic (PDA) can be expressed as the following sum
techniques are used to achieve the of products:
fastest sample rates, while lower rates
can be sustained with Serial Distributed
Arithmetic (SDA) techniques that use k=l
less FPGA resources (i.e. configurable
logic blocks (CLBs)). In addition, a serial Where y(n) is the response at time n,
stream of data matches better with the xk(n) is the k th prior input data at time n
structure of an FPGA. Thus, in an actual and the Ak are the coefficients of the
implementation, the speed of a full serial filter. Each term, when expanded,
circuit is not N times lower than the involves only one bit of the input data
equivalent N-bit parallel approach [5]. with all the bits of the coefficients. This
The primary design concern is the allows constructing a look-up table that
performance or the sample rate of the can be addressed by the same bit of all
filter and the design must work at the input variables. This look-up table holds
desired sample rate without overall the additive combinations. Figure 1
consuming the resources. The designs gives the data flow diagram for a 16-tap,
thus obtained can be verified and built 8-bit FIR filter that is based on
into a core for use with various DSP distributed arithmetic [8]. The filter
systems. consists of the following seven major
components:
3. FIR filters
0 a parallel to serial converter,
In this paper design of FIR filter, that 0 a RAM-based shift register ,
is a fit case for being built into a core, is
463
a serial adder, adders are presented to the lookup
a Look-Up Table (LUT), tables. Since the coefficients are
a complementing register, symmetric, the sums generated by the
an adder and adders can be multiplied by the same
a scaling accumulator. coefficients. Since all possible partial
products are pre-computed, the outputs
of the serial adders are used to address
] the lookup tables to generate the
appropriate multiplication results. The
outputs of the registered lookup tables
are summed, except for the sign result
that is complemented before being
summed. The registered summation is
then fed into a scaling accumulator.
Figure 2 suggests that the number of
MACs for a 16 Tap, 8 Bit FIR Filter
should be four. However, the number of
Figure 1. Data Flow diagram of a 16 Tap, 8 MACs reduces by a factor of two if we
bit FIR Filter consider the filter to be symmetrical [9].
An 8-bit data sample is loaded into
A
the parallel to serial converter (PSC) at
the sample rate. The PSC generates a
serial output stream that is supplied to
the RAM-based shift registers at the bit
clock rate. The bit clock rate is
f %
__
S-REG
LOOK. '
determined by C
UP
bit clock rate = (n + 1)wample rate I t
__
S-REG
c
'4
I TABLE
where (n+l) represents the number of ' D --
data bits per sample plus an overflow
bit.
.
...
FIR filters consume a large number

of registers (N bits T taps). This is Figure 2. LUT based SDA for a four product
particularly demanding in any FPGA MAC
architecture because the logic in front of
the flip-flops is wasted. The XC4OOOE In this case a 16 Tap, 8 Bit FIR Filter
on-chip RAM significantly reduces the core is designed. The core is then
cost of data storage. Using this feature, embedded into Xilinx XC4000E series
up-to-64 bits of data can be stored in a FPGA chip using Synopsis FPGA
single CLB that otherwise could only Express software tool [IO]. The core is
store two bits in its flip-flops. This developed to operate at a sampling rate
situation is exploited by using a look-up of 5.44 MHz and a clock frequency of 49
table (LUT) based approach. By doing MHz. The look-up tables contain
this a SDA based MAC gets reduced to coefficients corresponding to a low pass
the one shown in Figure 2. filter which has a cut-off frequency of 2.2
MHz. FPGA Express was used to create
This RAM-based shift register two types of implementations. One
approach, rather than a more traditional implementation was done for speed
cascade of data registers, significantly optimization and it gave an estimated
reduces the overall size of the FIR filter. clock speed of 53.79 MHz. The other
The outputs of the registered serial implementation was for siticon area
464
optimization and it gave an estimated that one stores the even-bits and the
clock speed of 33.48 MHz. These other stores the odd-bits. The 2-bit
results show the trade-off between parallel data samples require twice the
silicon area and speed of the number of LUTs. There is also the
implementation. Thus a filter design addition of a 1-bit scaling adder,
implemented in an FPGA with SDA required to add the two partial sums,
gives a significant amount of which results from each of the two
performance in a modest number of parallel sample bits. The scaling
CLBs i.e. 68. SDA uses the smallest accumulator’s input bus is expanded to
number of CLBs while processing all accommodate the larger partial sum and
data samples (TAPS) in parallel. the final scaling accumulator is changed
from a 1- to a 2-bit shift for scaling. A
Parallel Distributed Arithmetic four product MAC with a digit-size of two
is shown in Figure 3.
Parallel Distributed Arithmetic (PDA)
is used to increase the overall * BITS[(~.I),...,5,3.ij
performance of Serial Distributed

-I-,A
Is=+ I
Arithmetic. With PDA, the number of bits
being processed during each clock cycle
is increased. For this, the data words of
size W bits are partitioned into digits of
size N bits (the digit-size, N, is divisor of
the word-size, W) and are processed
serially one digit at a time with the least
significant digit first [Ill.A complete
word is processed in W/N clock cycles
and consecutive words follow each LOOK-
other continuously. The time W/N is UP
TABLE
called sample period. The digit-serial . PS-BITS-0
operators are cascaded following the
data-flow algorithm in a pipelined
fashion. Hence, a set of PDA
architectures can be designed by using Figure 3. LUT based four product MAC for
different digit-sizes. PDA of digit size two
Note that increasing the number of These changes in MAC structure

bits sampled has a significant effect on essentially double the resources
the number of CLBs used for the design. required compared to that of the SDA
Therefore, the number of parallel bits design. Thus, it is possible to choose
sampled should be increased only to the digit-size that best suits the speed of
meet the required performance. the application while minimizing the cost
Increasing the number of bits processed in terms of area [12].
from I-bit, in the case of SDA, to a 2-bit
PDA results in half the number of The performance for the 16-Tap FIR
processing clock cycles. Hence, 2-Bit filter example, implemented with a PDA
PDA results in twice the throughput. algorithm having digit size of 2, resulted
With 2-bit PDA, the serial shift registers, in a sample rate of 10.88 MHz, in 130
referenced in the discussion on SDA, CLBs. Thus a filter design implemented
are each replaced with two similar I-bit in an FPGA with 2-Bit PDA results in
shift registers at half the bit depth. The twice the performance and about twice
two parallel shift registers are spilt, such the number of CLBs compared to the
465
same function with SDA. 2-Bit PDA uses range goes up to 6 MHz using the bit-
more number of CLBs than that of SDA, serial approach optimized for speed and
while still processing all data samples more than 50MHz with bit-parallel one.
(TAPS) in parallel, at twice the SDA All filters have been automatically
data sample rate. The number of bits implemented using Synopsis Workview
being processed during each clock cycle office tools. Further, hand optimisation is
can be increased until a BDA known to yield still better results in most
implementation with digit size n is of these cases.
reached, for n-bit data samples. When In order to tune a filter in a system,
the design is an n-Bit PDA, the sample or even have multiple filter settings, the
data rate is at a maximum. SRAM technology of the XC4000E can
be exploited by reconfiguring the
With FDA, each additional parallel parameterized part.
bit requires an additional level of scaling
(by powers of 2) and summation for References
each partial product pair of bits. The
LUTs for SDA and PDA can always be 1. Tiong Jiu Ding, John V. McCanny,
the same for any given 4-MAC block, Fellow, IEEE, and Yi Hu, "Rapid Design of
regardless of the number of bits in the Application Specific FFT cores", IEEE Signal
Processing Transactions, Vol 47, No 5, pp 1371-
sample data. This is true for PDA only if 1390, May 1999.
common bit-weighted sample inputs are 2. Altera Corporation, USA, Conference
used to address the LUT. paper, "Improving Fixed-point DSP Processor
System Performance with PLDs as a DSP
It may be necessary to tune a filter in Processor ".
3. Xilinx Inc., " The Programmable Logic
a system, or even have multiple filter Data Book ", 1999.
settings. Here the SRAM technology of 4. S.A. White, "Applications of distributed
the XC4000E can be exploited by arithmetic to digital signal processing" IEEE
reconfiguring the parameterized part. ASSP magazine, Vol. 6, no.3, pp 4-19, July 1989.
5. Atmel Corporation, USA, Application
The changes to the filter lie in the Note, "FPGA-Based Signal Processing Using Bit
coefficients with the actual structure of Serial Digital Signal Processing", September,
the design remaining unchanged. These 1999.
coefficients are stored as partial 6. Actel Corporation, USA, Application
note, "Designing FIR Filters with Actel FPGAs".
products. In case, it is desired to change 7. Mintzer, L., "FIR Filters with Xilinx
the filter characteristics, it can be FPGA", FPGA92 ACMlSlGDA workshop on
achieved by simply altering the VHDL FPGAs pp 129-134.
fife that contains the coefficients for the 8. Gregory Ray Gosh, Program manager,
Xilinx Corporation, Application notes, "Using
desired filter. Xilinx FPGAs to design Custom Digital Signal
Processing", Nov. 2000.
Conclusion 9. Lucent Technologies, Application note,
"Parameterized FIR Filters In ORCA Field
A study of SDA and PDA FIR filters Programmable Gate Arrays" September 1996.
10. User's Manual, Work View Office
with programmable coefficients is Software tool to design customized digital circuits
presented. The design methodology on FPGAs.
using each of these structures is 11. Javier Valls, Marcos M. Peiro, Trini
detailed and finally the results of their Sansaloni, Eduardo Boemo. "A Study About
FPGA-Based Digital Filters", Proc. 1998 IEEE
implementation in Xilinx XC4000E SIPS, pp.191-201, Boston, Oct.1998.
FPGA are given. The results of 2-bit 12. Jean-Michel Raczinski, Stephane
PDA are more efficient in area-time Sladek, Luc Chevalier, "Filter Implementation on
product parameter. The throughput SYNTHUP", Proceedings of the 2nd COST G-6
achieved lets the filters be used in Workshop on Digital Audio Effects (DAFxgS),
NTNU, Trondheim, December 911,1999.
applications where the sample rate
466

Efficient Design Using Fpgas: of Application Specific Cores

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Design Using Fpgas: of Application Specific Cores

Uploaded by

Copyright:

Available Formats

EFFICIENT DESIGN OF APPLICATION SPECIFIC DSP CORES

Abstract ever decreasing time to market has

FIR filters consume a large number

performance of Serial Distributed

Note that increasing the number of These changes in MAC structure

You might also like