You are on page 1of 24

CHAPTER 4

RECONFIGURABLE FIR FILTER ARCHITECTURE USING


OPTIMIZED MULTIPLIERS

In mobile communication systems and multimedia applications. need for


efficient reconfigurable digital finite impulse response (FIR) filters has been
increasing tremendously because of the advantage of lea area, low cost, low power
and high speed of operation. This chapter presents a near optimum low-complexity,
rewnfigurable digital FIR filter architecture based on computation sharing multipliers
(CSHM), constant shift method (CSM) and modified binary based common sub-
expression elimination (BCSE) method for different word-length filter coefficients.
The CSHM identifies common computation steps and reuses them for different
multiplications. The proposed reconfigurable FIR filter architecture reduces the
adders cost and operates at high speed for low-complexity reconfigurable filtering
applications such as channelization, channel equalization, matched filtering, pulse
shaping, video convolution functions, signal preconditioning, and various other
communication applications. The proposed architecture has been implemented and
tested on a Virtex 2 xc2vp2-6fg256 field-programmable gate array (FPGA) with a
precision of 8 bits, 12 bits, and 16 bits filter coefficients. The proposed novel
reconfigurable FIR filter architecture using dynamically reconfigurable multiplier
block offers good area and speed improvement compared to existing reconfigurable
FIR filter implementations.

4.1 Introduction

Finite impulse response (FIR) digital filter is the fundamental element of


digital signal processing (DSP) systems. Generally, digital FIR fillers are used in
mobile communication systems and multimedia applications such as channelization,
channel equalization, matched filtering, pulse shaping, video convolution functions,
signal preconditioning and various communication applications because of the
advantages of their absolute stability and linear phase properties. The disadvantage of
using digital FIR filters is that it involves lot of computations to process a signal. The
implementation cost and power consumption are also high because of computational
complexity. The complexity of FIR filter is dictated by the complexity of coefficient
multipliers. The multipliers are the most expensive blocks in terms of area, delay, and
power in a FIR filter structure. As shifts are less expensive in terms of hardware
implementation, the design problem can be defined as the lninim~zationof the
number of additionisubtraction operations to implement the coefficient
multiplications. The complexity of the multiplier block (MB)in a FIR filter is
reduced, if implemented as shift-adders and sharing common sub-expressions.

Software defined radio (SDR) technology IS a very active field in the area of
FIR filter research which focuses on reconfigurable realizations. The low-complexity
FIR filter coefficient multipliers are obtained using common sub-expression
elimination (CSE) method based on canonical signed digit (CSD) coefficients [27].
The aim of CSE method is to search for instances of identical expressions that are
presenl in the CSD representation of coefficients and eliminate these redundant
multiplications. The CSD representation uses three symbols in each power of two
positions: -1, 0 and 1, the value represented is the sum of the positional coefficients
times the corresponding power of '2'. In [70], modification of method [27] is
proposed for identifying and eliminating the redundant computations. In 1721,
technique discussed in [27] is modified to minimize the logic depth (LD) and there by
improve the speed of operation. LD is defined as the number of adder-steps in a
maximal path of decomposed multiplications. Rather than CSD, binary based
common sub-expression elimination (BCSE) method is used in 1441 to improve
reduction of adder and thus achieve low-complexity FIR filter compared to [27, 70,
721. In order to reduce the complexity of the filter [96], the filter coefficients are
encoded using pseudo random floating point method; however it is limited to filter
lengths less than 40. The methods [27, 70, 72, 961 are only suitable for fixed logic
filters where the coefficients are fixed. In chapter 3, we have proposed reconfigurable
field programmable gate array (FPGA) based hardware accelerator, a balanced shared
memory architecture for FIR filtering and time variant discrete Fourier transform
(TVDFT) which provides reconfigurability. In this chapter, we propose some
modifications to the above and additionally propose low-complexity multiplier.
Several reconfigurable FIR filters have been proposed by researchers and are
discussed in detail (9, 11, 18, 45, 62, 69, 1051. The reconfigurable MB [62, 691
consists of pre-computer, select units and final adder. The pre-computer performs the
multiplication of alphabets with input x. In [69], the method proposed in [62] is
modified and efficient circuit level techniques such as new carry select adder and
conditional capture flip-flop are used to enhance its power and performance. The
architectures [9, 11, 18, 62, 69, 93, 1051 are appropriate only for relatively lower
order filters and not suitable for channel filters in communication receivers. In [45],
common DSP operations such as filtering and matrix multiplication are identified and
expressed as vector scaling operations. In [45], ripple carry adders are used. The
presence of multiplexers gives the option of adaptive computing for the method
proposed in [45]. The BCSE method proposed in [43] provides improved adder
reductions leading to low-complexity FIR filters. Reconfigurability of the FIR filters
is not considered in [43] though. The concept of reconfigurable multiplier blocks
(ReMB) is proposed in [IS]. The ReMB generates all the coefficient products and a
multiplexer selects the required ones depending on the input. The redundancy can be
reduced by pushing the multiplexer deep into the MB design. The resulting
specialized multiplier design can be more efficient in terms of area and computational
complexity compared to the general purpose multiplier plus the coefficient store [I 81.
B U the
~ ReMB has its area, power and speed dependent on the filter length making
them inappropriate for higher order FIR filters.

The proposed reconfigurable FIR filter architecture, (processing element


architecture), is shown in figure 4.1. This architecture is a modification to what we
have proposed reconfigurable FPGA based hardware accelerator for embedded DSP
in chapter 3. This architecture is useful for any CSE method with appropriate
modifications. Simple number decomposition strategies are identified to perform
vector scaling. The idea is to pre-compute the values such as Ox, 2x, 4x, 6x, Sx, lox,
12x, 1% where x is the input signal, then reuse these pre-computations efficiently
using multiplexers. In this chapter, the bit lengths of the input signal x is considered
as 4-bit and the coefficients are considered as 8-bit, 12-bit and 16-bit. The presence of
multiplexers gives rise to the option of adaptive computing. T h ~ scomputation sharing
multipliers (CSHM) can be used to realize efficient, low-complexity FIR filter design.
The multiplications in the CSHM FIR filtering operation are viewed as a combination
of add and shift operations over common computation results. The proposed method
also uses constant shift method (CSM) to achieve faster shifting. The CSHM, CSM,
and modified BCSE methods achieve high performance, low area, low power and
high speed in FIR filter implementation. The proposed low-complexity reconfigurable
FIR filter architecture exploits high level of concurrency using pipelining and parallel
processing. For both lower and higher order filters, the architecture offers the best
trade-off between speed and area.

This chapter is organized as follows. In Section 4.2, the modified BCSE


method is presented. Section 4.3 presents the proposed reconfigurable FIR filter
architecture with programmable coefficients. Section 4.4 shows the synthesis results.
The synthesis and design results of the proposed architecture are compared with
existing architectures and are discussed in Section 4.5. Finally, conclusions are drawn
in Section 4.6.

4.2 Modified BCSE Method

This section presents the modified BCSE algorithm. This is a technique which
eliminates redundant binary common sub-expressions (BCSs) that occur within the
coefficients. Computational complexity is reduced by reusing the most common
binary bit patterns present in the coefficients. In general, an n-bit binary number can
represent 2* - (n+l) BCSs [45]. For example, a 4-bit binary representation can form
1 I BCSs, which are [0 0 1 I], [0 I 0 I], [0 1 I 01, [0 1 1 I], [I 0 0 I], [I 0 1 01, [I 0 1
I], [ 1 1 0 01, [I I 0 11, [I 1 1 01 and [I 1 1 11. Note that other BCSs such as [0 0 0 I],
[O 0 1 01, [O 1 0 01, and [I 0 0 O] do not require any adders for implementation. These
BCSs can be expressed as:

[O 0 I 11 = Xl = x t 2Ix
[O I 0 11 = xz= x + 2 2 ~
[0 I 1 01 = x3= 2'x + 22x
[0 1 1 I] = Xq= X t2Ix + 2 2 ~
[I o o I] = x5= + 23x
[I o I 01 = %= 21x + 2 3 ~
[I o 1 I 1 = X,= X+ 21x+ 2 3 ~
1 o 01 = xs= 22x + 23x
[I 1 0 I ] = xg= x +22x + 2 3 ~
[I I I OI= IO= 21x + 2 2 ~+ 2 3 ~
[I I I I J = X ~ ~ = X + ~ ~ X + ~ ~ X + ~ ~ X
Where x is the input signal

A straightforward realization of above BCSs would require seventeen adders.


However x3 can be obtained from X I by a left shift operation (without using any extra
adders): x3 = 2'x + 2'x = 2'(x + 2'x) ~ 2 ~ x similarly
1, x8 can be obtained from xl by
double left shift operation (without using any extra adders): x8 = 22x + 23x = 2'(x +
2'x) =2'x1, X6 can be obtained from x2 by a left shift operation: xt,= 2'x + 2'~=2'(x +
2'x )= 2'(x2), XI(, can be obtained from u by a left shift operation: XI"= 2'x +2'x +2 3 ~
= 2I(x +2'x + 2'x)= 2'(~4).Binary horizontal common sub-expression elimination
(BHCSE) is used in the proposed architecture. In general, an n-bit binary number can
represent 2"' - 1 BHCSs. 4 bit binary representation can form seven BHCSs, which
I], [0 1 1 I], [I I 1 I], [I 1 0 I], [I 0 1 I] and 11 0 0 I]. It can be
are [O 0 1 I], [0 1 0
noted that the common sub-expressions (CSs) [0 0 1 I], 11 1 0 01 and [0 1 1 01 can be
implemented using [ l I]; the CSs [0 I 0 11 and [I 0 1 01 can be implemented using [I
0 11 and the CSs [0 1 1 I] and [I 1 1 01 can be obtained using [ I 1 11 with simple shift
operations.
Also, x4 can be obtained from xz using an adder: x4 = 2'x + xz. x7 can be obtained
from xg using an adder: x7 = 2'x + xs, xo can be obtained from xs using an adder: x g =
22x + x5 and XI, can be obtained from x4 using an adder: x l l = 23x + x4. Thus only
seven adders are needed to realize all the 4 bit BCSs rather than seventeen adders.
Thus the binary representation based BCSE offers better reduction of
adderslsubtractors needed to implement the coefficient multipliers.

4.3 Proposed FIR Filter Architecture

The proposed reconfigurable FIR filter is based on reconfigurable processing


element architecture as shown in figure 4.1. The processing element (PE) is capable
of performing three computing functions: addition (including carry), multiplication
and negation with two inputs and one output. Other arithmetic operations are
performed as a sequence of additions. Each PE is equipped with three shift registers,
to support two serial inputs and one serial output for two independent paths. These
shift registers are used as single-word cache memories functioning as seriallparallel
and parallellserial translators in the communication path to RAM memory and buffer.
This scheme leads to a simple interconnection network enabling small chip area and
use of high speed clock. We consider two PE and one RAM memory block. The
architecture provides two control signals sl and s2 which enable multiplication and
addition respectively. If the control signal s, (sl or s2), is set to 'l', then PE perfoms
multiplication otherwise it performs addition to realize accumulation of the multiplier
output. When sl=] and s2=0, pEI performs multiplication while PE2 accumulates the
multiplier output and writes the result to the output buffer. The multiplier output of
PEI is sent to PE2 via shift register. The output of the RAM block is fed back to PE2
via shift register for further accumulation of the multiplier output. The accumulated
output of PE2 is sent to the RAM block as well as the final output. (The structure of
the proposed reconfigurable PE is near symmetrical and allows interchanging the
roles of PEI and PE2. When sl=O and sz=l, PE2 performs multiplication while PEI
accumulates the multiplier output and writes the result to the output buffer. 'The
multiplier output of PE2 is sent to PEI via shift register. The output of the RAM block
is fed back to PEI via shift register for further accumulation of the multiplier output.
The accumulated output of PEl is sent to the RAM block as well as the final output.)
In the discussions that follow, we consider the case when control signal sl=l and
s2=0. For this case, PEl represents the MB of the reconfigurable FIR filter and the PE2
accumulates the result.
The main steps in the proposed FIR Filter architecture are:
1) Load data into input registers
RI I< -INI, R2 I< - IN2;

2) Perform Calculations
PEl< -RI I * Rz I;

3) Write the results into output registers


R3 I< -PEI;

4) Write the result from output registers to memory


RAM [0] -R3.?;

Where: INI- Input signal x(n), IN2- represents the filter coefficient h(n)
P- - --
-
rise1 , -- -. 1. /

L
-
I
clk -- --

- -
hl --
J IOU

--

--
--I C*

Figure 4.1: Proposed rrronfipurahle FIR filter architecture modr~le(a) Hlock diagram (hl RTl,
Schematic

56
Figure 4.2: Yroposcd reconfigl~rable1:IH filt1.r architectllre n~odulein parallel and cascade form

The FIR liltcr architecture can be realized uaing either the architeclurc modulc
shown in I'lgure 3.1 or that shown in Tigure 4.1. I n appri)ach shown in figure 4.1, all
the partial producrs are generated by multiplying rhr coefficients with the Input s~gnal
(h[n] * :In]) and accumlll;l~r [he ~-c>ult\.In rhr approach shown i r ~figure 4.2,
lnultiplea of propnced nodules may be u5t.d in a par;~llrlor cascade configur;~(ion.
The hr~iicr~ p l i o t (~U. ~ I I si~iglc
I ~ architecture module), ih chnren when area ant1 power
con+ump~ii,nare the key cona[mints wli~lcthe latlcr ciption, (using parallel or cascade
configuration), i ctrosen lor high speed applications. In figure 4.,.7 hlue line5 i~i~licare
parallel cnnfiguratlc,~),while green lities indicatr cascade co~lfiguration.
clk

(h)
Figure 4.3: Architecture of lhr prol,o\cd CSM PE, (MH)for &bit coefficients (a) Block diagram (bi
K I'L Scl~ernatir
Brown lines indicate common connections for both parallel and cascade
configurations. The basic architecture of the PEI (MB) is shown in figure 4.3. PEI
(MB) is realized with the help of a Shift-and-Add unit, multiplexer unit, adder and
shifter unit, and final adder unit. The PEI (MB) uses CSM. In the CSM, the filter
coefficients are partitioned into fixed groups and hence the PEI (MB) requires only
constant shifters. The proposed architecture module with new PEl (MB) offers good
balance in terms of chip area, power consumption, computational throughput and
flexibility. PE2 is realized using high performance carry select adder with dual
transition skewed logic (DTSL) to reduce power consumption.

4.3.1 Low-Complexity Reconfigurable PEI Architecture (Multiplier Block)

The functions of different blocks of the PEI (MB) are explained below.

Shift-and-Add Unit: One of the efficient ways to reduce complexity of


multiplication operation is to realize multiplication using Shift-and-Add operations.
In contrast to conventional Shift-and-Add unit used in previously proposed
reconfigurable filter architectures [27, 701, we use modified BCSE based Shift-and-
Add unit in our proposed CSM architecture. In the Shift-and-Add unit, all the even 4-
bit BCSs are realized. The proposed architecture of Shift-and-Add unit is shown in
figure 4.4. This architecture of Shift-and-Add unit is used to realize Cbit even BCSs
of input signal in the range [0000], [OOIO], [OIOO], [OI 101, [IOOO], [1010], [ I 1001,
[I 110]. In figure 4.4, "x<<k" represents the input x shifted left by k units. A
straightforward realization of these BCSs would require 5 adders. These BCSs can be
expressed as:

[o 1 1 01 = x3=2'x + Z2x
[1 o I 01 = %= 21xt 2 3 ~
[ I I o 01 = X8= 2 2 +~ 2 3 ~
[ I 1 1 01 = XI"=2'x +2?x + 2 3 ~
Where x is the input signal
Figure 4.4: Architecture of Shift-and-Add unit

Note that other BCSs such as [0 0 I 01, [0 1 O 01, and [ I 0 O 0] do not require any
adders for implementation. However, xy can be obtained from x3 without using any
adders as follows:
xe' 2 ' ~t 2 3=~2'(2'x + 2 ' ~ )=2'x3 (4.23)

Also xlo can be obtained from xe as follows:


x,o= 21x + 2 ? +~ 2 3 =21x+
~ xs

Thus 4-bit even BCSs Ox, 2x, 4x, 6x, 8x, lox, 12x, 14x of a 4-bit number are
generated using only three adders. The easiest ways to obtain the BCSs is to know the
amount of shifts in advance. All these eight BCSs (Ox, 2x, 4x, 6x. 8x, lox, 12x, 14x)
are then fed to 8:l multiplexers. In contrast to conventional Shift-and-Add unit, the
number of adderslsubtractors needed to implement the proposed Shift-and-Add unit is
significantly reduced.
Multiplexer Unit: Multiplexer unit selects the appropriate output from the Shift-
and-Add unit. The inputs to the multiplexer unit are the 8 inputs (Ox, 2x. 4x, 6x, 8x,
lox, 12x, 14x) from the Shift-and-Add unit and hence 8:l multiplexer unit are
employed in this architecture. The values of the filter coefficients are stored in a look
up table (LUT). These coefficients are partitioned into groups of four bits. Three most
significant bits from individual Cbit groups are used as select signals to the
corresponding multiplexer unit. The number of multiplexers in the multiplexer unit
depends on the number of groups after partitioning the filter coefficient into fixed
groups of bbit. In the proposed MB of PEI, the four most significant bits (MSR) of
the coefficient (fixed group) are given to multiplexerl. the next lower 4-bit of the
filter coefficient are given to multiplexer2. This procedure is continued till the four
least significant bits (LSB) that are given to the last multiplexer.

Adder and Shifter Unit: The intermediate results obtained from multiplexer unit are
given to the adder and shifter unit. If the LS-bit of the fixed group is ' l ' , the adder
adds the intermediate result with x(n) else the intermediate result is passed through
without change. Shifter performs constant shift with the count of shift dependent on
the 4-bit group. As an example, for the case of 8-bit coefficient, the shifter shifts the
MSBs of the intermediate result by 4-bit to achieve required weighting of 24. This can
be illustrated by the following expression.
~ ) , +21x + 13x)
( 2 ' ~ + 2 ~24(21x (4.25)

Final Adder Unit: The accumulation of all the intermediate results from adder and
shifter unit is handled in this unit.
r- 21x+2jxt 24( 2 ' +~ z 2 +~ 2 3 ~ ) (4.26)

The proposed CSM architecture performs addition based on BCS instead of


bitwise addition. The BCS based Shift-and-Add unit offers good reduction in the
number of addition operations. Filter specifications change for different
communication standards and corresponding filter coefficients are different. The
same CSM architecture may be used for these different FIR filter specifications. The
only change is in the LUT contents and this is loaded with appropriate coefficient set
corresponding to the filter specifications of the required communication standard.
Reconfigurability in this architecture is thereby achieved through modifiable filter
coefficients stored in LUT.

Figure 4.5: Architecture of the proposed CSM PEI (MB) for 16-bit coefficients

4.3.2 CSM architecture for 8 bit coefficients

In the proposed CSM architecture (MB) for 8 bit coefficients, the values of the
coefficients are stored in the LUT.The number of mult~plexerunit required is 1114,
where n is the word-length of the filter coefficients. For example, the 8-bit coefficient
of the CSM architecture is h = "0.1 111 11 1 1". This coefficient requires maximum
number of additions and shifts since all bits in the coefficient are nonzero.
In the above example, n = 8, and therefore, the number of multiplexers required are
two. The output y = h * x is expressed as:
= x + ~ ~ x + +~ 2* 3x+14x
~ +25x +26x t27x (4.27)

By partitioning into groups of 4 bits from MSB, we obtain.


+ ~ ~+X24(X +21x +21x +23x)
y- ( X + ~ ~ X+YX) (4.28)

Note that the expression x+ 2'x+ 2' x t 2jx can be obtained from the Shift-and-Add
unit as follows:
[ I 1 1 01 = xlo= 21x + 2 2 +~ 2jx (4.29)

Since the intermediate sums shown for each sub group in equation (4.28) are odd, we
add the intermediate result with x(n). namely, xlo+x. I'he shifier shifts the MSBs of
the intermediate result by 4-bit to achieve required weighting of 24. Since the number
of shifts is always constant irrespective of the value of the coefficients, programmable
shifters are not required and these shifts can be hardwired.
The following example shows computation sequence in the CSM architecture for 8-
bit even coefficient (h = "0.1 110101 0"). This coefficient requires two additions.
In this example, n = 8, and therefore the number of multiplexer required are two.
The output y = h * x is expressed as:
= ~ I x + +25x
~ ~ x+26x +27x

By partitioning into groups of 4 bits from MSB, we obtain:


).- ( ~ I x + ~+~24 ) + 2 ? +~ 2jx)
x (2Ix (4.3 I)

The expressions 2'x t 2"x and 2'x +22x t2"x can be obtained from the Shift-and-Add
unit as follows:
Two 8:l multiplexers are used for the two 4-bit groups of the 8-bit filter coefficient in
order to obtain the intermediate sums shown for each sub group in equation (4.3 1).
Since the intermediate sums are even, the shifier directly shifts the MSBs of the
intermediate result by 4-bit to achieve required weighting of Z4.

The general steps of computation in CSM architecture are as follows:


Step 1 : Get the input x
Step 2: Store coefficients in LUT
Step 3: Get the coefficients from the LUT and partition into fixed groups of 4 bits and
use three MS bits of each group as select lines to the corresponding multiplexer
Step 4: If the LSB of 4-bit group is '1' then add output of multiplexer unit with x(n),
else do nothing
Step 5: Final constant shifting is performed on each output of multiplexer unit with
the count of shift dependent on the 4-bit group
Step 6: Perform the addition of intermediate sums using the final adder unit
Step 7: Store the final result, hex, in the register and pass to PE for accumulation
Step 8: Advance to next coefficient in LUT and if entry is null then Go to step 1, else
go to step 3

4.3.3 CSM architecture for 16 bit coefficients

The CSM architecture of PEI (MB) for the coefficient word-length of 16-bit is
shown in figure 4.5. The 16-bit values of the lilter coefficients are stored in the LUT
in sign-magnitude form with the MSB reserved for the sign bit. The first bit is used
afier the sign bit to represent the integer part of the coefficient and the remaining 16-
bit are used to represent the fractional part of the coefficient. Thus, each 16-bit
coefficient is stored as an 18-bit value in LUTs. Each 16-bit coefficient is stored in a
single row of LUT. Exploiting coefficient symmetry, only half the number of
coefficients needs to be stored in the LUT. The coefficient values corresponding to 2'
to 215 are partitioned into groups of 4 bits and the three MS bits of the group are
connected to select signals of respective multiplexers, (multiplexerl through
multiplexers). For the case of odd value in the fixed group coefficient, the
intermediate result from the multiplexer is added with x(n). For the case of even value
in the fixed group coefficient, intermediate result from the multiplexer is passed
through the adder without change. Let rl, r?, r3 and r4 denote outputs of Adder]
through Adder4, respectively as shown in figure 4.5. The shifts indicated between
stages of adders in figure 4.5 are obtained as follows:
y = [(rl +24r2+2sr3+2I2r4)] (4.35)

Substituting (rl + z4r2), (r3 +24r4), by r ~ r6,


, respectively, we get
Y = [r5 + 2'(r6)1 (4.37)

By substituting (r5 + 28(r6)), by r7


Y = r7 (4.38)
The expressions from (4.35144.38) are represented in figure 4.5.

4.3.4 High Performance PE2 Architecture (Accumulation Block)

The design of PE2 employs carry select adder (CSA) using DTSL in order to
achieve high performance and low power consumption. Skewed logic gate act as
static CMOS logic but the PMOS or the NMOS transistors are of different sizes to
accelerate the transitions. To speed up low to high transition, the sizes of NMOS
transistors are reduced while the PMOS transistors are enlarged. To speed up high to
low transition, the sizes of PMOS transistors are reduced while the NMOS transistors
are enlarged. Figure 4.6(a) shows DTSL circuit that achieves high performance by
using the technique of dual data path, top data path has different skew direction from
bottom data path. Top data path is for fast rising transition input and the bottom is for
fast falling transition input. The arrows shown within each block represent the
direction of transition skew. A combiner is used to detect the earliest transition
between the two paths and latch the result for transfer to the next stage.
(b)
Figure 4.6: Block diagram of (a) DTSL (b) Carry-seled adder using DTSL 1341

Proper skewing is achieved by preferential sizing of the pull-up and pull-down


lransistors in static CMOS circuits [34, 69, 1021. CSA using DTSL is shown in figure
4.6(b). It consists of two carry propagation paths implemented with DTSL logic for
generating SUM, and control logic. As shown in figure 4.6(h), the carry propagation
logic of each block of the carry select adder has two data paths: one has '0' as its
CARRY input and the other has ' I ' as its CARRY input. Control logic consists of
transmission gale, switching transistor, and CMOS control gates. The carry
propagation logic has the same topology as general CSA except the size of transistors
are adjusted to speed up transitions.

4.4 Experimental Results

In this section, the synthesis and design results of the proposed FIR filter
architecture ustng reconfigurable MB with low-complexity are presented. Xilinx 9.li
66
ISE is used for synthesis. The target FPGA is Xilinx's Virtex-11 family 2vp2fg256-6.
Table 4.1 shows the synthesis results of the proposed reconfiyurable MB CSM
architecture that has a coefficient word-length of 8-bit. The effect of the MB for
various filter coeficient word lengths of 8 bits, 12 bits and 16 bits has also been
analyzed. The results are shown in Table 4.2, Table 4.3, and figure 4.7. It is noted that
when precision of filter coefficients increase, the consumption area on the FPGA
increases while the speed of operation decreases. Thus, by choosing the appropriate
filter coefficient word-length, it is possible to obtain reduced area as well as increased
speed for the FIR filter architecture.

Table 4.1: Synthesis result for the proposed CSM architecture for a coeficient word-length of 8
bits
2vpZfg256-6 Area used by proposed PE,(MB) Utilization
Number of Slices 24 out of 1408 I%
Number of Slice Flip Flops 14 out of 2816 0%
Number o f 4 input LUTs 22 out of 2816 0%
Number of bonded lOBs 18 out of 140 12%
Number of GCLKs I out of 16 6%

Table 4.2: Synthesis results for the proposed CSM architecture with various coefficient word
lengths

fable 4.3 Timing summary and power analysis for the proposed CSM architecture with various
coefficient word length
Table 4.4: Comparison of performance for the proposed implementation and the existing
reconiigurable implementation

Table 4.5: Comparison of performance for the proposed implementation and the existing
reconiigurable implementation (Macro statistics)
Multiplier Block ( PEl) w)th Olfferent Coefficient
Word Lengths

50%
40%

4
30%
20%
10%
0%
Fpllq
-8 bit

1 2 3 4 5
Device Utilizmtion Summary

Figure 4.7: Synthesis Results for the proposed reconfiiurable FIR filter using CSM MB with
various coefficient word lengths

4.5 Discussions

The proposed FIR filter arch~tectureusing reconfigurable MB ib compared


with the existing FIR filter architectures [32,45,49.54,69. 1041. Table 4.4 and Table
4.5 show results of performance of proposed FIR filter architecture using
reconfigurahle MB comparing these against existing architectures in terms of bmic
design metrics. Paper [49] presents a reconfigurable PE architecture that consists of
PEs, memories, interconnection network, and control elements with PE based on bit
serial arithmetic. As discussed earlier, the existing shared memory architecture, [49],
offers good balance in terms of chip area, power consumption and flexibility; but the
MB in the reconfigurable FIR filter design approach does not address the task of
reducing the complexity of filter implementation. The main difference of the
proposed reconfigurable FIR filter architecture from the existing architectures [44,45,
62. 691 is the use of reconfigurable FPGA based hardware accelerator with balanced
shared memory architecture for FIR filtering. The proposed reconfigurable
architecture uses BCS based Shift-and-Add unit and hardwired shifts. The approach
in [62, 691 requires nine adders for pre-computers employing a special CSA. This is
in comparison with only three adders required by our 4-bit BCS based Shift-and-Add
unit. Thus, the proposed reconfigurable FIR filter using reconfigurable CSM MB
offers reduction of adders over the architectures in [62,69]. Another major difference
69
is that [62, 691 employ two programmable shifters. These programmable shifters
reduce the overall speed of operation of the resulting filters especially for higher
order channel filter applications in wireless communication receivers. in the proposed
reconfigurable CSM MB (PEI), all the shifts are constants and hence can be
hardwired. This results in better speed of operation compared to methods in [62, 691.
This can be further clarified using an example as illustrated below.

For a 16-bit coefficient, the proposed reconfigurable CSM MB (4-bit BCSs


based Shift-and-Add unit) architecture requires four 8:l multiplexers (equivalent to
sixteen 2:l multiplexers), three adders for Shift-and-Add unit and three adders for the
final adder unit. Note that programmable shifters are not requited in CSM. On the
other hand, the approach in 162, 691 requires four 8:l multiplexers (main
multiplexers) and four 4: 1 multiplexers for programmable shifters if not implemented
using power consuming barrel shifiers. These are equivalent to twenty four 2:l
multiplexers, 12 adders, (nine adders for pre-computers and three adders for final
adder unit), and eight programmable shifters.

The approach in [45] requires five 8:l multiplexers and one 2:l multiplexer
(equivalent to twenty one 2:l multiplexers), eight adders (three adders for Shift-and-
Add unit and five adders for the final adder unit). For a quantitative comparison, let
us consider a 16-bit coefficient with an 8 bit quantized input signal. The proposed
reconfigurable CSM MB approach and the approach in [45, 62, 691 have been
implemented on Virtex 2 xc2vp2-6fg256 FPGA and the complexities of the Shift-
and-Add unit, multiplexer unit, final adder unit and over-all complexities have been
compared. The over-all complexity is the sums of the complexities associated with
Shift-and-Add unit, multiplexer unit, final adder unit, and that of the LUTs used for
storing the coefficients. Thus the over-all complexity of the proposed reconfigurable
CSM MB (PE]) is less than that of the approach in [45,62,69].

A DA based approach has been suggested in [104], where memory space is


reduced at the cost of additional adders. A bit-level super systolic FIR filter using bit-
serial semi-systolic multiplier with positioning and floor planning better suited to
FPGA implementation is discussed in [32]. These approaches require more area
compared to the proposed architecture.

Approach in [541 presents the design optimization of one and two dimensional
fully pipelined computing structures for area, delay, and power efficient
implementation of FIR filter by systolic decomposition of distributed arithmetic (DA)
based inner product computation. The systolic decomposition scheme is found to
offer a flexible choice of the address length of the LUT for DA-bascd computation to
decide on suitable area time tradeoff. It is observed that by using smaller address
lengths for DA based computing units it is possible to reduce memory size, but leads
to increase in adder complexity and latency. In FIR filtering, one of the convolving
sequences is derived from the input samples while the other sequence is derived from
the fixed impulse response coefficients of the filter. This behavior of the FIR filter
makes it possible to use DA-based technique for memory-based realization. It yields
faster output compared with the multiplier accumulator based designs because it
stores pre-computed partial results in memory elements that can be read out and
accumulated to obtain the desired result. The memory requirement of DA based
implementation for FIR filters, however, increases exponentially with the filter order.
LUT multiplier based approach is proposed in 1551, where a subsel of pre-computed
partial products, (odd products), are stored in memory elements. Memory-based
structures are well-suited for many DSP algorithms, which involve multiplication
with a fixed set of coeficients. As mentioned above, memory requirements are
influenced by the filter order. The proposed CSM PEI (ME) shown in Fig. 3 may be
modified to use memory based LUT approach similar to [55] by replacing the Shift-
and-Add unit with LUT.

4.6 Conclusion

In this paper, reconfigurable FIR filter architecture using low-complexity


multiplier based on CSHM, CSM, and modified BCSE is proposed. The CSHM
algorithm reduces the redundant computation in FIR filtering operation. The CSM
ensures constant shifts and hence faster shifting. The modified BCSE method reduces
the computational complexity by reusing the most common binary bit patterns present
in the coefficients. The proposed novel architecture ensures low-complexity as well
as reconfigurability which are the two key requirements for FIR filters in the
channelizers of SDRs. This architecture is implemented on Virtex 2 xc2vp2-6fg256
FPGA and synthesized with coefficient precisions of 8 bits, 12 hits and 16 bits.
Results of this architecture are compared against several reconfigurable FIR filter
architectures. The results show a significant reduction in either arithmetic operations
or hardware necessary to implement those operations combined with satisfactory
runtimes. Comparison with related work based on the available data show that the
proposed method yields comparatively better results in FIR filter optimization.

You might also like