Professional Documents
Culture Documents
13 - Chapter 4 PDF
13 - Chapter 4 PDF
4.1 Introduction
Software defined radio (SDR) technology IS a very active field in the area of
FIR filter research which focuses on reconfigurable realizations. The low-complexity
FIR filter coefficient multipliers are obtained using common sub-expression
elimination (CSE) method based on canonical signed digit (CSD) coefficients [27].
The aim of CSE method is to search for instances of identical expressions that are
presenl in the CSD representation of coefficients and eliminate these redundant
multiplications. The CSD representation uses three symbols in each power of two
positions: -1, 0 and 1, the value represented is the sum of the positional coefficients
times the corresponding power of '2'. In [70], modification of method [27] is
proposed for identifying and eliminating the redundant computations. In 1721,
technique discussed in [27] is modified to minimize the logic depth (LD) and there by
improve the speed of operation. LD is defined as the number of adder-steps in a
maximal path of decomposed multiplications. Rather than CSD, binary based
common sub-expression elimination (BCSE) method is used in 1441 to improve
reduction of adder and thus achieve low-complexity FIR filter compared to [27, 70,
721. In order to reduce the complexity of the filter [96], the filter coefficients are
encoded using pseudo random floating point method; however it is limited to filter
lengths less than 40. The methods [27, 70, 72, 961 are only suitable for fixed logic
filters where the coefficients are fixed. In chapter 3, we have proposed reconfigurable
field programmable gate array (FPGA) based hardware accelerator, a balanced shared
memory architecture for FIR filtering and time variant discrete Fourier transform
(TVDFT) which provides reconfigurability. In this chapter, we propose some
modifications to the above and additionally propose low-complexity multiplier.
Several reconfigurable FIR filters have been proposed by researchers and are
discussed in detail (9, 11, 18, 45, 62, 69, 1051. The reconfigurable MB [62, 691
consists of pre-computer, select units and final adder. The pre-computer performs the
multiplication of alphabets with input x. In [69], the method proposed in [62] is
modified and efficient circuit level techniques such as new carry select adder and
conditional capture flip-flop are used to enhance its power and performance. The
architectures [9, 11, 18, 62, 69, 93, 1051 are appropriate only for relatively lower
order filters and not suitable for channel filters in communication receivers. In [45],
common DSP operations such as filtering and matrix multiplication are identified and
expressed as vector scaling operations. In [45], ripple carry adders are used. The
presence of multiplexers gives the option of adaptive computing for the method
proposed in [45]. The BCSE method proposed in [43] provides improved adder
reductions leading to low-complexity FIR filters. Reconfigurability of the FIR filters
is not considered in [43] though. The concept of reconfigurable multiplier blocks
(ReMB) is proposed in [IS]. The ReMB generates all the coefficient products and a
multiplexer selects the required ones depending on the input. The redundancy can be
reduced by pushing the multiplexer deep into the MB design. The resulting
specialized multiplier design can be more efficient in terms of area and computational
complexity compared to the general purpose multiplier plus the coefficient store [I 81.
B U the
~ ReMB has its area, power and speed dependent on the filter length making
them inappropriate for higher order FIR filters.
This section presents the modified BCSE algorithm. This is a technique which
eliminates redundant binary common sub-expressions (BCSs) that occur within the
coefficients. Computational complexity is reduced by reusing the most common
binary bit patterns present in the coefficients. In general, an n-bit binary number can
represent 2* - (n+l) BCSs [45]. For example, a 4-bit binary representation can form
1 I BCSs, which are [0 0 1 I], [0 I 0 I], [0 1 I 01, [0 1 1 I], [I 0 0 I], [I 0 1 01, [I 0 1
I], [ 1 1 0 01, [I I 0 11, [I 1 1 01 and [I 1 1 11. Note that other BCSs such as [0 0 0 I],
[O 0 1 01, [O 1 0 01, and [I 0 0 O] do not require any adders for implementation. These
BCSs can be expressed as:
[O 0 I 11 = Xl = x t 2Ix
[O I 0 11 = xz= x + 2 2 ~
[0 I 1 01 = x3= 2'x + 22x
[0 1 1 I] = Xq= X t2Ix + 2 2 ~
[I o o I] = x5= + 23x
[I o I 01 = %= 21x + 2 3 ~
[I o 1 I 1 = X,= X+ 21x+ 2 3 ~
1 o 01 = xs= 22x + 23x
[I 1 0 I ] = xg= x +22x + 2 3 ~
[I I I OI= IO= 21x + 2 2 ~+ 2 3 ~
[I I I I J = X ~ ~ = X + ~ ~ X + ~ ~ X + ~ ~ X
Where x is the input signal
2) Perform Calculations
PEl< -RI I * Rz I;
Where: INI- Input signal x(n), IN2- represents the filter coefficient h(n)
P- - --
-
rise1 , -- -. 1. /
L
-
I
clk -- --
- -
hl --
J IOU
--
--
--I C*
Figure 4.1: Proposed rrronfipurahle FIR filter architecture modr~le(a) Hlock diagram (hl RTl,
Schematic
56
Figure 4.2: Yroposcd reconfigl~rable1:IH filt1.r architectllre n~odulein parallel and cascade form
The FIR liltcr architecture can be realized uaing either the architeclurc modulc
shown in I'lgure 3.1 or that shown in Tigure 4.1. I n appri)ach shown in figure 4.1, all
the partial producrs are generated by multiplying rhr coefficients with the Input s~gnal
(h[n] * :In]) and accumlll;l~r [he ~-c>ult\.In rhr approach shown i r ~figure 4.2,
lnultiplea of propnced nodules may be u5t.d in a par;~llrlor cascade configur;~(ion.
The hr~iicr~ p l i o t (~U. ~ I I si~iglc
I ~ architecture module), ih chnren when area ant1 power
con+ump~ii,nare the key cona[mints wli~lcthe latlcr ciption, (using parallel or cascade
configuration), i ctrosen lor high speed applications. In figure 4.,.7 hlue line5 i~i~licare
parallel cnnfiguratlc,~),while green lities indicatr cascade co~lfiguration.
clk
(h)
Figure 4.3: Architecture of lhr prol,o\cd CSM PE, (MH)for &bit coefficients (a) Block diagram (bi
K I'L Scl~ernatir
Brown lines indicate common connections for both parallel and cascade
configurations. The basic architecture of the PEI (MB) is shown in figure 4.3. PEI
(MB) is realized with the help of a Shift-and-Add unit, multiplexer unit, adder and
shifter unit, and final adder unit. The PEI (MB) uses CSM. In the CSM, the filter
coefficients are partitioned into fixed groups and hence the PEI (MB) requires only
constant shifters. The proposed architecture module with new PEl (MB) offers good
balance in terms of chip area, power consumption, computational throughput and
flexibility. PE2 is realized using high performance carry select adder with dual
transition skewed logic (DTSL) to reduce power consumption.
The functions of different blocks of the PEI (MB) are explained below.
[o 1 1 01 = x3=2'x + Z2x
[1 o I 01 = %= 21xt 2 3 ~
[ I I o 01 = X8= 2 2 +~ 2 3 ~
[ I 1 1 01 = XI"=2'x +2?x + 2 3 ~
Where x is the input signal
Figure 4.4: Architecture of Shift-and-Add unit
Note that other BCSs such as [0 0 I 01, [0 1 O 01, and [ I 0 O 0] do not require any
adders for implementation. However, xy can be obtained from x3 without using any
adders as follows:
xe' 2 ' ~t 2 3=~2'(2'x + 2 ' ~ )=2'x3 (4.23)
Thus 4-bit even BCSs Ox, 2x, 4x, 6x, 8x, lox, 12x, 14x of a 4-bit number are
generated using only three adders. The easiest ways to obtain the BCSs is to know the
amount of shifts in advance. All these eight BCSs (Ox, 2x, 4x, 6x. 8x, lox, 12x, 14x)
are then fed to 8:l multiplexers. In contrast to conventional Shift-and-Add unit, the
number of adderslsubtractors needed to implement the proposed Shift-and-Add unit is
significantly reduced.
Multiplexer Unit: Multiplexer unit selects the appropriate output from the Shift-
and-Add unit. The inputs to the multiplexer unit are the 8 inputs (Ox, 2x. 4x, 6x, 8x,
lox, 12x, 14x) from the Shift-and-Add unit and hence 8:l multiplexer unit are
employed in this architecture. The values of the filter coefficients are stored in a look
up table (LUT). These coefficients are partitioned into groups of four bits. Three most
significant bits from individual Cbit groups are used as select signals to the
corresponding multiplexer unit. The number of multiplexers in the multiplexer unit
depends on the number of groups after partitioning the filter coefficient into fixed
groups of bbit. In the proposed MB of PEI, the four most significant bits (MSR) of
the coefficient (fixed group) are given to multiplexerl. the next lower 4-bit of the
filter coefficient are given to multiplexer2. This procedure is continued till the four
least significant bits (LSB) that are given to the last multiplexer.
Adder and Shifter Unit: The intermediate results obtained from multiplexer unit are
given to the adder and shifter unit. If the LS-bit of the fixed group is ' l ' , the adder
adds the intermediate result with x(n) else the intermediate result is passed through
without change. Shifter performs constant shift with the count of shift dependent on
the 4-bit group. As an example, for the case of 8-bit coefficient, the shifter shifts the
MSBs of the intermediate result by 4-bit to achieve required weighting of 24. This can
be illustrated by the following expression.
~ ) , +21x + 13x)
( 2 ' ~ + 2 ~24(21x (4.25)
Final Adder Unit: The accumulation of all the intermediate results from adder and
shifter unit is handled in this unit.
r- 21x+2jxt 24( 2 ' +~ z 2 +~ 2 3 ~ ) (4.26)
Figure 4.5: Architecture of the proposed CSM PEI (MB) for 16-bit coefficients
In the proposed CSM architecture (MB) for 8 bit coefficients, the values of the
coefficients are stored in the LUT.The number of mult~plexerunit required is 1114,
where n is the word-length of the filter coefficients. For example, the 8-bit coefficient
of the CSM architecture is h = "0.1 111 11 1 1". This coefficient requires maximum
number of additions and shifts since all bits in the coefficient are nonzero.
In the above example, n = 8, and therefore, the number of multiplexers required are
two. The output y = h * x is expressed as:
= x + ~ ~ x + +~ 2* 3x+14x
~ +25x +26x t27x (4.27)
Note that the expression x+ 2'x+ 2' x t 2jx can be obtained from the Shift-and-Add
unit as follows:
[ I 1 1 01 = xlo= 21x + 2 2 +~ 2jx (4.29)
Since the intermediate sums shown for each sub group in equation (4.28) are odd, we
add the intermediate result with x(n). namely, xlo+x. I'he shifier shifts the MSBs of
the intermediate result by 4-bit to achieve required weighting of 24. Since the number
of shifts is always constant irrespective of the value of the coefficients, programmable
shifters are not required and these shifts can be hardwired.
The following example shows computation sequence in the CSM architecture for 8-
bit even coefficient (h = "0.1 110101 0"). This coefficient requires two additions.
In this example, n = 8, and therefore the number of multiplexer required are two.
The output y = h * x is expressed as:
= ~ I x + +25x
~ ~ x+26x +27x
The expressions 2'x t 2"x and 2'x +22x t2"x can be obtained from the Shift-and-Add
unit as follows:
Two 8:l multiplexers are used for the two 4-bit groups of the 8-bit filter coefficient in
order to obtain the intermediate sums shown for each sub group in equation (4.3 1).
Since the intermediate sums are even, the shifier directly shifts the MSBs of the
intermediate result by 4-bit to achieve required weighting of Z4.
The CSM architecture of PEI (MB) for the coefficient word-length of 16-bit is
shown in figure 4.5. The 16-bit values of the lilter coefficients are stored in the LUT
in sign-magnitude form with the MSB reserved for the sign bit. The first bit is used
afier the sign bit to represent the integer part of the coefficient and the remaining 16-
bit are used to represent the fractional part of the coefficient. Thus, each 16-bit
coefficient is stored as an 18-bit value in LUTs. Each 16-bit coefficient is stored in a
single row of LUT. Exploiting coefficient symmetry, only half the number of
coefficients needs to be stored in the LUT. The coefficient values corresponding to 2'
to 215 are partitioned into groups of 4 bits and the three MS bits of the group are
connected to select signals of respective multiplexers, (multiplexerl through
multiplexers). For the case of odd value in the fixed group coefficient, the
intermediate result from the multiplexer is added with x(n). For the case of even value
in the fixed group coefficient, intermediate result from the multiplexer is passed
through the adder without change. Let rl, r?, r3 and r4 denote outputs of Adder]
through Adder4, respectively as shown in figure 4.5. The shifts indicated between
stages of adders in figure 4.5 are obtained as follows:
y = [(rl +24r2+2sr3+2I2r4)] (4.35)
The design of PE2 employs carry select adder (CSA) using DTSL in order to
achieve high performance and low power consumption. Skewed logic gate act as
static CMOS logic but the PMOS or the NMOS transistors are of different sizes to
accelerate the transitions. To speed up low to high transition, the sizes of NMOS
transistors are reduced while the PMOS transistors are enlarged. To speed up high to
low transition, the sizes of PMOS transistors are reduced while the NMOS transistors
are enlarged. Figure 4.6(a) shows DTSL circuit that achieves high performance by
using the technique of dual data path, top data path has different skew direction from
bottom data path. Top data path is for fast rising transition input and the bottom is for
fast falling transition input. The arrows shown within each block represent the
direction of transition skew. A combiner is used to detect the earliest transition
between the two paths and latch the result for transfer to the next stage.
(b)
Figure 4.6: Block diagram of (a) DTSL (b) Carry-seled adder using DTSL 1341
In this section, the synthesis and design results of the proposed FIR filter
architecture ustng reconfigurable MB with low-complexity are presented. Xilinx 9.li
66
ISE is used for synthesis. The target FPGA is Xilinx's Virtex-11 family 2vp2fg256-6.
Table 4.1 shows the synthesis results of the proposed reconfiyurable MB CSM
architecture that has a coefficient word-length of 8-bit. The effect of the MB for
various filter coeficient word lengths of 8 bits, 12 bits and 16 bits has also been
analyzed. The results are shown in Table 4.2, Table 4.3, and figure 4.7. It is noted that
when precision of filter coefficients increase, the consumption area on the FPGA
increases while the speed of operation decreases. Thus, by choosing the appropriate
filter coefficient word-length, it is possible to obtain reduced area as well as increased
speed for the FIR filter architecture.
Table 4.1: Synthesis result for the proposed CSM architecture for a coeficient word-length of 8
bits
2vpZfg256-6 Area used by proposed PE,(MB) Utilization
Number of Slices 24 out of 1408 I%
Number of Slice Flip Flops 14 out of 2816 0%
Number o f 4 input LUTs 22 out of 2816 0%
Number of bonded lOBs 18 out of 140 12%
Number of GCLKs I out of 16 6%
Table 4.2: Synthesis results for the proposed CSM architecture with various coefficient word
lengths
fable 4.3 Timing summary and power analysis for the proposed CSM architecture with various
coefficient word length
Table 4.4: Comparison of performance for the proposed implementation and the existing
reconiigurable implementation
Table 4.5: Comparison of performance for the proposed implementation and the existing
reconiigurable implementation (Macro statistics)
Multiplier Block ( PEl) w)th Olfferent Coefficient
Word Lengths
50%
40%
4
30%
20%
10%
0%
Fpllq
-8 bit
1 2 3 4 5
Device Utilizmtion Summary
Figure 4.7: Synthesis Results for the proposed reconfiiurable FIR filter using CSM MB with
various coefficient word lengths
4.5 Discussions
The approach in [45] requires five 8:l multiplexers and one 2:l multiplexer
(equivalent to twenty one 2:l multiplexers), eight adders (three adders for Shift-and-
Add unit and five adders for the final adder unit). For a quantitative comparison, let
us consider a 16-bit coefficient with an 8 bit quantized input signal. The proposed
reconfigurable CSM MB approach and the approach in [45, 62, 691 have been
implemented on Virtex 2 xc2vp2-6fg256 FPGA and the complexities of the Shift-
and-Add unit, multiplexer unit, final adder unit and over-all complexities have been
compared. The over-all complexity is the sums of the complexities associated with
Shift-and-Add unit, multiplexer unit, final adder unit, and that of the LUTs used for
storing the coefficients. Thus the over-all complexity of the proposed reconfigurable
CSM MB (PE]) is less than that of the approach in [45,62,69].
Approach in [541 presents the design optimization of one and two dimensional
fully pipelined computing structures for area, delay, and power efficient
implementation of FIR filter by systolic decomposition of distributed arithmetic (DA)
based inner product computation. The systolic decomposition scheme is found to
offer a flexible choice of the address length of the LUT for DA-bascd computation to
decide on suitable area time tradeoff. It is observed that by using smaller address
lengths for DA based computing units it is possible to reduce memory size, but leads
to increase in adder complexity and latency. In FIR filtering, one of the convolving
sequences is derived from the input samples while the other sequence is derived from
the fixed impulse response coefficients of the filter. This behavior of the FIR filter
makes it possible to use DA-based technique for memory-based realization. It yields
faster output compared with the multiplier accumulator based designs because it
stores pre-computed partial results in memory elements that can be read out and
accumulated to obtain the desired result. The memory requirement of DA based
implementation for FIR filters, however, increases exponentially with the filter order.
LUT multiplier based approach is proposed in 1551, where a subsel of pre-computed
partial products, (odd products), are stored in memory elements. Memory-based
structures are well-suited for many DSP algorithms, which involve multiplication
with a fixed set of coeficients. As mentioned above, memory requirements are
influenced by the filter order. The proposed CSM PEI (ME) shown in Fig. 3 may be
modified to use memory based LUT approach similar to [55] by replacing the Shift-
and-Add unit with LUT.
4.6 Conclusion