You are on page 1of 6

366 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO.

2, FEBRUARY 2012

ACKNOWLEDGMENT Area-Efficient Parallel FIR Digital Filter Structures for


The authors would like to thank National Chip Implementation Symmetric Convolutions Based on Fast FIR Algorithm
Center (CIC), Taiwan for technical support in simulations. The authors
Yu-Chi Tsao and Ken Choi
would also like to thank Y.-R. Cho and S.-W. Chen for their assistance
in simulations and layouts.

REFERENCES AbstractBased on fast finite-impulse response (FIR) algorithms (FFAs),


this paper proposes new parallel FIR filter structures, which are beneficial
[1] H. Kawaguchi and T. Sakurai, A reduced clock-swing flip-flop
to symmetric coefficients in terms of the hardware cost, under the condition
(RCSFF) for 63% power reduction, IEEE J. Solid-State Circuits, vol.
that the number of taps is a multiple of 2 or 3. The proposed parallel FIR
33, no. 5, pp. 807811, May 1998.
structures exploit the inherent nature of symmetric coefficients reducing
[2] A. G. M. Strollo, D. De Caro, E. Napoli, and N. Petra, A novel high half the number of multipliers in subfilter section at the expense of addi-
speed sense-amplifier-based flip-flop, IEEE Trans. Very Large Scale tional adders in preprocessing and postprocessing blocks. Exchanging mul-
Integr. (VLSI) Syst., vol. 13, no. 11, pp. 12661274, Nov. 2005. tipliers with adders is advantageous because adders weigh less than mul-
[3] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, tipliers in terms of silicon area; in addition, the overhead from the addi-
Flow-through latch and edge-triggered flip-flop hybrid elements, in tional adders in preprocessing and postprocessing blocks stay fixed and do
IEEE Tech. Dig. ISSCC, 1996, pp. 138139. not increase along with the length of the FIR filter, whereas the number of
[4] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta, reduced multipliers increases along with the length of the FIR filter. For
R. Heald, and G. Yee, A new family of semi-dynamic and dynamic flip example, for a four-parallel 72-tap filter, the proposed structure saves 27
flops with embedded logic for high-performance processors, IEEE J. multipliers at the expense of 11 adders, whereas for a four-parallel 576-tap
Solid-State Circuits, vol. 34, no. 5, pp. 712716, May 1999. filter, the proposed structure saves 216 multipliers at the expense of 11
[5] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. J. adders still. Overall, the proposed parallel FIR structures can lead to signif-
Sullivan, and T. Grutkowski, The implementation of the Itanium 2 icant hardware savings for symmetric convolutions from the existing FFA
microprocessor, IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. parallel FIR filter, especially when the length of the filter is large.
14481460, Nov. 2002. Index TermsDigital signal processing (DSP), fast finite-impulse re-
[6] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De, sponse (FIR) algorithms (FFAs), parallel FIR, symmetric convolution,
Comparative delay and energy of single edge-triggered and dual edge very large scale integration (VLSI).
triggered pulsed flip-flops for high-performance microprocessors, in
Proc. ISPLED, 2001, pp. 207212.
[7] B. Kong, S. Kim, and Y. Jun, Conditional-capture flip-flop for statis-
tical power reduction, IEEE J. Solid-State Circuits, vol. 36, no. 8, pp.
I. INTRODUCTION
12631271, Aug. 2001. Due to the explosive growth of multimedia application, the demand
[8] N. Nedovic, M. Aleksic, and V. G. Oklobdzija, Conditional precharge for high-performance and low-power digital signal processing (DSP)
techniques for power-efficient dual-edge clocking, in Proc. Int. Symp.
Low-Power Electron. Design, Monterey, CA, Aug. 1214, 2002, pp. is getting higher and higher. Finite-impulse response (FIR) digital fil-
5659. ters are one of the most widely used fundamental devices performed
[9] P. Zhao, T. Darwish, and M. Bayoumi, High-performance and low in DSP systems, ranging from wireless communications to video and
power conditional discharge flip-flop, IEEE Trans. Very Large Scale image processing. Some applications need the FIR filter to operate at
Integr. (VLSI) Syst., vol. 12, no. 5, pp. 477484, May 2004. high frequencies such as video processing, whereas some other appli-
[10] C. K. Teh, M. Hamada, T. Fujita, H. Hara, N. Ikumi, and Y. Oowaki,
cations request high throughput with a low-power circuit such as mul-
Conditional data mapping flip-flops for low-power and high-perfor-
mance systems, IEEE Trans. Very Large Scale Integr. (VLSI) Systems, tiple-input multiple-output (MIMO) systems used in cellular wireless
vol. 14, pp. 13791383, Dec. 2006. communication. Furthermore, when narrow transition-band character-
[11] S. H. Rasouli, A. Khademzadeh, A. Afzali-Kusha, and M. Nourani, istics are required, the much higher order in the FIR filter is unavoid-
Low power single- and double-edge-triggered flip-flops for high speed able. For example, a 576-tap digital filter is used in a video ghost can-
applications, Proc. Inst. Electr. Eng.Circuits Devices Syst., vol. 152, celler for broadcast television, which reduces the effect of multipath
no. 2, pp. 118122, Apr. 2005.
signal echoes. On the other hand, parallel and pipelining processing are
[12] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, Ultra low
power clocking scheme using energy recovery and clock gating, IEEE two techniques used in DSP applications, which can both be exploited
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, pp. 3344, Jan. to reduce the power consumption. Pipelining shortens the critical path
2009. by interleaving pipelining latches along the datapath, at the price of
[13] P. Zhao, J. McNeely, W. Kaung, N. Wang, and Z. Wang, Design of increasing the number of latches and the system latency, whereas par-
sequential elements for low power clocking system, IEEE Trans. Very allel processing increase the sampling rate by replicating hardware so
Large Scale Integr. (VLSI) Syst., to be published.
that multiple inputs can be processed in parallel and multiple outputs
[14] Y.-H. Shu, S. Tenqchen, M.-C. Sun, and W.-S. Feng, XNOR-based
double-edge-triggered flip-flop for two-phase pipelines, IEEE Trans. are generated at the same time, at the expense of increased area. Both
Circuits Syst. II, Exp. Briefs, vol. 53, no. 2, pp. 138142, Feb. 2006. techniques can reduce the power consumption by lowering the supply
[15] V. G. Oklobdzija, Clocking and clocked storage elements in a multi- voltage, where the sampling speed does not increase. In this paper, par-
giga-hertz environment, IBM J. Res. Devel., vol. 47, pp. 567584, Sep. allel processing in the digital FIR filter will be discussed. Due to its
2003. linear increase in the hardware implementation cost brought by the in-
crease of the block size L, the parallel processing technique loses its
advantage in practical implementation. There have been a few papers

Manuscript received July 30, 2010; revised September 20, 2010, October 22,
2010; accepted November 20, 2010. Date of publication December 30, 2010;
date of current version January 18, 2012.
The authors are with Department of Electrical and Computer Engineering,
Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail: ytsao@iit.edu;
kchoi@ece.iit.edu).
Digital Object Identifier 10.1109/TVLSI.2010.2095892

1063-8210/$26.00 2010 IEEE


IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012 367

proposing ways to reduce the complexity of the parallel FIR filter in


the past [1][9]. In [1][4], polyphase decomposition is mainly manip-
ulated, where the small-sized parallel FIR filter structures are derived
first and then the larger block-sized ones can be constructed by cas-
cading or iterating small-sized parallel FIR filtering blocks. Fast FIR
algorithms (FFAs) introduced in [1][3] shows that it can implement Fig. 1. Two-parallel FIR filter implementation using FFA.
a L-parallel filter using approximately (2L 0 1) subfilter blocks, each
of which is of length N=L . FFA structures successfully break the con-
straint that the hardware implementation cost of a parallel FIR filter has
L
a linear increase along with the block size . It reduces the required
number of multipliers to (2 0 N N=L L N
) from 2 . In [5][9], the fast
linear convolution is utilized to develop the small-sized filtering struc-
tures and then a long convolution is decomposed into several short con-
volutions, i.e., larger block-sized filtering structures can be constructed
through iterations of the small-sized filtering structures.
However, in both categories of method, when it comes to symmetric
convolutions, the symmetry of coefficients has not been taken into con-
sideration for the design of structures yet, which can lead to a signifi- Fig. 2. Three-parallel FIR filter implementation using FFA.
cant saving in hardware cost. In this paper, we provide new parallel FIR
filter structures based on FFA consisting of advantageous polyphase de-
compositions, which can reduce amounts of multiplications in the sub- adders, and totally 2 N multipliers and 2N 0 2 adders. However, (4)
filter section by exploiting the inherent nature of the symmetric coef- can be written as
ficients, compared to the existing FFA fast parallel FIR filter structure.
This paper is organized as follows. A brief introduction of FFAs is given Y0 = H0 X0 + z02 H1 X1
in Section II. In Section III, the proposed parallel FIR filter structures
are presented. Section IV investigates the complexity and comparisons.
Y1 = (H0 + H1 )(X0 + X1 ) 0 H0 X0 0 H1 X1 : (5)

In Section V, the description of hardware implementation and the ex- The implementation of (5) will require three FIR subfilter blocks of
perimental results are shown. Section VI gives the conclusion. length N= 2, one preprocessing and three postprocessing adders, and
3N= 2 multipliers and 3( N=2 0 1) + 4 adders, which reduces approxi-
II. FAST FIR ALGORITHM (FFA) mately one fourth over the traditional two-parallel filter hardware cost
Consider an N -tap FIR filter which can be expressed in the general L
from (4). The two-parallel ( = 2) FIR filter implementation using
form as FFA obtained from (5) is shown in Fig. 1.
N01 B. 3 2 3 FFA ( L = 3)
y(n) = h(i)x(n 0 i); n = 0; 1; 2; . . . ; 1 (1)
i=0 By the similar approach, a three-parallel FIR filter using FFA can be
where fx(n)g is an infinite-length input sequence and fh(i)g are the
expressed as
N L
length- FIR filter coefficients. Then, the traditional 0 parallel FIR Y0 = H0 X0 0 z03 H2 X2 + z03
filter can be derived using polyphase decomposition as [3]
2 [(H1 + H2 )(X1 + X2 ) 0 H1 X1 ]
L01 L01 L01
Yp (zL )z0p = Xq (zL )z0q Hr (zL )z0r (2) Y1 = [(H0 + H1 )(X0 + X1 ) 0 H1 X1 ]
p=0 q=0 r=0 0 (H0 X0 0 z03 H2 X2 )
X 1 z 0k x(Lk + q);Hr Y2 = [(H0 + H1 + H2 )(X0 + X1 + X2 )]
q k
where
(N=L)01 0k
=
z x(Lk + r);Yp
=0
1 z 0k x(Lk + p);
=
0 [(H0 + H1 )(X0 + X1 ) 0 H1 X1 ]
k=0 = k=0
for p; q; r = 0; 1; 2; . . . ; L 0 1. From this FIR filtering equation, it 0 [(H1 + H2 )(X1 + X2 ) 0 H1 X1 ]: (6)
shows that the traditional FIR filter will require L -FIR subfilter
2

blocks of length N=L for implementation. The hardware implementation of (6) requires six length-N=3 FIR sub-
filter blocks, three preprocessing and seven postprocessing adders, and
A. 2 2 2 FFA (L = 2) three N multipliers and 2N + 4 adders, which has reduced approxi-
mately one third over the traditional three-parallel filter hardware cost.
According to (2), a two-parallel FIR filter can be expressed as
The implementation obtained from (6) is shown in Fig. 2.
Y0 + z01 Y1 = (H0 + z01 H1 )(X0 + z01 X1 )
01 02
= H0 X0 + z (H0 X1 + H1 X0 ) + z H1 X1
III. PROPOSED FFA STRUCTURES FOR
SYMMETRIC CONVOLUTIONS
(3)
To utilize the symmetry of coefficients, the main idea behind the pro-
implying that posed structures is actually pretty intuitive, to manipulate the polyphase
decomposition to earn as many subfilter blocks as possible which con-
Y0 = H0 X0 + z02 H1 X1 ; tain symmetric coefficients so that half the number of multiplications
Y1 = H0 X1 + H1 X0: (4) in the single subfilter block can be reused for the multiplications of
whole taps, which is similar to the fact that a set of symmetric coeffi-
Equation (4) shows the traditional two-parallel filter structure, which cients would only require half the filter length of multiplications in a
N=
will require four length- 2 FIR subfilter blocks, two postprocessing N L
single FIR filter. Therefore, for an -tap -parallel FIR filter the total
368 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

Fig. 4. Subfilter block implementation with symmetric coefficients.


Fig. 3. Proposed two-parallel FIR filter implementation.

amount of saved multipliers would be the number of subfilter blocks


that contain symmetric coefficients times half the number of multipli-
N= L
cations in a single subfilter block ( 2 ).

A. 2 2 2 Proposed FFA ( L = 2)
From (4), a two-parallel FIR filter can also be written as

Y0 = 1
2
[( H0 + H1 )(X0 + X1 )
+( H0 0 H1 )(X0 0 X1 )] 0 H1 X1 + z02 H1 X1 ;
Y1 = 12 [(H0 + H1 )(X0 + X1 ) 0 (H0 0 H1 )(X0 0 X1 )]: (7)
Fig. 5. Proposed three-parallel FIR filter implementation.
When it comes to a set of even symmetric coefficients, (7) can earn
one more subfilter block containing symmetric coefficients than (5),
the existing FFA parallel FIR filter. Fig. 3 shows implementation of
the proposed two-parallel FIR filter based on (7).
An example is demonstrated here for a clearer perspective.
Example 1: Consider a 24-tap FIR fiter with a set of symmetric
coefficients applying to the proposed two-parallel FIR filter

fh(0); h(1);h(2);h(3); h(4); h(5);


h(6); h(7); h(8); h(9); . . . ; h(23)g
where h(0) = h(23);h(1) = h(22);h(2) = h(21);h(3) =
h(20);h(4) = h(19);h(5) = h(18); . . . ; h(11) = h(12), applying
to the proposed two-parallel FIR filter structure, and the top two Fig. 6. Comparison of subfilter blocks between existing FFA and the proposed
subfilter blocks will be as FFA three-parallel FIR structures.

H0 6 H1 =fh(0) 6 h(1);h(2) 6 h(3);


h(4) 6 h(5);h(6) 6 h(7); . . . ; h(18) 6 h(19);
h(20) 6 h(21);h(22) 6 h(23)g
where

h(0) 6 h(1) = 6(h(22) 6 h(23))


h(2) 6 h(3) = 6(h(20) 6 h(21))
h(4) 6 h(5) = 6(h(18) 6 h(19))
h(6) 6 h(7) = 6(h(16) 6 h(17)) . . . (8)

As can be seen from the example above, two of three subfilter blocks
H H
from the proposed two-parallel FIR filter structure, 0 0 1 and 0 + H
H 1 , are with symmetric coefficients now, as (8), which means the sub-
filter block can be realized by Fig. 4, with only half the amount of mul-
tipliers required. Each output of multipliers responds to two taps. Note
that the transposed direct-form FIR filter is employed. Compared to
the existing FFA two-parallel FIR filter structure, the proposed FFA
structure leads to one more subfilter block which contains symmetric
coefficients. However, it comes with the price of the increase of amount
of adders in preprocessing and postprocessing blocks. In this case, two Fig. 7. Comparison of subfilter blocks between existing FFA and the proposed
additional adders are required for = 2. L FFA four-parallel FIR structures.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012 369

Fig. 8. Proposed four-parallel FIR filter implementation.

B. 3 2 3 Proposed FFA (L = 3) FIR structure also brings an overhead of seven additional adders in
With the similar approach, from (6), a three-parallel FIR filter can preprocessing and postprocessing blocks.
also be written as (9). Fig. 5 shows implementation of the proposed
C. Proposed Cascading FFA
three-parallel FIR filter. When the number of symmetric coefficients
N is the multiple of 3, the proposed three-parallel FIR filter structure The proposed cascading process for the larger block-sized proposed
presented in (9) enables four subfilter blocks with symmetric coeffi- parallel FIR filter is similar to that introduced in [1]. However, a small
cients in total, whereas the existing FFA parallel FIR filter structure modification is adopted here for lower hardware consumption. As we
has only two ones out of six subfilter blocks. A comparison figure is can see, the proposed parallel FIR structure enables the reuse of multi-
shown in Fig. 6, where the shadow blocks stand for the subfilter blocks pliers in parts of the subfilter blocks but it also brings more adder cost
in preprocessing and postprocessing blocks. When cascading the pro-
which contain
posed FFA parallel FIR structures for larger parallel block factor theL;
Y0 = 2 [(H0 + H1 )(X0 + X1 )
1 increase of adders can become larger. Therefore, other than applying
the proposed FFA FIR filter structure to all the decomposed subfilter
+ (H0 0 H1 )(X0 0 X1 )] 0 H1 X1 blocks, the existing FFA structures which have more compact oper-
03
+ z f(H0 + H1 + H2 )(X0 + X1 + X2 ) ations in preprocessing and postprocessing blocks are employed for
0 (H0 + H2 )(X0 + X2 ) those subfilter blocks that contain no symmetric coefficients, whereas
the proposed FIR filter structures are still applied to the rest of subfilter
0 21 [(H0 + H1 )(X0 + X1 ) blocks with symmetric coefficients. An illustration of the proposed cas-
0 (H0 0 H1 )(X0 0 X1 )] 0 H1 X1g L
cading process for a four-parallel FIR filter ( = 4) as an example is
shown in Fig. 7, and the realization is shown in Fig. 8. From Fig. 7, it
Y1 = 2 [(H0 + H1 )(X0 + X1 )
1
is clear to see that the proposed four-parallel FIR structure earns three
0 (H0 0 H1 )(X0 0 X1 )] more subfilter blocks containing symmetric coefficients than the ex-
N=
isting FFA one, which means 3 8 multipliers can be saved for an
+z
03 1 [(H0 + H2 )(X0 + X2 ) N -tap FIR filter, at the price of 11 additional adders in preprocessing
2
and postprocessing blocks. By this cascading approach, parallel FIR
+(H0 0 H2 )(X0 0 X2 )] L
filter structures with larger block factor can be realized. The pro-
0 12 [(H0 + H1 )(X0 + X1 ) posed six-parallel FIR filter will result in 6 more symmetric subfilter
N N
blocks, equivalently /2 multipliers saved for an -tap FIR filter, than
+ ( H0 0 H1 )(X0 0 X1 )] + H1 X1 the existing FFA, at the expense of an additional 32 adders. Also, the
proposed eight-parallel FIR filter will lead to seven more symmetric
Y2 = 21 [(H0 + H2 )(X0 + X2 ) N=
subfilter blocks, equivalently 7 16 multipliers saved for an -tap N
filter, than the existing FFA, with the overhead of additional 54 adders.
0 (H0 0 H2 )(X0 0 X2 )] + H1 X1 (9)
IV. COMPLEXITY ANALYSIS AND COMPARISON
N
symmetric coefficients. Therefore, for an -tap three-parallel FIR L
When an -parallel FIR filter comes with a set of symmetric coeffi-
filter, the proposed structure can save N=
3 multipliers from the cients of length N; the number of required multipliers for the proposed
existing FFA structure. However, again, the proposed three-parallel parallel FIR filter structures is provided by (10) and (11).
370 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

TABLE I TABLE II
COMPARISON OF PROPOSED AND THE EXISTING FFA STRUCTURES NUMBER COMPARISON OF STRUCTURES FOR A 144-TAP FIR FILTER NUMBER OF
OF REQUIRED MULTIPLIERS (M.), REDUCED MULTIPLIERS (R.M.), NUMBER REQUIRED MULTIPLIERS (M.), NUMBER OF REQUIRED ADDERS (A.), NUMBER
OF REQUIRED ADDERS IN SUBFILTER SECTION (SUB.), NUMBER OF REQUIRED OF REQUIRED DELAY ELEMENTS (D.)
ADDERS IN PRE/POSTPROCESSING BLOCKS (PRE/POST.), AND NUMBER OF
THE INCREASED ADDERS (I.A.)

TABLE III
COMPARISON OF AREA

TABLE IV
COMPARISON OF POWER

TABLE V
COMPARISON OF CRITICAL PATH DELAY

i S
resulted from -th FFA. is the number of subfilter blocks containing
symmetric coefficients. The number of the required adders in subfilter
section can be given by

Asub =
r
N
Mi Li 0 1 :
Case 1: (12)
r
N i=1 i=1
When r
i=1 Li is even; A comparison between the proposed and the existing FFA structures for
M= N r
Mi 0 S2 : even symmetric coefficients with different length under different level
r
i=1 Li i=1
(10)
of parallelism is summarized in Table I. Also, a comparison between
the proposed structures and other structures for a 144-tap FIR filter with
Case 2: parallel block 4 and 8 is shown in Table II.
N
When r
i=1 Li is odd; V. IMPLEMENTATION AND EXPERIMENTAL RESULT

M = rN Li Mi 0 S2 N
r

Li 0 1 : (11)
The proposed FFA structures and the existing FFA structures are im-
r
i=1 i=1
plemented in Verilog HDL with filter length of 24 and 72, word length
i=1
16-bit and 32-bit, respectively. Two sets of the ideal low-pass FIR filter
Li is the small parallel block size such as (2 2 2) or (3 2 3) FFA. symmetric coefficients of length 24 and 72 are generated by MATLAB
r is the number of FFAs used. Mi is the number of subfilter blocks using Remez Exchange algorithm. The maximum absolute difference
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012 371

(MAD) algorithm introduced in [1], [2] is used for coefficients quanti- Low-Power and Area-Efficient Carry Select Adder
zation. The subfilter is based on canonical signed digit (CSD) structure
and Carry-Save adders are used. Tables III, IV, and V show the results B. Ramkumar and Harish M Kittur
of area, power, and critical path delay, synthesized by Design Compiler
[10] with 45-nm technology.
AbstractCarry Select Adder (CSLA) is one of the fastest adders used
VI. CONCLUSION in many data-processing processors to perform fast arithmetic functions.
In this paper, we have presented new parallel FIR filter structures, From the structure of the CSLA, it is clear that there is scope for reducing
the area and power consumption in the CSLA. This work uses a simple and
which are beneficial to symmetric convolutions when the number of
efficient gate-level modification to significantly reduce the area and power
taps is the multiple of 2 or 3. Multipliers are the major portions in hard- of the CSLA. Based on this modification 8-, 16-, 32-, and 64-b square-root
ware consumption for the parallel FIR filter implementation. The pro- CSLA (SQRT CSLA) architecture have been developed and compared with
posed new structure exploits the nature of even symmetric coefficients the regular SQRT CSLA architecture. The proposed design has reduced
and save a significant amount of multipliers at the expense of addi- area and power as compared with the regular SQRT CSLA with only a
slight increase in the delay. This work evaluates the performance of the
tional adders. Since multipliers outweigh adders in hardware cost, it is proposed designs in terms of delay, area, power, and their products by
profitable to exchange multipliers with adders. Moreover, the number hand with logical effort and through custom design and layout in 0.18- m
of increased adders stays still when the length of FIR filter becomes CMOS process technology. The results analysis shows that the proposed
large, whereas the number of reduced multipliers increases along with CSLA structure is better than the regular SQRT CSLA.
the length of FIR filter. Consequently, the larger the length of FIR fil- Index TermsApplication-specific integrated circuit (ASIC), area-effi-
ters is, the more the proposed structures can save from the existing FFA cient, CSLA, low power.
structures, with respect to the hardware cost. Overall, in this paper, we
have provided new parallel FIR structures consisting of advantageous
polyphase decompositions dealing with symmetric convolutions com- I. INTRODUCTION
paratively better than the existing FFA structures in terms of hardware
consumption. Design of area- and power-efficient high-speed data path logic sys-
tems are one of the most substantial areas of research in VLSI system
REFERENCES design. In digital adders, the speed of addition is limited by the time
[1] D. A. Parker and K. K. Parhi, Low-area/power parallel FIR digital required to propagate a carry through the adder. The sum for each bit
filter implementations, J. VLSI Signal Process. Syst., vol. 17, no. 1, position in an elementary adder is generated sequentially only after the
pp. 7592, 1997. previous bit position has been summed and a carry propagated into the
[2] J. G. Chung and K. K. Parhi, Frequency-spectrum-based low-area
low-power parallel FIR filter design, EURASIP J. Appl. Signal
next position.
Process., vol. 2002, no. 9, pp. 444453, 2002. The CSLA is used in many computational systems to alleviate the
[3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Im- problem of carry propagation delay by independently generating mul-
plementation. New York: Wiley, 1999. tiple carries and then select a carry to generate the sum [1]. However,
[4] Z.-J. Mou and P. Duhamel, Short-length FIR filters and their use in the CSLA is not area efficient because it uses multiple pairs of Ripple
fast nonrecursive filtering, IEEE Trans. Signal Process., vol. 39, no.
Carry Adders (RCA) to generate partial sum and carry by considering
carry input Cin = 0 and Cin = 1, then the final sum and carry are
6, pp. 13221332, Jun. 1991.
[5] J. I. Acha, Computational structures for fast implementation of L-path
and L-block digital filters, IEEE Trans. Circuit Syst., vol. 36, no. 6, pp. selected by the multiplexers (mux).
805812, Jun. 1989. The basic idea of this work is to use Binary to Excess-1 Converter
[6] C. Cheng and K. K. Parhi, Hardware efficient fast parallel FIR filter (BEC) instead of RCA with Cin = 1 in the regular CSLA to achieve
structures based on iterated short convolution, IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 51, no. 8, pp. 14921500, Aug. 2004.
lower area and power consumption [2][4]. The main advantage of this
[7] C. Cheng and K. K. Parhi, Furthur complexity reduction of parallel BEC logic comes from the lesser number of logic gates than the n-bit
FIR filters, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS 2005), Full Adder (FA) structure. The details of the BEC logic are discussed
Kobe, Japan, May 2005. in Section III.
[8] C. Cheng and K. K. Parhi, Low-cost parallel FIR structures with This brief is structured as follows. Section II deals with the delay
2-stage parallelism, IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
54, no. 2, pp. 280290, Feb. 2007.
and area evaluation methodology of the basic adder blocks. Section III
[9] I.-S. Lin and S. K. Mitra, Overlapped block digital filtering, IEEE presents the detailed structure and the function of the BEC logic. The
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 8, SQRT CSLA has been chosen for comparison with the proposed de-
pp. 586596, Aug. 1996. sign as it has a more balanced delay, and requires lower power and
[10] Design Compiler User Guide, ver. B-2008.09, Synopsys Inc., Sep. area [5], [6]. The delay and area evaluation methodology of the regular
2008.
and modified SQRT CSLA are presented in Sections IV and V, respec-
tively. The ASIC implementation details and results are analyzed in
Section VI. Finally, the work is concluded in Section VII.

Manuscript received May 12, 2010; revised October 28, 2010; accepted De-
cember 15, 2010. Date of publication January 24, 2011; date of current version
January 18, 2012.
The authors are with the School of Electronics Engineering, VIT University,
Vellore 632 014, India (e-mail: ramkumar.b@vit.ac.in; kittur@vit.ac.in).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2010.2101621

1063-8210/$26.00 2011 IEEE

You might also like