4 views

Uploaded by resplandor

In a recent work, we have introduced a new multiple constant multiplication (MCM) algorithm, denoted as RADIX-2r. The latter exhibits the best results in speed and power, comparatively with the most prominent algorithms. In this paper, the area aspect of RADIX-2r is more specially investigated. RADIX-2r is confronted to area efficient algorithms, notably to the cumulative benefit heuristic (Hcub) known for its lowest adder-cost. A number of benchmark FIR filters of growing complexity served for comparison. The results showed that RADIX-2r is better than Hcub in area, especially for high order filters where the saving ranges from 1.50% up to 3.46%. This advantage is analytically proved and experimentally confirmed using a 65nm CMOS technology. Area efficiency is achieved along with important savings in speed and power, ranging from 6.37% up to 38.01% and from 9.30% up to 25.85%, respectively. When MCM blocks are implemented alone, the savings are higher: 10.18%, 47.24%, and 41.27% in area, speed, and power, respectively. Most importantly, we prove that MCM heuristics using similar addition pattern (A-operation with the same shift spans) as Hcub yield excessive bit-adder overhead in MCM problems of high complexity. As such, they are not competitive to RADIX-2r in high order filters.

- Math Assignment
- IT Third Year SYLLABUS
- dspa course file
- Engindeniz Gunes Thesis
- 08 CM0268 Filters
- IIR Filters in Matlab
- Abn
- ib102e
- dtssp qb.docx
- Untitled
- Mid Sem Presentation
- m.tech syllabus
- EC2306 Lab Manual
- L0606057681.pdf
- lecture_notes_fp_IIR.pdf
- final ppt.pptx
- UT Dallas Syllabus for ee6360.501 05f taught by Issa Panahi (imp015000)
- Wireless Multichannel Eeg
- Time and Space Complexity lyrics of goose dream
- dynprog

You are on page 1of 11

Research Article

Received on 19th February 2017

Revised 1st July 2017

efficient FIR filters Accepted on 13th August 2017

E-First on 4th December 2017

doi: 10.1049/iet-cds.2017.0058

www.ietdl.org

Ahmed Liacha1,2, Abdelkrim K. Oudjida1 , Farid Ferguene2, Mohammed Bakiri1, Mohamed L. Berrandjia1

1Centre de Développement des Technologies Avancées, CDTA, System Architecture and Multimedia Division, Algiers, Algeria

2Universitédes Sciences et de la Technologie Houari Boumediene, USTHB, LRPE Laboratory, Algiers, Algeria

E-mail: a_oudjida@hotmail.com

Abstract: In a recent work, we have introduced a new multiple constant multiplication (MCM) algorithm, denoted as RADIX-2r.

The latter exhibits the best results in speed and power, comparatively with the most prominent algorithms. In this paper, the area

aspect of RADIX-2r is more specially investigated. RADIX-2r is confronted to area efficient algorithms, notably to the cumulative

benefit heuristic (Hcub) known for its lowest adder-cost. A number of benchmark FIR filters of growing complexity served for

comparison. The results showed that RADIX-2r is better than Hcub in area, especially for high order filters where the saving

ranges from 1.50% up to 3.46%. This advantage is analytically proved and experimentally confirmed using a 65nm CMOS

technology. Area efficiency is achieved along with important savings in speed and power, ranging from 6.37% up to 38.01% and

from 9.30% up to 25.85%, respectively. When MCM blocks are implemented alone, the savings are higher: 10.18%, 47.24%,

and 41.27% in area, speed, and power, respectively. Most importantly, we prove that MCM heuristics using similar addition

pattern (A-operation with the same shift spans) as Hcub yield excessive bit-adder overhead in MCM problems of high

complexity. As such, they are not competitive to RADIX-2r in high order filters.

1 Background and motivation where the ‘B’ stands for binary. BSE was dedicated to high-order

filters with the objective to achieve a lower adder cost without

The hardware complexity of FIR filters is dominated by multiple much increase (near optimal) in adder depth. Authors claimed that

constant multiplication (MCM). The latter is an arithmetic BSE was better in adder cost than the best CSE [5, 6] and DAG [7,

operation that multiplies a set of fixed-point constants 8] heuristics, including Hcub [8]. However, the work in [4] was

C0, C1, C2, …, CM − 1 with the same fixed-point variable X. To be severely criticised by Chang and Faust [9], revealing inconsistency

efficiently implemented, i.e. rapid, compact, and low power, MCM in the chain of arguments and several discrepancies in the

must avoid costly multipliers. The hardware alternative will be quantitative figures used to justify the merit of BSE. Authors

multiplierless, using only additions, subtractions, and left shifts. Chang and Faust [9] recalculated the adder cost and adder depth for

We assume that addition and subtraction have the same area/speed Hcub and the two best CSE heuristics, namely the contention

cost, and that the shift is costless since it can be realised without resolution algorithm (CRA) [5] and the non-recursive signed

any gates, i.e. just by using hardwiring. Therefore, the MCM common subexpression elimination (NR-SCSE) [6]. They showed

problem is defined as the process of finding the minimum number experimentally that the adder cost of Hcub is very close from the

of additions, and/or the minimum number of cascaded additions lower bound defined in [10]. They came to the conclusion that

forming the critical path. The computational complexity of MCM there is very little room for other algorithms to achieve lower adder

is conjectured to be NP hard [1]. However, because the solution cost than Hcub. They also claimed that the adder depth of Hcub

space to explore is so huge, optimal solutions require excessive decreases with increasing filter length.

runtime and become impractical even for MCM operations of a Hcub as one of the outstanding MCM heuristics in quality and

medium complexity [1, 2]. Only MCM heuristics can react in a runtime has still some challenging competitors. DIFFAG [11]

reasonable amount of time, producing, however, suboptimal outperforms Hcub only in MCM problems with many small

solutions. constants as it essentially relies on the difference between

Based on the Radix-2r arithmetic, we have developed in a constants. While H3 and H4 [12] beat Hcub except in the case of

previous work a fully predictable heuristic for MCM, called few large constants. In fact, H3 and H4 are modifications to Hcub

RADIX-2r (see Table 2 in [3]). We obtained the unique analytic which enable better exploitation of redundancy within constants

bounds known so far for MCM in adder cost (Upb), average (Avg), [12]. Results in [11, 12] show that DIFFAG and H3 achieve slight

and adder depth (Ath). The bounds have been extended to FIR savings in average over Hcub. H4 performs better than H3 but at

filters to deal with the different bit lengths of the coefficients (see the cost of an excessive runtime. Contrary to DIFFAD, H3, and H4

Table 3 in [3]). Additionally, RADIX-2r shows a sublinear runtime which are advantageous only for a narrow spectrum of MCM

complexity, which means that it has no limitation regarding the size problem sizes, Hcub is rather fairly balanced [12] with a much

of the problem to be solved (see Table 7 in [3]). We have also larger span of performances, adaptable therefore to a vast range of

demonstrated that the best MCM heuristics cannot compete with applications. Owing to this, we focus more particularly on Hcub

RADIX-2r in adder depth (see Section 3 in [3]). This explains the for the benchmarking comparisons.

superiority of speed/power results of RADIX-2r over the best Contrary to the claims of Chang and Faust [9], in this work we

MCM heuristics (see Tables 8 and 9 in [3]) whether they belong to provide evidences that Hcub is not so unbeatable in area as it might

the category of common subexpression elimination (CSE) or appear. The reason is that Hcub solution requires a set of adders

directed acyclic graphs (DAG). with longer bit lengths. For high-order filters, the total number of

In [4], authors proposed a CSE method using binary bit adders outweighs the benefit of the lower adder cost. Besides,

representation of the coefficients as opposed to the commonly used we prove that Hcub is not near optimal in cost for constant bit

canonical signed digit (CSD) representation. It was denoted BSE,

© The Institution of Engineering and Technology 2017

Oudjida et al. [13, 14] proved an upper limit for adder depth

equal to

log2 N + 1 /r + r − 2, (2)

where

the same formula (2) holds for MCM but with

Fig. 1 Sequential order of computation of the entire set of non-trivial PPs

r in (2) prevails over the term log as r increases (see Table 5 in

[3]). Without increasing the adder cost, limit (2) can be

needed by RADIX-26. For RADIX-26, a maximum of 26–2–1 = 15 additions

significantly reduced using the following recurrence relation:

are necessary, carried out in 6/2 − 1 = 2 steps in the worst case

m j = 2 p ± d, where p ≤ r − 1 and 0 < d < 2r − 3. The construction

length >20 bits. We also prove that Hcub adder depth increases process of the non-trivial partial products (PPs) is illustrated by

with increasing filter length. Fig. 1 for r = 6.

In addition to the RADIX-2r advantages cited earlier, the central

point of this paper is to demonstrate the superiority of RADIX-2r Theorem 1: In RADIX-2r, the computation of the entire set of non-

over the existing heuristics in speed, power, and area particularly. trivial PPs 3 × X, 5 × X, …, (2r − 1 − 1) × X needs a maximum

For this purpose, a number of benchmark FIR filters of different adder depth of r/2 − 1.

complexities have been implemented in 65 nm complementary

metal oxide semiconductor (CMOS). We also introduce the exact Proof: Proof by induction is used. Aided by Fig. 1, for r = 3

analytic bound of RADIX-2r at bit level, which is a very strong OM 23 = 3 , which requires 3/2 − 1 = 1 steps. Now assume

result in MCM. that for OM 2r , a maximum adder depth of r/2 − 1 is needed.

This paper is organised as follows. In Section 2, a new lower Let us determine the adder depth for OM 2r + 1 (see equation

bound in adder depth for RADIX-2r is defined. Section 3 compares below) Note that the greatest term

RADIX-2r to state-of-the-art algorithms through a number of filters 2 r + 1 − 3 − 1 = 2r − 2 − 1 ∈ OM 2r − 1 . Since the starting value

of different complexities. We introduce in Section 4 the bit-level

corresponds to r = 3 (Fig. 1), the adder step increases by one unit

version of RADIX-2r and show its efficiency over Hcub in a

for each subsequent odd value of r. The even values of r require as

number of bit adders. The implementation results are presented and

many adder steps as r − 1. Thus, whether r is odd or even, the term

discussed in Section 5. Finally, Section 6 provides some

concluding remarks and suggestions for future work. r + 1 /2 − 1 holds as the maximum adder depth for OM 2r + 1 .

Thus, in RADIX-2r, the upper limit in adder depth becomes

2 RADIX-2r: a lower bound for adder depth

Ath / / = log2 N + 1 /r + r/2 − 1 (5)

A non-negative N-bit constant C is expressed in RADIX-2r as (see

(1)) where c−1 = cN = 0 and r ∈ N ∗. In (1), the 2's complement and

representation of C is split into N + 1 /r slices (Q j), each of r +

1 bit length [3]. Each pair of two contiguous slices has one Ath⋯ = N + 1 /r + r/2 − 2 (6)

overlapping bit. A digit set DS(2r) corresponds to (1), such as

Q j ∈ DS 2r = −2r − 1, − 2r − 1 + 1, …, − 1, 0, 1, …, 2r − 1 − 1, 2r − 1 for parallel (binary tree) and serial implementations of the PPs,

respectively.

. Since each new non-trivial PP requires only one addition

The sign of the Qj term is given by the crj+r–1 bit, and (recurrence relation), the maximum adder cost is the number of

Q j = 2k j × m j, with k j ∈ 0, 1, 2, …, r − 1 and non-trivial PPs: Nom = OM 2r − 1 = 2r − 2 − 1. Hence, the upper

r r r−1

m j ∈ OM 2 ∪ 0, 1 , where OM 2 = 3, 5, 7, …, 2 − 1 . bound does not change. It is the same as in [3]:

OM 2r is the set of odd positive digits in RADIX-2r recoding,

with OM 2r = 2r − 2 − 1. Upb = M × N + 1 /r + 2r − 2 − 1 − M . (7)

N + 1 /r − 1

C= ∑ cr j − 1 + 20cr j + 21cr j + 1 + 22cr j + 2 + ⋯ + 2r − 2cr j + r − 2 − 2r − 1cr j + r − 1 × 2r j

j=0

N + 1 /r − 1

(1)

= ∑ Qj × 2 ,rj

j=0

OM 2r + 1 = 3, 5, 7, …, 2r − 1 − 1, 2r − 1 + 1, …, 2 r + 1 −1

−1

r r+1 −2 r+1 −1

= OM 2 ∪ 2 + 1, …, 2 −1

r r+1 r+1

= OM 2 ∪ 2 −2

+ 1 ,2 −2

+ 3 , …, 2 r + 1 −2

+ 2 r+1 −3

−1 ,

2 r+1 −1

− 2 r+1 −3

− 1 , …, 2 r + 1 −1

− 3 , 2 r+1 −1

− 1

2 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11

© The Institution of Engineering and Technology 2017

Table 1 RADIX-2r: maximum number of additions (Upb) for a number of M non-negative constants with the same bit size N

N M 10 50 100 200 300 400 500 600 700 800 900 1000

12 r 5 7

Upb 27 81 131 231 331 431 531 631 731 831 931 1031

Lbc 12 52 102 202 302 402 502 602 702 802 902 1002

16 r 6 9

Upb 35 115 215 327 427 527 627 727 827 927 1027 1127

Lbc 13 53 103 203 303 403 503 603 703 803 903 1003

20 r 6 7 11

Upb 45 131 231 431 631 831 1011 1111 1211 1311 1411 1511

Lbc 13 53 103 203 303 403 503 603 703 803 903 1003

24 r 5 7 9

Upb 47 181 331 527 727 927 1127 1327 1527 1727 1927 2127

Lbc 13 53 103 203 303 403 503 603 703 803 903 1003

32 r 6 7 9 11

Upb 65 231 427 727 1027 1311 1511 1711 1911 2111 2311 2511

Lbc 14 54 104 204 304 404 504 604 704 804 904 1004

Upb = M × N + 1 /r + 2r − 2 − 1 − M with r = 2W M N + 1 log 2 /log 2 ; Lbc = log2 N + 1 /2 + M − 1; W, Lambert function; , ceiling function.

Table 2 RADIX-2r: maximum serial (Ath…) and parallel (Ath//) adder depth for a number of M non-negative constants with the

same bit size N

N M 10 50 100 200 300 400 500 600 700 800 900 1000

12 Ath… 4

Ath// 4

Lbd 3

16 Ath… 4 5

Ath// 4 5

Lbd 4

20 Ath… 5 6

Ath// 4 5 6

Lbd 4

24 Ath… 6

Ath// 5 6

Lbd 4

32 Ath… 7

Ath// 5 6 7

Lbd 5

Ath… = N + 1 /r + r/2 − 2; Ath / / = log2 N + 1 /r + r/2 − 1 with r = 2W M N + 1 log 2 /log 2 ; Lbd = log2 N + 1 /2 .

On the other side, Gustafsson [10] proved lower bounds for adder for power consumption as well. The reason is that shorter paths

cost and depth, which are Lbc = log2 N + 1 /2 + M − 1 and reduce the number of glitches which are the dominant factor in

Lbd = log2 N + 1 /2 , respectively. To show how much (5) and power consumption [3, 16, 17]. □

(6) are far from the lower bound (Lbd), we first computed the

values of r that give optimal Upb for N = 12–32 and M = 10–1000 3 Benchmarking of the best MCM heuristics

(Table 1), which is the typical variation range when considering

filters, though N does not generally exceed 24 bits [15]. In what follows, RADIX-2r is confronted to the best MCM

Afterwards, we applied the values of r to (5) and (6) and collected algorithms through a set of FIR filters taken from different sources.

To highlight the complexity of the filters, we opted for the

the results in Table 2. Note that Ath… and Ath// are at most two

following nomenclature: FIR_L_M_N, where L is the number of

steps greater than Lbd. Observe that for filter applications

coefficients (H set) of the filter, M is the number of unique positive

N ≤ 24 , Ath… needs at most one step more than Ath//. The gap odd integer coefficients (Hmin set) to be solved in MCM, and N is

between Ath// and Ath… becomes significant only for applications the maximum bit length of the coefficients of Hmin set. The

where N is high and M is low, such as, for instance, in the case

products M × N and L × N serve to estimate the bit-level complexity

N ≥ 32 and M ≤ 10. Notice also that starting from M = 200, Ath… (area) of the MCM block and the whole filter (MCM block plus the

and Ath// have the same value. More importantly, note that both adder structure), respectively. Note that we deal with the

Ath… and Ath// can exhibit optimal values as in the case N = 16 transposed form of FIR filters.

and M ≤ 200. The first set of filters is taken from [18]. The filter

Significant conclusion: RADIX-2r leads to a near-optimal solution specifications are given in Table 2 of [18]. For an easy

in adder depth without sacrificing the adder cost. Therefore, there identification, the number of the filter (x) is added in the

is very little room for other heuristics to compete with it in adder nomenclature (FIRx_L_M_N). The Hmin sets of the eight filters

depth. To the best of our knowledge, there exists no MCM heuristic have been kindly provided by Aksoy. We have also considered the

for the time being that indicates how far from the minimum (Lbd) filter that served for benchmarking in [3]. The RADIX-2r solutions

is. The motivation for lower adder depth is not only for speed, but are given by the online version available in [19]. To cope with

© The Institution of Engineering and Technology 2017

Table 3 Adder cost and adder depth of some low-/medium-order FIR filters

Filter Lower bounds [10] CSD RADIX-2r RAG-n [7] Hcub [8] NAIAD [21] SIREN [21]

[20]

Cost Deptha Cost Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth

Max Avg Max Avg Max Max Max Max

// ⋯ // ⋯

FIR_25_13_12 [3] 14 3 2.46 44 22 3 4 2.46 2.61 18 10 16 7 18 3 16 7

FIR4_30_14_13 [18] 14 3 2.64 53 23 3 3 2.64 2.64 23 5 18 7 18 7 17 8

FIR3_30_14_13 [18] 14 3 2.42 51 27 3 4 2.78 2.92 24 9 19 7 18 6 17 11

FIR1_40_19_12 [18] 20 3 2.31 65 33 3 4 2.57 2.63 24 10 23 7 22 9 22 10

FIR2_40_19_13 [18] 20 3 2.47 65 36 3 3 2.68 2.73 27 6 24 6 23 7 23 7

FIR7_40_19_14 [18] 20 3 2.63 72 35 3 4 2.84 2.94 28 7 24 7 23 12 23 10

FIR6_60_29_14 [18] 29 3 2.31 95 47 4 4 2.65 2.68 36 10 32 10 32 5 31 11

FIR8_80_36_14 [18] 37 3 2.36 118 51 4 4 2.66 2.72 40 5 38 5 37 6 37 6

FIR5_80_39_15 [18] 40 3 2.61 151 64 4 4 2.79 2.97 44 9 42 8 41 10 41 10

total 208 27 22.21 714 338 30 34 24.07 24.84 264 71 236 64 232 65 227 80

normalisation 1 1 1 3.43 1.62 1.11 1.25 1.08 1.11 1.26 2.62 1.13 2.37 1.11 2.41 1.09 2.96

The lower bounds in cost and depth are given by log2 min Ni + 1 /2 + M − 1 and log2 max Ni + 1 /2 , respectively.

aThe same values apply to CSD. / /, parallel implementation (binary tree); ⋯, serial implementation in cascaded adders; , ceiling function.

various speed/area constraints, the latter allows three options: digit (MSD), CSD etc. Contrary to CSE, DAG algorithms make no

‘adder cost’ option reduces the adder cost according to the value of assumption on the representation format. As a matter of fact, they

ropt that optimises Upb (see the r1 formula in Table 3 of [3]); the offer a higher capability in optimising the cost as they explore a

‘adder depth’ option reduces the adder depth by selecting the best larger solution space. However, such flexibility requires a

solution given by ropt, ropt ± 1, and ropt ± 2; and the trade-off option substantial runtime which makes DAG less attractive for high-

for a fixed value of r varying from 2 up to ropt + 3. The different order filters. Although RAGn [7] belongs to the DAG category, it

MCM solutions for the nine filters are reported in Table 3. The has a low runtime complexity. The reason is that RAGn relies on a

filters are arranged according to the increasing value of M × N. precomputed lookup table of optimal single constant

decompositions, currently limited to 19 bits [8]. This upper limit

From one side, note that RADIX-2r provides a near-optimal disqualifies RAGn when considering filters with bit lengths larger

solution in adder depth for both parallel and serial than 19. Moreover, it was proved in [8] that Hcub requires up to

implementations. The maximum depth does not exceed 4 as stated 20% less additions in average than RAGn. Hcub is a DAG

in Table 2 for 12 ≤ N ≤ 15, which is the variation range of Ni for algorithm that shows somewhat a reasonable time complexity,

the nine filters. While the CSD solution [20] achieves optimality in which makes it a potential candidate for high-order filters. SIREN

depth, it adversely requires an exaggerated number of additions. and NAIAD as DAG algorithms are rather unable to handle filters

We have already demonstrated that CSD is largely superseded by with, respectively, >20 and 160 coefficients due to their

RADIX-2r in cost (see Table 5 in [3]). CSD is included for it is a exponential complexity [21]. Besides, they both use Hcub as

standard technique which is still being employed in designing the bounding heuristic in their optimisation process.

vast majority of small/medium SCM/MCM blocks, as well as for a Concerning the two CSE algorithms CRA [5] and NR-SCSE

number of merits already discussed in [3]. [6], their analytic computation complexities are unknown.

On the other side, SIREN induces the lowest adder cost but also However, Xu et al. [5] reported the CPU time consumed by CRA

the highest adder depth. It is a well-known fact in MCM that and NR-SCSE for a number of constants (M) varying from 10 to 80

decreasing the cost increases the depth, and vice versa. To see with a fixed constant size (N) of 11 bits. Both algorithms show a

which heuristic yields the best trade-off, we calculated the cost × linear time increase with a very low slope regarding M. However, it

depth product of the normalised values with respect to the lower is a well-established fact that MCM algorithms are much more

bounds. The lower the product, the best is the trade-off. The results sensitive to a variation of N than M.

are reported in an increasing order as follows: 1.79, 2.67, 2.67, Consequently, for high-order filters, RADIX-2r is compared

3.22, 3.30, and 3.43, for RADIX-2r, Hcub, NAIAD, SIREN, with Hcub, CRA, NR-SCSE, and the direct CSD, though the latter

RAGn, and CSD, respectively. Thus, it is set clear that RADIX-2r is not an effective MCM heuristic.

is by far the most balanced heuristic. Furthermore, lower depths For comparison, we consider five high-order filters employed in

can be attained by using the ‘adder depth’ option of the online the filter bank channeliser of digital advanced mobile phone

version of RADIX-2r [19]. Optimal parallel depth (three steps) is systems (D-AMPS). The filter specifications are mentioned in [9]

achieved for FIR5, FIR6, and FIR8 with an increased cost of 87, and their respective coefficients are taken from [15]. These five

57, and 67, respectively. The normalised cost × depth gives 1.86 × filters are symmetrical and have the highest number of taps (up to

1 = 1.86, which is still the best trade-off. Another way to get the 695) and the largest bit length (up to 24) that we can find in [15]

minimum depth is to realise that replacing r by 2 in Ath// formula for benchmarking. The different MCM solutions are reported in

(5) gives log2 N + 1 /2 , which corresponds to Gustafsson's Table 4.

lower bound (Lbd) [10]. Therefore, the idea is to reduce the cost The cost-oriented and depth-oriented values of RADIX-2r are

first using the ‘adder cost’ option in [19], then look for the few obtained by running the online version [19] with the ‘adder cost’

coefficients whose Ath// exceed Lbd and recode them in RADIX-2r and ‘adder depth’ options, respectively. The normalised cost ×

with r = 2. The new costs are 74, 51, and 56 for FIR5, FIR6, and depth products of the five heuristics are ranked in ascending order

FIR8, respectively. The normalised cost × depth yields 1.71 × 1 = as follows: 1.66, 2.13, 2.18, 2.85, and 4.64, for RADIX-2r, CRA,

1.71, which is the lowest calculated so far. Note that this solution is NR-SCSE, Hcub, and CSD, respectively. The value 1.39 × 1.20 =

not yet integrated in [19]. 1.66 corresponds to the cost-oriented option with a parallel

To handle high-order filters in a reasonable amount of time, an implementation, whereas the depth-oriented option gives 1.54 ×

MCM heuristic must exhibit a low runtime complexity, commonly 1.10 = 1.69. The ‘*’ marked solution yields a cost × depth products

expressed in M and N (see Table 7 in [3]). It is well known that the of 1.43 × 1 = 1.43, which is the lowest value that RADIX-2r can

performances of CSE algorithms depend on the initial recoding of deliver for the set of five filters. The number of coefficients

the coefficients, such as binary, signed digit (SD), minimum signed requiring RADIX-2r recoding with r = 2 is <1% in the most cases.

4 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11

© The Institution of Engineering and Technology 2017

Table 4 Adder cost and adder depth of some high-order FIR filters

Filter Lower bounds [10] CSD [20] RADIX-2r NR-SCSE [6] CRA [5] Hcub [8]

Cost oriented Depth oriented

Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth

// ⋯ // ⋯

FIR_279_140_24 [9] 141 4 729 4 239 4 5 239 4 5 355 4 346 4 158 26

FIR_418_208_22 [9] 208 4 1008 4 310 5 5 389/327a 4/4a 5/8a 474 4 466 4 212 9

FIR_516_256_24 [9] 256 4 1212 4 362 5 6 362/373a 5/4a 6/8a 575 4 562 4 259 9

FIR_631_313_23 [9] 313 4 1394 4 403 5 5 523/424a 4/4a 5/8a 647 4 632 4 315 6

FIR_695_345_24 [9] 345 4 1525 4 444 5 5 444/453a 5/4a 5/8a 706 4 693 4 348 6

total 1263 20 5868 20 1758 24 26 1957/1816 22/20 26/37 2757 20 2699 20 1292 56

normalisation 1 1 4.64 1 1.39 1.20 1.30 1.54/1.43 1.10/1 1.30/1.85 2.18 1 2.13 1 1.02 2.80

The lower bounds in cost and depth are given by log2 min Ni + 1 /2 + M − 1 and log2 max Ni + 1 /2 , respectively.

aValues are obtained by reducing the cost first using the ‘adder cost’ option in [19], then look for coefficients whose Ath// exceed Lbd and recode them in RADIX-2r with r = 2.

/ /, parallel implementation (binary tree); ⋯, serial implementation in cascaded adders; , ceiling function.

MCM Lower-bound [10] RADIX-2r Hcub [8]

Cost Depth Costa Depthb Cost Depth

MCM_27_12 29 3 42.53 3.10 29.02 8.36

MCM_28_12 30 3 43.71 3.10 30.02 8.18

MCM_29_12 31 3 44.22 3.10 30.82 8.72

MCM_30_12 32 3 45.39 3.11 31.68 8.58

MCM_32_12 33 3 47.86 3.11 33.46 7.78

MCM_96_16 100 4 164.94 4.00 104.72 17.02

MCM_102_16 104 4 172.14 4.00 109.82 17.56

MCM_105_16 107 4 174.28 4.00 112.20 17.22

MCM_105_16 107 4 174.28 4.00 112.20 17.22

MCM_109_16 111 4 175.18 4.00 115.58 17.48

MCM_135_20 138 4 289.16 4.00 178.86 22.42

MCM_187_20 191 4 387.34 4.08 231.40 24.06

MCM_234_20 238 4 469.10 4.12 275.78 24.42

MCM_243_20 244 4 482.54 4.28 285.14 25.54

MCM_271_20 274 4 523.74 4.80 310.40 25.74

MCM_140_24 146 4 355.94 5.00 243.32 26.44

MCM_208_24 217 4 495.30 5.00 315.64 28.87

MCM_256_24 263 4 594.38 5.00 377.18 32.39

MCM_313_24 318 4 713.55 5.06 441.48 32.65

MCM_345_24 353 4 779.27 5.46 473.93 32.97

aValues obtained by running the ‘adder cost’ option of [19].

bSerial depth.

For instance, in FIR_695_345_24, only two coefficients (169,127 suggesting that the depth decreases with an increasing filter length

and 10,415,657) among 345 need a special recoding with r = 2, as conjectured (no proof) in [9]. Based on a limited number of filter

whereas all the other coefficients (343) are recoded with r = 8 to cases, the authors in [9, 22] concluded that Hcub is near optimal in

minimise the adder cost (see the r1 formula in Table 3 of [3]). The cost for any MCM complexity. To verify these claims, we have

serial implementation induces a product of 1.39 × 1.30 = 1.80. In calculated the average cost and depth for a number of 50 randomly

any case, RADIX-2r is by far the most balanced solution. generated MCM cases. The bit length varies from 12 to 24,

CRA and NR-SCSE are two incommensurate heuristics as they whereas the number of constants (M) corresponds to the cardinality

achieve optimality in depth to the detriment of an excessive of the Hmin set of the D-AMPS filters used in [9]. We adopt the

increase in cost. Besides, there are no formal proofs in [5, 6] that following nomenclature: MCM_M_N, where M is the number of

guarantee optimality for any given coefficient set. CRA performs constants and N is the maximum bit size of the constant. We have

better in cost than NR-SCSE. RADIX-2r guarantees optimal used the readily available source code of Hcub (version 14 January

solutions in depth with much less additions than CRA. For 2009) with the following command: synth -r 50 140 -b 24 -seed 1 -

instance, in FIR_279_140_24, a saving of 31% over CRA is v 3. The same randomly generated constant sets are run with

reached in cost. The saving is even higher (35%) in the case of a RADIX-2r for comparison (Table 5).

near optimality such as in FIR_516_256_24. Note that contrary to the claims of [9], Hcub adder depth

Contrary to CRA and NR-SCSE, CSD ensures depth optimality increases as the number of constants increases. With the exception

for any coefficient set, as it was analytically proved in [10]. of FIR_279_140_24 (Table 4) whose depth (26 steps) is very close

However, CSD generates a prohibitive number of additions. For from the average (26.44 steps), the four remaining filters are very

FIR_279_140_24, RADIX-2r provides 67% saving over CSD. special cases upon which no conclusion could be drawn regarding

Likewise, Hcub is also an unbalanced heuristic. It achieves near the adder depth as mistakenly assumed by Chang and Faust [9]. As

optimality in cost at the expense of an important increase in depth for the adder cost, while Hcub preserves the near optimality for N

(up to 26 steps). In addition, disparate depth values are produced, ≤ 16, it loses it for N ≥ 20 conversely to the declarations of [9, 22].

© The Institution of Engineering and Technology 2017

Fig. 2 CSEA applied to RADIX-2r with Xb = 8 and r = 3

(a) Qj+1 = ± 1 and Qj = ± 22, (b) Qj+1 = ± 22 and Qj = ± 1

However, when considering the filters of Table 4, Hcub produces left-shift span between successive PPs varies from r to 2r − 1

costs which are more or less near optimal for 22 ≤ N ≤ 24. The positions. By using the appropriate SE for each operation in the

question is to know whether it is a pure coincidence because the RADIX-2r solution, there will never be an overflow and the output

sample of five filters only is not representative, or it is due to the word length is just enough to represent all possible outputs.

fact that in practice, the coefficient sets of digital filters are not We define the following conditions for our bit-level model. The

randomly distributed. The non-uniformity of the coefficient values RADIX-2r solution of the MCM block is mapped on carry-

was mentioned in [23] and demonstrated for 200 filters with 9 bit propagate-adders/subtractors (CPA/S) where the total area is linear

quantised coefficients in [24]. In general, it can be verified by the to the number of adders/subtractors. The latter are, respectively,

fact that all filters have a min(1,1/x) envelope due to the sin(x)/x formed of a serial connection of full adders (FAs) and full

function, which means that the coefficients are distributed near subtractors (FSs). We assume that FA and FS have the same area/

zero more frequently [15]. This question remains, however, an speed cost, which is a realistic approximation by the way. From

open research problem. now on, we refer to both as FAs. Finally, CSEA is employed to

Significant conclusion: Using standard metrics (adder level), we reduce the SE overhead since the common variable X is in 2's

have proved by comparison that RADIX-2r and Hcub are the best complement format.

algorithms in depth and cost, respectively. However, the ultimate Unlike CSE and DAG algorithms, in RADIX-2r, the recoding

objective is to see how both algorithms perform at an actual circuit of each constant is performed regardless of the others, but all

implementation (bit level).

constants share the same odd-multiple set OM 2r . This important

feature makes also possible the determination of the bit-level

4 Bit-level version of RADIX-2r complexity. To facilitate the understanding of the demonstration,

Bit-level heuristics yields better area results than adder-level we go through an illustrative SCM example (10,599 × X) taken

heuristics, but at the expense of an excessive computational effort. from [13].

This disqualifies them from tackling MCM problems of high The recoding of 10,599 × X in RADIX-2r and Hcub gives:

complexity. In [25], an accurate bit-level model is used in the PRADIX = X1 + X0 × 25–X1 × 28 + X0 × 212, with X0 = 3 × X = X × 2 +

objective function of an exact MCM algorithm to estimate the area X and X1 = 7 × X = X × 23 − X. PHcub = X2 × 23 − X, with X0 = 257 ×

of each operation. Although better results are achieved, the

X = X × 28 + X, X1 = 265 × X = X0 + X × 23, and X2 = 1325 × X = X1

algorithm requires an exponential runtime O(2M × N ). Similarly, in

[26], a detailed bit-level model is incorporated into RAGn [7] × 22 + X1.

algorithm to pursue a double objective: minimise the number of The product C × X requires a maximum of Xb + log2C bits.

adders as well as the number of bit adders within each adder block. The implementation of PRADIX and PHcub is depicted in Fig. 3 for

The area is made smaller but with a much higher runtime, knowing Xb = 8. Although PHcub requires four additions and PRADIX 5, the

that RAGn complexity alone is O(M 2 × N 3 × log(M × N)). We total number of FAs in PHcub is greater than in PRADIX. PHcub

introduce hereafter the bit-level version of RADIX-2r and shows consumes 61 FAs (Fig. 3c), while PRADIX needs 50 and 57 FAs in

that it requires no more than O(M × N /r) execution time. the serial (Fig. 3b) and parallel (Fig. 3a) implementations,

We consider the general MCM case C0, C1, C2, …, CM − 1 × X respectively. The reason why serial PRADIX consumes less FAs is

where Ci is a non-negative constants and X represented in 2's due to the successive left shifts performed in the PP array. In

complement format. However, in 2's complement arithmetic the parallel PRADIX, a lower depth is obtained (three steps instead of

sign of all operands needs to be extended to the bit length of the four) by altering the regularity of PP array (tree structure). This

result before any operation. This leads to an important overhead in results in a substantial increase (14%) in the number of FAs.

speed, power, and area. To reduce the sign extension (SE) overhead To understand why Hcub consumes a high number of FAs, we

in MCM, many approaches have been proposed, notably the MCM need to determine the analytic bounds of FA overhead (Δ) in Hcub

block partitioning [27] and positive offset [28] methods which and RADIX-2r. The term Δ is the number of extra FAs that must be

exhibit the best results in speed and area, respectively. The concatenated to the basic Xb FAs to form a complete adder (Δ + Xb

conventional SE approach (CSEA) used traditionally in variable FAs). We use the A-operation introduced in [8] as a unified

multiplication (Y × X) offers rather a good compromise in MCM formalism of MCM. In fact, any MCM solution is simply a

[27], especially when both area and power are a concern. collection of A-operations linked together [12]. To highlight the bit

Comparatively with [27, 28], CSEA is very easy to be used: the SE level (Xb) in the A-operation, the Ci × X product is expressed as

is performed locally between successive PPs, assuring that at each

stage, the partial sum contains the sum of the sign bits of previous

PPs. In addition, CSEA is fully predictable, making possible the Ci × X = A u × X, v × X = 2l1 × u × X ± 2l2 × v × X . (8)

determination of the exact analytic bound of RADIX-2r at bit level.

Based on (1), two adjacent PPs Equation (8) admits an infinite number of solutions {u, v, l1, l2}

⋯ + 2r j + 1 × Q j + 1 × X + 2r j × Q j × X + ⋯ require an SE span that satisfy Ci × X. To explore the solution space in a reasonable

varying from 1 to 2r − 1 bits, corresponding to Q j + 1 = ± 1 with amount of time and maintain a finite size of adders, most MCM

Q j = ± 2r − 1, and Q j + 1 = ± 2r − 1 with Q j = ± 1, respectively. algorithms constrain A u × X, v × X ≤ 2Nmax + 1 × X, where

These two extreme configurations are exemplified in Fig. 2 for Xb Nmax = log2Cimax is the bit width of the largest constant (Cimax) in

an MCM operation [8, 29, 30]. Therefore, the maximum number of

= 8 and r = 3, where Xb is the bit size of X. Note that in RADIX-2r

FAs in Ci × X is Xb + Nmax + 1 (Δmax = Nmax + 1). Few exceptions,

with CSEA, the average SE span is r. Likewise, in RADIX-2r, the

6 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11

© The Institution of Engineering and Technology 2017

Fig. 3 Bit-level implementation of 10,599 × X using

(a) RADIX-2r parallel, (b) RADIX-2r serial, (c) Hcub

4

∑i = 1 Δi = 61 − 4 × 8 = 29 FAs in Fig. 3c. To normalise the adder

and 2Nmax + 2 × X in H4 [12] and DIFFAG [11]. With an increased cost (Ncost), we divide the total number of FAs by Xb. It gives Ncost

bound, H4 performs better than H3 in reducing the average number = 50/8 = 6.25 and Ncost = 61/8 = 7.62 for Figs. 3b and c,

of adders, but at the cost of a significant run time [12] and higher Δ respectively. Thus, Hcub uses more than one normalised adder

(Δmax = Nmax + 2). (7.62 − 6.25 = 1.37) than the serial RADIX-2r solution.

Constraining Ci × X ≤ 2Nmax + 1 × X in Hcub implies that Aided by Figs. 1 and 3, we determine hereafter the maximum

number of FAs needed by C × X for a serial implementation in

1 ≤ 2l1 × u, 2l2 × v ≤ 2Nmax + 1 for all u and v (l1 and l2 depend on u

RADIX-2r.

and v). This means that a smaller fundamental can be derived from

bigger ones via a subtraction: In RADIX-2r, the PPs require N + 1 /r − 1 additions. Each

addition consumes at most Xb + r − 1 FAs. Hence, the maximum

Ci × X = 2l1 × u × X − 2l2 × v × X . In this case, the number of

number of FAs needed by the PP array is

FAs can be greater than the effective bit length of Ci × X. In MCM,

this construction process is even exaggerated since Hcub considers FApp = N + 1 /r − 1 × Xb + r − 1 . (10)

the impact of each possible intermediate fundamental on all target

fundamentals to be implemented and chooses the one that yields It can be easily proved that the odd-multiple tree (Fig. 1) consumes

the best cumulative benefit [22], even if smaller fundamentals are

made of bigger ones. at most Xb × 2r − 3 and Xb × 2r − 3 − 1 + r − 2 × 2r − 3 − 1 FAs in

the addition and subtraction parts, respectively. Thus, the

In RADIX-2r recoding, the largest value of mj in OM(2r) is 2r−1

maximum number of FAs required by the odd-multiple tree is

− 1. Hence, the A-operation between successive PPs

FAom = Xb × 2r − 2 − 1 + r − 2 × 2r − 3 − 1. (11)

⋯ ± 2r j+1 + kj + 1

× m j + 1 × X ± 2r j + k j × m j × X ± ⋯ (9)

Hence, C × X needs no more than FApp + FAom. For a 14 bit

consumes at most Xb + log2 2r − 1 − 1 = Xb + r − 1 FAs. Note constant like 10,599, (3) gives r = 4. With Xb = 8, FApp + FAom

that Δmax = Nmax + 1 in Hcub is much bigger than in RADIX-2r yields 33 + 27 = 60 FAs. Note that 10,599 × X needs 50 FAs.

(Δmax = r − 1). The reason is that Nmax is a function of log, while r The SCM formulas are straightforwardly extended to MCM as

is a function of W log [see (3) and (4) for SCM and MCM, follows. In the case of a number of M non-negative constants with

respectively]. Note that W(x) < log(x) for any non-negative value of the same bit size N, the upper bound in FAs is M × FApp + FAom,

x >2. Thus, the saving in Δ of RADIX-2r over Hcub is much with r given by (4). However, in the case of M constants with

significant at large constants. different bit sizes Ni, such as in the FIR filters, the upper limit is

Based on experimental results, many papers [17, 25, 31] equal to FApp + FAom, where

insisted on the fact that fewer number of additions does not

necessarily lead to fewer number of FAs. However, to the best of M−1

our knowledge, none of them provided an in-depth analysis or a FApp = ∑ Ni + 1 /r − 1 × Xb + r − 1 , (12)

formal proof of the problem as we did here above. This is a strong i=0

result that will set new directions for MCM.

In the example of Fig. 3 (Cmax = 10,599), FAom given by (11), and

Nmax = log2 10, 599 = 14 while

M−1

r = 2W log2 10, 599 + 1 log 2 /log 2 = 4. For Xb = 8, the

longest adders include up to 8 + 14 + 1 = 23 and 8 + 4 − 1 = 11 FAs

r =2×W ∑ Ni + 1 × log 2 /log 2 . (13)

i=0

for Hcub and RADIX-2r, respectively. In Fig. 3c, the longest adder

comprises 22 FAs, while in Fig. 3b, it does not exceed 11 FAs. Let us take the smallest filter FIR_25_13_12 as an example.

Comparatively, the 22 FAs account for two adder blocks actually. Equation (13) gives r = 5. With Xb = 16, and replacing r by 5 in

The 22 FAs come from the subtraction of 1 × X from 10,600 × X to (12) and (11), we have a maximum of 340 + 123 = 463 FAs. The

generate the final result 10,599 × X. Note that 10,599 × X < 10,600 RADIX-2r solution [19] for this filter gives 22 additions.

× X; however, much greater fundamentals up to 214+1 × X = 32,768 Therefore, we are sure that the MCM block comprising 22

× X can be used, yielding up to 23 FAs. Note that additions can be implemented with at most 463 FAs (upper bound).

5

∑i = 1 Δi = 50 − 5 × 8 = 10 FAs in Fig. 3b, while it is However, the serial implementation requires only 393 FAs

22

(Table 6). Note that RADIX-2r induces the lowest ∑i = 1 Δi (41

© The Institution of Engineering and Technology 2017

Table 6 MCM solutions for fir_25_13_12: comparison in FAs

Algorithm Cost FAs Cost Total FAs Ncost Ncost − cost

∑ Δi i=1

Hcub [8] 16 256 103 359 22.43 6.43

Hcub+ [8]a 18 288 93 381 23.81 5.81

RAG-n [26] 18 288 98 386 24.12 6.12

BHM [26] 20 320 118 438 27.37 7.37

Pasko [26] 23 368 136 504 31.50 8.50

C1 [26] 19 304 104 408 25.50 6.50

DA-MST [26] 19 304 110 414 25.87 6.87

RFAG-n [26] 17 272 105 377 23.56 6.56

RADIX-2r 22 352 41 393 24.56 2.56

aHcub+ is the Hcub version with minimal adder depth. ∑Cost Δ is the total FA overhead. N

i=1 i cost is the normalised cost (number of FAs divided by Xb = 16).

Filter r (13) FAom (11) FApp (12) FAom + FApp max. FAs Adder cost

FIR_279_140_24 [9] 8 1199 4623 5822 239

FIR_418_208_22 [9] 8 1199 5957 7156 310

FIR_516_256_24 [9] 8 1199 7268 8467 362

FIR_631_313_23 [9] 8 1199 8303 9502 403

FIR_695_345_24 [9] 8 1199 9200 10,399 444

FAs), and that some heuristics with lower adder cost (BHM, C1, 14). Dynamic power consumption was evaluated with 2000

and DA-MST) yields more FAs than RADIX-2r. As RADIX-2r random input samples at 25 MHz frequency. The post-layout

Cost

induces a very low ∑i = 1 Δi, the number of adders (Cost) which results in speed, power, and area are reported in Table 8.

serves as an abstraction for the logic cost is more credible in As practical illustration of how RADIX-2r succeeds to beat

Hcub in area in spite of its higher number of additions, we have

RADIX-2r than in other algorithms. When the cost is compared implemented the two small illustrative examples (10,599 × X) of

with the normalised cost (Ncost), RADIX-2r exhibits by far the Figs. 3b and c. They consume five and four adder blocks with a

lowest gap (2.56). total of 50 and 61 FAs (18% saving over Hcub), respectively. Note

To provide an idea on the variation scope of consumption, we that both examples exhibit the same adder depth (four steps).

calculated the maximum number of FAs required by the high-order Results in Table 9 confirm the relevance of the bit-level (FAs)

filters (Table 7). It is important to note that r = 8 is the value that model over the adder level model in predicting the area occupation.

ensures the optimum (minimum) for both FApp + FAom and the

adder cost. This is a very strong result as there exists no MCM In low-order filters, Hcub is slightly better than RADIX-2r.

heuristic for the time being that is predictable at the bit level (FAs) However, in this case, the gain margin in area is insignificant.

or even at adder level. While in medium-order filters, the opposite is rather true with

Significant conclusion: From one side, we have proved in small improvements in area over Hcub. In high-order filters,

Section 2 that for typical variation range of the constant bit size RADIX-2r yields better results than Hcub in all cases, with an area

(12–24) in filtering, Ath… ≤ Ath / / + 1 (Table 2). On the other side, saving ranging from 1.5 up to 3.46%. Although the cost for all

we have also shown that parallel implementation yields a filters in Hcub is very close from the lower bound (Tables 3 and 4),

significant increase in FAs. Furthermore, as a matter of fact, the the saving in area increases as Δmax in Hcub increases. This is the

serial structure is highly regular, leading therefore to a shorter experimental proof that FA overhead outweighs the cost in Hcub

routing (reduced delay) and more compact area. Consequently, in (the solution employs less adder blocks, but with a much larger

RADIX-2r, the serial implementation stands as the best option for size).

the design of FIR filters. If we take FIR_279_140_24 as an example (Table 4),

RADIX-2r and Hcub solutions for the MCM blocks give 239 and

5 Experimental results 158 FAs, respectively. Although RADIX-2r consumes 81 adders

We have shown in our recent paper [3] using 180 nm CMOS more than Hcub, the 158 adders of Hcub require more FAs than

technology that RADIX-2r gives the best results in speed and their counterparts in RADIX-2r. The reason is that Δmax = 24 + 1 =

power, comparatively with most prominent MCM heuristics. The 25 FAs in Hcub, whereas it is only Δmax = 8 − 1 = 7 FAs in

reason is due to its optimality in adder depth. Hereafter, RADIX-2r RADIX-2r. Therefore, the longest adder in Hcub consumes up to

is confronted to Hcub for being one of the best heuristics in area. 16 + 25 = 41 FAs, while it does not exceed 16 + 7 = 23 FAs in

Based on the conclusion above, only the serial implementation in RADIX-2r. Based on (11)–(13), we are sure that the MCM block in

RADIX-2r is considered. RADIX-2r consumes <4623 + 1199 = 5822 FAs, while it needs

All filters of Tables 3 and 4 were coded in Verilog according to many >5822 FAs in Hcub since RADIX-2r achieves 2.51% saving

the bit description thoroughly depicted in Fig. 3. Note that we in area over Hcub (Table 7). The adder depth in RADIX-2r and

consider the whole architecture of the transposed form of FIR Hcub are 5 and 26 steps, respectively. This explains the high

filter, i.e. the MCM block plus adder structure. The latter is exactly savings in speed and power which are 38.01 and 25.85%,

the same in both implementations for each filter. The RADIX-2r respectively.

recoding is generated using the online tool [19] with the ‘adder We also compared RADIX-2r with the Hcub version with

cost’ option. The input data word length (Xb) is fixed to 16 bits for

minimal depth (Hcub+) which is available in www.spiral.com.

all filters. The generated filters are mapped to UMC 65 nm

Hcub+ guarantees a minimal depth but with a substantial increase

standard-cell library using Cadence RTL compiler. The synthesis

in cost and runtime depending on how the Hcub solution is far

tool was constrained to a relaxed constraint of 50 ns. The place and

from the lower bound (Table 10). Note that the runtime complexity

route was performed using Cadence SoC Encounter (version EDI

of Hcub + is unknown. Confronted to Hcub+, RADIX-2 r is better in

8 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11

© The Institution of Engineering and Technology 2017

Table 8 RADIX-2r versus Hcub: post-layout implementation results in 65 nm CMOS of a number of FIR filters

Filter Delaya, ns Powerb, mW Areac, µm2 Delay saving, % Power saving, % Area saving, %

RADIX-2r Hcub [8] RADIX-2r Hcub [8] RADIX-2r Hcub [8]

Low-order filters

FIR_25_13_12 [3] 12.208 14.300 0.6106 0.6850 14,954.76 14,718.60 14.62 10.86 −1.60

FIR4_30_14_13 [18] 11.960 12.777 0.7510 0.8058 17,652.60 17,628.12 6.39 6.80 −0.13

FIR3_30_14_13 [18] 13.501 17.308 0.6973 0.7936 16,632.72 16,315.92 21.99 12.13 −1.94

FIR1_40_19_12 [18] 13.567 15.694 1.0216 1.1632 24,260.76 24,189.84 13.55 12.17 −0.29

FIR2_40_19_13 [18] 13.615 15.987 1.0199 1.2254 25,520.40 25,209.72 14.83 16.77 −1.23

FIR7_40_19_14 [18] 13.308 14.065 1.0432 1.1709 25,040.52 25,540.20 5.38 10.90 1.95

average 13.026 15.021 0.8572 0.9739 20,676.96 20,600.40 12.79 11.60 −0.54

Medium-order filters

FIR6_60_29_14 [18] 14.182 17.578 1.5471 1.8441 37,182.24 37,601.28 19.31 16.10 1.11

FIR8_80_36_14 [18] 14.882 14.944 1.9851 2.2344 47,372.76 47,295.72 0.41 11.15 −0.16

FIR5_80_39_15 [18] 14.732 18.257 2.1998 2.6870 50,850.72 51,538.68 19.30 18.13 1.33

average 14.598 16.926 1.9106 2.2551 45,135.24 45,478.56 13.00 15.12 0.76

High-order filters

FIR_279_140_24 [9] 19.838 32.005 11.4926 15.5008 217,743.12 223,366.68 38.01 25.85 2.51

FIR_418_208_22 [9] 20.884 24.678 16.3238 20.0545 315,236.88 320,047.20 15.37 18.60 1.50

FIR_516_256_24 [9] 19.010 22.503 21.4947 27.5732 381,966.12 393,904.44 15.52 22.04 3.03

FIR_631_313_23 [9] 18.747 21.433 25.6591 30.2111 460,180.80 473,360.04 12.53 15.06 2.78

FIR_695_345_24 [9] 20.569 21.969 29.9247 32.9955 505,337.04 523,501.20 6.37 9.30 3.46

average 19.809 24.517 20.9789 25.2670 376,092.79 386,835.91 17.56 18.17 2.65

aMinimum clock period.

bTotal dynamic power dissipation.

cTotal area.

Table 9 RADIX-2r versus Hcub: post-layout implementation results in 65 nm CMOS of 10,599 × X SCM EXEMLPE

Delaya, ns Powerb, mW Areac, µm2

RADIX-2r Hcub [8] RADIX-2r Hcub [8] RADIX-2r Hcub [8]

6.265 6.121 0.0129 0.0155 524.160 560.360

aMinimum clock period.

bTotal dynamic power dissipation.

cTotal area.

Table 10 Comparison between Hcub and the Hcub version with minimum depth (Hcub+)

Filter Hcub Hcub+ Increase in cost, %

Depth Cost Depth Cost

FIR_279_140_24 [9] 26 158 4 194 22.78

FIR_418_208_22 [9] 9 212 4 252 18.86

FIR_516_256_24 [9] 9 259 4 301 16.21

FIR_631_313_23 [9] 6 315 4 335 6.34

FIR_695_345_24 [9] 6 348 4 365 4.88

power and area (Table 11), while they are almost equal in speed. speed, power, and area all together, Hcub+ is more interesting than

The savings in area are not important because the whole filter Hcub except in the runtime.

architecture (MCM block plus adder structure) is considered. The Very recently, a new graph-based heuristic has been proposed in

hardware complexity of the adder structure is important as adders [33]. It uses the hypergraph concept to enable a new iterative

and registers grow monolithically along the taps to hold the correct minimum arborescence (IMA) formulation of the MCM problem.

precision of the partial sums [32]. It even exceeds the complexity Note that the computational complexity of IMA and Hcub

of the MCM block, especially in symmetrical filters as is the case algorithms are O M3 × N 2 + M4 × N and

of all FIR filters of Tables 8 and 11. However, when only the MCM

O M × N × log M × N + M × N 6 , respectively. With a lower

4 5 3

block is implemented, a couple of per cent improvement over Hcub

runtime, IMA performs slightly better than Hcub in low-/medium-

and Hcub+ is obtained (Table 12). We deliberately chose order filters. However, since IMA uses the same bound

FIR_279_140_24 as a typical example because its depth (26 steps) Cost

is very close from the average (26.44 steps) as shown in Table 5. A u × X, v × X ≤ 2Nmax + 1 × X as Hcub (Fig. 6 in [33]), ∑i = 1 Δi

It was statistically proved in [29] that constraining (8) to the is very important. Therefore, IMA cannot be so area efficient as

minimum depth will more likely increase the cost. Observe that RADIX-2r in MCM problems of high complexity.

even though Hcub+ solution employs 36 adder blocks more than While the conventional metric of adder depth was proved as a

Hcub (a raise of ≃23%), the increase in area occupation is not reliable measure of the critical path, a more accurate delay model

194

significant (≃3%). The reason is that ∑i = 1 Δi in Hcub+ is much has been introduced in [34]. The latter is based on a bit-level

158 propagation of signals for a fine-grained analysis of the critical

smaller than in Hcub ∑i = 1 Δi . Therefore, when considering path of MCM blocks. It shows that the delays of the shift-add

© The Institution of Engineering and Technology 2017

Table 11 RADIX-2r versus Hcub+: post-layout implementation results in 65 nm CMOS of D-AMPS filters

Filter Delaya, ns Powerb, mW Areac, µm2 Delay saving, % Power saving, Area saving, %

RADIX-2r Hcub+ [8] RADIX-2r Hcub+ [8] RADIX-2r Hcub+ [8] %

FIR_279_140_24 [9] 19.838 19.870 11.4926 12.0888 217,743.12 225,187.56 0.16 4.93 3.30

FIR_418_208_22 [9] 20.884 21.418 16.3238 18.5316 315,236.88 323,322.84 2.49 11.91 2.50

FIR_516_256_24 [9] 19.010 19.829 21.4947 23.3619 381,966.12 395,147.44 4.13 7.99 3.33

FIR_631_313_23 [9] 18.747 20.075 25.6591 29.3194 460,180.80 473,906.28 6.61 12.48 2.89

FIR_695_345_24 [9] 20.569 20.726 29.9247 33.0765 505,337.04 524,262.36 0.76 9.52 3.60

aMinimum clock period.

bTotal dynamic power dissipation.

cTotal area.

Table 12 RADIX-2r versus Hcub and Hcub+: Post-layout implementation results in 65 nm CMOS of FIR_279_140_24

Algorithm Delaya, ns Powerb, mW Areac, µm2 Delay saving, % Power saving, % Area saving, %

Hcub [8] 32.64 9.9868 92,669.76 47.24 41.27 10.18

Hcub+ [8] 17.79 7.2880 95,387.40 3.20 19.53 12.74

RADIX-2r 17.22 5.8646 83,232.72 — — —

aMinimum clock period.

bTotal dynamic power dissipation.

cTotal area.

network estimated at bit level are shorter than the delays at adder 8 References

level. Similarly, based on the bit-level description of RADIX-2r, a [1] Thong, J., Nicolici, N.: ‘An optimal and practical approach to single constant

more precise delay metric (bit depth) of the critical path can be multiplication’, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 2011,

derived. 30, (9), pp. 1373–1386

A more accurate bit-level model of RADIX-2r is possible by [2] Aksoy, L., da Costa, E., Flores, P.: ‘Exact and approximate algorithms for the

optimisation of area and delay in multiple constant multiplication’, IEEE

eliminating the subtraction cell FS. The reason is that most Trans. Comput. -Aided Des. Integr. Circuits Syst., 2008, 27, (6), pp. 1013–

standard cell libraries do not include the FS cell [35]. Instead, the 1026

operation A − B is implemented in 2's complement as A + B̄ + 1, [3] Oudjida, A.K., Liacha, A., Bakiri, M., et al.: ‘Multiple constant multiplication

algorithm for high speed and low power design’, IEEE Trans. Circuits Syst. II

requiring three elementary cells: FA, half adder (HA), and inverter Exp. Brief., 2016, 63, (2), pp. 176–180

(INV). While this is more precise, it much complicates the analytic [4] Mahesh, R., Vinod, A.P.: ‘A new common subexpression elimination

model (10)–(13). algorithm for realizing low-complexity higher order digital filters’, IEEE

While the whole work presented so far deals with exact Trans. Comput.-Aided Des. Integr. Circuits Syst., 2008, 27, (2), pp. 217–229

[5] Xu, F., Chang, C.H., Jong, C.C.: ‘Contention resolution algorithm for

computing, we have recently developed a new SCM algorithm common subexpression elimination in digital filter design’, IEEE Trans.

based also on RADIX-2r arithmetic but with a variable radix [36]. Circuits Syst. II Exp. Brief, 2005, 52, (10), pp. 695–700

The latter is suitable for approximate computing as it allows saving [6] Martinez-Peiro, M., Boemo, E.I., Wanhammar, L.: ‘Design of high-speed

multiplierless filters using a nonrecursive signed common subexpression

power in error-resilient applications. algorithm’, IEEE Trans. Circuits Syst. II Analog Digit. Signal Process., 2002,

49, (3), pp. 196–203

6 Conclusion and future work [7] Dempster, A.G., Macleod, M.D.: ‘Use of minimum-adder multiplier blocks in

FIR digital filters’, IEEE Trans. Circuits Syst. II Analog Digit. Signal

We have proved that RADIX-2r is one of the leading MCM Process., 1995, 42, (9), pp. 569–577

[8] Voronenko, Y., Püschel, M.: ‘Multiplierless multiple constant multiplication’,

heuristics. With a simple recoding and insignificant computational ACM Trans. Algorithms, 2007, 3, (2), pp. 1–38

effort, RADIX-2r achieves the best results in speed, power, and [9] Chang, C.H., Faust, M.: ‘On “a new common subexpression elimination

area, especially in MCM blocks of high complexity. This has been algorithm for realizing low-complexity higher order digital filters”’, IEEE

Trans. Comput.-Aided Des. Integr. Circuits Syst., 2010, 29, (5), pp. 844–848

clearly demonstrated through the circuit implementation of a [10] Gustafsson, O.: ‘Lower bounds for constant multiplication problems’, IEEE

number of D-AMPS filters. The proven near optimality in adder Trans. Circuits Syst. II Exp. Brief, 2007, 54, (11), pp. 974–978

depth along with area superiority over Hcub makes of RADIX-2r a [11] Gustafsson, O.: ‘A difference based adder graph heuristic for multiple

constant multiplication problems’. Proc. Int. IEEE (ISCAS) Conf. Circuits

powerful heuristic that is very hard to compete with in VLSI and Systems, New Orleans, USA, May 2007, pp. 1097–1100

applications. We have also shown that optimality in adder depth is [12] Thong, J., Nicolici, N.: ‘Combined optimal and heuristic approaches for

always guaranteed in RADIX-2r by recoding the very few multiple constant multiplication’. Proc. Int. Conf. Computer Design (ICCD),

Amsterdam, Netherlands, October 2010, pp. 266–273

coefficients that cause problem with r = 2. Most generally, their [13]

number does not exceed 1% of the total number of coefficients as Oudjida, A.K., Chaillet, N.: ‘Radix-2r arithmetic for multiplication by a

constant’, IEEE Trans. Circuits Syst. II Exp. Brief, 2014, 61, (5), pp. 349–353

shown in the case of D-AMPS filters. Besides the unique analytic [14] Oudjida, A.K., Chaillet, N., Berrandjia, M.L.: ‘Radix-2r arithmetic for

bounds known so far for MCM at adder level, new finely grained multiplication by a constant: further results and improvements’, IEEE Trans.

bound at bit level has been introduced. This is another Circuits Syst. II Exp. Brief, 2015, 62, (4), pp. 372–376

unprecedented result that opens new research perspectives in [15] Nanyang Technological University, Singapore: ‘FIRsuite Suite of constant

MCM. coefficient FIR filters’, http://www.firsuite.net, November 2009

[16] Fraust, M., Gustafsson, O., Chip-Hong, C.: ‘Reconfigurable multiple constant

While the introduced bit-level model applies to a CPA multiplication using minimum adder depth’. Proc. IEEE ASILOMAR Conf.

implementation of the MCM block, we are currently exploring the Signals, Systems, and Computers, CA, USA, November 2010, pp. 1293–1301

possibility to extend it to a carry-save (CSA) realisation, required [17] Johansson, K., Gustafsson, O., DeBrunner, L.S., et al.: ‘Minimum adder depth

in high-speed applications. multiple constant multiplication algorithm for low power FIR filters’. Proc.

Int. IEEE (ISCAS) Conf. Circuits and Systems, Rio de Janeiro, Brazil, May

2011, pp. 1439–1442

7 Acknowledgments [18] Aksoy, L., Günes, E.O., Flores, P.: ‘Search algorithms for the multiple

constant multiplications problem: exact and approximate’, Microprocess.

This work was supported by ‘Centre de Développement des Microsyst., 2010, 34, (5), pp. 151–162

Technologies Avancées’, CDTA, Algiers, Algeria. [19] Oudjida, A.K., Berrandjia, M.L., Liacha, A.: ‘RADIX-2r MCM, ver. 2.0.2’.

Available at http://www.cdta.dz/products/mcm, June 2016

© The Institution of Engineering and Technology 2017

[20] Avizienis, A.: ‘Signed-digit number representation for fast parallel [29] Faust, M., Chang, C.H.: ‘Minimal logic depth adder tree optimization for

arithmetic’, IRE Trans. Electron. Comput., 1961, EC-10, (3), pp. 389–400 multiple constant multiplication’. Proc. Int. IEEE (ISCAS) Conf. Circuits and

[21] Aksoy, L., Flores, P., Monteiro, J.: ‘Exact and approximate algorithms for the Systems, Paris, France, June 2010, pp. 457–460

filter design optimization problem’, IEEE Trans. Signal Process., 2015, 63, [30] Kumm, M., Zipf, P., Faust, M., et al.: ‘Pipelined adder graph optimization for

(1), pp. 142–154 high speed multiple constant multiplication’. Proc. Int. IEEE (ISCAS) Conf.

[22] Aksoy, L., Gunes, E., Flores, P.: ‘An exact breadth-first search algorithm for Circuits and Systems, Seoul, South Korea, May 2012, pp. 49–52

the multiple constant multiplications problem’. Proc. Int. 26th Edition of [31] Ye, W.B., Yu, Y.J.: ‘Bit-level multiplierless FIR filter optimization

IEEE NORCHIP Conf., Tallinn, Estonia, November 2008, pp. 41–44 incorporating sparce filter technique’, IEEE Trans. Circuits Syst. I Reg.

[23] Samueli, H.: ‘An improved search algorithm for the design of multiplierless Papers, 2014, 61, (11), pp. 3206–3215

FIR filters with powers-of-two coefficients’, IEEE Trans. Circuits Syst., 1989, [32] Faust, M., Chang, C.H.: ‘Optimization of structural adders in fixed coefficient

36, (7), pp. 1044–1047 transposed direct form FIR filters’. Proc. Int. IEEE (ISCAS) Conf. Circuits

[24] Dimitrov, V.S., Eskritt, J., Imbert, L., et al.: ‘The use of the multi-dimensional and Systems, Taiwan, May 2009, pp. 2185–2188

logarithmic number system in DSP applications’. Proc. Int. (ARITH) Conf. [33] Feng, F., Chen, J., Chang, C.H.: ‘Hypergraph based minimum arborescence

Computer Arithmetic, Vail, CO, USA, June 2001, pp. 247–254 algorithm for the optimization and reoptimization of multiple constant

[25] Aksoy, L., Costa, E., Flores, P., et al.: ‘Optimization of area in digital FIR multiplications’, IEEE Trans. Circuits Syst. I, 2016, doi: 10.1109/

filters using gate-level metrics’. Proc. Int. (DAC) Conf. Design Automation, TCSI.2015.2512742

San Diego, CA, USA, June 2007, pp. 420–423 [34] Lou, X., Yu, Y.J., Meher, P.K.: ‘Fine-grained critical path analysis and

[26] Johansson, K., Gustafsson, O., Wanhammar, L.: ‘A detailed complexity optimization for area-time efficient realization of multiple constant

model for multiple constant multiplication and an algorithm to minimize the multiplications’, IEEE Trans. Circuits Syst. I Reg. Papers, 2014, 62, (3), pp.

complexity’. Proc. IEEE European (ECCTD) Conf. Circuit Theory and 863–872

Design, Cork, Ireland, September 2005, vol. 3, pp. 465–468 [35] Copyright © UMC Ver. B02: ‘UMC 65 nm Data Book’, December 2011

[27] Lou, X., Yu, Y.J., Meher, P.K.: ‘New approach to the reduction of sign- [36] Liacha, A., Oudjida, A.K., Ferguene, F., et al.: ‘A variable Radix-2r algorithm

extension overhead for efficient implementation of multiple constant for single constant multiplication’. Proc. Int. IEEE (NEWCAS) Conf. New

multiplications’, IEEE Trans. Circuits Syst. I: Reg. Papers, 2015, 62, (11), pp. Circuits and Systems, Strasbourg, France, June 2017, pp. 1–4 DOI: 10.1109/

2695–2705 NEWCAS.2017.8010156

[28] Huang, R., Chang, C.H., Faust, M., et al.: ‘Sign-extension avoidance and

word-length optimization by positive-offset representation for FIR filter

design’, IEEE Trans. Circuits Syst. II Exp. Brief, 2011, 58, (12), pp. 916–920

© The Institution of Engineering and Technology 2017

- Math AssignmentUploaded bysatya1dohare
- IT Third Year SYLLABUSUploaded byPRIYA RAJI
- dspa course fileUploaded byhima_srtist
- Engindeniz Gunes ThesisUploaded byMohammed A. Maher
- 08 CM0268 FiltersUploaded byNoedjoem Kaboer Aer
- IIR Filters in MatlabUploaded bysidhuhere
- AbnUploaded byapi-26172869
- ib102eUploaded byMaría Alejandra Echeverría Moreno
- dtssp qb.docxUploaded bygeetha657595
- UntitledUploaded byapi-127299018
- Mid Sem PresentationUploaded byVibhooti Sinha
- m.tech syllabusUploaded byTara Gonzales
- EC2306 Lab ManualUploaded byPremanand Subramani
- L0606057681.pdfUploaded byAnonymous 7VPPkWS8O
- lecture_notes_fp_IIR.pdfUploaded bydcastrelos2000
- final ppt.pptxUploaded byRajiv Kumar Sharma
- UT Dallas Syllabus for ee6360.501 05f taught by Issa Panahi (imp015000)Uploaded byUT Dallas Provost's Technology Group
- Wireless Multichannel EegUploaded byalexpina-x
- Time and Space Complexity lyrics of goose dreamUploaded byBezalel Recide Fernandez
- dynprogUploaded byHykinel Bon Guarte
- DSP and Its Applications SyllabusUploaded byelamaran_vlsi
- 071003_lectureUploaded bySangeetha Sangu BC
- Problem Set 01Uploaded byMilos Stojanovic
- Optimal-Capacitor-Allocation-for-loss-reduction-in-Distribution-System-Using-Fuzzy-and-Plant-Growth-Simulation-Algorithm.pdfUploaded byKadek Ita Rosiana
- AEUploaded byGanapathy Ram
- Appendix 4Uploaded byLegenda P. Pratama
- Fact OptimalUploaded byAlvaro Jara
- Lab 7Uploaded bylehins7235
- dwtUploaded byjeyavanianbudan
- 5590-5598Uploaded byVignesh Balakrishnan

- Internet of Things for Smart CitiesUploaded byloory21
- Intelligent and Dynamic Ransomware Spread Detection and Mitigation in Integrated Clinical EnvironmentsUploaded byresplandor
- Malware Classification Using Deep Convolutional Neural NetworksUploaded byresplandor
- 08680725.pdfUploaded byresplandor
- The Performance of Bayesian Network Classifiers Constructed using Different TechniquesUploaded byresplandor
- Industry 4.0 and Information Communication TechnologiesUploaded byresplandor
- Numerical and experimental study of conjugate heat transfer in a horizontal air cavityUploaded byresplandor
- A Comparative Study of Use of Shannon, Rényi and Tsallis Entropy for Attribute Selecting in Network Intrusion DetectionUploaded byresplandor
- CSEBook.pdfUploaded byresplandor
- Outside the Closed World, On Using Machine Learning for NIDSUploaded byPablo Velarde Alvarado
- Predicting Intrusion Goal Using Dynamic Bayesian Network With Transfer Probability EstimationUploaded byresplandor
- introducing-dag.pdfUploaded byresplandor
- Support Vector Machines with MatlabUploaded byresplandor
- 07961274Uploaded byresplandor

- UG 40 GovernorUploaded byPhilippe DAVID
- 120424924 Ch an Foyen Thomas F Cleary Tr Instant ZenUploaded bypatsai1989
- Kubota PumpsUploaded byВенцислав Венев
- Bio Investigatory project.docxUploaded byBiryanichicken
- 3GPP LTE Downlink System PerformanceUploaded byXiaolin Cheng
- Physics BookletUploaded byMayank Kashyap
- Spindle NoseUploaded bySharad Sharma
- The Romon ClosseamUploaded bySam Grt
- 29. terapiUploaded byLailaKhanifatunNisa
- Ganga Action Plan ReportUploaded bySwahum Mukherjee
- Rolul intensitati in dezvoltarea capacitatii de efoer (En)Uploaded byRaul Nemes
- 1-WilburnUploaded byTanmay Moharana
- ESP Training - 1 AL & SLB Introduction - 19 PgsUploaded byjoo123456789
- GENETIC ENGINEERING.pdfUploaded byHomero Pixar
- Comparisons for Astm & JisUploaded bySandi Aslan
- Ethernet Cable Color Coding Standard PDFUploaded byTom
- OPLLUploaded byTeofilus Kristianto
- arduino based motion trigger cameraUploaded byAbdul Hafiz
- WKB ApproximationUploaded byTanzid Sultan
- The One Lillian DeWatersUploaded byBurmese2
- 34-p1286Uploaded byDavid H. Butar Butar
- e Conservation Magazine n° 13, February 2010Uploaded bySARA
- Image Monitoring In-situ of Alpha Particles by CR-39 DetectorUploaded bySEP-Publisher
- Zahara Group BrochureUploaded byask.abhijit
- Principles of Environmental LawUploaded byEllaine Quimson
- Muther Vol 2Uploaded byancuta
- Hoods, Ductwork and StacksUploaded byShiyamraj Thamodharan
- lebaneseseismicnorms-160314114256Uploaded byKamal Halawi
- PAO Retinopathy of Prematurity Guidelines (2013)Uploaded bySirias_black
- Spray Drying of FoodsUploaded bygombossandor