You are on page 1of 11

IET Circuits, Devices & Systems

Research Article

Design of high-speed, low-power, and area- ISSN 1751-858X


Received on 19th February 2017
Revised 1st July 2017
efficient FIR filters Accepted on 13th August 2017
E-First on 4th December 2017
doi: 10.1049/iet-cds.2017.0058
www.ietdl.org

Ahmed Liacha1,2, Abdelkrim K. Oudjida1 , Farid Ferguene2, Mohammed Bakiri1, Mohamed L. Berrandjia1
1Centre de Développement des Technologies Avancées, CDTA, System Architecture and Multimedia Division, Algiers, Algeria
2Universitédes Sciences et de la Technologie Houari Boumediene, USTHB, LRPE Laboratory, Algiers, Algeria
E-mail: a_oudjida@hotmail.com

Abstract: In a recent work, we have introduced a new multiple constant multiplication (MCM) algorithm, denoted as RADIX-2r.
The latter exhibits the best results in speed and power, comparatively with the most prominent algorithms. In this paper, the area
aspect of RADIX-2r is more specially investigated. RADIX-2r is confronted to area efficient algorithms, notably to the cumulative
benefit heuristic (Hcub) known for its lowest adder-cost. A number of benchmark FIR filters of growing complexity served for
comparison. The results showed that RADIX-2r is better than Hcub in area, especially for high order filters where the saving
ranges from 1.50% up to 3.46%. This advantage is analytically proved and experimentally confirmed using a 65nm CMOS
technology. Area efficiency is achieved along with important savings in speed and power, ranging from 6.37% up to 38.01% and
from 9.30% up to 25.85%, respectively. When MCM blocks are implemented alone, the savings are higher: 10.18%, 47.24%,
and 41.27% in area, speed, and power, respectively. Most importantly, we prove that MCM heuristics using similar addition
pattern (A-operation with the same shift spans) as Hcub yield excessive bit-adder overhead in MCM problems of high
complexity. As such, they are not competitive to RADIX-2r in high order filters.

1 Background and motivation where the ‘B’ stands for binary. BSE was dedicated to high-order
filters with the objective to achieve a lower adder cost without
The hardware complexity of FIR filters is dominated by multiple much increase (near optimal) in adder depth. Authors claimed that
constant multiplication (MCM). The latter is an arithmetic BSE was better in adder cost than the best CSE [5, 6] and DAG [7,
operation that multiplies a set of fixed-point constants 8] heuristics, including Hcub [8]. However, the work in [4] was
C0, C1, C2, …, CM − 1 with the same fixed-point variable X. To be severely criticised by Chang and Faust [9], revealing inconsistency
efficiently implemented, i.e. rapid, compact, and low power, MCM in the chain of arguments and several discrepancies in the
must avoid costly multipliers. The hardware alternative will be quantitative figures used to justify the merit of BSE. Authors
multiplierless, using only additions, subtractions, and left shifts. Chang and Faust [9] recalculated the adder cost and adder depth for
We assume that addition and subtraction have the same area/speed Hcub and the two best CSE heuristics, namely the contention
cost, and that the shift is costless since it can be realised without resolution algorithm (CRA) [5] and the non-recursive signed
any gates, i.e. just by using hardwiring. Therefore, the MCM common subexpression elimination (NR-SCSE) [6]. They showed
problem is defined as the process of finding the minimum number experimentally that the adder cost of Hcub is very close from the
of additions, and/or the minimum number of cascaded additions lower bound defined in [10]. They came to the conclusion that
forming the critical path. The computational complexity of MCM there is very little room for other algorithms to achieve lower adder
is conjectured to be NP hard [1]. However, because the solution cost than Hcub. They also claimed that the adder depth of Hcub
space to explore is so huge, optimal solutions require excessive decreases with increasing filter length.
runtime and become impractical even for MCM operations of a Hcub as one of the outstanding MCM heuristics in quality and
medium complexity [1, 2]. Only MCM heuristics can react in a runtime has still some challenging competitors. DIFFAG [11]
reasonable amount of time, producing, however, suboptimal outperforms Hcub only in MCM problems with many small
solutions. constants as it essentially relies on the difference between
Based on the Radix-2r arithmetic, we have developed in a constants. While H3 and H4 [12] beat Hcub except in the case of
previous work a fully predictable heuristic for MCM, called few large constants. In fact, H3 and H4 are modifications to Hcub
RADIX-2r (see Table 2 in [3]). We obtained the unique analytic which enable better exploitation of redundancy within constants
bounds known so far for MCM in adder cost (Upb), average (Avg), [12]. Results in [11, 12] show that DIFFAG and H3 achieve slight
and adder depth (Ath). The bounds have been extended to FIR savings in average over Hcub. H4 performs better than H3 but at
filters to deal with the different bit lengths of the coefficients (see the cost of an excessive runtime. Contrary to DIFFAD, H3, and H4
Table 3 in [3]). Additionally, RADIX-2r shows a sublinear runtime which are advantageous only for a narrow spectrum of MCM
complexity, which means that it has no limitation regarding the size problem sizes, Hcub is rather fairly balanced [12] with a much
of the problem to be solved (see Table 7 in [3]). We have also larger span of performances, adaptable therefore to a vast range of
demonstrated that the best MCM heuristics cannot compete with applications. Owing to this, we focus more particularly on Hcub
RADIX-2r in adder depth (see Section 3 in [3]). This explains the for the benchmarking comparisons.
superiority of speed/power results of RADIX-2r over the best Contrary to the claims of Chang and Faust [9], in this work we
MCM heuristics (see Tables 8 and 9 in [3]) whether they belong to provide evidences that Hcub is not so unbeatable in area as it might
the category of common subexpression elimination (CSE) or appear. The reason is that Hcub solution requires a set of adders
directed acyclic graphs (DAG). with longer bit lengths. For high-order filters, the total number of
In [4], authors proposed a CSE method using binary bit adders outweighs the benefit of the lower adder cost. Besides,
representation of the coefficients as opposed to the commonly used we prove that Hcub is not near optimal in cost for constant bit
canonical signed digit (CSD) representation. It was denoted BSE,

IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11 1


© The Institution of Engineering and Technology 2017
Oudjida et al. [13, 14] proved an upper limit for adder depth
equal to

log2 N + 1 /r + r − 2, (2)

where

r =2⋅W N + 1 ⋅ log(2) /log(2), (3)

W is the Lambert function, and is the ceiling function. Note that


the same formula (2) holds for MCM but with

r =2⋅W M ⋅ N + 1 ⋅ log(2) /log(2), (4)

where M is the number of constants. We also showed that the term


Fig. 1  Sequential order of computation of the entire set of non-trivial PPs
r in (2) prevails over the term log as r increases (see Table 5 in
[3]). Without increasing the adder cost, limit (2) can be
needed by RADIX-26. For RADIX-26, a maximum of 26–2–1 = 15 additions
significantly reduced using the following recurrence relation:
are necessary, carried out in 6/2 − 1 = 2 steps in the worst case
m j = 2 p ± d, where p ≤ r − 1 and 0 < d < 2r − 3. The construction
length >20 bits. We also prove that Hcub adder depth increases process of the non-trivial partial products (PPs) is illustrated by
with increasing filter length. Fig. 1 for r = 6.
In addition to the RADIX-2r advantages cited earlier, the central  
point of this paper is to demonstrate the superiority of RADIX-2r Theorem 1: In RADIX-2r, the computation of the entire set of non-
over the existing heuristics in speed, power, and area particularly. trivial PPs 3 × X, 5 × X, …, (2r − 1 − 1) × X needs a maximum
For this purpose, a number of benchmark FIR filters of different adder depth of r/2 − 1.
complexities have been implemented in 65 nm complementary  
metal oxide semiconductor (CMOS). We also introduce the exact Proof: Proof by induction is used. Aided by Fig. 1, for r = 3
analytic bound of RADIX-2r at bit level, which is a very strong OM 23 = 3 , which requires 3/2 − 1 = 1 steps. Now assume
result in MCM. that for OM 2r , a maximum adder depth of r/2 − 1 is needed.
This paper is organised as follows. In Section 2, a new lower Let us determine the adder depth for OM 2r + 1 (see equation
bound in adder depth for RADIX-2r is defined. Section 3 compares below) Note that the greatest term
RADIX-2r to state-of-the-art algorithms through a number of filters 2 r + 1 − 3 − 1 = 2r − 2 − 1 ∈ OM 2r − 1 . Since the starting value
of different complexities. We introduce in Section 4 the bit-level
corresponds to r = 3 (Fig. 1), the adder step increases by one unit
version of RADIX-2r and show its efficiency over Hcub in a
for each subsequent odd value of r. The even values of r require as
number of bit adders. The implementation results are presented and
many adder steps as r − 1. Thus, whether r is odd or even, the term
discussed in Section 5. Finally, Section 6 provides some
concluding remarks and suggestions for future work. r + 1 /2 − 1 holds as the maximum adder depth for OM 2r + 1 .
Thus, in RADIX-2r, the upper limit in adder depth becomes
2 RADIX-2r: a lower bound for adder depth
Ath / / = log2 N + 1 /r + r/2 − 1 (5)
A non-negative N-bit constant C is expressed in RADIX-2r as (see
(1)) where c−1 = cN = 0 and r ∈ N ∗. In (1), the 2's complement and
representation of C is split into N + 1 /r slices (Q j), each of r + 
1 bit length [3]. Each pair of two contiguous slices has one Ath⋯ = N + 1 /r + r/2 − 2 (6)
overlapping bit. A digit set DS(2r) corresponds to (1), such as
Q j ∈ DS 2r = −2r − 1, − 2r − 1 + 1, …, − 1, 0, 1, …, 2r − 1 − 1, 2r − 1 for parallel (binary tree) and serial implementations of the PPs,
respectively.
. Since each new non-trivial PP requires only one addition
The sign of the Qj term is given by the crj+r–1 bit, and (recurrence relation), the maximum adder cost is the number of
Q j = 2k j × m j, with k j ∈ 0, 1, 2, …, r − 1 and non-trivial PPs: Nom = OM 2r − 1 = 2r − 2 − 1. Hence, the upper
r r r−1
m j ∈ OM 2 ∪ 0, 1 , where OM 2 = 3, 5, 7, …, 2 − 1 . bound does not change. It is the same as in [3]:
OM 2r is the set of odd positive digits in RADIX-2r recoding,
with OM 2r = 2r − 2 − 1. Upb = M × N + 1 /r + 2r − 2 − 1 − M . (7)

N + 1 /r − 1
C= ∑ cr j − 1 + 20cr j + 21cr j + 1 + 22cr j + 2 + ⋯ + 2r − 2cr j + r − 2 − 2r − 1cr j + r − 1 × 2r j
j=0
N + 1 /r − 1
(1)
= ∑ Qj × 2 ,rj

j=0

OM 2r + 1 = 3, 5, 7, …, 2r − 1 − 1, 2r − 1 + 1, …, 2 r + 1 −1
−1
r r+1 −2 r+1 −1
= OM 2 ∪ 2 + 1, …, 2 −1
r r+1 r+1
= OM 2 ∪ 2 −2
+ 1 ,2 −2
+ 3 , …, 2 r + 1 −2
+ 2 r+1 −3
−1 ,
2 r+1 −1
− 2 r+1 −3
− 1 , …, 2 r + 1 −1
− 3 , 2 r+1 −1
− 1
2 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11
© The Institution of Engineering and Technology 2017
Table 1 RADIX-2r: maximum number of additions (Upb) for a number of M non-negative constants with the same bit size N
N M 10 50 100 200 300 400 500 600 700 800 900 1000
12 r 5 7
Upb 27 81 131 231 331 431 531 631 731 831 931 1031
Lbc 12 52 102 202 302 402 502 602 702 802 902 1002
16 r 6 9
Upb 35 115 215 327 427 527 627 727 827 927 1027 1127
Lbc 13 53 103 203 303 403 503 603 703 803 903 1003
20 r 6 7 11
Upb 45 131 231 431 631 831 1011 1111 1211 1311 1411 1511
Lbc 13 53 103 203 303 403 503 603 703 803 903 1003
24 r 5 7 9
Upb 47 181 331 527 727 927 1127 1327 1527 1727 1927 2127
Lbc 13 53 103 203 303 403 503 603 703 803 903 1003
32 r 6 7 9 11
Upb 65 231 427 727 1027 1311 1511 1711 1911 2111 2311 2511
Lbc 14 54 104 204 304 404 504 604 704 804 904 1004
Upb = M × N + 1 /r + 2r − 2 − 1 − M with r = 2W M N + 1 log 2 /log 2 ; Lbc = log2 N + 1 /2 + M − 1; W, Lambert function; , ceiling function.

Table 2 RADIX-2r: maximum serial (Ath…) and parallel (Ath//) adder depth for a number of M non-negative constants with the
same bit size N
N M 10 50 100 200 300 400 500 600 700 800 900 1000
12 Ath… 4
Ath// 4
Lbd 3
16 Ath… 4 5
Ath// 4 5
Lbd 4
20 Ath… 5 6
Ath// 4 5 6
Lbd 4
24 Ath… 6
Ath// 5 6
Lbd 4
32 Ath… 7
Ath// 5 6 7
Lbd 5
Ath… = N + 1 /r + r/2 − 2; Ath / / = log2 N + 1 /r + r/2 − 1 with r = 2W M N + 1 log 2 /log 2 ; Lbd = log2 N + 1 /2 .

On the other side, Gustafsson [10] proved lower bounds for adder for power consumption as well. The reason is that shorter paths
cost and depth, which are Lbc = log2 N + 1 /2 + M − 1 and reduce the number of glitches which are the dominant factor in
Lbd = log2 N + 1 /2 , respectively. To show how much (5) and power consumption [3, 16, 17]. □
(6) are far from the lower bound (Lbd), we first computed the
values of r that give optimal Upb for N = 12–32 and M = 10–1000 3 Benchmarking of the best MCM heuristics
(Table 1), which is the typical variation range when considering
filters, though N does not generally exceed 24 bits [15]. In what follows, RADIX-2r is confronted to the best MCM
Afterwards, we applied the values of r to (5) and (6) and collected algorithms through a set of FIR filters taken from different sources.
To highlight the complexity of the filters, we opted for the
the results in Table 2. Note that Ath… and Ath// are at most two
following nomenclature: FIR_L_M_N, where L is the number of
steps greater than Lbd. Observe that for filter applications
coefficients (H set) of the filter, M is the number of unique positive
N ≤ 24 , Ath… needs at most one step more than Ath//. The gap odd integer coefficients (Hmin set) to be solved in MCM, and N is
between Ath// and Ath… becomes significant only for applications the maximum bit length of the coefficients of Hmin set. The
where N is high and M is low, such as, for instance, in the case
products M × N and L × N serve to estimate the bit-level complexity
N ≥ 32 and M ≤ 10. Notice also that starting from M = 200, Ath… (area) of the MCM block and the whole filter (MCM block plus the
and Ath// have the same value. More importantly, note that both adder structure), respectively. Note that we deal with the
Ath… and Ath// can exhibit optimal values as in the case N = 16 transposed form of FIR filters.
and M ≤ 200. The first set of filters is taken from [18]. The filter
Significant conclusion: RADIX-2r leads to a near-optimal solution specifications are given in Table 2 of [18]. For an easy
in adder depth without sacrificing the adder cost. Therefore, there identification, the number of the filter (x) is added in the
is very little room for other heuristics to compete with it in adder nomenclature (FIRx_L_M_N). The Hmin sets of the eight filters
depth. To the best of our knowledge, there exists no MCM heuristic have been kindly provided by Aksoy. We have also considered the
for the time being that indicates how far from the minimum (Lbd) filter that served for benchmarking in [3]. The RADIX-2r solutions
is. The motivation for lower adder depth is not only for speed, but are given by the online version available in [19]. To cope with

IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11 3


© The Institution of Engineering and Technology 2017
Table 3 Adder cost and adder depth of some low-/medium-order FIR filters
Filter Lower bounds [10] CSD RADIX-2r RAG-n [7] Hcub [8] NAIAD [21] SIREN [21]
[20]
Cost Deptha Cost Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth
Max Avg Max Avg Max Max Max Max
// ⋯ // ⋯
FIR_25_13_12 [3] 14 3 2.46 44 22 3 4 2.46 2.61 18 10 16 7 18 3 16 7
FIR4_30_14_13 [18] 14 3 2.64 53 23 3 3 2.64 2.64 23 5 18 7 18 7 17 8
FIR3_30_14_13 [18] 14 3 2.42 51 27 3 4 2.78 2.92 24 9 19 7 18 6 17 11
FIR1_40_19_12 [18] 20 3 2.31 65 33 3 4 2.57 2.63 24 10 23 7 22 9 22 10
FIR2_40_19_13 [18] 20 3 2.47 65 36 3 3 2.68 2.73 27 6 24 6 23 7 23 7
FIR7_40_19_14 [18] 20 3 2.63 72 35 3 4 2.84 2.94 28 7 24 7 23 12 23 10
FIR6_60_29_14 [18] 29 3 2.31 95 47 4 4 2.65 2.68 36 10 32 10 32 5 31 11
FIR8_80_36_14 [18] 37 3 2.36 118 51 4 4 2.66 2.72 40 5 38 5 37 6 37 6
FIR5_80_39_15 [18] 40 3 2.61 151 64 4 4 2.79 2.97 44 9 42 8 41 10 41 10
total 208 27 22.21 714 338 30 34 24.07 24.84 264 71 236 64 232 65 227 80
normalisation 1 1 1 3.43 1.62 1.11 1.25 1.08 1.11 1.26 2.62 1.13 2.37 1.11 2.41 1.09 2.96
The lower bounds in cost and depth are given by log2 min Ni + 1 /2 + M − 1 and log2 max Ni + 1 /2 , respectively.
aThe same values apply to CSD. / /, parallel implementation (binary tree); ⋯, serial implementation in cascaded adders; , ceiling function.

various speed/area constraints, the latter allows three options: digit (MSD), CSD etc. Contrary to CSE, DAG algorithms make no
‘adder cost’ option reduces the adder cost according to the value of assumption on the representation format. As a matter of fact, they
ropt that optimises Upb (see the r1 formula in Table 3 of [3]); the offer a higher capability in optimising the cost as they explore a
‘adder depth’ option reduces the adder depth by selecting the best larger solution space. However, such flexibility requires a
solution given by ropt, ropt ± 1, and ropt ± 2; and the trade-off option substantial runtime which makes DAG less attractive for high-
for a fixed value of r varying from 2 up to ropt + 3. The different order filters. Although RAGn [7] belongs to the DAG category, it
MCM solutions for the nine filters are reported in Table 3. The has a low runtime complexity. The reason is that RAGn relies on a
filters are arranged according to the increasing value of M × N. precomputed lookup table of optimal single constant
decompositions, currently limited to 19 bits [8]. This upper limit
From one side, note that RADIX-2r provides a near-optimal disqualifies RAGn when considering filters with bit lengths larger
solution in adder depth for both parallel and serial than 19. Moreover, it was proved in [8] that Hcub requires up to
implementations. The maximum depth does not exceed 4 as stated 20% less additions in average than RAGn. Hcub is a DAG
in Table 2 for 12 ≤ N ≤ 15, which is the variation range of Ni for algorithm that shows somewhat a reasonable time complexity,
the nine filters. While the CSD solution [20] achieves optimality in which makes it a potential candidate for high-order filters. SIREN
depth, it adversely requires an exaggerated number of additions. and NAIAD as DAG algorithms are rather unable to handle filters
We have already demonstrated that CSD is largely superseded by with, respectively, >20 and 160 coefficients due to their
RADIX-2r in cost (see Table 5 in [3]). CSD is included for it is a exponential complexity [21]. Besides, they both use Hcub as
standard technique which is still being employed in designing the bounding heuristic in their optimisation process.
vast majority of small/medium SCM/MCM blocks, as well as for a Concerning the two CSE algorithms CRA [5] and NR-SCSE
number of merits already discussed in [3]. [6], their analytic computation complexities are unknown.
On the other side, SIREN induces the lowest adder cost but also However, Xu et al. [5] reported the CPU time consumed by CRA
the highest adder depth. It is a well-known fact in MCM that and NR-SCSE for a number of constants (M) varying from 10 to 80
decreasing the cost increases the depth, and vice versa. To see with a fixed constant size (N) of 11 bits. Both algorithms show a
which heuristic yields the best trade-off, we calculated the cost ×  linear time increase with a very low slope regarding M. However, it
depth product of the normalised values with respect to the lower is a well-established fact that MCM algorithms are much more
bounds. The lower the product, the best is the trade-off. The results sensitive to a variation of N than M.
are reported in an increasing order as follows: 1.79, 2.67, 2.67, Consequently, for high-order filters, RADIX-2r is compared
3.22, 3.30, and 3.43, for RADIX-2r, Hcub, NAIAD, SIREN, with Hcub, CRA, NR-SCSE, and the direct CSD, though the latter
RAGn, and CSD, respectively. Thus, it is set clear that RADIX-2r is not an effective MCM heuristic.
is by far the most balanced heuristic. Furthermore, lower depths For comparison, we consider five high-order filters employed in
can be attained by using the ‘adder depth’ option of the online the filter bank channeliser of digital advanced mobile phone
version of RADIX-2r [19]. Optimal parallel depth (three steps) is systems (D-AMPS). The filter specifications are mentioned in [9]
achieved for FIR5, FIR6, and FIR8 with an increased cost of 87, and their respective coefficients are taken from [15]. These five
57, and 67, respectively. The normalised cost × depth gives 1.86 ×  filters are symmetrical and have the highest number of taps (up to
1 = 1.86, which is still the best trade-off. Another way to get the 695) and the largest bit length (up to 24) that we can find in [15]
minimum depth is to realise that replacing r by 2 in Ath// formula for benchmarking. The different MCM solutions are reported in
(5) gives log2 N + 1 /2 , which corresponds to Gustafsson's Table 4.
lower bound (Lbd) [10]. Therefore, the idea is to reduce the cost The cost-oriented and depth-oriented values of RADIX-2r are
first using the ‘adder cost’ option in [19], then look for the few obtained by running the online version [19] with the ‘adder cost’
coefficients whose Ath// exceed Lbd and recode them in RADIX-2r and ‘adder depth’ options, respectively. The normalised cost × 
with r = 2. The new costs are 74, 51, and 56 for FIR5, FIR6, and depth products of the five heuristics are ranked in ascending order
FIR8, respectively. The normalised cost × depth yields 1.71 × 1 =  as follows: 1.66, 2.13, 2.18, 2.85, and 4.64, for RADIX-2r, CRA,
1.71, which is the lowest calculated so far. Note that this solution is NR-SCSE, Hcub, and CSD, respectively. The value 1.39 × 1.20 = 
not yet integrated in [19]. 1.66 corresponds to the cost-oriented option with a parallel
To handle high-order filters in a reasonable amount of time, an implementation, whereas the depth-oriented option gives 1.54 × 
MCM heuristic must exhibit a low runtime complexity, commonly 1.10 = 1.69. The ‘*’ marked solution yields a cost × depth products
expressed in M and N (see Table 7 in [3]). It is well known that the of 1.43 × 1 = 1.43, which is the lowest value that RADIX-2r can
performances of CSE algorithms depend on the initial recoding of deliver for the set of five filters. The number of coefficients
the coefficients, such as binary, signed digit (SD), minimum signed requiring RADIX-2r recoding with r = 2 is <1% in the most cases.
4 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11
© The Institution of Engineering and Technology 2017
Table 4 Adder cost and adder depth of some high-order FIR filters
Filter Lower bounds [10] CSD [20] RADIX-2r NR-SCSE [6] CRA [5] Hcub [8]
Cost oriented Depth oriented
Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth Cost Depth
// ⋯ // ⋯
FIR_279_140_24 [9] 141 4 729 4 239 4 5 239 4 5 355 4 346 4 158 26
FIR_418_208_22 [9] 208 4 1008 4 310 5 5 389/327a 4/4a 5/8a 474 4 466 4 212 9
FIR_516_256_24 [9] 256 4 1212 4 362 5 6 362/373a 5/4a 6/8a 575 4 562 4 259 9
FIR_631_313_23 [9] 313 4 1394 4 403 5 5 523/424a 4/4a 5/8a 647 4 632 4 315 6
FIR_695_345_24 [9] 345 4 1525 4 444 5 5 444/453a 5/4a 5/8a 706 4 693 4 348 6
total 1263 20 5868 20 1758 24 26 1957/1816 22/20 26/37 2757 20 2699 20 1292 56
normalisation 1 1 4.64 1 1.39 1.20 1.30 1.54/1.43 1.10/1 1.30/1.85 2.18 1 2.13 1 1.02 2.80
The lower bounds in cost and depth are given by log2 min Ni + 1 /2 + M − 1 and log2 max Ni + 1 /2 , respectively.
aValues are obtained by reducing the cost first using the ‘adder cost’ option in [19], then look for coefficients whose Ath// exceed Lbd and recode them in RADIX-2r with r = 2.
/ /, parallel implementation (binary tree); ⋯, serial implementation in cascaded adders; , ceiling function.

Table 5 Average cost and depth of randomly generated MCM cases


MCM Lower-bound [10] RADIX-2r Hcub [8]
Cost Depth Costa Depthb Cost Depth
MCM_27_12 29 3 42.53 3.10 29.02 8.36
MCM_28_12 30 3 43.71 3.10 30.02 8.18
MCM_29_12 31 3 44.22 3.10 30.82 8.72
MCM_30_12 32 3 45.39 3.11 31.68 8.58
MCM_32_12 33 3 47.86 3.11 33.46 7.78
MCM_96_16 100 4 164.94 4.00 104.72 17.02
MCM_102_16 104 4 172.14 4.00 109.82 17.56
MCM_105_16 107 4 174.28 4.00 112.20 17.22
MCM_105_16 107 4 174.28 4.00 112.20 17.22
MCM_109_16 111 4 175.18 4.00 115.58 17.48
MCM_135_20 138 4 289.16 4.00 178.86 22.42
MCM_187_20 191 4 387.34 4.08 231.40 24.06
MCM_234_20 238 4 469.10 4.12 275.78 24.42
MCM_243_20 244 4 482.54 4.28 285.14 25.54
MCM_271_20 274 4 523.74 4.80 310.40 25.74
MCM_140_24 146 4 355.94 5.00 243.32 26.44
MCM_208_24 217 4 495.30 5.00 315.64 28.87
MCM_256_24 263 4 594.38 5.00 377.18 32.39
MCM_313_24 318 4 713.55 5.06 441.48 32.65
MCM_345_24 353 4 779.27 5.46 473.93 32.97
aValues obtained by running the ‘adder cost’ option of [19].
bSerial depth.

For instance, in FIR_695_345_24, only two coefficients (169,127 suggesting that the depth decreases with an increasing filter length
and 10,415,657) among 345 need a special recoding with r = 2, as conjectured (no proof) in [9]. Based on a limited number of filter
whereas all the other coefficients (343) are recoded with r = 8 to cases, the authors in [9, 22] concluded that Hcub is near optimal in
minimise the adder cost (see the r1 formula in Table 3 of [3]). The cost for any MCM complexity. To verify these claims, we have
serial implementation induces a product of 1.39 × 1.30 = 1.80. In calculated the average cost and depth for a number of 50 randomly
any case, RADIX-2r is by far the most balanced solution. generated MCM cases. The bit length varies from 12 to 24,
CRA and NR-SCSE are two incommensurate heuristics as they whereas the number of constants (M) corresponds to the cardinality
achieve optimality in depth to the detriment of an excessive of the Hmin set of the D-AMPS filters used in [9]. We adopt the
increase in cost. Besides, there are no formal proofs in [5, 6] that following nomenclature: MCM_M_N, where M is the number of
guarantee optimality for any given coefficient set. CRA performs constants and N is the maximum bit size of the constant. We have
better in cost than NR-SCSE. RADIX-2r guarantees optimal used the readily available source code of Hcub (version 14 January
solutions in depth with much less additions than CRA. For 2009) with the following command: synth -r 50 140 -b 24 -seed 1 -
instance, in FIR_279_140_24, a saving of 31% over CRA is v 3. The same randomly generated constant sets are run with
reached in cost. The saving is even higher (35%) in the case of a RADIX-2r for comparison (Table 5).
near optimality such as in FIR_516_256_24. Note that contrary to the claims of [9], Hcub adder depth
Contrary to CRA and NR-SCSE, CSD ensures depth optimality increases as the number of constants increases. With the exception
for any coefficient set, as it was analytically proved in [10]. of FIR_279_140_24 (Table 4) whose depth (26 steps) is very close
However, CSD generates a prohibitive number of additions. For from the average (26.44 steps), the four remaining filters are very
FIR_279_140_24, RADIX-2r provides 67% saving over CSD. special cases upon which no conclusion could be drawn regarding
Likewise, Hcub is also an unbalanced heuristic. It achieves near the adder depth as mistakenly assumed by Chang and Faust [9]. As
optimality in cost at the expense of an important increase in depth for the adder cost, while Hcub preserves the near optimality for N 
(up to 26 steps). In addition, disparate depth values are produced, ≤ 16, it loses it for N ≥ 20 conversely to the declarations of [9, 22].

IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11 5


© The Institution of Engineering and Technology 2017
Fig. 2  CSEA applied to RADIX-2r with Xb = 8 and r = 3
(a) Qj+1 = ± 1 and Qj = ± 22, (b) Qj+1 = ± 22 and Qj = ± 1

However, when considering the filters of Table 4, Hcub produces left-shift span between successive PPs varies from r to 2r − 1
costs which are more or less near optimal for 22 ≤ N ≤ 24. The positions. By using the appropriate SE for each operation in the
question is to know whether it is a pure coincidence because the RADIX-2r solution, there will never be an overflow and the output
sample of five filters only is not representative, or it is due to the word length is just enough to represent all possible outputs.
fact that in practice, the coefficient sets of digital filters are not We define the following conditions for our bit-level model. The
randomly distributed. The non-uniformity of the coefficient values RADIX-2r solution of the MCM block is mapped on carry-
was mentioned in [23] and demonstrated for 200 filters with 9 bit propagate-adders/subtractors (CPA/S) where the total area is linear
quantised coefficients in [24]. In general, it can be verified by the to the number of adders/subtractors. The latter are, respectively,
fact that all filters have a min(1,1/x) envelope due to the sin(x)/x formed of a serial connection of full adders (FAs) and full
function, which means that the coefficients are distributed near subtractors (FSs). We assume that FA and FS have the same area/
zero more frequently [15]. This question remains, however, an speed cost, which is a realistic approximation by the way. From
open research problem. now on, we refer to both as FAs. Finally, CSEA is employed to
Significant conclusion: Using standard metrics (adder level), we reduce the SE overhead since the common variable X is in 2's
have proved by comparison that RADIX-2r and Hcub are the best complement format.
algorithms in depth and cost, respectively. However, the ultimate Unlike CSE and DAG algorithms, in RADIX-2r, the recoding
objective is to see how both algorithms perform at an actual circuit of each constant is performed regardless of the others, but all
implementation (bit level).
constants share the same odd-multiple set OM 2r . This important
feature makes also possible the determination of the bit-level
4 Bit-level version of RADIX-2r complexity. To facilitate the understanding of the demonstration,
Bit-level heuristics yields better area results than adder-level we go through an illustrative SCM example (10,599 × X) taken
heuristics, but at the expense of an excessive computational effort. from [13].
This disqualifies them from tackling MCM problems of high The recoding of 10,599 × X in RADIX-2r and Hcub gives:
complexity. In [25], an accurate bit-level model is used in the PRADIX = X1 + X0 × 25–X1 × 28 + X0 × 212, with X0 = 3 × X = X × 2 + 
objective function of an exact MCM algorithm to estimate the area X and X1 = 7 × X = X × 23 − X. PHcub = X2 × 23 − X, with X0 = 257 × 
of each operation. Although better results are achieved, the
X = X × 28 + X, X1 = 265 × X = X0 + X × 23, and X2 = 1325 × X = X1 
algorithm requires an exponential runtime O(2M × N ). Similarly, in
[26], a detailed bit-level model is incorporated into RAGn [7] × 22 + X1.
algorithm to pursue a double objective: minimise the number of The product C × X requires a maximum of Xb + log2C bits.
adders as well as the number of bit adders within each adder block. The implementation of PRADIX and PHcub is depicted in Fig. 3 for
The area is made smaller but with a much higher runtime, knowing Xb = 8. Although PHcub requires four additions and PRADIX 5, the
that RAGn complexity alone is O(M 2 × N 3 × log(M × N)). We total number of FAs in PHcub is greater than in PRADIX. PHcub
introduce hereafter the bit-level version of RADIX-2r and shows consumes 61 FAs (Fig. 3c), while PRADIX needs 50 and 57 FAs in
that it requires no more than O(M × N /r) execution time. the serial (Fig. 3b) and parallel (Fig. 3a) implementations,
We consider the general MCM case C0, C1, C2, …, CM − 1 × X respectively. The reason why serial PRADIX consumes less FAs is
where Ci is a non-negative constants and X represented in 2's due to the successive left shifts performed in the PP array. In
complement format. However, in 2's complement arithmetic the parallel PRADIX, a lower depth is obtained (three steps instead of
sign of all operands needs to be extended to the bit length of the four) by altering the regularity of PP array (tree structure). This
result before any operation. This leads to an important overhead in results in a substantial increase (14%) in the number of FAs.
speed, power, and area. To reduce the sign extension (SE) overhead To understand why Hcub consumes a high number of FAs, we
in MCM, many approaches have been proposed, notably the MCM need to determine the analytic bounds of FA overhead (Δ) in Hcub
block partitioning [27] and positive offset [28] methods which and RADIX-2r. The term Δ is the number of extra FAs that must be
exhibit the best results in speed and area, respectively. The concatenated to the basic Xb FAs to form a complete adder (Δ + Xb
conventional SE approach (CSEA) used traditionally in variable FAs). We use the A-operation introduced in [8] as a unified
multiplication (Y × X) offers rather a good compromise in MCM formalism of MCM. In fact, any MCM solution is simply a
[27], especially when both area and power are a concern. collection of A-operations linked together [12]. To highlight the bit
Comparatively with [27, 28], CSEA is very easy to be used: the SE level (Xb) in the A-operation, the Ci × X product is expressed as
is performed locally between successive PPs, assuring that at each
stage, the partial sum contains the sum of the sign bits of previous
PPs. In addition, CSEA is fully predictable, making possible the Ci × X = A u × X, v × X = 2l1 × u × X ± 2l2 × v × X . (8)
determination of the exact analytic bound of RADIX-2r at bit level.
Based on (1), two adjacent PPs Equation (8) admits an infinite number of solutions {u, v, l1, l2}
⋯ + 2r j + 1 × Q j + 1 × X + 2r j × Q j × X + ⋯ require an SE span that satisfy Ci × X. To explore the solution space in a reasonable
varying from 1 to 2r − 1 bits, corresponding to Q j + 1 = ± 1 with amount of time and maintain a finite size of adders, most MCM
Q j = ± 2r − 1, and Q j + 1 = ± 2r − 1 with Q j = ± 1, respectively. algorithms constrain A u × X, v × X ≤ 2Nmax + 1 × X, where
These two extreme configurations are exemplified in Fig. 2 for Xb  Nmax = log2Cimax is the bit width of the largest constant (Cimax) in
an MCM operation [8, 29, 30]. Therefore, the maximum number of
= 8 and r = 3, where Xb is the bit size of X. Note that in RADIX-2r
FAs in Ci × X is Xb + Nmax + 1 (Δmax = Nmax + 1). Few exceptions,
with CSEA, the average SE span is r. Likewise, in RADIX-2r, the
6 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11
© The Institution of Engineering and Technology 2017
Fig. 3  Bit-level implementation of 10,599 × X using
(a) RADIX-2r parallel, (b) RADIX-2r serial, (c) Hcub

however, consider different bounds for (8): 2Nmax × X in H3 [12],


4
∑i = 1 Δi = 61 − 4 × 8 = 29 FAs in Fig. 3c. To normalise the adder
and 2Nmax + 2 × X in H4 [12] and DIFFAG [11]. With an increased cost (Ncost), we divide the total number of FAs by Xb. It gives Ncost 
bound, H4 performs better than H3 in reducing the average number = 50/8 = 6.25 and Ncost = 61/8 = 7.62 for Figs. 3b and c,
of adders, but at the cost of a significant run time [12] and higher Δ respectively. Thus, Hcub uses more than one normalised adder
(Δmax = Nmax + 2). (7.62 − 6.25 = 1.37) than the serial RADIX-2r solution.
Constraining Ci × X ≤ 2Nmax + 1 × X in Hcub implies that Aided by Figs. 1 and 3, we determine hereafter the maximum
number of FAs needed by C × X for a serial implementation in
1 ≤ 2l1 × u, 2l2 × v ≤ 2Nmax + 1 for all u and v (l1 and l2 depend on u
RADIX-2r.
and v). This means that a smaller fundamental can be derived from
bigger ones via a subtraction: In RADIX-2r, the PPs require N + 1 /r − 1 additions. Each
addition consumes at most Xb + r − 1 FAs. Hence, the maximum
Ci × X = 2l1 × u × X − 2l2 × v × X . In this case, the number of
number of FAs needed by the PP array is
FAs can be greater than the effective bit length of Ci × X. In MCM,
this construction process is even exaggerated since Hcub considers FApp = N + 1 /r − 1 × Xb + r − 1 . (10)
the impact of each possible intermediate fundamental on all target
fundamentals to be implemented and chooses the one that yields It can be easily proved that the odd-multiple tree (Fig. 1) consumes
the best cumulative benefit [22], even if smaller fundamentals are
made of bigger ones. at most Xb × 2r − 3 and Xb × 2r − 3 − 1 + r − 2 × 2r − 3 − 1 FAs in
the addition and subtraction parts, respectively. Thus, the
In RADIX-2r recoding, the largest value of mj in OM(2r) is 2r−1 
maximum number of FAs required by the odd-multiple tree is
− 1. Hence, the A-operation between successive PPs
FAom = Xb × 2r − 2 − 1 + r − 2 × 2r − 3 − 1. (11)
⋯ ± 2r j+1 + kj + 1
× m j + 1 × X ± 2r j + k j × m j × X ± ⋯ (9)
Hence, C × X needs no more than FApp + FAom. For a 14 bit
consumes at most Xb + log2 2r − 1 − 1 = Xb + r − 1 FAs. Note constant like 10,599, (3) gives r = 4. With Xb = 8, FApp + FAom
that Δmax = Nmax + 1 in Hcub is much bigger than in RADIX-2r yields 33 + 27 = 60 FAs. Note that 10,599 × X needs 50 FAs.
(Δmax = r − 1). The reason is that Nmax is a function of log, while r The SCM formulas are straightforwardly extended to MCM as
is a function of W log [see (3) and (4) for SCM and MCM, follows. In the case of a number of M non-negative constants with
respectively]. Note that W(x) < log(x) for any non-negative value of the same bit size N, the upper bound in FAs is M × FApp + FAom,
x >2. Thus, the saving in Δ of RADIX-2r over Hcub is much with r given by (4). However, in the case of M constants with
significant at large constants. different bit sizes Ni, such as in the FIR filters, the upper limit is
Based on experimental results, many papers [17, 25, 31] equal to FApp + FAom, where
insisted on the fact that fewer number of additions does not
necessarily lead to fewer number of FAs. However, to the best of M−1
our knowledge, none of them provided an in-depth analysis or a FApp = ∑ Ni + 1 /r − 1 × Xb + r − 1 , (12)
formal proof of the problem as we did here above. This is a strong i=0
result that will set new directions for MCM.
In the example of Fig. 3 (Cmax = 10,599), FAom given by (11), and
Nmax = log2 10, 599 = 14 while
M−1
r = 2W log2 10, 599 + 1 log 2 /log 2 = 4. For Xb = 8, the
longest adders include up to 8 + 14 + 1 = 23 and 8 + 4 − 1 = 11 FAs
r =2×W ∑ Ni + 1 × log 2 /log 2 . (13)
i=0
for Hcub and RADIX-2r, respectively. In Fig. 3c, the longest adder
comprises 22 FAs, while in Fig. 3b, it does not exceed 11 FAs. Let us take the smallest filter FIR_25_13_12 as an example.
Comparatively, the 22 FAs account for two adder blocks actually. Equation (13) gives r = 5. With Xb = 16, and replacing r by 5 in
The 22 FAs come from the subtraction of 1 × X from 10,600 × X to (12) and (11), we have a maximum of 340 + 123 = 463 FAs. The
generate the final result 10,599 × X. Note that 10,599 × X < 10,600  RADIX-2r solution [19] for this filter gives 22 additions.
× X; however, much greater fundamentals up to 214+1 × X = 32,768  Therefore, we are sure that the MCM block comprising 22
× X can be used, yielding up to 23 FAs. Note that additions can be implemented with at most 463 FAs (upper bound).
5
∑i = 1 Δi = 50 − 5 × 8 = 10 FAs in Fig. 3b, while it is However, the serial implementation requires only 393 FAs
22
(Table 6). Note that RADIX-2r induces the lowest ∑i = 1 Δi (41

IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11 7


© The Institution of Engineering and Technology 2017
Table 6 MCM solutions for fir_25_13_12: comparison in FAs
Algorithm Cost FAs Cost Total FAs Ncost Ncost − cost
∑ Δi i=1
Hcub [8] 16 256 103 359 22.43 6.43
Hcub+ [8]a 18 288 93 381 23.81 5.81
RAG-n [26] 18 288 98 386 24.12 6.12
BHM [26] 20 320 118 438 27.37 7.37
Pasko [26] 23 368 136 504 31.50 8.50
C1 [26] 19 304 104 408 25.50 6.50
DA-MST [26] 19 304 110 414 25.87 6.87
RFAG-n [26] 17 272 105 377 23.56 6.56
RADIX-2r 22 352 41 393 24.56 2.56
aHcub+ is the Hcub version with minimal adder depth. ∑Cost Δ is the total FA overhead. N
i=1 i cost is the normalised cost (number of FAs divided by Xb = 16).

Table 7 Maximum number of FAs with Xb = 16


Filter r (13) FAom (11) FApp (12) FAom + FApp max. FAs Adder cost
FIR_279_140_24 [9] 8 1199 4623 5822 239
FIR_418_208_22 [9] 8 1199 5957 7156 310
FIR_516_256_24 [9] 8 1199 7268 8467 362
FIR_631_313_23 [9] 8 1199 8303 9502 403
FIR_695_345_24 [9] 8 1199 9200 10,399 444

FAs), and that some heuristics with lower adder cost (BHM, C1, 14). Dynamic power consumption was evaluated with 2000
and DA-MST) yields more FAs than RADIX-2r. As RADIX-2r random input samples at 25 MHz frequency. The post-layout
Cost
induces a very low ∑i = 1 Δi, the number of adders (Cost) which results in speed, power, and area are reported in Table 8.
serves as an abstraction for the logic cost is more credible in As practical illustration of how RADIX-2r succeeds to beat
Hcub in area in spite of its higher number of additions, we have
RADIX-2r than in other algorithms. When the cost is compared implemented the two small illustrative examples (10,599 × X) of
with the normalised cost (Ncost), RADIX-2r exhibits by far the Figs. 3b and c. They consume five and four adder blocks with a
lowest gap (2.56). total of 50 and 61 FAs (18% saving over Hcub), respectively. Note
To provide an idea on the variation scope of consumption, we that both examples exhibit the same adder depth (four steps).
calculated the maximum number of FAs required by the high-order Results in Table 9 confirm the relevance of the bit-level (FAs)
filters (Table 7). It is important to note that r = 8 is the value that model over the adder level model in predicting the area occupation.
ensures the optimum (minimum) for both FApp + FAom and the
adder cost. This is a very strong result as there exists no MCM In low-order filters, Hcub is slightly better than RADIX-2r.
heuristic for the time being that is predictable at the bit level (FAs) However, in this case, the gain margin in area is insignificant.
or even at adder level. While in medium-order filters, the opposite is rather true with
Significant conclusion: From one side, we have proved in small improvements in area over Hcub. In high-order filters,
Section 2 that for typical variation range of the constant bit size RADIX-2r yields better results than Hcub in all cases, with an area
(12–24) in filtering, Ath… ≤ Ath / / + 1 (Table 2). On the other side, saving ranging from 1.5 up to 3.46%. Although the cost for all
we have also shown that parallel implementation yields a filters in Hcub is very close from the lower bound (Tables 3 and 4),
significant increase in FAs. Furthermore, as a matter of fact, the the saving in area increases as Δmax in Hcub increases. This is the
serial structure is highly regular, leading therefore to a shorter experimental proof that FA overhead outweighs the cost in Hcub
routing (reduced delay) and more compact area. Consequently, in (the solution employs less adder blocks, but with a much larger
RADIX-2r, the serial implementation stands as the best option for size).
the design of FIR filters. If we take FIR_279_140_24 as an example (Table 4),
RADIX-2r and Hcub solutions for the MCM blocks give 239 and
5 Experimental results 158 FAs, respectively. Although RADIX-2r consumes 81 adders
We have shown in our recent paper [3] using 180 nm CMOS more than Hcub, the 158 adders of Hcub require more FAs than
technology that RADIX-2r gives the best results in speed and their counterparts in RADIX-2r. The reason is that Δmax = 24 + 1 = 
power, comparatively with most prominent MCM heuristics. The 25 FAs in Hcub, whereas it is only Δmax = 8 − 1 = 7 FAs in
reason is due to its optimality in adder depth. Hereafter, RADIX-2r RADIX-2r. Therefore, the longest adder in Hcub consumes up to
is confronted to Hcub for being one of the best heuristics in area. 16 + 25 = 41 FAs, while it does not exceed 16 + 7 = 23 FAs in
Based on the conclusion above, only the serial implementation in RADIX-2r. Based on (11)–(13), we are sure that the MCM block in
RADIX-2r is considered. RADIX-2r consumes <4623 + 1199 = 5822 FAs, while it needs
All filters of Tables 3 and 4 were coded in Verilog according to many >5822 FAs in Hcub since RADIX-2r achieves 2.51% saving
the bit description thoroughly depicted in Fig. 3. Note that we in area over Hcub (Table 7). The adder depth in RADIX-2r and
consider the whole architecture of the transposed form of FIR Hcub are 5 and 26 steps, respectively. This explains the high
filter, i.e. the MCM block plus adder structure. The latter is exactly savings in speed and power which are 38.01 and 25.85%,
the same in both implementations for each filter. The RADIX-2r respectively.
recoding is generated using the online tool [19] with the ‘adder We also compared RADIX-2r with the Hcub version with
cost’ option. The input data word length (Xb) is fixed to 16 bits for
minimal depth (Hcub+) which is available in www.spiral.com.
all filters. The generated filters are mapped to UMC 65 nm
Hcub+ guarantees a minimal depth but with a substantial increase
standard-cell library using Cadence RTL compiler. The synthesis
in cost and runtime depending on how the Hcub solution is far
tool was constrained to a relaxed constraint of 50 ns. The place and
from the lower bound (Table 10). Note that the runtime complexity
route was performed using Cadence SoC Encounter (version EDI
of Hcub + is unknown. Confronted to Hcub+, RADIX-2 r is better in
8 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11
© The Institution of Engineering and Technology 2017
Table 8 RADIX-2r versus Hcub: post-layout implementation results in 65 nm CMOS of a number of FIR filters
Filter Delaya, ns Powerb, mW Areac, µm2 Delay saving, % Power saving, % Area saving, %
RADIX-2r Hcub [8] RADIX-2r Hcub [8] RADIX-2r Hcub [8]
Low-order filters
FIR_25_13_12 [3] 12.208 14.300 0.6106 0.6850 14,954.76 14,718.60 14.62 10.86 −1.60
FIR4_30_14_13 [18] 11.960 12.777 0.7510 0.8058 17,652.60 17,628.12 6.39 6.80 −0.13
FIR3_30_14_13 [18] 13.501 17.308 0.6973 0.7936 16,632.72 16,315.92 21.99 12.13 −1.94
FIR1_40_19_12 [18] 13.567 15.694 1.0216 1.1632 24,260.76 24,189.84 13.55 12.17 −0.29
FIR2_40_19_13 [18] 13.615 15.987 1.0199 1.2254 25,520.40 25,209.72 14.83 16.77 −1.23
FIR7_40_19_14 [18] 13.308 14.065 1.0432 1.1709 25,040.52 25,540.20 5.38 10.90 1.95
average 13.026 15.021 0.8572 0.9739 20,676.96 20,600.40 12.79 11.60 −0.54
Medium-order filters
FIR6_60_29_14 [18] 14.182 17.578 1.5471 1.8441 37,182.24 37,601.28 19.31 16.10 1.11
FIR8_80_36_14 [18] 14.882 14.944 1.9851 2.2344 47,372.76 47,295.72 0.41 11.15 −0.16
FIR5_80_39_15 [18] 14.732 18.257 2.1998 2.6870 50,850.72 51,538.68 19.30 18.13 1.33
average 14.598 16.926 1.9106 2.2551 45,135.24 45,478.56 13.00 15.12 0.76
High-order filters
FIR_279_140_24 [9] 19.838 32.005 11.4926 15.5008 217,743.12 223,366.68 38.01 25.85 2.51
FIR_418_208_22 [9] 20.884 24.678 16.3238 20.0545 315,236.88 320,047.20 15.37 18.60 1.50
FIR_516_256_24 [9] 19.010 22.503 21.4947 27.5732 381,966.12 393,904.44 15.52 22.04 3.03
FIR_631_313_23 [9] 18.747 21.433 25.6591 30.2111 460,180.80 473,360.04 12.53 15.06 2.78
FIR_695_345_24 [9] 20.569 21.969 29.9247 32.9955 505,337.04 523,501.20 6.37 9.30 3.46
average 19.809 24.517 20.9789 25.2670 376,092.79 386,835.91 17.56 18.17 2.65
aMinimum clock period.
bTotal dynamic power dissipation.
cTotal area.

Table 9 RADIX-2r versus Hcub: post-layout implementation results in 65 nm CMOS of 10,599 × X SCM EXEMLPE
Delaya, ns Powerb, mW Areac, µm2
RADIX-2r Hcub [8] RADIX-2r Hcub [8] RADIX-2r Hcub [8]
6.265 6.121 0.0129 0.0155 524.160 560.360
aMinimum clock period.
bTotal dynamic power dissipation.
cTotal area.

Table 10 Comparison between Hcub and the Hcub version with minimum depth (Hcub+)
Filter Hcub Hcub+ Increase in cost, %
Depth Cost Depth Cost
FIR_279_140_24 [9] 26 158 4 194 22.78
FIR_418_208_22 [9] 9 212 4 252 18.86
FIR_516_256_24 [9] 9 259 4 301 16.21
FIR_631_313_23 [9] 6 315 4 335 6.34
FIR_695_345_24 [9] 6 348 4 365 4.88

power and area (Table 11), while they are almost equal in speed. speed, power, and area all together, Hcub+ is more interesting than
The savings in area are not important because the whole filter Hcub except in the runtime.
architecture (MCM block plus adder structure) is considered. The Very recently, a new graph-based heuristic has been proposed in
hardware complexity of the adder structure is important as adders [33]. It uses the hypergraph concept to enable a new iterative
and registers grow monolithically along the taps to hold the correct minimum arborescence (IMA) formulation of the MCM problem.
precision of the partial sums [32]. It even exceeds the complexity Note that the computational complexity of IMA and Hcub
of the MCM block, especially in symmetrical filters as is the case algorithms are O M3 × N 2 + M4 × N and
of all FIR filters of Tables 8 and 11. However, when only the MCM
O M × N × log M × N + M × N 6 , respectively. With a lower
4 5 3
block is implemented, a couple of per cent improvement over Hcub
runtime, IMA performs slightly better than Hcub in low-/medium-
and Hcub+ is obtained (Table 12). We deliberately chose order filters. However, since IMA uses the same bound
FIR_279_140_24 as a typical example because its depth (26 steps) Cost
is very close from the average (26.44 steps) as shown in Table 5. A u × X, v × X ≤ 2Nmax + 1 × X as Hcub (Fig. 6 in [33]), ∑i = 1 Δi
It was statistically proved in [29] that constraining (8) to the is very important. Therefore, IMA cannot be so area efficient as
minimum depth will more likely increase the cost. Observe that RADIX-2r in MCM problems of high complexity.
even though Hcub+ solution employs 36 adder blocks more than While the conventional metric of adder depth was proved as a
Hcub (a raise of ≃23%), the increase in area occupation is not reliable measure of the critical path, a more accurate delay model
194
significant (≃3%). The reason is that ∑i = 1 Δi in Hcub+ is much has been introduced in [34]. The latter is based on a bit-level
158 propagation of signals for a fine-grained analysis of the critical
smaller than in Hcub ∑i = 1 Δi . Therefore, when considering path of MCM blocks. It shows that the delays of the shift-add

IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11 9


© The Institution of Engineering and Technology 2017
Table 11 RADIX-2r versus Hcub+: post-layout implementation results in 65 nm CMOS of D-AMPS filters
Filter Delaya, ns Powerb, mW Areac, µm2 Delay saving, % Power saving, Area saving, %
RADIX-2r Hcub+ [8] RADIX-2r Hcub+ [8] RADIX-2r Hcub+ [8] %
FIR_279_140_24 [9] 19.838 19.870 11.4926 12.0888 217,743.12 225,187.56 0.16 4.93 3.30
FIR_418_208_22 [9] 20.884 21.418 16.3238 18.5316 315,236.88 323,322.84 2.49 11.91 2.50
FIR_516_256_24 [9] 19.010 19.829 21.4947 23.3619 381,966.12 395,147.44 4.13 7.99 3.33
FIR_631_313_23 [9] 18.747 20.075 25.6591 29.3194 460,180.80 473,906.28 6.61 12.48 2.89
FIR_695_345_24 [9] 20.569 20.726 29.9247 33.0765 505,337.04 524,262.36 0.76 9.52 3.60
aMinimum clock period.
bTotal dynamic power dissipation.
cTotal area.

Table 12 RADIX-2r versus Hcub and Hcub+: Post-layout implementation results in 65 nm CMOS of FIR_279_140_24
Algorithm Delaya, ns Powerb, mW Areac, µm2 Delay saving, % Power saving, % Area saving, %
Hcub [8] 32.64 9.9868 92,669.76 47.24 41.27 10.18
Hcub+ [8] 17.79 7.2880 95,387.40 3.20 19.53 12.74
RADIX-2r 17.22 5.8646 83,232.72 — — —
aMinimum clock period.
bTotal dynamic power dissipation.
cTotal area.

network estimated at bit level are shorter than the delays at adder 8 References
level. Similarly, based on the bit-level description of RADIX-2r, a [1] Thong, J., Nicolici, N.: ‘An optimal and practical approach to single constant
more precise delay metric (bit depth) of the critical path can be multiplication’, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 2011,
derived. 30, (9), pp. 1373–1386
A more accurate bit-level model of RADIX-2r is possible by [2] Aksoy, L., da Costa, E., Flores, P.: ‘Exact and approximate algorithms for the
optimisation of area and delay in multiple constant multiplication’, IEEE
eliminating the subtraction cell FS. The reason is that most Trans. Comput. -Aided Des. Integr. Circuits Syst., 2008, 27, (6), pp. 1013–
standard cell libraries do not include the FS cell [35]. Instead, the 1026
operation A − B is implemented in 2's complement as A + B̄ + 1, [3] Oudjida, A.K., Liacha, A., Bakiri, M., et al.: ‘Multiple constant multiplication
algorithm for high speed and low power design’, IEEE Trans. Circuits Syst. II
requiring three elementary cells: FA, half adder (HA), and inverter Exp. Brief., 2016, 63, (2), pp. 176–180
(INV). While this is more precise, it much complicates the analytic [4] Mahesh, R., Vinod, A.P.: ‘A new common subexpression elimination
model (10)–(13). algorithm for realizing low-complexity higher order digital filters’, IEEE
While the whole work presented so far deals with exact Trans. Comput.-Aided Des. Integr. Circuits Syst., 2008, 27, (2), pp. 217–229
[5] Xu, F., Chang, C.H., Jong, C.C.: ‘Contention resolution algorithm for
computing, we have recently developed a new SCM algorithm common subexpression elimination in digital filter design’, IEEE Trans.
based also on RADIX-2r arithmetic but with a variable radix [36]. Circuits Syst. II Exp. Brief, 2005, 52, (10), pp. 695–700
The latter is suitable for approximate computing as it allows saving [6] Martinez-Peiro, M., Boemo, E.I., Wanhammar, L.: ‘Design of high-speed
multiplierless filters using a nonrecursive signed common subexpression
power in error-resilient applications. algorithm’, IEEE Trans. Circuits Syst. II Analog Digit. Signal Process., 2002,
49, (3), pp. 196–203
6 Conclusion and future work [7] Dempster, A.G., Macleod, M.D.: ‘Use of minimum-adder multiplier blocks in
FIR digital filters’, IEEE Trans. Circuits Syst. II Analog Digit. Signal
We have proved that RADIX-2r is one of the leading MCM Process., 1995, 42, (9), pp. 569–577
[8] Voronenko, Y., Püschel, M.: ‘Multiplierless multiple constant multiplication’,
heuristics. With a simple recoding and insignificant computational ACM Trans. Algorithms, 2007, 3, (2), pp. 1–38
effort, RADIX-2r achieves the best results in speed, power, and [9] Chang, C.H., Faust, M.: ‘On “a new common subexpression elimination
area, especially in MCM blocks of high complexity. This has been algorithm for realizing low-complexity higher order digital filters”’, IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst., 2010, 29, (5), pp. 844–848
clearly demonstrated through the circuit implementation of a [10] Gustafsson, O.: ‘Lower bounds for constant multiplication problems’, IEEE
number of D-AMPS filters. The proven near optimality in adder Trans. Circuits Syst. II Exp. Brief, 2007, 54, (11), pp. 974–978
depth along with area superiority over Hcub makes of RADIX-2r a [11] Gustafsson, O.: ‘A difference based adder graph heuristic for multiple
constant multiplication problems’. Proc. Int. IEEE (ISCAS) Conf. Circuits
powerful heuristic that is very hard to compete with in VLSI and Systems, New Orleans, USA, May 2007, pp. 1097–1100
applications. We have also shown that optimality in adder depth is [12] Thong, J., Nicolici, N.: ‘Combined optimal and heuristic approaches for
always guaranteed in RADIX-2r by recoding the very few multiple constant multiplication’. Proc. Int. Conf. Computer Design (ICCD),
Amsterdam, Netherlands, October 2010, pp. 266–273
coefficients that cause problem with r = 2. Most generally, their [13]
number does not exceed 1% of the total number of coefficients as Oudjida, A.K., Chaillet, N.: ‘Radix-2r arithmetic for multiplication by a
constant’, IEEE Trans. Circuits Syst. II Exp. Brief, 2014, 61, (5), pp. 349–353
shown in the case of D-AMPS filters. Besides the unique analytic [14] Oudjida, A.K., Chaillet, N., Berrandjia, M.L.: ‘Radix-2r arithmetic for
bounds known so far for MCM at adder level, new finely grained multiplication by a constant: further results and improvements’, IEEE Trans.
bound at bit level has been introduced. This is another Circuits Syst. II Exp. Brief, 2015, 62, (4), pp. 372–376
unprecedented result that opens new research perspectives in [15] Nanyang Technological University, Singapore: ‘FIRsuite Suite of constant
MCM. coefficient FIR filters’, http://www.firsuite.net, November 2009
[16] Fraust, M., Gustafsson, O., Chip-Hong, C.: ‘Reconfigurable multiple constant
While the introduced bit-level model applies to a CPA multiplication using minimum adder depth’. Proc. IEEE ASILOMAR Conf.
implementation of the MCM block, we are currently exploring the Signals, Systems, and Computers, CA, USA, November 2010, pp. 1293–1301
possibility to extend it to a carry-save (CSA) realisation, required [17] Johansson, K., Gustafsson, O., DeBrunner, L.S., et al.: ‘Minimum adder depth
in high-speed applications. multiple constant multiplication algorithm for low power FIR filters’. Proc.
Int. IEEE (ISCAS) Conf. Circuits and Systems, Rio de Janeiro, Brazil, May
2011, pp. 1439–1442
7 Acknowledgments [18] Aksoy, L., Günes, E.O., Flores, P.: ‘Search algorithms for the multiple
constant multiplications problem: exact and approximate’, Microprocess.
This work was supported by ‘Centre de Développement des Microsyst., 2010, 34, (5), pp. 151–162
Technologies Avancées’, CDTA, Algiers, Algeria. [19] Oudjida, A.K., Berrandjia, M.L., Liacha, A.: ‘RADIX-2r MCM, ver. 2.0.2’.
Available at http://www.cdta.dz/products/mcm, June 2016

10 IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11


© The Institution of Engineering and Technology 2017
[20] Avizienis, A.: ‘Signed-digit number representation for fast parallel [29] Faust, M., Chang, C.H.: ‘Minimal logic depth adder tree optimization for
arithmetic’, IRE Trans. Electron. Comput., 1961, EC-10, (3), pp. 389–400 multiple constant multiplication’. Proc. Int. IEEE (ISCAS) Conf. Circuits and
[21] Aksoy, L., Flores, P., Monteiro, J.: ‘Exact and approximate algorithms for the Systems, Paris, France, June 2010, pp. 457–460
filter design optimization problem’, IEEE Trans. Signal Process., 2015, 63, [30] Kumm, M., Zipf, P., Faust, M., et al.: ‘Pipelined adder graph optimization for
(1), pp. 142–154 high speed multiple constant multiplication’. Proc. Int. IEEE (ISCAS) Conf.
[22] Aksoy, L., Gunes, E., Flores, P.: ‘An exact breadth-first search algorithm for Circuits and Systems, Seoul, South Korea, May 2012, pp. 49–52
the multiple constant multiplications problem’. Proc. Int. 26th Edition of [31] Ye, W.B., Yu, Y.J.: ‘Bit-level multiplierless FIR filter optimization
IEEE NORCHIP Conf., Tallinn, Estonia, November 2008, pp. 41–44 incorporating sparce filter technique’, IEEE Trans. Circuits Syst. I Reg.
[23] Samueli, H.: ‘An improved search algorithm for the design of multiplierless Papers, 2014, 61, (11), pp. 3206–3215
FIR filters with powers-of-two coefficients’, IEEE Trans. Circuits Syst., 1989, [32] Faust, M., Chang, C.H.: ‘Optimization of structural adders in fixed coefficient
36, (7), pp. 1044–1047 transposed direct form FIR filters’. Proc. Int. IEEE (ISCAS) Conf. Circuits
[24] Dimitrov, V.S., Eskritt, J., Imbert, L., et al.: ‘The use of the multi-dimensional and Systems, Taiwan, May 2009, pp. 2185–2188
logarithmic number system in DSP applications’. Proc. Int. (ARITH) Conf. [33] Feng, F., Chen, J., Chang, C.H.: ‘Hypergraph based minimum arborescence
Computer Arithmetic, Vail, CO, USA, June 2001, pp. 247–254 algorithm for the optimization and reoptimization of multiple constant
[25] Aksoy, L., Costa, E., Flores, P., et al.: ‘Optimization of area in digital FIR multiplications’, IEEE Trans. Circuits Syst. I, 2016, doi: 10.1109/
filters using gate-level metrics’. Proc. Int. (DAC) Conf. Design Automation, TCSI.2015.2512742
San Diego, CA, USA, June 2007, pp. 420–423 [34] Lou, X., Yu, Y.J., Meher, P.K.: ‘Fine-grained critical path analysis and
[26] Johansson, K., Gustafsson, O., Wanhammar, L.: ‘A detailed complexity optimization for area-time efficient realization of multiple constant
model for multiple constant multiplication and an algorithm to minimize the multiplications’, IEEE Trans. Circuits Syst. I Reg. Papers, 2014, 62, (3), pp.
complexity’. Proc. IEEE European (ECCTD) Conf. Circuit Theory and 863–872
Design, Cork, Ireland, September 2005, vol. 3, pp. 465–468 [35] Copyright © UMC Ver. B02: ‘UMC 65 nm Data Book’, December 2011
[27] Lou, X., Yu, Y.J., Meher, P.K.: ‘New approach to the reduction of sign- [36] Liacha, A., Oudjida, A.K., Ferguene, F., et al.: ‘A variable Radix-2r algorithm
extension overhead for efficient implementation of multiple constant for single constant multiplication’. Proc. Int. IEEE (NEWCAS) Conf. New
multiplications’, IEEE Trans. Circuits Syst. I: Reg. Papers, 2015, 62, (11), pp. Circuits and Systems, Strasbourg, France, June 2017, pp. 1–4 DOI: 10.1109/
2695–2705 NEWCAS.2017.8010156
[28] Huang, R., Chang, C.H., Faust, M., et al.: ‘Sign-extension avoidance and
word-length optimization by positive-offset representation for FIR filter
design’, IEEE Trans. Circuits Syst. II Exp. Brief, 2011, 58, (12), pp. 916–920

IET Circuits Devices Syst., 2018, Vol. 12 Iss. 1, pp. 1-11 11


© The Institution of Engineering and Technology 2017