DeBAM Decoder-Based Approximate Multiplier For Low Power Applications

174 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO.
4, DECEMBER 2021
DeBAM: Decoder-Based Approximate Multiplier

for Low Power Applications
Suresh Nambi , Member, IEEE, U. Anil Kumar , Kavya Radhakrishnan ,
Mythreye Venkatesan, and Syed Ershad Ahmed
Abstract—Approximate computing is a promising method for number of PP rows generated by designing an approximate
designing power-efficient computing systems. Many image and decoder logic block without utilizing an internal accumulation
compression algorithms are inherently error-tolerant and can stage. Thus, our proposed design occupies lower die-area and
allow errors up to a specific limit. In such algorithms, savings in consumes less power while resulting in fewer erroneous results
power can be achieved by approximating the data path units, such in comparison to the existing approximate multipliers.
as a multiplier. This letter presents a novel decoder logic-based
The remainder of this letter is arranged as follows.
multiplier design with the intent to reduce the partial products
generated. Thus, leading to a reduction in the hardware com- Preliminary information about the existing approximate
plexity and power consumption while maintaining a low error multiplier architectures is provided in Section II. An outline of
rate. Our proposed design in an 8-bit format which achieves the proposed decoder-based approximate multiplier (DeBAM)
40.96% and 22.30% power reduction compared to the accurate architecture is presented in Section III. Comprehensive error
and approximate multipliers. Comprehensive simulations are car- analysis and hardware synthesis of the proposed and exist-
ried out on image sharpening and compression algorithms to ing techniques are provided in Section IV. The proposed and
prove that the proposed design obtains a better quality-effort existing multiplier architectures are evaluated using the image
tradeoff than the existing multipliers. sharpening and compression algorithms to quantify and com-
Index Terms—Approximate computing, approximate pare their performance as explained in Section V. Finally,
multiplier, image compression, image sharpening, low-power. conclusions are drawn in Section VI.
I. I NTRODUCTION
II. P REVIOUS W ORK
HE KEY design objective of any modern-day digital
T system is to remain power-efficient with an improvement
in performance. Approximate computing [1] has emerged as
To reduce the computation complexity, various approxi-
mate multipliers have been proposed based on methods 1) and
a promising technique capitalizing on the inherently error- 2). Based on 1), a broken-array multiplier was suggested by
resilient image and video processing applications to tolerate Jiang et al. [3] by pruning the carry-save adders that were used
some loss of accuracy. The fundamental arithmetic modules at accumulation stage. This method achieved saving in area
in these applications are multipliers; numerous techniques by eliminating either horizontal partial-product rows or verti-
to analyze and evaluate approximate multipliers have been cal partial-product columns. An approximate 4-2 compressor
extensively studied [2], [3]. with an intent to achieve a low error rate was proposed by
Approximate multipliers have been primarily developed using Yang et al. [8]. This method tries to exploit the probabilistic
two methods: 1) approximation at partial product reduction characteristic of AND gate output which is used in the PPG
(PPR) stage and 2) approximation at partial product generation stage at the PPR stage. Ha and Lee [9] modified the compressor
(PPG) stage [3]. In 1), the least significant portion of the partial proposed by Yang and improves the accuracy by adding a simple
product (PP) rows are truncated and the error is compensated error correction circuit in the PPR stage. To achieve low area
using a correction function or larger input approximate com- and low power, Venkatachalam and Ko [12] proposed an effi-
pressors are used to accumulate the PP rows [6]–[8], [9]–[12]. cient compressor. Yi et al. [6] proposed the compressor-based
This method results in a minor reduction of the resource uti- approximate multiplier with an intent to reduce the error rate.
lization because the number of PP rows remains unchanged. On Based on 2), Bhardwaj et al. [7] presented an approximate
the other hand, in 2), the number of PP rows generated could wallace tree multiplier in mode 4 (AWTM-4) with a carry-in
be drastically reduced by employing an approximate circuit. prediction for building recursive blocks. Accurate multiplier
However, existing designs based on 2) aim to improve latency modules are used for MSB computation while in-exact blocks
and reduce error in the MSB section of the output by splitting were used for LSB portion. Approximate inputs were gener-
the PPG into smaller inaccurate sub-blocks [7], utilizing an ated using a fast bit selection scheme in Hashemi et al. [11]
accumulation stage within each sub-block. Unlike some exist- for error resilient applications to achieve low-power.
ing designs based on 2), in this letter, we aim to reduce the
III. D ECODER -BASED A PPROXIMATE M ULTIPLIER
Manuscript received September 24, 2020; accepted December 8, 2020. Date
of publication December 16, 2020; date of current version November 24, 2021. The proposed DeBAM aims to utilize a block configuration
This manuscript was recommended for publication by S. Venkatramani. that encapsulates a unique 2-bit decoder logic circuit, used to
(Suresh Nambi and U. Anil Kumar contributed equally to this work.) simplify the generation of PPs. The novelty of this letter is the
(Corresponding author: U. Anil Kumar.) method employed and the circuit used to generate PPs. The
The authors are with the Department of Electrical and Electronics
Engineering, BITS Pilani, Hyderabad 500078, India (e-mail:
PPs generated by the decoder logic are compressed to two
anilkumaruppugundur@gmail.com). rows using accumulator and final reduction is achieved using
Digital Object Identifier 10.1109/LES.2020.3045165 two-operand adder. For the sake of simplicity, the proposed
1943-0671
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on December 10,2023 at 16:04:21 UTC from IEEE Xplore. Restrictions apply.
NAMBI et al.: DeBAM FOR LOW POWER APPLICATIONS 175
Fig. 1. (a) Block diagram of DeBAM architecture. (b) DeBAM technique illustrated using a numerical example.
TABLE I
D ECODER L OGIC T RUTH TABLE Fig. 1(b) illustrates the DeBAM with the help of a numerical
example. Based on the input select bits, the three decoder
blocks generate three rows of PP while the remaining two
rows are generated using AND gate logic. These rows are
reduced to the final product using a reduction structure. It can
be observed that the result obtained using DeBAM is almost
equal to an accurate result.
unsigned multiplier is explained using two 8-bit operands and A. Gate-Level Implementation of Decoder Logic Block
the same can be extended to a higher number of bits as illus- This section presents the circuit-level implementation of the
trated later. Consider the multiplication of two numbers A and proposed decoder based logic for an 8 ∗ 8 multiplier as shown
B each of 8-bit. All the bits of A (multiplier) can be mul- in Fig. 2. When ax ax + 1 is “00,” the decoder logic block
tiplied with B (multiplicand) in a conventional manner using generates zero as the output. For the case “01,” the value of
AND gate to generate PPs. However, it will result in 8 PP rows b7 − b0 form the PP while for “10” the resulting PP is (b7 − b0)
with each consisting of 8 bits. Reducing the number of PP rows left-shifted by “1.” It is easy to see that the combinational circuit
will result in simple hardware at the PPR stage. To accomplish for implementing these three cases requires minimal number of
this, we propose DeBAM architecture illustrated in Fig. 1(a), gates as it involves either conditional transfer of input bits or
which uses novel decoder logic blocks to generate approxi- zero. The final case where ax ax + 1 is “11” can be evaluated
mate PP for the lower significant multiplier bits (a0a1, a2a3, as the sum of output PPG for the case ax ax + 1 is 01 and 10.
a4a5) while the most significant a7 & a6 bits generate exact The adder logic, in this case, has been implemented using a
PPs. The multiplier A, that forms the selection input to decoder simple OR gate (solid OR gate in Fig. 2). This choice is the
logic, is divided into three groups (a0a1, a2a3, a4a5), each key element in reducing the hardware and associated delay.
of 2-bit to generate approximate PP while corresponding to
a7 & a6 exact PP are generated using conventional AND gate
logic, thereby limiting the error to the least significant portion. B. DeBAM (N, M) Design for Higher Bit Multiplication
Further, choosing two bits for select logic restricts the possi- The proposed multiplier was extended for higher bit multi-
ble bit combinations of each pair to four unique cases. The plication using two parameters N and M in the parameterized
functionality of the decoder logic block is based on the select Verilog HDL code. N denotes the total number of bits
inputs (ax ax + 1) as illustrated in Table I. Since the select being multiplied and M denotes the total number of accurate
input to the decoder is 2-bits there can be four possible com- blocks, thus the earlier discussed design would be denoted
binations as shown in the table. For example, if the multiplier as DeBAM(8, 2). The total number of decoder logic blocks
bits are “01” then the decoder block passes the multiplicand in the parameterized design would be defined by the value
(B) as output. The only case when approximate PP are gen- (N − M)/2 which is equivalent to the number of PPs rows
erated by respective decoder blocks is when the select inputs reduced. To further optimize our design, we share the output
ax ax + 1 are “11.” To obtain PP for this case the multiplicand of solid OR gates across decoder logic blocks resulting in a
B and left-shifted B (B<<1) has to be added. This addition is hardware reduction. This is possible as the multiplicand input
carried out using OR logic. It can be noted that the output of to the decoder logic block remains constant across all decoder
each decoder logic block is arranged based on the positional logic blocks. The above design extension provides flexibil-
weight of the multiplier bits. The accuracy of the proposed ity to the users to choose a suitable tradeoff between better
multiplier is enhanced by generating the exact PP using MSB accuracy and reduction in die-area based on the applications
bits (a7a6) and multiplicand (B) using traditional AND logic. requirement, for higher bit widths.
Based on the select inputs to the proposed decoder logic blocks
(multiplicand as other input) approximate PP is generated while IV. E XPERIMENTAL R ESULTS AND E RROR M ETRICS
the most significant (a7a6) bits generate exact PPs as shown
in Fig. 1(a). The PPG stage is followed by an accumulation A. Error Analysis of Approximate Multipliers
stage, carried out using 3:2 carry-save adder [4], that reduces The proposed DeBAM was tested by taking 65 536 and
the 5 PP rows into two. The generation of minimal PP leads 1 million (random) cases for 8-bit and 16-bit multiplier,
to minimization in the number of adders in the accumulation respectively, and compared with other existing approximate
stage, leading to a significant reduction in area and power. The multipliers [6]–[9], [11], [12]. The performance metrics, such
two rows obtained after accumulation stage are subsequently as error rate, mean relative error distance (MRED), and nor-
reduced to the final result (p15-p0) using a ripple carry adder. malized MRED (NMED) [5] of all the 8-bit and 16-bit
176 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO. 4, DECEMBER 2021
TABLE II
E RROR A NALYSIS OF VARIOUS 8-B IT AND 16-B IT A PPROXIMATE M ULTIPLIERS
respectively, less than Yi et al. [6] design. Although DeBAM

(8, 2) has a 13.46% increase in delay than Ha and Lee [9], it
has 45.2% improved accuracy. Finally, the proposed DeBAM
(8, 2) achieves up to 30.13% area and up to 22.3% power
improvements compared to the existing designs and 31.34%
in area and 40.96% in power compared to accurate multiplier.
DeBAM (16, M) consumes less power compared to the
existing designs except Multiplier1, Xilin, and DRUM4.
Similarly, DeBAM (16, M) occupies less area compared to
the existing designs except Multiplier1 and DRUM4. Designs
Fig. 2. Gate-level implementation of decoder logic block. which consume less power and area compared to DeBAM
(16, M) suffer from lower accuracy. In short, it is evident that
multipliers, including DeBAM (N, M) are computed and DeBAM (N, M) has better computation quality-effort tradeoff
detailed in Table II. It can be observed from the table that than all other designs.
the DeBAM shows a lower error rate than all the existing
multipliers. The only inaccuracy introduced in the PPG is V. B ENCH M ARKING A PPLICATIONS
when both the two input bits to the decoder logic are 1 s As discussed in the preceding section, the DeBAM (N, M)
while for all the other cases accurate PPs are generated. This architecture offers significant benefits in terms of accuracy,
lower rate of inaccuracy in the PPG decreases the error rate, area, and power. In this section, DeBAM (8, 2) is referred to
thereby decreasing it to a value of 45.39% for DeBAM (8, 2), as DeBAM and evaluated using image sharpening [10] and
which is much lower than all the other approximate multipliers JPEG image compression (JIC) [14] applications.
taken into consideration. The NMED and MRED of DeBAM The image sharpening algorithm accepts an image (P),
(N, N/4) are lower in comparison to most other multipliers processes it using 5 ∗ 5 kernels to produce a high-quality
as we tradeoff the number of accurate bits in the MSB by image. Since the computation process in image sharpening
incorporating a large number of decoder logic blocks for an involves a lot of multiplication operations, this application is
improvement in resource utilization and latency. However, due suitable to demonstrate the performance of the approximate
to the parameterized nature of our design we can improve multiplier. All the exact multiplication operations are replaced
our NMED and MRED as in the case of DeBAM(16, 6) and with DeBAM while the other operations, such as addition
DeBAM (16, 8) by increasing the number of accurate units. and division, are carried out using accurate modules. The
performance of the multiplier is evaluated using the quality
metrics, mean square error (MSE), and peak signal-to-noise
B. Hardware Synthesis Results ratio (PSNR) [4] as listed in Table IV.
For fair analysis, all the 8-bit and 16-bit multipliers [6]–[9], The PSNR values of the proposed DeBAM corresponding
[11], [12] schemes, including the proposed one, have been mod- Pirate and Peppers images is 50.91 and 50.98 dB, respectively.
eled structurally using Verilog hardware description language The proposed multiplier achieves better PSNR value compared
(HDL). The functional simulation of exact and approximate with all other multipliers due to its low error rate despite low
multiplier designs has been performed using the Cadence Incisive NMED and MRED values as we encounter few inputs that
Unified Simulator v6.1 while hardware synthesis has been car- result in an error and consequently only a handful of cases
ried out at TSMC 180-nm process node (slow-normal library) result in a large deviation from the accurate value as indicated
using Cadence RTL compiler v7.1. Hardware analysis has been by NMED and MRED. The DeBAM architecture achieves bet-
performed to compute metrics, such as area, delay, and power. ter PSNR in addition to savings in area and power compared
For 8-bit, it can be observed from Table III that the area to Ha and Lee [9] design mentioned in this letter. Fig. 3(a)–(i)
occupied and power consumed by DeBAM (8, 2) is 6.4% and shows the Peppers images obtained by using exact, DeBAM,
10%, respectively, less than Ha and Lee [9] design. Similarly, and existing multipliers. It can be observed that the obtained
17.7%, 11.8%, and 9.2% improvement in the area, delay, and images of exact multiplier and DeBAM look almost identical.
power consumption compared to AWTM-4 is achieved. The As another example, we consider the JIC [14] to demon-
area occupied by the DeBAM (8, 2) is 30.13% less compared strate the proposed multiplier advantage. It is well known
to the DRUM4 [11]. Though DRUM4 has lower power con- that JIC is usually used to save transmission bandwidth or
sumption it suffers from a high error rate. The area occupied storage space. Matrix multiplication in JIC makes it a good
and power consumed by DeBAM (8, 2) is 14.9% and 8.3%, application for evaluating approximate multipliers. Table IV
NAMBI et al.: DeBAM FOR LOW POWER APPLICATIONS 177
TABLE III
S YNTHESIS R ESULTS OF VARIOUS 8-B IT AND 16-B IT A PPROXIMATE M ULTIPLIERS
TABLE IV
PSNR OF S HARPENED AND JIC I MAGES P ROCESSED U SING VARIOUS 8-B IT A PPROXIMATE M ULTIPLIERS
VI. C ONCLUSION
Approximate computing offers potential benefits in terms
of area, delay, and power. This letter attempts to leverage the
inherent error-resilient characteristic in the image processing
applications. The proposed DeBAM (N, M) architecture reduces
the number of PPs generated, as a result, also simplifies the PP
accumulation stage. Error and hardware results prove the energy
and area efficacy of the proposed multiplier compared to the
existing designs. Finally, the proposed method achieves a better
computation quality-effort tradeoff and the same is demonstrated
on the image sharpening and compression algorithms.
R EFERENCES
[1] S. Mittal, “A survey of techniques for approximate computing,” ACM
Fig. 3. Output images of exact, DeBAM, and existing architectures after Comput. Surveys, vol. 48, no. 4, p. 62, 2016.
image sharpening. (a) Exact. (b) DeBAM. (c) Multiplier1. (d) Multiplier2. [2] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S.Kim,
(e) Minho. (f) Yang. (g) Xilin. (h) DRUM4. (i) AWTM-4. “Energy-efficient approximate multiplication for digital signal process-
ing and classification applications,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 23, no.6, pp. 1180–1184, 2014.
[3] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, “A review, classifi-
cation, and comparative evaluation of approximate arithmetic circuits,”
ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 4, p. 60, 2017.
[4] B. Parhami, Computer Arithmetic, vol. 20. Oxford, U.K.: Oxford Univ.
Press, 2010.
[5] J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of
approximate and probabilistic adders,” IEEE Trans. Comput., vol. 62,
no. 9, pp. 1760–1771, Sep. 2013.
[6] X. Yi, H. Pei, Z. Zhang, H. Zhou, and Y. He, “Design of an energy-
efficient approximate compressor for error-resilient multiplications,” in
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2019, pp. 1–5.
[7] K. Bhardwaj, P. S. Mane, and J. Henkel, “Power-and area-efficient
approximate wallace tree multiplier for error-resilient systems,” in Proc.
15th Int. Symp. Qual. Electron. Design, 2014, pp. 263–269.
Fig. 4. Comparison between PDP and PSNR of various 8-bit approximate [8] Z. Yang, J. Han, and F. Lombardi, “Approximate compressors for error
multipliers. resilient multiplier design,” in Proc. IEEE Int. Symp. DFTS, 2015,
pp. 183–186.
[9] M. Ha and S. Lee, “Multipliers with approximate 4–2 compressors and
error recovery modules,” IEEE Embedded Syst. Lett., vol. 10, no. 1,
presents the performance comparison of the DeBAM with pp. 6–9, Mar. 2018.
[10] M. S. Lau, K.-V. Ling, and Y.-C. Chu, “Energy-aware probabilistic
the existing multipliers with respect to the JIC with a qual- multiplier: Design and analysis,” in Proc. Int. Conf. Compilers Architect.
ity factor (QF) of 70. It is clear from the table that the Synth. Embedded Syst., 2009, pp. 281–290.
PSNR of the proposed multiplier is better than the existing [11] S. Hashemi, R. I. Bahar, and S. Reda, “DRUM: A dynamic range unbi-
ased multiplier for approximate applications,” in Proc. IEEE/ACM Int.
multipliers [6]–[9], [11], [12]. Conf. Comput. Aided Design (ICCAD), 2015, pp. 418–425.
Fig. 4 shows the comparison of power-delay-product [12] S. Venkatachalam and S.-B. Ko, “Design of power and area efficient
approximate multipliers,” IEEE Trans. Very Large Scale Integr. (VLSI)
(PDP) and PSNR results of 8×8 proposed and approximate Syst., vol. 25, no. 5, pp. 1782–1786, May 2017.
multipliers. From Fig. 4, DeBAM has higher PSNR value [13] J. Yang, G. Zhu, and Y.-Q. Shi, “Analyzing the effect of JPEG compres-
compared to multipliers that have similar PDP. Thus, it can sion on local variance of image intensity,” IEEE Trans. Image Process,
vol. 25, no. 6, pp. 2647–2656, Jun. 2016.
be concluded that the DeBAM multiplier has an advantage in [14] N. Rathore, “JPEG image compression,” Int. J. Eng. Res. Appl., vol. 4,
a wide range of image processing applications. no. 3, pp. 435–440, 2014.

DeBAM Decoder-Based Approximate Multiplier For Low Power Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DeBAM Decoder-Based Approximate Multiplier For Low Power Applications

Uploaded by

Copyright:

Available Formats

174 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO.

DeBAM: Decoder-Based Approximate Multiplier

respectively, less than Yi et al. [6] design. Although DeBAM

You might also like