Professional Documents
Culture Documents
TABLE I
C OMPARISON R ESULTS IN T ERMS OF PSNR AND MED
with results being essentially the same as those obtained for unsigned
numbers.
The above bit truncation scheme can be easily implemented at
y are equal to A [k−1:0] − m and B[k−1:0] − m, respectively, being circuit level in an n-bit RCA as shown in Fig. 2. No special cells
A [k−1:0] (B[k−1:0] ) the numerical value chosen to truncate the k are needed, as conventional full adders are used for all bit positions.
LSBs of the operand A (B). The accurate portion of A and B (i.e., the OR and AND gates are employed as input buffers driving the full
first n−k bits) can assume 2h values, i.e., 0, 2k , 2k ·2, . . . , 2k ·(2h −1). adders that receive kmax LSBs of the operands, being kmax the
Considering that 22n distinct additions can be performed on A and B, maximum number of LSBs that can be truncated. The control signals
the sum of the squared errors (sse) that is introduced in the output Ctrkmax−1 . . . Ctr0 allow dynamic adaptation of k (with k < kmax ),
sum by the truncation in the k LSBS of A and B is and hence of the energy-quality tradeoff at run-time. If signals Ctri
22n −1 (with 0 ≤ i < k) are set to 1, the inputs TAi and TBi of the ith full
sse = |S j − Ŝ j |2 . (1) adder are, respectively, 0 and 1, thus making its carry output equal
to Ci = 0. Once the truncation is applied, switching activity and
j =0
dynamic energy are suppressed in all k full adders associated with
Since each possible output error value across all possible input the LSBs, regardless of the constant value used in the k LSBs.
combinations occurs 22·h times,1 and assuming that the k-bit LSB In general, the above results need to be modified when consider-
subwords in the operands are uniformly distributed, (1) can be ing different operations, such as subtraction. Indeed, the difference
rewritten as A − B is computed by first evaluating the 2s complement of the
⎛ ⎞
y+m
x+m subtrahend B, and then by adding it to the minuend A. For the
sse = 22h × ⎝ (i + j )2 ⎠ . (2) subtraction operation, the output error E diff is the difference between
i=x j =y the errors E A and E B due to the truncation of A and B, respectively.
Consequently, E diff can be reduced by ensuring that E A and E B have
By definition, the PSNR is given by (3), where MAX is the maximum
the same sign, instead of being opposite as in the addition operation.
value of the sum [i.e., 2 × (2n − 1) for n-bit unsigned operands]
More precisely, the quality of the output of an approximate subtractor
MAX is maximized by truncating k LSBs of the two operands by setting
PSNR = 20 log . (3)
sse A i = Bi (with 0 ≤ i < k), by simple extension of the above results.
22n
It is worth noting that the traditional zeroing bit truncation scheme
By equating the derivative of (2) and (3) with respect to x and automatically satisfies such a criterion, as it sets A i = Bi = 0 for
y to zero, the minimum value of sse and the maximum PSNR 0 ≤ i < k. Hence, the bit truncation strategy in [15] is preferable
are readily found to occur when x + y = −m. All (x, y) pairs when performing subtractions, not additions. When multiplication is
satisfying the latter condition lead to the minimum error and hence considered, the energy of an approximate multiplier was found to
best quality. From the definition of x and y, it follows that the strongly depend on the constant values chosen to truncate the k LSBs
optimum condition is equivalent to: A [k−1:0] = m − B[k−1:0] . Since of the inputs. In other words, two different truncation schemes can
the binary representation of m is a k-bit sequence of 1s, the optimum lead to the same accuracy but to different energy consumptions, which
condition translates into the requirement that the ith LSBs of A makes the analysis much more complex. This will be analyzed in the
and B(with i < k) are statically set to complementary values, i.e., future work.
A i = B̄i .
As an example, the choice in Fig. 1(b) has x = −m (i.e., k LSBs
in A set to 0...0) and y = 0 (i.e., k LSBs in B set to 1...1), and hence III. E NERGY-Q UALITY T RADEOFF AND S IMULATION R ESULTS
satisfies the above condition, thus maximizing quality for a given k
(i.e., energy). The above bit truncation approach for addition was applied to a
On the contrary, the traditional bit truncation strategy in [15] 16-bit approximate RCA with k ranging from 0 to 8. The adder
leads to x = y = −m, which is clearly non-optimal in terms of structured as in Fig. 2 was synthesized with Cadence RTL Compiler,
quality. Detailed results are reported in Table I, which shows that the using a commercial 28-nm FDSOI standard cell library with regular-
quality improvement achieved by the proposed approach over [15] V th transistors. Several adder configurations were examined, each
ranges between 6.7 and 8.5 dB. A comparison in terms of the MED of which was excited with 10 000 consecutive random additions.
quality metric [10] is also reported in Table I. The proposed approach Cadence UltraSim circuit simulations were run to derive the average
reduces MED by 60%–67% over [15] at same k, thus confirming its energy consumption per addition. During simulations, the number of
effectiveness. Similar analysis can be repeated for signed operands, truncated LSBs was dynamically configured by properly setting k
and the Ctri signals. The energy contributions of the RCA and its
1 Indeed, the number of possible pairs of truncated input operands (A, B) input buffers were split by using two separate 1-V supply voltages.
having the same error (err A , err B ) is 22·h . The 2-GHz operating frequency was targeted, and the adder outputs
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
C OMPARISON R ESULTS
conventional zeroing bit truncation schemes, the proposed approach [8] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power dig-
sets the inputs associated with the truncated digits equal to proper ital signal processing using approximate adders,” IEEE Trans. Comput.-
Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137,
nonzero constant values that minimize the error on the output. The
Jan. 2013.
proposed technique is able to retain the dynamic energy-quality con- [9] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired
figurability of traditional reconfigurable approximate adders, while imprecise computational blocks for efficient VLSI implementation of
achieving more favorable energy-quality tradeoff. The proposed soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers,
scheme increases the output quality by up to 8.5 dB in terms of vol. 57, no. 4, pp. 850–862, Apr. 2010.
[10] H. A. F. Almurib, T. N. Kumar, and F. Lombardi, “Inexact designs for
PSNR, and 67% in terms of MED, at no energy penalty, compared approximate low power addition by cell replacement,” in Proc. IEEE
to traditional zeroing bit truncation schemes. DATE, Mar. 2016, pp. 660–665.
When voltage scaling is applied, a 64% energy reduction has [11] J. Miao, K. He, A. Gerstlauer, and M. Orshansky, “Modeling and
been observed compared to the approximate adder in [18], which synthesis of quality-energy optimal approximate adders,” in Proc.
IEEE/ACM ICCAD, Nov. 2012, pp. 728–735.
specifically aims to extract energy savings from voltage scaling. More [12] F. Frustaci, D. Blaauw, D. Sylvester, and M. Alioto, “Approximate
substantial benefits have been shown compared to existing approx- SRAMs with dynamic energy-quality management,” IEEE Trans. Very
imate designs that do not allow dynamic energy-quality tradeoff. Large Scale Integr. (VLSI) Syst., vol. 24, no. 6, pp. 2128–2141, Jun. 2016.
Finally, the benefits of the proposed scheme have been quantified [13] S. Hashemi, R. I. Bahar, and S. Reda, “DRUM: A dynamic
in a DCT engine, and confirmed to be essentially the same as an range unbiased multiplier for approximate applications,” in Proc.
IEEE/ACM ICCAD, Nov. 2015, pp. 418–425.
individual adder. [14] M. Imani, D. Peroni, and T. Rosing, “CFPU: Configurable floating point
multiplier for energy-efficient computing,” in Proc. IEEE/ACM DAC,
R EFERENCES Jun. 2017, pp. 1–6.
[15] J. Park, J. H. Choi, and K. Roy, “Dynamic bit-width adaptation in DCT:
[1] J. Han and M. Orshansky, “Approximate computing: An emerg- An approach to trade off image quality and computation energy,” IEEE
ing paradigm for energy-efficient design,” in Proc. ETS, May 2013, Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5, pp. 787–793,
pp. 1–6. May 2010.
[2] M. Alioto, Ed., Enabling the Internet of Things: From Integrated Circuits [16] A. Dalloo, A. Najafi, and A. Garcia-Ortiz, “Systematic design of an
to Integrated Systems. Cham, Switzerland: Springer, 2017. approximate adder: The optimized lower part constant-OR Adder,”
[3] M. Alioto, “Energy-quality scalable adaptive VLSI circuits and sys- IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 8,
tems beyond approximate computing,” in Proc. IEEE DATE, Lausanne, pp. 1595–1599, Aug. 2018.
Switzerland, Mar. 2017, pp. 127–132. [17] A. B. Kahng and S. Kang, “Accuracy-configurable adder for approximate
[4] M. Alioto, V. De, and A. Marongiu, “Guest editorial energy-quality scal- arithmetic designs,” in Proc. DAC, Jul. 2012, pp. 820–825.
able circuits and systems for sensing and computing: From approximate [18] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On reconfiguration-
to communication-inspired and learning-based,” IEEE J. Trans. Emerg. oriented approximate adder design and its application,” in Proc.
Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 361–368, Sep. 2018. IEEE/ACM ICCAD, Nov. 2013, pp. 48–54.
[5] D. Shin and S. K. Gupta, “Approximate logic synthesis for error tolerant [19] D. Gong, Y. He, and Z. Cao, “New cost-effective VLSI implementation
applications,” in Proc. IEEE DATE, Mar. 2010, pp. 957–960. of a 2-D discrete cosine transform and its inverse,” IEEE Trans. Circuits
[6] F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester, and M. Alioto, Syst. Video Technol., vol. 14, no. 4, pp. 405–415, Apr. 2004.
“SRAM for error-tolerant applications with dynamic energy-quality [20] Public-Domain Test Images. [Online]. Available: http://homepages.cae.
management in 28 nm CMOS,” IEEE J. Solid-State Circuits, vol. 50, wisc.edu/~ece533/images
no. 5, pp. 1310–1323, May 2015. [21] Public-Domain Video Database for Testing Change Detection Algo-
[7] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of rithms. [Online]. Available: http://www.changedetection.net/
low-power high-speed truncation-error-tolerant adder and its application [22] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
in digital signal processing,” IEEE Trans. Very Large Scale Integr. (VLSI) quality assessment: From error visibility to structural similarity,” IEEE
Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010. Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.