You are on page 1of 7

EE650P Endsem Presentation

Approximate Reverse Carry Propagate Adder for


Energy-Efficient DSP Applications
Himanshu Tiwari(T21137)
Supervisor: Prof. Srinivasu Bodapati
Assistant Professor
School of Computing and Electrical Engineering

Abstract—Adders are fundamental component in many image decreased to one in this project, resulting in even greater
processing , DSP application.In this paper we have used RCPFA energy efficiency.However we get accuracy trade off with the
adders which has resolved timing voilations due to the reverse energy efficiency. It uses the approximation arithmetic blocks
carry propogation. The results indicates that employing the
proposed RCPAs in the hybrid adders may provide, on average, to improve the performance of the circuit.
27%, 6%, and 31% improvements in delay, energy, and energy-
delay-product while providing higher levels of accuracy. At the A1 A2 A3 A4 CIN

end of the project we need to implement the matrix multiplication


FULL ADDER
using the systolic array for image processing application.
Multiplier are extensively used in the DSP applica- FULL ADDER
tion.Approximations in the multiplier are utilised in the DSP
block with some error tolerent. In this report I have proposed CARRY SUM
the 4 2 compressor using the RCPFA adders given in the paper COUT
.
Furthur using the compressor in the 8 bit multiplier we got Fig. 1. 4 2 Compressor
saving of 3.70% in area of LUT, 7.14% reduction in power and
7.92% reduction in delay with small compromise on accuracy.
Index Terms—RCPFA, DSP, Approximate computing, 4 2 Multiplication in digital systems consists of two parts:
Compressor,Image Processing etc. 1.Partial product generation(PPG)
2.Partial Product accumulation(PPA).
I. I NTRODUCTION For Multiplier of smaller bit width the PPG is done by simple
AND gates so the approximation can’t be applied here. But
in PPA we need arrays of Full Adders , where the scope of
Sacrificing computing accuracy is one way to increase approximation occurs. In this report, the fixed width multiplier
calculation speed and power. This approximate computing is used which gives the same width of output as the input
method can be utilized in applications where certain errors which is achieved by truncation. So we used truncation to
are acceptable. Adder blocks, which make up the majority achieve the fixed width multiplied output , the truncation used
of the arithmetic units in DSP systems, require a significant here ,is actual truncation where the LSB bits(last 8 bits) are
amount of energy and frequently generate hot spots on truncated after the PP accumulation. LSB bit (truncated bit )
the chip.This gives the motivation for the appproximate will not put much effect on the accuracy of the final output.
computing approach. For the approxiamte computing the first So the resources allocation in the PP accumulation of the
approach is based on the hybrid structure where two diffrent LSB side will be approximate in nature i.e. the compressor,
parts , exact parts used in more significant bit computations adders used in the PP accumulation for getting the LSB bits
and the less significant bit part uses the approximate adders. are approximate in nature.

In this end sem report of Thesis Work , one approximate II. L ITERATURE R EVIEW
4 2 compressor is proposed from the RCPFA adder which
is embed in the 8-bit multiplier. Multiplier is basic building Why approximate adders needed in DSP application?
block in digital circuits, that uses approximation arithmetic • In all the exact adders , the RCA has lowest Power and
blocks to improve the performance of these circuits. In com- Area used, but RCA can’t solve the delay problem and
puter arithmetic, a multiplier is a fundamental building piece. has more delay.
Furthermore, 4-2 compressors are commonly used in parallel • Approximate adders gives us the advantage of improving
multipliers to speed up the compression of partial products.The the delay, power and area, with compromise in accu-
number of outputs from the approximate 4-2 compressor is racy.
• When these Adders are used to implement the DSP TABLE I
blocks, P OWER REPORT FOR THE IMPLEMENTED DESIGN
The below equation is the full adder equation where Power(mWatts) LUT(nos)
all the three inputs are of same weights and among the RCPFA 1 0.006 6
outputs the carry output has twice weight as that of sum. RCPFA 2 0.004 6
RCPFA 3 0.004 4

2Ci+1 + Si = Ai + Bi + Ci
IV. I NTERNAL S TRUCTURE OF RCPFA
Now , if we shift the Ci and Ci+1 to the left and right
side respectively then we will get the equation shown
A3 B 3 A2 B 2 A1 B 1 A0 B 0
below which clearly depicts that the carry input has
become twice the weight of the carry in the normal Full
adder equation . Moreover the direction of the carry also F4 F3 F2 F1 C0
changes and carry propogates from the MSB to LSB . RCPFA RCPFA RCPFA RCPFA
Due to this change in the direction of the propogation of C4 C3 C2 C1
the carry, amount of error due to timing voilation reduces.
The equation for the RCPFA adder finally becomes:
Critical Path
Si − Ci = Ai + Bi − 2 Ci+1
Why we need forecast signal in the proposed design? S3 S2 S1 S0
• The output of the proposed RCPFA has error when the
RHS of above equation becomes -2 or 2. Which implies
that the Si - Ci becomes -2 or 2.But by weight of the Fig. 2. Four bit RCPFA adder using complete approximation
output signal the output can be -1,0,1 . In addition, when
the right side of (2) becomes 0, either of (0,0) and (1,1) The boolean relation between the inputs for obtaining
may be considered for (Si,Ci). One of the ways to select sum and carry:
between these two solutions is to use an auxiliary signal
Si = Ci+1 Fi + Ci+1 Ai + Ci+1 Bi + Ai Bi Fi
created by using the inputs of the (i-1)th bit position
• When the right side of equation (2) becomes 0, then their Ci = Ci+1 Fi + Ci+1 Ai + Ci+1 Bi + Ai Bi Fi .
are two pair possible for the Si and Ci (0,0),(1,1). Simplifying above equation as follows will result in an
• One of ways to select between the above two solution is improved gate-level structure for implementing RCPFA.
to use auxillary signal(Fi ) generated by (i-1)th bits. 
In addition to the intrinsic error of the RCPA, similar to Si = Fi Ci+1 + Ai Bi + Ci+1 (Ai + Bi ) = Fi Xi + Yi
the conventional RCA, an incomplete carry propagation  
Ci = Fi Ci+1 (Ai + Bi ) + Ci+1 + Ai Bi = Fi Yi + Xi
causes some error. As mentioned before, the advantage of
the RCPA is that the value of the error is in the direction (Ai or Bi ).
of decrease
In ref [2] Novel error-tolerant adder (ETA) achieve I
Fi +1
tremendous improvements in both the power consumption OAI21
AOI21 3 4
and speed performance The error-tolerant adder, which Ai 1 Si
Bi 5
trades certain amount of accuracy for significant power C i+1 2
saving. These adders are used in communication system. Fi
Ai OAI21 AOI21
In the ETA adders the carry propogation is curtailed so 6 8
Bi 7
a great improvement in speed and power can be achieved. C i+1 9 C i+1

III. I MPLEMENTATION OF RCPFA ADDER Fig. 3. Circuit of RCPFA 1


In the thesis work the implementation of the adders using
the three diffrent approximate approach given in the paper The most number of gates are there in the RCPFA 1
is performed,also implementation of the proposed 4 2 then the number of gates reduces in the RCPFA 2 and 3
compressor is also done. respectively.
For adders I have used 8 bits and 16 bit adders consisting The RCPFA 1 uses both OAI21 (or- and gate with two
of half of its LSB bits from the approximate adders inputs and one output )and AOI21 (and or gate with
(RCPFA 1, RCPFA2,RCPFA 3) and half of it’s MSB bits two inputs and one output) configuration . Where as the
from the exact FA. RCPFA 2 and RCPFA 3 configurations are obtained by
Below is the result from the Xilinx Vivado for the removing the AOI21 and OAI21 parts respectively.
number of LUT used for implementing the design and Bit truncation is a conventional approximate computing
the Dynamic power used in the design : method , with this approach we have to select the effective
number of bits(ENOB) for the DSP application. There are data into 2 output data. So, we can say that the adder
two things needed that are : compresses the 3 input data into two output data. In the
1. At first we will simulate the sixteen bit adders with same way I have proposed the compressor comprising
having 3 to 12 bits as inaccurate part. These inaccurate of the two Full adders in which 4 inputs gets converted
parts will be made from RCPFA adders. into the 1 outputs. Compressor gives best results due to
2. Secondly,we will simulate the 16 bit adders while less carry propogation as compared to the Full adders.
truncating the the 3 to 12 LSB bits. Compressor find it’s use in the Multipliers .
We find that MRED of the approximate adders (RCPFA)
with consisting of 8 bit approximate part equals to the Exact 4 2 Compressor
four bit truncation showing to four bit effectiveness of
the bit length Where as other approx adders have ENOB sum = y1 ⊕ y2 ⊕ y3 ⊕ y4 ⊕ cin
of the 3.5 bits only.So in this way the RCPFA adders
cout = (y1 ⊕ y2 ) y3 + (y1 ⊕ y2 )y1
are advantageous as they provide the more number of
effective bits. The MRED of the proposed approximate carry = (y1 ⊕ y2 ⊕ y3 ⊕ y4 ) cin +(y1 ⊕ y2 ⊕ y3 ⊕ y4 )y4
adders (RCPAs) with the 8-bit inexact part width (MRED Above are the equations for the exact 4 2 Compressor
= 0.005) is equal to that of the 4-bit where we get Sum,Cout,Carry are the outputs. The Carry
and Cout have same weight which is twice the weight
Ai Fi+1
Bi 10 of sum output.
Approx 4 2 Compressor
Here out of three output the Cout and carry are moved
Ai OAI21 Si
6 5 to the next bit .Hence they are more important than
Bi Yi
C i+1 7 sum.Each compressor which occupies a binary bit gener-
Ci
ates the ED, the combined ED of Compressor chain is a
Fi 8 multi bit binary number.
3
Error distance is the diffrence between the approximate
Fig. 4. Circuit of RCPFA 2 computing results and exact results. where as Error rate
is ratio of error containing number to the total num-
Design Parameters of the RCPFA’s and Hybrid Adder’s bers . Every approximate compressor in a multiplier
based on the RCPFA Here, carry is predicted using the will generate it’s ED(Error Distance). Many approximate
forecast signal (Fk ) in the joining of the approximate part compressor laid back to back in cascade fashion are called
and exact part. The exact part (which is used for the MSB Compressor chain.
portion ) Carry is predicted using the Fk in the Joining ED = a ∗ b − a × b
of the two parts. Simply Fk will be the carry signal P2N (
for the exact part and same Fk will be Ci+1 input for 1 Dt 0, if a ∗ b = a × b
ER = , Dt =
the 1st approximate FA seen from left.Therefore critical 22N 1, if a ∗ b ̸= a × b
path starts from joining point. For small approximate part
the delay path of exact part will dominate . When the
approximate part is smaller then the delay path of the VI. P ROPOSED D ESIGN OF 4 2 C OMPRESSOR
exact part dominates and hence will decide the critical
path. In high-speed parallel multipliers, 4-2 compressors are
employed to speed up the compression process of the
Ai Fi +1 partial products. The conventional way to implement a
Bi 10 4-2 compressor is cascading two full adders. In this
section, I have approximated the 4 2 compressor for
Ai AOI21 getting the less area , less power consumption and less
Bi 1
Si delay. During the approximation process the outputs are
Xi 4
2 reduced to 1 as compared to the three outputs of accurate
C i+1 4 2 compressor.
Ci
9 Approximate 4 2 Compressor Using RCPFA
Fi 3 This compressor uses two RCPFA 3 adders along with
one Xor gate. In this compressor we have 4 main inputs
Fig. 5. Circuit of RCPFA 3 namely A,B,C,D . However there are some secondary
inputs and outputs namely reverse carry input and output
V. 4 2 C OMPRESSOR and forecast signal input and output for each RCPFA
What do we mean by 4 2 Compressor? adder.The XOR input converts internal sum and C input
We know that the Full Adder represents the 3 input to input bit of 2 .The inputs given to the compressors
are the partial product generated and the probability of TABLE II
getting 1 as partial product is 25% . T RUTH TABLE OF E XACT 4 2 C OMPRESSOR

Cin X4 X3 X2 X1 Cout Carry Sum


0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1
A B C D 0 0 0 1 0 0 0 1
0 0 0 1 0 1 0 0
0 0 1 0 0 0 0 1
Fout 1 Fin1 0 0 1 0 0 1 0 0
0 0 1 1 0 1 0 0
Cin1 RCPFA-3 Cout 1 0 0 1 1 1 1 0 1
0 1 0 0 0 0 0 1
0 1 0 0 0 0 1 0
Xor 0 1 0 1 0 0 1 0
0 1 0 1 1 1 0 1
0 1 1 0 0 0 1 0
Fout 2 RCPFA-3 Fin 2 0 1 1 1 0 1 0 1
Cin 2 Cout 2 0 1 1 1 0 1 0 1
0 1 1 1 1 1 1 0
1 0 0 0 0 0 0 1
1 0 0 0 0 0 1 0
1 0 0 0 1 0 1 0
Sum 1 0 0 0 1 1 0 1
1 0 1 1 0 0 1 0
Fig. 6. Proposed Design of approximate 4 2 compressor 1 0 1 1 0 1 0 1
1 0 1 1 1 1 0 1
1 0 1 1 1 1 1 0
1 1 0 0 0 0 1 0
1 1 0 0 0 0 1 1
The below figure shows the Partial Product accumulation 1 1 0 0 1 0 1 1
for the 8 bit multiplier.It uses the exact 4 2 compres- 1 1 0 1 1 1 1 0
1 1 1 0 0 0 1 1
sor ,proposed approx 4 2 compressor , full adder, Half 1 1 1 1 0 1 1 0
adder, 2 input OR gate. The first eight bits (MSB’s) are 1 1 1 0 1 1 1 0
calculated with the exact computation using the exact 1 1 1 1 1 1 1 1
compressor while the last eight bits are computed with
approximate compression.By using the compressor chain
of four compressors in the approximate compression part
the approximation is performed.

The above is the truth table of exact compressor consist-


ing of 5 inputs and 3 outputs .If we give 5 inputs to be
A Compressor Chain one each then in the output it can be reflected by making
Exact 4_2
all three output bits to be 1 each. So we can say that
Compressor there is no internal error in the exact compressor due to
scarcity of output ports. However the error occurs when
Full Adder
we try to make the number of output to be reduced to 1.

Half Adder

2 input OR
Gate
Exact Approximate
Compression Compression Approx A B
Part Part 4_2
Compressor

Fout1 Fin1
Cin1 Approximate 4_2 Cout1
Compressor
Fig. 7. Partial Product Accumulation of Proposed Approximate multiplier Fout2 Fin2
Cin2 using RCPFA adder Cout2

In the second stage of compression again we use the Sum

exact compressor and also the adders to get the sixteen


bit output of multiplier. Fig. 8. Basic block of Approximate 4 2 compressor
TABLE III
T RUTH TABLE OF A PPROXIMATE 4 2 C OMPRESSOR

A B C D Sum ED Prob
0 0 0 0 0 0 81/256
0 0 0 1 1 0 27/256
0 0 1 0 0 -1 27/256
0 0 1 1 0 -2 9/256
0 1 0 0 0 -1 27/256
0 1 0 1 0 -2 9/256
0 1 1 0 0 -2 9/256
0 1 1 1 1 -2 3/256 figure 9(a) Image 1 (b) Image 2
1 0 0 0 0 -1 27/256
1 0 0 1 0 -2 9/256
1 0 1 0 0 -2 9/256
1 0 1 1 0 -3 3/256
1 1 0 0 0 -2 9/256 Below is the formula for calculating the PSNR value . The
1 1 0 1 0 -3 3/256 PSNR for exact multiplied image is ∞ since the MSE
1 1 1 0 0 -3 3/256
tends to be zero for exact compression. However for the
1 1 1 1 1 -3 1/256
approximate image compression due to finite MSE we
get finite PSNR.
The above is truth table for the proposed design. It In below formula R is the signal power and MSE is Noise
contain some error for the Sum and C out. Error distance Power .Image is considered as signal .8 bit multiplier has
is the distance between the accurate and approximate maximum intensity of 255 so we take it as R value.
result of the compressor . Below is the formula for calculating the PSNR value . The
From the table 3, if we collect the data regarding the PSNR for exact multiplied image is ∞ since the MSE
probability then it’s clear that when there are 4 0’s, tends to be zero for exact compression. However for the
3 0’s, 2 0’s in the input set then the probability of approximate image compression there is finite MSE so
occurance of such input set is 243/ 256. Hence these we get finite PSNR. 
M AXI2

cases of input comprises of the maximum occurance
among all the cases. ED in these cases is either ”0” or P SN R = 10 · log10
M SE
”1”. However, when there are 3 1’s or 4 1’s then the 
M AXI

error distance(ED) is greater than -2, which cannot be = 20 · log10 √
M SE
compensated .In these input sets we get huge deviation
in the outputs which is consequence of the approximation. = 20 · log10 (M AXI ) − 10 · log10 (M SE).
PSNR commonly used to address how well the image
is reconstructed while going through the compression
Below is the formula for calculating the PSNR value . The
process.
PSNR for exact multiplied image is ∞ since the MSE
tends to be zero for exact compression. However for the
approximate image compression there is finite MSE so
we get finite PSNR.

VII. R ESULTS OF P ROPOSED D ESIGNS

Both Approximate and exact multiplier is designed in


the Xilinx Vivado tool using Verilog HDL . The board
used is xc7z010clg400-1 . The number of LUT(Look up
table) used for proposed design is 79 where as the number
of LUT used in exact design is 81 .The delay for the
exact multiplier is 10.833nsec where as the delay for the
proposed design is 9.975nsec. The power used for the
exact deisgn is 13mWatts where as the power used by
the exact design is 12mWatts.
PSNR stands for Peak Signal to Noise Ratio and SSIM
Fig. 9. Schematic obtained from the Xilinx Vivado tool for the Multiplier
stands for Structural Similarity Index (SSIM) are used
as parameters to tell how well the image multiplication
has taken place. Ideally the PSNR should lie in the range
30dB to 50dB for bit depth of 8 bits . For 16 bit depth the Above is the schematic for the Multiplier it uses 4
PSNR is between 60dB to 80dB .SSIM should be from proposed compressor,9 exact compressor,17 full adders
0.97 to 0.99 for the approximate image multiplication. and 2 half adders.
Fig. 10. Simulation Result of 8 bit multiplier made from proposed 4 2 Fig. 12. SSIM of the Approximate Multiplied image.
Compressor using RCPFA adder

VIII. P ROCUDURE FOR IMAGE M ULTIPLICATION


The simulation results are shown in the above figure . In this project images to be multiplied is given to matlab
Here after giving two inputs to multiplier as 128 and 192 to convert it to hex codes . Each image is converted
then we will get 24576 in exact multiplication whereas into the 3 image pixels of red, green, blue colour with
the proposed multiplier gives 24704 as output . So clearly 256x256 pixel numbers . Each pixel of the corresponding
it is seen that there is only diffrence of 128 in this colour is multiplied and then the result is written in
multiplier.While doing the image compression our human the hex file and then these text files are used by the
eyes can’t detect this diffrence. MATLAB to get the multiplied image. Now depending on
the multiplier used in the multiplication process we will
get diffrent kind of the multiplied image. PSNR, SSIM
is calculated in the MATLAB . PSNR commonly used to
TABLE IV
C OMPARISION OF AREA , POWER AND DELAY FOR THE IMPLEMENTED address how well the image is reconstructed while going
DESIGN OF MULTIPLIER AND EXACT DESIGN through the compression process.
Power(mWatts) LUT(nos) Delay (nsec)
Exact Multiplier 14 81 10.833 TABLE V
Approx Multiplier 13 78 9.975 C OMPARISION OF AREA , POWER AND DELAY FOR THE IMPLEMENTED
Improvement 7.14% 3.70% 7.92% DESIGN OF MULTIPLIER AND EXACT DESIGN

PSNR SSIM
Approx Image multiplication 37.68 0.96422

IX. C ONCLUSION
The suggested approximation of RCPFA’s in this study
that propagate carry from most significant to LSBs.
Higher delay variation stability was offered by reverse
carry propagation. The effectiveness of the suggested
rough FAs and the Hybrid adders that realized them have
been examined technology. The outcomes showed that
applying the suggested on average, RCPFAs in hybrid
adders give 27%, 6%, significant improvements in energy,
EDP, and delay of 31%.
Furthur using the proposed compressor in the 8 bit
multiplier the saving of 3.70% in area of LUT, 7.14%
reduction in power and 7.92% reduction in delay with
Fig. 11. Exact image left and approximate image right small compromise on accuracy.
From the Proposed Multiplier utilised for Image multi-
plication we get, SSIM value to be 0.96422 and PSNR
Below figure shows the Structual similarity content in the value to be 37.687.
approximate image as compared to the exact image. In
this, the white portion shows that both approximate and R EFERENCES
exact image are matching whereas the black part shows [1] M. Pashaeifar, M. Kamal, A. Afzali-Kusha and
the diffrence in exact and approximate part. M. Pedram, ”Approximate Reverse Carry Propagate
Adder for Energy-Efficient DSP Applications,” in IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
Transactions on Very Large Scale Integration (VLSI) no. 5, pp. 1782- 1786, May 2017.
Systems, vol. 26, no. 11, pp. 2530-2541, Nov. 2018,
[12] D. Esposito, A. G. M. Strollo, E. Napoli, D. De.
Caro, and N. Petra. “Approximate Multipliers Based on
[2] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. New Approximate Compressors,” IEEE Trans. Circuits
H. Kong, “Design of low-power high-speed truncation- and Syst. : Reg. Papers, vol. 65, no. 12, pp. 4169- 4182,
error-tolerant adder and its application in digital signal Dec. 2018.
processing,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010 [13] X. Yi, H. Pei, Z. Zhang, H.Zhou, and Y. He. “Design
of an EnergyEfficient Approximate Compressor for Error-
[3] V. Gupta, D. Mohapatra, A. Raghunathan, and Resilient Multiplications,” 2019 IEEE Int. Symp. Circuits
K. Roy, “Low-power digital signal processing using and Syst. (ISCAS). Sapporo, Japan, 2019, pp. 1-5
approximate adders,” IEEE Trans. Comput.- Aided [14] M. Ha and S. Lee. “Multipliers with Approximate
Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, 4-2 Compressors and Error Recovery Modules,” IEEE
Jan. 2013. Embedded Systems Letters, vol. 10, no. 1, pp. 6-9, Mar.
2018.
[4] Z. Yang, A. Jain, J. Liang, J. Han, and F. Lombardi,
“Approximate XOR/XNOR-based adders for inexact
computing,” in Proc. 13th IEEE Int. Conf. Nanotechnol.
(NANO), Aug. 2013, pp. 690–693

[5]H. T. Bui, Y. Wang, and Y. Jiang, “Design and


analysis of low-power 10-transistor full adders using
novel XOR-XNOR gates,” IEEE Trans. Circuits Syst. II,
Analog Digit. Signal Process., vol. 49, no. 1, pp. 25–30,
Jan. 2002.

[6] H. A. F. Almurib, T. N. Kumar, and F. Lombardi,


“Inexact designs for approximate low power addition
by cell replacement,” in Proc. Design, Autom. Test Eur.
(DATE), Mar. 2016, pp. 660–665

[7] O. Akbari, M. Kamal, A. Afzali-Kusha, and M.


Pedram, “Dual-quality 4:2 Compressors for utilizing in
dynamic accuracy configurable multipliers,” IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4,
pp. 1352–1361, Apr. 2017.

[8] L. B. Soares, E. Costa, and S. Bampi, “Approximate


adder synthesis for area- and energy-efficient FIR filters
in CMOS VLSI,” in Proc. 13th IEEE Int. New Circuits
Syst. (NEWCAS) Conf., Jun. 2015, pp. 1–4.

[9] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel,


“A low latency generic accuracy configurable adder,” in
Proc. ACM/EDAC/IEEE Design Autom. Conf. (DAC),
Jun. 2015, pp. 1–6.

[10] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel,


“A low latency generic accuracy configurable adder,” in
Proc. ACM/EDAC/IEEE Design Autom. Conf. (DAC),
Jun. 2015, pp. 1–6.

[11] S. Venkatachalam and S.-B. Ko. “Design of Power


and Area Efficient Approximate Multipliers,” IEEE

You might also like