Professional Documents
Culture Documents
Modi Ed Operand Decomposition Multiplication For High Performance Parallel Multipliers
Modi Ed Operand Decomposition Multiplication For High Performance Parallel Multipliers
†
ADWC-HCT, Abu Dhabi, UAE
‡
Electronics Research Institute, Cairo,
by MONASH UNIVERSITY on 10/01/16. For personal use only.
1. Introduction
Digital Signal Processors (DSPs) on hand-held and portable battery-operated devices
are increasingly needed and require e±cient and low power multiplication operations
as multipliers are at the heart of most arithmetic operations in DSPs. Thus, reducing
the power dissipation and area utilizations of multipliers is a key to satisfy the overall
power budget1 and reliability concerns of portable devices. A lot of research e®orts
have been directed to developing e±cient and low power multiplier designs, which
can be tackled at di®erent levels of the design hierarchy. Algorithmic level approaches
modify the multiplication's operands can give a signi¯cant overall power
*This paper was recommended by Regional Editor Piero Malcovati.
1650149-1
Z. Abid, D. A. El-Dib & R. Mudassir
duced power dissipation by almost half, but it imposed large area (31%) and delay
(12%) overheads. Other proposed architectures for low power multipliers achieved
by MONASH UNIVERSITY on 10/01/16. For personal use only.
the reduction of the switching activity through operand decomposition,5,6 where both
multiplier and multiplicand are decomposed in such a way that there is a reduction in
the total number of transitions. As a result, switching activity is reduced, achieving
lower power dissipation. Fayed and Bayoumi5 achieved lower power consumption by
dividing the multiplication circuit into clusters of smaller multipliers. Due to applying
clock gating techniques and pre-processing operations on the input pattern, some
clusters producing zero can be disabled. Hence, saving the switching power compo-
nent consumed by these clusters. The approach is however limited to small multi-
pliers or applications that naturally contain a high probability of inputs with several
consecutive zeros. The amount of power saved by this approach is input dependant as
a higher probability of containing zero nibbles in the input data increases the amount
of saved power and vice versa. On the other hand, work proposed by Ito et al.6
surpasses these drawbacks; it decomposes both multiplicand and multiplier in such a
way that the total number of ones in partial products is reduced. As a result, logic
transitions are reduced achieving lower power dissipation. However, this technique,
called Multiplication Algorithm for Switching Activity Reduction (MASAR), gen-
erates the partial products using AND/OR gates and needs inversion of partial sums
and partial carry outputs. We propose a modi¯ed MASAR (M-MASAR). First, M-
MASAR generates the partial products using NAND gates instead of AND/OR
gates. In fact, studies have shown that NAND based partial product tree reduces the
dynamic power consumption.7 Second, we propose a new low-power (4:2) compressor
capable of handling negative inputs, thus reducing the number of required inversions.
This consequently reduces the area and switching activity. Third, M-MASRAR
applies a Carry Select Adder (CSA) instead of a Carry Propagate Adder (CPA). Our
modi¯cation neither imposes a delay overhead nor is input dependant. M-MASAR
outperforms MASAR in area, delay and power consumption.
The organization of the paper is as follows: Section 2 details the original MASAR
algorithm and our proposed modi¯cations to its architecture/implementation.
Section 3 presents the results obtained after simulation runs along with a comparison
to the MASAR design. Section 4 concludes the paper.
1650149-2
Modi¯ed Operand Decomposition Multiplication Algorithm
2.1. Step 1
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
The operands X and Y are decomposed into four numbers A, B, C and D to reduce
the number of ones in partial products. Ito et al. assumed the following:
by MONASH UNIVERSITY on 10/01/16. For personal use only.
X ¼ ðB þ DÞ; Y ¼ ðB þ CÞ : ð3Þ
¼ ðC DÞ þ B ðB þ C þ DÞ : ð4Þ
Assuming that A ¼ ðB þ C þ DÞ, then one can easily reach the following:
X Y ¼ ðC DÞ ðA BÞ ; ð5Þ
where
A ¼ ½1 an1 an2 . . . a1 ða0 þ 1Þ ; ð6Þ
B ¼ ½bn1 bn2 . . . b1 b0 ; ð7Þ
ci ¼ xi ^ yi ; di ¼ xi ^ yi ; ð11Þ
for ðn 1Þ i 0.
It is noted that ai , bi , ci and di are all generated by ANDing xi and yi or their
inversions. Thus, only one of them (ai , bi , ci , and di ) is \1" and the rest are \0"s as
seen in Table 1.
Therefore, ai þ bi þ ci þ di ¼ 1 for each i. Moreover, the probability of each of
them to be equal to 1, namely Pr(ai ), Pr(bi ), Pr(ci ), and Pr(di ), are equal to 1/4,
whereas Pr(xi ) and Pr(yi ) are equal to 1/2.
1650149-3
Z. Abid, D. A. El-Dib & R. Mudassir
ai bi ci di
xi yi xi ^ yi xi ^ yi xi ^ yi xi ^ yi
0 0 1 0 0 0
0 1 0 0 1 0
1 0 0 0 0 1
1 1 0 1 0 0
2.2. Step 2
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
Partial products (PPs) to reach the ¯nal product are generated for both C D and
A B. This is twice as many as PPs needed for a regular multiplication X Y .
by MONASH UNIVERSITY on 10/01/16. For personal use only.
1650149-4
Modi¯ed Operand Decomposition Multiplication Algorithm
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.
Similarly,
qij ¼ ci ^ dj ^ ci ^ dj ð17Þ
for ðn 1Þ i j 0.
Therefore, in the M-MASAR, the generation of partial products is achieved by
½3nðn 1Þ NAND gates. Since a 2-inputs NAND gate requires 4 transistors while
each AND/OR gate require 6 transistors in complementary CMOS logic topology,
the proposed scheme reduces the transistor count by ½2nðn 1Þ 6 þ nðn 1Þ
6 3 nðn 1Þ 4. For n = 8, this is equal to a reduction of ½2 8 7 6 þ 8
7 6 3 8 7 4 ¼ 336 out of a total of 1008 transistors. This is a 33% re-
duction in the transistors required to form the partial products and hence an
equivalent reduction in area utilization. Table 2 gives the transistor count needed to
generate all the partial products in M-MASAR versus those needed for MASAR
using three di®erent values for n (n ¼ 8, 16 and 32).
2.3. Step 3
After decomposition and partial multiplication, the compressed partial products
for each P and Q need to be summed up and the ¯nal partial sum and carry from
1650149-5
Z. Abid, D. A. El-Dib & R. Mudassir
both P and Q have to be combined. This is all done using (2:2) counters, (3:2)
counters and (4:2) compressors as shown in Fig. 2. Since P þ Q is required instead
of P þ Q, the inputs from P should be subtracted and not added. Finally, MASAR
uses carry propagate adders at the output stage.
1650149-6
Modi¯ed Operand Decomposition Multiplication Algorithm
1650149-7
Z. Abid, D. A. El-Dib & R. Mudassir
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.
Table 3. Truth table for Type-1 and Type-2 adders along with their logic functions.
1650149-8
Modi¯ed Operand Decomposition Multiplication Algorithm
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.
Both types of adders, Type-1 and Type-2, have the °exibility to deal with neg-
atively weighted inputs without using inverters if stacked in a smart way. The
proposed ð4 : 2Þ compressor, illustrated in Fig. 7 uses only 28 transistors and does not
need inversion of inputs from P as it can accommodate two negatively weighted
inputs. Thus, the proposed compressor reduces the number of inverters of a tradi-
tional ð4 : 2Þ compressor by 50%. At the last stage of the multiplier it is mandatory to
design a carry select adder which can handle the negatively weighted carry output
signals from the last ð4 : 2Þ compressor's stage.11 A sample of the last stage is shown
in Fig. 8. The concept of the carry select adder is to compute alternative results in
parallel and subsequently selecting the correct result with single or multiple stage
1650149-9
Z. Abid, D. A. El-Dib & R. Mudassir
hierarchical techniques. In carry select adders both sum and carry bits are calculated
for the two possible values of the carry input bit.
Thus, in order to enhance its speed performance, the carry select adder increases its
area requirements. Once the carry-in is delivered, the correct computation is chosen
(using a MUX) to produce the desired output. Therefore, instead of waiting for the
carry-in to calculate the sum, the sum is correctly output as soon as the carry-in arrives.
The time taken to compute the sum is then avoided which results in a good improve-
ment in speed. In Fig. 8, the sum of one ð4 : 2Þ compressor is applied to the positive
input of the carry select adder in the same column, whereas the generated carry of the
compressor is applied to the negative input of the carry select adder in the next column.
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
3. Simulation Results
by MONASH UNIVERSITY on 10/01/16. For personal use only.
1650149-10
Modi¯ed Operand Decomposition Multiplication Algorithm
4. Conclusion
We have modi¯ed the implementation of a low power multiplication algorithm for
high performance circuits to reduce its power dissipation, area and delay further. The
implementation results show that the proposed Multiplier has a lower power con-
sumption (20%) and a lower delay (10%) while requiring less transistors compared to
the original MASAR multiplier design6 at 8-bit multiplier implementation. These
advantages are largely due to two reasons. First, the proposed scheme, where we
generate and compress the partial products using ½3n ðn 1Þ NAND gates. Sec-
ond, our proposed low-power (4:2) compressor, which reduces the number of inver-
ters at the output stage by 50%. Thus, the proposed architecture can be applied to
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
digital circuits and systems where low power consumption and high speed are critical
design constraints. For higher order multipliers delay improvement and power
by MONASH UNIVERSITY on 10/01/16. For personal use only.
consumption reductions are expected to be even higher. This is simply because the
area of the partial products grows quadratically with the number of multiplier bits,
whereas the output stage grows linearly.
Acknowledgement
This research was conducted at UWO, London, Ontario. The authors would like to
thank NSERC-Canada and UWO for their ¯nancial support.
References
1. A. P. Chandrakasan, S. Sheng and R. W. Brodersen, Low-power cmos digital design,
IEICE Trans. Electron. 75 (1992) 371.
2. P.-M. Seidel, Dynamic operand modi¯cation for reduced power multiplication, in Conf.
Record of the 36th Asilomar Conf. Signals, Systems and Computers, 2002, Vol. 1, IEEE
2002, pp. 52–56.
3. T. Ahn and K. Choi, Dynamic operand interchange for reduced power, Electron. Lett. 33
(1997) 2118.
4. M. Fujino and V. G. Moshnyaga, Dynamic operand transformation for low-power mul-
tiplier-accumulator design, in Proc. 2003 Int. Symp. Circuits and Systems, 2003.
ISCAS'03, Vol. 5 IEEE 2003, pp. V–345.
5. A. A. Fayed and M. A. Bayoumi, A novel architecture for low-power design of parallel
multipliers, in Proc. IEEE Computer Society Workshop on VLSI, 2001, IEEE 2001,
pp. 149–154.
6. M. Ito, D. Chinnery and K. Keutzer, Low power multiplication algorithm for switching
activity reduction through operand decomposition, in Proc. 21st Int. Computer Conf.
Design 2003, IEEE 2003, pp. 21–26.
7. M. DeRenzo, M. J. Irwin and N. Vijaykrishnan, Designing leakage aware multipliers, in
Proc. 17th Int. Conf. VLSI Design, 2004, IEEE 2004, pp. 654–657.
8. A. M. Shams and M. A. Bayoumi, A novel high-performance cmos 1-bit full-adder cell,
IEEE Trans. Circuits Syst. II, Analog Digital Signal Proces. 47 (2000) 478–481.
1650149-11
Z. Abid, D. A. El-Dib & R. Mudassir
1650149-12