You are on page 1of 12

Journal of Circuits, Systems, and Computers

Vol. 25, No. 12 (2016) 1650149 (12 pages)


.c World Scienti¯c Publishing Company
#
DOI: 10.1142/S0218126616501498

Modi¯ed Operand Decomposition Multiplication


for High Performance Parallel Multipliers¤

Z. Abid†,¶, Dalia A. El-Dib‡,|| and Rizwan Mudassir§,**


J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com


ADWC-HCT, Abu Dhabi, UAE

Electronics Research Institute, Cairo,
by MONASH UNIVERSITY on 10/01/16. For personal use only.

Egypt and Electrical and Computer Engineering,


Dalhousie University, Halifax, NS, Canada
§
Canadian Nuclear Laboratories, Canada

zeabid@uwo.ca
||
dafeldib@alumni.uwaterloo.ca
**mudassir _r@yahoo.com

Received 16 September 2015


Accepted 5 June 2016
Published 19 July 2016

A low power operand decomposition multiplication architecture implementation is modi¯ed to


further reduce its power dissipation and delay. First, the multiplier's implementation was
modi¯ed to generate the partial products using NAND gates instead of AND and OR gates in
order to reduce the number of transistors (area utilized) and to reduce the delay. Then, new
types of adders and (4:2) compressors, that accept negatively weighted bits are used to reduce
the number of inverters. Therefore, the resulting multiplier architecture reduces the number of
transistors signi¯cantly. These modi¯cations result in 20% and 36% reduction in power con-
sumption and energy delay product (EDP), respectively.

Keywords: Low power; operand decomposition multiplication; parallel multiplier.

1. Introduction
Digital Signal Processors (DSPs) on hand-held and portable battery-operated devices
are increasingly needed and require e±cient and low power multiplication operations
as multipliers are at the heart of most arithmetic operations in DSPs. Thus, reducing
the power dissipation and area utilizations of multipliers is a key to satisfy the overall
power budget1 and reliability concerns of portable devices. A lot of research e®orts
have been directed to developing e±cient and low power multiplier designs, which
can be tackled at di®erent levels of the design hierarchy. Algorithmic level approaches
modify the multiplication's operands can give a signi¯cant overall power
*This paper was recommended by Regional Editor Piero Malcovati.

1650149-1
Z. Abid, D. A. El-Dib & R. Mudassir

reduction.2–6 Interchanging the multiplier's operands2 to reduce transitions between


the current operands and the previous operands reduced the expected switching
activity of the input to the multiplier by up to 25%, but no exact ¯gures for power
reductions, delay and hardware overhead of the multiplier were given. Another dy-
namic operand interchanging approach3 was able to reduce the power consumption
by up to 29.6%, by interchanging the two operands of the multiplier based on the
operand's sign bit, but did impose a minor delay and some hardware overhead.
Reducing the transition activity of the multiplication is achieved by comparing the
current operand values to the previous ones to use either their original form or their
two's complement form depending on the computed Hamming distance.4 This re-
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com

duced power dissipation by almost half, but it imposed large area (31%) and delay
(12%) overheads. Other proposed architectures for low power multipliers achieved
by MONASH UNIVERSITY on 10/01/16. For personal use only.

the reduction of the switching activity through operand decomposition,5,6 where both
multiplier and multiplicand are decomposed in such a way that there is a reduction in
the total number of transitions. As a result, switching activity is reduced, achieving
lower power dissipation. Fayed and Bayoumi5 achieved lower power consumption by
dividing the multiplication circuit into clusters of smaller multipliers. Due to applying
clock gating techniques and pre-processing operations on the input pattern, some
clusters producing zero can be disabled. Hence, saving the switching power compo-
nent consumed by these clusters. The approach is however limited to small multi-
pliers or applications that naturally contain a high probability of inputs with several
consecutive zeros. The amount of power saved by this approach is input dependant as
a higher probability of containing zero nibbles in the input data increases the amount
of saved power and vice versa. On the other hand, work proposed by Ito et al.6
surpasses these drawbacks; it decomposes both multiplicand and multiplier in such a
way that the total number of ones in partial products is reduced. As a result, logic
transitions are reduced achieving lower power dissipation. However, this technique,
called Multiplication Algorithm for Switching Activity Reduction (MASAR), gen-
erates the partial products using AND/OR gates and needs inversion of partial sums
and partial carry outputs. We propose a modi¯ed MASAR (M-MASAR). First, M-
MASAR generates the partial products using NAND gates instead of AND/OR
gates. In fact, studies have shown that NAND based partial product tree reduces the
dynamic power consumption.7 Second, we propose a new low-power (4:2) compressor
capable of handling negative inputs, thus reducing the number of required inversions.
This consequently reduces the area and switching activity. Third, M-MASRAR
applies a Carry Select Adder (CSA) instead of a Carry Propagate Adder (CPA). Our
modi¯cation neither imposes a delay overhead nor is input dependant. M-MASAR
outperforms MASAR in area, delay and power consumption.
The organization of the paper is as follows: Section 2 details the original MASAR
algorithm and our proposed modi¯cations to its architecture/implementation.
Section 3 presents the results obtained after simulation runs along with a comparison
to the MASAR design. Section 4 concludes the paper.

1650149-2
Modi¯ed Operand Decomposition Multiplication Algorithm

2. Operand Decomposition Multiplication and its Modi¯cation


In the MASAR algorithm, unsigned n-bit multiplication is assumed using two
operands X and Y . Consequently, their product has 2n bits.
X ¼ ½xn1 xn2 . . . x1 x0  ; ð1Þ

Y ¼ ½yn1 yn2 : . . . y1 y0  : ð2Þ

Many steps are needed to achieve the product.

2.1. Step 1
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com

The operands X and Y are decomposed into four numbers A, B, C and D to reduce
the number of ones in partial products. Ito et al. assumed the following:
by MONASH UNIVERSITY on 10/01/16. For personal use only.

X ¼ ðB þ DÞ; Y ¼ ðB þ CÞ : ð3Þ

Thus, X  Y can be expressed as


X  Y ¼ ðB  BÞ þ ðB  CÞ þ ðD  BÞ þ ðD  CÞ

¼ ðC  DÞ þ B  ðB þ C þ DÞ : ð4Þ

Assuming that A ¼ ðB þ C þ DÞ, then one can easily reach the following:
X  Y ¼ ðC  DÞ  ðA  BÞ ; ð5Þ

where
A ¼ ½1 an1 an2 . . . a1 ða0 þ 1Þ ; ð6Þ
B ¼ ½bn1 bn2 . . . b1 b0  ; ð7Þ

C ¼ ½cn1 cn2 . . . c1 c0  ; ð8Þ


D ¼ ½dn1 dn2 . . . d1 d0  ; ð9Þ

and Ito et al. astonishingly suggested that


ai ¼ xi ^ yi ; b i ¼ x i ^ yi ; ð10Þ

ci ¼ xi ^ yi ; di ¼ xi ^ yi ; ð11Þ

for ðn  1Þ  i  0.
It is noted that ai , bi , ci and di are all generated by ANDing xi and yi or their
inversions. Thus, only one of them (ai , bi , ci , and di ) is \1" and the rest are \0"s as
seen in Table 1.
Therefore, ai þ bi þ ci þ di ¼ 1 for each i. Moreover, the probability of each of
them to be equal to 1, namely Pr(ai ), Pr(bi ), Pr(ci ), and Pr(di ), are equal to 1/4,
whereas Pr(xi ) and Pr(yi ) are equal to 1/2.

1650149-3
Z. Abid, D. A. El-Dib & R. Mudassir

Table 1. Logic table of (ai , bi , ci and di ) as a


function of (xi , yi and their inversions).

ai bi ci di
xi yi xi ^ yi xi ^ yi xi ^ yi xi ^ yi
0 0 1 0 0 0
0 1 0 0 1 0
1 0 0 0 0 1
1 1 0 1 0 0

2.2. Step 2
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com

Partial products (PPs) to reach the ¯nal product are generated for both C  D and
A  B. This is twice as many as PPs needed for a regular multiplication X  Y .
by MONASH UNIVERSITY on 10/01/16. For personal use only.

P ¼ A  B ¼ ½pn1;n2 :::pi;j :::p1;0  ; ð12Þ


Q ¼ C  D ¼ ½qn1;n2 :::qi;j :::q1;0  : ð13Þ
The partial product can be expressed as follows using ANDs and ORs.
pij ¼ ðai ^ bj Þ _ ðbi ^ aj Þ ; ð14Þ
qij ¼ ðci ^ dj Þ _ ðci ^ dj Þ : ð15Þ
Thus, every partial product (qij or pij ) requires two AND gates and one OR gate to
be implemented in hardware. Since only one of ai and bi is \1" and only one of ci and
di is \1", qii ¼ pii ¼ \0" for ðn  1  i  0Þ. Moreover, because of the special relation
between a, b, c and d, many of the partial products turn out to be zero and can be
removed from the hardware, thereby reducing (compressing) the number of partial
products signi¯cantly. The probability of the remaining partial products to be equal
\1" is reduced by half (1/8 instead of 1/4) compared to conventional multiplier's
partial products, thus reducing the switching activity. The nth position of A is
always 1 and the 0th position of A is ða0 þ 1Þ instead of a0 . In order to accom-
modate the complementary terms generated by the most signi¯cant and the least
signi¯cant bit of A, a simple modi¯cation to the logic is required.6 The resulting
partial product scheme is shown in Fig. 1, where the regular partial products for each
multiplication of (A  B) and (C  D) are shown on the left and the shrunk version
(after compression) is shown on the right. The reduced partial products on the right
are generated by using ½2nðn  1Þ AND gates and ½nðn  1Þ OR gates.6

2.2.1. Modi¯cation to step 2


In our work, the partial product implementation is modi¯ed to allow the generation
of pij and qij using NAND gates instead of OR/AND gates as described by the
following expressions:
pij ¼ ðai ^ bj Þ _ ðbi ^ aj Þ ¼ ai ^ bj ^ bi ^ aj : ð16Þ

1650149-4
Modi¯ed Operand Decomposition Multiplication Algorithm
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.

Fig. 1. Partial product reduction by MASAR (see Ref. 6).

Similarly,

qij ¼ ci ^ dj ^ ci ^ dj ð17Þ
for ðn  1Þ  i  j  0.
Therefore, in the M-MASAR, the generation of partial products is achieved by
½3nðn  1Þ NAND gates. Since a 2-inputs NAND gate requires 4 transistors while
each AND/OR gate require 6 transistors in complementary CMOS logic topology,
the proposed scheme reduces the transistor count by ½2nðn  1Þ  6 þ nðn  1Þ
6  3  nðn  1Þ  4. For n = 8, this is equal to a reduction of ½2  8  7  6 þ 8 
7  6  3  8  7  4 ¼ 336 out of a total of 1008 transistors. This is a 33% re-
duction in the transistors required to form the partial products and hence an
equivalent reduction in area utilization. Table 2 gives the transistor count needed to
generate all the partial products in M-MASAR versus those needed for MASAR
using three di®erent values for n (n ¼ 8, 16 and 32).

2.3. Step 3
After decomposition and partial multiplication, the compressed partial products
for each P and Q need to be summed up and the ¯nal partial sum and carry from

Table 2. Number of transistors needed to generate


the multiplier partial products and their corresponding
area reductions.

Multiplier type/width 8-bit 16-bit 32-bit


MASAR 1008 4320 17856
M-MASAR 672 2880 11904
Area reduction 33.3% 33.3% 33.1%

1650149-5
Z. Abid, D. A. El-Dib & R. Mudassir

both P and Q have to be combined. This is all done using (2:2) counters, (3:2)
counters and (4:2) compressors as shown in Fig. 2. Since P þ Q is required instead
of P þ Q, the inputs from P should be subtracted and not added. Finally, MASAR
uses carry propagate adders at the output stage.

2.3.1. Modi¯cation to Step 3


Figure 3 shows the block diagram of the proposed multiplier. The positive partial
products in Fig. 3 correspond to the values ðC  D þ B  2 n Þ and the negative partial
products correspond to the values ðA  B þ B  2 n Þ. All regular (3:2) counters
(full adders) used in the proposed multiplier design are implemented using the low
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com

power full adder circuit8 shown in Fig. 4.


by MONASH UNIVERSITY on 10/01/16. For personal use only.

Fig. 2. MASAR 8-bit tree multiplier (see Ref. 6).

Fig. 3. The proposed M-MASAR 8-bit tree multiplier.

1650149-6
Modi¯ed Operand Decomposition Multiplication Algorithm

This full adder performs the regular operation of ðX þ Y þ Z ¼ S þ 2CÞ. Instead


of using inverters and regular ð4 : 2Þ compressors,6 a new low-power ð4 : 2Þ compressor
capable of handling negative inputs is proposed to reduce the number of inverters.
Proposed designs of low power and area e±cient Type-19 and Type-2 Adders10 (shown
in Figs. 5 and 6, respectively) are used to build a new ð4 : 2Þ compressor.
Type-1 adder performs the operation ðX þ Y  Z ¼ S þ 2CÞ, which means, it
adds two inputs and subtracts the third. As declared, the output sum of the Type-1
adder is negative. Thus, it must be subtracted in subsequent blocks instead of being
added to get correct binary number results. Type-1 adder is also called plus plus
minus (PPM) adder.
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.

Fig. 4. Regular (3,2) Counter (full adder) (see Ref. 8).

Fig. 5. Type-1 Adder (see Ref. 9).

1650149-7
Z. Abid, D. A. El-Dib & R. Mudassir
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.

Fig. 6. Type-2 Adder (see Ref. 10).

On the other hand, Type-2 adder implements the operation ðX  Y  Z ¼


þS  2CÞ, subtracting two inputs and adding only one, whereas its output carry
must be subtracted in subsequent stages.
The truth table for both types of adders, Type-1 and Type-2 is listed in Table 3.
The logic function for both Cs is put in a form suitable to be implemented using the
balanced method, where both outputs are generated in a balanced way. For clarity
the following shows how the logic function of C for Type-1 was reached.
C ¼ XY Z þ XY Z þ XY Z þ XYZ,
C ¼ Y ðXZ þ XZÞ þ XZ .
Applying the rule: AB ¼ AðA  B), then it is clear that:
C ¼ Y ðX  ZÞ þ XðX  ZÞ)
Similarly, using the rule that: AB ¼ AðA  B), C for Type-2 can be reached.

Table 3. Truth table for Type-1 and Type-2 adders along with their logic functions.

Type-1:-S+2C Type-2: S-2C


X+Y-Z X-Y-Z
X Y Z S C (decimal format) S C (decimal format)
0 0 0 0 0 0 0 0 0
0 0 1 1 0 1 1 1 1
0 1 0 1 1 1 1 1 1
0 1 1 0 0 0 0 1 2
1 0 0 1 1 1 1 0 1
1 0 1 0 0 0 0 0 0
1 1 0 0 1 2 0 0 0
1 1 1 1 1 1 1 1 1
S Logic Function S¼XY Z S¼XY Z
C Logic Function C ¼ Y ðX  Z Þ þ XðX  ZÞ C ¼ ZðX  Y Þ þ Y ðX  Y Þ

1650149-8
Modi¯ed Operand Decomposition Multiplication Algorithm
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.

Fig. 7. Proposed ð4 : 2Þ compressor.

Both types of adders, Type-1 and Type-2, have the °exibility to deal with neg-
atively weighted inputs without using inverters if stacked in a smart way. The
proposed ð4 : 2Þ compressor, illustrated in Fig. 7 uses only 28 transistors and does not
need inversion of inputs from P as it can accommodate two negatively weighted
inputs. Thus, the proposed compressor reduces the number of inverters of a tradi-
tional ð4 : 2Þ compressor by 50%. At the last stage of the multiplier it is mandatory to
design a carry select adder which can handle the negatively weighted carry output
signals from the last ð4 : 2Þ compressor's stage.11 A sample of the last stage is shown
in Fig. 8. The concept of the carry select adder is to compute alternative results in
parallel and subsequently selecting the correct result with single or multiple stage

Fig. 8. Proposed carry select adder.

1650149-9
Z. Abid, D. A. El-Dib & R. Mudassir

hierarchical techniques. In carry select adders both sum and carry bits are calculated
for the two possible values of the carry input bit.
Thus, in order to enhance its speed performance, the carry select adder increases its
area requirements. Once the carry-in is delivered, the correct computation is chosen
(using a MUX) to produce the desired output. Therefore, instead of waiting for the
carry-in to calculate the sum, the sum is correctly output as soon as the carry-in arrives.
The time taken to compute the sum is then avoided which results in a good improve-
ment in speed. In Fig. 8, the sum of one ð4 : 2Þ compressor is applied to the positive
input of the carry select adder in the same column, whereas the generated carry of the
compressor is applied to the negative input of the carry select adder in the next column.
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com

3. Simulation Results
by MONASH UNIVERSITY on 10/01/16. For personal use only.

In this section, we compare the proposed M-MASAR multiplier to the MASAR6 in


terms of power consumption and time delay. Two (88)-bit multipliers are imple-
mented to test the performance of the proposed Multiplier. Carry select adders are
adopted as the ¯nal carry propagate adder in the multipliers for our evaluation.
The (88)-bit is chosen for simplicity and to reduce the simulation time and
circuit complexity. Another important reason is that Ito et al. have proven through
simulation that their algorithm is e®ective for both array and tree multipliers of
32 bits or larger width. However, 8-bit MASAR multipliers caused 8% increase in
total delay and negligible power savings if compared to 8-bit array or tree multipliers.
Our modi¯cation of the 8-bit multiplier MASAR implementation makes it feasible
even for 8-bit multiplication as follows.
We have simulated both multipliers (MASAR and M-MASAR) with SPECTRE
using the Normal device models of the TSMC n-well 0:18 m CMOS technology. The
circuits are simulated at 1.8 V, 125 MHz with randomly generated test vectors.
The average power dissipation is then calculated. The simulated values for the time
delay, average power dissipation and EDP are presented in Table 4.
These results show that our proposed reduction in transistor count, use of the
proposed (4:2) compressor, and use of carry select adder have substantial e®ect on
the multiplier performance. The proposed 8-bit M-MASAR achieves reductions of
20% in power consumption, 10% in delay, and 36% in Energy Delay Product (EDP).
In addition, 33% reduction in the number of transistors needed to generate the
partial products is achieved.

Table 4. Simulation results of 8-bit multipliers.

Multiplier type Delay (ns) Power (mW) EDP (Js)


MASAR 1.996 2.143 8.54
M-MASAR 1.794 1.701 5.47
Reduction 20% 10% 36%

1650149-10
Modi¯ed Operand Decomposition Multiplication Algorithm

4. Conclusion
We have modi¯ed the implementation of a low power multiplication algorithm for
high performance circuits to reduce its power dissipation, area and delay further. The
implementation results show that the proposed Multiplier has a lower power con-
sumption (20%) and a lower delay (10%) while requiring less transistors compared to
the original MASAR multiplier design6 at 8-bit multiplier implementation. These
advantages are largely due to two reasons. First, the proposed scheme, where we
generate and compress the partial products using ½3n  ðn  1Þ NAND gates. Sec-
ond, our proposed low-power (4:2) compressor, which reduces the number of inver-
ters at the output stage by 50%. Thus, the proposed architecture can be applied to
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com

digital circuits and systems where low power consumption and high speed are critical
design constraints. For higher order multipliers delay improvement and power
by MONASH UNIVERSITY on 10/01/16. For personal use only.

consumption reductions are expected to be even higher. This is simply because the
area of the partial products grows quadratically with the number of multiplier bits,
whereas the output stage grows linearly.

Acknowledgement
This research was conducted at UWO, London, Ontario. The authors would like to
thank NSERC-Canada and UWO for their ¯nancial support.

References
1. A. P. Chandrakasan, S. Sheng and R. W. Brodersen, Low-power cmos digital design,
IEICE Trans. Electron. 75 (1992) 371.
2. P.-M. Seidel, Dynamic operand modi¯cation for reduced power multiplication, in Conf.
Record of the 36th Asilomar Conf. Signals, Systems and Computers, 2002, Vol. 1, IEEE
2002, pp. 52–56.
3. T. Ahn and K. Choi, Dynamic operand interchange for reduced power, Electron. Lett. 33
(1997) 2118.
4. M. Fujino and V. G. Moshnyaga, Dynamic operand transformation for low-power mul-
tiplier-accumulator design, in Proc. 2003 Int. Symp. Circuits and Systems, 2003.
ISCAS'03, Vol. 5 IEEE 2003, pp. V–345.
5. A. A. Fayed and M. A. Bayoumi, A novel architecture for low-power design of parallel
multipliers, in Proc. IEEE Computer Society Workshop on VLSI, 2001, IEEE 2001,
pp. 149–154.
6. M. Ito, D. Chinnery and K. Keutzer, Low power multiplication algorithm for switching
activity reduction through operand decomposition, in Proc. 21st Int. Computer Conf.
Design 2003, IEEE 2003, pp. 21–26.
7. M. DeRenzo, M. J. Irwin and N. Vijaykrishnan, Designing leakage aware multipliers, in
Proc. 17th Int. Conf. VLSI Design, 2004, IEEE 2004, pp. 654–657.
8. A. M. Shams and M. A. Bayoumi, A novel high-performance cmos 1-bit full-adder cell,
IEEE Trans. Circuits Syst. II, Analog Digital Signal Proces. 47 (2000) 478–481.

1650149-11
Z. Abid, D. A. El-Dib & R. Mudassir

9. R. Mudassir, H. El-Razouk, Z. Abid and W. Wang, New designs of 14-transistor ppm


adder, in Canadian Conf. Electrical and Computer Engineering, 2005, IEEE 2005,
pp. 1739–1742.
10. R. Mudassir and Z. Abid, New parallel multipliers based on low power adders, in
Canadian Conf. Electrical and Computer Engineering, 2005, IEEE 2005, pp. 694–697.
11. H. R. Srinivas, K. K. Parhi and L. A. Montalvo, Radix 2 division with over-redundant
quotient selection, IEEE Trans. Comput. 46 (1997) 85–92.
J CIRCUIT SYST COMP 2016.25. Downloaded from www.worldscientific.com
by MONASH UNIVERSITY on 10/01/16. For personal use only.

1650149-12

You might also like