A High-Speed Radix-4 Multiplexer-Based Array Multiplier: Dimitris Bekiaris Kiamal Z. Pekmestzi Chris Papachristou

A High-Speed Radix-4 Multiplexer-Based Array Multiplier
Dimitris Bekiaris Kiamal Z. Pekmestzi Chris Papachristou

National Technical National Technical Case Western Reserve University
University of Athens University of Athens Cleveland, Ohio 44106, USA
Iroon Polytechneiou 9, Zografou Iroon Polytechneiou 9, Zografou
Athens, Greece Athens, Greece cap2@case.edu
+30 2107721800 Tel+30 2107722500
mpekiaris@microlab.ntua.gr pekmes@microlab.ntua.gr
ABSTRACT Wallace tree structure for the addition of partial products, but
This paper presents a new radix-4 multiplexer-based array these schemes produce a non-regular layout. In [5], the Modified-
multiplier, based on a multiplication scheme shown in a previous Booth encoding approach is introduced, leading to the reduction
work, where 4-to-1 multiplexers are used for the computation of of critical time to the half, compared to bit-parallel array schemes.
partial products. In the proposed design, the rows of the array are Recent Modified-Booth array implementations are shown in [11]
reduced to the half, compared to the initial multiplexer-based and [12] respectively. The drawback of these designs is the final
scheme, as two bits from both operands are processed at each carry-propagate adder, which imposes a significant delay as the
step. The proposed scheme is compared to the Modified-Booth bit-length increases.
array multiplier and to the initial multiplexer-based array scheme. In [6], the presented algorithm introduces a novel multiplication
The compared designs are coded in VHDL and synthesized using technique, where 4-to-1 multiplexers are used for the computation
the TSMC 0.13μm technology library. The synthesis results of of partial products. The proposed design results into an array with
critical time and area show 11-22% improvement in critical time canonic interconnections, reduced hardware complexity and high-
delay compared to the Modified-Booth array, in the expense of an speed operation, compared to the circuits given in [8]. Based on
area overhead of 3.8-16%. Compared to the initial multiplexer- this concept, a suggestion for a radix-4 array multiplier is also
based scheme, there is a significant improvement in terms of area presented in [7]. This scheme is strongly based in [6], but the
and critical time. addition of partial products is better implemented on FPGA-based
platforms, rather than on a technology-dependent ASIC core.
Categories and Subject Descriptors
B.7.m [Integrated Circuits]: Miscellaneous-VLSI-Standard cells- This paper introduces a new radix-4 array multiplier, where two
Algorithms implemented in hardware bits of both operands are processed at each step. The proposed
scheme is based on the approach adopted in [6] and produces a
General Terms circuit with regular structure and better performance, compared to
Algorithms, Design, Performance the Modified-Booth array multiplier.
Keywords In the next section, the radix-4 multiplexer-based multiplication is

Array Multiplier, Multiplexer-Based, Modified Booth, Radix-4 given. In Section 3, the structure of the proposed multiplier is
Multiplier. presented. A theoretical estimation of critical time delay and
transistor count is shown in Section 4, while synthesis results
1. INTRODUCTION from Synopsys Design Compiler tool and a comparison with other
Multipliers are the most critical components for the efficient array schemes are both given in Section 5. Finally, a conclusion
implementation of Digital Signal Processing (DSP) and and hints for future work are included in the last section.
Cryptographic algorithms. So, numerous design techniques have
been presented in the bibliography, targeting fast multiplication 2. THE RADIX-4 MULTIPLICATION
schemes with reduced amount of area and low power dissipation. ALGORITHM
In [6], the proposed algorithm is based on the truncation of most
In general, carry-save array multipliers are the most appropriate to significant bits from the operands X and Y, according to the
combine high performance with regular interconnection patterns, following equations:
which permit an efficient VLSI realization. Conventional iterative
n −1
array schemes with canonic interconnections are presented in [1] X =2 xn −1 + X n −1 and Y = 2n −1 yn −1 + Yn −1 (1)
and [2], where the critical time of the operation is linearly
Now, the product of X and Y is written:
dependent on the bit-length of both operands. Faster
implementations are presented in [3], [4], and [9], using the P = 22n−2 xn-1 yn-1 + 2n−1{xn-1Yn−1 + yn−1 X n−1} + X n−1Yn−1 (2)
Permission to make digital or hard copies of all or part of this work for We define P = X i Yi . So, Pi can be computed as follows:
personal or classroom use is granted without fee provided that copies are i
not made or distributed for profit or commercial advantage and that copies 2i i
Pi = 2 xi-1 yi-1 + 2 {xi-1Yi-1 + yi -1 X i -1} + Pi -1 (3)
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific Based on the above equation, the final product P is:
permission and/or a fee. n −1 n −1
2i i
GLSVLSI’08, May 4–6, 2008, Orlando, Florida, USA. P= ∑ x y 2 + ∑ { x Y + y X }2 (4)
Copyright 2008 ACM 978-1-59593-999-9/08/05...$5.00 i =0 i i i =1 i i i i
115
The numbers X and Y are the numbers formed by the i least Z 2 k = 2{ x Y +y X }+ x Y +y X (13)
i i 2 k +1 2 k 2 k +1 2k 2k 2k 2k 2k
significant bits of X and Y respectively. The terms put in brackets The first two terms are computed by 4-to-1 multiplexers
in Equation 4 comprise the main part of the algorithm and can be
controlled by x2 k +1 and y2 k +1 , while the last two terms by 4-to-1
named as:
Z i = xi Yi + yi X i (5) multiplexers, controlled by x2 k and y2 k . According to the
previous relations, the final product is:
The values of terns Z i depend on the variables xi and yi , shown
(n/2)−1 (n/2)−1
in Table 1. These terms are computed by 4-to-1 multiplexers, P= ∑ x y 2 + ∑ Z 2
4k 2k
(14)
2k 2k 2k
which are controlled by xi and yi . k =0 k =1
Table 1. The values of Z i

xi yi
3. THE PROPOSED MULTIPLIER
Zi
0 0 0 In the proposed radix-4 design, the terms x2 k y2 k and x2 k +1 y2 k +1
1 0 are computed in Cell I and Cell III, shown in Figures 1 and 3. The
Y term {x2 k +1 y2 k + x2 k y2 k +1 } is computed by Cell II, shown in
i
0 1 Xi Figure 2. At the same time, Cell I produces the sums s2 k and
1 1 S i = X i + Yi s2 k +1 , using a 2-bit carry-lookahead adder. The carry output
This technique reduces the number of full-adders of an array c2 k + 2 is driven to the carry-lookahead of Cell I at the next row.
multiplier to the half and leads to the folding of the parallel shape The terms Z 2 k are mapped mainly to Cell IV and at the boundary
into a triangular scheme.
of the array to Cell II. The circuit of the proposed radix-4
In the radix-4 approach of this algorithm, the operands are multiplier is shown in Figure 5 for an 8x8-bit unsigned
decomposed according to the relations: multiplication example.
X = 2n −1 xn-1 + 2n − 2 xn- 2 + X n − 2 = 2n − 2 x n- 2 + X n − 2 (6) j+ j
j+ j
Y = 2 n −1 yn-1 + 2 n − 2 yn- 2 + Yn − 2 = 2 n − 2 y n- 2 + Yn − 2 (7)
j+ j j+ j
where the bold-typed terms x i and yi are the 2-bit digits:
x i = 2 xi +1 + xi and yi = 2 yi +1 + yi (8)
Thus, the operands produced at each multiplication step can be

written iteratively, according to the next relations:
X i = 2i − 2 x i- 2 + X i − 2 and Yi = 2 i − 2 y i- 2 + Yi − 2 (9)
j j j
Based on Equations 6, 7 and 9, the product of X and Y can be
written as follows:
Pn = 22n−4 xn-2 yn-2 + 2n−2{xn−2Yn−2 + yn−2 X n−2 } + Pn−2 (10) Figure 1. The structure of Cell I
j+ j
The product Pi can be computed by the next equation:
j+ j
2i − 4 i−2
Pi = 2 x i-2 y i − 2 + 2 {x i-2 Yi − 2 + y i − 2 X i − 2 } + Pi − 2 (11)
Considered that the index i at each step is reduced by 2, it is

replaced by a new index k based on the relation i=2k. Now, the
final product can be computed as follows: j j k
k+
(n /2)−1 (n /2)−1
4k 2k
P= ∑ x y 2 + ∑ {x2kY2k + y X }2 (12) k k
k =0 2k 2k k =1 2k 2k k k+
k k
where k implies the multiplication step. k+
The first sum of Equation 12 computes x2 k y2 k and x2 k +1 y2 k +1 ,

along with the sum {x2 k +1 y2 k + x2 k y2 k +1 } .
The main computation concerns the addition of the terms: Figure 2. The structure of Cell II
116
j+ j
j+ j j+ j
j+ j
k
j+ j+ k+ j j j+1 j+1 k
k
k+
k k+
k+ k k
k k k+ k k+
k k+
k+ k k
k+
j j j
Figure 4. The structure of Cell IV

Figure 3. The structure of Cell III
x7=(0,x7) x6=(x7,x6) x5=(x6,x5) x4=(x5,x4) x3=(x4,x3) x2=(x3,x2) x1=(x2,x1) x0=(x1,x0) x0

y7=(0,y7) y6=(y7,y6) y5=(y6,y5) y4=(y5,y4) y3=(y4,y3) y2=(y3,y2) y1=(y2,y1) y0=(y1,y0) y0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
c0
II IV IV IV IV IV III II x0
y0
s0
I
c1
x1 y1 x0 y0
c0,5 0 0 0
x7 x6 x5 x4 x3 x2 x2
0 y7 0 y6 y5 y4 y3 y2 y2
0 0
c2 p1 p0
II IV IV IV III II x2
y2
s2
I CLA - 2
0
c3
x3 y3 x2y2
c0,5
c1,9
x7 x6 x5 x4 x4
y5 FA FA FA FA p3 p2
0 y7 0 y6 y4 y4
0
0 c4
II IV III II x4
y4
s4
I CLA - 4
c5
x5y5 x4y4
c1,9
x7 x6
0 y7 y6 FA FA FA FA p7 p6 p5 p4
c6
II x6
y6 I
s6 CLA - 4
x6y6
x7y7c7 x7y7 0
FA FA FA p11 p10 p9 p8
0
CLA - 4
p15 p14 p13 p12
Figure 5. An 8x8-bit example of the proposed radix-4 multiplexer-based multiplier for unsigned numbers
The four least significant bits of sums and carries produced at circuit performs significantly faster than the Modified-Booth
each level are added using a 4-bit carry-save adder and a 4-input array, for any bit-length n 〉 16, in a percentage of 11-22% at the
carry-lookahead adder, which transforms the result into binary expense of an area overhead of 3.8-16% as it is shown in Table 3.
form. Also, the leftmost cell of each row is simplified to Cell II,
However, this additional percentage of area is reduced, as the bit-
considered that the most significant bit of the digits x and
n−1 length approaches numbers greater than 32. The area results of
y is always zero. The presented design yields regular Table 3 are given in μm2, based on the critical times of Table 2,
n−1 and they include also the input and the output registers needed for
structure and canonic interconnections. For larger bit-lengths, the computation of critical time by the synthesis tool.
additional rows with similar structure need to be added. As can be seen in Table 2 and Figure 6, the proposed multiplier
has significantly less critical time, compared to the multiplexer-
4. IMPLEMENTATION AND SYNTHESIS based multiplier. Also, Table 3 and Figure 7 demonstrate the
RESULTS superiority of the proposed scheme over the multiplexer-based
We synthesized our designs targeting the minimum critical times, multiplier from the aspect of area required, for bit-lengths greater
while attempting to compress the area overhead. The three than 32.
compared circuits are implemented using the TSMC 0.13μm
technology and the results are derived from the Synopsys Design
Compiler Synthesis tool. In Table 2, is shown that the proposed
117
Table 2. Critical Time Comparisons Table 3. Area Comparisons in μm2
n Proposed Mux- Modified n Proposed Mux- Modified
Multiplier Based [6] Booth Multiplier Based[6] Booth
8 2.51 ns 2.69 ns 2.21 ns 8 8690.17 8628.74 8704.52
16 3.59 ns 4.36 ns 3.51 ns 16 27137.30 26055.02 22728.57
24 4.41 ns 5.98 ns 4.90 ns 24 55221.93 55437.89 42855.18
32 5.31 ns 8.77 ns 6.02 ns 32 86295.81 81535.71 72070.10
48 6.97 ns 11.2 ns 8.74 ns 48 167877.62 176129.29 144354.48
54 7.70 ns 12.37 ns 9.24 ns 54 201791.56 223348.12 183670.60
64 8.88 ns 14.51 ns 10.88 ns 64 260315.65 301777.50 250683.58
Critical Time Comparison Area Comparison

16 350000
14 300000
Critical Time (ns)
12
250000
Area (um2)
10
200000
8
150000
6
4 100000
2 50000
0 0
8 16 24 32 48 54 64 8 16 24 32 48 54 64
Bit-Length n Bit-Length n
Proposed Mux-Based [6] Modified-Booth Proposed Mux-Based [6] Modified-Booth
Figure 6. Critical Time Diagram Figure 7. Area Diagram
5. CONCLUSION [6] Kiamal Z. Pekmestzi, “Multiplexer-Based Array

In this paper, a new radix-4 array multiplier is presented, based Multipliers”, IEEE Transactions on Computers, vol. 48,
on 4-to-1 multiplexers. The proposed circuit is faster than the no. 1, January 1999.
Modified-Booth array scheme and the radix-2 multiplexer- [7] Osama Al-Khaleel et al., “A Large Scale Adaptable
based multiplier. The proposed scheme can be further Multiplier for Cryptographic Applications”, Proceedings
improved by redesigning the 4-to-1 multiplexers with less of the First NASA/ESA Conference on Adaptive
number of transistors, in order to radically reduce the circuit’s Hardware and Systems (AHS) 2006.
area and to further increase its performance, especially
[8] S. Nakamura and K.-Y. Chu, “A Single-Chip Parallel
compared to the Modified-Booth array. Also, the presented Multiplier by MOS Technology”, IEEE Transactions on
circuit can be extended to implement 2’s complement number
Computers, vol. 37, no. 3, pp. 274-282, March 1988.
multiplications.
[9] Ki-seon Cho et al., “54x54-bit Radix-4 Multiplier Based
on Modified-Booth algorithm”, Proceedings of GLVLSI’
7. REFERENCES 03, April 28-29, 2003, Washington, USA.
[1] Y. Oowaki et al., “A Sub-10-ns 16x16 Multiplier Using
0.6-nm CMOS Technology”, IEEE Journal of Solid-State [10] Zhijun Huang and Milos Ercegovac, “High-Performance
Circuits, vol. 22, no. 5, October 1987. Low-Power Left-to-Right Array Multiplier Design”, IEEE
Transactions on Computers, vol. 54, no. 3, March 2005.
[2] R. Sharma et al., “A 6.75-ns 16x16-bit Multiplier in
Single-Level-Metal CMOS Technology”, IEEE Journal of [11] Wen-Chang Yeh and Chein-Wei Jen, “High-speed Booth-
Solid State Circuits, vol. 24, no. 4, August 1989. encoded parallel multiplier design”, IEEE Transactions on
Computers, vol. 49, no. 7, pp. 692-701, July 2000.
[3] C. Wallace, “A suggestion for a fast multiplier”, IEEE
Transactions on Elect. Computers, vol. 13, pp. 114-117, [12] Leonardo L. de Oliveira et al., “Array Hybrid Multiplier
February 1964. versus Modified-Booth Multiplier: Comparing Area and
Power Consumption of Layout Implementations of Signed
[4] Niichi Itoh et al., “A 600-MHZ 54x54-bit Multiplier with
Radix-4 Architectures”, Proceedings of the 47-th
Rectangular-Styled Wallace Tree”, IEEE Journal of Solid-
MWSCAS, July 25-28, 2004, Hiroshima, Japan.
State Circuits, vol. 36, no. 2, February 2001.
[5] O. L. McSorley, “High-Speed Arithmetic in Binary
Computers”, Proc. IRE, Vol. 49, pp. 67-71, January 1961.
118

A High-Speed Radix-4 Multiplexer-Based Array Multiplier: Dimitris Bekiaris Kiamal Z. Pekmestzi Chris Papachristou

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A High-Speed Radix-4 Multiplexer-Based Array Multiplier: Dimitris Bekiaris Kiamal Z. Pekmestzi Chris Papachristou

Uploaded by

Copyright:

Available Formats

A High-Speed Radix-4 Multiplexer-Based Array Multiplier

Dimitris Bekiaris Kiamal Z. Pekmestzi Chris Papachristou

Keywords In the next section, the radix-4 multiplexer-based multiplication is

Table 1. The values of Z i

Thus, the operands produced at each multiplication step can be

Considered that the index i at each step is reduced by 2, it is

The first sum of Equation 12 computes x2 k y2 k and x2 k +1 y2 k +1 ,

Figure 4. The structure of Cell IV

x7=(0,x7) x6=(x7,x6) x5=(x6,x5) x4=(x5,x4) x3=(x4,x3) x2=(x3,x2) x1=(x2,x1) x0=(x1,x0) x0

p15 p14 p13 p12

Critical Time Comparison Area Comparison

Figure 6. Critical Time Diagram Figure 7. Area Diagram

5. CONCLUSION [6] Kiamal Z. Pekmestzi, “Multiplexer-Based Array

You might also like