Professional Documents
Culture Documents
ABSTRACT Wallace tree structure for the addition of partial products, but
This paper presents a new radix-4 multiplexer-based array these schemes produce a non-regular layout. In [5], the Modified-
multiplier, based on a multiplication scheme shown in a previous Booth encoding approach is introduced, leading to the reduction
work, where 4-to-1 multiplexers are used for the computation of of critical time to the half, compared to bit-parallel array schemes.
partial products. In the proposed design, the rows of the array are Recent Modified-Booth array implementations are shown in [11]
reduced to the half, compared to the initial multiplexer-based and [12] respectively. The drawback of these designs is the final
scheme, as two bits from both operands are processed at each carry-propagate adder, which imposes a significant delay as the
step. The proposed scheme is compared to the Modified-Booth bit-length increases.
array multiplier and to the initial multiplexer-based array scheme. In [6], the presented algorithm introduces a novel multiplication
The compared designs are coded in VHDL and synthesized using technique, where 4-to-1 multiplexers are used for the computation
the TSMC 0.13μm technology library. The synthesis results of of partial products. The proposed design results into an array with
critical time and area show 11-22% improvement in critical time canonic interconnections, reduced hardware complexity and high-
delay compared to the Modified-Booth array, in the expense of an speed operation, compared to the circuits given in [8]. Based on
area overhead of 3.8-16%. Compared to the initial multiplexer- this concept, a suggestion for a radix-4 array multiplier is also
based scheme, there is a significant improvement in terms of area presented in [7]. This scheme is strongly based in [6], but the
and critical time. addition of partial products is better implemented on FPGA-based
platforms, rather than on a technology-dependent ASIC core.
Categories and Subject Descriptors
B.7.m [Integrated Circuits]: Miscellaneous-VLSI-Standard cells- This paper introduces a new radix-4 array multiplier, where two
Algorithms implemented in hardware bits of both operands are processed at each step. The proposed
scheme is based on the approach adopted in [6] and produces a
General Terms circuit with regular structure and better performance, compared to
Algorithms, Design, Performance the Modified-Booth array multiplier.
Permission to make digital or hard copies of all or part of this work for We define P = X i Yi . So, Pi can be computed as follows:
personal or classroom use is granted without fee provided that copies are i
not made or distributed for profit or commercial advantage and that copies 2i i
Pi = 2 xi-1 yi-1 + 2 {xi-1Yi-1 + yi -1 X i -1} + Pi -1 (3)
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific Based on the above equation, the final product P is:
permission and/or a fee. n −1 n −1
2i i
GLSVLSI’08, May 4–6, 2008, Orlando, Florida, USA. P= ∑ x y 2 + ∑ { x Y + y X }2 (4)
Copyright 2008 ACM 978-1-59593-999-9/08/05...$5.00 i =0 i i i =1 i i i i
115
The numbers X and Y are the numbers formed by the i least Z 2 k = 2{ x Y +y X }+ x Y +y X (13)
i i 2 k +1 2 k 2 k +1 2k 2k 2k 2k 2k
significant bits of X and Y respectively. The terms put in brackets The first two terms are computed by 4-to-1 multiplexers
in Equation 4 comprise the main part of the algorithm and can be
controlled by x2 k +1 and y2 k +1 , while the last two terms by 4-to-1
named as:
Z i = xi Yi + yi X i (5) multiplexers, controlled by x2 k and y2 k . According to the
previous relations, the final product is:
The values of terns Z i depend on the variables xi and yi , shown
(n/2)−1 (n/2)−1
in Table 1. These terms are computed by 4-to-1 multiplexers, P= ∑ x y 2 + ∑ Z 2
4k 2k
(14)
2k 2k 2k
which are controlled by xi and yi . k =0 k =1
1 0 are computed in Cell I and Cell III, shown in Figures 1 and 3. The
Y term {x2 k +1 y2 k + x2 k y2 k +1 } is computed by Cell II, shown in
i
0 1 Xi Figure 2. At the same time, Cell I produces the sums s2 k and
1 1 S i = X i + Yi s2 k +1 , using a 2-bit carry-lookahead adder. The carry output
This technique reduces the number of full-adders of an array c2 k + 2 is driven to the carry-lookahead of Cell I at the next row.
multiplier to the half and leads to the folding of the parallel shape The terms Z 2 k are mapped mainly to Cell IV and at the boundary
into a triangular scheme.
of the array to Cell II. The circuit of the proposed radix-4
In the radix-4 approach of this algorithm, the operands are multiplier is shown in Figure 5 for an 8x8-bit unsigned
decomposed according to the relations: multiplication example.
X = 2n −1 xn-1 + 2n − 2 xn- 2 + X n − 2 = 2n − 2 x n- 2 + X n − 2 (6) j+ j
j+ j
Y = 2 n −1 yn-1 + 2 n − 2 yn- 2 + Yn − 2 = 2 n − 2 y n- 2 + Yn − 2 (7)
j+ j j+ j
where the bold-typed terms x i and yi are the 2-bit digits:
x i = 2 xi +1 + xi and yi = 2 yi +1 + yi (8)
X i = 2i − 2 x i- 2 + X i − 2 and Yi = 2 i − 2 y i- 2 + Yi − 2 (9)
j j j
Based on Equations 6, 7 and 9, the product of X and Y can be
written as follows:
Pn = 22n−4 xn-2 yn-2 + 2n−2{xn−2Yn−2 + yn−2 X n−2 } + Pn−2 (10) Figure 1. The structure of Cell I
j+ j
The product Pi can be computed by the next equation:
j+ j
2i − 4 i−2
Pi = 2 x i-2 y i − 2 + 2 {x i-2 Yi − 2 + y i − 2 X i − 2 } + Pi − 2 (11)
The main computation concerns the addition of the terms: Figure 2. The structure of Cell II
116
j+ j
j+ j j+ j
j+ j
k
j+ j+ k+ j j j+1 j+1 k
k
k+
k k+
k+ k k
k k k+ k k+
k k+
k+ k k
k+
j j j
x7y7c7 x7y7 0
FA FA FA p11 p10 p9 p8
0
CLA - 4
Figure 5. An 8x8-bit example of the proposed radix-4 multiplexer-based multiplier for unsigned numbers
The four least significant bits of sums and carries produced at circuit performs significantly faster than the Modified-Booth
each level are added using a 4-bit carry-save adder and a 4-input array, for any bit-length n 〉 16, in a percentage of 11-22% at the
carry-lookahead adder, which transforms the result into binary expense of an area overhead of 3.8-16% as it is shown in Table 3.
form. Also, the leftmost cell of each row is simplified to Cell II,
However, this additional percentage of area is reduced, as the bit-
considered that the most significant bit of the digits x and
n−1 length approaches numbers greater than 32. The area results of
y is always zero. The presented design yields regular Table 3 are given in μm2, based on the critical times of Table 2,
n−1 and they include also the input and the output registers needed for
structure and canonic interconnections. For larger bit-lengths, the computation of critical time by the synthesis tool.
additional rows with similar structure need to be added. As can be seen in Table 2 and Figure 6, the proposed multiplier
has significantly less critical time, compared to the multiplexer-
4. IMPLEMENTATION AND SYNTHESIS based multiplier. Also, Table 3 and Figure 7 demonstrate the
RESULTS superiority of the proposed scheme over the multiplexer-based
We synthesized our designs targeting the minimum critical times, multiplier from the aspect of area required, for bit-lengths greater
while attempting to compress the area overhead. The three than 32.
compared circuits are implemented using the TSMC 0.13μm
technology and the results are derived from the Synopsys Design
Compiler Synthesis tool. In Table 2, is shown that the proposed
117
Table 2. Critical Time Comparisons Table 3. Area Comparisons in μm2
n Proposed Mux- Modified n Proposed Mux- Modified
Multiplier Based [6] Booth Multiplier Based[6] Booth
8 2.51 ns 2.69 ns 2.21 ns 8 8690.17 8628.74 8704.52
16 3.59 ns 4.36 ns 3.51 ns 16 27137.30 26055.02 22728.57
24 4.41 ns 5.98 ns 4.90 ns 24 55221.93 55437.89 42855.18
32 5.31 ns 8.77 ns 6.02 ns 32 86295.81 81535.71 72070.10
48 6.97 ns 11.2 ns 8.74 ns 48 167877.62 176129.29 144354.48
54 7.70 ns 12.37 ns 9.24 ns 54 201791.56 223348.12 183670.60
64 8.88 ns 14.51 ns 10.88 ns 64 260315.65 301777.50 250683.58
12
250000
Area (um2)
10
200000
8
150000
6
4 100000
2 50000
0 0
8 16 24 32 48 54 64 8 16 24 32 48 54 64
Bit-Length n Bit-Length n
Proposed Mux-Based [6] Modified-Booth Proposed Mux-Based [6] Modified-Booth
118