The Efficient Implementation of An Array Multiplier

The Efficient Implementation of An Array Multiplier
Guoping Wang
Indiana University Purdue University, Fort Wayne
wang@engr.ipfw.edu
James Shield
University of Oklahoma
Abstract follows: In section 2, array multipliers are briefly

reviewed. The architectures and implementations are
Multiplication is one of the basic and critical discussed. In section 3, the proposed array multiplier
operations in the computations. Efficient structure is presented. Simulation results are shown in
implementations of multipliers are required in many section 4. In section 5, a summary is concluded.
applications. In this paper, a new implementation of the
array multiplier for unsigned numbers is proposed which 2 Array-Based Multiplication
significantly reduces the silicon area compared to
recently published array multiplier while with no penalty An array multiplier was first proposed in [1] which
of speed and power. The proposed scheme is applicable has good repeatability of unit cells and is very regular in
for VLSI and FPGA application and it can be easily its structure. It uses only short wires that connect one full
extended to signed number computations. adder horizontally, vertically, or diagonally adjacent full
adders. Thus, it results in a very simple and efficient
layout in VLSI implementation. Figure 1 shows a design
of 5 × 5 unsigned multiplier:
1 Introduction
a4 a3 a2 a1 a0
Multiplication is one of the most critical operations x0

in many computational systems. For VLSI
implementations of multipliers, array-based multipliers p0
x1
[1][2]and tree-based multipliers[3][4] are well known
and often used. For tree-based multipliers, Wallace tree
p1
[3] and Dadda scheme [4] are focusing on decreasing the x2
depth of partial products processing. Fast multipliers
using redundant binary number representations with p2
x3
addition tree [5][6] are also proposed to have a more
regular modularity to Wallace and Dadda tree structures. p3
Booth encoding technique[7][8] is also used to reduce x4
the number of partial products. However, for tree-based
multipliers, due to the interconnection between the p4
stages of addition tree, it is less regular and modular; Full
Adder
Full
Adder
Full
Adder
Full
Adder
meanwhile, array-based multipliers are more used in the
p9 p8 p7 p6 p5
implementation in VLSI due to its regular layout.
Recently a MUX-based multiplier was proposed in [2]. Figure 1. Design of a 5× 5 Array Multiplier
However, direct implementation of this algorithm,
somehow leads to inefficient use of silicon area. The A MUX-based unsigned multiplier scheme was
modified structure of [2] with reduced power is proposed in [2] which permits high-speed operation and
discussed in [9]. In this paper, an improved array with regular structure. At each step, one bit of the
multiplier is proposed which significantly reducing the multiplier and one bit of the multiplicand are processed.
size of the completed design by 40-50% without speed
and power penalty. The rest of the paper is organized as
Sin x4 x3 x2 x1 x0
xj y4 y3 y2 y1 y0
yj
0 0 0 0 0 0 0
0 0
0
xi Full 4to1 xi
yi Adder Mux yi
x0y0
si si 0
p1 p0
xj
0
Sout CELL I
yj c out x1y1
CLA ADDER
0
x2y2
xj
p3 p2
yj
Sin c in
cj
CLA ADDER
0 xi=xj
yi=yj
Full
x3y3 s i=s j Adder
x4y4
CELL II
Full xj
F.A. CLA ADDER Adder yj
cj+1
p9 p8 p7 p6 p5 p4
c out s out xiyi
Figure 2. Pekmestzi Multiplier and Cell I, II used in the Multiplier

Table 1 indicates the results of Z j depending on the
As a result, the algorithm is symmetric. The
multiplier and multiplicand can be interchanged. The input x j and y j .
details are explained in the following.
Consider two positive N-bit integers X and Y given Table 1. Selection of Summation Z j
by
N −1 xj yj Zj
X = xN −1 xN − 2 x0 = ∑ x j 2 j (1)
j =0
0 0 0
N −1 0 1 Xj
Y = yN −1 y N − 2 y0 = ∑ y j 2 j (2) 1 0
j =0
Yj
where N is the bit length of the multiplier and 1 1 S j = X j + Yj
multiplicand.
Define Figure 2 is a diagram of the cellular array proposed
N −2
by Kiamal Pekmestzi[2]. Cell I and Cell II in Figure 2
X N −1 = xN − 2 xN − 3 x0 = ∑x 2
j =0
j
j
are used in the internal architecture. The implementation

N −2
(3) of the preceding structure for computing both the partial
YN −1 = y N − 2 y N − 3 y0 = ∑y j 2 j
products and the resulting summation, with the output
j =0 forms the final result of the 8-bit muliplication.
These N-1 bit numbers represent the original N-bit
numbers without the most significant bit; therefore, the 3 Proposed Array Multiplier
product P of X and Y is represented by:
N −1 N −1
P = X iY = ∑ x j y j 22 j + ∑ {x j Y j + X j y j }2 j 3.1 Improved Array Multiplier
j =0 j =1
N −1 N −1 The array multiplier implementation discussed above
= ∑ x j y j 22 j + ∑ Z j 2 j contains many of the same signals routed to the same
j =0 j =1
cells, and many of the components are unnecessarily
(4) duplicated causng increased design size and possible
delays in execution. Figure 3 is a diagram of the
proposed multiplier (4-bit multiplication). This diagram Figure 4 is a block diagram of this 4-to-1 multiplexer
does not display all interconnections; however, one can implementation in Table 1.
see the simplicity in the updated design. This design is a Different methodologies were investigated to
more direct implementaiton of the product formula from determine the optimum multiplexer for this function.
the previously derived multiplication: Since S j is not a single bit, but rather an increasing bit-
N −1 N −1
P = ∑ x j y j 2 j + ∑ ( x j Y j + y j X j }2 j (5) vector of the addition X j and Y j , only the jth bit needs
j =0 j =0
to be calculated. As seen in the example in Figure 4,
In this proposed implementation, the multiplexers when j=2, S2 is a 3-bit number (101) which is the 3-bit
from cell 1 in Figure 2 are removed from the array and sum of X3 and Y3. The first version of this multiplexer
brought to the front of the process in a form of simply pulled the appropriate values from a pre-
preprocessing. The AND gates in the partial products are calculated Sj sum, much like the example; therefore, for
the result of Equation (5). The example in Figure 3 each N-bit multiplication, a N-bit sum would be
clearly indicates that Table 1 can be implemented as a 4- calculated and stored in a register. The multiplexer
to-1 multiplexer with x j and y j as the selection bits for would then copy the appropriate value from the sum (in
the inputs 0, X j , Y j and S j resulting Z j as the single the case of x j and y j equal to 1) to the corresponding
output. This example also displays how the partial partial product register.
products are related to the original multiplier, A second approach for this multiplexer was based on
multiplicand, and sum. calculating the first two least significant bits of the sum
in the first multiplexer rather than in a separate register,
X3 Y3 X1 Y1 then passing the result to the next multiplexer j+1. The
next most significant bit of the sum would then be
X = 1 0 0 1
0 0
calculated and truncated to the two sum bits passed into
Y = 1 1 0 0
X0-3 X0-1 the multiplexer. The multiplexer would only need to
S = 0 1 0 1 Y0-3 Y0-1 calculate a single sum bit. The new sum, now one bit
S0-3 S0-1
j = 3 2 1 0 larger would be passed to the next multiplexer for use. In
1 1 0 0 the example in Figure 4, the 2-bit sum (01) would be
calculated in the multiplexer for j=1, passed to the next
X0Y0
0 0 0
multiplexer (j=2), and concatenated with the next
significant bit (corresponding to X 2 and Y2 ) to obtain
1 0 0 S2 =101.
X1Y1 The third, and chosen for the proposed multiplier,
0 0 form of the 4-to-1 multiplexer calculates the necessary
X2Y2 sum within each multiplexer. Therefore
1 S0 −1 , S0 − 2, ...S0 − ( N −1) are calculated separately and in
X3Y3 parallel corresponding to multiplexers j=1, 2, …N-1. The
internal equations for S j are as follows:
0 1 1 0 1 1 0 0
S0 = X 0 XOR Y0
Figure 3. 4-bit Multiplier Example S1 = X 1XOR Y1XOR S0
(6)
....
X3 Y3
S N −1 = X N −1XOR YN −1XORX N − 2 XOR YN − 2
0 The Si , X j and Y j signals are calculated in the

4-to-1 X0-3 traditional manner for a 4-to-1 multiplexer.
MUX Table 2 is a summary of the three methods used in
Y0-3
this paper for the multiplexer design. This table indicates
S0-3
that the parallel calculation of the sum
Z
Figure 4. Specialized 4-to-1 Multiplexer

Table 2. 4-to-1 MUX Design Comparisons X0Y0
Multiplexer Size of MUX Benefit

Full
Design (N) Adder
Sum in register 8 Easy to Carry X1Y1
implement
X2Y2
Pass-sum 8 Easily scalable
Parallel Calc 8 Fast and easily X3Y3 Carry
scalable FA
X4Y4 Output
with each multiplexer. Because each multiplexer is based X5Y5

upon input data (X and Y) and each output Z from all
multiplexers is independent, the partial products X6Y6
generated from all N-1 multiplexers are processed X7Y7

simultaneously. This parallel processing of the partial
products is the greatest cost saving of the proposed
multiplier. The final factor in the generation of the Figure 5. Partial Product Reduction for 8-bit
partial product is the presence of the AND signals. These
Multiplier
signals are present in the first summation in Equation (5)
, repeated below as (7)
N −1 N −1
In this example, only full adders are needed;
P = ∑ x j y j 2 + ∑ ( x j Y j + y j X j }2
j j
(7) however, if only two inputs exist to be added, then half
j =0 j =0 adders are needed. It can also be seen in this example
The first summation is implemented using ANDs and the how the adders are chained together by the carries from
second summation is implemented using the 4-to-1 the previously columns and the output (sum) of the
multiplexer described above. These are the only two adders in the same column.
functions required to compute the partial products,
regardless of the size of the operand N. 4 Simulation Results
3.2 Minimizing Partial Products To test and verify the performance of the proposed
multiplier, VHDL code was written and simulated using
Because the generated partial products are both non- a VHDL simulator. VHDL code was written and
symmetrical and irregular, a direct implementation of simulated for both the proposed multiplier and published
Wallace trees is not practical for the minimization of the array multipliers. As can be seen in section 2 and 3, this
partial products. However, an adapted Wallace tree design can be easily adapted to any size multiplier. The
format is proposed. This new form is based on carry- design consists of N-1 multiplexer, AND gates, and
save adders just as in Wallace trees. A Wallace tree adders (half and full adders). Because this design has
implementation provides a slightly faster and smaller only four basic components tied with interconnect
implementation; however, this paper does not investigate signals, it is very suitable for VLSI implementation;
the design and implementation of Wallace trees for however, design complexity increases as N approaches
irregular partial products. large values. This is due to the large amount of adders
Looking at the 8-bit example, the number of partial needed to compress the partial products with a height of
products for a given column is not standard from one (N/2+1).
column to its neighboring columns. Therefore, each Table 3 and Table 4 are the comparisons between
column must be treated independently from its the two implementations and the simulation results. The
neighbors. Only the carries from each column will be tables compare the recently published array multiplier,
related. Figure 5 is an example of how adders can be implemented with 8-bit inputs in a Xilinx Virtex FPGA,
combined for a given column to add the partial products. to the improved design with the same input and device
In this example, only full adders are needed; requirements. The proposed 8-bit design does not offer
however, if only two inputs exist to be added, then half significant speed improvement (0.5 ns faster); however,
adders are needed. It can also be seen in this example the reduction in size by nearly a factor of two is a great
how the adders are chained together by the carries from improvement (43% smaller).
the previously columns and the output (sum) of the
adders in the same column.
Table 3. Multiplier Comparison
References:
8-bit Multiplier 16-Bit Multiplier [1] S. D. Pezaris, “A 40-ns 17-bit by 17-bit array
Design # of Max # of Max multipliers,” IEEE Transactions on Computers,
gates Delay gates Delay vol. 20, pp. 442-447, April 1971.
Pekmestzi 1413 57.2ns 2684 108.8ns [2] K.Z. Pekmestzi, “Multiplexer-based array
Array multipliers,” IEEE Transactions on Computers,
Multiplier vol. 48, no. 1, pp. 15-23, Jan. 1999.
Improved 801 56.7ns 1410 92.1ns [3] C. Wallace, “A suggestion for a fast multiplier,”
Array IEEE Transactions on Electronic Computers,
Multiplier vol. 13, pp. 14-17, 1964
Table 4. Multiplier Comparison [4] L. Dadda, “Some schemes for parallel
multipliers,” Alta Frequenza, 34, pp. 349-356,
32-bit Multiplier 64-Bit Multiplier March 1965.
Design # of Max # of Max [5] N. Takagi, H. Yasuura, and S. Yajima, “High-
gates Delay gates Delay speed VLSI multiplicationalgorithm with a
Pekmestzi 5803 282.9ns 14256 508.1ns redundant binary addition tree,” IEEE
Array Transaction on Computers, vol. 34, no. 99, pp.
Multiplier 789-796, Sept, 1985.
Improved 3220 281.7ns 6129 413.0ns [6] H. Makino, Y. Nakase, H. Suzuki, H. Morinaka,
Array H. Shinohara, and K. Mashiko, “An 8.8-ns
Multiplier 54x54-bit multiplier with high speed redundant
binary architecture, " IEEE Journal of Solid-
5 Summary state Circuits, vol. 31, no. 6, pp. 773-783, June
1996.
A new improved array multiplier has been proved to [7] A. Booth, “A signed binary multiplication
be both smaller and faster than the current published techniques,” Quarterly Journal Mechanics of
array multipliers. The cost saving of the proposed Applied Mathematics, vol. 4, pp. 236-240,
implementation comes from a slightly different 1951.
implementation of the multiplexer. Where Pekmestzi [8] L. MacSorley, “High speed arithmetic in binary
designed for a strictly array based multiplier, the computers,” Proc. IRE, vol. 49, Jan. 1961.
proposed multiplier aimed for both speed and minimal [9] Y. Wang, Y. Jiang and E. Sha, “On a area-
size, but it is also scalable to be expanded to other size efficient low power array multipliers”, in 8th
multipliers. The proposed array multiplier is applicable IEEE International Conference on Electronics,
to VLSI and FPGA implementation. Current research is Circuits and Systems, 2001, vol. 3, pp. 1429-
to expand the unsigned number multiplier to signed 1432, Sept, 2001.
numbers.

The Efficient Implementation of An Array Multiplier

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Efficient Implementation of An Array Multiplier

Uploaded by

Copyright:

Available Formats

The Efficient Implementation of An Array Multiplier

Abstract follows: In section 2, array multipliers are briefly

Multiplication is one of the most critical operations x0

Figure 2. Pekmestzi Multiplier and Cell I, II used in the Multiplier

are used in the internal architecture. The implementation

0 The Si , X j and Y j signals are calculated in the

Figure 4. Specialized 4-to-1 Multiplexer

Multiplexer Size of MUX Benefit

Sum in register 8 Easy to Carry X1Y1

with each multiplexer. Because each multiplexer is based X5Y5

generated from all N-1 multiplexers are processed X7Y7

You might also like