You are on page 1of 6

640

IEEE TRANSACTIONS ON COMPUTERS,

Brief Contributions

NO. 6,

JUNE 1999

________________________________________________________________________________

Parallel Multiplication Using


Fast Sorting Networks
Paul D. Fiore, Student Member, IEEE
AbstractA recent paper describes the use of Svoboda's binary counter in the
construction of fast parallel multipliers. The resulting approach was shown to be
faster than the conventional Dadda multiplier when the wordlength N was small.
Unfortunately, the growth in the number of gates of that method was ON 3 and
the speed was ON. In this paper, Batcher's bitonic sorting network and other
efficient networks replace the Svoboda counter. The asymptotic growth rate in
gates of these new methods is ON 2 log2 N, and the speed is Olog2 N.
Index TermsParallel multiplier, partial product reduction, Dadda's counter,
4:2 compressor, bitonic sorting network.

VOL. 48,

INTRODUCTION

FAST parallel multipliers are important for high speed signal

processing systems implemented in custom hardware, and much


research effort has been devoted to techniques for their construction. In this paper, we apply sorting methods to the problem of
fast, unsigned multiplication.
The methods due to Dadda [1] and Wallace [2], as well as the
newer compressor methods [3], [4], [5], all perform the multiplicaton in three steps: A diamond-shaped bit-product matrix is
formed, the bit-product matrix is reduced to two rows, and, finally,
the two rows are summed by a fast-carry adder to form the final
product. These methods use arrays of full-adder cells to perform
the second step. The number of rows in the initial bit-product
matrix is iteratively reduced by a 3:2 ratio for Wallace's and one of
Dadda's schemes, or by a 4:2 ratio for the compressor methods,
yielding, in general, an Olog N time to reduction to two rows.
A different approach was taken by Drerup and Swartzlander
[6], [7]. They use a parallel layout of a method for counting binary
bits due to Svoboda [8]. This method counts the number of ones in
a column of bits by first sorting all the bits, propagating all the ones
to the bottom of the list and all the zeros to the top. The number of
ones in the column is now determined by detecting the location of
an adjacent one and zero. Fig. 1 shows a circuit for detecting the
transition point and converting this into an unsigned binary count
value. In [6], [7], this is known as a Sort-Detect approach.
Instead of explicitly counting the number of ones, the authors
improve the method by noting that every other bit in the sorted
output represents a carry propagation into the next column of the
bit-product array for the next iteration of reduction. The remaining
odd bits represent the possibility that the number of ones in the
current column was an odd number. A single sum bit can be
propagated to the current column of the next iteration of the
reduction by forming a type of odd-parity of the sorted column
bits. This sort-only method thus reduces a column of N bits into
a column of N=2 bits and another column of one bit, nearly halving
the number of bits in the bit-product matrix for the next iteration.
Moreover, the N=2 carry bits are already sorted, allowing for a
simplification of all the remaining reduction iterations. The authors

present a constant-time column reordering structure to combine


the sum bit from one column compression with the ordered
column of carries from an adjacent column compression.
Unfortunately, this straightforward parallel implementation of
Svoboda's counter results in a multiplier structure with a size of
ON 3 and delay of ON. This ultimately limits the utility of the
method to small wordlengths. The Svoboda sorting method
iteratively sorts the column by comparing adjacent bit pairs and
by swapping them so that the smaller bit value propagates to the
top of the pair. At worst, it would take N 1 levels of comparisons
and swaps for a bit to propagate the length of the column, resulting
in the delay previously stated. The size claim can be understood by
noting that there are 2N 1 columns in the first multiplier stage,
with column heights 1; . . . N 1; N; N 1; . . . 1. Since the area of a
Svoboda counter for a column height of k is Ok2 , the total area is
O 212 . . . N 12 N 2 ON 3 .

FAST SORTING NETWORKS

If we examine this method closely, we see that it is merely a


parallel implementation of a bubble sort. We can vastly improve
the overall speed of this step by using a more efficient sorting
technique. A readily available method, known as a bitonic
sorting network, is due to Batcher and is described in [9], [10]. A
sequence is bitonic if it increases then decreases or decreases then
increases (monotonic sequences are considered bitonic as well).
Bitonic sorting networks are constructed by recursively merging sorted lists of increasing length. First, N length-one sorted
lists are pairwise merged into N=2 length-two sorted lists. These
are, in turn, pairwise merged into N=4 length-four sorted lists. The
process continues until two length-N=2 sorted lists are merged into
the final sorted list. The parallel merging networks themselves are
constructed in a recursive fashion from simpler structures [9]. The
number of levels of comparison-exchange operations in the bitonic
network is log2 N1 log2 N=2.
Fig. 2 shows the basic comparison-exchange element for sorting
two bit values Fig. 2a, and a graphical shorthand representation
Fig. 2b. Fig. 3 shows the construction of an N 16 element sorting
network composed of these simple comparison-exchange elements. Notice that there are log2 161 log2 16=2 10 levels,
each consisting of N=2 elements. Elements in the same level can
execute in parallel, so the speed of this network is 10 gate-delays.
The layout in Fig. 3 has each comparison-exchange being
performed between elements that are a constant distance apart
within each level. For networks with N a power of 2, there is a
simple pattern for determining this sequence of distances, which is

. The author is with the Signal Processing Center of Sanders, a LockheedMartin Company, Nashua, NH 03061. E-mail: pfiore@sanders.com.
Manuscript received 7 Mar. 1997; revised 29 Aug. 1997.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number 104114.
0018-9340/99/$10.00 1999 IEEE

Fig. 1. Parallel counter based on a sorting network and transition detection.

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48, NO. 6,

JUNE 1999

641

Fig. 2. (a) Comparison-exchange circuit for one bit numbers. (b) Graphical
shorthand notation.

1; 2; 1; 4; 2; 1; 8; 4; 2; 1 . . . N=2; N=4 . . . 4; 2; 1 (the up/down pattern


for the arrows is a little more complicated). The last log2 N levels in
the sequence are identical to the radix-2 N-point decimation-infrequency Fast Fourier Transform (FFT) interconnect pattern. Also,
level 7 in Fig. 3 is representative of a half-cleaner [9], which we
will refer to later in the paper.
Knuth [10] also describes sorting networks derived by different
methods. He presents both minimum-comparison and minimumtime networks. He also presents Stone's perfect shuffle layout of
the bitonic sorter [11]. To preserve the perfect shuffle interconnect
pattern, Stone had to include extra stages where no operations
were occuring. The total number of stages in this network are
log22 N. Fig. 4 shows a perfect shuffle bitonic sorting network for
N 8. The network outputs will be arranged such that
y1; . . . y8 are in descending order.

MULTIPLIER CONSTRUCTION

Table 1 shows a comparison of the asymptotic delays of multipliers


constructed by the different networks. For Dadda's method and
the 4:2 compressor method, the speeds of a full-adder and a
compressor are assumed to be two and three XORs, respectively
[4]. As shown in the table, the asymptotic speeds of the bitonic
methods grows faster than either the 4:2 compressor or Dadda's
method, thus ultimately limiting this method to small word
lengths. Table 2 shows a comparison of the asymptotic sizes of the
different networks. The size for Dadda's method comes from [3].
This table again limits the bitonic method to small wordlengths.
Fig. 5 and 6 show the column reduction sequences for an 8  8
multiplication using both Dadda's method and the Bitonic-Detect
method. The figures show that the Bitonic-Detect method reduces
the column height in fewer stages (but the execution time of the
upper stages grows with increasing size).
Table 3 shows the speeds of the various methods for different
operand sizes. Also included in the table is the Three Dimensional
Minimization (TDM) method [5], which takes into account the
unequal delays in a full-adder. This table assumes that the delay
through an XOR gate is equivalent to three basic gate delays, as
was done in [6], [7]. However, in many cases, this is considered
excessive and unfairly penalizes all the full-adder based methods.
Because a two-input XOR gate may be realized by the Boolean
 AB,
 a more reasonable choice is to assume an XOR
equation AB
gate requires two gate delays, assuming that the complementary
form of the input signals are available. Generally, complementary
signals can be generated in parallel, so no extra gate delays are

Fig. 3. Bitonic sorting network for N 16. The network is composed of 10 levels of parallel comparison-exchange circuits shown in Fig. 2.

642

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

Fig. 4. Perfect shuffle bitonic sorting network for N 8. The network is composed of six level of parallel-exchange circuits shown in Fig. 2 and three levels of reordering.

incurred. However, there is now an increased fanout before the


circuits generating the complementary signals, so, for some logic
families, the speed of an XOR gate will be slightly greater than two
gate delays.
Table 4 shows the speed of the various methods assuming two
gate delays per XOR gate. The 4:2 compressor and TDM methods
clearly outperform the sorting methods under these circumstances.
Hybrid combinations of sorting, detecting, and standard
compression trees can yield improved performance in some cases,
even when an XOR gate is equivalent to two basic gate delays. For

example, N 15 bits can be optimally sorted in nine gate delays,

TABLE 1
Asymptotic Multiplier Speeds

TABLE 2
Asymptotic Multiplier Sizes

and detected in two more gate delays, resulting in a column height


of four bits. These bits can be compressed by a 4:2 compressor in
six more gate delays, for a total of 17 gate delays. This is one gate
delay better than a pure 4:2 compression tree (although not as fast
as the TDM approach).

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48, NO. 6,

JUNE 1999

643

Fig. 5. Dadda 8 by 8 column reduction. Delay is two XOR delays per stage.

VARIATIONS

An interesting variation of the bitonic sorting approach is to take


advantage of the properties of half-cleaners [9] operating on a
bitonic input sequence. Recall from Section 2 that a half-cleaner
consists of parallel stages of the comparison-exchange circuit
shown in Fig. 2.
The output of a half-cleaner has three properties: 1) It consists
of two half-length bitonic sequences, 2) every element in the top

TABLE 3
Number of Gate Delays for Multiplication Methods (XOR 3 Gate
Delays)

sequence is as least as small as every element in the bottom


sequence, and 3) at least one of the output halves is clean,
meaning it consists of either all zeros or ones. We do not know
ahead of time which sequence will be clean. If we detect which half
is clean, and if the length of the input sequence is a power of two,
then the sum of the bits in the clean half is another power of two
(or zero). This can be represented by one bit, shifted a known
number of places. The other, potentially unclean, half of the output
TABLE 4
Number of Gate Delays for Multiplication Methods (XOR 2 Gate
Delays)

644

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48,

NO. 6,

JUNE 1999

Fig. 6. Bitonic-detect 8 by 8 column reduction. Sorts of size 8, 4, and 3 are performed (6, 3, and 2 gate delays, respectively). Detection in each stage is two gate delays.

can be presented to a half cleaner which is one-half the size of the


first half-cleaner. The outputs of this second stage will again have
the half-clean property, and a single bit may be extracted. The
weight of this bit is one-half the weight of the bit extracted from the
first stage. This process may be continued until only two bits
remain, which are left unprocessed. Thus, this procedure compresses a length-2k bitonic sequence of bits into k 1 weighted bits
(two bits in the lowest-weight column, one bit in each of the
remaining columns).
Fig. 7 shows a circuit that implements one stage of bit
extraction. Referring to the bitonic sorting network in Fig. 3, the
input to level seven is in fact a bitonic sequence. Thus, only eight
elements are retained for input to level 8 and four elements for
level 9. Level 10 is eliminated. Levels 1 through 6 convert the

Fig. 7. Bit extraction circuit based on the half-cleaner. The half-cleaner converts
two bitonic sequences into one bitonic sequence and one clean sequence. The
OR gate detects which sequence is clean, extracts the single weighted bit, and
causes the 2-1 MUX to propagate the single remaining bitonic to the next stage.

unsorted input into the length-16 bitonic sequence and, therefore,


must be considered for any speed calculations.
By extracting single bits corresponding to clean halves, we can
reduce the required hardware in the network. Unfortunately, using
simple gate-delay models, the circuit in Fig. 7 appears to be much
slower than a single stage of the conventional bitonic sorting
network (no examination of pass-transistor logic approaches has
yet been made). Thus, this hybrid approach allows one to reduce
size at the expense of speed.
Another interesting variation occurs if we consider a N-element
sorting network as a method for converting 2N possible input
patterns into only N 1 possible output patterns. If we construct a
network with fewer stages than are necessary to fully sort a
column of bits, we can save hardware, at the expense of an increase
in the number of possible output patterns. The simple detector
approach of Fig. 1 cannot be used because there may be more than
one zero/one transition in the partially sorted output sequence.
However, we expect that the partial sorting network will have
done some good by reducing the number of possible patterns in
the output sequence. For example, using only log2 N stages
arranged in the FFT butterfly fashion, a significant reduction in
the number of stages is achieved from the number required for a
complete bitonic sorting network. The number of possible output
patterns increases from 9 to 20 (for N 8), from 17 to 168 (for
N 16), and from 33 to 7,581 (for N 32). Given the patterns, one
can use standard Boolean logic minimization software to derive
the detector portion of the circuit. Other patterns of partial sorting
networks could be examined to find which combinations of half-

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 48, NO. 6,

JUNE 1999

cleaner distances and Boolean logic minimization give the best


performance.

CONCLUSIONS

We have shown how bitonic sorting networks can be applied to


fast parallel counter and multiplier designs. In technologies where
an XOR gate is significantly slower than AND or OR gates, the
Bitonic approach will outperform Dadda's method, the 4:2
compressor and the TDM method over a wide range of operand
sizes. With faster XOR gates, the 4:2 compressor and TDM are
favored. We have treated the unsigned multiplication case; the
extension to signed multiplication is straightforward.
We also considered two variations of hybrid networks. Each
attempted to reduce the hardware size or delay of the network by
sacrificing the exact sorting capability.

645

REFERENCES
[1]
[2]
[3]
[4]
[5]

[6]
[7]
[8]

ACKNOWLEDGMENTS
This work was supported by Sanders internal funding.

[9]
[10]
[11]

L. Dadda, Some Schemes for Parallel Multipliers, Alta Frequenza, vol. 34,
pp. 349-356, May 1965.
C.S. Wallace, A Suggestion for a Fast Multiplier, IEEE Trans. Electronic
Computing, vol. 13, pp. 14-17, Feb. 1964.
Z. Wang, G.A. Jullien, and W.C. Miller, A New Design Technique for
Column Compression Multipliers, IEEE Trans. Computers, vol. 44, no. 8,
pp. 962-970, Aug. 1995.
D. Villeger and V. Oklobdzija, Analysis of Booth Encoding Efficiency in
Parallel Multipliers Using Compressors for Reduction of Partial Products,
Proc. Asilomar Conf. Signals, Systems, and Computers, vol. 1, pp. 781-784, 1993.
V.G. Oklobdzija, D. Villeger, and S.S. Liu, A Method for Speed Optimized
Partial Product Reduction and Generation Of Fast Parallel Multipliers
Using an Algorithmic Approach, IEEE Trans. Computers, vol. 45, no. 3, pp.
294-306, Mar. 1996.
B.C. Drerup and E.E. Swartzlander Jr., Fast Multiplier Bit-product Matrix
Reduction Using Bit-Ordering and Parity Generation, Proc. Asilomar Conf.
Signals, Systems, and Computers, pp. 356-360, 1992.
B.C. Drerup and E.E. Swartzlander Jr., Fast Multiplier Bit-product Matrix
Reduction Using Bit-Ordering and Parity Generation, J. VLSI Signal
Processing, vol. 7, pp. 249-257, 1994.
A. Svoboda, Adder with distributed control IEEE Trans. Computers, vol.
19, pp. 749-752, 1970.
T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms.
McGraw-Hill, 1990.
D.E. Knuth, The Art of Computer Programming, Volume 3: Sorting and
Searching. Addison-Wesley, 1973.
H.S. Stone, Parallel processing with the perfect shuffle, IEEE Trans.
Computers, vol. 20, no. 2, 1971.

You might also like