You are on page 1of 14

1

Speeding-up Booth Encoded Multipliers by Reducing the Size of Partial Product Array
INTERNAL REPORT DAUIN/DELEN-POLITECNICO DI TORINO AND UNIVERSITY DE SANTIAGO DE COMPOSTELA, 2009 Fabrizio Lamberti, Member, IEEE, Nikolaos Andrikos, Student Member, IEEE, Elisardo Antelo, Member, IEEE,, Paolo Montuschi, Senior Member, IEEE

Abstract Twos complement multipliers are important for a wide range of applications. In this paper we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modied Boot Encoded multiplier, without any increase in the delay of the partial product generation stage. The proposed method can be extended to higher radices encodings, as well as any size m n multiplications. This reduction may allow a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width twos complement multipliers for high-performance embedded cores. We evaluate the proposed idea by providing rst a gate equivalent study, and then through synthesis results. The comparisons with some other possible solutions, show that the proposed approach is highly efcient in terms of both area and delay. Index Terms Multiplication, Modied Booth Encoding, Partial Product Array.

case the multiplier must be highly optimized to t within the required cycle time and power budgets. Another relevant application for short bit-width multipliers is the design of SIMD units supporting different data formats [3]. In this case, short bit-width multipliers often play the role to be basic building blocks. Twos complement multipliers of moderate bit-width (less than 32 bits) are also being used massively in FPGAs. All of the above translates into a high interest and motivation by the industry, to design highly performing short or moderate bit-width twos complement multipliers. The basic algorithm for multiplication is based on the well known paper and pencil approach [1] and passes through three main phases: i) partial product (PP) generation, ii) PP reduction and iii) nal (carry propagated) addition. During PP generation a set of rows is generated, where each one is the result of the product of one bit of the multiplier by the multiplicand. For example, if we consider the multiplication X Y with both X and Y on n bits and of the form xn1 . . . x0 and yn1 . . . y0 , then the ith row is, in general, a proper left shifting of yi X , i.e. either a string of all zeros when yi = 0, or the multiplicand X itself when yi = 1. In this case, the number of PP rows generated during the rst phase is clearly n. Modied Booth Encoding (MBE) [4], [5], [6] is a technique which has been introduced to reduce the number of PP rows, still keeping both simple and fast enough the generation process of each row. One of the most commonly used schemes is radix-4 MBE, for a number of reasons, being the most important that it allows to reduce the size of the partial product array by almost half [4], and it is very simple to generate the multiples of the multiplicand. More specically, the classical twos complement n n bit multiplier using the radix-4 MBE scheme, generates a PP array with maximum height of n +1 rows, being each row 2 one of the following possible values: all zeros, X, 2X [1]. The PP reduction is the process of adding all PP rows by using a compression tree [4], [5]. Since the knowledge of intermediate addition values is not important, the outcome

I. I NTRODUCTION

N multimedia, 3D graphics and signal processing applications, performance, in most cases, strongly depends on the effectiveness of the hardware used for computing multiplications, since multiplication is, besides addition, massively used in these environments. The high interest in this application eld is witnessed by the large amount of algorithms and implementations of the multiplication operation, which have been proposed in the literature (for a representative set of references see [1]). More specically, short bit-width (8-16 bits) twos complement multipliers with single-cycle throughput and latency have emerged to be very important building blocks for high-performance embedded processors and DSP execution cores [2]. In this
Fabrizio Lamberti is with Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino, Italy. E-mail: fabrizio.lamberti@polito.it Nikolaos Andrikos is with Dipartimento di Elettronica, Politecnico di Torino, Torino, Italy. E-mail: nikolaos.andrikos@polito.it Elisardo Antelo is with Department of Electronic and Computer Engineering, University of Santiago de Compostela, Santiago de Compostela, Spain. E-mail: elisardo@dec.usc.es Paolo Montuschi is with Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino, Italy. E-mail: paolo.montuschi@polito.it

of this phase is a result represented in redundant carry save form, i.e. as two rows, which allows for much faster implementations. The nal (carry propagated) addition has the task to sum these two rows and to present the nal result in non redundant form, i.e. as a single row. In this work we introduce an idea to overlap to some extent the PP generation and the PP reduction. Specically our aim is to produce a PP array of maximum height of n to be 2 then reduced by the compressor tree stage. This, as we will see for the common case of values n which are power of 2, can lead to an implementation where the delay of the compressor tree is reduced by one XOR2 gate. Since we are focusing on small values of n and fast single cycle units, this reduction might be important. A similar study aimed at the reduction of maximum height to n but using a different approach has recently presented 2 interesting results in [7] and previously, by the same authors, in [8]. In Section VI we evaluate and compare the proposed approach with the technique in [7]. The paper is organized as follows: in Section II the multiplication algorithm based on MBE is briey reviewed and analyzed. In Section III we describe related works. In Section IV we present our scheme to reduce the maximum height of the partial product array by one unit during the generation of the PP rows. In Section V we shortly discuss possible extensions to the proposed technique. Finally, in Section VI we evaluate our proposal and compare with a previous scheme using both theoretical analysis and synthesis tools. II. M ODIFIED B OOTH R ECODED M ULTIPLIERS The standard paper and pencil method to compute the product of a multiplicand X and a multiplier Y , is to produce the partial product array by generating one row for each bit of the multiplier Y . This methodology produces n rows, where n is the size of the multiplier. An effective method for reducing the number of partial products to be accumulated is the Modied Booth Encoding (MBE) [4], [5]. In general, a radix-B = 2b MBE leads to a reduction of the number of rows to about n/b while, on the other hand, introduces the need to generate all the multiples of the multiplicand X, at least from B/2 X to B/2 X . As mentioned above, radix-4 MBE is particularly of interest since for radix-4 it is easy to create the multiples of the multiplicand 0, X, 2X . In particular, 2X can be simply obtained by single left shiftings of the corresponding terms X . It is clear that the MBE can be extended to higher radices [9], but the advantage to get a higher reduction in the number of rows, is paid with the need to generate more multiples of X . In this paper we focus our attention to radix-4 MBE, although the proposed method can be easily extended to any radix-B MBE.

y2i+1 0 0 0 0 1 1 1 1

y2i 0 0 1 1 0 0 1 1

y2i1 0 1 0 1 0 1 0 1

Generated partial products 0X 1X 1X 2X (2) X (1) X (1) X 0X

TABLE I M ODIFIED B OOTH E NCODING (R ADIX -4)

(a) MBE signals generation

(b) Partial product generation

Fig. 1. Gate level diagram for partial product generation using MBE (adapted from [8]).

From an operational point of view, it is well known that the radix-4 MBE scheme consists in scanning the multiplier operand with a three-bit window and a stride of two bits (radix-4). For each group of three bits (y2i+1 , y2i , y2i1 ), only one partial product row is generated according to the encoding in Table I. A possible implementation of the radix4 MBE and the corresponding partial product generation is shown in Fig. 1, which comes from a small adaption on the (b) part of Figure 12 in [8]. For each partial product row, Fig. 1(a) produces the one, two and neg signals. These signals are then exploited by the logic in Fig. 1(b), along with the appropriate bits of the multiplicand, in order to generate the whole partial product array. Other alternatives for the implementation of the recoding and partial product generation can be found in [10] [11] among others. As introduced previously, the use of radix-4 MBE allows the (theoretical) reduction of the PP rows to n , with the 2 possibility for each row to host a multiple of yi X , with yi {0, 1, 2}. While it is straightforward to generate the positive terms 0, X and 2X at least through a left shift of X , some attention is required to generate the terms X and 2X which, as observed in Table I, can arise for three congurations of the y2i+1 , y2i and y2i1 bits. To avoid computing negative encodings, i.e. X and 2X , the twos complement of the multiplicand is generally used. From a mathematical point

Fig. 2.

Application of the sign extension prevention measure on the partial product array of a 8 8 multiplier.

of view, the use of twos complement requires to extend the sign up to the left-most part of each partial product row, with the consequence of an extra area overhead. Thus, a number of strategies for preventing sign extension have been developed. For instance, one scheme [1] relies on the observation that pp = pp 1 = pp + 1 + 2 4. The resulting array after some simplications is shown in Fig. 2. Besides sign extension, twos complement has another major characteristic, i.e. to require a neg signal (e.g. neg0 , neg1 , neg2 and neg3 in Fig. 2) to be added in the LSB position of each partial product row for generating the twos complement, as needed. Thus, although for a n n multiplier only n partial products are generated, 2 the maximum height of the partial product array is n + 1. 2 When 4-to-2 compressors are used, which is a widely used option because of the high regularity of the resultant circuit layout for n power of 2, the reduction of the extra row may require an additional delay of two XOR2 gates. By properly connecting partial product rows and using a Wallace reduction tree [12], the extra delay can be further reduced to one XOR2 [13] [14]. However, the reduction still requires additional hardware, roughly a row of n half adders. This issue is of special interest when n is a power of 2, which is by far a very common case, and the multipliers critical path has to t within the clock period of a high performance processor. For instance, in the design presented in [2], for n = 16 the maximum column height of the partial product array is 9, with an equivalent delay for the reduction of six XOR2 gates [13] [14]. For a maximum height of the partial product array of 8, the delay of the reduction tree would be reduced by one XOR2 gate [13] [14]. Alternatively, with a maximum height of 8, it would be possible to use 4 to 2 adders, with a delay of the reduction tree of six XOR2 gates, but with a very regular layout. III. R ELATED W ORK Some approaches have been proposed aimed to add the n n 2 + 1 rows, possibly in the same time as the 2 rows. The solution presented in [11] is based on the use of different types of counters, that is, it operates at the level of the PP reduction phase. Kang and Gaudiot propose a different approach in [7] that manages to achieve the goal of eliminating the extra row before the PP reduction phase. This approach is based on computing the twos complement of the last partial product

Fig. 3.

Twos complement computation (n = 8) [7].

(a) 3-5 decoder

(b) 5-1 (with zero) selector

Fig. 4. Gate level diagram for the generation of twos complement partial product rows [7].

in a logarithmic time complexity, thus eliminating the need for the last neg signal. A special tree structure (basically an incrementer implemented as a prex tree [15]) is used in order to produce the twos complement (Fig. 3), by decoding the MBE signals through a 3-5 decoder (Fig. 4(a)). Finally, a row of 5-1 selectors is used (Fig. 4(b)) to produce the last partial product row directly in twos complement, without the need for the neg signal. The goal is to produce the twos complement in parallel to the computation of the partial products of the other rows with maximum overlap. In such a case it is expected to have no or small time penalization in the critical path. The architecture in [7], [15] is a logarithmic version of the linear method presented in [16] and [17]. With respect to [16], [17], the approach in [7] is more general, and shows better adaptability to any

word size. An example of the partial product array produced using the above method is depicted in Fig. 5. In this work, we present a technique that also aims at producing only n rows, but by relying on a different 2 approach than [7]. IV. BASIC I DEA The proposed approach is general and, for sake of clarity, will be explained through the practical case of 8 8 multiplication (as in the previous gures). As shortly outlined in the previous sections, the main goal of our approach is to produce a partial product array of a maximum heigh of n 2 rows, without introducing any additional delay. Let us consider, as the starting point, the form of the simplied array as reported in Fig. 2, for all the partial product rows except the rst one. As depicted in Fig. 6(a), the rst row is temporarily considered as being split into two sub-rows, the rst one containing the partial product bits (from right to left) from pp00 to pp80 and the second one with two bits set at one in positions 9 and 8. Then, the bit neg3 related to the fourth partial product row, is moved to become a part of the second sub-row. The key point of this graphical transformation is that the second sub-row containing also the bit neg3 , can now be easily added to the rst sub-row, with a constant short carry propagation of 3 positions (further denoted as 3-bits addition), a value which is easily shown to be general, i.e. independent of the length of the operands, for square multipliers. In fact, with reference to the notation of Fig. 6, we have that qq90 qq90 qq80 qq70 qq60 = 0 0 pp80 pp70 pp60 + 0 1 1 0 neg3 . Let us denote now with ss90 qq80 qq70 qq60 the result of the 3-bit sum pp80 pp70 pp60 + 0 0 neg3 . We observe that, by denition of the operands involved, it is impossible to have ss90 = qq80 = 1. Then, the remaining sum of the two bits set at one in positions 9 and 8, leads to the result depicted in Fig. 6(b), and it is easy to show that qq90 = ss90 OR pp80 where ss90 = pp80 AN D pp70 AN D pp60 AN D neg3 as it comes out from the previous 3-bit sum. Note that, because of the particular operands involved, the overall process requires the propagation of the carry over 3 bits positions, i.e. it is very similar as a 3-bit addition. This is also observed by the implementation shown in Fig. 7. Now, if (as we expect and later prove in Section VI-D) the generation of the rst row cascaded by the computation of the bits qq90 qq90 qq80 qq70 qq60 is done in parallel with the generation of the other rows, then the goal to reduce the number of rows by one with no delay penalization has been successfully accomplished. In order to achieve this, we must simplify and differentiate the generation of the rst row with respect to the other rows. We observe that the Booth recoding for the rst row is computed more easily than for the other rows, because the y1 bit used by the MBE is always equal to zero.

Fig. 7. Gate level diagram of the proposed method for adding the last neg in the rst row.

(a) MBE signals generation Fig. 8.

(b) Partial product generation

Gate level diagram for rst row partial product generation.

In order to have a preliminary analysis which is possibly independent of technological details, we refer to the circuits in the following gures:

Fig. 1, slightly adapted from gure 12 of [8], for the partial product generation using MBE; Fig. 7, obtained through manual synthesis aimed at modularity and area reduction without compromising the delay, for the addition of the last neg bit to the most signicant three bits of the rst row. Fig. 8, obtained by simplifying Fig. 1 (since in the rst row it is y2i1 = 0), for the partial product generation of the rst row only using MBE; Fig. 9, obtained through manual synthesis of a combination of the two parts of Fig. 8 aimed at decreasing the delay of Fig. 8 with no or very small area increase, for the partial product generation of the rst row only using MBE.

In particular, we observe that, by direct comparison of Fig.s 1 and 8, the generation of the MBE signals for the rst row is simpler, and allows for the theoretical saving the delay of one NAND3 gate. In addition, the implementation which is reported in Fig. 9 has a smaller delay, but could require a small amount of additional area than the two parts of Fig. 8. As we see in the following, this issue hardly has any signicant impact on the overall design, since this extra hardware is used only for the 3 most signicant bits of the rst row, and not for all the other n n/2 3 bits of the partial products. The high level description of the proposed idea is as follows: 1) generation of most signicant three bit weights of the rst row, plus addition of the last neg bit. Possible

Fig. 5.

Partial product array by applying the twos complement computation method in [7] to the last row.

(a) Basic idea

(b) Resulting array Fig. 6. Partial product array after adding the last neg bit to the rst row. Description Standard implementation of the MBE: signals one, two, and neg generated rst, and then used to produce the partial product array (Fig. 1). Revisited implementation of the MBE for the rst row: logic for the generation of the partial product row is simplied as the y1 bit is always equal to zero (Fig. 8). Generation of the rst partial product row and fast 3-bit carry propagate addition (Fig. 6). Motivation The delay to generate a generic partial product row (other than the rst one) in a standard nn multiplier represents the upper bound for any design aimed at removing the last neg signal (either working on the rst or last row). The delay to produce the rst row constitutes the lower bound for any scheme trying to get rid of the MBE negative encoding by incorporating its effect in the shortest path (i.e., in the rst row, as in the proposed method). The aim is to reduce the number of partial product rows from n/2 + 1 to n/2 , thus getting rid of the effect of MBE negative encoding. By having fewer partial product rows, the next reduction hardware can be smaller in size and faster in speed. The delay will be higher than the one of the rst row for a standard multiplier, but it should be lower than any of the other PP rows, thus not inducing any time penalty. Similarily as the proposed method, also this scheme avoids the extra partial product row, and its delay has to be within the delay requirements of the other PP rows. The above goal is achieved by replacing partial product generation on the last row with partial product selection of the multiplicands twos complement, thus eliminating the need for the last neg signal. With respect to similar techniques (e.g. [16], [17]), this approach is more general and it shows a better adaptability to any word size.

Implementation Standard multiplier (Any row)

Standard multiplier (First row)

Proposed method

Twos complement

Direct computation of the last partial product row in twos complement performed in parallel to the production of the partial products of the other rows (using the designs in Fig.s 3 and 4, as in [7]).

TABLE II D ESIGNS FOR THE G ENERATION OF THE PARTIAL P RODUCT ROWS C ONSIDERED IN THE E VALUATION

a (m + 3)-bit carry propagation (i.e., a (m + 3)-bit addition) in the sum qqm+1,0 qqm+1,0 qqm,0 . . . qqn2,0 = 0 0 ppm,0 . . . ppn2,0 + 0 1 1 . . . 0 negn/21 . Therefore, for rectangular multipliers, the proposed approach can be applied with the cost of a (m + 3)-bit addition. Although we have explicitly focused our attention to radix4 MBE, the proposed method can be easily extended to any radix-B MBE. It is easily observed, by redrawing the equivalent of Fig. 6(a) for another radix-B MBE, that the neg signal of the last row can be included in the rst row by using a simple 3-bit adder for a n n multiplier and by using a (m + 3)-bit adder for the more general case of a (n + m ) n adder. The use of a combined multiplier with accumulation is also common. In this case, the proposed approach can also be used and does not loose the benet to reduce by one the number of rows to be added. Although this reduction does not necessarily lead to a reduction in the computation time for some values of n, it still remains interesting and useful since it is very reasonable to expect some savings and more regularity in the compression tree. We have not evaluated such potential reductions, because their impact could be strongly dependent on the value of n and on the strategy which has been identied to design the compression tree. VI. E VALUATION AND C OMPARISONS A. Foreword In this section, the proposed method based on the addition of the last neg signal to the rst row is rst evaluated. The designed architecture is then compared with an implementation based on the computation of the twos complement of the last row (referred to as Twos complement method) using the designs for the 3-5 decoders, 5-1 selectors and twos complement tree in [7]. We explicitly evaluate the most common case of a n n multiplier, although we have shown in Section V that the proposed approach can also be extended to m n rectangular multipliers. More specically, in the following we analyze (from a theoretical point of view) the proposed architecture and other alternatives, in different frameworks. The theoretical analysis is useful because (possibly) independently of the technology, it provides rst and rough estimates of area and delay. We also provide a theoretical proof that under some reasonable conditions, there is an overlapping between the rst row generation plus addition of the last neg bit, and the generation of the other rows. While studying the framework of possible implementations, we have focused our attention on the following issues: area occupancy and modular design, since it is reasonable to expect that they lead to a possibly small multiplier with regular layout. Clearly, other choices are also possible, starting from implementations more focused on delay rather than on area attention, but they have not been considered

Fig. 9. Combined MBE signals and partial product generation for the rst row (improved for speed).

implementations can use a replication of three times the circuit of Fig. 9 (each for the three most signicant bits of the rst row), cascaded by the circuit of Fig. 7 to add the neg signal; 2) parallel generation of the other bits of the rst row. Possible implementations can use instances of the circuitry depicted in Fig. 8, for each bit of the rst row, except for the most signicant three; 3) parallel generation of the bits of the other rows. Possible implementations can use the circuitry of Fig. 1, replicated for each bit of the other rows. All items 1 to 3 are independent and therefore can be executed in parallel. Clearly if, as assumed and expected, item 1 is not the bottleneck (i.e. the critical path), then the implementation of the proposed idea has reached the goal not to introduce time penalties. Later, in Section VI we will present some results on both the theoretical and practical issues coming from the implementation of the proposed idea. V. P OTENTIAL E XTENSIONS There exists a number of potential extensions to the proposed method. Among all, in this section we provide some highlights on the following:

m n rectangular multiplier;

higher radix Modied Booth Encoding (e.g. radix-8); multipliers with fused accumulation.

From the discussion of Section IV it is clear that all of the above well applies to n n square multipliers which by far are very common. However, in some cases, the use of a m n rectangular multiplier could be preferred. We show that the proposed approach also ts for rectangular multipliers. With no loss of generality we assume m n, i.e. m = n + m with m 0, since it leads to the smaller number of rows; for simplicity, and also with no loss of generality, in the following we assume that both m and n are even. Now, we have seen in Fig. 6(a) that for m = 0, then the last neg bit, i.e. negn/21 belongs to the same column as the rst row partial product ppn2,0 . We observe that, the rst partial product row has bits up to ppm,0 , and therefore, in order to include in the rst row also the contribution of negn/21 , due to the particular nature of operands it is necessary to perform

with a detailed theoretical analysis in this paper. Specically, in the following we present the results of:

Gate INV NAND2 NAND3 NAND4 NOR2 XNOR2 XOR2

Normalized area cost 0.5 1.0 1.5 2.0 1.0 3.0 3.5

Normalized gate delay 1.0 1.4 1.8 2.2 1.4 3.2 4.2

Function 1-input inverter 2-input NAND 3-input NAND 4-input NAND 2-input NOR 2-input XNOR 2-input XOR

the theoretical analysis based on the concept of equivalent gates from Gajskis analysis [18]; the theoretical analysis based on delay and area costs for elementary gates in an industrial standard cell library (STMicroelectronics 130nm HCMOS technology); the theoretical proof that the proposed approach, in the version minimizing area, very likely overlaps the generation of the rst row with the generation of the other rows.

TABLE III N ORMALIZED G ATE D ELAY AND A REA C OST ( MODEL BY [18])

Then, in a different subsection the theoretical analysis results are validated by logic synthesis and technology mapping to an industrial standard cell library. To perform the different analyses, we focused on the generation of a n height partial product array, and we 2 considered implementations capable of avoiding one additional row for the last neg signal, by either incorporating it in the rst row (Proposed method), or by removing the need for an extra row by twos complementing the last row (Twos complement approach in [7]). For sake of reference, we also include in our analysis the study of two additional designs: the generation of a generic PP row in a standard multiplier, without considering the effect of negative encoding (later referred to as Standard Multiplier (Any row)), and the generation of the rst PP row in a standard multiplier, again without considering the effect of negative encoding (later referred to as Standard Multiplier (First row)). The characteristics of the implementations which have been considered and the related motivations are summarized in Table II. In our evaluations we focused the attention on the rst phase of the multiplication algorithm, i.e. partial product generation. The contribution of wires was neglected; nevertheless, with regards to the critical path delay, the theoretical analysis was carried out by either considering a hypothetical innite fanout (i.e., no buffering required), and a fanout equal to 4 inverters (corresponding to the actual load of elementary gates used in the evaluation). Thus, the architectures based on the latter logic model require buffering stages that have been modeled as twolevel networks of INV gates with a load of 4. Large load signals requiring buffering together with corresponding buffer stages location and size are summarized below. a) Recoded signals produced by the MBE logic to control the PP generation (Fig.s 1 and 8): one buffer stage for the one, two, neg signals of any of the n/2 PP rows, each spanning n + 1 bits. b) Recoded signals generated by the 3-5 decoder to control the 5-1 selection logic producing the last PP row directly in twos complement (Fig. 4, Twos Complement implementation of [7]): one buffer stage for the four selection signals of the last PP row, each spanning n + 1

bits. c) Conversion signals in the twos complement tree (Fig. 3, Twos Complement implementation of [7]): log2 (n/2) 1 buffer stages, each spanning a maximum of n/2 + 1 bits (e.g., in a 16 16 multiplier, two buffer stages spanning ve and nine conversion signals, respectively). Based on the labeling above, all the architectures considered in the evaluation should require (a) one buffer stage in the PP generation stage, while the Twos Complement implementation should require (b) an additional buffer stage to store the signals in input to the row of 5-1 selectors, and (c) log2 (n/2)1 buffer levels in the twos complement tree (only the latter being in the critical path).

B. Theoretical analysis: equivalent gates of [18] In order to have a fair comparison, we started with a theoretical analysis on the normalized gate delays and area costs adopted in [7] (based on Gajskis analysis [18]). Table III reports area cost (normalized to the area of a 2input NAND gate) and relative delays (normalized to one inverter delay) for various cell types. 1) Critical path delay estimation: We started the analysis by estimating the critical path of the designs of Table II for n = 8, 16, 32, and 64, both neglecting and considering buffering contribution (modeled as two-level networks of INV gates). We then extended the analysis to area usage. Results concerning delays and area costs are summarized in Tables IV and V, respectively. It is worth remarking that Tables IV and V only refer to values of delay and area related to the rst phase of PP generation including, where necessary, the reduction by one line. Moreover, in the overall results, a buffer-free conguration was considered. As it will be shown in the following, a buffer free conguration corresponds to the lower bound for the critical path delay of the selected implementations, while considering buffers would result in critical path delays strongly dependent on design and technology choices (especially for large values of n, where the delay of the buffering stages could even become comparable to the delay to generate the considered

Technique / No. of buffers Standard multiplier (Any row) / 1 buffer Standard multiplier (First row, delay) / 1 buffer Proposed method / 1 buffer Twos Complement [7] / [log(n/2) 1] buffers

Part Generating MBE signals Generating partial products Total Gen. MBE & PPs (combined) Total Generating partial products Adding the last neg Total 5-1 selector Twos complement tree Total

nn 4.2 6.0 10.2 4.6 4.6 4.6 5.0 9.6 3.6 3.2 + log2 n 1.4 6.8 + log2 n 1.4

88 4.2 6.0 10.2 4.6 4.6 4.6 5.0 9.6 3.6 7.4 11.0

16 16 4.2 6.0 10.2 4.6 4.6 4.6 5.0 9.6 3.6 8.8 12.4

32 32 4.2 6.0 10.2 4.6 4.6 4.6 5.0 9.6 3.6 10.2 13.8

64 64 4.2 6.0 10.2 4.6 4.6 4.6 5.0 9.6 3.6 11.6 15.2

TABLE IV E STIMATED C RITICAL PATH D ELAY FOR THE PARTIAL P RODUCT G ENERATION IN A B UFFER - FREE C ONFIGURATION (N ORMALIZED TO ONE INV D ELAY, [18])

Technique Standard multiplier (Any row) Standard multiplier (First row, area)

Part Generating MBE signals Generating partial products Total Generating MBE signals Generating partial products Total Generating MBE signals Generating partial products Adding the last neg Total Generating MBE signals 3-5 decoder 5-1 selector Twos complement tree Total

nn 12 6.5 (n + 1) 6.5 n + 18.5 1.5 6.5 (n + 1) 6.5 n + 8 1.5 6.5 (n + 1) 13.5 6.5 n + 21.5 12 5.5 6 (n + 1) 3 log2 n n/4 + 11 n/4 0.75 n log2 n +8.75 n + 23.5

88 12 58.5 70.5 1.5 58.5 60 1.5 58.5 13.5 73.5 12 5.5 54 40 111.5

16 16 12 110.5 122.5 1.5 110.5 112 1.5 110.5 13.5 125.5 12 5.5 102 92 211.5

32 32 12 214.5 226.5 1.5 214.5 216 1.5 214.5 13.5 229.5 12 5.5 198 208 423.5

64 64 12 422.5 434.5 1.5 422.5 424 1.5 422.5 13.5 437.5 12 5.5 390 464 871.5

Proposed method

Twos Complement [7]

TABLE V E STIMATED H ARDWARE C OST FOR THE C ONSIDERED I MPLEMENTATIONS (N ORMALIZED TO A 2- INPUT NAND G ATE , M ODEL BY [18])

row). Nonetheless, the required number of buffer is indicated. According to the gate level diagram in Fig. 1, in a buffer free conguration, the delay to generate a generic partial product row in a standard multiplier (Standard multiplier (Any row) implementation) can be computed as the delay to produce the MBE signals (Fig. 1(a)), plus the delay to generate the partial product bits (Fig. 1(b)). Based on the normalized gate delays in Table III, the generation of a generic partial product row takes 10.2 tIN V , since it takes 4.2 tIN V to produce the onei , twoi , and negi signals, and 6.0 tIN V to generate the ppij bits. When considering the logic required for buffering the recoded signals of the MBE generation phase, an additional delay has to be included, corresponding to two INV gates. The delay to generate the rst partial product row (Standard multiplier (First row) implementation) can be evalu-

ated by either considering Fig. 8 or Fig. 9. Given the fact that we are focusing on estimating the critical path, we focused the attention on the scheme in Fig. 9. In this case, by combining MBE signals and partial product generation into a design improved for speed (and by also neglecting buffering delay), producing the rst row takes 4.6 tIN V for any size of n. When considering the delay of the buffer stage, the critical path for the rst partial product row increases by the delay of a two-level network of INV gates. We then considered the delay of the twos complementation approach of [7] (Twos Complement method). According to the design presented in [7], in a buffer-free architecture, the critical path entails the delay to nd the twos complement of the multiplicand (Fig. 4(b)) plus the delay to select the correct value for the last partial product row by means of a 5-1 selector (Fig. 3). The 3-5 decoder (Fig. 4(a)) does not have to be considered since it is in parallel with

the twos complementation step. The delay of the twos complementation varies with the size of the multiplicand. Based on results summarized in [7], for a n n multiplier this delay can be estimated in 3.2 + log2 n 1.4 tIN V . The selection of the correct partial product row has a constant delay of 3.6 tIN V . Thus, generating the last partial product row by incorporating the effect of the last neg signal for a 8 8, 16 16, 32 32, and 64 64 multiplier takes 11.0, 12.4, 13.8, and 15.2 tIN V respectively. On the other hand, when buffering is considered, an additional delay of 2 [log2 (n/2) 1] tIN V corresponding to log2 (n/2) 1 buffer stages in the twos complement tree has to be accounted. We nally evaluated the critical path of the approach presented in this paper (Proposed method implementation). In this case, we focused on the generation of the rst partial product row through the devised 3-bit carry propagate addition scheme. According to the design in Fig. 9, the delay for generating the rst partial product row takes 4.6 tIN V . The addition of the last neg signal using the circuit in Fig. 7 has a constant delay of 5.0 tIN V . Thus, when neglecting buffer contribution, the proposed method has a delay of 9.6 tIN V , independent of the size of n. Otherwise, when taking into account the time required by the buffering stage, an additional delay of 2 tIN V has to be considered. We observe that the Twos Complement approach has a delay that is longer than the delay of the generation of the standard partial product rows, getting even longer as the size n of the multiplier increases. On the other hand, according to theoretical estimations, we can see that the delay for generating the rst row in the proposed method is estimated to be lower than the delay for generating the standard rows. This means that the extra row is eliminated without any penalty on the total critical path. 2) Area cost estimation: After analyzing the critical path, we computed the area estimates associated to the four designs in Table II. Results normalized to the area of a two-input NAND gate can be seen in Table V. The area cost for generating a generic partial product row using the standard approach (Standard multiplier (Any row) implementation) can be estimated by considering Fig. 1. According to this design, for a nn multiplier, MBE signals generation requires 12 NAND2 gates (Fig. 1(a)), while partial product generation has an area cost of 6.5 NAND2 for each of the n + 1 bits of the considered row (Fig. 1(b)). The area cost to generate the rst partial product row Standard multiplier (First row) implementation can be estimated by either analyzing the designs in Fig. 8 or Fig. 9. In particular, when the design in Fig. 8 is considered, the area consumption can be estimated as 1.5 NAND2 for the generation of the MBE signals, plus 6.5 NAND2 gates for producing each of the n+1 partial product bits. On the other

hand, when the scheme improved for speed is considered, the computation of the area cost has to take into account that the gate level design in Fig. 9 has to be applied only for the three most signicant bits of the partial product row. Thus, the area consumption can be computed as 1.5 NAND2 gates for the generation of the MBE signals for the least signicant part of the rst row, plus 6.5 NAND2 gates for producing each of the n2 least signicant partial product bits, plus 8 NAND2 gates for any of the three most signicant bits of the partial product row. With respect to previous design, the circuit in Fig. 9 has an additional consumption of 4.5 NAND2 gates. The area consumption for the technique presented in this work (Proposed method) can be easily derived from the cost estimated for the rst row, by adding the area required by the circuit for performing the 3-bit addition (Fig. 7). According to Table V, this circuit has a constant area cost of 13.5 NAND2 gates, independent on the size of the multiplier. Lastly, the area requirements of the Twos Complement implementation are given by the area cost of the 3-5 decoder, 5-1 selector, and twos complement tree blocks. According to the designs in Fig.s 4(a) and 4(b), the 3-5 decoding logic consumes 5.5 NAND2 gates, while the 5-1 selector has an equivalent area consumption of 6 NAND2 gates for any of the n + 1 bits. For estimating the area cost of the twos complementation logic in Fig. 3 we considered a design made up of log2 n n NOR2 or NAND2 gates, 2 (log2 n1) n inverters (0.5 NAND2 gates) and n XNOR2 2 gates (3 NAND2 gates each). This results in a total area cost of 3 log2 n n + 11 n for the twos complement tree. 4 4 From results summarized in Table V, it can be observed that the proposed method does not introduce any signicant area overhead with respect to the standard generation of a partial product row. On the other hand, the Twos Complement technique requires additional hardware which increases with the size of the multiplier. C. Theoretical analysis: gates based on logic synthesis In order to check for the validity of the model we decide to evaluate the above estimates using a modern cell library. The library used is a 130nm HCMOS standard cell library from STMicroelectronics [19] (later selected also for obtaining overall synthesis results). Data concerning area and delay for elementary cells used in this work (as well as in [7]) normalized to the area of a 2-input NAND gate and to the delay of one inverter are reported in Table VI. The area was extracted from the library documentation, while for the delay we used timing simulation with a load of 4 inverters. It is worth observing that results may vary depending on specic parameters selected for the synthesis such as logic implementation, optimization strategies and target libraries. We see that there are some notable differences between the previously used model of Table III; for instance, in

10

Technique / No. of buffers Standard multiplier (Any row) / 1 buffer Standard multiplier (First row, delay) / 1 buffer Proposed method / 1 buffer Twos Complement [7] / [log(n/2) 1] buffers

Part Generating MBE signals Generating partial products Total Gen. MBE & PPs (combined) Total Generating partial products Adding the last neg Total 5-1 selector Twos complement tree Total

88 4.29 5.48 9.77 4.84 4.84 4.84 4.66 9.50 4.10 7.79 11.89

16 16 4.29 5.48 9.77 4.84 4.84 4.84 4.66 9.50 4.10 9.16 13.26

32 32 4.29 5.48 9.77 4.84 4.84 4.84 4.66 9.50 4.10 11.00 15.10

64 64 4.29 5.48 9.77 4.84 4.84 4.84 4.66 9.50 4.10 12.37 16.47

TABLE VII E STIMATED C RITICAL PATH D ELAY FOR THE PARTIAL P RODUCT G ENERATION IN A B UFFER - FREE C ONFIGURATION (N ORMALIZED TO ONE INV D ELAY, STM ICROELECTRONICS 130 MM HCMOS T ECHNOLOGY )

Technique Standard multiplier (Any row) Standard multiplier (First row, area)

Part Generating MBE signals Generating partial products Total Generating MBE signals Generating partial products Total Generating MBE signals Generating partial products Adding the last neg Total Generating MBE signals 3-5 decoder 5-1 selector Twos complement tree Total

88 11.26 51.77 63.03 1.75 51.77 53.52 1.75 51.77 10.0 63.52 11.26 6.26 47.23 34.02 97.77

16 16 11.26 97.78 109.04 1.75 97.78 99.53 1.75 97.78 10.0 109.53 11.26 6.26 89.22 82.05 188.79

32 32 11.26 189.81 201.07 1.75 189.81 191.56 1.75 189.81 10.0 201.56 11.26 6.26 173.19 192.12 382.83

64 64 11.26 373.88 385.14 1.75 373.88 375.63 1.75 373.88 10.0 385.63 11.26 6.26 341.13 440.3 798.95

Proposed method

Twos Complement [7]

TABLE VIII E STIMATED H ARDWARE C OST FOR THE C ONSIDERED I MPLEMENTATIONS (N ORMALIZED TO A 2- INPUT NAND GATE , STM ICROELECTRONICS 130 MM HCMOS T ECHNOLOGY )

the rst model a XNOR2 gate is modeled as one XOR2 plus one inverter, while in the library model they have the same area and delay. These discrepancies could mean that we may need to update the model to better reect modern technologies. With the above information, we performed the same type of estimations as for the previous model. Results obtained on the considered implementations for critical path and area cost are illustrated in Table VII and VIII, respectively. We observe that results are, in general, comparable to the previous model, which means that the model presented in [18] is still quite valid, although we should note that we avoided using XNOR2 gates in the circuits presented in this work, in order to avoid the above discrepancies.

D. Theoretical Analysis: proof of overlapping between rst line and other rows generations. In this section we compute and compare the theoretical delay to generate the rst row (including the addition of the last neg signal) vs. the delay to generate the other rows. We will see that, under some very reasonable conditions, the computation of the rst row is overlapped by the computation of the other rows. As shortly said in the previous sections, we have considered designs focused on modularity and area requirements as primary issues. We remember that the key point is that if we wish to have the generation of the rst row (including neg assimilation) running in parallel with the computation of the other rows, it is necessary to have the generation and the assimilation of the most signicant bit weights of the rst row with the neg signal, which is done fast enough, without compromising

11

Gate INV NAND2 NAND3 NAND4 NOR2 XNOR2 XOR2

Normalized area cost 0.75 1.00 1.25 1.25 1.00 2.00 2.00

Normalized gate delay 1.00 1.37 1.92 2.73 1.84 2.74 3.17

Function 1-input inverter 2-input NAND 3-input NAND 4-input NAND 2-input NOR 2-input XNOR 2-input XOR

circuitry depicted in Fig. 8; 4) generation of the bits of the other rows, using the circuitry of Fig. 1. All of the above is related to an overall implementation which is modular, regular and has attention to the area occupancy, possibly without compromising overall performances. While looking for the critical path, we must consider the three parallel paths:

TABLE VI G ATE D ELAY AND A REA C OST N ORMALIZED TO THE D ELAY OF ONE INV G ATE WITH A L OAD OF 4 AND TO THE A REA OF A 2- INPUT NAND G ATE (STM ICROELECTRONICS 130 MM HCMOS T ECHNOLOGY )

the regularity and area increase of the whole multiplier. We remind that, as it is seen from Fig. 6, the generation of the rst row is different from the generation of the other rows, basically for two reasons:

Path A: points 1 and 2 of the previous list, i.e. generation of the most signicant three bit weights of the rst row (delay of Fig. 9), plus the addition to these bits of the neg signal, (delay of Fig. 7); Path B: point 3, i.e. generation of the other bits of the rst row (delay of Fig. 8); Path C: point 4, i.e. generation of the bits of the other rows (delay of Fig. 1).

the rst row needs to assimilate the last neg signal, an operation which requires an addition over the most signicant three bit weights; the rst row can take advantage of the easier Booth recoding of the rst row, because the y1 bit used by the MBE is always equal to zero (Section IV).

We get P ath A = (IN V + 2 N AN D3) + (N AN D3 + max(N AN D2 + IN V, XN OR2)). Now, also supported by the results provided in Table III, we assume that XN OR2 N AN D2 + IN V , and therefore P ath A = (IN V + 2 N AN D3) + (N AN D3 + XN OR2)) and P ath B = (IN V + N AN D2) + (2 N AN D2 + XN OR2). Finally, P ath C = max(XOR2, IN V + N AN D3 +
N AN D2, max(N AN D2, IN V ) + N OR2) + (max(2 N AN D2, IN V ) + XN OR2), which becomes after some trivial simplications, P ath C = (IN V + N AN D3 + N AN D2) + (2 N AN D2 + XN OR2) = IN V + 3N AN D2 + N AN D3 + XN OR2

As seen before, in Fig. 8 we have a possible implementation to generate the rst row, which takes into account the easier generation of the rst row. We have already seen that when we combine the two parts of Fig. 8 we get Fig. 9 which is faster than Fig. 8, at a possibly slightly larger area cost, certainly very marginal with respect to the global area of all the other bits of partial products coming from the other rows. We have done some rough analyses and found that a good tradeoff is to have the generation of the rst bits of the rst row which is carried out by the circuit of Fig. 9, then followed by the cascaded addition provided by Fig. 7. For this reason, although it is possible, we have decided not to merge the circuits of Fig.s 9 and 7, in order not to slightly reduce the regularity of this small portion of the multiplier. Certainly, for some implementations this is still a feasible option to be explored in cases when the generation of the rst row becomes the bottleneck with respect to the generation of the other rows. With these premises, our analysis will consider the following: 1) generation of most signicant three bit weights of the rst row, determined according to the very small and regular circuitry of Fig. 9; 2) addition to these bits of the neg signal, carried out by the very small and regular circuitry of Fig. 7; 3) generation of the other bits of the rst row, using the

Now, when we compare Path A vs. Path B, we get that since N AN D3 N AN D2, it follows that P ath A P ath B . This implies that in the rst row, the critical path is in the generation of the most signicant 3 bit weights plus the addition of the neg signal. Now we compute P ath C P ath A. After some simple passages, we get
(CA) = P ath CP ath A = 3N AN D22N AN D3

If (C A) is non negative, then it follows that Path A is overlapped by Path C and therefore the proposed method of generating and adding the extra neg signal to the rst row, does not introduce any additional delay to be possibly overlapped during the second phase of tree reduction, as it was done in [7]. We see that when the delays are these reported in Tables III and VI, then (C A) > 0 and therefore Path A is overlapped by Path C, resulting therefore transparent, i.e. not introducing additional delay. Clearly, the analysis above does not consider the delays of the wiring, although having the circuits of Fig.s 9, 7, 8 and 1 similar characteristics, it is not too much unreasonable to expect that it also can hold when wire contributions are taken into account.

12

E. Implementation Results

VII. C ONCLUSIONS Twos complement n n bit multipliers using radix-4 Modied Booth Encoding produce n partial products 2 but due the sign handling, the partial product array has a maximum height of n + 1. We presented a scheme that 2 produces a partial product array with a maximum height of n 2 , without introducing extra delay in the partial product generation stage. With the extra hardware of a (short) 3bit addition, and the simpler generation of the rst partial product, we have been allowed to achieve a delay for the proposed scheme within the bound of the delay of a standard partial product generation. The outcome of the above is that the reduction of the maximum height of the partial product array by one unit, may simplify the partial product reduction tree, in terms of delay and regularity of the layout. This is of special interest for all multipliers but especially for short bit-width multipliers for high performance embedded cores, where short but fast bit-width multiplications could be common operations. We have also compared our approach with a recent proposal with the same aim, considering results after synthesis using a widely used industrial simulation tool and a modern industrial technology library, and concluded that our approach can improve both the performance and area requirements of square multiplier designs. The proposed approach also applies with minor modications to rectangular and to general radix-B Modied Booth Encoding multipliers.

In order to further check the validity of our estimations in an implementation technology, we implemented the designs in Table II through logic synthesis and technology mapping to an industrial standard cell library. Specically, for the logic synthesis we used Synopsys Design Compiler and the designs were mapped to a HCMOS 130nm industrial library from STMicroelectronics. To perform the evaluation, we obtained the area-delay space for the sole generation of the partial product row of interest (i.e. the rst row in the proposed approach, the last row in the implementation presented in [7]). In order to support the comparison, the area-delay space for the generation of the partial product rows using standard MBE implementations was also evaluated, by considering the rst row and the other rows of the partial product array separately. The results, obtained for n = 8, 16 and 32, are depicted in Figure 10. The delays are shown both in absolute units (ns) and normalized to the delay of an inverter with a fanout of four (68 ps for the technology used under worst-case conditions). Accordingly, the area is presented both in absolute units (m2 ) and normalized to equivalent gates using the area of a 2-input NAND gate (4.39 m2 for the technology used). We obtained several design points (using different target delays) for each approach, and the minimum delay shown is the fastest design that the tool was capable to synthesize. We observe that the Proposed method implementation produces a curve in the delay-area graph bounded by the curve for the generation of a standard partial product (upper bound) and by the curve for the standard generation of the rst partial product (lower bound) for the three values of n considered. Moreover the minimum delay achievable is very similar to the case of the generation of a standard partial product for n = 8, 16 (with our approach it is about 0.5-0.7 FO4 higher), and is even less for n = 32 due to the predominant effect of the higher loading of the control signals. Therefore our scheme does not introduce any additional delay in the partial product generation stage for target delays higher than about 5 FO4. The curve for our scheme gets closer to the curve corresponding to the standard generation of the st partial product as n increases. This is due to the fact that as n increases, the short addition of the leading part achieves more overlap with the generation of the rest of the partial product (with higher input load capacitance as n increases). The Twos Complement scheme achieves minimum delays between 7 and 10 FO4, at the cost of requiring more than four times the area at this point, compared to the Proposed method approach. Most importantly, its delay is much higher than the one of any standard row.

R EFERENCES
[1] M. D. Ercegovac and T. Lang, Digital Arithmetic. Los Altos, CA, USA: Morgan Kaufmann Publishers, 2003. [2] S. K. Hsu, S. K. Mathew, M. A. Anders, B. R. Zeydel, V. G. Oklobdzija, R. K. Krishnamurthy, and S. Y. Borkar, A 110gops/w 16-bit multiplier and recongurable pla loop in 90-nm cmos, IEEE Journal of Solid State Circuits, vol. 41, pp. 256264, Jan. 2006. [3] M. S. Schmookler, M. Putrino, A. Mather, J. Tyler, H. V. Nguyen, C. Roth, M. Sharma, M. N. Pham, and J. Lent, A low-power, high-speed implementation of a powerpc(tm) microprocessor vector extension, Proceedings of the 14th IEEE Symposium on Computer Arithmetic, p. 12, 1999. [4] A. D. Booth, A signed multiplication technique, Quarterly J. Mech. Appl. Math., vol. 4, 1951. [5] L. Dadda, Some schemes for parallel multipliers, Alta Frequenza, vol. 34, May 1965. [6] O. L. MacSorley, High speed arithmetic in binary computers, Proceedings IRE, vol. 49, pp. 6791, Jan. 1961. [7] J.-Y. Kang and J.-L. Gaudiot, A simple high-speed multiplier design, IEEE Transactions on Computers, vol. 55, no. 10, pp. 12531258, Oct. 2006. [8] , A fast and well-structured multiplier, Proceedings of Euromicro Symposium on Digital System Design, pp. 508 515, Sept. 2004. [9] E. M. Schwarz, R. M. A. III, and L. J. Sigal, A radix-8 cmos s/390 multiplier, Proceedings of the 13th IEEE Symposium on Computer Arithmetic, p. 2, 1997.

13

(a) Size of 8

(b) Size of 16

(c) Size of 32 Fig. 10. Post-synthesis results of product generation for different sizes.

14

[10] W.-C. Yeh and C.-W. Jen, High-speed booth encoded parallel multiplier design, IEEE Transactions on Computers, vol. 49, no. 7, pp. 692701, Jul. 2000. [11] Z. Huang and M. D. Ercegovac, High-performance low-power left-to-right array multiplier design, IEEE Transactions on Computers, vol. 54, no. 3, pp. 272283, Mar. 2005. [12] C. S. Wallace, A suggestion for a fast multiplier, IEEE Transactions on Electronic Computers, vol. EC-13, no. 1, pp. 1417, Feb. 1964. [13] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Transactions on Computers, vol. 45, no. 3, pp. 294306, Mar. 1996. [14] P. F. Stelling, C. U. Martel, V. G. Oklobdzija, and R. Ravi, Optimal circuits for parallel multipliers, IEEE Transactions on Computers, vol. 47, no. 3, pp. 273285, Mar. 1998. [15] J.-Y. Kang and J.-L. Gaudiot, A logarithmic time method for twos complementation, Proceedings of the Int. Conference on Computational Science, pp. 212219, 2005. [16] K. Hwang, Computer Arithmetic Principles, Architectures, and Design. New York: Wiley, 1979. [17] R. Hashemian and C. P. Chen, A new parallel technique for design of decrement/increment and twos complement circuits, Proceedings of the 34th Midwest Symposium on Circuits and Systems, vol. 2, pp. 887890, 1991. [18] D. Gajski, Principles of Digital Design. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1997. [19] STMicroelectronics, 130nm HCMOS9 Cell Library, http://www. st.com/stonline/products/technologies/soc/evol.htm.

You might also like