You are on page 1of 9

International Journal of Electronic Engineering Research Volume 1 Number 1 (2009) pp. 5361 Research India Publications


A 1616 MUX Based Multiplier Design Using Optimized Static CMOS Logic Style
Abhijit Asati* and Chandrashekhar** * Lecturer, Electrical & Electronics Engineering Group, BITS, Pilani, India ** Director, Central Electronics Engineering Research Institute, Pilani, India

Abstract Simpler VLSI implementation of array multipliers makes them preferable for smaller operand sizes, in-spite of their linear time complexity. In general array multipliers have bad space complexity O (n2), and it requires approximately n2 cells to produce multiplication, therefore as the operand size grows the circuit takes large area and power. In this paper we present a MUX based 1616 unsigned multiplier circuit, which utilize an efficient partial product generation and partial product addition technique. The time and space complexity of such multiplier is much better than simpler array multiplier techniques. The multiplier has been designed using optimized static CMOS logic cells to provide best area, power and delay performance. The multiplier circuit is implemented using conventional CMOS logic in 0.6m, N-well CMOS process (SCN_SUBM, lambda=0.3) of MOSIS, and simulated after parasitic extraction. The simulation result shows large reduction in propagation delay and the average power compared to tree multiplier implementation by [3]. Keywords: MUX based, array, Wallace tree, booth encoding, partial product, complexity, operand size,

In Digital Signal Processor implementation like Standard Digital Signal Processors and ASIC Digital Signal Processors, the multiplier is used as fundamental building block. The performance of different signal processing algorithms like frequency domain filtering (FIR and IIR), frequency-time transformations (FFT), Correlation etc depend on performance of multiplier implementation. In most real-time DSP processing task, the multiplier block must operate at high speed, consuming less layout area and low Power. The multiplication algorithms differ in the means of


Abhijit Asati and Chandrashekhar

partial product generation and partial product addition [1]. The array multipliers have linear time complexity i.e O (n) therefore their delay may degrade for multipliers having larger operand sizes. Also array multipliers have bad space complexity O (n2), and they requires approximately n2 cells to produce multiplication, therefore as the operand size grows the circuit takes large area and power [2], [4], [5]. The reduction in partial product row by factor of n can be achieved using a radix-m booth encoding, (where m=2n). By using Booth radix-4 (m=4=22) encoding the partial product rows can be halved [3]; therefore the number of logic cells required to generate partial product are reduced to n2/2 [2]. Further in Wallace tree accumulation, since ripple effect is reduced it produces product in far less time, the time complexity is reduced to O (log n) but requires large gate and routing area compared to regular array, hence unsuitable for VLSI implementation [2]. The advantage of reduction in hardware using Booth encoding scheme can be combined with, accelerated Wallace tree accumulation of partial product to obtain the reduced time complexity of O (log n), which are very much suitable for multipliers having large operand sizes [2], [3]. As discussed earlier, for smaller operand sizes the tree based architectures may have smaller gate delay but consume more silicon area due to increased routing and encoding overheads, on the other hand array multipliers have larger gate delay but consume smaller routing length. The MUX based array multipliers show faster and compact implementation due to efficient partial product generation and efficient partial product addition. In this paper we present, an implementation of 1616, multiplier design using MUX based array technique and static CMOS logic cells. These static CMOS logic cells provide best area, power and delay performance as described in [6]. The VLSI implementation of multiplier circuit is done using 0.6m, N-well CMOS process (SCN_SUBM, lambda=0.3) of MOSIS, using conventional CMOS logic. Simulation results are compared with another faster Booth encoded Wallace tree multiplier implementation as in [3]. Section II discusses the conventional static CMOS logic design style, section III explains the design of MUX based multiplier algorithm, Section IV describes the illustration of the Multiplication Logic; Section V describes schematic 44 multiplier and 1616 multiplier. Physical implementation and results are described in section VI. Section VII concludes the paper.

Conventional static CMOS Logic Design style

A static logic gate generates its output corresponding to the applied input voltages after a certain time delay, and it can preserve its output level (or state) as long as the power supply is provided. In steady state each gate output is connected to either Vdd or Gnd through a low-resistive path therefore for a static input, the output levels are preserved, while the operation dynamic logic circuits relies on temporary storage of signal values on the capacitance of dynamic circuit nodes. Conventional static logic style offers a versatile implementation of logic functions based on static or steady state behavior of simple CMOS structures. It is most suitable and widely accepted for many VLSI circuit implementations due to its important properties like high speed, low power, large noise margins, no logic degradation and validity of logic design

A 1616 MUX Based Multiplier Design Using Optimized


style at scaled down technologies. A logic gate with fan-in of n requires 2n (n Ntype + n P-type) devices. Two logic blocks, N-block and P-block, form a CMOS gate. The topology of N-block is the dual of that of the P-block. Since both the two blocks have equal number of transistors, transistor count may increase. The channel widths of series connected n-channel MOS transistors (NMOS) or p-channel MOS transistors (PMOS) have to be increased to obtain a reasonable conducting current to drive capacitive loads. The increase in size of PMOS results in a significant area overhead, and also an increased gate input capacitance, which may lead to high dynamic power dissipation. The higher gate input capacitance loads the previous stage thereby increases the delay. The ratio of PMOS/NMOS transistor widths () should be chosen optimally for achieving good, noise margin, higher speed and lower power consumption as described in [7], [9]. The short-circuit currents of a static CMOS gate can be minimized by appropriately sizing transistors for equal rise and fall times. The schematic of 1-bit full adder, 2-input AND, 3-input AND, 2-input MUX, 2-input function implemented using Conventional Static CMOS Logic design is shown in Figure 1. The full adder cell is designed using principle of symmetry has 28 transistors as described in [6], [8]. The 28-transistor performs considerably better than the 40-transistors version [6]. The 32-bit adder designed using complimentary CMOS has a power delay product of less than half of the CPL version [6]. The 2-input AND cell, 3-input AND cell, 2 input MUX and other cells also provide better a power delay product.





Figure 1: schematic using conventional static CMOS logic design style of (a) complex Full adder cell using principle of symmetry (b) 2 input AND gate (c) 3 input AND gate (d) 2 input MUX .


Abhijit Asati and Chandrashekhar

MUX based multiplier algorithm

It is unsigned multiplier algorithm in which one bit of the multiplier and one bit of the multiplicand are processed in parallel. The algorithm is symmetric, i.e., the multiplier and multiplicand can be interchanged. According to this algorithm, the sum of the two operands, progressively computed, is a useful quantity that is used in the computation of certain partial products. The different quantities are computed one bit at each step of the algorithm and the appropriate quantity is then selected in the next step, if required so. The parallel implementation of this algorithm yields an iterative type array. Compared to the implementation based on the modified booths algorithm, it consumes the same amount of circuitry but yields faster multiplication. This multiplexer-based architecture performs parallel computation of the partial sums of the two operands together, which simplifies the tasks such as compression and accumulation. It also performs favorably well with regards to processing speed, compared to other regular array architectures. The multiplication logic can be explained using equation 1, equation 2, equation 3, equation 4 and equation 5.

X = x n 1 x n 2 K x 0 Y = y n 1 y n 2 K y 0 Let ,


P = XY

Xj &Yj are binary nos. after truncation, up-to the (j+1)th bit in X,Y respectively;

= x j 1 x j 2 K x 0
j 1

Yj = y

y j2 K y0 for , 0 < j < n +1 (2)

P j = X jY j

X 0 = Y 0 = 0 = P0 X = X n = X n 1 + 2 n 1 x n 1 & Y = Y n = Y n 1 + 2 n 1 y n 1 Pn = X n Y n = X n 1 + 2 n 1 x n 1 Y n 1 + 2 n 1 y n 1 = = =


= 2 2 n 2 x n 1 y n 1 + 2 n 1 ( x n 1Y n 1 + X n 1 y n 1 ) + X n 1Y n 1

n 1


2 j

+ ( x j Y j + X j y j ) 2 j + P0
0 n 1 0

n 1

n 1

x j y j 2 2 j + ( x jY j + X j y j ) 2 j x j y j 22 j + Z j 2 j
0 n 1

n 1


A 1616 MUX Based Multiplier Design Using Optimized


where , Z j = x jY j + X j y j Zj = X Z j = Yj Zj =0

if if if

x j = 0, y j = 1 x j = 1, y j = 0 x j = 0, y j = 0 if x j = 1, y j = 1 ( 4)

Z j = X j +Yj

Illustration of the Multiplication Logic

The example 1 shows the multiplication process for two binary 4-bit numbers using MUX-based approach. The multiplication process shows that the numbers of rows remain the same, but numbers of partial product bits to be compressed in a particular column are now restricted to only 3-bits; this makes compression much faster and easier. If carry bits C1, C2, C3 as shown by example 1 are taken care then the number of bits to be added in particular column will be only 2-bits. The two columns can be added simultaneously using 2 bit CLA, which also accepts carry input C1, C2, C3 of particular column (this is possible because, these carries are occurring in alternate columns). Thus the first step in algorithm is generation of partial product rows and second step performs the addition of these partial products together with compression. Thus compared to other regular array multiplier it will be faster. It produces output in time T= (n+1) FA_2CLA where FA_2CLA is delay of a 2 bit CLA adder, with a timing overhead one 4:1 MUX delay, while regular array multiplier takes approximate delay of T= (2n) FA. The large area overhead will be due to routing needed between these MUX.
Example 1: X0Y0, X1Y1, X2Y2 & X3Y3 at the positions shown below has be added with appropriate term selected by 4:1 MUX based on select lines shown in first column. Let X= X3X2X1X0=0111=(+7)10 and Y= Y3Y2Y1Y0=0011=(+3)10 The uncolored portion explains the operation to be performed by algorithm and colored portion show the application of algorithm on selected inputs X and Y. Working of MUX: Select lines 00/01/10/11 corresponds to I1/I2/I3/I4. X3Y3 0 X2Y2 0 X1Y1 1 0/0/0/C1 =0/0/0/1 1 0/X0/Y0/S0 =0/1/1/0 1 X0Y0 1 0/X0/Y0/S0 =0/1/1/0 0 X1Y1 =11 X2Y2 =10 X3Y3 =00 1 P2 0 P1 1 P0 =(21)10 Select line for 4:1 MUX

0 P7

0/0/0/C3 =0/0/0/1 0 0 P6

0/X2/Y2/S2 =0/1/0/0 0 0 P5

0/0/0/C2 =0/0/0/1 0 0/X1/Y1/S1 =0/1/1/1 0 1 P4

0/X1/Y1/S1 =0/1/1/1 1 0/X0/Y0/S0 =0/1/1/0 0 0 P3


Abhijit Asati and Chandrashekhar

Schematic 44 multiplier and 1616 multiplier

The logic explained in example 1 can be shown through a schematic, which use 4:1 Multiplexers & AND gates as shown in figure 2. The multiplexers are used to choose j the Zj for the Zj 2 terms (refer equation 5) while AND gates are used to produce the xjyj2 terms. The logic for MUX based multiplier implementation is shown in Figure 2. The complete logic structure to accumulate the partial product terms utilizes Cell-I and Cell-II, which are shown in Figure 3 [2]. Similar technique can be used in design of 1616 multiplier.
X1 Y1 X0 Y0


AND2 22X1Y1 X2 Y2 0 0 0 C1 0 X0 Y0 S0

AND2 20X0Y0

AND2 24X2Y2 X3 Y3 0 0 0 C2 0 Y1 X1 S1

4:1 MUX

4:1 MUX

X1 Y1

0 X0 Y0 S0


AND2 26X3Y3 0 0 0 C3 0 X2 Y2 S2

4:1 MUX

4:1 MUX

4:1 MUX

X2 Y2 Z222

0 Y1 X1 S1

0 X0 Y0 S0

4:1 MUX

4:1 MUX

X14:1 Y1 MUX

4:1 MUX

X3 Y3 Z323

Figure 2: logic for MUX based multiplier implementation.

Sin Xi=Xj Yi=Yj Si=Sj FA Xj Yj


Yj Cin Sin Xj Yj Cin

4:1 MUX

0 Xi Yi Si

I Xi=Xj Yi=Yj Si=Sj

Cout Sout Xj Yj Cout Sout

A 1616 MUX Based Multiplier Design Using Optimized


Sin Xi=Xj



Cin Cj Sin Xi=Xj Yi=Yj Si=Sj Xj Yj Cj+1 Cout Sout

Xj Yj Cin

Yi=Yj Si=Sj









Sout Cj+1 Cout


Figure 3: Cell-I and Cell-II used in MUX-based multiplier implementation.

Figure 4: Photomicrograph of a 1616 MUX based multiplier.

Physical implementation and Result

Layout for a 1616 MUX based, unsigned multiplier circuit shown in figure (4) is implemented in 0.6m, N-well CMOS process (SCN_SUBM, lambda=0.3) of MOSIS, using conventional CMOS logic. A schematic library consisting of 7 functional cells is defined for static CMOS design styles comprising of 1-bit full adder, 2-input AND, 3-input AND, 2-input MUX, 2-input XOR, 2-input OR and 3input OR function. Corresponding to the schematic library, physical libraries were designed using conventional CMOS logic design styles using the design principles of [7], [8], [9], [10]. Three different versions of each physical library were developed by respectively sizing the W/L ratios of the NMOS transistor to values of 3,5 and 7 (W/L values smaller than 3 were also experimented with but not considered further as they resulted in parasitic dominated slower speeds due to weak drives of transistors and were not considered good candidates for high performance. The layout assemblies for


Abhijit Asati and Chandrashekhar

the 16-bit multiplier were carried out using these cell libraries and automatic place and route tool LEDIT (SPR) from M/s Tanner Research Inc. It was noticed that the physical library utilizing W/L ratio of 3 for NMOS transistor gave the smallest average switching energy-delay product. The generated layouts were simulated after parasitic extraction using circuit simulator, ELDO spice. Supply voltage VDD is kept at 3.3V. The table 1 shows the comparison of important parameters like propagation delay and power dissipation at 20MHz data rate with tree based implementation as in [3]. Table 2 shows the maximum power leakage power, transistor count, core area, total routing length and number of vias.

Table 1 Algorithm (technology) Proposed (0.6m) BEWM ref [3] (1.25 m) VDD (V) 3.3 5 Propagation delay () ns 14.15 60 Average power (mW) 22.05 100

Table 2 Algorithm Maximum Leakage Transistor Core Total Number (technology) Power Power count area routing of Via (mW) (nW) (mm2) length (mm) Proposed 623.46 53.34 10168 23.76 1386.71 3452 (0.6m)

Comparing these two multiplier architectures shows that proposed MUX based array multiplier architecture shows reduction in delay by a factor of 0.235 and reduction in average power consumption almost by a factor of 0.22. The maximum instantaneous power, leakage power, transistor count, core area, total routing length and number of vias are also shown for judging the VLSI implementation characteristics.

A 1616 MUX Based Multiplier Design Using Optimized


This paper present a 16-bit MUX based unsigned multiplier implementation using an optimized static CMOS logic style. The multiplier algorithm performs efficient partial product generation and addition; which makes its time and space complexity better than other array multipliers. The simulation results are compared with faster tree multiplier implementation shows reduction in propagation delay by a factor 1/4 and average switching power by approximately by a factor 1/4.

[1] [2] [3] A. Hesham, Technology scaling effects on multipliers, IEEE Transactions on Computers, Vol.47, No.11, pp. 1201-1215, November 1998. Z. Kiamal, Multiplexer-based array multipliers, IEEE Transactions on Computers, Vol.48, No.1, pp. 15-23, January 1999. F Jalil, M *N Booth encoded multiplier generator using optimized wallace trees, IEEE Transactions on very large Scale Integration (VLSI) Systems, Vol. 1, No.2, pp. 120-125, June 1993. V. Chanramouli, Self-Timed design in GaAs-case study on a high-speed, parallel multiplier, IEEE Transactions on very large Scale Integration (VLSI) Systems, Vol. 4, No.1, pp. 146-149, March 1996. P. Kornerup, A systolic, linear-array multiplier for a class of right-shift algorithms, IEEE Transactions on Computers, Vol.43, No.8, pp. 892-898, August 1994. Reto Zimmermann and Wolfgang Fichtner, Low-Power Logic Styles: CMOS Versus Pass Transisistor Logic IEEE Journal of solid state circuits, Vol. 32, No. 7, pp. 1079-1090, July 1997 Mohab Anis, Mohamed Allam and Mohamed Elmasry, Impact of Technology Scaling on CMOS Logic Styles, IEEE Transaction on circuits and systems-II, Analog and Digital Signal Processing, VOL. 49, NO. 8, pp. 577-587, August 2002. S.M. kang, Yusuf Leblebici, CMOS Digital integrated Circuits, Analysis and Design, Third edition McGrawhill, 2003. N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, AddisonWesley, 1994 Jan M. Rabaey, Anantha Chandrakasan, Borivose Nikolic, Digital Integrated Circuits, Second Edition PrenticeHall of India Private Limited, 2004.





[8] [9] [10]