You are on page 1of 6

Counter Based Wallace Design

Implemented Using Symmetric Bit


Stacking Compression Technique
Kaushika Sree A Saraniya O
Master Of Engineering, VLSI Design Assistant Professor
Department of ECE Department of ECE
Government College of Technology Government College of Technology
Coimbatore, India. Coimbatore, India
sreegct95@gmail.com saranya@gct.ac.in

Abstract – Multipliers play an important role in today’s digital Wallace tree or Dadda tree or the improved architecture in
signal processing, image processing and various other column compression technique.
applications. They are the essential part of an arithmetic logic In order to combine the partial products efficiently,
unit for performing filtering and convolution operations. The
column compression is commonly used. These methods
binary multiplication of integers and floating point numbers
results in partial products that must be added to produce the final involve using full adders functioning as counters to reduce
product. The addition of these partial products dominates the groups of 3 bits of the same weight to 2 bits of different
latency and power consumption of the multiplier which weight in parallel using a carry-save adder tree. Through
influences the performance of the processors. Hence, in order to several layers of reduction, the number of summands is
improve the performance of the processor the partial product reduced to two, which are then added using a conventional
addition in the multiplier circuit must be fast, should consume
less power and area. It can be achieved by using compressor
adder circuit.
circuit for partial product addition. The existing compressor To achieve higher efficiency, larger numbers of bits of
circuits are designed using XOR gates in the critical path of
equal weight can be considered. The basic method when
partial product addition which increases the latency of the
compressor. This paper proposes a novel method of designing a dealing with larger numbers of bits is the same: bits in one
compressor/counter circuit using symmetric bit stacking method. column are counted, producing fewer bits of different
The symmetric bit stacking method is designed using the three weights. For example, a 7:3 counter circuit accepts 7 bits
bit stacking circuit which groups the number of “1” bits in the of equal weight and counts the number of “1” bits. This
input together. The 6:3 counter circuit is designed by merging the count is then output using 3 bits of increasing weight.
3 bit staking circuit. By doing so, we can eliminate the XOR gate
delay in the critical path of the partial product addition which
The 7:3 and 6:3 counter circuits can be constructed using
results in reduced latency of the circuit. For 64-bit and 128-bit full and half adders.
multipliers this compressor circuit are very effective in improving
Much of the delay in these counter circuits is due to
the performance of the multiplier circuit.
the chains of XOR gates on the critical path. Therefore,
I INTRODUCTION many faster parallel counter architectures have been
presented. A parallel 7:3 counter was presented and used
The increased level of integration brought by the to design a high speed counter-based Wallace tree
modern VLSI and ULSI has rendered possible, the multiplier. Additionally, counter designs use multiplexers
integration of many components that were considered very to reduce the number of XOR gates.
complex. The multiplication operation is certainly present in
many parts of the digital systems or digital computers, II. RELATED WORKS
notably in signal processing, graphics and scientific The existing compression methods involve using full
calculations. Multiplication is a basic arithmetic operation adders functioning as the counters to reduce the group of 3
important in applications like digital signal processing bits of same weight to 2 bits of different weight in parallel
which rely on efficient implementation of g0eneric using carry save adder tree. Through several layers of
arithmetic logic units (ALU) and floating point units to reduction, the number of summands is reduced to two,
execute dedicated operations like convolution and filtering. which are then added using conventional adder circuit.
The speed and power efficiency of a multiplier circuit is of To achieve higher efficiency, larger number of bits of
critical importance in the overall performance of the equal weight can be considered. The basic method when
microprocessor. dealing with lager number of bits is same: the bits in one
column are counted, producing fewer bits of different
The binary multiplication of integers or fixed-point numbers weights. For example, a 7:3 counter circuit accepts 7 bits of
result in partial products that must be added to produce the equal weight and counts the number of “1” bits. This count
final product. Many methods have been presented to is then output using 3 bits of increasing weight. The 7:3 and
optimize the performance of the partial products summation, 6:3 counter circuits can be constructed using full and half
such as the well-known row compression technique in the adders, as shown in Fig. 2.1.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


Fig 2.1 A 7:3 counter and a 6:3 counter built from full and
half adders

Much of the delay in these counter circuits is due


to the chains of XOR gates on the critical path. Some
counter design uses multiplexers to reduce the number of
XOR gates in the critical path. Some of these muxes can be
implemented with transmission gate logic to produce even
faster design. Therefore, many faster parallel counter
architectures have been presented.
Fig 4.1 Flowchart of Proposed method
III WALACE MULTIPLER
3.1 THREE BIT STACKING CIRCUIT:
A Wallace tree is an efficient hardware Given the inputs X0, X1 and X2, a 3-bit stacker
implementation of a digital circuit that multiplies two circuit will have three outputs Y0, Y1 and Y2 such that the
integers. The benefit of the Wallace tree is that there are number of “1” bits in the outputs is same as the number of
only O(log n) reduction layers, and each layer has “1” bits in the input, but the “1” bits are grouped together to
O(1) propagation delay. As making the partial products the left followed by the “0” bits. It is clear that the outputs
is O(1) and the final addition is O(log n), the multiplication are then formed by
is O(log n) only , not much slower than addition (however, Y0 = X0 + X1 + X2 4.1
much more expensive in the gate count). Multipliers based Y1 = X0X1 + X1X2 + X2X0 4.2
on Wallace reduction tree provide an area-efficient strategy Y2 = X0X1X2 4.3
for high speed multiplication. This paper proposed a The first output will be “1” if any one bit of the
reduced-area Wallace multiplier without compromising on input is one, the second output will be “1” if any two of the
the speed of the original Wallace multiplier. Synthesis inputs are one, and the last output will be one if all the three
results show that the proposed multiplier has the lowest area of the inputs are “1”. The Y1 output is a majority function
as compared to other tree-based multipliers. The speed of and can be implemented using one complex CMOS gate.
the proposed and reference multipliers is almost the same. The 3-bit stacker circuit is shown in Fig. 4.2

IV PROPOSED METHOD
The proposed approach allows the fast and easy
implementation of large CBW multipliers on FPGA as well
as in ASIC. The proposed algorithm uses high speed 7:3,
6:3 counters in the implementation of the CBW multiplier
using symmetric bit stacking method. It eliminates the use
of Full adders and half adders with accounts for major
reason of critical path delay. The CBW requires half stages
to perform the tree reduction as compared to the existing
Wallace multipliers.
The proposed Wallace tree design uses a 6:3
counter which is realized by first stacking all of the input
bits such that the “1” bits are grouped together. After
stacking the input bits, this stack can be can be converted
into a binary count to output the 6-bit count. Small 3-bit Fig. 4.2 Three-bit stacker circuit
stacking circuits are first used to form 3-bit stacks. These 3- 4.2 MERGING STACKS:
bit stacks are then combined to make a 6-bit stack using a We wish to form a 6-bit stacking circuit using the
symmetric technique that adds one extra layer of the logic. 3-bit stacking circuits. Given six inputs X0,..., X5, we first
divide them into two groups of 3 bits which are stacked

2
using 3-bit stacking circuits. Let X0, X1, and X2 be stacked 4.3 CONVERTING BIT STACK TO BINARY
into signals named H0, H1, and H2 and X3, X4, and X5 be NUMBER:
stacked into I0 , I, and I2. First, we reverse the outputs of In order to implement a 6:3 counter circuit, the 6-
the first stacker and consider the six bits H2, H1, H0, I0, I1, bit stack described in section 3.2 must be converted to a
and I 2. We notice that within these six bits, there is a train binary number. For faster, and more efficient count, we can
of “1” bits surrounded by “0” bits. To form a proper stack, use the intermediate values H, I, and K to quickly compute
this train of “1” bits must start from the leftmost bit. each output bits without needing the bottom layer of the
In order to form the proper 6-bit stack, two more 3- stackers. We can call the output bits C2, C1, and S in which
bit vectors of bits are formed called J0, J1, J2 and K0, K1, C2, C1, S is the binary representation of number of “1”
K2. The idea is to fill the J vector with ones first, before input bits.
filling the K vector. So we let 4.3.1 COMPUTATION OF S:
J0 = H2 + I0 4.4 To compute S, we note that we can easily
J1 = H1 + I1 4.5 determine the parity of the outputs from the first layer of the
J2 = H0 + I2 4.6 3-bit stackers. Even parity occurs in the H if zero or two “1”
In this way, the first three “1” bits of the train are bits appear in X0, X1 and X2. Thus He and Ie, which
guaranteed to fill into the J bits although they may not be indicates even parity in the H and I bits are given by
properly stacked. Now to ensure no bits are counted twice, He = H0 + H1 H2 4.10
the K bits are formed using the same inputs but with the Ie = I0 + I1 I2 4.11
AND gates instead As S indicates odd parity over all of the input bits, and
K0 = H2 I0 4.7 because the sum of two numbers with different parities is
K1 = H1 I1 4.8 odd, we can compute S as follows
K2 = H0 I2 4.9 S = He Ie 4.12
If the train of “1”s is no more than three places Although, this does incur one XOR gate delay, it is not on
long, then all of the K bits will be “0” as the AND gate the critical path.
inputs are three positions apart. If the train is longer than 4.3.2 COMPUTATION OF C1:
three places long, then some of the AND gates will have To compute C1, we note that C1=1 when the count
both inputs as “1”s as the AND gate inputs are three is 2, 3 or 6. Therefore, there are two cases.
positions apart. The number of AND gates that will have First we need to check if we have at least two but
this property will be three less than the length of the train of no more than three total inputs set to 1. We can use the
“1”s. intermediate H, I, and K vectors for this. To check for at
We notice that now J0 J1 J2 and K0 K1 K2 still least two inputs we need to see stacks of length two from
contain the same number of “1” bits as the input in total but either top level stacker, or two stacks of length one, which
now J bits will be filled with ones before any of the K bits. yields H1 H0+I1 I0+H0 I0. To check that we do not have
We must now stack J0 J1 J2 and K0 K1 K2 using two more than three inputs set, we simply need to make sure that
more 3-bit stacking circuits. The outputs of these two none of the K bits are set, as the K vector is only set when
circuits can then be concatenated to form the stack outputs more than three inputs are “1”. This gives (K0+K1+K2).
Y5,..., Y0. Second, we need to check if we have all six inputs
An example of this process is shown for an input set as “1”. We can check this by checking that all three of
vector containing four “1” bits in Fig. 4.3. In this example, both H and I bits are set. As these are bit stacks, we simply
first the H and I vectors are formed by stacking groups of need to check the rightmost bit in the stack for this case,
three input bits. Then, the H vector is reversed, forming a which yields
continuous train of four “1” bits surrounded by zero bits. C1 = (H1 H0 + I1 I0 + H0 I0)(K0 + K1 + K2) + H2 I2 4.13
Corresponding bits are OR-ed to form the J vector which is 3.3.3 COMPUTATION OF C2:
full of “1” bits. Corresponding bits are AND-ed to form the We can easily calculate C2 as it should be set only
K vector which finds exactly one overlap. Then, the J and K whenever we have at least 4bits set
vectors are restacked to form the final 6-bit stack. C2 = K0 + K1 + K2 4.14
Using the equations 3.12, 3.13 and 3.14 the final 6:3 counter
circuit can be constructed as shown in Fig 4.4.

Fig 4.3 Six Bit Stacking Example Fig 4.4 A 6:3 counter based on symmetric stacking

3
4.4 6:3 COUNTER SIMULATION: V RESULTS AND DISCUSSION
The proposed 6:3 counter design was built as a
The counter based Wallace tree design
standard CMOS design and simulated using spectre, using
implemented using bit stacking compression technique is
the ON semiconductor C5 0.5-μm process (formerly
simulated using ISE Design Suite 14.7. The inputs are
AMI06). For comparison, a 6:3 counter design was
forced in ISim Simulator ad the outputs are observed.
implemented using standard CMOS full adders as in Fig.
5.1 THREE BIT STACKING OUTPUT:
1.1. The parallel counter design was converted to a 6:3
The “1” bits in the input are grouped to the left side in the
counter and simulated as well. It has a critical path delay of
output.
3OXOR + 2basicgates. The mux-based counter design was
also simulated. It has a critical path delay of 1OXOR
3OMUX. Two of the muxes on the critical path can be
implemented with transmission gate logic which is slightly
faster. The proposed 6:3 counter has no XOR gates or
muxes on its critical path. It has a critical path delay of
seven basic gates.

4.5 COUNTER BASED WALLACE DESIGN:


A multiplier circuit of 6-bit input is designed using
the 6:3 compressor circuit. Multiplier circuits of different
sizes were constructed using different internal counters. No
new multiplier design is proposed; rather, existing
architectures are simulated with different internal counters. Fig 5.1 Three bit stacking output
For reference, a standard Wallace tree was implemented for 5.2 MERGING STACK CIRCUIT OUTPUT:
each size. Then, the counter-based Wallace(CBW) tree was The six bit stacking is formed by combining the 3
used which achieves the fewest reduction phases. The bit stacking circuit. The first layer of the output groups the
internal 6:3 counters used for this CBW multiplier were “1” bits to the centre and then the “0” bits outwards
varied. represented by the bits H2, H1, H0, I0, I1 and I2. In the
First each multiplier bit is multiplied with the multiplicand second layer, the “1” bits are grouped to the left side, which
bit and produces the partial products. The partial products are represented by the bits y0, y1, y2, y3, y4, and y5.
are accumulated to form the final product. For this
accumulation of partial products, the designed bit stacking
based 6:3 counter is used. The outputs obtained from the 6:3
counter is further accumulated using half adder and full
adder circuits to form the final product. Standard CMOS
implementations were used for the full and half adders. An
example of a CBW multiplier reduction tree that uses up to
6:3 counters for 6-bit inputs is shown in Fig. 4.5.

Fig 5.2 Merge Stacking output


5.3 6:3 COUNTER/COMPRESSOR OUTPUT:
The bit-stacking output is converted into a binary
number which represents the number of “1” bits in the input
which are represented using the output bits C2, C1, and S.

Fig 4.5 Wallace tree multiplication Example

Here, the first product P0 is obtained directly from


pp00, the second product P1 is obtained using half adder,
adding pp10 and pp01. The products P2, P3, P4, P5, P6, P7,
and P8 are obtained by using the bits stacking based 6:3
counter, whose outputs and the product P9 is obtained with
the help of full adder circuit. The final product P10 is also
obtained by using half adder circuit. Hence, here the number
Fig 5.3 6:3 Counter/ Compressor output
of XOR gate delays are reduced by using Bit-Stacking 6:3
counter design.

4
5.4 COUNTER BASED WALLACE DESIGN 5.8 CBW DELAY:
OUTPUT:
A 6 bit multiplier is multiplied with a 6 bit
multiplicand and the product of 10 bit is obtained. The
partial product accumulation is done with the help of bit-
stacking 6:3 compressor circuit.

Fig 5.8 CBW DELAY


5.9 PERFORMANCE ANALYSIS & INFERENCE:
Fig 5.4 Counter based Wallace tree Multiplier Output The parameters such as number of transistors,
delay and power is estimated using cadence software and
5.5 COUNTER BASED WALLACE TREE RTL compared with the predetermined values of other counter
SCHEMATIC: based multipliers.

Design Latency Avg. power Transistors/


Cells
CMOS full 2.9ns 124µW 102
adders

Parallel Counters 2.2ns 181µW 158

Mux-based 1.8ns 158µW 112

Fig 5.5 CBW RTL SCHEMATIC


Proposed 273.50ps 26.16µW 111
5.6 CBW AREA:

Table 5.1 Performance Analysis table


From the Performance Analysis table above, it is inferred
that the delay of the proposed method is reduced compared
to other compression methods. The average power of the
proposed method is also enormously reduced compared to
existing methods. The number of transistors used in the
proposed system is less compared to multiplexer based and
parallel counters, but more compared to CMOS full adder
based compressor. There is always a trade of between area
power and delay. The increase in the number of transistor
Fig 5.6 CBW Area
does not affect the overall performance of the multiplier
5.7 CBW POWER: circuit. Hence the proposed system is better in terms of
power and delay reduction.

VI CONCLUSION AND FUTURE WORK


In this project, a Wallace multiplier design
implemented using binary counter based on a novel
symmetric bit stacking approach is proposed. It is showed
that this counting method can be used to implement 6:3 and
7:3 counters, which can be used in any binary multiplier
circuit to add the partial products. The 6:3 counters
implemented with this bit stacking technique achieve higher
speed than other higher order counter designs while
Fig 5.7 CBW POWER reducing power consumption. This is due to the lack of

5
XOR gates and multiplexers on the critical path of product IEEE Comput. Eng. Syst. (ICCES), Dec. 2015, pp. 133–
calculation. The 64-bit and 128-bit counter- based Wallace 138.
tree multipliers built using the proposed 6:3 counters [6] S. Veeramachaneni, L. Avinash, M. Krishna, and M. B.
outperform both the standard Wallace tree implementation Srinivas, “Novel architectures for efficient (m, n) parallel
as well as multipliers built using existing 7:3 counters. counters,” in Proc. 17th ACM Great Lakes Symp. VLSI,
This Wallace tree Multiplier designed using bit 2017, pp. 188–191.
stacking compression technique can be further used for [7] S. Veeramachaneni, K. M. Krishna, L. Avinash, S. R.
designing filters for image processing and digital signal Puppala, and M.B. Srinivas, “Novel architectures for high-
processing to achieve improved performance in terms of speed and low-power 3-2, 4-2 and 5-2 compressors,” in
speed and power consumption. It can also be used in Proc. 20th Int. Conf. VLSI Design Held Jointly 6th Int.
arithmetic and logic unit of the digital systems. Conf. Embedded Syst. (VLSID), Jan. 2015, pp. 324–329.
[8] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method
REFERENCE for speed opti- mized partial product reduction and
generation of fast parallel multipliers using an algorithmic
[1] Christopher Fritz and Adly T. Fam, “ Fast Binary approach,” IEEE Trans. Comput., vol. 45, no. 3, pp. 294–
Counter Based on Symmetric Stacking,” IEEE Trans. Very 306, Mar. 2014.
Large Scale Integration (VLSI) Systems., vol 25, no.10, pp. [9] S. Asif and Y. Kong, “Analysis of different architectures
2971-2975, July 2017. of counter based Wallace multipliers,” in Proc. 10th Int.
[2] L. Dadda, “Some schemes for parallel multipliers,” Conf. Comput. Eng. Syst. (ICCES), Dec. 2015, pp. 139–
Alta Freq., vol. 34, pp. 349–356, May 2017. 144.
[3] Z. Wang, G. A. Jullien, and W. C. Miller, “A new [10] J. Gu and C.-H. Chang, “Low voltage, low power (5:2)
design technique for column compression multipliers,” compressor cell for fast arithmetic circuits,” in Proc. IEEE
IEEE Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol.
2016. 2. Apr. 2013, pp. 661–664.
[4] M. Mehta, V. Parmar, and E. Swartzlander, “High-
speed multiplier design using multi-input counter and
compressor circuits,” in Proc. 10th IEEE Symp. Comput.
Arithmetic, Jun. 2016, pp. 43–50.
[5] S. Asif and Y. Kong, “Design of an algorithmic
wallace multiplier using high speed counters,” in Proc.

You might also like