Professional Documents
Culture Documents
Abstract – Multipliers play an important role in today’s digital Wallace tree or Dadda tree or the improved architecture in
signal processing, image processing and various other column compression technique.
applications. They are the essential part of an arithmetic logic In order to combine the partial products efficiently,
unit for performing filtering and convolution operations. The
column compression is commonly used. These methods
binary multiplication of integers and floating point numbers
results in partial products that must be added to produce the final involve using full adders functioning as counters to reduce
product. The addition of these partial products dominates the groups of 3 bits of the same weight to 2 bits of different
latency and power consumption of the multiplier which weight in parallel using a carry-save adder tree. Through
influences the performance of the processors. Hence, in order to several layers of reduction, the number of summands is
improve the performance of the processor the partial product reduced to two, which are then added using a conventional
addition in the multiplier circuit must be fast, should consume
less power and area. It can be achieved by using compressor
adder circuit.
circuit for partial product addition. The existing compressor To achieve higher efficiency, larger numbers of bits of
circuits are designed using XOR gates in the critical path of
equal weight can be considered. The basic method when
partial product addition which increases the latency of the
compressor. This paper proposes a novel method of designing a dealing with larger numbers of bits is the same: bits in one
compressor/counter circuit using symmetric bit stacking method. column are counted, producing fewer bits of different
The symmetric bit stacking method is designed using the three weights. For example, a 7:3 counter circuit accepts 7 bits
bit stacking circuit which groups the number of “1” bits in the of equal weight and counts the number of “1” bits. This
input together. The 6:3 counter circuit is designed by merging the count is then output using 3 bits of increasing weight.
3 bit staking circuit. By doing so, we can eliminate the XOR gate
delay in the critical path of the partial product addition which
The 7:3 and 6:3 counter circuits can be constructed using
results in reduced latency of the circuit. For 64-bit and 128-bit full and half adders.
multipliers this compressor circuit are very effective in improving
Much of the delay in these counter circuits is due to
the performance of the multiplier circuit.
the chains of XOR gates on the critical path. Therefore,
I INTRODUCTION many faster parallel counter architectures have been
presented. A parallel 7:3 counter was presented and used
The increased level of integration brought by the to design a high speed counter-based Wallace tree
modern VLSI and ULSI has rendered possible, the multiplier. Additionally, counter designs use multiplexers
integration of many components that were considered very to reduce the number of XOR gates.
complex. The multiplication operation is certainly present in
many parts of the digital systems or digital computers, II. RELATED WORKS
notably in signal processing, graphics and scientific The existing compression methods involve using full
calculations. Multiplication is a basic arithmetic operation adders functioning as the counters to reduce the group of 3
important in applications like digital signal processing bits of same weight to 2 bits of different weight in parallel
which rely on efficient implementation of g0eneric using carry save adder tree. Through several layers of
arithmetic logic units (ALU) and floating point units to reduction, the number of summands is reduced to two,
execute dedicated operations like convolution and filtering. which are then added using conventional adder circuit.
The speed and power efficiency of a multiplier circuit is of To achieve higher efficiency, larger number of bits of
critical importance in the overall performance of the equal weight can be considered. The basic method when
microprocessor. dealing with lager number of bits is same: the bits in one
column are counted, producing fewer bits of different
The binary multiplication of integers or fixed-point numbers weights. For example, a 7:3 counter circuit accepts 7 bits of
result in partial products that must be added to produce the equal weight and counts the number of “1” bits. This count
final product. Many methods have been presented to is then output using 3 bits of increasing weight. The 7:3 and
optimize the performance of the partial products summation, 6:3 counter circuits can be constructed using full and half
such as the well-known row compression technique in the adders, as shown in Fig. 2.1.
IV PROPOSED METHOD
The proposed approach allows the fast and easy
implementation of large CBW multipliers on FPGA as well
as in ASIC. The proposed algorithm uses high speed 7:3,
6:3 counters in the implementation of the CBW multiplier
using symmetric bit stacking method. It eliminates the use
of Full adders and half adders with accounts for major
reason of critical path delay. The CBW requires half stages
to perform the tree reduction as compared to the existing
Wallace multipliers.
The proposed Wallace tree design uses a 6:3
counter which is realized by first stacking all of the input
bits such that the “1” bits are grouped together. After
stacking the input bits, this stack can be can be converted
into a binary count to output the 6-bit count. Small 3-bit Fig. 4.2 Three-bit stacker circuit
stacking circuits are first used to form 3-bit stacks. These 3- 4.2 MERGING STACKS:
bit stacks are then combined to make a 6-bit stack using a We wish to form a 6-bit stacking circuit using the
symmetric technique that adds one extra layer of the logic. 3-bit stacking circuits. Given six inputs X0,..., X5, we first
divide them into two groups of 3 bits which are stacked
2
using 3-bit stacking circuits. Let X0, X1, and X2 be stacked 4.3 CONVERTING BIT STACK TO BINARY
into signals named H0, H1, and H2 and X3, X4, and X5 be NUMBER:
stacked into I0 , I, and I2. First, we reverse the outputs of In order to implement a 6:3 counter circuit, the 6-
the first stacker and consider the six bits H2, H1, H0, I0, I1, bit stack described in section 3.2 must be converted to a
and I 2. We notice that within these six bits, there is a train binary number. For faster, and more efficient count, we can
of “1” bits surrounded by “0” bits. To form a proper stack, use the intermediate values H, I, and K to quickly compute
this train of “1” bits must start from the leftmost bit. each output bits without needing the bottom layer of the
In order to form the proper 6-bit stack, two more 3- stackers. We can call the output bits C2, C1, and S in which
bit vectors of bits are formed called J0, J1, J2 and K0, K1, C2, C1, S is the binary representation of number of “1”
K2. The idea is to fill the J vector with ones first, before input bits.
filling the K vector. So we let 4.3.1 COMPUTATION OF S:
J0 = H2 + I0 4.4 To compute S, we note that we can easily
J1 = H1 + I1 4.5 determine the parity of the outputs from the first layer of the
J2 = H0 + I2 4.6 3-bit stackers. Even parity occurs in the H if zero or two “1”
In this way, the first three “1” bits of the train are bits appear in X0, X1 and X2. Thus He and Ie, which
guaranteed to fill into the J bits although they may not be indicates even parity in the H and I bits are given by
properly stacked. Now to ensure no bits are counted twice, He = H0 + H1 H2 4.10
the K bits are formed using the same inputs but with the Ie = I0 + I1 I2 4.11
AND gates instead As S indicates odd parity over all of the input bits, and
K0 = H2 I0 4.7 because the sum of two numbers with different parities is
K1 = H1 I1 4.8 odd, we can compute S as follows
K2 = H0 I2 4.9 S = He Ie 4.12
If the train of “1”s is no more than three places Although, this does incur one XOR gate delay, it is not on
long, then all of the K bits will be “0” as the AND gate the critical path.
inputs are three positions apart. If the train is longer than 4.3.2 COMPUTATION OF C1:
three places long, then some of the AND gates will have To compute C1, we note that C1=1 when the count
both inputs as “1”s as the AND gate inputs are three is 2, 3 or 6. Therefore, there are two cases.
positions apart. The number of AND gates that will have First we need to check if we have at least two but
this property will be three less than the length of the train of no more than three total inputs set to 1. We can use the
“1”s. intermediate H, I, and K vectors for this. To check for at
We notice that now J0 J1 J2 and K0 K1 K2 still least two inputs we need to see stacks of length two from
contain the same number of “1” bits as the input in total but either top level stacker, or two stacks of length one, which
now J bits will be filled with ones before any of the K bits. yields H1 H0+I1 I0+H0 I0. To check that we do not have
We must now stack J0 J1 J2 and K0 K1 K2 using two more than three inputs set, we simply need to make sure that
more 3-bit stacking circuits. The outputs of these two none of the K bits are set, as the K vector is only set when
circuits can then be concatenated to form the stack outputs more than three inputs are “1”. This gives (K0+K1+K2).
Y5,..., Y0. Second, we need to check if we have all six inputs
An example of this process is shown for an input set as “1”. We can check this by checking that all three of
vector containing four “1” bits in Fig. 4.3. In this example, both H and I bits are set. As these are bit stacks, we simply
first the H and I vectors are formed by stacking groups of need to check the rightmost bit in the stack for this case,
three input bits. Then, the H vector is reversed, forming a which yields
continuous train of four “1” bits surrounded by zero bits. C1 = (H1 H0 + I1 I0 + H0 I0)(K0 + K1 + K2) + H2 I2 4.13
Corresponding bits are OR-ed to form the J vector which is 3.3.3 COMPUTATION OF C2:
full of “1” bits. Corresponding bits are AND-ed to form the We can easily calculate C2 as it should be set only
K vector which finds exactly one overlap. Then, the J and K whenever we have at least 4bits set
vectors are restacked to form the final 6-bit stack. C2 = K0 + K1 + K2 4.14
Using the equations 3.12, 3.13 and 3.14 the final 6:3 counter
circuit can be constructed as shown in Fig 4.4.
Fig 4.3 Six Bit Stacking Example Fig 4.4 A 6:3 counter based on symmetric stacking
3
4.4 6:3 COUNTER SIMULATION: V RESULTS AND DISCUSSION
The proposed 6:3 counter design was built as a
The counter based Wallace tree design
standard CMOS design and simulated using spectre, using
implemented using bit stacking compression technique is
the ON semiconductor C5 0.5-μm process (formerly
simulated using ISE Design Suite 14.7. The inputs are
AMI06). For comparison, a 6:3 counter design was
forced in ISim Simulator ad the outputs are observed.
implemented using standard CMOS full adders as in Fig.
5.1 THREE BIT STACKING OUTPUT:
1.1. The parallel counter design was converted to a 6:3
The “1” bits in the input are grouped to the left side in the
counter and simulated as well. It has a critical path delay of
output.
3OXOR + 2basicgates. The mux-based counter design was
also simulated. It has a critical path delay of 1OXOR
3OMUX. Two of the muxes on the critical path can be
implemented with transmission gate logic which is slightly
faster. The proposed 6:3 counter has no XOR gates or
muxes on its critical path. It has a critical path delay of
seven basic gates.
4
5.4 COUNTER BASED WALLACE DESIGN 5.8 CBW DELAY:
OUTPUT:
A 6 bit multiplier is multiplied with a 6 bit
multiplicand and the product of 10 bit is obtained. The
partial product accumulation is done with the help of bit-
stacking 6:3 compressor circuit.
5
XOR gates and multiplexers on the critical path of product IEEE Comput. Eng. Syst. (ICCES), Dec. 2015, pp. 133–
calculation. The 64-bit and 128-bit counter- based Wallace 138.
tree multipliers built using the proposed 6:3 counters [6] S. Veeramachaneni, L. Avinash, M. Krishna, and M. B.
outperform both the standard Wallace tree implementation Srinivas, “Novel architectures for efficient (m, n) parallel
as well as multipliers built using existing 7:3 counters. counters,” in Proc. 17th ACM Great Lakes Symp. VLSI,
This Wallace tree Multiplier designed using bit 2017, pp. 188–191.
stacking compression technique can be further used for [7] S. Veeramachaneni, K. M. Krishna, L. Avinash, S. R.
designing filters for image processing and digital signal Puppala, and M.B. Srinivas, “Novel architectures for high-
processing to achieve improved performance in terms of speed and low-power 3-2, 4-2 and 5-2 compressors,” in
speed and power consumption. It can also be used in Proc. 20th Int. Conf. VLSI Design Held Jointly 6th Int.
arithmetic and logic unit of the digital systems. Conf. Embedded Syst. (VLSID), Jan. 2015, pp. 324–329.
[8] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method
REFERENCE for speed opti- mized partial product reduction and
generation of fast parallel multipliers using an algorithmic
[1] Christopher Fritz and Adly T. Fam, “ Fast Binary approach,” IEEE Trans. Comput., vol. 45, no. 3, pp. 294–
Counter Based on Symmetric Stacking,” IEEE Trans. Very 306, Mar. 2014.
Large Scale Integration (VLSI) Systems., vol 25, no.10, pp. [9] S. Asif and Y. Kong, “Analysis of different architectures
2971-2975, July 2017. of counter based Wallace multipliers,” in Proc. 10th Int.
[2] L. Dadda, “Some schemes for parallel multipliers,” Conf. Comput. Eng. Syst. (ICCES), Dec. 2015, pp. 139–
Alta Freq., vol. 34, pp. 349–356, May 2017. 144.
[3] Z. Wang, G. A. Jullien, and W. C. Miller, “A new [10] J. Gu and C.-H. Chang, “Low voltage, low power (5:2)
design technique for column compression multipliers,” compressor cell for fast arithmetic circuits,” in Proc. IEEE
IEEE Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol.
2016. 2. Apr. 2013, pp. 661–664.
[4] M. Mehta, V. Parmar, and E. Swartzlander, “High-
speed multiplier design using multi-input counter and
compressor circuits,” in Proc. 10th IEEE Symp. Comput.
Arithmetic, Jun. 2016, pp. 43–50.
[5] S. Asif and Y. Kong, “Design of an algorithmic
wallace multiplier using high speed counters,” in Proc.