Professional Documents
Culture Documents
Introduction: Multiplication is a very common and important operation Proposed circuit: A block diagram of the proposed 5-2 compressor is
in general purpose microprocessors and digital signal processors. Its shown in Fig. 2a. It consists of six XOR and three MUX blocks,
speed is often a determining factor of how fast the processors can whose generalised schematics are shown in Figs. 2b and c, respectively.
operate [1]. A fast multiplication process usually uses an array multiplier This compressor is characterised by (4)–(7), similar to the conventional
and consists of three stages: partial product generation, partial product one. The functionality of the full adder was decomposed into two XOR
reduction and final carry-propagating addition [2]. In large multipliers, gates and a 2-1 MUX and these basic cells were rearranged to achieve a
the second stage usually contributes the most in delay, power and smaller critical path. Consequently, this structure requires four gate
area, therefore improvement in this field is deemed essential. delays to produce the Sum and Carry outputs.
The partial product reduction can be efficiently implemented with
compressor trees, which commonly use 4-2 and 5-2 compressors as X1 X2 X3 X4 Cin1 X5 Cin2
X
building blocks. Improving the compressor cell directly affects the
overall performance of the multiplier. A 5-2 compressor was initially XOR XOR XOR Y
used in [2] for the design of a high performance 16 × 16-bit multiplier,
X
however specific transistor-level implementation was not provided. XY+XY
Several follow-up publications proposed new designs for the 5-2 com- MUX XOR
Y
pressor with different structures and performance trade-offs [3–8]. X
Cout1
In this Letter, we propose a new low-power and high-performance
CMOS 5-2 compressor with a low transistor count. A comparative b
MUX XOR
study is also conducted to evaluate the new circuit against prior works. S
Cout2 VDD
A
5-2 Compressor functionality: The 5-2 compressor receives seven XOR MUX S SA+SB
inputs of equal weight, including five primary inputs X1–5 and two
B
inputs Cin1-2 that receive their values from the neighboring compressor
of 1 binary bit order lower. It generates an output Sum of the same Sum Carry
S
weight as the inputs and three outputs Cout1-2, Carry weighted 1 a c
binary bit order higher. Its functionality is described by the following
binary arithmetic equation: Fig. 2 Architecture of the proposed 5-2 compressor
a Block diagram
X1 + X2 + X3 + X4 + X5 + Cin1 + Cin2 = b Schematic of 2-input XOR cell
(1) c Schematic of 2-1 MUX cell
Sum + 2 × (Cout1 + Cout2 + Carry)
Fig. 1 shows a conventional implementation of the 5-2 compressor using The complete transistor-level schematic of the proposed 5-2 compres-
a cascade of three full adders. In its general form, a full adder receives sor is shown in Fig. 3. It uses full-swing transmission gate logic and
three inputs (a, b, c) and generates two outputs (SFA, CFA), which can be static inverters so that no threshold drops occur in intermediate nodes
expressed by the following Boolean equations: and output nodes are fully restored. These traits make it suitable for
voltage/process scaling and cascading. Furthermore, unlike previous
SFA = a ⊕ b ⊕ c (2) designs, this one completely avoids the use of complex static CMOS
gates and dual rail XOR-XNOR logic, which require more transistors
for their implementation. As a result, it comprises a total of 58 transis-
CFA = (a ⊕ b) · c + (a ⊕ b)′ · a (3) tors, which is the lowest reported for a 5-2 compressor, to the best of the
authors’ knowledge. Fig. 4 illustrates a layout implementation of the
Consecutively, by applying (2) and (3) to the 5-2 compressor of proposed circuit using a 32 nm CMOS n-well process. It utilises three
Fig. 1, we obtain the following set of Boolean equations: metal levels and yields a total area of 5.6 μm2.
Sum = X1 ⊕ X2 ⊕ X3 ⊕ X4 ⊕ X5 ⊕ Cin1 ⊕ Cin2 (4) Simulation setup: The proposed 5-2 compressor was implemented on
transistor level along with several designs from the relevant literature
Cout1 = (X1 ⊕ X2 ) · X3 + (X1 ⊕ X2 )′ · X1 (5) [3–8], for the sake of comparison. In case that an examined work pro-
posed several new designs, we chose the one that was deemed more effi-
cient by its authors. A 32 nm predictive technology model was used for
Cout2 = (X1 ⊕ X2 ⊕ X3 ⊕ X4 ) · Cin1 the simulations, incorporating high-k/metal gate and stress effect [9].
(6) Post-layout simulations were also performed for our circuit, including
+ (X1 ⊕ X2 ⊕ X3 ⊕ X4 )′ · X4
the parasitic capacitances extracted from the layout.
The same sizing methodology was used for all circuits to ensure fair
and unbiased comparison. All transistors have the minimum length (Ln,
Carry = (X1 ⊕ X2 ⊕ X3 ⊕ X4 ⊕ X5 ⊕ Cin1 ) · Cin2 Lp) of 30 nm. Regarding transistor width, transmission gates use equally
(7)
+ (X1 ⊕ X2 ⊕ X3 ⊕ X4 ⊕ X5 ⊕ Cin1 )′ · X5 sized transistors (Wn = Wp = 60 nm), whereas in every other case, k
stacked nMOS (pMOS) have Wn = k × 60 nm (Wp = k × 120 nm). In
This conventional implementation is not very efficient, since it has a case of [3, 6], we added some inverters (along with proper modifi-
critical path of six gate delays for the Sum and Carry outputs. Several cations) to avoid outputs driven by pass transistors. Therefore, their tran-
alternative architectures have been proposed, using fewer logic stages sistor count is slightly higher than originally reported.
and in some cases producing output equations that differ from (4)–(7). The evaluation procedure was realised with BSIM4-level SPICE
Nonetheless, all designs must abide by (1) for valid operation. simulations, which were used to calculate the maximum delay and