Professional Documents
Culture Documents
Abstract— Adders are basic building blocks of any processor and add one circuit is presented in [8]. For implementing
or data path application. For the design of high performance high bit adders a square-root (SQRT) CSA where CSA
processing units high speed adders with low power consumption
is a requirement. Carry Select Adder (CSA) is known to be one of
with increasing size are connected in series which provide
the fastest adders used in many data processing applications. In a parallel path for carry propagation is presented in [9].
this paper, we present a new CSA architecture using Manchester CSA design using binary to excess-1 (BEC) is presented
carry chain(MCC) in multioutput domino CMOS logic. It employs in [10]. BEC-based CSA involves less logic resources than
a novel MCC blocks in an hierarchical approach in the design of the conventional CSA, but it has marginally higher delay.
the CSA. The proposed design is validated by implementation of
16 and 32-bit adder circuits in a standard 45nm CMOS process
High speed CSA design using Brent Kung (BK) and BEC
technology. This proposed work evaluates the performance of is presented in [5]. Area-delay-power efficient CSA which
the proposed designs in terms of delay, power consumption eliminates redundant logic operations present in conventional
and hardware overhead. The results are analyzed and compared CSA and carry select is scheduled before the calculation of
with existing fast adder architectures to prove its efficiency. The final sum is proposed in [13].
simulation results shows that the proposed architecture achieves
two fold advantages in terms of power-delay product (PDP) and
There are different types of high-speed adder architectures
hardware overhead. available in literature [3-7]. Among them, tree structures
Keywords - Carry select adder (CSA), Manchester carry chain like parallel-prefix adders (PPA) are developed for faster
(MCC), multioutput Domino logic, PDP, Speed. computational time. PPA are distinct class of adders that
I. INTRODUCTION works on the use of generate and propagate signals. The
Kogge-Stone adder is one of the PPA form carry look-ahead
As the requirement for high performance processor grows, adder. It generates the carry signals in O(log n) time, and is
there is a constant need to enhance the performance of data widely considered the fastest adder design [11]. Brent-Kung
path units. Addition is the most commonly used arithmetic (BK) adder [1] is one of the parallel prefix adder which has
operation and the performance of VLSI processor is enor- logarithmic adder architecture that gives an optimal number
mously impacted by performance of resident adder. A high of stages from input to all outputs is presented in [5]. Several
performance adder with low power consumption designed variants of the carry-look ahead equations, like Ling adder
with minimum area plays an indispensable role in large por- is presented in [6].
tion of the hardware circuits. For adding two binary numbers In this paper, we present a methodology that would allow
several types of adders have been designed, for example to design CSA using manchester carry chain (MCC) in mul-
ripple-carry, carry-skip, carry-select adder (CSA) , etc. The tioutput domino CMOS. In this design we focus on designing
major speed restriction in the conventional adder circuits, CSA architecture using domino MCC in an hierarchical ap-
such as ripple carry adder (RCA) and carry save adder arises proach for a wide N-bit adders. This proposed design shows
from the long computation time required for generating the better PDP than existing designs with minimum hardware
outputs . CSA and carry look-ahead architectures have been overhead.
suggested to reduce large carry propagation delay of adders. The remaining paper is organized as follows. In Section II
The main advantage of CSA is it alleviate the problem of related work is discussed. Section III presents the proposed
carry propagation delay. This is realized by use of parallel CSA design using MCC structure. Finally, in Section IV
structure that results from multiple pairs of ripple carry simulation results are provided and Section V concludes.
adder (RCA) and final sum and carry output is chosen by
multiplexers [2]. The main difference between a CSA and II. RELATED WORK
RCA is that in a RCA the carry has to ripple through all A. Conventional Carry Select Adder
full-adders, but in the case of a CSA the carry has to pass The block diagram of conventional CSA adder is shown
through a single multiplexer. in Fig. 1. CSA uses dual RCA structure to precalculate sum
CSAs computational time is faster than RCA but requires and carry values for both possible cin value, cin=0 and cin=1
more area compared to RCA. Modified CSA using one RCA before actual carry is available. In Fig.1. the upper RCA
structure calculates the result for cin=0 and the lower RCA
structure calculates the result for cin=1. Multiplexer selects
33
(a) (b)
Fig. 5: Domino Generate and Propagate
Level-I
Clk1
P0:2 G0:2 P3:6 G3:6 P7:10 G7:10 P11:15 G11:15
Pi-1 Pi Pi+1 Pi+2
area-delay-power efficient. Here we use the same MCC for C1A C1B C1C C1D
both cin=0 and cin=1 and this design focus on hierarchical
approach for a wide N-bit adders. The clock diagrams and
initial carry-in signals used for proposed CSA are shown in Actual C0A C0B C0C C0D
Fig.3. The first level Li (i=1) is partitioned into ’P’ blocks Cin
of MCC structure where ’S’ be the maximum block size for
MSB bits. The MCC structure of one block with size 4 at Clk2
first level is shown in Fig.4.
In phase1 for MCC at first level L1 When clk1 is
low, PMOS transistors are precharged to vdd. When clk1
is high, the NMOS evaluation transistor turns ON and Fig. 6: Proposed CSA 16-bit Adder
computes carry-outs using equation (7) for initial cin =
’0’. Transmission gates are used to hold these computed In level Li+1 (L=2) , we use P-bit MCC structure where
intermediate carry-outs at active high ntclock and active low the MSB carry-out computed for each ’P’ blocks with initial
34
Level-I
P0:3 G0:3 P4:8 G4:8 P9:13 G9:13 P14:19 G14:19 P20:25 G20:25 P26:31 G26:31
C-1 = 0 C0A C-1 = 0 C0B C-1 = 0 C0C C-1 = 0 C0D C-1 = 0 C0E C-1 = 0 C0F
C-1 = 1 C1A C-1 = 1 C1B C-1 = 1 C1C C-1 = 1 C1D C-1 = 1 C1E C-1 = 1 C1F
Clk2
√
cin = ’0’ at first level are fed as generate signals. As if N=32 then S = 2 32 ⇒ 5.65 where maximum block size
explained earlier transmission gates used at first level Li at MSB in level
√ 1 can be restricted to 6. On contrary for
(i=1) holds these carry-outs. The MSB carry-out computed N=128, S = 2 128 ⇒√11.3 which is greater than 8, so with
with initial cin = ’1’ for each block at first level are directly incremented L, S ⇒ 3 128 = 5.
fed as propagate signals. The block diagrams of 16-bit and 32-bit adders using
From Fig.3 when clk2 is low, PMOS transistors for MCC proposed CSA are shown in Fig.6 and Fig.7 respectively. The
at level Li+1 (L=2) are precharged to vdd, when clk2 16-bit adder is designed by cascading four proposed MCC
goes high, the actual carry-in is fed and NMOS evaluation blocks with block size as 3,4,4,5 starting from LSB to MSB
transistor turns ON and computes the required final carry-out. in first level and it uses 4 bit MCC block in second level as
In parallel during the same active high phase of clk2 sum shown in Fig.6. The 32-bit adder is designed by cascading six
is computed using si =pi XOR ci−1 , where pi is propagate proposed MCC blocks with block size as 4,5,5,6,6,6 starting
signal generated from input bits and ci−1 is carry-in which from LSB to MSB in first level and it uses 6 bit MCC block
are always available from previous discussion. in second level as shown in Fig.7. All the block sizes chosen
If N is number of bits of an adder we formulate the in first level are optimized for less delay. The block size of
maximum block size for MSB bits from following equation LSB is chosen minimum so that, in the worst-case operation,
√
L a carry path of pass-transistors has to be discharged in the
S= N (9) LSB position and it also speeds up the circuit in next level
In above equation L always starts with 2. L is incremented as LSB for next level is immediately available.
when S exceeds each multiple of 8. If S > 8 then L is The speed improvement can be obtained by parallel carry
incremented by one which redefines block size. For example generation (for anticipated cin=0 and cin=1) at first level
35
using MCC structure. The basic implementation of CSA Adder [10] for different bit widths. This increase in power
using parallel MCC blocks at first level makes each block level in proposed CSA is due to power contribution by clock
to be independent of other. In other words the generation signals.
of carry (for cin=0 and cin=1) of one block will not affect We can find from Fig.8 and Fig.9 that proposed CSA
other. Here we chose variable block size at first level such design offers saving of 115% PDP and 19% hardware
that delay is minimum. Unlike conventional CSA, we dont overhead (transistor count) than the CONV CSA; 97% PDP
use extra hardware, instead we re-use MCC structure (for and 11% hardware overhead than the Advanced CSA [10];
cin=0 & cin=1) in an effective way using adaptive clocking in 73% PDP and 7% hardware overhead than the CSA [13];
multi-output domino MCC. This has three fold advantage in 8% PDP and 15% hardware overhead than the KSA [10];
terms of speed, power consumption and hardware overhead. 65% PDP and 2% hardware overhead than the Ling Adder
A disadvantage of the proposed CSA design is the need [10], on average, for different bit-widths.
of clock signal at level Li and at level Li+1 for pre-charge
and evaluation and two control clocks for transmission gates
used at level Li . However the generation of these signals is a
one-time cost that does not increase with circuit complexity. Conv CSA Adv CSA Kogge Stone
From Fig.3 it can be noted that ntclk is inverted version Ling Adder CSA [13] Proposed CSA
0.08
of ptclk. The clk2 signal used at level Li+1 is the ptclk
0.06
PDP (pJ)
in Li delayed by halve of the clock period (clock period of
0.04
ptclk). Thus overhead of clock signal generation is minimum
0.02
with low cost. Moreover, in our proposed CSA design where
each level is fed with a dedicated clock signal, the clock 0
8 16
signals distribution cost is also very small. We can adopt
skew hardened dynamic design methodology [14] for clock Adder Width (n)
skew.
IV. SIMULATION RESULTS
Fig. 8: Power Delay Product Comparison
In order to verify the performance of the proposed CSA,
we have designed proposed CSA , conventional CSA (CONV
CSA), Advanced CSA [10], CSA [13], Kogge Stone (KSA)
Conv CSA Adv CSA Kogge Stone
[10] and Ling adder [10] for bit-widths 16 and 32 in a
Ling Adder CSA [13] Proposed CSA
standard 45nm CMOS technology (Vdd=1V) and simulated 1200
using SPECTRE. The pre-layout comparison results for
Transistor Count
1000
800
different parameters like delay, power consumption, power-
600
delay product and transistor count for 16-bit and 32-bit 400
adders are summarized in Table I. Simulations are performed 200
using the minimum size transistors for low power for both 0
8 16
proposed and standard adder circuits. The carry signals
simulated waveform of a 32 bit adder with respect to clk2 Adder Width (n)
36
TABLE I: Experimental Results and Comparisons for 16-Bit and 32-Bit Adders
Design Width Delay (ps) Power Consumption (uW) Transistor Count PDP (pJ) EPDP (%) ETC (%)
16 714.816 34.325 497 0.0245 131.13 20.92
CONV CSA
32 1138.162 50.295 954 0.0572 106.49 17.63
16 748.321 29.429 462 0.022 107.54 11.86
Advanced CSA [10]
32 1197.492 45.083 897 0.0539 94.58 10.6
16 737.64 27.08 447 0.0199 87.7 8.24
CSA [13]
32 1162.91 38.76 848 0.045 62.45 4.56
16 518.617 22.76 481 0.0118 11.32 16.46
Kogge Stone [10]
32 802.341 35.942 921 0.0288 4.04 13.56
16 987.32 20.026 421 0.0197 85.8 1.94
Ling Adder [10]
32 1310.532 31.443 828 0.0412 48.78 2.1
16 473.965 22.563 413 0.0106 – –
Proposed CSA
32 774.736 35.798 811 0.0277 – –
CONV:conventional, PDP: power x delay (power delay product), EPDP: excess PDP over the proposed CSA design, ETC: excess transistor count over the proposed CSA design.
Fig. 10: Carry Signals Simulated Waveform for Proposed 32 bit adder with respect to clk2
37