You are on page 1of 6

2017 IEEE Computer Society Annual Symposium on VLSI

High Speed Power Efficient Carry Select Adder Design


Raghava Katreepalli Themistoklis Haniotakis
Dept. of Electrical and Computer Engineering, Dept. of Electrical and Computer Engineering,
Southern Illinois University Carbondale, USA Southern Illinois University Carbondale, USA
raghava.katreepalli@siu.edu haniotak@siu.edu

Abstract— Adders are basic building blocks of any processor and add one circuit is presented in [8]. For implementing
or data path application. For the design of high performance high bit adders a square-root (SQRT) CSA where CSA
processing units high speed adders with low power consumption
is a requirement. Carry Select Adder (CSA) is known to be one of
with increasing size are connected in series which provide
the fastest adders used in many data processing applications. In a parallel path for carry propagation is presented in [9].
this paper, we present a new CSA architecture using Manchester CSA design using binary to excess-1 (BEC) is presented
carry chain(MCC) in multioutput domino CMOS logic. It employs in [10]. BEC-based CSA involves less logic resources than
a novel MCC blocks in an hierarchical approach in the design of the conventional CSA, but it has marginally higher delay.
the CSA. The proposed design is validated by implementation of
16 and 32-bit adder circuits in a standard 45nm CMOS process
High speed CSA design using Brent Kung (BK) and BEC
technology. This proposed work evaluates the performance of is presented in [5]. Area-delay-power efficient CSA which
the proposed designs in terms of delay, power consumption eliminates redundant logic operations present in conventional
and hardware overhead. The results are analyzed and compared CSA and carry select is scheduled before the calculation of
with existing fast adder architectures to prove its efficiency. The final sum is proposed in [13].
simulation results shows that the proposed architecture achieves
two fold advantages in terms of power-delay product (PDP) and
There are different types of high-speed adder architectures
hardware overhead. available in literature [3-7]. Among them, tree structures
Keywords - Carry select adder (CSA), Manchester carry chain like parallel-prefix adders (PPA) are developed for faster
(MCC), multioutput Domino logic, PDP, Speed. computational time. PPA are distinct class of adders that
I. INTRODUCTION works on the use of generate and propagate signals. The
Kogge-Stone adder is one of the PPA form carry look-ahead
As the requirement for high performance processor grows, adder. It generates the carry signals in O(log n) time, and is
there is a constant need to enhance the performance of data widely considered the fastest adder design [11]. Brent-Kung
path units. Addition is the most commonly used arithmetic (BK) adder [1] is one of the parallel prefix adder which has
operation and the performance of VLSI processor is enor- logarithmic adder architecture that gives an optimal number
mously impacted by performance of resident adder. A high of stages from input to all outputs is presented in [5]. Several
performance adder with low power consumption designed variants of the carry-look ahead equations, like Ling adder
with minimum area plays an indispensable role in large por- is presented in [6].
tion of the hardware circuits. For adding two binary numbers In this paper, we present a methodology that would allow
several types of adders have been designed, for example to design CSA using manchester carry chain (MCC) in mul-
ripple-carry, carry-skip, carry-select adder (CSA) , etc. The tioutput domino CMOS. In this design we focus on designing
major speed restriction in the conventional adder circuits, CSA architecture using domino MCC in an hierarchical ap-
such as ripple carry adder (RCA) and carry save adder arises proach for a wide N-bit adders. This proposed design shows
from the long computation time required for generating the better PDP than existing designs with minimum hardware
outputs . CSA and carry look-ahead architectures have been overhead.
suggested to reduce large carry propagation delay of adders. The remaining paper is organized as follows. In Section II
The main advantage of CSA is it alleviate the problem of related work is discussed. Section III presents the proposed
carry propagation delay. This is realized by use of parallel CSA design using MCC structure. Finally, in Section IV
structure that results from multiple pairs of ripple carry simulation results are provided and Section V concludes.
adder (RCA) and final sum and carry output is chosen by
multiplexers [2]. The main difference between a CSA and II. RELATED WORK
RCA is that in a RCA the carry has to ripple through all A. Conventional Carry Select Adder
full-adders, but in the case of a CSA the carry has to pass The block diagram of conventional CSA adder is shown
through a single multiplexer. in Fig. 1. CSA uses dual RCA structure to precalculate sum
CSAs computational time is faster than RCA but requires and carry values for both possible cin value, cin=0 and cin=1
more area compared to RCA. Modified CSA using one RCA before actual carry is available. In Fig.1. the upper RCA
structure calculates the result for cin=0 and the lower RCA
structure calculates the result for cin=1. Multiplexer selects

2159-3477/17 $31.00 © 2017 IEEE 32


DOI 10.1109/ISVLSI.2017.16
the result of upper RCA if the actual carry is 0 or the result C4 = G4 + P4 G3 + P4 P3 G2 + P4 P3 P2 G1 + P4 P3 P2 P1 G0
of lower RCA if the actual carry is 1. Due to calculation of (2)
both possible values of cin in advance CSA has less delay H4 = G4 + G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 G0 (3)
compared to RCA.
s4 = P4 ⊕ (t3 .H3 ) (4)
E. MCC Adder
MCC structures are used to speed up the binary additions.
Let A=ak−1 ak−1 ..........ak1 a0 and B=bk−1 bk−1 ..........bk1 b0
be two binary numbers and for each bit position the propa-
gate and generates signals and computation of carry signals
is based on following equations
Gi = ai .bi (5)
P i = ai ⊕ b i (6)
Fig. 1: Conventional Carry Select Adder Ci = Gi + Pi .Ci−1 (7)
Further, equation (7) can be extended as below
B. Advanced CSA
For implementing CSA with efficient area and low Ci = Gi +Pi +Gi−1 +Pi Pi−1 Gi−2 +..........+Pi Pi−1 .......P0 cin
power, RCA structure with cin value as 1 is replaced with (8)
BEC. In this implementation the number of Ex-or gates are A conventional 8-bit MCC cell is illustrated in Fig.2. From
reduced. Based on actual cin the final output is selected using Fig.2 it can be seen that the carry-chain is implemented
multiplexer. The number of logic gates used in BEC design by means of eight cascaded pass-transistors driven by the
is less than the number of logic gates used in conventional propagate signals P0 .....P7 . The MCC [1] generates all
CSA [10], but the disadvantage of this design is that the the carries computed according to equation (8) in parallel,
delay is higher than the conventional CSA. using an iterative shared transistor structure. The sum bits
of the adder are computed as si =Pi XOR ci−1 which is
C. Kogge-Stone Adder implemented by CMOS static XOR gate.
The parallel prefix adder is one of the most frequently
used architectures that works on the use of generate and
propagate signals which offers good compromise among
power-delay and area. Kogge-Stone adder is one of the
parallel prefix adder which has low depth and high node
count with minimum fan-out at each node. Kogge-Stone
adder generates the carry in O(log n) time, where n is the
number of bits per input and the operation of Kogge-Stone
adder are defined according to [1]. It is the fastest adder
with focus on design time and is the common choice for
high performance adders in industry.
D. Ling Adder
Ling [12] proposed a simplified form of carry look-ahead
Fig. 2: 8-Bit Manchester Carry Chain Cell
equations that rely on adjacent bit pairs [6]. An Ex-OR gate
is Replaced with an OR gate i.e the term Pi is replaced by
ti as below equation III. P ROPOSED C ARRY S ELECT A DDER (CSA) D ESIGN
Pi = Ai ⊕ Bi ⇒ ti = Ai + Bi (1) The MCC circuit has the smaller number of transistor
count when compared to CSA architecture and among all
which results in design with lower overhead and improved CLA circuits because of pass transistor configuration and
performance. It propagates Hi instead of Ci . Hi in [6] has low power consumption [1]. But for a wide N-bit adders
defined as Hi = Ci + Ci−1 . The Ling carries Hi can be where the carry chain is long, the carry propagation delay
computed much faster compared to corresponding carries becomes large due to RC delay of pass transistors which
Ci since they rely on a simpler boolean function [6]. For effects speed.
example, the case of C4 and H4 which are given in equations In the following, we propose a methodology of implement-
(2) and (3) and sum is given in equation (4) according to ing CSA using manchester carry chain (MCC) in multioutput
[6]. As this involves no delay by using multiplexer instead domino CMOS in ’L’ levels using adaptive clocking which is
of XOR gate to compute sum, the resulting adder becomes
faster [10].

33
(a) (b)
Fig. 5: Domino Generate and Propagate

Fig. 3: Clock Diagram for Proposed CSA


ptclock, where ntclock and ptclock are clock signals used for
activating transmission gate as shown in Fig.3.
In phase2 (the second active high phase of clk1) at first
ptclk ptclk ptclk ptclk level L1 intermediate carry-outs for initial cin = ’1’ are
computed. These computed carry-outs values do not override
the transmission gate value as transmission gate is in-active
C(i-1) C(i-1)t Ci C(i)t C(i+1) C(i+1)t C(i+2) C(i+2)t
which is evident from ntclk and ptclk as shown in Fig.3. In
level L1 the generate and propagate signal are computed by
equations (5) and (6) where ai and bi are input bits. Fig.5 (a)
ntclk ntclk ntclk and (b) shows the implementation of generate and propagate
ntclk
signals in domino CMOS logic.
Vdd Vdd Vdd Vdd

Level-I
Clk1
P0:2 G0:2 P3:6 G3:6 P7:10 G7:10 P11:15 G11:15
Pi-1 Pi Pi+1 Pi+2

C-1 = 0 C0A C-1 = 0 C0B C-1 = 0 C0C C-1 = 0 C0D


Initial Gi-1 Gi Gi+1 Gi+2
Cin C-1 = 1 C1A C-1 = 1 C1B C-1 = 1 C1C C-1 = 1 C1D
Clk1

Level-II C3 C8 C13 C19

Vdd Vdd Vdd Vdd


Fig. 4: Proposed 4-bit MCC Block
Clk2

area-delay-power efficient. Here we use the same MCC for C1A C1B C1C C1D
both cin=0 and cin=1 and this design focus on hierarchical
approach for a wide N-bit adders. The clock diagrams and
initial carry-in signals used for proposed CSA are shown in Actual C0A C0B C0C C0D
Fig.3. The first level Li (i=1) is partitioned into ’P’ blocks Cin
of MCC structure where ’S’ be the maximum block size for
MSB bits. The MCC structure of one block with size 4 at Clk2
first level is shown in Fig.4.
In phase1 for MCC at first level L1 When clk1 is
low, PMOS transistors are precharged to vdd. When clk1
is high, the NMOS evaluation transistor turns ON and Fig. 6: Proposed CSA 16-bit Adder
computes carry-outs using equation (7) for initial cin =
’0’. Transmission gates are used to hold these computed In level Li+1 (L=2) , we use P-bit MCC structure where
intermediate carry-outs at active high ntclock and active low the MSB carry-out computed for each ’P’ blocks with initial

34
Level-I
P0:3 G0:3 P4:8 G4:8 P9:13 G9:13 P14:19 G14:19 P20:25 G20:25 P26:31 G26:31

C-1 = 0 C0A C-1 = 0 C0B C-1 = 0 C0C C-1 = 0 C0D C-1 = 0 C0E C-1 = 0 C0F

C-1 = 1 C1A C-1 = 1 C1B C-1 = 1 C1C C-1 = 1 C1D C-1 = 1 C1E C-1 = 1 C1F

C3 C8 C13 C19 C25 C31

Vdd Vdd Vdd Vdd Vdd Vdd


Level-II
Clk2

C1A C1B C1C C1D C1F


C1E

Actual C0A C0B C0C C0D C0E C0F


Cin

Clk2

Fig. 7: Proposed CSA 32-Bit Adder


cin = ’0’ at first level are fed as generate signals. As if N=32 then S = 2 32 ⇒ 5.65 where maximum block size
explained earlier transmission gates used at first level Li at MSB in level
√ 1 can be restricted to 6. On contrary for
(i=1) holds these carry-outs. The MSB carry-out computed N=128, S = 2 128 ⇒√11.3 which is greater than 8, so with
with initial cin = ’1’ for each block at first level are directly incremented L, S ⇒ 3 128 = 5.
fed as propagate signals. The block diagrams of 16-bit and 32-bit adders using
From Fig.3 when clk2 is low, PMOS transistors for MCC proposed CSA are shown in Fig.6 and Fig.7 respectively. The
at level Li+1 (L=2) are precharged to vdd, when clk2 16-bit adder is designed by cascading four proposed MCC
goes high, the actual carry-in is fed and NMOS evaluation blocks with block size as 3,4,4,5 starting from LSB to MSB
transistor turns ON and computes the required final carry-out. in first level and it uses 4 bit MCC block in second level as
In parallel during the same active high phase of clk2 sum shown in Fig.6. The 32-bit adder is designed by cascading six
is computed using si =pi XOR ci−1 , where pi is propagate proposed MCC blocks with block size as 4,5,5,6,6,6 starting
signal generated from input bits and ci−1 is carry-in which from LSB to MSB in first level and it uses 6 bit MCC block
are always available from previous discussion. in second level as shown in Fig.7. All the block sizes chosen
If N is number of bits of an adder we formulate the in first level are optimized for less delay. The block size of
maximum block size for MSB bits from following equation LSB is chosen minimum so that, in the worst-case operation,

L a carry path of pass-transistors has to be discharged in the
S= N (9) LSB position and it also speeds up the circuit in next level
In above equation L always starts with 2. L is incremented as LSB for next level is immediately available.
when S exceeds each multiple of 8. If S > 8 then L is The speed improvement can be obtained by parallel carry
incremented by one which redefines block size. For example generation (for anticipated cin=0 and cin=1) at first level

35
using MCC structure. The basic implementation of CSA Adder [10] for different bit widths. This increase in power
using parallel MCC blocks at first level makes each block level in proposed CSA is due to power contribution by clock
to be independent of other. In other words the generation signals.
of carry (for cin=0 and cin=1) of one block will not affect We can find from Fig.8 and Fig.9 that proposed CSA
other. Here we chose variable block size at first level such design offers saving of 115% PDP and 19% hardware
that delay is minimum. Unlike conventional CSA, we dont overhead (transistor count) than the CONV CSA; 97% PDP
use extra hardware, instead we re-use MCC structure (for and 11% hardware overhead than the Advanced CSA [10];
cin=0 & cin=1) in an effective way using adaptive clocking in 73% PDP and 7% hardware overhead than the CSA [13];
multi-output domino MCC. This has three fold advantage in 8% PDP and 15% hardware overhead than the KSA [10];
terms of speed, power consumption and hardware overhead. 65% PDP and 2% hardware overhead than the Ling Adder
A disadvantage of the proposed CSA design is the need [10], on average, for different bit-widths.
of clock signal at level Li and at level Li+1 for pre-charge
and evaluation and two control clocks for transmission gates
used at level Li . However the generation of these signals is a
one-time cost that does not increase with circuit complexity. Conv CSA Adv CSA Kogge Stone

From Fig.3 it can be noted that ntclk is inverted version Ling Adder CSA [13] Proposed CSA
0.08
of ptclk. The clk2 signal used at level Li+1 is the ptclk
0.06

PDP (pJ)
in Li delayed by halve of the clock period (clock period of
0.04
ptclk). Thus overhead of clock signal generation is minimum
0.02
with low cost. Moreover, in our proposed CSA design where
each level is fed with a dedicated clock signal, the clock 0
8 16
signals distribution cost is also very small. We can adopt
skew hardened dynamic design methodology [14] for clock Adder Width (n)

skew.
IV. SIMULATION RESULTS
Fig. 8: Power Delay Product Comparison
In order to verify the performance of the proposed CSA,
we have designed proposed CSA , conventional CSA (CONV
CSA), Advanced CSA [10], CSA [13], Kogge Stone (KSA)
Conv CSA Adv CSA Kogge Stone
[10] and Ling adder [10] for bit-widths 16 and 32 in a
Ling Adder CSA [13] Proposed CSA
standard 45nm CMOS technology (Vdd=1V) and simulated 1200
using SPECTRE. The pre-layout comparison results for
Transistor Count

1000
800
different parameters like delay, power consumption, power-
600
delay product and transistor count for 16-bit and 32-bit 400
adders are summarized in Table I. Simulations are performed 200
using the minimum size transistors for low power for both 0
8 16
proposed and standard adder circuits. The carry signals
simulated waveform of a 32 bit adder with respect to clk2 Adder Width (n)

using proposed CSA is shown in Fig.10.


Table I show substantial gains for the proposed CSA
design. According to these simulations the proposed CSA
involves significantly less hardware overhead and less delay Fig. 9: Transistor Count Comparison
and consumes less power than existing designs. The compu-
tational delay of the proposed CSA for 16 bit and 32 bit are
473.965ps and 774.736ps respectively. The delay reduction
using proposed CSA is 45% than the CONV CSA; 52% than V. CONCLUSIONS
the Advanced CSA [10]; 48% than the CSA [13]; 8% than
the KSA [10]; and 60% than the Ling adder [10], on average, In this paper, a new efficient hierarchical Manchester
for different bit widths. carry-chain based CSA adder has been presented. The pro-
The power consumption of the proposed CSA for 16 bit posed design technique has been applied for the imple-
and 32 bit are 22.563pJ and 35.798pJ respectively, where mentation of 16-bit and 32-bit adders. The proposed CSA
14% of the circuit power is contributed due to clocks. The design involves significantly less delay, power delay prod-
power consumption of proposed CSA is reduced by 40% than uct and hardware overhead when compared with the other
the CONV CSA; 25% than the Advanced CSA [10]; 11% adder structures. Hence, proposed CSA design can be used
than the CSA [13]; 0.5% than the KSA [10], on average, for power-delay efficient devices with minimum hardware
for different bit widths. The power consumption of proposed overhead. Simulation results validate the efficiency of the
CSA is increased on average of 17% when compared to Ling proposed CSA design.

36
TABLE I: Experimental Results and Comparisons for 16-Bit and 32-Bit Adders

Design Width Delay (ps) Power Consumption (uW) Transistor Count PDP (pJ) EPDP (%) ETC (%)
16 714.816 34.325 497 0.0245 131.13 20.92
CONV CSA
32 1138.162 50.295 954 0.0572 106.49 17.63
16 748.321 29.429 462 0.022 107.54 11.86
Advanced CSA [10]
32 1197.492 45.083 897 0.0539 94.58 10.6
16 737.64 27.08 447 0.0199 87.7 8.24
CSA [13]
32 1162.91 38.76 848 0.045 62.45 4.56
16 518.617 22.76 481 0.0118 11.32 16.46
Kogge Stone [10]
32 802.341 35.942 921 0.0288 4.04 13.56
16 987.32 20.026 421 0.0197 85.8 1.94
Ling Adder [10]
32 1310.532 31.443 828 0.0412 48.78 2.1
16 473.965 22.563 413 0.0106 – –
Proposed CSA
32 774.736 35.798 811 0.0277 – –
CONV:conventional, PDP: power x delay (power delay product), EPDP: excess PDP over the proposed CSA design, ETC: excess transistor count over the proposed CSA design.

Fig. 10: Carry Signals Simulated Waveform for Proposed 32 bit adder with respect to clk2

R EFERENCES [7] N. Anand, G. Joseph, J. S. Raj and P. Jayakrishnan, ”Implementation


of adder structure with fast carry network for high speed processor,”
2013 International Conference on Green Computing, Communication
[1] N. Weste and D. Harris, CMOS VLSI Design. Reading, MA: Addison and Conservation of Energy (ICGCE), Chennai, 2013, pp. 188-190.
Wesley, 2004. [8] Youngjoon Kim and Lee-sup Kim, 64-bit Carry-Select Adder with
[2] B. Ramkumar and Harish M Kittur, Low-Power and area-efficient reduced area, Electron. Lett., vol. 37, no. 10, pp. 614-615, 2001.
carry select adder IEEE transactions on very large scale integration [9] O. J. Bedrij, ”Carry-Select Adder,” in IRE Transactions on Electronic
(VLSI) systems. vol. no.-20, p-p 371-375, Feb 2012. Computers, vol. EC-11, no. 3, pp. 340-346, June 1962.
[3] Padma Devi, Ashima Girdhar, Balwinder Singh, Improved Carry [10] M. O. V. Pavan Kumar and M. Kiran, ”Design of optimal fast
Select Adder with Reduced Area and Low Power Consumption in adder,” 2013 International Conference on Advanced Computing and
International Journal of Computer Applications, Vol.3, No.4, June Communication Systems, Coimbatore, 2013, pp. 1-4.
2010. [11] P.M. Kogge and H.S. Stone, A Parallel Algorithm for the Efficient
[4] B. Ramkumar, H.M. Kittur, and P. M. Kannan, ASIC implementation Solution of a General Class of Recurrence Equations, IEEE Trans.
of modified faster carry save adder, Eur. J.Sci. Res., vol. 42, no. 1, Computers, vol. 22, no. 8, pp. 786-792, Aug. 1973.
pp.5358, 2010. [12] H.Ling, ”High-Speed Binary Adder,” IBM J. R & D, vol. 25, pp.
[5] Pallavi Saxena, Design of Low Power and High Speed Carry Select 156-166, May 1981.
Adder Using Brent Kung Adder 2015 International Conference on [13] B. K. Mohanty and S. K. Patel, ”AreaDelayPower Efficient Carry-
VLSI Systems, Architecture, Technology and Applications (VLSI- Select Adder,” in IEEE Transactions on Circuits and Systems II:
SATA). Express Briefs, vol. 61, no. 6, pp. 418-422, June 2014
[6] G. Dimitrakopoulos and D. Nikolos, ”High-speed parallel-prefix VLSI [14] D. Harris, Skew-Tolerant Circuit Design, Morgan-Kaufmann, 2001.
Ling adders,” in IEEE Transactions on Computers, vol. 54, no. 2, pp.
225-231, Feb. 2005.

37

You might also like