You are on page 1of 8

Fast Area-Efficient VLSI Adders

Tackdon Han and David A. Carlson!


Department of Electrical and Computer Engineering '
University of Massachusetts, Amherst, MA 01003

Abstract
In the methodology offinding a better VLSI circuit, we sugg est
In this paper, we study area-time tradeoffs in VLSI for prefix the following steps.
computation using graph representations of this problem. Since
a Find a lower bound on the chosen metric.

the pro lem is inti mately related to binary addition, the results
we obtam lead to the design of area-time efficient VLSI adders. b Devise algori thms that are close to the lower bound or

This is a major goal of our work: to des ign fiery low latency addi­ analyze known algorithms with respect t o the chosen
metric.
tion circuitry that is also area efficient. To this end, we present
a new graph representation for pre fix computation tha t leads to c By investigating t he trade-off behavior between area and
the design of a fast, area-efficient binary adder. Th e new graph is time complexity, find the optimal range of area-time com­
a combination of previously known graph representations for pre­ plexity.
fix compu t ation , and its area is close to known lower bounds on d Discover corresponding layouts that are optimal in area­
the VLSI area of parallel prefix graphs . Using it, we are able to time complexity for given values of comput ation time
design VLSI adders having area A = O(nlogn) whose delay time (including minimum computation time ) .
is t he lowest possible value, i. e. the fastest possible area-efficient
VLSI adder. Brent and Kung [BK82] found a regular layout for a graph rep­
resentation of prefix computation and used it to perform binary
1 Introduction
addition. We apply the above steps on a combination of their al­
gorithm and an algorithm for solving linear recurrences developed
Addi tion circuitry is an impor tant part of any computer archi tec­
by K ogge and Stone [KS73]. We then subs ti tute actual logic for
ture, since it is the major component of an ALU, fioating point
the no des and wires for the edges of our graphs in order to im­
pro ces sor , or a special purp o se chip ([Gr8S], [FWB5], [ST8S], etc.).
plement VLSI circuitry for binary additio n that is both fast and
Since the performance of the adder can domina t e the p erforman ce
· area-efficient.
of the architecture, the design of low latency circu it ry has been the
This paper is organized as follows. In t he next section, we
subject of a large amount of research. In this paper, we co ntinu e .
denve upper bounds on area for a graph representation of our
these studies with the desig n of a fast area-efficient VLSI binary
new prefix algorithm. In s ect ion 3, we design VLSI circuitry for
adder. We start by deriving a new "hybrid" algorithm for prefix
bin ary addition based on our hybrid prefix algorithm. In se ction
co mputation (a combination of previously known algorithms for
4, we compare our model and circuitry with o th ers in terms of
prefix computation which are e ither the best case in area or time
area and delay time performance. We conclude this paper with
complexity ) , and investigate the trade-off behavior between area
suggestions for future research in sec t ion 5.
and time of this algorithm. Since graph representations of pre­
fix algorithms map dir ectly into circuitry for binary addition, our
hy brid prefi x algo r i th m leads to the design of fast and area effi­ 2 Graph Representations of Algorithms for
cient VLSI adders. One of the results of this research is that we
are able to design the fastest possible circuitry occu pying a given
Prefix Computation
amount of area for b ot h prefix comp uta t ion and binary addition.
A par allel prefix circuit is a combinational circuit that takes n in­
In VLSI design, there are three impor t ant factors that affect
puts :l:J, :1:2," ':1:n and produces the n outputs :1:1,:1:10 :1)2," ',:1:10
the performance of a design . The first is area complexity, the
z2 o· • 'OZn where 0 represents an arbitrary associati v e binary oper­
second is time/delay performance, and the third is regularity of
ation. Parallel prefix computation has been used for many appli­
interconnection. Usually, a pattern rich in conne c t ion s occupies a
cations, for instance, in regular layouts of parallel adders [BK82J,
lot of area , while a pattern with limited connections takes a long
generati�� consecutiv� terms of a linear recurrence [GLPG82], and
time to compute. Because the cost of area is very expensive, every
fast addi t Ion of two bmary numbers [Of63]. Therefore, it is of in­
effort should be made to find the optimal area complexity for a
terest to find area-time efficient VLSI implementations of prefix
given value of time performance.
computation. We are interested here in the design of the fastest
Area in VLSI is expressed as the total amou n t of silicon used.
possible prefix circuitry occup y ing a given amount of VLSI area.
Interconnection of wires can consume most of the area of a VLSI
This section is organized as follows. In section 2.1, we review
circuit. Therefore, in this paper , we take the complexity of inter­
past work on the subject of prefix computation. Section 2.2 de­
connection into consideration rathe r than just a count of "active
s crib es a new algorithm for prefix computation that forms a basis
elements', "gates', or "registers".
for the rest of the paper. In sections 2.2 and 2.3, we derive two
different VLSI layouts for the graph representation of our new
lpresent address:
Supercomputing Re.earch Center 4S80 Forbes Blvd.
algorithm and discuss their properties. We obtain upper b01ll\ds
Lanham, MD 20706

49
CH2419-0/87/000010049$Ol.OO © 1987 IEEE

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
area, low time; B-K: low area, high time), and we can see that
on area forgiven value of delay time for the layouts, and from
a
these u pp er bounds, we find the range of dela y time for which area only a constant factor reduction in delay time from 2 log n - 1 to
is 0 (n log n). In section 2.4, we discuss properties of the graph's logn r esults in a significant in crease in area from O(nlogn) to
layout when its nodes are interpreted to implement the circuitry O(n2). It becomes of interest to discover upper and lower bounds
of a bin ar y adder. on are a- t im e tradeoffs for prefix computation when delay time is
restricted to lie in the range between log n and 2 log n - 1.
Carlson and Sugla. [e585] obtained lower bounds on the area
2.1 Review of Past Work
of a prefix graph for delay time between log n and 2 log n - 1.
Past work on p arallel algorithms for pr efix computation is as fol­ Among other things, the low er bounds are matched by the B-K
lows. Ladner and F is cher [LF80] gave a general co nstruct i on for and the "[{-S graphs at the extremes ofthe range. Thus, it is likely
parallel prefix circuits when gates have unbounded fan-out and that a proper combination of the B-K and K-S graphs will be both
Fich [Fi83] p rovided new lower and upp er bounds for circuits with area and time efficient.
gates having bounded fan-out. Brent and Kung [BK82] suggested By combining the B-K and K-S graphs as shown in Figure 2.3,
a regular layout for co mputing prefixes. Their algorithm is suit­ we obtain a new hybrid prefix grap h that achiev es intermediate
ab le for implementation in VLSI, and when it is combined with values of area and time. The graph is easily observed to compute
pip elining , time O(1ogn) and area O(n) can be achieved. In the prefixes correctly. We divide it into three separate blocks, the
early 70's, Kogge and Stone [KS73) discovered the technique of outer of which ar e multiple B-K graphs and t he inner of which
recUIsive doubling and gave an algorithm for solving a larg e class is the K-S gra ph . A straightforward VLSI layout of the graph
of recurrence problems on parallel computers su ch as the llliac follows from FigUIe 2.3, which is drawn with the correct number
IV. T heir algorithm can be applied to compute prefixes in paral­ of vertical tracks between each stage.
lel and a graph representation can be developed which leads to a If we define k as the extra depth of the hybrid prefix graph
p os sible VLSI layo)1t. In papers by Carlson and Sugla [CS85] and and n a8 the number of inputs, then the area of the embedded
Sugla [Su85], a lower bound on the VLSI area of a prefix graph K-S graph will be n2/2". To see this, consider a hybrid pre­
was obtained using the minimum bi s ect ion width (MBW) of the fix graph of extra depth k comp osed of an embedded K-S graph
graph. Th omp s on [Th79,80] pi oneered this appr oa ch to deriving on n/2" inputs surrounded by B-K graphs each with 2" inputs.
VLSI lower bounds. In [CS85] and [SuS5], t he lower bound on the The embedded K-S graph has n/2" inputs and thus requires n/2h
area of a parallel prefix graph is A. =: !l(n222{loS. n-T) + n) where vertical tracks to lay out in VLSI. However, since its inputs are
l o g n :$ T :$ 2 log n. They indicated the possibility of de signing spread out with 2" tracks between each pair, it requires n hori­
circuitry with minimum valu es of both area and time. zontal tracks. The area of the embedded B-K gra phs in this model
The work of Ladner and Fi s cher , and Fi ch , while interesting, will be 2kn (n horizontal tr ac ks and 2k vertical tracks). There­
uses gate count rat h er than area as a complexity measur e. This fore the total VLSI area of this layout of the hybrid prefix graph
ignores the wiring area complexity needed to interconnect circuit will be A =: 2kn + n2 /2k. In the above expression for area, it
elements. Brent and Kung [BK82] use area as a complexity mea­ can be shown that A = O(nlogn) when k is the following range:
sure, however, their techniques do not produce minimum depth log n - log log n � Ie :5 log n - 1. Thus , a small a d ditive factor
parallel prefix circuits. In the follo wing section, we use meth o ds in del a y time performance is gained over the B-K graph with no
similar to those emplo yed by Brent and Kung, and Carlson and d egradation in area consumption.
Sugla to find the minimum dep th of an area efficient prefix circui t . When the above upper bounds are compared with Carlson
The result immediately su g g est s area-efficient VLSI imp lement a­ and Sugla's (CS85] lower bound of A = !l(nl/22"), it can be
tions of high-speed binary addition. seen that the lay,out doe s not achieve o p timalit y. This is mainly
due to that area wasted in the middle K-S grap h , who s e input s
2.2 The New Hybrid Prefix Algorithm, a Graph and s ubs e quent nodes are spread by 21. tracks. H ow ever, this
Representation, and a VLSI Layout. layout for the entire graph does have some desirable prop erties ,
including a regular connection pattern, and all inputs and outputs
In this subsection, we discuss results on area-t i me tradeoffs for on the boundaries. As we will see later in this paper, efficient area
prefix computation. Our di s cussions lead to a graph representa· utilization can be made for small values of n by fo lding nodes of
tion of a new algorit hm for c omput ing prefixes, which will form the K-S graph into unused tracks.
the basis of the fast area-efficient VLSI adder des ign s presented
late r in this paper . We obtain a straightforward VLSI layout of
2.3 An Area Optimal Layout for the Hybrid Prefix
the graph, and upper bound the area of the layout when it is
Graph
cons truct e d to have a given value of delay time. From this upper
bound, we find the range of delay time for which area is efficient, In subsection 2.2, we derived a VLSI layout for prefix graphs
i.e. A O(nlogn).
= th at achieves intermediate values of area and time p erformance .
The Brent-Kung ( B - K ) prefix graph [BKB2] can be laid out The layout has some desirable properties, es p ecially for small to
in area O ( n log n) with T 2log n - 1, as shown in FigUIe 2.1.
=: intermediate values of n, and we will use it as a basis for designing
A generalization of the Kogge-Stone (K-S) linear recurrence algo­ fast area- effici ent VLSI adders later in this paper. For l arge values
rithm [KS73] computes prefixes and has a graph representation of n h ow e v er , area is wasted in the middle of the gra ph that cannot
that can be laid out in area !l( n2) with T log n (see Figure
=: be used for the placement of other components.
2.2). The lar ge area is mainly due to the lar g e number of vertical In this subsection, we p re sent an alternative layou t for the
t r acks required t o embed wires in the upp er stages of the graph. hybrid prefix graph that is more suitable for l arge valu es of n. The
The last stage requires n/2 vertical tracks, the stage before n/4, layout achieves ar e a consumption that co m e s within a constant
etc., which adds to a total of n - 1 vertical trac k s. FigUIe 2.2 factor of the lo wer bound of Carlson and Sugla(CS85], however it
illustrates this, eventhough some lines do not follow gri d tra ck s . loses the desirable pr op ert y of having all inputs and o utputs on
C omparing these grap h representations for prefix computa­ the boundary.
tion, we find extremes in area and time performance (K-S: high This asymptotically optimal layout is shown in Figure 2.4. It is

50

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
When reducing the depth of th e hybrid prefix graph , the total
a result mainly of l a ying out the emb edded K-S graph in minimwn
area. Inputs to the K-S graph are placed so that consecut ive area of the processing elements increases (more are required ) and

inputs are separa t ed


by at most a constant distance of 3 tr acks interconnection layer area increases (t he embedded K-S graph has
more inputs), vice versa with increasing the depth of the hyb r id
(some separat ion is needed to route cOIUlecting wires b etween the
K-S and the B-K graphs). Below the K-S graph , subsets of B-K .
prefix grap h When we take t he constant factors a and,8 into con­
sidera.tion, we find that the following facts are worthy of further
graphs are laid out on top of each ot h er . Each subset is formed
so that the number of inputs / outputs to the subset matches the exposition. For small values of n, t he area of processing elements
number of inputs /outputs of the K-S gr aph . In this way, the d ominat es the total area, and for large values of n, the area of
interconnection layers is the domin an t factor. Furthermore, when
horizontal space spanned by th eB-K graphs is the same as that
of the K-S graph . n :s; 64, the total area with k == 2 or 3 is almost identical to the
area with k == 1 du e to th e effect of the constant facto r s a and (3.
Using the same steps as in sub se ct ion 2.2, we now derive an
upper bound on the VLSI area of the layout The embedded K-S . When n is large, i.e. n > 64, extra depth k = log n -log log n or
graph 0«n/2k)2), since its inputs are a constant
occupies area k == 1/ 2 ( log n -loglogn ) should be used to obtain area-efficient
circuitr y.
distance from ea ch other, i.e.0(n/2/c) horizontal tracks are used.
Each B-K graph is laid out in area 0(lc2/c), so that the total area In addition to these fa ct s, if we are allowed to use more than
one m etal layer, or a folding method (explained further in the
of theB-K graphs in the layout is O(kn). Thus, the total area of
the lay out is A O(lm + (n/ 2/c ) 2 ) . Alternatively, the nwnber of
==
next section) , the total area may be even further reduced. When
horizont al tracks used by the l ayout is 0(n/2/c), and the number
using more than one metal layer, the area can be expres sed as
of vertical tracks used is 0(n/2k) by the K-S graph plus O(k22/C)
follows:
by the multipleB-K graphs. This yields the same expression for
Area Area of the processing elements
.
==

the area of the layout


1 n2 n
From the above expression for area, it can b e shown that A == +(2len + m{ 2" or ( " )2})
O(n log n) , when k is in the following range: ! (log n - log log n) :s; 2
k :s; log n - 1. Thus, a further improvement in delay time can be where m is the nwnber of metal layer s . With the fo l ding method'
made while maintaining area O(nlogn). the total area will be
Our upper bound of A 2kn + (n/2")2 mat ches the lower
==

bound of Carlson and Sugla [CS8S] to within a constant factor Area == 1/ f( Area of the processing elements )
for almost all values of extra depth k. If an H-t ree like la yout +(Area of interconnections )
is used for each B-K graph, optimalit y is achieved for all values
of Ie. We also find that T == 3/2(logn) - 1/2(loglogn - 0(1»
where f is the number of folds. T herefore with either or both of
the above two me thods , we can decrease the depth by at least
is the lowest value of delay time for which area O(nlogn) can
simultan eousl y be achieved. In Table 2.1, we compare the area one unit and still achieve the same area cons1lIIlption. We note
that reducing the depth oUhe hybrid prefix graph to its absolute
and time performance of the layouts presented in this and the
minimum (Ie 0, K-S case) results in the des i gn of area inefficient
previ ous subsection for different values of area, time , and extra
==

VLSI circuitry, b ecause the folding method cannot be applied and


.
depth We leave it as an open qu estion whether the lower bound
it is the worst case for the number of the transistor gat es (see Table
of Carlson and Su gla [eS8S] can be met under the more r estrictive
condition that all inputs /outputs appear at the b o undary of the
2.2 for a comparison of upper bounds on the nwnber of gates) .
graph's layout.
To summarize, for small values of n (n :s; 64), we have found
that t h e case when extr a depth k is equal to 1 lea.ds to VLSI
circuitry for binary a ddition that is b oth fast and area effici en t
2.4 Accounting for the Area of Nodes of the Graph
when compared to other VLSI designs of carry look-ahead adders.
When a prefix graph is used as a ba sis for designing binary addi­ Speed is obtained by using a small value of k, and area-efficiency is
tion circuitry in VLSI, each node of the graph re present s a set of obtained through the use of a fol din g method. The actual desig n
logic equations that must be computed/implemented in the un­ is explained in more detail in the fo llowing s ections .
derlying te chnology. Thus, each node can be t h ought of as a "leaf
cell" or "proc e ssing element" and will expand from being a point
3 Detailed VLSI D e s ign of a Fast Area­
in the gri d to occupy a fixed amount of area in the layout . In this
subsection, we determine the effect that th e area of the nodes has Efficient Adder
on the area of the entire layout.
As was mentioned in the previous subsection, our new hybrid
When the area of nodes/processing elements is account ed for,
prefix algorithm can be applied to the problem of carry look-ahead
th e area of addition circuitry derived from a hybri d p r e fix g r aph
addition. By using this new method, we are able to overcome
fan be expressed as follows:
the main difficulties of implementing a standard carry look-ahead
A rea Area of Processing Elemen ts adde r in VLSI: interconnectionirregularity and fan-in, fan-out
limitations. Our method improves the performance of a standard
:=

+Area of IntercoIUlection Layers


carry look-ahead adder in following ways.
== a,B(Depth Times Wid th of Layout)
i Simple leaf cells are used, each having one gate level delay.
+Area of Interconnection Layers
ii The overall design is area-efficient, since it is b ased on the
n n�
a,Bn( Ie + log n ) + (2kn + {( 2" ) 1 or ) n ew prefix algorithm and employs a folding technique.
2"'}
iii The circuitry is fast, since it is based on a near minimum
Here, a (,8) is the r atio of the
height (wi dt h) of one processing ele­ dept h prefix algo rithm .
ment to one grid line of the intercoIUlection layer. In our designs , iv Fan-in, fan-out to leaf cells is bounded by 2 logical wires (4
both a an d ,8 are approximat ely 4.
physical wires ) .

51

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
v Interconnection wires have reasonable lengths (linear in the then take posi ti ve input; produc e negative output
number of bits b eing added) . .else take negative input; pr oduce p osit ive output

This section is or g anized as follows . First we introduce the general


formulation of carry look-ahead addition. Then,we explain out
This is explained in Figure 3.180. We now explain the internal
meth o d of implementing a carry look-ahead adder based on the circuitry of each leaf cell. We use paired subscripts i, k in our
new hybrid p refix algorithm. We follow this wit h a met hod for descriptions, where i d en o t e s the input ( output ) number and k
further reducing the area of our layouts, a.nd a discus sion of how the level number.
our circuitry compares with other designs.
i) Pggen cell
This cell produce s the initial P and 9 signals ( carry propagation
S.l The Carry Look-Ahead Adder. and generation signals) . Comparing our cell with an exclusive-or
scheme, the size ca.n be reduced in half and delay time of only one
Let a..-l,···, ao a.nd bn-l,···, bo be n·bit binary numbers with
gate level can be achieved.
sum 8n, . . ',80' The following equations can be used to c o mpute
the 8,'S:
'Pi,l == , a;( + bi)
C-I 0 ,gi,1 == ,(a;' bi)
c, ( ai' b;) + (ltj + h,). <:;-1
ii ) Black cell
8i (,ei)' (ai + bi + ei-Il + ( ai' bi . Ci-I) A black c ell implements the 0 o perator on P and 9 signals . We
use two different types of black cells: bp. ca (positive input, nega­
where i = 0,1' . . , n - 1 tive output )and bn.ca (neg at ive input, positive output). This is
explained in Figure 3.1b. Each cell is used alternatively in even
Note the alternative equat i on for the ith sum bit, which is usually
and odd levels ( see Figure 3.1a ) .
expressed as Si = ai $ hi Ell Ci-l'
bn.ca ( negative input )
It is well known that computing the ei's quickly is the key to
high. spe e d addition, and that they can be determined using the gj,2k == ,«( 'Pi,21o-1) + (,gi,2k-l)) (,gj,2"-1))

following scheme.
Pi.2k ,« -'Pi,2k-l) + ("'Pi.21o-1))
(1)
where gi = a, .bi hi and Pi = ai + bp.ca ( positive input )
Here, we do not us e exclusive·ors in the e quat ions [BK82], so th at
it is not neces s ary to invert the a's and b's to produce the p's and
-,g;,21o+1 == -'(Pi,21o' gi,21e + 9i,21o)
g's. ,pj,21o+1 '(Pi.2k . Pi,2")
The recurrence relat io n in equat ion (1) can be applie d repeat­ iii) Whit e cell
edly to obtain the following set of carry equations in terms of gi, This cell is a simple inverter; notation is as follows.

Pi and C_l'
Pi,lo == 'Pi.k-l
i-1 i
9k,I "'g;,k-l
«II Pr.)· e-l
=

ei == gi + (L:( II Pk)' g;) + )


;=0 k=j+l k=O iv ) Sum cell
The e qu at ion for the ith sum bi t in t erms of the P, 9 variables and
Also, when the operator 0 is defined on ordered pairs (p, g) by
negative logic is:
(Pi, gil 0 (Pj, gil (Pi' Pj,Pj . gi + gj). It is easily verified that
=

(Ph gl) 0 a (Pi, gil


• • • (PI' . . Pi, ei). Thus, computing the carries
:=

can be performed using any valid algorithm for p refix computation

where 0 is implemented as given above. U sing this equation allows us to avoid the use of exclusive ors,
which would require the input variables and their complements
S.2 VLSI Implementation (ltj, ,ai, b i , ,bi) to be provided to the sum bi t circuitry. Taking
int o account that carries produced by the c arr y generation cir­
The hybrid prefix algorithm is straightforward to convert into cir­
cuitry alternate b etween being positive and negative, we design
cuitry for binary addition using techniques similar to those em­
two types of sum cells. One takes its inputs "naturally" and the
plo ye d by Brent and Kung [BK82]. We use the same terminology
other must invert its two carry inputs.
as [BK82]. Figure 3.1 diagrams a carry look - ahe ad adder derived
from the hybrid prefix algorithm. With this graph, we can reduce sumce1l1 F(Ci, ""Ci-I, "'Pi,l, ,gi,l)
the d epth of the prefix algorithm from 7 (B-K case) to 5 (our way)
with n=16. We use only the standard layers of interconnection
sumee1l2 F( lei, Ci-l, "'pi,l, ,gi,l )
a vailab le in NMOS VLSI (metal, silicides, polysilicon). If the de­
We note the following advantages of using leaf cells designed as
signer is allow e d to use more than 2 metal interconnection layers,
given above:
then a further reduc t ion in depth can be made while maintaining
the given area. We follow 4J.Lm NMOS topological de s i gn rules. Black cells are simple.
There are 4 cUfferent types of leaf cells in our circuitry: pg gen ,
ii P ggen cells are s i mple .
black, white, and sum. All are illustrated in Figure 3.1b. As
iii Sum cells are half of a full a dder cell.
shown in [NI8S], we can employ the following strategy to eliminate
the use of inverter gates in each leaf cell. iv All cells combine speed (minimum number of gate delays)
with area efficient layouts.
If (level is even number)

52

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
In Figure 3.2, we show the leaf cell layouts. The le af cells for the other designs, and even is competitive with the area of a regular
sum circuitry contain super buffers in order to more easily drive ripple-carry adder for small values of n. When n � 64, the area
a large capacitance output pad. Since all leaf cells have good values of our circuitry and of a ripple carry adder are compared in
layouts, the overall design is competitive in area with a standard Table 4.1. The area of our circuitry is only a factor of 2 or 3 higher
ripp le carry adder . Area increases in our design only because of than that of the ripple carry adder. For wire capacitance effects,
the extra proc es sing elements and interconnection area required the s p eed of the circui t ry is dependent on the length of the wires
by basing the design on a hybrid prefix graph. To reduce the (their capacitance) and the resistance oflong wires. In Table 4.2,
charge time of the capacitance being driven by the leaf cells, we we comp ar e for each design the total interconnection wire lengt h

used minimum sized pull up structures (2)' x 2>.). from inputs to outputs. From this table, we can conclude that it
takes almost same amount of time for each design to charge and
3.3 A Densely Packed Layout Using a Folding Method. pas s signals along the interconnection wires. Therefore we can
ignore the delay time due to wire capacitance in comparing these
Our c arr y look-ahead adder b ased on th e hybrid prefix algorithm desig ns ( b ut we do have to consider this wire capacitance when
can be packe d densely by using a f oldi ng method. With this measuring delay time in an actual implementation).
metho d , we can reduce in half the area of the overall de si gn for We now compare our circuitry with the carry skip adder of
reasonable sized adders (n � 64). Equivalently we could decrease [OB85], and the carry look-ahead adder of [Cav84] usin g the
the depth by one unit and still achieve the same ar ea consumption method of (OE85]. They associate a time unit t with one gate
by employing our folding method. For large sized adders, the propagation delay time (fan-in and fan-out bounded by at least
folding method is not as effective in reducing the depth, because 4). When we comp are our leaf cells with their single gate, OUl'S
the interconnection area. becomes dominant. is quite simple (
half the size; fan-in and fan-out bounded hy 2).
Our folding scheme is shown in Figure 3.3. It works by placing Therefore, the propagation delay time of any of our leaf cells is
two levels of the h ybrid prefix graph into one level of the layout, at most !t. As discussed in section 2.3, in our designs we have
since space is available to emb ed leaf cells. It should be ob vious fixed the value of ext ra depth to k = 1 when n � 64. The delay
from the figure that very efficient area consumption is realized, time comparison of our designs with (OB85], (Cav84] is shown in
especially for circuitry designed from a hybrid prefix grap h with Table 4.3.
extra depth k = 1. Next we compare our circuitry with the designs of [BK82]
In Fi gure 3.4, we int roduce two ways of arranging the direction [NIB5] and [NIRB6] using the method outlined in [NIBS]. Here,
of input and output to our carry look-ahead adder circuitry. In we simply count the worst-case number of gates along any path
Figure 3.4a, the inputs and outputs go through in the same direc­ from inputs to outputs to measure the delay time. The fan-in
tion (inputs and outputs bo th at the bottom). In Figure 3.4b, the and fan-out restrictions are almost the same in all these designs
outputs come out the t op , the opposite side of the inputs, which and thus are ignored. As we discussed previously, extra depth
come in the bottom. This gives the designer some fiexibility in k = 1 is used in our designs with n :$ 64. For large values of
attaching the circuitry to di fferent styles of bus structures, or n (n > 64), we use the value k = logn -loglogn. The delay
different arrangements of I/O p ads . Other desi gns of carry look­ time comparison between our design and [BK82], [NI8S], [NIR86]
ahead adders in VLSI [NIBS] [NIRB6] have ignored the placement is shown in Table 4.4. Using our hybrid prefix algorithm, delay
of input s and outputs, instead assuming that external inputs and time is the lowe s t possible value with optimal area 0 (n log n).
out p ut s are available wherever they are required. Such assump­ Furthermore, by adding various strategies (folding, simple leaf
tions are impractical in a typical ALU where I/O and bussing cells and multiple metal layers ), our designs are better than those
conventions are generally adapted in advance of the design of the of [BK82], [NI85], [NIR86] with respect to delay time, and almost
circuitry. We conclude this section by giving a diagram of the as area-efficient as a ripple carry adder.
entire chip including I/O pa.ds ( Figure 3.5).
5 Conclusion
4 Comparison with Other VLSI Adder De­
We have presented a new prefix algorithm and have applied it to
signs obtain a fast area-efficient VLSI implementation of a binary adder.
\Vith the other techniques presented here (folding, multiple metal
In this section, we compare our VLSI circuitry for carry look­
layers and the design of simple leaf cells), the area of this hybrid
ahead addition with other designs. It is not accurate to compare
adder becomes competitive with the area of a ripple carry adder.
each design by counting only the number of g ates because of the
Also, the time performance of this adder is better than any other
difrerences in fan-in and fan-out, types of gates, and wiring capac­
reported in the open literature.
itan(c effeds. AI:counting for these differences using the method
The design strategy used here can be applied t o various tech­
of [OBSS], we compare our circuitry with the carry skip adder of
nologies including ECL, TTL, CMOS, NMOS, etc. Since speed
[OB85] and the commercial carry look-ahead adder presented in
and power dissipation depend on the technology being used, and
[Cav84] for which fan-in and fan-out are bounded by at least 4
the lamda scale factor b e c omes smaller in a new technology, it may
logical wires. Then, with the same fan-in and fan-out restriction,
he desirable to extend our fan-out 2 algorithm to fan-out f (f � 2)
we compare out drcuitry with [BK82] [NI85] [NIR86].
due to the reduced fan-out capacitances being driven [OA83J. Our
First, some gerwral comparisonH arf' i.n order. Although the
current investigations are concentrating on this topic.
tim€' p�.,rc>rma.nce of our circuitry is proportional to O(logn), ii
still compares favorably to a standard carry look-ahead adder im­
plemented in VLSI. This is because of the large fan-out of a carry
look-ahead adder, and the fact that fan-out proportional to n
require. delay time O(log n) to drive as a capacitive load under
realistic models of a VLSI cirCl�it.
The area of our circuitry is very efficient when compared with

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
·4 ·8 - 2 • 6
Outputs 1 0 2 0 " · 16

B-K part

Inputs 2 • . •
16
Fig 2.1 Kogge-Stone Prelix graph

Outputs 102 0 • • · 16

1
Fig 2.4 Area Optimal Layout (K=2, n=16)

8 l(J 8 1 r Levels

Inputs 1 2 16
Fig 2.2 Brent·Kung Prelix Graph

Levels Tracks

T T
k B·K Graph (inverted tree) k
3
+ +
K-S Graph n/ 2 k
log n 2
+
B-K Graph (binary tree) k
....L
r-- n inputs -

a) Block Diagram of HPC


a ) A 16 bit CLA by using IIPC
Outputs 1
k= I t �6 6�)b U �; U
� 6 � � ��
1/ or
1
: :: : � : :: '
o 0
log n 4 :
= K-S 1 1
.; bp.ca bn .ca white.ca

B·K

Inputs 16 p g. ca sum.ca
b) Hybrid Prelix Graph of k=1 and n=16
Fig 2.3 Hybrid Prefix G raph and Block Diagram b) nolation5 of leaf cell
Fig 3 . 1 Hybrid Prefix Adder

54

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
pi g2 p2 gl pi god

0.) b p.ca. b) b o.co.

inl in2 g od
c) pgge n.ca
dJ white.c..
cO
Fig 3.4 Two different styles of HPA (16 bit)
cO cl cl
(Different arrangements of I/O pads)

g nd p .um vcld vdd sum g p


g g nd
cJ . u m l.c .. f) . u m2.c ..

Fig 3.2 Layo ut of edc h lea.f cell.

Fig 3.5 Layout of a 1 6·bit Hybrid Prefix Adder


Fig 3.3 Densely Packed Layout using folding method (n=16)

55

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.
k -[extra depthT T Ttimel Are..
References
K-S k 0 lo g n O(n' )
cas.
[ B K 82J R.P. Brent and H.T. Kung, "A Regular Layout for Parallel
HPC � (Iog n - log log n) :S k :S log n O(n log n) Adders,' IEEE Tran.actions on Compute r., Vol. 0-3 1 , No.
(.ec 2.2) lo g n - 1 f log log n S, March 1982.
HPC log n log log n :S k :S 2 10g n log log n O(n log n)
(.ee 2.8) log n [ C ay 84 ] J.J.F. CavlIJlagh, Digital Co mp uter Arithmetic DeBign and

B-K k - Iog n 1 2 1c g n 1 O(n log n) Implem entation, McGraw-Hill, New York, NY, 1984.
c... e [ CS 85] D.A. C arlson and B. Sughl., "On the Area Requirements of
Very Fa.st VLSI Adders,' Technical Report, Dept. of Elec­
T..ble 2.1 Comparison �e..-Tim. Complexity
(or each c�' of graph. trical and Computer Engineering, Univ. of Ma.ssachusetts,
Amherst, MA, 1985.
number of g.ates delay time k
1 [Fi 8S] F . Fich, "New Bo u nds for Parallel Prefix C ircuit s, ' Proceed­
B-K 2.5 .. 2 log .. log n 1
K-S n log n log n 0 ing. 15th A CM Sympo Biu m on Th e ory 01 Computing, pp .
Fich 2.5 n + Sv'riTog n log n ! Iog n � Iog log n !-(Io g n - log log nl 100-109, April 1983.
lTIog It - log log nl
Fan drian to IIJld B.Y. Woo, "VLSI Flo at ing Point Proces­
HPC 2.5 n + vnTog n log n -llog n --nog log lt
[FW 851 J.
sors,' Proceeding. 7th Sympo.ium on Comp ute r A rithmetic,
pp. 9S-100, June 1985.
Table 2.2 Camporuon of Upper bound. by counting tb. number of gat••
(fan-out hounded by 2) IG LPG 8 2 ] A . G . Greenberg, R.E. Ladner, M. S. Paterson and Z . Galil,
"Efficient Parallel Al gorithms for Linear Recurrence Com­
Area of Area of
put .. tiont Information Proctllin g Lettero, Vol. 1 5 , No. 1 ,
Ripple Carry Adder Hybrid Pre fix Adder
pp . SI-85, August 1982.
n - 1 6 200 X 700,\2 350 X 700XT
n - 32 200 X 1400'\ " 500 X 14005;2 IGr 85J T. Gross, "Floating-Point Arithmetic on a Reduced Instruction­
n - 64 200 X 2 100,\� 650 X 2 1 00,\2 Set Processor, "Proceeding. 7th Sympo.ium on Computer
Arithmetic, pp. 86-92 , June 198 5 .
IKS 13] P.M. Kogse a n d H.S. Stone, "A Par.JJ.el Al gorithm for the
Table 4.1 Are a of Hybrid Prefix Adder and RCA
Efficient Solution of a General Cla.ss of Recurrence Equ....
tions," IEEE Tran.action. on Comp ute .. , Vol. C-22, No. 8 ,
Tot al intercon nectio n
pp. 783-791, August 1973.
wire length
RCA 0 ILF aO] R.E. L adner and M.J. Fischer, "Par.JJ.e l Prefix Comput....

K-S n tion.· Journal 01 the A CM, Vol. 27, No. 4, pp. 831-8:18,

B-K n October 1980.


HPA n [N! 85] T.F. Ngai and M.J. Irwin, "Regular Are.... Time Efficient
CSA n Carry- Look ..head Adders,' Proceeding' 7th SlImpo,ium on
N-I Vn Computer Arithmetic, pp . 9-15, J Uly 1985.

[ Nffi 86J T.F. N g ai, M.J. Irwin and S. Rawat, " Regul ar Are .... Time
Efficient C arry-Lookahead Adders ," Journal 01 Paraliel and
Table 4.2 Total wire length from input to output for each c as e
Distributed Computing, pp. 92-105, 1986.
n CLA CSA HPA lOA a 8 ] S . Ong and D.E. Atkins, "A Comparison of ALU Structures
16 6t St 3.St For VLSI Technology,' Proceeding. 6th SlImpo,ium on Co m­
32 8t 7t 4t puter A rith m e tic , pp. 10-16, 1983.
48 lOt 9t V.G. Oklobdzija and E.R. Barnes, "Some Optim ..l Schemes
lOB 8 5 J
64 lOt lOt 4.5t forALU Implementation in VLSI Technolo gy," Proceeding.
7th SI/mpo . iu m o n Co mpute r A rithmetic, pp. 2- 8 , June
wh ere CLA : Carry Look-Ahead Adder
1985.
(Fan-in , Fan-out bounded by 4)
CSA : Carry Skip Adder 10f 8 S ] Y. Orman, "On the Algorithmetic Complexity of Discrete
(Fan-in, Fan-out bou n ded by 4) Functions ," CI/bernetie . and Control Theorll, So viet Physic.
HPA : Hybrid Prefix Adder Dokladll, Vol. 7 , No . 7 , pp. 589- 5 9 1 , 196 3 .
(Fan-in , Fan-o u t bounded by 2) [ST 8 5 ] S .P. Smith a n d H . C . Torng , " D es i gn o f a
Fast Inner Product
Processor," Proceeding, 7th Sympo6ium o n Comp uter A rith­
Table 4 . 3 Comp arison in T using the method of lOB 85J
metic, pp. 38-48 , June 1985 .

! S U 85] B . Sugla, "Parallel C omput ation with Limited Resources,'


n RCA B-K N- I BPA Ph.D. Dissertation, Dept. of Elec tri cal and C o m p u t er Engi­
8 18 15 7 neering, Univ. of Massachuset ts, Amherst , MA , May 1985 .
9 20 8
IT h 791 C . D . Th o m p so n , "Area-Time Tro.deofl's
in VLSI," Proceed­
16 34 19 8 ing. 1 1 th A CM SIImp o 6 ium on Theor" 01 Co mp ut in g, pp.
27 56 12 8 1 - 8 8 , April 1 979.
32 66 9
ITh 80J C . D . Thompson, " A Complexity Theory for VL SI," Ph. D .
64 130 27 10
Dissert ation , Carnegie-Mellon University, D e p t . o f Com­
108 218 16
puter Science, 1980.
256 514 35 17
432 866 20
512 39 19

Tab le 4.4 Comparison i., T usi n g the m e t h o d of IN! 851

56

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on February 01,2023 at 08:09:20 UTC from IEEE Xplore. Restrictions apply.

You might also like