You are on page 1of 87

UNIT IV DESIGNING ARITHMETIC BUILDING

BLOCKS

Data path circuits, Architectures for ripple carry


adders, carry look ahead adders, High speed
adders, accumulators, Multipliers, dividers,
Barrel shifters, speed and area tradeoff
A Generic Digital Processor

MEM ORY

INPUT-OUTPUT
CONTROL

DATAPATH

Adders, multipliers and shifter often used in the


datapaths of microprocessor and signal processors.
•The Datapath is the core of the processor where all the
computations are performed.

•The other blocks in the processor are support units that


either store the results produced by the datapath or help
to determine what will happen in the next cycle.

•A typical datapath consists of:


an interconnection of basic combinational functions,
such as arithmetic operators (addition, multiplication,
comparison and shift) or logic function (AND, OR, and
XOR).

The design of datapath depends upon the application


sets.
Building Blocks for Digital Architectures

Arithmetic unit
-Bit-sliced datapath
(adder, multiplier, shifter, comparator, etc.)

Memory
- RAM, ROM, Buffers, Shift registers
Control
- Finite state machine (PLA, random logic.)
- Counters
Interconnect
- Switches
- Arbiters
- Bus
Bit-Sliced Design
C o n tr o l

B it 3

Data-Out
Multiplexer
B it 2
Data-In

Register

Adder

Shifter
B it 1
B it 0

T ile id e n tic a l p r o c e ss in g e le m e n ts
 Datapaths often are arranged in a bit sliced
organization.
 Instead of operating on single bit digital signals,
the data in a processor are arranged in a word-
based fashion.
 Typical microprocessor datapaths are 32 or 64
bits wide, while the dedicated signal processing
datapaths, such as those in DSL modems,
magnetic disk drives, or compact-disc players are
of arbitrary width, typically 5 to 24 bits.
 For instance, a 32 bit processor operates on data
words that are 32 bits wide.
 This is reflected in the organization of the
datapath.
 Since the same operation frequently has to be
performed on each bit of the data word, the
datapath consists of 32 bit slices, each operating
on a single bit-hence the term bit sliced.
 Bit slices are either identical or resemble a similar
structure for all bits.
 The datapath designer can concentrate on the
design of a single slice that is repeated 32 times.
Adder
•Addition is the most commonly used arithmetic
operation. It is often the speed limiting element as
well.
•Careful Optimization of the adder is of the utmost
importance.
This Optimization is done at the logic or circuit level:
Logic level optimization-Boolean functions are
rearranged so that faster or smaller circuit is
obtained. eg:(Carry look ahead adder)
Circuit level optimization- manipulates transistor
sizes and circuit topologies are manipulated to
optimize speed.
Full-Adder

A B

Cin Full Cout


adder

Sum
The Binary Adder
A B

Cin Full Cout


adder

Sum

S = A  B  Ci

= ABC i + ABC i + ABC i + ABCi


C o = AB + BCi + AC i
Express define S and Co as functions of intermediate
signals G (Generate), D (Delete), And P (Propagate

 When G=1 (D=1) a carry bit will be generated


(deleted) at Co independent of Ci,
while (P=1) an incoming carry will propagate to
Co.
The expressions for these signals can be derived
from the truth table
G=AB  
D=
P= A ⊕ B
Rewrite S and Co as functions of P and G (or D)
Co (G, P) =G + PCi
S (G, P)=P ⊕ Ci
 G and P are only functions of A and B and are
Delay for 4 bit ripple carry adder

 The delay is then proportional to the


number of bits in the input words N and
is approximated by
tadder =(N-1)tcarry+tsum
 1. The propagation delay of the ripple
carry adder is linearly proportional to N.
This property plays an important role
while designing adders of wide data paths
(N=16 …..128).

 2. For designing the full adder cell for a


ripple carry adder, it is important to
optimize tcarry than tsum.
Inversion Property

A B A B

Ci FA Co Ci FA Co

S S

S  A B C i  = S  A B  Ci 

C o  A B C i  = Co  A B  Ci 

Inverting all inputs to a full adder in inverted values for all outputs
Complimentary Static CMOS Full Adder
VDD

VDD
Ci A B

A B
A 28 Transistors
B
Ci B
VDD
A
X
Ci

Ci A
Ci

A B B VDD
A B Ci A

Co B

Co=AB+BCi+ACi
S=ABCi+Co(A+B+Ci)
 The circuit consumes larger area and the circuit is slow.

 Tall pMOS transistor stacks are present in both carry


and sum generation circuits.

 The intrinsic load capacitance Co is large = two


diffusion and six gate capacitances + wiring
capacitance.

 The signal propagates through two inverting stages in


carry-generation circuit. Given the small load (fan-out)
at the output of the carry chain, having two logic stages
is too high a number, and leads to extra delay.
 The sum generation requires one extra logic stage
Minimize Critical Path by Reducing Inverting Stages

Even cell Odd cell

A0 B0 A1 B1 A2 B2 A3 B3

Ci,0 Co,0 Co,1 C Co,3


FA FA FA FA

S0 S1 S2 S3

Exploit Inversion Property


 The first gate of the carry – generation circuit is designed
with the Ci signal on the smaller pMOS stack, lowering
the logical effort to 2.
 NMOS and PMOS transistors connected to Ci are placed
as close as possible to the output of the gate.
 Transistors on the critical path should be placed as close
as possible to the output of the gate.
 In stage k of the adder, signals Ak and Bk are available
and stable long before Ci,k arrives after rippling through
the previous stages.
 The capacitances of the internal nodes in the
transistor chain are precharged or discharged in
advance.
 On the arrival of Ci,k, only the capacitance of the
code X has to be (dis)charged.
 Putting Ci,k transistors closer to VDD and GND
would require not only the (dis)charging of the
capacitance of node X, but also of internal capacitance
 The speed of the circuit can be improved gradually by
number of inverting stages in the carry path can be
reduced by exploiting the inverting property.
 The rule allows to eliminate an inverter in a carry
chain.
A Better Structure: The Mirror Adder

VDD

VDD VDD A

A B B A B Ci B
Kill
"0"-Propagate A Ci
Co
Ci S
A Ci
"1"-Propagate Generate
A B B A B Ci A

24 transistors
 The carry - inverting gate is eliminated and the PDN and
PUN networks of the gate are not dual.
 They form propagate/generate/delete function –
 When either D or G is high, Co is set to VDD or GND,
respectively.
 When the conditions for a propagate are valid (or P is 1),
the incoming carry is propagated (in inverted format) to Co.
This results in a considerable reduction in both area and
delay.
Mirror Adder
Stick Diagram
VDD

A B Ci B A Ci Co Ci A B

Co

GND
The Mirror Adder
 This full adder cell requires only 24 transistors.
 The NMOS and PMOS chains are completely symmetrical.
This guarantees identical rising and falling transitions if the
NMOS and PMOS devices are properly sized.
 A maximum of two series transistors can be observed in the
carry-generation circuitry.
 When laying out the cell, the most critical issue is the
minimization of the capacitance at node Co.
 The reduction of the diffusion capacitances is particularly
important.
 The capacitance at node Co is composed of four
diffusion capacitances, two internal gate
capacitances, and six gate capacitances in the
connecting adder cell .
 The transistors connected to Ci are placed closest to
the output.
 Only the transistors in the carry stage have to be
optimized for optimal speed.
 All transistors in the sum stage can be minimal size.
Transmission Gate Full Adder
P
VDD
VDD Ci
A
P S Sum Generation
A A P Ci

A P VDD
B B
VDD A
P
P Co Carry Generation
Ci Ci Ci
A
Setup P
 A full adder based on this approach uses 24
transistors
 It is based on propagate-generate model.

Co (G, P) =G + PCi
S (G, P)=P ⊕ Ci

 The propagate signal, which is the XOR of inputs


A and B, is used to select true or complementary
value of the input carry as the new sum output.
 Based on the propagate signal the output carry is
set to the input carry or either of the inputs A or B.
 It has similar delays for both sum and carry.
Manchester Carry Chain adder

VDD
Pi
VDD 
Pi
Ci Co
Gi
Co Gi
Ci

Di
Pi 
 . The propagate path is unchanged, and it
passes Ci to the Co output if the propagate
signal (Ai ⊕ Bi) is true.
 If the propagate condition is not satisfied, the
output is either pulled low by the Di signal or
pulled up by Gi .
 In the dynamic implementation and the
transitions shown in the circuit are monotonic;
the transmission gates can be replaced by
NMOS- only pass transistors.
 Precharging the output eliminates the need for
the kill signal
Manchester Carry Chain in dynamic logic

VDD

P0 P1 P2 P3
C3

Ci,0
G0 G1 G2 G3

C0 C1 C2 C3
 A manchester carry chain adder uses a cascade of
pass transistors to implement the carry chain.
 During the precharge phase (φ = 0), all
intermediate nodes of the pass transistor carry
chain are precharged to VDD .
 During evaluation, the A node is discharged
k
when there is an incoming carry and the propagate
signal Pk is high, or when the generate signal for
stage k(Gk ) is high.
 The worst case delay of the carry chain adder is
modeled by the linearized RC network
Manchester Carry Chain
 Increasing the transistor width reduces the time
constant, but it loads the gates in the previous
stage.
 Therefore transistor size is limited by the input
loading capacitance.
 Unfortunately the distributed RC nature of the
carry chain results in a propagation delay that is
quadratic in the number of bits N.
 To avoid this, it is necessary to insert signal
buffering inverters.
 Adding inverter makes the overall propagation
delay a linear function of N, as is the case with
ripple carry adders
The Binary Adder : Logic Design Considerations

 The ripple carry adder is implemented practically for the


additions with a relatively small word length.

 But most of the desktop computers use word length of 32


bits, while servers require 64; very fast computers, such as
mainframes, supercomputers, or multimedia processors
require word lengths up to 128 bits.

 The linear dependence of the adder speed on the number


of bits makes the usage of ripple adders rather
impractical.
Carry-Bypass Adder
P0 G1 P0 G1 P2 G2 P3 G3 Also called
Carry-Skip
C i ,0 C o ,0 C o ,1 C o ,2 C o ,3
FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3
B P = P oP 1 P 2 P 3
C i,0 C o ,0 C o,1 C o ,2
FA FA FA FA

Multiplexer
C o ,3

Id e a : If (P 0 a n d P 1 a n d P 2 a n d P 3 = 1 )
th e n C o3 = C 0 , e ls e “k ill” o r “g e n e ra te ”.
 The values of Ak and Bk (k=0, 1, 2, 3) are such that all
propagate signals Pk (k=0, 1, 2, 3) are high.

 An incoming carry Ci, 0=1 propagates under those conditions


through the complete adder chain and causes an outgoing
carry C0, 3=1.
 If (P0P1P2P3=1) then C0,3 = Ci, 0 else either DELETE
or GENERATE occurred.

 When BP=P0P1P2P3=1, the incoming carry is


forwarded immediately to the next block
through the bypass transistor Mb- hence the
name carry-bypass or carry-skip adder.
 Consider the N-bit carry skip adder.
 The N bit adder is divided into N/M stages, where
each stage consists of M-bits.
 For example let us take 16-bit adder. It has four
stages and each stage contains 4-bits.
 The total delay can be derived from the fig below:
tadder = tsetup + Mtcarry + (N/M-1)tbypass + (M-
1)tcarry + tsum
 where
 tsetup: time taken to compute G & P signals.
 tcarry: time taken by the carry to propagate through a single
bit.
 tbypass: delay of the single mux

 tsum : the time to generate the sum of the final stage.


Carry-Bypass Adder (cont.)

Bit 0–3 Bit 4–7 Bit 8–11 Bit 12–15


Setup tsetup Setup Setup Setup
tbypass

Carry Carry Carry Carry


propagation propagation propagation propagation

Sum Sum Sum tsum Sum

M bits

tadder = tsetup + Mtcarry + (N/M-1)tbypass + (M-1)tcarry + tsum


 The worst case delay occurs when the carry is
generated at the first bit and ripples through
first block, skips the (N/M)-2 bypass
multiplexers and finally consumed at last bit
without generating the carry.
 From the graph it is found that the ripple carry
adder is best for small values of N.
 But when N increases, the bypass adder has
small propagation delay comparing to ripple
carry adder.
Carry Ripple versus Carry Bypass

tp
ripple adder

bypass adder

4..8
N
 In ripple carry adder, every full-adder cell has to
wait for the incoming carry before an outgoing
carry can be generated.
 Once the real value of the incoming carry is known,
the correct result is easily selected with a simple
multiplexer stage.
 This implementation is appropriately called the
carry-select adder.
 Consider the block of adders, which is adding bits k
to k+3.
 Instead of waiting on the arrival of the output carry
of bit k-1, both the 0 and 1 possibilities are
analyzed and that two carry paths are implemented.

 When C0,k-1 finally settles, either the result of the 0


or the 1 path is selected by the multiplexer, which
can be performed with a minimal delay.

 The hardware overhead of the carry-select adder is


restricted to an additional carry path and a
multiplexer, and equals about 30% with respect to a
ripple carry structure.
Carry Select Adder: Critical Path
Bit 0–3 Bit 4–7 Bit 8–11 Bit 12–15
Setup Setup Setup Setup

0 0-Carry 0 0-Carry 0 0-Carry 0 0-Carry

1 1-Carry 1 1-Carry 1 1-Carry 1 1-Carry

Multiplexer Multiplexer Multiplexer Multiplexer


Ci,0 Co,3 Co,7 Co,11 Co,15

Sum Generation Sum Generation Sum Generation Sum Generation


S0–3 S4–7 S8–11 S
 where tsetup, tadd, tsum and tmux are fixed delays and N
and M represent the total number of bits, and the
number of bits per stage respectively.
 tcarry is the delay of the carry through a single full-
adder cell.
 The carry delay through a single block is
proportional to the length of that stage or equals
Mtcarry.
 The propagation delay of the adder is
proportional to N.
 The reason for this is that the block-select signal
selects between the 0 and 1 solution still has to
ripple through all stages in worst case.
 The carry select adder can be optimized further to
reduce the delay.
 Consider the 16-bit linear select adder and let us
assume tsetup= tcarry= tmux= 1

 From the above eqn, the delay can be found out


tp = 1+4+(16/4)+1
tp = 10
 In the linear select adder, each stage consists of equal
bits.
 So the delay of the carry path is same in all stages.
 But the last adder stage has to wait for the incoming
carry as it has to cross three Mux.
 So there is a difference in the arrival times of the signal.
 This can be eliminated by adding more bits to
the subsequent stages of adder and it is
known as square root carry select adder.
 For example, the first stage can have 2 bits,
the second can have 3 bits, the third has 4
bits and so on.
 This is shown in the fig.
 This will reduce the delay and the t p = 9.
 The same propagation delay is valid for a
20bit adder, but an extra stage is added.
 The delay analysis can be done by assuming
N bit adder containing P stages.
 The first stage contains M bits, then
Adder Delays - Comparison
50

40 Ripple adder
tp (in unit delays)

30

Linear select
20

10
Square root select

0
0 20 40 60
N
The Carry Lookahead Adder
 Monolithic carry-lookahead adder:
 When designing even faster adders, it is
essential to get around the rippling effect
of the carry that is still present in one form or
another in both the carry-bypass and carry-
select adders.
 The equation 4.15 is used to implement an N-
bit adder. For every bit, the carry and sum
outputs are independent of the previous bits.
 The ripple effect has thus been effectively
eliminated, and the addition time should be
independent of the number of bits.
 A general block diagram for the carry lookahead
adder is shown in figure 4.21 and it has some
hidden dependencies.
 A schematic mirror implementation of four bit
look ahead is shown is figure 4.22 and real delay is
at least increasing linearly with the number of
bits.
 The circuit exploits the self duality and
recursivity of the carry lookahead equation to
build a mirror structure as shown in figure 4.8.
 The large fan –in of the circuit makes it
prohibitively slow for larger values of N.
 Implementing it with simpler gates requires
multiple logic levels. In both the cases,
propagation delay increases.
 Fanout on some of the signals tends to grow
excessively, slowing down the adder even
more.
 The area implementation grows
progressively with N.
 For smaller values of N(≤4) the look ahead
structure is useful.
LookAhead - Basic Idea
A A1, B1 ••• AN-1, BN-1

Ci,0 P0 Ci,1 P1
Ci, N-1 PN-1

S0 S1 ••• SN-1

C o k = f A k B k Co  k – 1  = Gk + P k Co  k – 1
Look-Ahead: Topology
C o k = G k + Pk G k – 1 + Pk – 1 Co  k – 2 
Expanding Lookahead equations:

All the way:C o k = G k + Pk  Gk – 1 + P k – 1  + P1  G0 + P0 Ci  0   


VDD

G2

G1

G0

Ci,0
Co,3

P0

P1

P2

P3
Logarithmic Look-Ahead Adder
 The N-bit monolithic carry look ahead adder
has N+1 parallel branches and N+1 transistor
in the stack.
 These structure makes the ckts slow and
increase the area.
 In order to build fast adders, logarithmic look
ahead adder is used.
 It is implemented by decomposing carry
propagation into subgroup of N bits and tree
like structure is used.
Multipliers
The Binary Multiplication

M+ N– 1
··  Y k
Z = X =
 Zk 2
k=0
M – 1 N – 1 
 i j
Xi 2    Yj 2 
 
= 
 
 i=0  j = 0 
M – 1 N – 1 
 i + j
=
 
  Xi Yj 2 

i =0 j= 0
 

with
M –1
i
X =
 Xi 2
i=0
N– 1
j
Y =
 Y j2
j= 0
The Binary Multiplication

1 0 1 0 1 0 Multiplicand
x 1 0 1 1 Multiplier
1 0 1 0 1 0
1 0 1 0 1 0

0 0 0 0 0 0 Partial products

 1 0 1 0 1 0

1 1 1 0 0 1 1 1 0 Result
The Array Multiplier
X3 X2 X1 X0 Y0

X3 X2 X1 X0 Y1

HA FA FA HA

X3 X2 X1 X0 Y2 Z1

FA FA FA HA

X3 X2 X1 X0 Y3

FA FA FA HA

Z7 Z6 Z5 Z4 Z3
The MxN Array Multiplier
— Critical Path

HA FA FA HA

FA FA FA HA Critical Path 1
Critical Path 2

Critical Path 1 & 2


FA FA FA HA
Carry-Save Multiplier
HA HA HA HA

HA FA FA FA

HA FA FA FA

HA FA FA HA

Vector Merging Adder


 In array multiplier, the carry bit are passed to the
right and adds more delay in the critical path.
 But when the output carry bits are passed
diagonally downward, does not change the
multiplier output.
 it reduces the delay.
 Since the carry bits are not added immediately
and it is saved for next stage, the structure is
called as carry save multiplier.
 But extra adders i.e. vector merging adders are
included to get the final product.
 For the final stage, the fast carry look ahead
adder is used.
 This will increases the area when compared to
array multiplier.
Multiplier Floorplan
X3 X2 X1 X0

Y0
Y1 HA Multiplier Cell
C S C S C S C S
Z0

FA Multiplier Cell
Y2
C S C S C S C S
Z1 Vector Merging Cell

Y3
C S C S C S C S X and Y signals are broadcasted
Z2 through the complete array.
( )

C C C C
S S S S

Z7 Z6 Z5 Z4 Z3
Wallace tree multiplier
 To improve the speed and to reduce the
adders in multiplication, a tree structure
called Wallace tree multiplier is used.
 Consider for example, 4x4 multiplication.
 There are four rows of partial products and
each row has 4bits length.
 The no.of adders can be reduced by observing
that only column 3 has to add four bits.
 The partial products are rearranged in a tree
like fashion shown in fig to visually illustrate
its varying depth.
 The circle covering three bits represents full
adder.
 Which has three inputs and produces two
outputs.
 The sum output retains in the same column
and carry output is moved to the next column.
 The full adder is called 3-2 compressor.
 The circle covering two bits represent half
adder.
 In the first stage, two half adders are used in
column 3 and column 4.
 The sum is located in the same column and
the carry is moved to the next column. Shown
in fig.c
 Here three full adders and one half adder is
 Only three full adders and three half adders
are used for the reduction process, compared
with six FAs and six Has in the carry save
multiplier.
 The final stage can use any fast adder for
addition.
 This H/W structure saves the no.of adder and
it can be used for large multiplication.
 The propagation delay is proportional to Log 3/2
N.
 But the structure is irregular and makes the
layout inefficient.
Wallace-Tree Multiplier

Partial products First stage


6 5 4 3 2 1 0 6 5 4 3 2 1 0 Bit position

(a) (b)

Second stage Final adder


6 5 4 3 2 1 0 6 5 4 3 2 1 0

FA HA
(c) (d)
Wallace-Tree Multiplier

x3y2 x2y2 x3y1 x1y2 x3y0 x1y1 x2y0 x0y1


Partial products x3y3 x2y3 x1y3 x0y3 x0y2 x1y0 x0y

First stage
HA HA

Second stage FA FA FA FA

Final adder
z7 z6 z5 z4 z3 z2 z1 z0
Wallace-Tree Multiplier
y0 y1
y2

y0 y1 y2 y3 y4 y5
Ci-1
FA

y3
FA FA
Ci Ci Ci-1
Ci-1
FA Ci Ci-1

y4
FA
Ci Ci-1 Ci Ci-1
FA

y5

Ci FA
FA

C S
C S
Shifters
The Binary Shifter
Right nop Left

Ai Bi

Ai-1 Bi-1

Bit-Slice i

...
The Barrel Shifter
A3
B3

Sh1
A2
B2

Sh2 : Data Wire


A1
B1 : Control Wire

Sh3
A0
B0

Sh0 Sh1 Sh2 Sh3

Area Dominated by Wiring


4x4 barrel shifter

A3

A2

A1

A0

Sh0 Sh1 Sh2 Sh3


Buffer
Widthbarrel ~ 2 pm M
Logarithmic Shifter
Sh1 Sh1 Sh2 Sh2 Sh4 Sh4

A3 B3

A2 B2

A1 B1

A0 B0
 Barrel shifter implements the whole shifter as
a single array of pass transistors, the
logarithmic shifter uses a staged approach.
 The total shift is decomposed into shifts over
powers of two.
 A shifter with a max shift width of M consists of
a Log2 M stages, where the ith stage either
shifts over 2i or passes the data unchanged.
 Above fig shows a shifter with a max shift
value of seven bits.
 For instance, to shift over five bits, the first
stage is set to shift mode, the second to pass
mode, and the last stage again to shift.
 the control word for this shifter is already
encoded, and no separate decoder is required.
 The speed of the logarithmic shifter depends
on the shift width in a logarithmic way, since
an M-bit shifter requires log M stages.
 The series connection of pass transistors
slows the shifter down for larger shift values.
A careful introduction of intermediate
buffers is therefore necessary.
 Barrel shifter is appropriate for smaller
shifters.
 For larger shift values, the log shifter
becomes more effective, in terms of both
area and speed.
 Log shifter is easily parameterized, allowing
for automatic generation
0-7 bit Logarithmic Shifter

A
3
Out3

A
2
Out2

A
1
Out1

A
0
Out0
Speed and Area trade off
 There is a trade off between speed and area in
digital arithmetic circuits.
 Depending on product specifications, the dominating
factor i.e., (area or speed) is determined.
 The designer should have better understanding in all
design constraints to make the product success.
 From the analysis of the adder and multiplier circuits
it is found that the ripple carry adder propagation
delay is proportional to the no.of bits N and area is
small.
 Circuit optimization is done to reduce the delay. The
delay and area is slightly reduced in Manchester
adder and by pass adder.
 Other adder structure use logic optimization
to increase the performance.
 The carry select adder and carry look ahead
adder’s delay depends on the no.of bits in
square root and logarithm fashion.
 But the area is increased in carry look ahead
adder.
 The designer should determine the critical
path of the circuit and the optimization can be
done for that path.
 The area of the circuit is not only determined
by the no.of transistors. The wiring, contacts
and the no.of vias also have an impact on the
size.
 The comparison plot drawn for the delay area
Accumulator:
Configuration of Accumulator Cells
 When A[i] = 1 is required. Set[i]=1 and Reset[i] = 0 and hence
A[i] = 1 and B[i] = 0. Then the output is equal to 1, and Cin is
transferred to Cout.

 The configuration that drives the CUT inputs when A[i]= 0


is required. Set[i] = 0 and Reset[i] = 1 and hence A[i] = 0
and B[i] = 1. Then, the output is equal to 0 and Cin is transferred
to Cout.

 The configuration that drives the CUT inputs when A[i] = “_” is
required. Set[i] = 0 and Reset[i]= 0.The D input of the flip-flop
of register B is driven by either 1 or 0, depending on the value
that will be added to the accumulator inputs in order to
generate satisfactorily random patterns to the inputs of the
CUT.

You might also like