Professional Documents
Culture Documents
BLOCKS
MEM ORY
INPUT-OUTPUT
CONTROL
DATAPATH
Arithmetic unit
-Bit-sliced datapath
(adder, multiplier, shifter, comparator, etc.)
Memory
- RAM, ROM, Buffers, Shift registers
Control
- Finite state machine (PLA, random logic.)
- Counters
Interconnect
- Switches
- Arbiters
- Bus
Bit-Sliced Design
C o n tr o l
B it 3
Data-Out
Multiplexer
B it 2
Data-In
Register
Adder
Shifter
B it 1
B it 0
T ile id e n tic a l p r o c e ss in g e le m e n ts
Datapaths often are arranged in a bit sliced
organization.
Instead of operating on single bit digital signals,
the data in a processor are arranged in a word-
based fashion.
Typical microprocessor datapaths are 32 or 64
bits wide, while the dedicated signal processing
datapaths, such as those in DSL modems,
magnetic disk drives, or compact-disc players are
of arbitrary width, typically 5 to 24 bits.
For instance, a 32 bit processor operates on data
words that are 32 bits wide.
This is reflected in the organization of the
datapath.
Since the same operation frequently has to be
performed on each bit of the data word, the
datapath consists of 32 bit slices, each operating
on a single bit-hence the term bit sliced.
Bit slices are either identical or resemble a similar
structure for all bits.
The datapath designer can concentrate on the
design of a single slice that is repeated 32 times.
Adder
•Addition is the most commonly used arithmetic
operation. It is often the speed limiting element as
well.
•Careful Optimization of the adder is of the utmost
importance.
This Optimization is done at the logic or circuit level:
Logic level optimization-Boolean functions are
rearranged so that faster or smaller circuit is
obtained. eg:(Carry look ahead adder)
Circuit level optimization- manipulates transistor
sizes and circuit topologies are manipulated to
optimize speed.
Full-Adder
A B
Sum
The Binary Adder
A B
Sum
S = A B Ci
A B A B
Ci FA Co Ci FA Co
S S
S A B C i = S A B Ci
C o A B C i = Co A B Ci
Inverting all inputs to a full adder in inverted values for all outputs
Complimentary Static CMOS Full Adder
VDD
VDD
Ci A B
A B
A 28 Transistors
B
Ci B
VDD
A
X
Ci
Ci A
Ci
A B B VDD
A B Ci A
Co B
Co=AB+BCi+ACi
S=ABCi+Co(A+B+Ci)
The circuit consumes larger area and the circuit is slow.
A0 B0 A1 B1 A2 B2 A3 B3
S0 S1 S2 S3
VDD
VDD VDD A
A B B A B Ci B
Kill
"0"-Propagate A Ci
Co
Ci S
A Ci
"1"-Propagate Generate
A B B A B Ci A
24 transistors
The carry - inverting gate is eliminated and the PDN and
PUN networks of the gate are not dual.
They form propagate/generate/delete function –
When either D or G is high, Co is set to VDD or GND,
respectively.
When the conditions for a propagate are valid (or P is 1),
the incoming carry is propagated (in inverted format) to Co.
This results in a considerable reduction in both area and
delay.
Mirror Adder
Stick Diagram
VDD
A B Ci B A Ci Co Ci A B
Co
GND
The Mirror Adder
This full adder cell requires only 24 transistors.
The NMOS and PMOS chains are completely symmetrical.
This guarantees identical rising and falling transitions if the
NMOS and PMOS devices are properly sized.
A maximum of two series transistors can be observed in the
carry-generation circuitry.
When laying out the cell, the most critical issue is the
minimization of the capacitance at node Co.
The reduction of the diffusion capacitances is particularly
important.
The capacitance at node Co is composed of four
diffusion capacitances, two internal gate
capacitances, and six gate capacitances in the
connecting adder cell .
The transistors connected to Ci are placed closest to
the output.
Only the transistors in the carry stage have to be
optimized for optimal speed.
All transistors in the sum stage can be minimal size.
Transmission Gate Full Adder
P
VDD
VDD Ci
A
P S Sum Generation
A A P Ci
A P VDD
B B
VDD A
P
P Co Carry Generation
Ci Ci Ci
A
Setup P
A full adder based on this approach uses 24
transistors
It is based on propagate-generate model.
Co (G, P) =G + PCi
S (G, P)=P ⊕ Ci
VDD
Pi
VDD
Pi
Ci Co
Gi
Co Gi
Ci
Di
Pi
. The propagate path is unchanged, and it
passes Ci to the Co output if the propagate
signal (Ai ⊕ Bi) is true.
If the propagate condition is not satisfied, the
output is either pulled low by the Di signal or
pulled up by Gi .
In the dynamic implementation and the
transitions shown in the circuit are monotonic;
the transmission gates can be replaced by
NMOS- only pass transistors.
Precharging the output eliminates the need for
the kill signal
Manchester Carry Chain in dynamic logic
VDD
P0 P1 P2 P3
C3
Ci,0
G0 G1 G2 G3
C0 C1 C2 C3
A manchester carry chain adder uses a cascade of
pass transistors to implement the carry chain.
During the precharge phase (φ = 0), all
intermediate nodes of the pass transistor carry
chain are precharged to VDD .
During evaluation, the A node is discharged
k
when there is an incoming carry and the propagate
signal Pk is high, or when the generate signal for
stage k(Gk ) is high.
The worst case delay of the carry chain adder is
modeled by the linearized RC network
Manchester Carry Chain
Increasing the transistor width reduces the time
constant, but it loads the gates in the previous
stage.
Therefore transistor size is limited by the input
loading capacitance.
Unfortunately the distributed RC nature of the
carry chain results in a propagation delay that is
quadratic in the number of bits N.
To avoid this, it is necessary to insert signal
buffering inverters.
Adding inverter makes the overall propagation
delay a linear function of N, as is the case with
ripple carry adders
The Binary Adder : Logic Design Considerations
P0 G1 P0 G1 P2 G2 P3 G3
B P = P oP 1 P 2 P 3
C i,0 C o ,0 C o,1 C o ,2
FA FA FA FA
Multiplexer
C o ,3
Id e a : If (P 0 a n d P 1 a n d P 2 a n d P 3 = 1 )
th e n C o3 = C 0 , e ls e “k ill” o r “g e n e ra te ”.
The values of Ak and Bk (k=0, 1, 2, 3) are such that all
propagate signals Pk (k=0, 1, 2, 3) are high.
M bits
tp
ripple adder
bypass adder
4..8
N
In ripple carry adder, every full-adder cell has to
wait for the incoming carry before an outgoing
carry can be generated.
Once the real value of the incoming carry is known,
the correct result is easily selected with a simple
multiplexer stage.
This implementation is appropriately called the
carry-select adder.
Consider the block of adders, which is adding bits k
to k+3.
Instead of waiting on the arrival of the output carry
of bit k-1, both the 0 and 1 possibilities are
analyzed and that two carry paths are implemented.
40 Ripple adder
tp (in unit delays)
30
Linear select
20
10
Square root select
0
0 20 40 60
N
The Carry Lookahead Adder
Monolithic carry-lookahead adder:
When designing even faster adders, it is
essential to get around the rippling effect
of the carry that is still present in one form or
another in both the carry-bypass and carry-
select adders.
The equation 4.15 is used to implement an N-
bit adder. For every bit, the carry and sum
outputs are independent of the previous bits.
The ripple effect has thus been effectively
eliminated, and the addition time should be
independent of the number of bits.
A general block diagram for the carry lookahead
adder is shown in figure 4.21 and it has some
hidden dependencies.
A schematic mirror implementation of four bit
look ahead is shown is figure 4.22 and real delay is
at least increasing linearly with the number of
bits.
The circuit exploits the self duality and
recursivity of the carry lookahead equation to
build a mirror structure as shown in figure 4.8.
The large fan –in of the circuit makes it
prohibitively slow for larger values of N.
Implementing it with simpler gates requires
multiple logic levels. In both the cases,
propagation delay increases.
Fanout on some of the signals tends to grow
excessively, slowing down the adder even
more.
The area implementation grows
progressively with N.
For smaller values of N(≤4) the look ahead
structure is useful.
LookAhead - Basic Idea
A A1, B1 ••• AN-1, BN-1
Ci,0 P0 Ci,1 P1
Ci, N-1 PN-1
S0 S1 ••• SN-1
C o k = f A k B k Co k – 1 = Gk + P k Co k – 1
Look-Ahead: Topology
C o k = G k + Pk G k – 1 + Pk – 1 Co k – 2
Expanding Lookahead equations:
G2
G1
G0
Ci,0
Co,3
P0
P1
P2
P3
Logarithmic Look-Ahead Adder
The N-bit monolithic carry look ahead adder
has N+1 parallel branches and N+1 transistor
in the stack.
These structure makes the ckts slow and
increase the area.
In order to build fast adders, logarithmic look
ahead adder is used.
It is implemented by decomposing carry
propagation into subgroup of N bits and tree
like structure is used.
Multipliers
The Binary Multiplication
M+ N– 1
·· Y k
Z = X =
Zk 2
k=0
M – 1 N – 1
i j
Xi 2 Yj 2
=
i=0 j = 0
M – 1 N – 1
i + j
=
Xi Yj 2
i =0 j= 0
with
M –1
i
X =
Xi 2
i=0
N– 1
j
Y =
Y j2
j= 0
The Binary Multiplication
1 0 1 0 1 0 Multiplicand
x 1 0 1 1 Multiplier
1 0 1 0 1 0
1 0 1 0 1 0
0 0 0 0 0 0 Partial products
1 0 1 0 1 0
1 1 1 0 0 1 1 1 0 Result
The Array Multiplier
X3 X2 X1 X0 Y0
X3 X2 X1 X0 Y1
HA FA FA HA
X3 X2 X1 X0 Y2 Z1
FA FA FA HA
X3 X2 X1 X0 Y3
FA FA FA HA
Z7 Z6 Z5 Z4 Z3
The MxN Array Multiplier
— Critical Path
HA FA FA HA
FA FA FA HA Critical Path 1
Critical Path 2
HA FA FA FA
HA FA FA FA
HA FA FA HA
Y0
Y1 HA Multiplier Cell
C S C S C S C S
Z0
FA Multiplier Cell
Y2
C S C S C S C S
Z1 Vector Merging Cell
Y3
C S C S C S C S X and Y signals are broadcasted
Z2 through the complete array.
( )
C C C C
S S S S
Z7 Z6 Z5 Z4 Z3
Wallace tree multiplier
To improve the speed and to reduce the
adders in multiplication, a tree structure
called Wallace tree multiplier is used.
Consider for example, 4x4 multiplication.
There are four rows of partial products and
each row has 4bits length.
The no.of adders can be reduced by observing
that only column 3 has to add four bits.
The partial products are rearranged in a tree
like fashion shown in fig to visually illustrate
its varying depth.
The circle covering three bits represents full
adder.
Which has three inputs and produces two
outputs.
The sum output retains in the same column
and carry output is moved to the next column.
The full adder is called 3-2 compressor.
The circle covering two bits represent half
adder.
In the first stage, two half adders are used in
column 3 and column 4.
The sum is located in the same column and
the carry is moved to the next column. Shown
in fig.c
Here three full adders and one half adder is
Only three full adders and three half adders
are used for the reduction process, compared
with six FAs and six Has in the carry save
multiplier.
The final stage can use any fast adder for
addition.
This H/W structure saves the no.of adder and
it can be used for large multiplication.
The propagation delay is proportional to Log 3/2
N.
But the structure is irregular and makes the
layout inefficient.
Wallace-Tree Multiplier
(a) (b)
FA HA
(c) (d)
Wallace-Tree Multiplier
First stage
HA HA
Second stage FA FA FA FA
Final adder
z7 z6 z5 z4 z3 z2 z1 z0
Wallace-Tree Multiplier
y0 y1
y2
y0 y1 y2 y3 y4 y5
Ci-1
FA
y3
FA FA
Ci Ci Ci-1
Ci-1
FA Ci Ci-1
y4
FA
Ci Ci-1 Ci Ci-1
FA
y5
Ci FA
FA
C S
C S
Shifters
The Binary Shifter
Right nop Left
Ai Bi
Ai-1 Bi-1
Bit-Slice i
...
The Barrel Shifter
A3
B3
Sh1
A2
B2
Sh3
A0
B0
A3
A2
A1
A0
A3 B3
A2 B2
A1 B1
A0 B0
Barrel shifter implements the whole shifter as
a single array of pass transistors, the
logarithmic shifter uses a staged approach.
The total shift is decomposed into shifts over
powers of two.
A shifter with a max shift width of M consists of
a Log2 M stages, where the ith stage either
shifts over 2i or passes the data unchanged.
Above fig shows a shifter with a max shift
value of seven bits.
For instance, to shift over five bits, the first
stage is set to shift mode, the second to pass
mode, and the last stage again to shift.
the control word for this shifter is already
encoded, and no separate decoder is required.
The speed of the logarithmic shifter depends
on the shift width in a logarithmic way, since
an M-bit shifter requires log M stages.
The series connection of pass transistors
slows the shifter down for larger shift values.
A careful introduction of intermediate
buffers is therefore necessary.
Barrel shifter is appropriate for smaller
shifters.
For larger shift values, the log shifter
becomes more effective, in terms of both
area and speed.
Log shifter is easily parameterized, allowing
for automatic generation
0-7 bit Logarithmic Shifter
A
3
Out3
A
2
Out2
A
1
Out1
A
0
Out0
Speed and Area trade off
There is a trade off between speed and area in
digital arithmetic circuits.
Depending on product specifications, the dominating
factor i.e., (area or speed) is determined.
The designer should have better understanding in all
design constraints to make the product success.
From the analysis of the adder and multiplier circuits
it is found that the ripple carry adder propagation
delay is proportional to the no.of bits N and area is
small.
Circuit optimization is done to reduce the delay. The
delay and area is slightly reduced in Manchester
adder and by pass adder.
Other adder structure use logic optimization
to increase the performance.
The carry select adder and carry look ahead
adder’s delay depends on the no.of bits in
square root and logarithm fashion.
But the area is increased in carry look ahead
adder.
The designer should determine the critical
path of the circuit and the optimization can be
done for that path.
The area of the circuit is not only determined
by the no.of transistors. The wiring, contacts
and the no.of vias also have an impact on the
size.
The comparison plot drawn for the delay area
Accumulator:
Configuration of Accumulator Cells
When A[i] = 1 is required. Set[i]=1 and Reset[i] = 0 and hence
A[i] = 1 and B[i] = 0. Then the output is equal to 1, and Cin is
transferred to Cout.
The configuration that drives the CUT inputs when A[i] = “_” is
required. Set[i] = 0 and Reset[i]= 0.The D input of the flip-flop
of register B is driven by either 1 or 0, depending on the value
that will be added to the accumulator inputs in order to
generate satisfactorily random patterns to the inputs of the
CUT.