# VLSI Arithmetic

Prof. Vojin G. Oklobdzija
University of California

http://www.ece.ucdavis.edu/acsel

Oklobdzija 2004 Computer Arithmetic 2
Introduction

• Digital Computer Arithmetic belongs to
Computer Architecture, however, it is also an
aspect of logic design.
• The objective of Computer Arithmetic is to
develop appropriate algorithms that are
utilizing available hardware in the most
efficient way.
• Ultimately, speed, power and chip area are
the most often used measures, making a
strong link between the algorithms and
technology of implementation.
Oklobdzija 2004 Computer Arithmetic 3
Basic Operations

• Multiplication
• Division

• Evaluation of Functions
• Multi-Media
Oklobdzija 2004 Computer Arithmetic 5
of most arithmetic circuits:

The sum and carry outputs are described as:

i i i i i i i i i i i i i i i i i i i
c b c a b a c b a c b a c b a c b a c + + = + + + =
+1
i i i i i i i i i i i i i
c b a c b a c b a c b a s + + + =
Full
C
in
C
out
s
i
a
i
b
i
Oklobdzija 2004 Computer Arithmetic 6
Propagate
Propagate
Generate
Generate
Inputs Outputs
c
i
a
i
b
i
s
i
c
i+1

0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
Oklobdzija 2004 Computer Arithmetic 7
Full Adder operations is defined by equations:
i i i i i i i i i i i i i i i i i i
c p c b a c b a c b a c b a c b a s © = © © = + + + =
i i i i i i i i i i i i
c p g b a c b a c b a c + = + + =
+1
implemented as shown
Carry-Propagate:
and Carry-Generate g
i

i i i
i i i
b a g - =
c
out
c
in
s
i
a
i
b
i
Oklobdzija 2004 Computer Arithmetic 8
i i i
i i i i
c p g c + =
+1
implemented more efficiently
because MUX is faster
i i i
i i i
b a g - =
0
1
s
b
i
a
i
c
out
s
i
c
in
Oklobdzija 2004 Computer Arithmetic 9
Oklobdzija 2004 Computer Arithmetic 10
A
0
B
0
S
0
C
o,0
C
i,0
A
1
B
1
S
1
C
o,1
A
2
B
2
S
2
C
o,2
A
3
B
3
S
3
C
o,3
(= C
i,1
)
FA
FA
FA
FA
Worst case delay linear with the number of bits
t
N 1 – ( )t
carry
t
sum
+ ~
t
d
= O(N)
Goal: Make the fastest possible carry path circuit
From Rabaey
Oklobdzija 2004 Computer Arithmetic 11
Inversion Property
A B
S
C
o
C
i FA
A B
S
C
o
C
i
FA
S A B C
i
, , ( ) S A B C
i
, , ( ) =
C
o
A B C
i
, , ( ) C
o
A B C
i
, , ( ) =
From Rabaey
Oklobdzija 2004 Computer Arithmetic 12
Minimize Critical Path by Reducing Inverting
Stages
A
0
B
0
S
0
C
o,0
C
i,0
A
1
B
1
S
1
C
o,1
A
2
B
2
S
2
C
o,2
C
o,3
FA’ FA’ FA’ FA’
A
3
B
3
S
3
Odd Cell Even Cell
Exploit Inversion Property
Note: need 2 different types of cells
From Rabaey
Oklobdzija 2004 Computer Arithmetic 13
Carry-Chain of an RCA implemented using multiplexer from the
standard cell library:
a
i+1
b
i+1
a
i
b
i
a
i+2
b
i+2
c
out
c
i+1
c
i
s
i
s
i+1
s
i+2
c
in
Critical Path
Oklobdzija, ISCAS’88
Oklobdzija 2004 Computer Arithmetic 14
Manchester Carry-Chain
Realization of the Carry Path
• Simple and very popular scheme for implementation of
carry signal path
V
dd
Carry out Carry in
Propagate
device
Predischarge
& kill device
Generate
device
+ + + + + + + +
V
dd
V
dd
V
dd
V
dd
V
dd
V
dd
V
dd
Oklobdzija 2004 Computer Arithmetic 15
Original Design
T. Kilburn, D. B. G. Edwards, D. Aspinall, "Parallel Addition in Digital Computers:
A New Fast "Carry" Circuit", Proceedings of IEE, Vol. 106, pt. B, p. 464, September 1959.
Oklobdzija 2004 Computer Arithmetic 16
Manchester Carry Chain (CMOS)
P
0
C
i,0
P
1
G
0
P
2
G
1
P
3
G
2
P
4
G
3
G
4
|
|
V
DD
Kilburn, et al, IEE Proc, 1959.
•Implement P with pass-transistors
•Implement G with pull-up, kill (delete) with pull-down
•Use dynamic logic to reduce the complexity and speed up
Oklobdzija 2004 Computer Arithmetic 17
Pass-Transistor Realization in DPL
A
A
B
B
C C
VCC
S
S
XOR/XNOR MULTIPLEXER BUFFER
C C
MULTIPLEXER
VCC
C
O
C
O
BUFFER
VCC
VCC
OR/NOR
AND/NAND
A
A
B
B
A
A
B
B
Oklobdzija 2004 Computer Arithmetic 18
MacSorley, Proc IRE 1/61
Lehman, Burla, IRE Trans on Comp, 12/61
Oklobdzija 2004 Computer Arithmetic 19
FA FA FA FA
P
0
G
1
P
0
G
1
P
2
G
2
P
3
G
3
C
o,3
C
o,2
C
o,1
C
o,0
C
i,0
FA FA FA FA
P
0
G
1
P
0
G
1
P
2
G
2
P
3
G
3
C
o,2
C
o,1
C
o,0
C
i,0
C
o,3
M
u
l
t
i
p
l
e
x
e
r
BP=P
o
P
1
P
2
P
3
Idea: If (P0 and P1 and P2 and P3 = 1)
then C
o3
= C
0
, else “kill” or “generate”.
Bypass
From Rabaey
Oklobdzija 2004 Computer Arithmetic 20
N-bits, k-bits/group, r=N/k groups
G
r
G
r-
1
...
S
N-k-1
S
N-1
a
N-1
b
N-1
b
N-k-1
a
N-k-1
S
(r-1)k-1
S
(r-2)k
G
1
G
o
...
S
k
S
2k-1
a
2k-1
b
2k-1
b
k
a
k
S
k-1
S
0
...
...
a
(r-1)k
b
(r-1)k
a
(r-1)k
b
(r-1)k
...
a
k-1
b
k-1
a
0
b
0
...
C
in
... ... ... ...
... ...
... ...
P
r-1
P
r-2
P
1
P
0
C
out
+ + +
+
AND
OR
OR OR OR
AND AND AND
critical path, delay A=2(k-1)+(N/2-2)
Oklobdzija 2004 Computer Arithmetic 21
( )
SKIP RCA d
t
N
t k t
|
.
|

\
|
÷ + ÷ = 2
2
1 2
N
t
p
4..8
k
Oklobdzija 2004 Computer Arithmetic 22
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 23
Carry-chain of a 32-bit Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
G
0
... ...
a
0
b
0
...
...
a
i
b
i
a
N-1
b
N-1
S
j
P
m-2
C
in
C
out
C
ou
t
G
2
G
m-2
G
m-1
G
m
G
0
G
1
G
2
G
m-2
G
m-1
G
m
S
N-1
S
i
S
0
P
2
P
0
P
m-1
P
m
.. ...
G
1
P
1
C
in
..
..
.
a
j
b
j
Carry signal path
skiping
rippling
Oklobdzija 2004 Computer Arithmetic 24
Carry-chain of a 32-bit Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
1 1
3 3
4
4
5
5
6
A=9
Any-point-to-any-point delay = 9 A
as compared to 12 A for CSKA
Oklobdzija 2004 Computer Arithmetic 25
Carry-chain block size determination for a
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 26
Delay Calculation for Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
P
0
C
i,0
P
1
G
0
P
2
G
1
P
3
G
2
BP
G
3
BP
C
o,3
Delay model:
Oklobdzija 2004 Computer Arithmetic 27
(Oklobdzija, Barnes: IBM 1985)
Variable Group Length
Oklobdzija, Barnes, Arith‟85
3 2 1
c N c c t
d
+ + =
Oklobdzija 2004 Computer Arithmetic 28
Carry-chain of a 32-bit Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
Variable Block Lengths
• No closed form solution for delay
• It is a dynamic programming problem
Oklobdzija 2004 Computer Arithmetic 29
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 30
0
2
4
6
8
10
12
14
16
4 11 18 25 32 39 46 53 60
Size N
D
e
l
a
y

VBA- Multi-Level
CLA
VBA
VLSI Arithmetic
Lecture 4

Prof. Vojin G. Oklobdzija
University of California

http://www.ece.ucdavis.edu/acsel

Review
Lecture 3
Oklobdzija 2004 Computer Arithmetic 33
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 34
Carry-chain of a 32-bit Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
G
0
... ...
a
0
b
0
...
...
a
i
b
i
a
N-1
b
N-1
S
j
P
m-2
C
in
C
out
C
out
G
2
G
m-2
G
m-1
G
m
G
0
G
1
G
2
G
m-2
G
m-1
G
m
S
N-1
S
i
S
0
P
2
P
0
P
m-1
P
m
.. ...
G
1
P
1
C
in
..
..
.
a
j
b
j
Carry signal path
skiping
rippling
Oklobdzija 2004 Computer Arithmetic 35
Carry-chain of a 32-bit Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
1 1
3 3
4
4
5
5
6
A=9
Any-point-to-any-point delay = 9 A
as compared to 12 A for CSKA
Oklobdzija 2004 Computer Arithmetic 36
Carry-chain block size determination for a
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 37
Delay Calculation for Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
P
0
C
i,0
P
1
G
0
P
2
G
1
P
3
G
2
BP
G
3
BP
C
o,3
Delay model:
Oklobdzija 2004 Computer Arithmetic 38
(Oklobdzija, Barnes: IBM 1985)
Variable Group Length
Oklobdzija, Barnes, Arith‟85
3 2 1
c N c c t
d
+ + =
Oklobdzija 2004 Computer Arithmetic 39
Carry-chain of a 32-bit Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
Variable Block Lengths
• No closed form solution for delay
• It is a dynamic programming problem
Oklobdzija 2004 Computer Arithmetic 40
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 41
0
2
4
6
8
10
12
14
16
4 11 18 25 32 39 46 53 60
Size N
D
e
l
a
y

VBA- Multi-Level
CLA
VBA
Square Root
Dependency
Log
Dependency
Oklobdzija 2004 Computer Arithmetic 42
Circuit Issues
• Adder speed can not be estimated based
on:
– logic gates in the critical path
– number of transistors in the path
– logic levels in the path
• Estimating Adders speed is much more
complex and many of the “fast” schemes
Oklobdzija 2004 Computer Arithmetic 43
Fan-Out Dependency
Oklobdzija 2004 Computer Arithmetic 44
Fan-In Dependency
This looks like
“Logical Effort”
(1985)
Oklobdzija 2004 Computer Arithmetic 45
(Oklobdzija, Barnes: IBM 1985)
Oklobdzija 2004 Computer Arithmetic 46
Oklobdzija 2004 Computer Arithmetic 47
(Weinberger and Smith, 1958)
Ref: A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”,
National Bureau of Standards, Circ. 591, p.3-12, 1958.
ARITH-13: Presenting Achievement Award to Arnold Weinberger of IBM (who
Oklobdzija 2004 Computer Arithmetic 48
i i i
i i i i
c p g c + =
+1

i i i
i i i
b a g - =
0
1
s
b
i
a
i
c
out
s
i
c
in
Oklobdzija 2004 Computer Arithmetic 49
a
i
b
i
C
i
g
i
p
i
a
i+1
b
i+1
C
i+1
g
i+1
p
i+1
a
i+2
b
i+2
C
i+2
g
i+2
p
i+2
a
i+3
b
i+3
C
i+3
g
i+3
p
i+3
C
i+4
1 1 1 1
1 1 1 1 1 1 2
) (
c p p g p g
c p g p g c p g c
i i i i i
i i i i i i i i
+ + +
+ + + + + +
+ + =
+ + = + =
i i i i i i i i i i i i
c p g b a c b a c b a c + = + + =
+1
Oklobdzija 2004 Computer Arithmetic 50
a
i
b
i
C
i
g
i
p
i
a
i+1
b
i+1
C
i+1
g
i+1
p
i+1
a
i+2
b
i+2
C
i+2
g
i+2
p
i+2
a
i+3
b
i+3
C
i+3
g
i+3
p
i+3
C
i+4
i i i i i i i i i i
i i i i i i i i i i i i
c p p p g p p g p g
c p p g p g p g c p g c
1 2 1 2 1 2 2
1 1 1 2 2 2 2 2 3

) (
+ + + + + + +
+ + + + + + + + +
+ + + =
+ + + = + =
i i i i i i i i i i i i i i i
i i i i i i i i i i i i
c p p p p g p p p g p p g p g
g p p g p g p g c p g c
1 2 3 1 2 3 1 2 3 2 3 3
1 2 1 2 2 3 3 3 3 3 4

) (
+ + + + + + + + + + + +
+ + + + + + + + + + +
+ + + + =
+ + + + = + =
G
j P
j
Oklobdzija 2004 Computer Arithmetic 51
i i i i i i i i i i j
g p p p g p p g p g G
1 2 3 1 2 3 2 3 3

+ + + + + + + + +
+ + + =
i i i i j
p p p p P
1 2 3 + + +
=
j j j j
c P G c + =
+ ) 1 ( 4
One gate delay A
to calculate p, g
One A to calculate
P and two for G
Three gate delays
To calculate C
4(j+1)
Compare that to 8 A in RCA !
a
i
b
i
Cin
C
j
G
j
P
j
a
i+1
b
i+1
g
i+1
p
i+1
g
i
p
i
a
i+2
b
i+2
a
i+3
b
i+3
g
i+1
p
i+1
g
i+1
p
i+1
C
4(j+1)
C
4j+1
C
4j+2
C
4j+3
P, G Group
Oklobdzija 2004 Computer Arithmetic 52
(Weinberger and Smith)

i i i i i i i i i i
j
G P P P G P P G P G
1 2 3
1
2 3
2
3
3
*
G
+ + +
+
+ +
+
+
+
+ + + =
i i i i
j
P P P P P
1 2 3
*
+ + +
=
j k k j
c P G c
4 ) 1 ( 4
* * + =
+
P
j
G* P*
C
4j+1
G
j
P
j+1
G
j+1
P
j+3
G
j+3
P
j+2
G
j+2
C
4j
C
4(j+1)
C
4j+2
C
4j+3
C
16
will take a total of 5A vs. 32A for RCA !
Oklobdzija 2004 Computer Arithmetic 53
C
in
C
out
C
in
C
4
C
8
C
12
C
out
C
20
C
24
C
28
C
in
C
16
a
i
b
i
generating: g
i
, p
i,
and sum S
i
4-bits generating:
G
i
, P
i
, and C
in
for the
4-bits blocks generating:
G*
i
, P*
i
,

and C
in
for the 4-bit
blocks
Group producing final
carry C
out
and C
16
Critical path delay = 1A (for gi,pi)+2x2A (for G,P)+3x2A (for Cin)+1XOR-A (for Sum) = appx. 12A of delay
Oklobdzija 2004 Computer Arithmetic 54
(Weinberger and Smith: original derivation, 1958 )
Oklobdzija 2004 Computer Arithmetic 55
(Weinberger and Smith: original derivation )
Oklobdzija 2004 Computer Arithmetic 56
Oklobdzija 2004 Computer Arithmetic 57
Motorola: CLA
Implementation Example
A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS
Proceedings of the IEEE Custom Integrated Circuits
Conference, May 3-6, 1992.
Oklobdzija 2004 Computer Arithmetic 59
Critical path in Motorola's 64-bit CLA
Critical path: A, B - G
0
- G
3:0
- G
15:0
- G
47:0
- C
48
- C
60
- C
63
- S
63
G
4
P
7
G
0
P
0
G
1
P
1
G
2
P
2
G
3
P
3
. . .
CARRY
BLOCK
G
8
P
1
1
. . .
G
1
2
P
1
5
. . .
G
1
6
P
3
1
. . .
G
3
2
P
4
7
. . .
G
4
8
P
5
1
G
6
0
P
6
0
G
6
1
P
6
1
G
6
2
P
6
2
G
6
3
P
6
3
. . .
G
5
2
P
5
5
. . .
G
5
6
P
5
9
. . .
PG BLOCK
PG BLOCK
PG BLOCK
PG BLOCK
P
,
G
0
P
,
G
1
:
0
P
,
G
2
:
0
G
3
:
0
P
3
:
0
G
7
:
4
P
7
:
4
G
1
1
:
8
P
1
1
:
8
G
1
5
:
1
2
P
1
5
:
1
2
G
3
:
0
P
3
:
0
G
7
:
0
P
7
:
0
G
1
1
:
0
P
1
1
:
0
G
1
5
:
0
P
1
5
:
0
G
1
5
:
0
P
1
5
:
0
G
3
1
:
1
6
P
3
1
:
1
6
G
3
1
:
0
P
3
1
:
0
G
4
7
:
3
2
P
4
7
:
3
2
G
4
7
:
0
P
4
7
:
0
G
5
1
:
4
8
P
5
1
:
4
8
G
5
5
:
5
2
P
5
5
:
5
2
G
5
9
:
5
6
P
5
9
:
5
6
C
6
4
G
5
1
:
4
8
P
5
1
:
4
8
G
5
5
:
4
8
P
5
5
:
4
8
G
5
9
:
4
8
P
5
9
:
4
8
P
,
G
6
0
P
,
G
6
1
:
6
0
P
,
G
6
2
:
6
0
G
6
3
:
6
0
P
6
3
:
6
0
G
6
3
:
4
8
P
6
3
:
4
8
G
6
3
:
0
P
6
3
:
0
C
0
C
4
C
8
C
1
2
C
1
6
C
3
2
C
4
8
C
1
6
C
3
2
C
4
8
C
5
2
C
5
6
C
6
0
C
6
3
PG BLOCK
C
6
2
C
6
1
1.05nS
1.7nS
2.0nS
2.35nS
2.7nS
3.75nS
4.8nS
Oklobdzija 2004 Computer Arithmetic 60
Motorola's 64-bit
CLA
conventional PG Block
carry ripples locally
5-transistors in the path
no better
situation here !
Basically, this is MCC performance with
Carry-Skip.
One should not expect any better results
than VBA.
Oklobdzija 2004 Computer Arithmetic 61
Motorola's 64-bit
CLA

Modified PG Block
Intermediate propagate signals P
i:0

are generated to speed-up C
3
still critical path resembles MCC
Oklobdzija 2004 Computer Arithmetic 62
Motorola's 64-bit CLA
1.8nS
2.2nS
2.9nS 3.2nS
3.55nS
3.9nS
Oklobdzija 2004 Computer Arithmetic 63
Critical path: A, B - G
0
- G
3:0
- G
15:0
- G
47:0
- C
48
- C
60
- C
63
- S
63
G
4
P
7
G
0
P
0
G
1
P
1
G
2
P
2
G
3
P
3
. . .
CARRY
BLOCK
G
8
P
1
1
. . .
G
1
2
P
1
5
. . .
G
1
6
P
3
1
. . .
G
3
2
P
4
7
. . .
G
4
8
P
5
1
G
6
0
P
6
0
G
6
1
P
6
1
G
6
2
P
6
2
G
6
3
P
6
3
. . .
G
5
2
P
5
5
. . .
G
5
6
P
5
9
. . .
PG BLOCK
PG BLOCK
PG BLOCK
PG BLOCK
P
,
G
0
P
,
G
1
:0
P
,
G
2
:0
G
3
:0
P
3
:0
G
7
:4
P
7
:4
G
1
1
:8
P
1
1
:8
G
1
5
:1
2
P
1
5
:1
2
G
3
:0
P
3
:0
G
7
:0
P
7
:0
G
1
1
:0
P
1
1
:0
G
1
5
:0
P
1
5
:0
G
1
5
:0
P
1
5
:0
G
3
1
:1
6
P
3
1
:1
6
G
3
1
:0
P
3
1
:0
G
4
7
:3
2
P
4
7
:3
2
G
4
7
:0
P
4
7
:0
G
5
1
:4
8
P
5
1
:4
8
G
5
5
:5
2
P
5
5
:5
2
G
5
9
:5
6
P
5
9
:5
6
C
6
4
G
5
1
:4
8
P
5
1
:4
8
G
5
5
:4
8
P
5
5
:4
8
G
5
9
:4
8
P
5
9
:4
8
P
,
G
6
0
P
,
G
6
1
:6
0
P
,
G
6
2
:6
0
G
6
3
:6
0
P
6
3
:6
0
G
6
3
:4
8
P
6
3
:4
8
G
6
3
:0
P
6
3
:0
C
0
C
4
C
8
C
1
2
C
1
6
C
3
2
C
4
8
C
1
6
C
3
2
C
4
8
C
5
2
C
5
6
C
6
0
C
6
3
PG BLOCK
C
6
2
C
6
1
1.05nS
1.7nS
2.0nS
2.35nS
2.7nS
3.75nS
4.8nS
1.8nS
2.2nS
2.9nS 3.2nS
3.55nS
3.9nS
Delay Optimized CLA
B. Lee, V. G. Oklobdzija
Journal of VLSI Signal Processing, Vol.3, No.4, October 1991
Oklobdzija 2004 Computer Arithmetic 65
Delay
Optimized
CLA: Lee-
Oklobdzija „91
(a.) Fixed groups and levels

(b.) variable-sized groups,
fixed levels

(c.) variable-sized groups and
fixed levels

(d.) variable-sized groups and
levels
Oklobdzija 2004 Computer Arithmetic 66
Two-Levels of Logic Implementation of
the Carry Block
Oklobdzija 2004 Computer Arithmetic 67
Two-Levels of Logic Implementation of
Oklobdzija 2004 Computer Arithmetic 68
Three-Levels of Logic Implementation
of the Carry Block (restricted fan-in)
Oklobdzija 2004 Computer Arithmetic 69
Three-Levels of Logic Implementation of the
Oklobdzija 2004 Computer Arithmetic 70
Delay Optimized CLA: Lee-Oklobdzija „91
Delay: Two-level BCLA
Delay: Three-level BCLA
Oklobdzija 2004 Computer Arithmetic 71
Delay Optimized CLA: Lee-Oklobdzija „91
(a.) 2-level BCLA A=8.5nS (b.) 3-level BCLA A=8.9nS
IBM Journal of Research and Development, Vol.5, No.3, 1981.

Used in: IBM 3033, IBM 168, Amdahl V6, HP etc.
Oklobdzija 2004 Computer Arithmetic 73
Ling‟s Derivations
a
i
b
i
p
i
g
i
t
i

0 0 0 0 0
0 1 1 0 1
1 0 1 0 1
1 1 0 1 1
i i i
C C H + =
+ + 1 1
i i i
b a g =
a
i
b
i
c
i
s
i
c
i+1
g
i
implies C
i+1
which implies
H
i+1 ,
thus: g
i
= g
i
H
i+1

i i i i
C p g C · + =
+1
define:
1 1 1
1 1
+ + +
+ +
= + =
+ + =
i i i i i i
i i i i i i i i i
H p C p C p
C p p g p C p C p
1 +
=
i i i i
H p C p
1 1 1
1 1
+ + +
+ +
· = · + =
· + = · + =
i i i i i i
i i i i i i i i
H t H p H g
C p H g C p g C
1 1 + +
· =
i i i
H t C
Oklobdzija 2004 Computer Arithmetic 74
Ling‟s Derivations
i i i
C C H + =
+ + 1 1
i i i i
C p g C · + =
+1
From:
and
i i i i i i i i i
C g C C p g C C H + = + + = + =
+ + 1 1
i i i i
H t g H
1 1 ÷ +
+ =
1 1 + +
· =
i i i
H t C
because:
fundamental expansion
Now we need to derive Sum equation
Oklobdzija 2004 Computer Arithmetic 75
Variation of CLA:
Ling, IBM J. Res. Dev, 5/81
i i i i
C p g C · + =
+1
i i i
i i i
i i i
b a g · =
i i i i
H t g H · + =
÷ + 1 1
i i i i i i
H t g H t S
1 1 ÷ +
i i i
b a t + =
i i i
b a g · =
Ling‟s equations:
Oklobdzija 2004 Computer Arithmetic 76
( )
i i i i
i i i i i i
C p g g
C p C g g C
· + + =
· + + =
+1
i i i i
C t g C · + =
+1 1 1 ÷ ÷
· + =
i i i i
H t g H
Ling‟s equation:
see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988.
Ling uses different transfer function.
Four of those functions have desired
properties (Ling‟s is one of them)
Variation of CLA:
Oklobdzija 2004 Computer Arithmetic 77
in
C t t t t g t t t g t t g t g C
0 1 2 3 0 1 2 3 1 2 3 2 3 3 4
+ + + + =
in
in
C t t t g t t g t g g H
C t t t t g t t t g t t g t g H
0 1 2 0 1 2 1 2 2 3 4
1 0 1 2 0 0 1 2 1 1 2 2 2 3 4
+ + + + =
+ + + + =
÷
Conventional:
Ling:
Fan-in of 5

Fan-in of 4

Oklobdzija 2004 Computer Arithmetic 78
• H
16
contains 8 terms as compared to G16
that contains 15.
• H
16
can be implemented with one level of
logic (in ECL), while G
16
can not.

OR, of special importance when ECL
technology is used)

VLSI Arithmetic
Lecture 5

Prof. Vojin G. Oklobdzija
University of California

http://www.ece.ucdavis.edu/acsel

Review
Lecture 4
IBM Journal of Research and Development, Vol.5, No.3, 1981.

Used in: IBM 3033, IBM S370/168, Amdahl V6, HP etc.
Oklobdzija 2004 Computer Arithmetic 82
Ling‟s Derivations
a
i
b
i
p
i
g
i
t
i

0 0 0 0 0
0 1 1 0 1
1 0 1 0 1
1 1 0 1 1
i i i
C C H + =
+ + 1 1
i i i
b a g =
a
i
b
i
c
i
s
i
c
i+1
g
i
implies C
i+1
which implies
H
i+1 ,
thus: g
i
= g
i
H
i+1

i i i i
C p g C · + =
+1
define:
1 1 + +
= + =
+ + =
i i i i i i
i i i i i i i i i
H p C p C p
C p p g p C p C p
1 +
=
i i i i
H p C p
1 1 1
1 1
+ + +
+ +
· = · + =
· + = · + =
i i i i i i
i i i i i i i i
H t H p H g
C p H g C p g C
1 1 + +
· =
i i i
H t C
Oklobdzija 2004 Computer Arithmetic 83
Ling‟s Derivations
i i i
C C H + =
+ + 1 1
i i i i
C p g C · + =
+1
From:
and
i i i i i i i i i
C g C C p g C C H + = + + = + =
+ + 1 1
i i i i
H t g H
1 1 ÷ +
+ =
1 1 + +
· =
i i i
H t C
because:
fundamental expansion
Now we need to derive Sum equation
Oklobdzija 2004 Computer Arithmetic 84
Variation of CLA:
Ling, IBM J. Res. Dev, 5/81
i i i i
C p g C · + =
+1
i i i
i i i
i i i
b a g · =
i i i i
H t g H · + =
÷ + 1 1
i i i i i i
H t g H t S
1 1 ÷ +
i i i
b a t + =
i i i
b a g · =
Ling‟s equations:
Oklobdzija 2004 Computer Arithmetic 85
( )
i i i i
i i i i i i
C p g g
C p C g g C
· + + =
· + + =
+1
i i i i
C t g C · + =
+1 i i i i
H t g H · + =
÷ + 1 1
Ling‟s equation:
see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988.
Ling uses different transfer function.
Four of those functions have desired
properties (Ling‟s is one of them)
Variation of CLA:
a
i
b
i
c
i
s
i
c
i+1
a
i-1
b
i-1
c
i-1
s
i-1
g
i
, t
i
g
i-1
, t
i-1
H
i+1
H
i
Oklobdzija 2004 Computer Arithmetic 86
in
C t t t t g t t t g t t g t g C
0 1 2 3 0 1 2 3 1 2 3 2 3 3 4
+ + + + =
in
in
C t t t g t t g t g g H
C t t t t g t t t g t t g t g H
0 1 2 0 1 2 1 2 2 3 4
1 0 1 2 0 0 1 2 1 1 2 2 2 3 4
+ + + + =
+ + + + =
÷
Conventional:
Ling:
Fan-in of 5

Fan-in of 4

Oklobdzija 2004 Computer Arithmetic 87
• H
16
contains 8 terms as compared to G16 that
contains 15.
• H
16
can be implemented with one level of logic
(in ECL), while G
16
can not (with 8-way wire-
OR).

special importance when ECL technology is
used - his IBM limitation was fan-in of 4 and
wire-OR of 8)

Oklobdzija 2004 Computer Arithmetic 88
Ling: Weinberger Notes
Oklobdzija 2004 Computer Arithmetic 89
Ling: Weinberger Notes
Oklobdzija 2004 Computer Arithmetic 90
Ling: Weinberger Notes
Oklobdzija 2004 Computer Arithmetic 91
• 32-bit adder used in: IBM 3033, IBM S370/
Model168, Amdahl V6.
• Implements 32-bit addition in 3 levels of
logic
• Implements 32-bit AGEN: B+Index+Disp in
4 levels of logic (rather than 6)
• 5 levels of logic for 64-bit adder used in
HP processor
Oklobdzija 2004 Computer Arithmetic 92
Implementation of Ling‟s

(S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC „ 96)
Oklobdzija 2004 Computer Arithmetic 93
S. Naffziger,
ISSCC’96
0 1 2 1 2 2 3 4
g t t g t g g H + + + =
1 1 + +
· =
i i i
H t C
Oklobdzija 2004 Computer Arithmetic 94
S. Naffziger,
ISSCC’96
0 1 2 1 2 2 3 4
g t t g t g g H + + + =
Oklobdzija 2004 Computer Arithmetic 95
S. Naffziger,
ISSCC’96
0 1 2 1 2 2 3 4
g t t g t g g H + + + =
Oklobdzija 2004 Computer Arithmetic 96
S. Naffziger,
ISSCC’96
Oklobdzija 2004 Computer Arithmetic 97
S. Naffziger, ISSCC’96
Oklobdzija 2004 Computer Arithmetic 98
S. Naffziger, ISSCC’96
Oklobdzija 2004 Computer Arithmetic 99
S. Naffziger,
ISSCC’96
Oklobdzija 2004 Computer Arithmetic 100
S. Naffziger, ISSCC’96
= + + + = = ) (
0 7 11 7 11 11 15 15 16 15 16
g t t g t g g p H p C
Oklobdzija 2004 Computer Arithmetic 101
S. Naffziger,
ISSCC’96
Oklobdzija 2004 Computer Arithmetic 102
S. Naffziger,
ISSCC’96
Oklobdzija 2004 Computer Arithmetic 103
S. Naffziger,
ISSCC’96
Oklobdzija 2004 Computer Arithmetic 104
Oklobdzija 2004 Computer Arithmetic 105
A0
B0
A1 B1 A1
B1
A2
B2
A2 B2
CK
G3
G4
CK
A3
B3
P4
A2 B2
B3 A3 B1
A0 B0
A1
CK
CK
P
LCH LCL
C1H C0L C1L C0H
SumH
CK
K
G
SumL
LCH LCL
C1H C0L C1L C0H
CK
P2
P1
G0
CK
LC
G2 G1
Oklobdzija 2004 Computer Arithmetic 106
LCS4 – Critical G Path
4b
in
1
G
3
12b
P
4
(k,p) or (g,p) G
4
C
15
32b
C
47
C
15
C
31
S
63
S
48
S
62
16b
Oklobdzija 2004 Computer Arithmetic 107
LCS4 – Logical Effort Delay
Prefix-4 Ling/Conditional-Sum (Dynamic - Long Carry Path)
Stages Branch LE Parasitic
Total
Branch Total LE
Path
Effort fo, opt
Effort
Delay
(ps)
Parasitic
Delay
(ps)
Total
Delay
(ps)
Total
Delay
(FO4)
dg3# (dg3) 4.0 0.98 2.97
g4 (NAND2) 2.0 1.11 1.84
C15# (GG4) 1.0 1.01 1.80
C15 (INV) 1.0 1.00 1.00
C47# (LC) 3.0 1.03 3.32
C47 (INV) 1.0 1.00 1.00
C47#b (INV) 1.0 1.00 1.00
C47b (INV) 1.0 1.00 1.00
S63# (SUM) 16.0 0.86 1.36
S63 (INV) 1.0 1.00 1.00
3.74E+02 3.84E+02 9.73E-01 7.2 70 1.81 136 66
Oklobdzija 2004 Computer Arithmetic 108
Results:

• 0.5u Technology
• Speed: 0.930 nS
• Nominal process, 80C, V=3.3V

See: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC „ 96
and
Oklobdzija 2004 Computer Arithmetic 110
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 111
(g
0
, p
0
)
Following recurrence operation is defined:
(g, p)o(g‟,p‟)=(g+pg‟, pp‟)
such that:
G
i
, P
i
=
(g
i
, p
i
)o(G
i-1
, P
i-1
)
i=0
1 ≤ i ≤ n
c
i+1
= G
i
for i=0, 1, ….. n
c
1
= g
0
+ p
0
c
in

(g
-1
, p
-1
)=(c
in
,c
in
)
This operation is associative, but not commutative
It can also span a range of bits (overlapping and adjacent)
Oklobdzija 2004 Computer Arithmetic 112
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 113
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 114
M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic
Units”, IFIP Congress, Munich, Germany, 1962.
Oklobdzija 2004 Computer Arithmetic 115
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 116
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 117
Oklobdzija 2004 Computer Arithmetic 118
Parallel Prefix Adders: S. Knowles 1999
operation is associative: h>i≥j≥k
operation is idempotent: h>i≥j≥k
produces carry: c
in
=0
Oklobdzija 2004 Computer Arithmetic 119
Exploits associativity, but not idempotency.
Produces minimal logical depth
Oklobdzija 2004 Computer Arithmetic 120
Two wires at each level. Uniform, fan-in of two.
combined with the long wires (in the last stages)
(16,8,4,2,1)
Oklobdzija 2004 Computer Arithmetic 121
Exploits idempotency
to limit the fan-out to 1.
Dramatic increase in
wires. The wire span
remains the same as

Buffers needed in both
cases: K-S, L-F
Oklobdzija 2004 Computer Arithmetic 122
Oklobdzija 2004 Computer Arithmetic 123
• Set the fan-out to one
• Avoids explosion of wires (as in K-S)
• Makes no sense in CMOS:
– fan-out = 1 limit is arbitrary and extreme
– much of the capacitive load is due to wire
(anyway)
• It is more efficient to insert buffers in L-F
than to use B-K scheme
Oklobdzija 2004 Computer Arithmetic 124
Oklobdzija 2004 Computer Arithmetic 125
• Is a hybrid synthesis of L-F and K-S
• Trades increase in logic depth for a
reduction in fan-out:
– effectively a higher-radix variant of K-S.
– others do it similarly by serializing the prefix
computation at the higher fan-out nodes.
• Others, similarly trade the logical depth for
reduction of fan-out and wire.
Oklobdzija 2004 Computer Arithmetic 126
Parallel Prefix Adders: variety of possibilities
from: Knowles
bounded by L-F and K-S at ends
Oklobdzija 2004 Computer Arithmetic 127
Parallel Prefix Adders: variety of possibilities
Knowles 1999
Following rules are used:

• Lateral wires at the j
th
level span 2
j
bits
• Lateral fan-out at j
th
level is power of 2 up
to 2
j

• Lateral fan-out at the j
th
level cannot
exceed that a the (j+1)
th
level.
Oklobdzija 2004 Computer Arithmetic 128
Parallel Prefix Adders: variety of possibilities
Knowles 1999
• The number of minimal depth graphs of this type
is given in:

• at 4-bits there is only K-S and L-F, afterwards
there are several new possibilities.
Oklobdzija 2004 Computer Arithmetic 129
Parallel Prefix Adders: variety of possibilities
example of a new 32-bit adder [4,4,2,2,1]
Knowles 1999
Oklobdzija 2004 Computer Arithmetic 130
Parallel Prefix Adders: variety of possibilities
Example of a new 32-bit adder [4,4,2,2,1]
Knowles 1999
Oklobdzija 2004 Computer Arithmetic 131
Parallel Prefix Adders: variety of possibilities
Knowles 1999
• Delay is given in terms of FO4 inverter delay: w.c.
(nominal case is 40-50% faster)
• K-S is the fastest
• K-S adders are wire limited (requiring 80% more area)
• The difference is less than 15% between examined schemes

Oklobdzija 2004 Computer Arithmetic 132
Parallel Prefix Adders: variety of possibilities
Knowles 1999
Conclusion
• Irregular, hybrid schmes
are possible
• The speed-up of 15% is
achieved at the cost of
large wiring, hence area
and power
• Circuits close in speed to
K-S are available at
significantly lower wiring
cost
VLSI Arithmetic
Lecture 6

Prof. Vojin G. Oklobdzija
University of California

http://www.ece.ucdavis.edu/acsel

Review
Lecture 5
and
Oklobdzija 2004 Computer Arithmetic 136
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 137
(g
0
, p
0
)
Following recurrence operation is defined:
(g, p)o(g‟,p‟)=(g+pg‟, pp‟)
such that:
G
i
, P
i
=
(g
i
, p
i
)o(G
i-1
, P
i-1
)
i=0
1 ≤ i ≤ n
c
i+1
= G
i
for i=0, 1, ….. n
c
1
= g
0
+ p
0
c
in

(g
-1
, p
-1
)=(c
in
,c
in
)
This operation is associative, but not commutative
It can also span a range of bits (overlapping and adjacent)
Oklobdzija 2004 Computer Arithmetic 138
Parallel Prefix Adders: S. Knowles 1999
operation is associative: h>i≥j≥k
operation is idempotent: h>i≥j≥k
produces carry: c
in
=0
Oklobdzija 2004 Computer Arithmetic 139
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 140
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 141
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 142
Parallel Prefix Adders: variety of possibilities
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 143
Oklobdzija 2004 Computer Arithmetic 144
Oklobdzija 2004 Computer Arithmetic 145
Oklobdzija 2004 Computer Arithmetic 146
M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic
Units”, IFIP Congress, Munich, Germany, 1962.
Oklobdzija 2004 Computer Arithmetic 147
Exploits associativity, but not idempotency.
Produces minimal logical depth
Oklobdzija 2004 Computer Arithmetic 148
Two wires at each level. Uniform, fan-in of two.
combined with the long wires (in the last stages)
(16,8,4,2,1)
Oklobdzija 2004 Computer Arithmetic 149
Exploits idempotency
to limit the fan-out to 1.
Dramatic increase in
wires. The wire span
remains the same as

Buffers needed in both
cases: K-S, L-F
Oklobdzija 2004 Computer Arithmetic 150
• Set the fan-out to one
• Avoids explosion of wires (as in K-S)
• Makes no sense in CMOS:
– fan-out = 1 limit is arbitrary and extreme
– much of the capacitive load is due to wire
(anyway)
• It is more efficient to insert buffers in L-F
than to use B-K scheme
Oklobdzija 2004 Computer Arithmetic 151
G
2
,P
2
G
3
,P
3
G
4
,P
4
G
1
,P
1
C
1
C
2
C
3
C
4
C
5
C
6
C
7
C
8
C
9
C
10
C
11
C
12
C
13
C
14
C
15
C
out
G
2
,P
2
G
3
,P
3
G
4
,P
4
G
1
,P
1
C
1
C
2
C
3
C
4
C
5
C
6
C
7
C
8
C
9
C
10
C
11
C
12
C
13
C
14
C
15
C
out
Kogge-Stone Han-Carlson
• log(bits) carry stages
• Extra Wiring
• log(bits) + 1 carry stages
• Reduced Wiring and Gates
Oklobdzija 2004 Computer Arithmetic 152
• Is a hybrid synthesis of L-F and K-S
• Trades increase in logic depth for a
reduction in fan-out:
– effectively a higher-radix variant of K-S.
– others do it similarly by serializing the prefix
computation at the higher fan-out nodes.
• Others, similarly trade the logical depth for
reduction of fan-out and wire.
Oklobdzija 2004 Computer Arithmetic 153
Parallel Prefix Adders: variety of possibilities
from: Knowles
bounded by L-F and K-S at ends
Oklobdzija 2004 Computer Arithmetic 154
Parallel Prefix Adders: variety of possibilities
Knowles 1999
Following rules are used:

• Lateral wires at the j
th
level span 2
j
bits
• Lateral fan-out at j
th
level is power of 2 up
to 2
j

• Lateral fan-out at the j
th
level cannot
exceed that a the (j+1)
th
level.
Oklobdzija 2004 Computer Arithmetic 155
Parallel Prefix Adders: variety of possibilities
Knowles 1999
• The number of minimal depth graphs of this type
is given in:

• at 4-bits there is only K-S and L-F, afterwards
there are several new possibilities.
Oklobdzija 2004 Computer Arithmetic 156
Parallel Prefix Adders: variety of possibilities
example of a new 32-bit adder [4,4,2,2,1]
Knowles 1999
Oklobdzija 2004 Computer Arithmetic 157
Parallel Prefix Adders: variety of possibilities
Example of a new 32-bit adder [4,4,2,2,1]
Knowles 1999
Oklobdzija 2004 Computer Arithmetic 158
Parallel Prefix Adders: variety of possibilities
Knowles 1999
• Delay is given in terms of FO4 inverter delay: w.c.
(nominal case is 40-50% faster)
• K-S is the fastest
• K-S adders are wire limited (requiring 80% more area)
• The difference is less than 15% between examined schemes

Oklobdzija 2004 Computer Arithmetic 159
Parallel Prefix Adders: variety of possibilities
Knowles 1999
Conclusion
• Irregular, hybrid schmes
are possible
• The speed-up of 15% is
achieved at the cost of
large wiring, hence area
and power
• Circuits close in speed to
K-S are available at
significantly lower wiring
cost
Oklobdzija 2004 Computer Arithmetic 160
Possibilities for Further Research
• The logical depth is important (Knowles was
right)
• The fan-out is less important than fan-in
(Knowles was wrong):
– It is possible to examine a variety of topologies with
restricted and varied fan-in.
• Driving strength and Logical Effort rules were
overlooked and at least neglected:
– It is possible to create number of topologies taking LE
rules into account.
– It is further possible to combine the rules with
of two different rules governing “dynamic” and “static”.
• It is still possible to produce a better adder !

Oklobdzija 2004 Computer Arithmetic 161
Logic”, IRE Transactions on Electronic
Computers, EC-9, p.226-231, 1960.
Oklobdzija 2004 Computer Arithmetic 163
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 164
Conditional
Oklobdzija 2004 Computer Arithmetic 165
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 166
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 167
O. J. Bedrij, “Carry-Select Adder”, IRE
Transactions on Electronic Computers, June
1962, p.340-34
Oklobdzija 2004 Computer Arithmetic 169
from: Ercegovac-Lang
Oklobdzija 2004 Computer Arithmetic 170
in
=0 and C
in
=1.
Oklobdzija 2004 Computer Arithmetic 171