You are on page 1of 29

Advanced Computer Systems

Architecture
Course Teacher: Dr.-Ing. Shehzad Hasan
CIS, NED University

Lecture # 4

Fall Semester 2015 CS-506 ACSA 1


Recap Lecture – 3
• Computer Arithmetic
– Adders
• Carry Save Adder
– Multipliers
• Right and Left Shift Algorithms
• Signed Multiplication (2’s Complement)
– Positive Multiplier
– Negative Multiplier
• Booth’s Recoding for Radix-2
• Higher Radix Multiplication

Fall Semester 2015 CS-506 ACSA 2


Possible Designs Radix-4 Multiplication
The multiple generation part of a
M ultiplier radix-4 multiplier based on replacing
3a with 4a (carry into next higher
2-bit shifts radix-4 multiplier digit) and –a.
x i+1 xi
0 a 2a –a +c c Carry
mod 4 FF xSet (xif x i+1
i+1 if i x c) = xi = 1
or i+1 = c = 1
xi+1  xi c c
00 01 10 11
M ux xi  c
xi+1 xi c Mux control Set carry
To the adder ---- --- --- ---------------- ------------
0 0 0 0 0 0
0 0 1 0 1 0
0 1 0 0 1 0
An extra cycle may be 0 1 1 1 0 0
needed at the end 1 0 0 1 0 0
because of the carry 1 0 1 1 1 1
1 1 0 1 1 1
1 1 1 0 0 1

Fall Semester 2015 CS-506 ACSA 3


Modified Booth’s Recoding
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
xi+1 xi xi–1 yi+1 yi zi/2 Explanation
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
0 0 0 0 0 0 No string of 1s in sight
0 0 1 0 1 1 End of string of 1s in x
0 1 0 1 -1 1 Isolated 1 in x
0 1 1 1 0 2 End of string of 1s in x
1 0 0 - 1 0 - 2 Beginning of string of 1s
1 0 1 - 1 1 - 1 End a string, begin new string
1 1 0 0 - 1 - 1 Beginning of string of 1s in x
1 1 1 0 0 0 Continuation of string of 1s
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Recoded
Context Radix-4 digit
radix-2 digits
Example
1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 Recoded version y
(1) -2 2 -1 2 -1 -1 0 -2 Radix-4 version z

Fall Semester 2015 CS-506 ACSA 4


Multiplication via Modified Booth’s Recoding
================================ ––––––––––––––––––––––––
a 0 1 1 0 xi+1 xi xi–1 zi/2
––––––––––––––––––––––––
x 1 0 1 0 0 0 0 0
z -1 -2 Radix-4 0 0 1 1
================================ 0 1 0 1
p(0) 0 0 0 0 0 0 0 1 1 2
1 0 0 -2
+z0a 1 1 0 1 0 0 1 0 1 -1
––––––––––––––––––––––––––––––––– 1 1 0 -1

4p(1) 1 1 0 1 0 0 1 1 1 0
p(1) 1 1 1 1 0 1 0 0 –––––––––––––––––––––––
+z1a 1 1 1 0 1 0
–––––––––––––––––––––––––––––––––
4p(2) 1 1 0 1 1 1 0 0
p(2) 1 1 0 1 1 1 0 0
================================

Fall Semester 2015 CS-506 ACSA 5


Hardware Implementation of Radix-4
Booth’s Recoding
M ultiplier Init. 0 M ultiplicand
---- Encoding ----
Digit neg two non0
2-bit shift
–2 1 1 1
x i+1 xi x i–1 k
–1 1 0 1
Sign
of a 0 0 0 0
Recoding Logic
1 0 0 1
neg two non0 0 0
2 0 1 1
a 2a
0 1
Enable
M ux
Select
0, a, or 2a
k+1

Add/subtract z i/2 a
control To adder input

Fall Semester 2015 CS-506 ACSA 6


Multiple Designs

Several
Next multiples All multiples
multiple
... ...

Small CSA
tree
Full CSA
Adder
tree

Partial product Partial product

Adder Adder

High-radix
Basic or Full
binary Speed up partial tree Economize tree

Fall Semester 2015 CS-506 ACSA 7


Wallace and Dadda Trees
• Wallace – Combine Partial ––––––––––––––––––––––
h n(h) h n(h)
Products as soon as possible –––––––––––––––––––––––
• Fastest Possible design since 0 2 7 28
1 3 8 42
typically smaller CPA at end
2 4 9 63
• Dadda – Maintain Critical Path 3 6 10 94
4 9 11 141
Length (Tree Depth) but 5 13 12 211
combine as late as possible 6 19 13 316
–––––––––––––––––––––––
• Simpler Tree but wider CPA at
n(h): Maximum number of
end inputs for h levels

Fall Semester 2015 CS-506 ACSA 8


Wallace Tree
Addition of seven 6-bit numbers in dot
notation

12 FAs

Reduce the number


6 FAs of operands at the
earliest possible
6 FAs
opportunity

4 FAs + 1 HA

7-bit adder

Total cost = 7-bit adder + 28 FAs + 1 HA

Fall Semester 2015 CS-506 ACSA 9


Dadda Tree
Addition of seven 6-bit numbers in dot
notation

6 FAs
h n(h)
2 4
3 6
4 9
5 13
6 19
11 FAs

Postpone the reduction


7 FAs
to the extent possible
without causing added
4 FAs + 1 HA
delay
7-bit adder

Total cost = 7-bit adder + 28 FAs + 1 HA

Fall Semester 2015 CS-506 ACSA 10


4  4 Example

• 16 AND Gates Used to Form xiaj Terms (dots)

 1 2 3 4 3 2 1

Fall Semester 2015 CS-506 ACSA 11


Wallace Example
1 2 3 4 3 2 1

• 5 FAs, 3 HAs, 4-bit CPA

Fall Semester 2015 CS-506 ACSA 12


Dadda Examples
1 2 3 4 3 2 1 1 2 3 4 3 2 1

• 3 FAs, 3 HAs, 6-bit CPA • 4 FAs, 2 HAs, 6-bit CPA

Fall Semester 2015 CS-506 ACSA 13


Multipliers [0, 6]
[1, 7]
[2, 8] [3, 9]
[4, 10]
[5, 11]
[6, 12]

[1, 6]
using CSA tree 7-bit CSA 7-bit CSA
[2, 8] [1,8] [5, 11] [3, 11]

CSA trees are quite 7-bit CSA


irregular, causing [6, 12] [3, 12]

some difficulties in [2, 8]


7-bit CSA
VLSI realization
[3,9] [2,12]
[3,12]
Due to the gradual The index pair 10-bit CSA
dropping out of some of [i, j] means that [3,12]
bit positions
the result bits, CSA from i up to j
[4,13] [4,12]
widths do not vary much are involved.
10-bit CPA
as we go down the tree
levels Ignore [4, 13] 3 2 1 0

Fall Semester 2015 CS-506 ACSA 14


Array Multipliers
a4 x0 0 a3 x0 0 a2 x0 0 a1 x0 0 a0 x0

p0
a3 x1
a4 x1 a2 x1 a1 x1 a0 x1

p1
a3 x2
a2 x2 a1 x2 a0 x2
a4 x2
p2
a3 x3
a2 x3 a1 x3 a0 x3
a4 x3
p3
a3 x4
a2 x4 a1 x4 a0 x4
a4 x4
p4
A basic array multiplier uses a 0
one-sided CSA tree and a ripple-
p9 p8 p7 p6 p5
carry adder

Fall Semester 2015 CS-506 ACSA 15


Array Multipliers
• Slowest possible CSA tree
• Slowest possible Carry Propagate Adder
But
• Very regular in structure
• Uses only short wires that go from one FA to
horizontally, vertically or diagonally adjacent FA
• Thus it has very simple and efficient layout in VLSI
• Can also be easily and effectively pipelined

Fall Semester 2015 CS-506 ACSA 16


Array Multiplier Modified Full-Adder Cells
a4 a3 a2 a1 a0

x0

p0
x1

p1
x2

FA p2
x3

p3
x4

p4

p5
p9 p8 p7 p6

Fall Semester 2015 CS-506 ACSA 17


Pipelined Array Multiplier
a4 a3 a2 a1 a0 x0 x1 x2 x3 x4

With latches after every


FA level, the maximum
throughput is achieved
Latches may be inserted
after every h FA levels for
an intermediate design

Example: 3-stage pipeline


Pipelined 5  5 array Latched FA
FA with
multiplier using latched FA AND gate FA
blocks. FA
Latch
The small shaded boxes
FA
are latches.
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Fall Semester 2015 CS-506 ACSA 18


DIVIDERS

Fall Semester 2015 CS-506 ACSA 19


Division
Notation for division algorithms:
z Dividend (2k bits) z2k–1z2k–2 . . . z3z2z1z0
d Divisor (k bits) dk–1dk–2 . . . d1d0
q Quotient (k bits) qk–1qk–2 . . . q1q0
s Remainder, (k bits) sk–1sk–2 . . . s1s0

where s = z – (d  q)
or z = (d  q) + s
q Quotient
d Divisor
z Dividend
–q3 d 23
–q2 d 22 Subtracted
–q1 d 21 bit-matrix
–q0 d 20
s Remainder

Fall Semester 2015 CS-506 ACSA 20


Division Complexity
Division is more complex than multiplication:
• Two unknowns: ‘q’ and ‘s’.
• Need for quotient digit selection or estimation: For radix-2 its much easier
as its either 0 or 1, but for higher numbers its more difficult.
• In multiplication product of two k-bit numbers is representable in 2k bits,
whereas, if 2k-bit number is divided by k-bit number the quotient could be
greater than k bits.
• Overflow possibility: the high-order k bits of z must be strictly less than d
for signed integers
q  2k and s  d
z  (2k - 1)d  d  2k d

this overflow check also detects the divide-by-zero condition.

Fall Semester 2015 CS-506 ACSA 21


Division Complexity
Recent Intel Microarchitectures
Instruction Latency Cycles to wait Latency Cycles to wait
Sandy Bridge Nehalem/Westmere
Load / Store 1 0.33 1 0.33
Integer Add 1 0.33 1 0.33
Integer Multiply (32/64) 3/4 1 3/10 1
Integer Divide (32/64) 25/90 17 21/80 13
Double/Single FP Add 3 1 3 1
Double/Single FP Multiply 5 1 5 1
Double/Single FP Divide 22/14 22/14 24/16 20/12

Latency: The number of clock cycles that are required for the execution core to complete
the execution of all of the μops that form an instruction.
Cycles to wait: The number of clock cycles required to wait before the issue ports are free to
accept the same instruction again
Source: Intel® 64 and IA-32 Architectures Optimization Reference Manual

Fall Semester 2015 CS-506 ACSA 22


Division Recurrence
q Quotient
d Divisor
z Dividend
–q3 d 23
–q2 d 22 Subtracted
–q1 d 21 bit-matrix
–q0 d 20
s Remainder

Division with left shifts (There is no corresponding right-shift algorithm)

s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and


|–shift–| s(k) = 2ks
|––– subtract –––|

Integer division is characterized by z = d  q + s No-overflow


2–2kz = (2–kd)  (2–kq) + 2–2ks condition for
zfrac = dfrac  qfrac + 2–k sfrac fractions is:

Divide fractions like integers; adjust the remainder zfrac < dfrac

Fall Semester 2015 CS-506 ACSA 23


Decimal Division Example
Integer division Fractional division
====================== =====================
117 z 0111 0101 zfrac .0111 0101
÷10 2d4 1010 dfrac .1010
====================== =====================
s(0) 0111 0101 s(0) .0111 0101
2s(0) 01110 101 2s(0) 0.1110 101
–q3 2 d 4 1 0 1 0 {q3 = 1} –q–1d . 1 0 1 0 {q–1=1}
––––––––––––––––––––––– ––––––––––––––––––––––
s(1) 0100 101 s(1) .0100 101
2s (1) 01001 01 2s (1) 0.1001 01
–q2 24d 0 0 0 0 {q2 = 0} –q–2d . 0 0 0 0 {q–2=0}
––––––––––––––––––––––– ––––––––––––––––––––––
s(2) 1001 01 s(2) .1001 01
2s(2) 10010 1 2s(2) 1.0010 1
–q1 2 d 4 1 0 1 0 {q1 = 1} –q–3d . 1 0 1 0 {q–3=1}
––––––––––––––––––––––– ––––––––––––––––––––––
s(3) 1000 1 s(3) .1000 1
2s (3) 10001 2s (3) 1.0001
–q0 24d 1 0 1 0 {q0 = 1} –q–4d . 1 0 1 0 {q–4=1}
––––––––––––––––––––––– ––––––––––––––––––––––
s(4) 0111 s(4) .0111
7 s 0111 sfrac 0.0000 0111
11 q 1011 qfrac .1011
====================== =====================

Fall Semester 2015 CS-506 ACSA 24


Programmed Division
{Using left shifts, divide unsigned 2k-bit dividend,
z_high|z_low, storing the k-bit quotient and remainder.
Registers: R0 holds 0 Rc for counter
Rd for divisor Rs for z_high & remainder
Rq for z_low & quotient}
{Load operands into registers Rd, Rs, and Rq}
div: load Rd with divisor
load Rs with z_high
load Rq with z_low
{Check for exceptions}
branch d_by_0 if Rd = R0
branch d_ovfl if Rs > Rd
{Initialize counter}
load k into Rc
{Begin division loop}
d_loop: shift Rq left 1 {zero to LSB, MSB to carry}
rotate Rs left 1 {carry to LSB, MSB to carry}
skip if carry = 1
branch no_sub if Rs < Rd
sub Rd from Rs
incr Rq {set quotient digit to 1}
no_sub: decr Rc {decrement counter by 1}
branch d_loop if Rc  0
{Store the quotient and remainder}
store Rq into quotient
store Rs into remainder
d_by_0: ...
d_ovfl: ...
d_done: ...

Fall Semester 2015 CS-506 ACSA 25


Time Complexity of Programmed Division
• Assume k-bit words

• k iterations of the main loop


• 6-8 instructions per iteration, depending on the quotient bit

• Thus, 6k + 8 to 8k + 8 machine instructions, including operand


loads and result store

• k = 32 implies 200+ instructions on average. This is too slow


for many modern applications!

• Microprogrammed division would be somewhat better which


would be covered in the next lecture.

Fall Semester 2015 CS-506 ACSA 26


Programmed Division – Task 1
• 8086 division
• Write a program for division using shift and
subtract operation.

Fall Semester 2015 CS-506 ACSA 27


Restoring Unsigned Division
=======================
z 0111 0101
2d4 0 1010 No overflow, because
–2 d4 1 0110 (0111)two < (1010)two
=======================
s(0) 0 0111 0101
2s (0) 0 1110 101
+(–24d) 1 0110
––––––––––––––––––––––––
s(1) 0 0100 101 Positive, so set q3 = 1
2s (1) 0 1001 01
+(–2 d) 4 1 0110
––––––––––––––––––––––––
s(2) 1 1111 01 Negative, so set q2 = 0
(2)
s =2s (1) 0 1001 01 and restore
2s (2) 1 0010 1
+(–2 d) 4 1 0110
––––––––––––––––––––––––
s(3) 0 1000 1 Positive, so set q1 = 1
2s (3) 1 0001
+(–24d) 1 0110
––––––––––––––––––––––––
s(4) 0 0111 Positive, so set q0 = 1
s 0111
q 1011
=======================
Fall Semester 2015 CS-506 ACSA 28
Restoring Hardware Dividers
In 2’s-complement
arithmetic, adding
a negative value
to a positive value
produces cout = 1 if
the result is positive

Fall Semester 2015 CS-506 ACSA 29

You might also like