Advanced Computer Systems Architecture Lect-4

Advanced Computer Systems
Architecture
Course Teacher: Dr.-Ing. Shehzad Hasan
CIS, NED University
Lecture # 4
Fall Semester 2015 CS-506 ACSA 1

Recap Lecture – 3
• Computer Arithmetic
– Adders
• Carry Save Adder
– Multipliers
• Right and Left Shift Algorithms
• Signed Multiplication (2’s Complement)
– Positive Multiplier
– Negative Multiplier
• Booth’s Recoding for Radix-2
• Higher Radix Multiplication

Possible Designs Radix-4 Multiplication
The multiple generation part of a
M ultiplier radix-4 multiplier based on replacing
3a with 4a (carry into next higher
2-bit shifts radix-4 multiplier digit) and –a.
x i+1 xi
0 a 2a –a +c c Carry
mod 4 FF xSet (xif x i+1
i+1 if i x c) = xi = 1
or i+1 = c = 1
xi+1  xi c c
00 01 10 11
M ux xi  c
xi+1 xi c Mux control Set carry
To the adder ---- --- --- ---------------- ------------
0 0 0 0 0 0
0 0 1 0 1 0
0 1 0 0 1 0
An extra cycle may be 0 1 1 1 0 0
needed at the end 1 0 0 1 0 0
because of the carry 1 0 1 1 1 1
1 1 0 1 1 1
1 1 1 0 0 1

Modified Booth’s Recoding
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
xi+1 xi xi–1 yi+1 yi zi/2 Explanation
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
0 0 0 0 0 0 No string of 1s in sight
0 0 1 0 1 1 End of string of 1s in x
0 1 0 1 -1 1 Isolated 1 in x
0 1 1 1 0 2 End of string of 1s in x
1 0 0 - 1 0 - 2 Beginning of string of 1s
1 0 1 - 1 1 - 1 End a string, begin new string
1 1 0 0 - 1 - 1 Beginning of string of 1s in x
1 1 1 0 0 0 Continuation of string of 1s
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Recoded
Context Radix-4 digit
radix-2 digits
Example
1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 Recoded version y
(1) -2 2 -1 2 -1 -1 0 -2 Radix-4 version z

Multiplication via Modified Booth’s Recoding
================================ ––––––––––––––––––––––––
a 0 1 1 0 xi+1 xi xi–1 zi/2
––––––––––––––––––––––––
x 1 0 1 0 0 0 0 0
z -1 -2 Radix-4 0 0 1 1
================================ 0 1 0 1
p(0) 0 0 0 0 0 0 0 1 1 2
1 0 0 -2
+z0a 1 1 0 1 0 0 1 0 1 -1
––––––––––––––––––––––––––––––––– 1 1 0 -1
4p(1) 1 1 0 1 0 0 1 1 1 0
p(1) 1 1 1 1 0 1 0 0 –––––––––––––––––––––––
+z1a 1 1 1 0 1 0
–––––––––––––––––––––––––––––––––
4p(2) 1 1 0 1 1 1 0 0
p(2) 1 1 0 1 1 1 0 0
================================

Hardware Implementation of Radix-4
Booth’s Recoding
M ultiplier Init. 0 M ultiplicand
---- Encoding ----
Digit neg two non0
2-bit shift
–2 1 1 1
x i+1 xi x i–1 k
–1 1 0 1
Sign
of a 0 0 0 0
Recoding Logic
1 0 0 1
neg two non0 0 0
2 0 1 1
a 2a
0 1
Enable
M ux
Select
0, a, or 2a
k+1
Add/subtract z i/2 a
control To adder input

Multiple Designs
Several
Next multiples All multiples
multiple
... ...
Small CSA
tree
Full CSA
Adder
tree
Partial product Partial product
Adder Adder
High-radix
Basic or Full
binary Speed up partial tree Economize tree

Wallace and Dadda Trees
• Wallace – Combine Partial ––––––––––––––––––––––
h n(h) h n(h)
Products as soon as possible –––––––––––––––––––––––
• Fastest Possible design since 0 2 7 28
1 3 8 42
typically smaller CPA at end
2 4 9 63
• Dadda – Maintain Critical Path 3 6 10 94
4 9 11 141
Length (Tree Depth) but 5 13 12 211
combine as late as possible 6 19 13 316
–––––––––––––––––––––––
• Simpler Tree but wider CPA at
n(h): Maximum number of
end inputs for h levels

Wallace Tree
Addition of seven 6-bit numbers in dot
notation
12 FAs
Reduce the number

6 FAs of operands at the
earliest possible
6 FAs
opportunity
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA

Dadda Tree
Addition of seven 6-bit numbers in dot
notation
6 FAs
h n(h)
2 4
3 6
4 9
5 13
6 19
11 FAs
Postpone the reduction

7 FAs
to the extent possible
without causing added
4 FAs + 1 HA
delay
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA

4  4 Example
• 16 AND Gates Used to Form xiaj Terms (dots)
 1 2 3 4 3 2 1

Wallace Example
1 2 3 4 3 2 1
• 5 FAs, 3 HAs, 4-bit CPA

Dadda Examples
1 2 3 4 3 2 1 1 2 3 4 3 2 1
• 3 FAs, 3 HAs, 6-bit CPA • 4 FAs, 2 HAs, 6-bit CPA

Multipliers [0, 6]
[1, 7]
[2, 8] [3, 9]
[4, 10]
[5, 11]
[6, 12]
[1, 6]
using CSA tree 7-bit CSA 7-bit CSA
[2, 8] [1,8] [5, 11] [3, 11]
CSA trees are quite 7-bit CSA

irregular, causing [6, 12] [3, 12]
some difficulties in [2, 8]

7-bit CSA
VLSI realization
[3,9] [2,12]
[3,12]
Due to the gradual The index pair 10-bit CSA
dropping out of some of [i, j] means that [3,12]
bit positions
the result bits, CSA from i up to j
[4,13] [4,12]
widths do not vary much are involved.
10-bit CPA
as we go down the tree
levels Ignore [4, 13] 3 2 1 0

Array Multipliers
a4 x0 0 a3 x0 0 a2 x0 0 a1 x0 0 a0 x0
p0
a3 x1
a4 x1 a2 x1 a1 x1 a0 x1
p1
a3 x2
a2 x2 a1 x2 a0 x2
a4 x2
p2
a3 x3
a2 x3 a1 x3 a0 x3
a4 x3
p3
a3 x4
a2 x4 a1 x4 a0 x4
a4 x4
p4
A basic array multiplier uses a 0
one-sided CSA tree and a ripple-
p9 p8 p7 p6 p5
carry adder

Array Multipliers
• Slowest possible CSA tree
• Slowest possible Carry Propagate Adder
But
• Very regular in structure
• Uses only short wires that go from one FA to
horizontally, vertically or diagonally adjacent FA
• Thus it has very simple and efficient layout in VLSI
• Can also be easily and effectively pipelined

Array Multiplier Modified Full-Adder Cells
a4 a3 a2 a1 a0
x0
p0
x1
p1
x2
FA p2
x3
p3
x4
p4
p5
p9 p8 p7 p6

Pipelined Array Multiplier
a4 a3 a2 a1 a0 x0 x1 x2 x3 x4
With latches after every

FA level, the maximum
throughput is achieved
Latches may be inserted
after every h FA levels for
an intermediate design
Example: 3-stage pipeline

Pipelined 5  5 array Latched FA
FA with
multiplier using latched FA AND gate FA
blocks. FA
Latch
The small shaded boxes
FA
are latches.
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

DIVIDERS

Division
Notation for division algorithms:
z Dividend (2k bits) z2k–1z2k–2 . . . z3z2z1z0
d Divisor (k bits) dk–1dk–2 . . . d1d0
q Quotient (k bits) qk–1qk–2 . . . q1q0
s Remainder, (k bits) sk–1sk–2 . . . s1s0
where s = z – (d  q)
or z = (d  q) + s
q Quotient
d Divisor
z Dividend
–q3 d 23
–q2 d 22 Subtracted
–q1 d 21 bit-matrix
–q0 d 20
s Remainder

Division Complexity
Division is more complex than multiplication:
• Two unknowns: ‘q’ and ‘s’.
• Need for quotient digit selection or estimation: For radix-2 its much easier
as its either 0 or 1, but for higher numbers its more difficult.
• In multiplication product of two k-bit numbers is representable in 2k bits,
whereas, if 2k-bit number is divided by k-bit number the quotient could be
greater than k bits.
• Overflow possibility: the high-order k bits of z must be strictly less than d
for signed integers
q  2k and s  d
z  (2k - 1)d  d  2k d
this overflow check also detects the divide-by-zero condition.

Division Complexity
Recent Intel Microarchitectures
Instruction Latency Cycles to wait Latency Cycles to wait
Sandy Bridge Nehalem/Westmere
Load / Store 1 0.33 1 0.33
Integer Add 1 0.33 1 0.33
Integer Multiply (32/64) 3/4 1 3/10 1
Integer Divide (32/64) 25/90 17 21/80 13
Double/Single FP Add 3 1 3 1
Double/Single FP Multiply 5 1 5 1
Double/Single FP Divide 22/14 22/14 24/16 20/12
Latency: The number of clock cycles that are required for the execution core to complete
the execution of all of the μops that form an instruction.
Cycles to wait: The number of clock cycles required to wait before the issue ports are free to
accept the same instruction again
Source: Intel® 64 and IA-32 Architectures Optimization Reference Manual

Division Recurrence
q Quotient
d Divisor
z Dividend
–q3 d 23
–q2 d 22 Subtracted
–q1 d 21 bit-matrix
–q0 d 20
s Remainder
Division with left shifts (There is no corresponding right-shift algorithm)
s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and

|–shift–| s(k) = 2ks
|––– subtract –––|
Integer division is characterized by z = d  q + s No-overflow

2–2kz = (2–kd)  (2–kq) + 2–2ks condition for
zfrac = dfrac  qfrac + 2–k sfrac fractions is:
Divide fractions like integers; adjust the remainder zfrac < dfrac

Decimal Division Example
Integer division Fractional division
====================== =====================
117 z 0111 0101 zfrac .0111 0101
÷10 2d4 1010 dfrac .1010
====================== =====================
s(0) 0111 0101 s(0) .0111 0101
2s(0) 01110 101 2s(0) 0.1110 101
–q3 2 d 4 1 0 1 0 {q3 = 1} –q–1d . 1 0 1 0 {q–1=1}
––––––––––––––––––––––– ––––––––––––––––––––––
s(1) 0100 101 s(1) .0100 101
2s (1) 01001 01 2s (1) 0.1001 01
–q2 24d 0 0 0 0 {q2 = 0} –q–2d . 0 0 0 0 {q–2=0}
––––––––––––––––––––––– ––––––––––––––––––––––
s(2) 1001 01 s(2) .1001 01
2s(2) 10010 1 2s(2) 1.0010 1
–q1 2 d 4 1 0 1 0 {q1 = 1} –q–3d . 1 0 1 0 {q–3=1}
––––––––––––––––––––––– ––––––––––––––––––––––
s(3) 1000 1 s(3) .1000 1
2s (3) 10001 2s (3) 1.0001
–q0 24d 1 0 1 0 {q0 = 1} –q–4d . 1 0 1 0 {q–4=1}
––––––––––––––––––––––– ––––––––––––––––––––––
s(4) 0111 s(4) .0111
7 s 0111 sfrac 0.0000 0111
11 q 1011 qfrac .1011
====================== =====================

Programmed Division
{Using left shifts, divide unsigned 2k-bit dividend,
z_high|z_low, storing the k-bit quotient and remainder.
Registers: R0 holds 0 Rc for counter
Rd for divisor Rs for z_high & remainder
Rq for z_low & quotient}
{Load operands into registers Rd, Rs, and Rq}
div: load Rd with divisor
load Rs with z_high
load Rq with z_low
{Check for exceptions}
branch d_by_0 if Rd = R0
branch d_ovfl if Rs > Rd
{Initialize counter}
load k into Rc
{Begin division loop}
d_loop: shift Rq left 1 {zero to LSB, MSB to carry}
rotate Rs left 1 {carry to LSB, MSB to carry}
skip if carry = 1
branch no_sub if Rs < Rd
sub Rd from Rs
incr Rq {set quotient digit to 1}
no_sub: decr Rc {decrement counter by 1}
branch d_loop if Rc  0
{Store the quotient and remainder}
store Rq into quotient
store Rs into remainder
d_by_0: ...
d_ovfl: ...
d_done: ...

Time Complexity of Programmed Division
• Assume k-bit words
• k iterations of the main loop

• 6-8 instructions per iteration, depending on the quotient bit
• Thus, 6k + 8 to 8k + 8 machine instructions, including operand

loads and result store
• k = 32 implies 200+ instructions on average. This is too slow

for many modern applications!
• Microprogrammed division would be somewhat better which

would be covered in the next lecture.

Programmed Division – Task 1
• 8086 division
• Write a program for division using shift and
subtract operation.

Restoring Unsigned Division
=======================
z 0111 0101
2d4 0 1010 No overflow, because
–2 d4 1 0110 (0111)two < (1010)two
=======================
s(0) 0 0111 0101
2s (0) 0 1110 101
+(–24d) 1 0110
––––––––––––––––––––––––
s(1) 0 0100 101 Positive, so set q3 = 1
2s (1) 0 1001 01
+(–2 d) 4 1 0110
––––––––––––––––––––––––
s(2) 1 1111 01 Negative, so set q2 = 0
(2)
s =2s (1) 0 1001 01 and restore
2s (2) 1 0010 1
+(–2 d) 4 1 0110
––––––––––––––––––––––––
s(3) 0 1000 1 Positive, so set q1 = 1
2s (3) 1 0001
+(–24d) 1 0110
––––––––––––––––––––––––
s(4) 0 0111 Positive, so set q0 = 1
s 0111
q 1011
=======================
Restoring Hardware Dividers
In 2’s-complement
arithmetic, adding
a negative value
to a positive value
produces cout = 1 if
the result is positive

Advanced Computer Systems Architecture Lect-4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computer Systems Architecture Lect-4

Uploaded by

Copyright:

Available Formats

Advanced Computer Systems

Fall Semester 2015 CS-506 ACSA 1

Fall Semester 2015 CS-506 ACSA 2

Fall Semester 2015 CS-506 ACSA 3

Fall Semester 2015 CS-506 ACSA 4

Fall Semester 2015 CS-506 ACSA 5

Fall Semester 2015 CS-506 ACSA 6

Partial product Partial product

Fall Semester 2015 CS-506 ACSA 7

Fall Semester 2015 CS-506 ACSA 8

Reduce the number

Total cost = 7-bit adder + 28 FAs + 1 HA

Fall Semester 2015 CS-506 ACSA 9

Postpone the reduction

Total cost = 7-bit adder + 28 FAs + 1 HA

Fall Semester 2015 CS-506 ACSA 10

• 16 AND Gates Used to Form xiaj Terms (dots)

Fall Semester 2015 CS-506 ACSA 11

• 5 FAs, 3 HAs, 4-bit CPA

Fall Semester 2015 CS-506 ACSA 12

• 3 FAs, 3 HAs, 6-bit CPA • 4 FAs, 2 HAs, 6-bit CPA

Fall Semester 2015 CS-506 ACSA 13

CSA trees are quite 7-bit CSA

some difficulties in [2, 8]

Fall Semester 2015 CS-506 ACSA 14

Fall Semester 2015 CS-506 ACSA 15

Fall Semester 2015 CS-506 ACSA 16

Fall Semester 2015 CS-506 ACSA 17

With latches after every

Example: 3-stage pipeline

Fall Semester 2015 CS-506 ACSA 18

Fall Semester 2015 CS-506 ACSA 19

Fall Semester 2015 CS-506 ACSA 20

this overflow check also detects the divide-by-zero condition.

Fall Semester 2015 CS-506 ACSA 21

Fall Semester 2015 CS-506 ACSA 22

Division with left shifts (There is no corresponding right-shift algorithm)

s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and

Integer division is characterized by z = d  q + s No-overflow

Fall Semester 2015 CS-506 ACSA 23

Fall Semester 2015 CS-506 ACSA 24

Fall Semester 2015 CS-506 ACSA 25

• k iterations of the main loop

• Thus, 6k + 8 to 8k + 8 machine instructions, including operand

• k = 32 implies 200+ instructions on average. This is too slow

• Microprogrammed division would be somewhat better which

Fall Semester 2015 CS-506 ACSA 26

Fall Semester 2015 CS-506 ACSA 27

Fall Semester 2015 CS-506 ACSA 29

You might also like