You are on page 1of 42

# 1

## ECE 366 Computer Architecture

Instructor: Shantanu Dutt Department of Electrical and Computer Engineering University of Illinois at Chicago

X i Yi

Ci+1

FA i

Ci Si



, where
      

## c Shantanu Dutt, UIC

x7 y7 x6 y6 x5 y5 x4 y4 x3 y3 x2 y2 x1 y1 x0 y0

cout FA6 c7 S6 S5 S4 S3 c6 c5 c4 S7

FA6

FA5

FA4

FA3

c3

FA2

c2 S2

FA1

c1 S1

FA0

c0 S0

Problem: Delay is gate delays or each FA has a 2-gate delay. Thus is each gate has a delay of 2ns, delay for a 32-bit RCA is 64ns.
     

## c Shantanu Dutt, UIC

Overow occurs when the result of the operation does not t in the representation being used For example, if 4-bit unsigned numbers 6 = and 12 = are added the sum (18) overows since its binary equivalent does not t in 4 bits Overow is detected for unsigned addition when the carry out of the nal Full Adder is 1 For 2s complement representation of signed numbers overow occurs when the carry into the MSB (most signicant bit), which is also the sign bit, is different from the carry out of that bit. The carry out of the MSB bit representation always represents the sign bit of the sign-extended of the sum
! ! " ! !   !

"

"

  \$ %

X i Yi

FA i

## Ci From carrygen. S Pi Gi logic i To carrygen. logic

is the generate bit, which is 1 only if a carry out is to be generated irrespective of the input carry. This obviously is the case only when
 !

is the propagate bit, which is 1 only if the output carry is to be the same . This will be the case only when either as the input carry, i.e., or (but not both) is 1 Thus the carry out of the th stage can be expressed as:
%    '  

&

## c Shantanu Dutt, UIC

Consider a 4-bit adder: Note that and can be generated in constant time (specically, 1 gate delay) by each modied full adder (MFA).
\$ %   \$(  %(  ( "  

\$

%

\$

%

\$

%

\$(

\$(

%

%(

) 

" 

"

"

0 

) 

"

)

)

"

%( % % % % % %

X 3 Y3

X 2 Y2

X 1 Y1

X 0 Y0

C4 S3

FA 3

C3 S2 P3 G3

FA 2

C2 S1 P2 G2

FA 1

## C1 S0 P1 G1 CARRY GENERATION LOGIC

FA 0

C0

P0

G0

In this CLA the s use 2-level logic and thus have a delay of 3 gate delays (1 for each , and 2 for the s), as opposed to gate delays for a 4-bit RCA.
1 2 

## c Shantanu Dutt, UIC

Disadvantage: For a 16-bit adder, for example, we cannot go on generating the s in this manner, since the hardware becomess execssive and messy A 16-bit adder can be partitioned into groups of 4 4-bit CLA adders, with the inter-group carries rippling through the four groups:

X 74 4

X 30 4

## Y30 4 4bit CLA

C0 4 S 30

Such a 16-bit adder has a delay of 12 gate delays (= 24ns for a gate delayof = 32 gate delays (= = 64ns) in 2ns) as opposed to a delay of a 16-bit RCA In general for an -bit group-CLA adder with 4-bit CLA cells, the delay gate delays as opposed to gate delays for the ripple-carry adder is
1 1 ! 4 2  5  6   5 

## c Shantanu Dutt, UIC

7

A note of caution: -input gates have greater delays than 2-input gates for . We had assumed that for a 4-bit CLA the 5-input OR and AND gates have the same delay as a 2-input gate. The delay of a 5-input gate will be greater because:
7 8 
Vdd A Vdd B AB A Gnd B AB A B

## (a) 2i/p OR gate: delay = 2t s+ 3RC

Vdd A C D B E A+B+C+D+E A B

## c Shantanu Dutt, UIC

Let be the max. of the switching on and the switching off time of a MOS transistor, i.e., the max. of the time for charge to collect at the channel or the time for the charge to be removed from the channel Let be the resistance of a channel, and the input gate capacitance of a transistor Then, a 2-i/p AND/OR gate has a delay of
@ 9 A B 9  A 5

## a 5-i/p gate has a delay of

9  A 4 @ B

while a 5-i/p AND/OR ckt. formed of cascaded 2-i/p gates will have a delay of
9 2 A C @ B

Thus, while a 5-i/p AND/OR gate will be slower than a 2-i/p AND/OR gate, it will be faster than a cascaded implementation of a 5-i/p AND/OR ckt. The latter is primarily because more transistors switch on in parallel in a 5-i/p AND/OR gate than in a 5-i/p AND/OR ckt.

@

10

In a 4-bit CLA adder, we are essentally replacing a series of 2-i/p cascaded (multi-level) logic by a multiple-input 2-level logic Note that the delay of an -bit group-CLA adder with 4-bit cells will be somewhat more than 2-i/p gate delays.
5  6 2 

## c Shantanu Dutt, UIC

10

11

Can we do better than a delay that is linear in the number of bits ? Yes, by using a carry-select adder ( time), OR by using a parallel prex circuit to generate all carries in time Carry-Select adder:
 D   FE G  H 
x n1 , y n1

, y

0 c

out

0 in

out

0 in

out

in

out

1 in

out

1 in

Mux

Mux

Mux

s n

       Q

D 2

D 2

## (2-i/p) gate delays (2-i/p) gate

PI   6  I 2       

## c Shantanu Dutt, UIC

11

12

' TS S

For the th full adder, dene the symbols : kill incoming carry (when ) : propagate incoming carry (when ) : generate a carry (when ) We can encode these symbols as and call these pair of bits . Each FA can produce in constant time As a matter of fact, , where and are the generate and propagate bits of a FA discussed earlier Dene operator as:
 !    !   ! !  R % U  U U % \$ V \$ % \$ !

!
R  U %

R % R R R % \$ \$ \$ \$ \$ R \$

Note that is an associative operator, i.e., Also, is not a commutative operator, i.e.,
V V

V XW Y W   Y V Y  ` WV W Y 

12

13

## The parallel-prex CLA adder (contd.) Dene

a ' Ved V a  d cb  d U( U U V Ved # b V d # b  " U( U U V d

## is the th prex of the associative computation:

'

If we can compute each quickly, then we can obtain the carry-in the th FA as follows: If then else if then else if then The s can be computed in constant time after the s are available
a  ' ! a  &  R   a    a & %   \$ ! ( a

of

13

14

## The parallel-prex CLA adder (contd.) Computing the s quickly: Dene as

gf a a V d  cb  V d a U U gf U fih V g

## Thus, Since is associative, we have

a  f ( a  V a gf

.
a a h  V b

k k1

gf

14

15

## The parallel-prex CLA adder (contd.)

gf a

We can use the above property to form 1-level s by combining 2 adjacent s, then 2-level s by combining two adjacent 1-level s, etc. This yields a tree-structured circuit with a logic at every node; this ckt. gives us only those s for which for some
a U gf '  q Ip a ! V

gf

15

16

a

## To obtain all s, the tree has to be augmented as shown below.

r 7,0 r 3,0 3,0 p r 7,4 r

!!

Legend: x x

!!
r7,6 r 5,0 1,0 5,4 r 1,0 r r3,2 r

!! !! c ! a
q 0

! a y b

!!
q 7 r r r 6,0 q q 4,0 q 2,0 q q q 3 2 1 6 5 4

!! !

!!

 E G H

This circuit is called a parallel prex circuit, and can be used to obtain the prexes of any associative operation (like AND, OR, addition, multiplication, etc.) The delay is -logic steps, steps to go up the tree and steps to come down Extra hardware used: logic units VLSI area reqd.: height of tree is , width of tree is
 I  E ! V G E G H   I H G ! ts u  I# I H  I  v  ! !   ! I !    E  G     H        V

## c Shantanu Dutt, UIC

16

17

The parallel-prex CLA adder (contd.) Another VLSI implementation of a parallel prex tree:
q n1 q 2 0 q

i+1

i+1,i

r n1

17

18

C out

Cin C
15

q15
C 14 S 2 2 S

q 14

q2
2 S

C2

q1
2 S

C1

q0
2 S

FA

FA

FA

FA

FA

X Y
15 15

X Y

14 14

X2 Y

X 1Y

X Y

0 0

18

19

## COMPUTER ARITHMETIC SUBTRACTION

I I w x w x 

Subtraction can be done using an adder, since This means that , which we assume is in 2s complement notation, has to be negated. A 2s complement number is negated by complementing it and adding a 1, i.e., The augmentation to an adder to perform subtraction is shown below:
x I x  x y

Y n n n 1 nbit Adder

Cout n

Cin

## c Shantanu Dutt, UIC

19

20

COMPUTER ARITHMETIC MULTIPLIERS Serial Multiplication Add-and-shift (A&S) multiplication: Manual Example:

If the additions are done one at a time, we obtain a sequence of partial products
(  e #

21

## Each partial product

 ( 

is obtained as

d  

x 

Thus
# b  d d # b  #     x x  x d   (

where
w x

the multiplicand

21

22

## A&S multiplication (contd.):

The same effect as shifting the multiplicand left ( ) can be achieved by keeping the multiplicand xed at the left-most position and shifting the partial product right. Example:
  x 

## In this case, the partial products obtained are:

 b  (    #      x

However,

is the same:
#

# b 

# 

22

23

## A&S multiplication (contd.): Hardware:

C out Reg. Multiplier X Accumulator 16 16 And 16 C out 1 16 Addandshift multiplication for unsigned numbers 16bit Adder 1 Q 1

Multiplicand Y M

## Algorithm: Initialize AC = 0; Q = Multiplier; M = Multiplicand. Do the following steps times

If LSB of Q is 1 then AC = AC+M else AC = AC; Shift -AC-Q register combination right by 1 bit
B

Final product is in AC-Q register NOTE: Overows are tolerated in the additions AC when right shifting
c Shantanu Dutt, UIC

B

23

24

## A&S 2s complement multiplication

Assumption: Both and are in their 2s complement representation Method 1: If multiplier is -ve get its 2s complement so that it becomes positive, i.e. If multiplicand is -ve get its 2s complement so that it becomes positive, i.e. Multiply If exactly one of and was negative, get s 2s complement so that it becomes negative, i.e. Disadvantage: Preprocessing and postprocessing can take up to 4 clock cycles (ccs)
w w  I w w x  I  x w x w x x x  I

24

25

## A&S 2s complement multiplication (contd.) :

Method 2: When the multiplier is +ve perform taking care to do the following when each is shifted right: 1. When there is no overow in the addition (recall the condition for overow for 2s complement addition), an arithmetic right shift of register AC.Q is performed without shifting in into the MSB of AC 2. Arithmetic right shift: MSB is sign-extended, i.e., if the MSB (sign bit) is 1 a 1 is shifted into the MSB of AC, otherwise a 0 is shifted in 3. If there is an overow, then as in the unsigned case, shift into MSB of AC when shifting AC.Q right. This works because in this case the bit output of the adder, where is the MSB, is the exact 2s complement representation of the sum. Check this by sign extending the inputs to bits and compute the sumthe output will be the same as for the -bit inputs, but without overow out of the th bit
w  w B x B   !      !   !

25

26

26

27

## A&S 2s complement multiplication (contd.) :

Method 2 (contd.): If the multiplier is negative, perform the rst additions as explained above, and then subtract as the nal step This works because the value of a 2s complement number is given by
 I w x ! # b  ( # b  (  I # b  w      # b "

## Thus is 1 and thus

  I w x  

# b

. When
# b  # b   (  x " (  I   x x d  # b 

is negative,
w

# b

! w

"

 I

# b 

27

28

## Note that multiplication is performed on the basis:

w w ed Xf     x

## The magnitude of a negative

 w f  

is given by
w I # # b  (  w   I  # b "

## Thus which is what we are doing

# b  I I   w x  

 # b  ( # b    x  (  I # b "  

# b

"

28

29

## A&S 2s complement multiplication (contd.) : Method 2 (contd.): Hardware:

1 Accumulator AC[0] Logic 1 1 If ovfl then AC[0] : output = C out Logic else output = AC[0] 16 Ovfl. det. C 15 C out 1 16bit Adder 1 16 And 16 Multiplier X Q 1

Multiplicand Y M

16

## c Shantanu Dutt, UIC

29

30

Speeding Up Serial Multiplication Booths Algorithm Idea: Consider the following substring of
gkj gi gh h h h h gh g g g o mj o i m lm gh h h h n h g g g g g g gh g g n g I x  ! o x x w w  g n g g g g g w

Thus instead of adding and shifting 4 times (corresponding to the string of 4 1s in 011110), we can subtract ( ) when we see the 1st 1 coming after a 0 in , i.e., we detect the 2-bit substring 10 in the last 2 bits of the current , just shift 4 times and then add ( ) when we see that the current string of 1s in have ended, i.e., we detect the 2-bit substring 01 in the last 2 bits of the current . This saves us two adds Thus when the multiplier contains long (greater than length 2) strings of 1s, Booths multiplication is faster
x  w w

 !

30

31

## Booths Algorithm (contd.)

w

Booths multiplication also takes care of a negative multiplier automatically. Consider the following :
j p q r i g j gi h h h h gh g g g g g g g g g g g 

Since the multiplication algorithm contains steps ( in the above example), the 1000000 is ignored and we end up subtracting times the multiplicandexactly the right answer!
  4

3

31

32

## Booths Algorithm (contd.)

Booths algorithm is decribed by the following table for iteration : Bit Bit Explanation (current) (prev.) 1 0
I ' ' !

Action
I x

Ts  x !

1 0 0

1 1 0

Beginning of a run of 1s Middle of a run of 1s Only shift 0 End of a run of 1s 1 Middle of a run of 0s Only shift 0

Note: (1) For unsigned multiplication, we need to pad the multiplier with mythical 0s on both sides (right of LSB padding required to start off the process) (2) For 2s complement multiplication, we need to pad the multiplier with a mythical 0 only to the right of its LSB. This works because (for 2s complement, the last run of 1s is 11. . . 1, where the leftmost 1 is the sign bit (bit ) and suppose the rightmost 1 is the th bit from left, then the value of this sequence is , which is exactly the value we alloted to this sequence when we subtracted at the th bit position at the
 I ! I# I g    I g # b     ' x 

'

33

beginning of this last run of 1s. Further, if is the value we alloted to the rest of the multiplier before the last run of 1s, then the nal value we give to the multiplier is , which is its correct value in 2s complement: . (3) is the th bit of the Booth Recoding of : (4) means no arithmetic operation, means add , means subtract
I  I#  t g   t     g # b   u  ws v   t ' w t xs   ! x 

Hardware: Excercise

33

34

34

35

## Booths Algorithm (contd.)

Problem: When the multiplier contains long strings (say, of length ) of alternating 1s and 0s ( ), then we perform additions and subtractions using Booths algorithm compared to only additions using regular add-and-shift Solution: Look at 3 consecutive bits of the multiplier instead of 2 to decide what to do. This is called the Modied Booths Algorithm (MBA) This will enable us to treat isolated 1s and 0s differently from runs of 1s and 0s.
! ! ! ! 7  6  6

35

36

## Modied Booths Algorithm Thus when we see a

gh x x g

we add corresponding to the isolated 1. This is correct, since in BA we would have subtracted a on detecting 10 and added on subsequently detecting 01. Thus assuming the 1 is the th bit, we would have added we are doing the RHS in MBA
I x  x   x h gh ' x

When we see a

and this isolated 0 is not following an isolated 1, then we subtract corresponding to the isolated 0. This is correct, since in BA we would on detecting 01 and then subtracted on detecting 10. have added This is equivalent to adding again, we are doing the RHS in MBA
x  I  x  x  I   x x

36

37

## Modied Booths Algorithm (contd.)

w

After detecting an isolated 1 (010), it should be noted as such so that after shifting right, we dont misinterpret the bit pattern as ending a run of 1s. For example, consider two bit patterns showing the 4 consecutive bits of
! z y { ! x ! z y { ! !

The 1st has an isolated 1 and the 2nd a run of 1s. For the 1st case we have added corresponding to the 1, and in the second case, we do not do anything as we are in the middle of a run of 1s (as in BA) After a right shift, we have the patterns

which are identical. In the rst case, we need to have noted that the 1 corresponds to an isolated 1, so that we do not do anything. In the second case, this means end of a run of 1s, and so we need to add (as in BA). These cases are distinguished by setting a latch to be 0 when an isolated 1 is spotted, and to 1 if a run of 1s is spotted Thus actually the least signicant of the 3 bits that we observe should be and not the previous bit of . Except when distinguishing between an
| w | x

38

isolated 1 (0) and the end of a run of 1s (0s) will be the same as the previously observed bit of . Thus after a right shift, the above 3-bit patterns will be
 w z   y | {   | ! | 

38

39

## Modied Booths Algorithm (contd.)

Similarly, we need to distinguish between an isolated 0 and the end of a run of 0s. In the former case, is set to 1 and to 0 in the latter case Again, consider two bit patterns showing 4 consecutive bits of
| ! ! ! z y { ! !

The 1st has an isolated 0 in its 2nd bit for which we subtracted and the 2nd pattern has a 0 in its 2nd bit that ends a run of 0s. Thus after a right shift, the above 3-bit patterns will not be identical, but will be
 ! ! ! z  y | { ! !   |  

In the rst case, we correctly do nothing (we already subtracted corresponding to the isolated 0) corresponding to the middle bit and in the second we subtract , since the middle bit begins a run of 1s which has not yet been accounted for.
x

39

40

## Modied Booths Algorithm (contd.)

The rightmost bit in the 3 bits that we are looking at is actually which is initialized to 0 The second bit is (in the th iteration, ), and the leftmost bit is Note that except in the isolated 1/0 case, (as in BA), otherwise
 I ' } '   }  | cb  y | cb  !

40

41

## Modied Booths Algorithm (contd.)

We thus have the following Modied Booths Algorithm described for iteration , :
 I ' } ' }

Bit (next) 0 0 1
 ~

Bit (current) 0 1 0 1 0 1 0 1 0 1 1 1 1

Explanation

New 0 0 0 1 0 1 1 1

~

1 0 0 1 1

0 Middle of a run of 0s 0 Isolated 1 0 Isolated 0 following an isolated 1 OR Middle of a run of 0s Begins a run of 1s Begins a run of 0s Middle of a run of 1s Isolated 0 following a run of 1s Middle of a run of 1s

NOTE: (1) The multiplier needs to be padded by mythical 0s on both sides for unsigned and 2s complement multiplication. (2) is the th bit of the Modied Booth Recoding of (3) This signed-digit encoding has 0s on the average, as opposed to in the regular binary code. Thus fewer arithmetic operations are required on the average using MBA for multiplication.
xs '  w  6 5 

0 1 0

1 0 0

41

42

42