Lecture # 10

Dr. Rehan Hafiz

<rehan.hafiz@seecs.edu.pk>

Course Website for ADSD Fall 2011
2

http://lms.nust.edu.pk/
Acknowledgement: Material from the following sources has been consulted/used in these slides: 1. [CIL] Advanced Digital Design with the Verilog HDL, M D. Ciletti 2. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan 3. [STV] Advanced FPGA Design, Steve Kilts 4. Ercegovac’s Book: “Digital Arithmetic” 2004 5. Dr. Shoab A Khan’s CASE Lectures on Advanced Digital System Design
Material/Slides from these slides CAN be used with following citing reference: Dr. Rehan Hafiz: Advanced Digital System Design 2010

Lectures: Contact: Office:

Tuesday @ 5:30-6:20 pm, Friday @ 6:30-7:20 pm By appointment/Email VISpro Lab above SEECS Library

Lecture Overview
3

Last Lecture
 

Signed/Unsigned Number Representation Sign Extension, Truncation, Fixed Point Addition

This Lecture

4

Logic Equations : HA C = x • y S = x  y

  

tc = txor + tand + tor ts = 2txor Critical Path: Max(tc,ts)

5

Delay:

Assign {cout, sum}= a + b + c_in;
a[4] b[4] a[3] b[3] a[2] b[2] a[1] b[1] a[0] b[0]

a[5] b[5]

cout FA

C4

FA C S

C3

FA C S

C2

FA C S

C1

FA C S

C0

FA C S cin

C S
c5

c4

c3

c2

c1

c0

S[5]

S[4]

S[3]

S[2]

S[1]

S[0]

RCA Characteristics
6

 

Implements the conventional way of adding two numbers Slowest parallel Adder / Takes minimum area N-bit full adders are required to add two N-bit operands Speed is linear with word length O(N)
4

Carry Delays for a 4 bit RCA

Optimization…..
7

So how can we optimize for
 Throughput
 Area  Timing/

Latency

Remember -- High Throughput Pipelining using the Delay Transfer Theorem
8

Remember – Area Effcient/Reusing Resources

9

N

Carry Shift reg A
clk Load regA N 1 1

1

FF

FA
clk

Shift reg B

Sum
1 clk

Shift reg C

Reg C

10

11

 Increases the

latency

 Do

we really need to wait for Carries
start processing the data

 Pre-compute Carries
 OR we can at least

Some observations - RCA
12

The ripple-carry adder introduces too much delay into a system. The longest path through the adder is from the inputs of the least significant full adder to the outputs of the most significant full adder. However
 the

process of summing the inputs at each bit position is relatively fast (a small two-level circuit suffices)

13

Generate all incoming carries in advance Idea: A carry is either generated or propagated Carry at ith location depends on the carry & inputs at (i-1)th location & not on the previous sum

 

14

Pi = ai ^ bi Gi = ai bi

Sum and Cout can be re-expressed in terms of generate/propagate: Ci+1 = Gi + Pi Ci Si = Ci ^ Pi (^ =xor)

Parallel Look Ahead Generation of all carries
15

Ci+1 = Gi + Pi Ci  Si = Ci ^ Pi

      

Look ahead Carries C1 = G0 + P0C0 C2 = G1 + P1C1 = G1 + P1(G0 + P0C0) = G1 + P1G0 + P0P1C0 C3 = G2 + P2G1 + P2P1G0 + P2P1P0C0 C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0

1-Gate -Delay Pi = ai ^ bi Gi = ai bi
16

2-Gate -Delay c0 = 0 c1 = G0 c2 = G1 + P1c1 c3 = G2 + P2G1 + P2P1c1 c4 = G3 + P3G2 + P3P2G1 + P3P2P1c1

2-Gate –Delay for a Full Adder

\$ Plz. Correct gate notations * Gate delays assuming 1 gate delay for xor gate

Final Result
17

Each of the carry equations can be implemented with two-level logic All inputs are now directly derived from data inputs and not from intermediate carries this allows computation of all sum outputs to proceed in parallel

• Maximum gate delay for the carry generation is only 3. The full adders introduce two more gate delays. Worst case path is 5 gate delays (To final sum bit to be generated !)

In general, the maximum fan-in/out of any gate in an n-bit CLA is n. Thus, the maximum fan-in of any gate in a 16-bit CLA is 16.

Fan IN/OUT Effects
19

Fundamentals of digital logic with Verilog design By Stephen D Brown

CLA
20

As n increases Fan IN/OUT becomes an issue Options
 Ripple

the carry across blocks(groups) of CLA adders of limited size  Or we may again pre-compute in parallel Group Carry of each block

• A16-bit GCLA is composed of four 4-bit CLAs, with additional logic that generates the carries between the four-bit groups. GG0 = G3 + P3G2 + P3P2G1 + P3P2P1G0 GP0 = P3P2P1P0 c4 = GG0 + GP0c0
No carries are required to generate Group G & Group P We just need single-xor-gate-delay G & P signals ! Total Delay = 3 Gate Delays for GG/GP To generate carries just use Group G & Group P with 2 Gate Delays

c8 = GG1 + GP1c4 = GG1 + GP1GG0 + GP1GP0c0 c12 = GG2 + GP2c8 = GG2 + GP2GG1 + GP2GP1GG0 + GP2GP1GP0c0
Red part will constitute Ripple based Group CLA Black Part will result into CLA based GCLA 

c16 = GG3 + GP3c12 = GG3 + GP3GG2 + GP3GP2GG1 + GP3GP2GP1GG0 + GP3GP2GP1GP0c0

• Each CLA has a longest path of 5 gate delays

• In the GCLL section, GG and GP signals are generated in 3 gate delays; carry signals are generated in 2 more gate delays, resulting in 5 gate delays to generate the carry out of each GCLA group and 10 gates delays on the worst case path (which is s15 – not c16).

FAN in / FAN out
23

In general, the maximum fan-in of any gate in an n-bit CLA is n. Thus, the maximum fan-in of any gate in a 16-bit CLA is 16. In comparison, the maximum fan-in for a 16-bit GCLA is five (for generating c16). The fan-outs for both cases are the same as the fan-ins.

24

Partition the adder into K groups Two values of sum with cin (1 and 0) are precomputed for each adder group Actual sum is selected using a 2-to-1 MUX by the carry of the previous group Allows computation of possible results in parallel Requires internal carry for blocks, e.g. ripple

25

 

Three partitions have been made of 4 bits each Outputs of each 4 bit adder block would be ready simultaneously including the Cout of the first adder
Cin = 0 4 - bit Adder Cin = 0 4 - bit Adder Cin = 0 4 - bit Adder

C0
S0 Cin = 1 4 - bit Adder

C0
S0 Cin = 1 4 - bit Adder

C0
S0 Cin = 1 4 - bit Adder

C1
2-to-1 Mux

S1 4-bit 2-to- 1 Mux

C1
2-to-1 Mux

S1 4-bit 2-to- 1 Mux

C1
2-to-1 Mux

S1 4-bit 2-to- 1 Mux Carry in

Cout[11]

Cout[7]

Cout[3]

SUM [11-8]

SUM [7-4]

SUM [3-0]

Non Uniform Group Carry Select Adder
26

Delay: Approx. 5RCA Delay + 2-to-1 Mux Delay

1 1 1 1 1 0 1 1 1 1 0 0 (a)

CSA: Example
27

1 1 1 1 1 1 0 1 0 0 1 1 (b)
1 1 1 1 0 0 0 0 1 1 1 1 (cin=0) 1 1 1 1 1 0 0 1 0 0 0 0 (cin=1)

1 1

11111

1 1
11110

0111

0 1
000 0001

100

11111
11111

1010

011
111

0010

0

1

1

0

11111

0001

111

28

If we keep on reducing the number of bits per adder we reach Conditional sum adder