Professional Documents
Culture Documents
Design Tradeoffs
vs.
C
tCLK=1/fCLK C discharges and
then recharges:
dVOUT dV
I =C ⇒ P = C OUT VOUT
Power dissipated dt dt
to discharge C: Power dissipated to recharge C:
tCLK /2 tCLK
PNFET = fCLK ∫ 0
iNFET VOUT dt PPFET = fCLK ∫ i
tCLK /2 PFET
VOUT dt
tCLK /2 dVOUT tCLK dVOUT
= fCLK C ∫
0
−C
dt
VOUT dt = fCLK ∫ tCLK /2
C
dt
VOUT dt
0 VDD
= fCLK C ∫ −VOUT dVOUT = fCLK C ∫ VOUT dVOUT
VDD 0
2
2
VDD VDD
= fCLK C = fCLK C
2 2
6.004 Computation Structures L08: Design Tradeoffs, Slide #4
CMOS Dynamic Power Dissipation
VIN moves from VOUT moves from
L to H to L H to L to H
VIN VOUT
C
tCLK=1/fCLK C discharges and
then recharges:
Power dissipated
= f C VDD2 per node “Back of the envelope”: trends
= f N C VDD2 per chip f ~ 1GHz = 1e9 cycles/sec
where N ~ 1e8 changing nodes/cycle
f = frequency of charge/discharge C ~ 1fF = 1e-15 farads/node
V ~ 1V
N = number of changing nodes/chip
⇒ 100 Watts
What if we could
eliminate unnecessary
transitions? When the boole
output of a CMOS gate ALUFN[3:0]
doesn’t change, the gate
doesn’t dissipate much ALU[31:0]
power! shift
ALUFN[1:0]
Z
V
cmp
N
ALUFN[2:1]
ALUFN[5:4]
DQ shift
ALUFN[1:0]
ALUFN[5:4] == 11 G Must computation
consume energy?
Z See §6.5 of Notes
Variations: Dynamically V
cmp
adjust tCLK or VDD (either N
ALUFN[2:1]
overall or in specific
regions) to accommodate ALUFN[5:4]
workload.
Sn-1 Sn-2 S2 S1 S0
CI to CO CIN-1 to SN-1
Θ(N) is read “order N” and tells us that the latency of our adder
grows in proportion to the number of bits in the operands.
6.004 Computation Structures L08: Design Tradeoffs, Slide #9
Performance/Cost Analysis
“Order Of” notation:
"g(n) is of order f(n)" g(n) = Θ(f(n))
g(n)=Θ(f(n)) if there exist C2 ≥ C1 > 0
such that for all but finitely many
integral n ≥ 0
C1 ⋅ f (n) ≤ g(n) ≤ C2 ⋅ f (n)
S[31:16] S[15:0]
S[4:0]
COUT SUM COUT SUM COUT SUM
Select input is
COUT,S[31:21] S[20:12] S[11:5]
heavily loaded,
so buffer for
Design goal: have these
speed.
two sets of signals arrive
simultaneously at each
carry-select mux
H A B L A B
CO FA CI CO FA CI
GH PH P/G generation G P S G P S
GL
GP PL
GHL PHL
GH PH
GL
1st level of GP PL
lookahead GHL PHL
Hierarchical building block
GH PH GL GH PH GL GH PH GL GH PH GL
GP GP GP GP
GHL PHL PL GHL PHL PL GHL PHL PL GHL PHL PL
Θ(log N) GP
GHL PHL PL
GP
GHL PHL PL
G7-0 P7-0
This will let us to quickly compute the carry-ins for each FA!
G6 c G4 c G2 c G0 c
GL H GL H GL H GL H
cL cL cL cL
P6 PL C
cin
P4 PL C
cin
P2 PL C
cin
P0 PL C
cin
C6 C4 C2 C0
G5-4 c G1-0 cH
GL H GL
Θ(log N) P cL C cL
5-4 PL C
cin
P1-0 PL cin Notice that the
C4 C6 = G5-4+P5-4 C4 C0 inputs on the
G3-0 c left of each C
GL H blocks are the
cL
P3-0 PL C
cin same as the
inputs on the
C4 = G3-0+P3-0 C0 right of each
C0 corresponding
GP block.
6.004 Computation Structures L08: Design Tradeoffs, Slide #17
8-bit CLA (complete)
A7 B7 A6 B6 A5 B5 A4 B4 A3 B3 A2 B2 A1 B1 A0 B0
A B A B A B A B A B A B A B A B
CO FA CI CO FA CI CO FA CI CO FA CI CO FA CI CO FA CI CO FA CI CO FA CI
G P S G P S G P S G P S G P S G P S G P S G P S
GH PH Cj GH PH Cj GH PH Cj GH PH Cj
GL GL GL GL
GP/CPL GP/CPL GP/CPL GP/CPL
C C C C
GHL PHL Cii GHL PHL Cii GHL PHL Cii GHL PHL Cii
GH PH Cj GH PH Cj
GL GL
GP/CPL GP/CPL
C C
GHL PHL Cii GHL PHL Cii
GH PH Cj
tPD = Θ(log N)
GL
GP/CPL
C
GHL PHL Cii
C0 AB
Notice that we don’t
need the carry-out
GH PH CH output of the adder any
GH PH cH more.
GL GL
GP PL + GL
PL C
ci
cL = GP/C PL
CL CI
GHL PHL GHL PHL Cin
CO
G PS
To learn more, look up Kogge-Stone adders on Wikipedia.
Easy part: forming partial products (just an AND gate since BI is either 0 or 1)
Hard part: adding M N-bit partial products
FA
Combinational Multiplier
a3 a2 a1 a0
CI b0
CO a3 a2 a1 a0
b1
S
AB HA FA FA HA
HA
a3 a2 a1 a0
b2
CO S
FA FA FA HA
a3 a2 a1 a0
b3
FA FA FA HA
z7 z6 z5 z4 z3 z2 z1 z0
Latency = Θ(N)
Throughput = Θ(1/N)
Hardware = Θ(N2)
6.004 Computation Structures L08: Design Tradeoffs, Slide #21
2’s Complement Multiplication
Step 1: two’s complement operands so high Step 3: add the ones to the partial products
order bit is –2N-1. Must sign extend partial and propagate the carries. All the sign
products and subtract the last one extension bits go away!
Step 2: don’t want all those extra additions, so Step 4: finish computing the constants…
add a carefully chosen constant, remembering to
subtract it at the end. Convert subtraction into add
of (complement + 1).
X3Y0 X2Y0 X1Y0 X0Y0
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0 + X3Y1 X2Y1 X1Y1 X0Y1
+ 1 + X3Y2 X2Y2 X1Y2 X0Y2
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1 + X3Y3 X2Y3 X1Y3 X0Y3
+ 1 + 1 1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+ 1
+ X3Y3 X3Y3 X2Y3 X1Y3 X0Y3 –B = ~B + 1 Result: multiplying 2’s complement operands
+ 1
takes just about same amount of hardware as
+ 1
- 1 1 1 1 multiplying unsigned operands!
6.004 Computation Structures L08: Design Tradeoffs, Slide #22
2’s Complement Multiplier
a3 a2 a1 a0
b0
a3 a2 a1 a0
1 b1
FA FA FA HA
a3 a2 a1 a0
b2
FA FA FA HA
a3 a2 a1 a0
1 b3
HA FA FA FA HA
z7 z6 z5 z4 z3 z2 z1 z0
HA FA FA HA
a3 a2 a1 a0
b2
FA FA FA HA
a3 a2 a1 a0
b3
FA FA FA HA
z7 z6 z5 z4 z3 z2 z1 z0
FA FA FA
HA HA HA
FA HA
Latency = Θ(N)
HA
Throughput = Θ(1)
Hardware = Θ(N2)
z7 z6 z5 z4 z3 z2 z1 z0
SN-1…S0
Init: P←0, load A&B
SN LSB
Repeat M times {
P B NC A P ← P + (BLSB==1 ? A : 0)
M bits
N shift SN,P,B right one bit
Carry-save 1 N }
+ xN