You are on page 1of 44

Chapter 4

Low--Power VLSI Design


Low

Jin-Fu Li
Advanced Reliable Systems
y ((ARES)) Lab.
Department of Electrical Engineering
National Central University
Jhongli, Taiwan
Outline

Introduction
Low-Power Gate-Level Design
Low-Power Architecture-Level Design
Algorithmic-Level Power Reduction
RTL Techniques
T h i for
f Optimizing
O i i i Power
P

National Central University EE4012VLSI Design 2


Introduction
Most SOC design teams now regard power as one
g concerns
of their top design
Why low-power design?
Battery lifetime (especially for portable devices)
Reliability
Power consumption
Peak power
p
Average power

National Central University EE4012VLSI Design 3


Overview of Power Consumption
Average power consumption
Dynamic
y p
power consumption
p
Short-circuit power consumption
Leakage power consumption
Static power consumption

D
Dynamic
i power di
dissipation
i ti d during
i switching
it hi

Cinput

interconnect
Cdrain Cinput

National Central University EE4012VLSI Design 4


Overview of Power Consumption
Generic representation of a CMOS logic gate for
gp
switching power calculation
VA
pMOS
VB network

Vout
VA nMOS C drain Cint erconnect Cinput
VB network

1 T /2 dVout T dVout
Pavg [ Vout (Cload )dt (VDD Vout )(Cload )dt ]
T 0 dt T /2 dt

National Central University EE4012VLSI Design 5


Overview of Power Consumption
The average power consumption can be expressed
as
1
Pavg 2
C load V DD C load V DD
2
f CLK
T
The node transition rate can be slower than the
clock rate. To better represent this behavior, a
node transition factor
f ( T ) should be introduced
Pavg T C load V DD
2
f CLK
The switching power expressed above are derived
by taking into account the output node load
capacitance

National Central University EE4012VLSI Design 6


Overview of Power Consumption

VA VA
Vinternal VB
VB Cinternal Vinternal
Vout

VA VB Cload Vout

The generalized expression for the average power dissipation


can be rewritten as
# ofnodes
Pavg Ti C iV i V DD f CLK

i 1
National Central University EE4012VLSI Design 7
Gate--Level Design Technology Mapping
Gate
The objective of logic minimization is to reduce the
boolean function.
For low-power design, the signal switching activity
is minimized by restructuring a logic circuit
The power minimization is constrained by the
delay, however, the area may increase.
During this phase of logic minimization, the
function to be minimized is


i
P i (1 P i ) C i

National Central University EE4012VLSI Design 8


Gate--Level Design Technology Mapping
Gate
The first step in technology mapping is to decompose
each logic function into two-input gates
The objective of this decomposition is to minimizing the
total power dissipation by reducing the total switching
activity
ti it

A 0.2 0.0384
0.0196
B 0.2 0.0099
C 0.5
D 0
0.5
5
A
B
C A 0.2
D 0.0384
B 0.2

C 0.5 0.0099
D 0
0.5
5 0.1875
National Central University EE4012VLSI Design 9
Gate--Level Design Phase Assignment
Gate

High activity node


High activity node

A
A

B
B

C
C

National Central University EE4012VLSI Design 10


Gate--Level Design Pin Swapping
Gate

a b c d a b c d

d a

Switchin
Switching activityy

c b

ng activity
b c

a d

d a
c
b b
a c
d

National Central University EE4012VLSI Design 11


Gate--Level Design Glitching Power
Gate
Glitches
spurious transitions due to imbalanced path delays
A design has more balanced delay paths
has fewer g
glitches,, and thus has less power
p dissipation
p
Note that there will be no glitches in a dynamic CMOS
logic
g

A
A
B
B D
C
E D
C
E

National Central University EE4012VLSI Design 12


Gate--Level Design Glitching Power
Gate
A chain structure has more glitches
A tree structure has fewer glitches
A
B

C Chain structure
D

B Tree structure
C
D

National Central University EE4012VLSI Design 13


Gate--Level Design Precomputation
Gate

REG REG
Combinational Logic
R1 R2

REG REG
Combinational Logic
R1 R2

Precomputation
Logic
g

National Central University EE4012VLSI Design 14


Gate--Level Design Precomputation
Gate

A<n-1>
A<n 1> REG 1-bit Comparator
B<n-1> R1 (MSB)

REG
A<n-2:0>
R2

(n-1)-bit REG
Enable
Comparator R4
Precomputation logic F
REG
B<n-2:0>
R3

National Central University EE4012VLSI Design 15


Gate--Level Design Gating Clock
Gate

D Q D Q D Q D Q
Fail DFT rule
clk checking

T
Add control pin
D Q D Q D Q D Q to solve DFT
violation
problem

clk

National Central University EE4012VLSI Design 16


Gate--Level Design Input Gating
Gate

f1

clk

+
select
l t

f2

National Central University EE4012VLSI Design 17


Clock--Gating in Low-
Clock Low-Power Flip
Flip--Flop

D D Q

CK

Source: Prof. V. D. Agrawal

National Central University EE4012VLSI Design 18


Reduced--Power Shift Register
Reduced

D D Q D Q D Q D Q

multiiplexer
Output

D Q D Q D Q D Q

CK(f/2)

Flip-flops are operated at full voltage and half the clock frequency.
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 19
Power Consumption of Shift Register

16-bit shift register, 2 CMOS


P = CVDD2f/n
Deg. Of
D Freq
F Power
P 10
1.0
parallelism (MHz) (W)

1 33 0
33.0 1535

ed power
2 16.5 887
4 8 25
8.25 738 05
0.5

ormalize
0.25

No
C. Piguet, Circuit and Logic Level
Design pages 103-133
Design, 103 133 in WW. Nebel 00
0.0
and J. Mermet (ed.), Low Power 1 2 4
Design in Deep Submicron Degree of parallelism, n
Electronics Springer,
Electronics, Springer 1997
1997.
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 20
Architecture--Level Design Parallelism
Architecture
16 16
A R A R

32 16 32
16x16 16x16
fref fref/2
multiplier multiplier
16 R
B R
M 32
U
fref fref/2 X

Assume that With the same 16x16 R


multiplier, the power supply can fref
be reduced from Vref to Vref/1.83.
16x16
Vreff f reff fref/2
Pparallel 2.2Cref ( ) 2
0.33Pref multiplier
p 32
16
1.83 2 B R
16
fref/2
National Central University EE4012VLSI Design 21
Architecture--Level Design Pipelining
Architecture
The hardware between the pipeline stages is reduced then
the reference voltage Vref can be reduced to Vnew to maintain
the same worst case delay.
delay For example,
example let a 50MHz
multiplier is broken into two equal parts as shown below. The
delay between the pipeline stages can be remained at 50MHz
when the voltage Vnew is equal to Vref/1.83
/1 83

32 Half Half 32
(A ,B) REG REG
multiplier multiplier

fref

V ref
Ppipeline 1 .2 C ref ( ) f ref 0 .36 Pref
2

1 .83
National Central University EE4012VLSI Design 22
Architecture--Level Design Retiming
Architecture
Retiming is a transformation technique used to change the
locations of delay elements in a circuit without affecting the
input/output characteristics of the circuit
circuit.

Two versions of an IIR filter.


(1) (1)
x(n)
( ) y(n))
y( x(n)
( ) y(n))
y(
D
D D a 2D a
w(n)
(2) (1) D (2) 2D
(1)
w1(n)
b retiming D b

w2(n)
(2) (2)

National Central University EE4012VLSI Design 23


Architecture--Level Design Retiming
Architecture
Retiming for pipeline design

REG C1 C2 REG C3
((6ns)) ((2ns)) (4ns)

fref

REG C1 REG C2
C3
(6ns) (2ns)
(4ns)

fref

National Central University EE4012VLSI Design 24


Architecture--Level Design Retiming
Architecture

Clock cycle is 4 gate delays

Clock cycle is 2 gate delays

National Central University EE4012VLSI Design 25


Architecture--Level Design
Architecture Power Management

C2
C1

C1_FREEZE

C2_FREEZE

C2

C1

C1_FREEZE
FREEZE

C2_FREEZE

National Central University EE4012VLSI Design 26


Architecture--Level Design
Architecture Bus Segmentation

Avoid the sharing of resources


Reduce the switched capacitance
For example: a global system bus
A single shared bus is connected to all modules, this
structure results in a large bus capacitance due to
The large number of drivers and receivers sharing the same
bus
The parasitic capacitance of the long bus line

A segmented
g bus structure
Switched capacitance during each bus access is
significantly reduced
Overall routing area may be increased

National Central University EE4012VLSI Design 27


Architecture--Level Design
Architecture Bus Segmentation

Cbus

Cbus1

Interface
Bus
Cbus1

National Central University EE4012VLSI Design 28


Algorithmic--Level Design
Algorithmic factivity Reduction

Minimization the switching activity, at high level, is one way to


reduce the power dissipation of digital processors.
One method to minimize the switching signals
signals, at the algorithmic
level, is to use an appropriate coding for the signals rather than
straight binary code.
The table shown below shows a comparison of 3-bit representation
of the binary and Gray codes.

Binary Code Gray Code Decimal Equivalent


000 000 0
001 001 1
010 011 2
011 010 3
100 110 4
101 111 5
110 101 6
111 100 7
National Central University EE4012VLSI Design 29
State Encoding for a Counter
Two-bit binary counter:
State sequence,
q , 00 01 10 11 00
Six bit transitions in four clock cycles
6/4 = 1.5 transitions pper clock
Two-bit Gray-code counter
State
St t sequence, 00 01 11 10 00
Four bit transitions in four clock cycles
4/4 = 11.0
0 ttransition
iti per clock
l k
Gray-code
Gray code counter is more power efficient.
G. K. Yeap, Practical Low Power Digital VLSI Design, Boston:
Kluwer Academic Publishers (now Springer)
Springer), 1998
1998.
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 30
Binary Counter: Original Encoding

a
Present
Next state
state A
b
a b A B
B
0 0 0 1
0 1 1 0
1 0 1 1
1 1 0 0

A = ab + ab CK
B = ab + ab CLR
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 31
Binary Counter: Gray Encoding

Present a
Next state
state
A
a b A B
0 0 0 1 B
b
0 1 1 1
1 0 0 0
1 1 1 0

A = ab + ab CK
B = ab + ab CLR
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 32
Three--Bit Counters
Three
Binary Gray-code
State gg
No. of toggles State No. of toggles
gg
000 - 000 -
001 1 001 1
010 2 011 1
011 1 010 1
100 3 110 1
101 1 111 1
110 2 101 1
111 1 100 1
000 3 000 1
Av. Transitions/clock = 1.75 Av. Transitions/clock = 1
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 33
N-Bit Counter: Toggles in Counting Cycle

Binary counter: T(binary) = 2(2N 1)


Gray code counter: T(gray) = 2N
Gray-code
T(gray)/T(binary) = 2N-1/(2N 1) 0.5
Bits T(binary) T(gray) T(gray)/T(binary)
1 2 2 1.0
2 6 4 0.6667
3 14 8 0.5714
4 30 16 0.5333
5 62 32 0.5161
6 126 64 0.5079
- - 0.5000

Source: Prof. V. D. Agrawal


National Central University EE4012VLSI Design 34
FSM State Encoding
Transition
probability
based on 06
0.6 06
0.6
PI statistics
11 01
0.3 0.3
01
0.1 01
0.1
0.4 0.4
00 01 00 11
0.1 09
0.9 0.1 09
0.9
0.6 0.6
Expected number of state-bit transitions:

2(0.3+0.4) + 1(0.1+0.1) = 1.6 1(0.3+0.4+0.1) + 2(0.1) = 1.0


State encoding can be selected using a power-based cost function.

Source: Prof. V. D. Agrawal


National Central University EE4012VLSI Design 35
FSM: Clock-
Clock-Gating
Moore machine: Outputs depend only on the
state variables.
If a state has a self-loop in the state transition
graph (STG), then clock can be stopped
whenever a self-loop is to be executed.

Xi/Zk
Si
Sk Xk/Zk

C oc ca
Clock can be stopped
Sj Xj/Zk when (Xk, Sk) combination
occurs.

Source: Prof. V. D. Agrawal


National Central University EE4012VLSI Design 36
Clock--Gating in Moore FSM
Clock

PI

Combinational
logic PO

Flip-fflops
Clock
activation Latch
logic L. Benini and G
L G. De Micheli
Micheli,
Dynamic Power Management,
CK Boston: Springer, 1998.
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 37
Bus Encoding for Reduced Power
Example: Four bit bus
0000 1110 has three transitions.
If bits of second pattern are inverted
inverted, then 0000
0001 will have only one transition.
Bit-inversion encoding for N-bit bus:
N
ding
ons
Number of bit transitio
n encod
after iinversion

N/2

0
0 N/2 N
N b off bi
Number bit transitions
ii
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 38
Bus--Inversion Encoding Logic
Bus

ed data
nt data
Sen

Receive
R
Bus register
Polarity
M. Stan and W. Burleson, Bus-Invert
decision
Polarity bit Coding for Low Power I/O, IEEE
logic
Trans. VLSI Systems, vol. 3, no. 1, pp.
49-58, March 1995.
Source: Prof. V. D. Agrawal
National Central University EE4012VLSI Design 39
RTL--Level Design
RTL Signal Gating

Simple Decoder Decoder with enable

module decoder (a, sel); module decoder (en,a, sel);


input [1:0[ a; input en;
ouput [3:0] sel; input [1:0[ a;
reg [3:0]
[3 0] sel;
l ouput [3:0]
[3 0] sel;
l
always @(a) begin reg [3:0] sel;
case (a) always @({en,a}) begin
2b00: sel=4b0001; case ({en,a})
2b01: sel=4b0010; 3b100: sel=4b0001;
2b10: sel=4b0100; 3b101: sel=4b0010;
2b11: sel=4b1000; 3b110: sel=4b0100;
endcase 3b111: sel=4b1000;
end default: sel=4b0000;
endmodule endcase
d
end
endmodule

National Central University EE4012VLSI Design 40


RTL--Level Design
RTL Datapath Reordering

Initial Reordered

stable M
Mux
A<B Mux
glitchy

glitchy
lit h
Mux A<B Mux
stable

National Central University EE4012VLSI Design 41


RTL--Level Design
RTL Memory Partition

128x32
din
32
addr dout
write

noe
8 M 32
q addr[7:0]
[ ]
pre addr
pre_addr d addr[7:1]
dd [7 1] U dout
X
clk noe
write
addr dout
din 32

addr0 128x32

National Central University EE4012VLSI Design 42


RTL--Level Design
RTL Memory Partition

Application-driven memory partition

Reads
64K bytes

Data
ARM
Addr
Core
R/W

Addr
28K 4K 32K Range
64K

National Central University EE4012VLSI Design 43


RTL--Level Design
RTL Memory Partition

A power-optimal partitioned memory organization

Decoder

ARM

R/W
Addr
CS

Data

R/W
Addr

R/W
Addr
CS

Data

CS

Data
Core

National Central University EE4012VLSI Design 44

You might also like