You are on page 1of 44

Chapter 4

Low--Power VLSI Design


Low

Jin-Fu Li
Advanced Reliable Systems
y
((ARES)) Lab.
Department of Electrical Engineering
National Central University
Jhongli, Taiwan

Outline

Introduction
Low-Power Gate-Level Design
Low-Power Architecture-Level Design
Algorithmic-Level Power Reduction
RTL Techniques
T h i
for
f Optimizing
O i i i Power
P

National Central University

EE4012VLSI Design

Introduction

Most SOC design teams now regard power as one


g concerns
of their top design

Why low-power design?


Battery lifetime (especially for portable devices)
Reliability

Power consumption
Peak power
p
Average power

National Central University

EE4012VLSI Design

Overview of Power Consumption

Average power consumption


Dynamic
y
p
power consumption
p
Short-circuit power consumption
Leakage power consumption
Static power consumption

D
Dynamic
i power di
dissipation
i ti d
during
i switching
it hi

Cinput
Cdrain
National Central University

interconnect

EE4012VLSI Design

Cinput

Overview of Power Consumption

Generic representation of a CMOS logic gate for


gp
power calculation
switching
VA
VB

pMOS
network
Vout

VA
VB

Pavg

nMOS
network

C drain Cint erconnect Cinput

T
dVout
dVout
1 T /2
[ Vout (Cload
)dt (VDD Vout )(Cload
)dt ]
T /2
T 0
dt
dt

National Central University

EE4012VLSI Design

Overview of Power Consumption

The average power consumption can be expressed


as
Pavg

1
2
2

C load V DD
C load V DD
f CLK
T

The node transition rate can be slower than the


clock rate. To better represent this behavior, a
node transition factor
f
( T ) should be introduced
2
Pavg T C load V DD
f CLK

The switching power expressed above are derived


by taking into account the output node load
capacitance

National Central University

EE4012VLSI Design

Overview of Power Consumption


VA

VA

Vinternal

VB

VB

Cinternal

Vinternal
Vout

VA

Cload

VB

Vout

The generalized expression for the average power dissipation


can be rewritten as
Pavg
National Central University

# ofnodes

Ti C iV i V DD f CLK
i 1

EE4012VLSI Design

Gate--Level Design Technology Mapping


Gate

The objective of logic minimization is to reduce the


boolean function.

For low-power design, the signal switching activity


is minimized by restructuring a logic circuit

The power minimization is constrained by the


delay, however, the area may increase.

During this phase of logic minimization, the


function to be minimized is

P i (1 P i ) C

National Central University

EE4012VLSI Design

Gate--Level Design Technology Mapping


Gate

The first step in technology mapping is to decompose


each logic function into two-input gates

The objective of this decomposition is to minimizing the


total power dissipation by reducing the total switching
activity
ti it
A 0.2

0.0384

B 0.2

A
B
C
D

0.0196

0.0099

C 0.5
D 0
0.5
5
A 0.2

0.0384

B 0.2
C 0.5
D 0
0.5
5
National Central University

0.0099
0.1875
EE4012VLSI Design

Gate--Level Design Phase Assignment


Gate

High activity node

High activity node

National Central University

EE4012VLSI Design

10

Gate--Level Design Pin Swapping


Gate
a

a
Switchin
ng activity

Switching activityy

c
b
a

d
c
b
a

National Central University

b
c
d

a
b
c
d

EE4012VLSI Design

11

Gate--Level Design Glitching Power


Gate

Glitches
spurious transitions due to imbalanced path delays

A design has more balanced delay paths


has fewer g
glitches,, and thus has less power
p
dissipation
p

Note that there will be no glitches in a dynamic CMOS


logic
g
A

A
B

D
E

National Central University

C
D
E

EE4012VLSI Design

12

Gate--Level Design Glitching Power


Gate

A chain structure has more glitches


A tree structure has fewer glitches
A
B

Chain structure

C
D
A

Tree structure

B
C
D
National Central University

EE4012VLSI Design

13

Gate--Level Design Precomputation


Gate

REG
R1

REG
R1

Combinational Logic

Combinational Logic

REG
R2

REG
R2

Precomputation
Logic
g
National Central University

EE4012VLSI Design

14

Gate--Level Design Precomputation


Gate
A<n-1>
A<n
1>
B<n-1>

REG
R1

A<n-2:0>

REG
R2
Enable

Precomputation logic
B<n-2:0>

National Central University

1-bit Comparator
(MSB)

(n-1)-bit
Comparator

REG
R4

REG
R3

EE4012VLSI Design

15

Gate--Level Design Gating Clock


Gate
D Q

D Q

D Q

D Q

Fail DFT rule


checking

clk

T
D Q

D Q

D Q

D Q

Add control pin


to solve DFT
violation
problem

clk

National Central University

EE4012VLSI Design

16

Gate--Level Design Input Gating


Gate
f1

clk

+
select
l t
f2

National Central University

EE4012VLSI Design

17

Clock--Gating in LowClock
Low-Power Flip
Flip--Flop

CK

Source: Prof. V. D. Agrawal


National Central University

EE4012VLSI Design

18

Reduced--Power Shift Register


Reduced
D

Q
multiiplexer

Output

CK(f/2)

Flip-flops are operated at full voltage and half the clock frequency.
Source: Prof. V. D. Agrawal
National Central University

EE4012VLSI Design

19

Power Consumption of Shift Register


16-bit shift register, 2 CMOS
Freq
F
(MHz)

Power
P
(W)

33 0
33.0

1535

16.5

887

8 25
8.25

738

10
1.0
No
ormalize
ed power

Deg. Of
D
parallelism

C. Piguet, Circuit and Logic Level


Design pages 103-133
Design,
103 133 in W
W. Nebel
and J. Mermet (ed.), Low Power
Design in Deep Submicron
Electronics Springer,
Electronics,
Springer 1997
1997.

P = CVDD2f/n

05
0.5
0.25

00
0.0

2
4
Degree of parallelism, n
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

20

Architecture--Level Design Parallelism


Architecture
A

16

R
16x16
multiplier

fref
16

16

16

32

fref/2

16x16
multiplier

32

M 32
U
X

fref/2

fref
Assume that With the same 16x16
multiplier, the power supply can
be reduced from Vref to Vref/1.83.

Vreff

Pparallel 2.2Cref (
)
1.83

f reff
2

R
fref/2

0.33Pref

16
B

fref

16x16
multiplier
p

32

R
16
fref/2

National Central University

EE4012VLSI Design

21

Architecture--Level Design Pipelining


Architecture
The hardware between the pipeline stages is reduced then
the reference voltage Vref can be reduced to Vnew to maintain
the same worst case delay.
delay For example,
example let a 50MHz
multiplier is broken into two equal parts as shown below. The
delay between the pipeline stages can be remained at 50MHz
when the voltage Vnew is equal to Vref/1.83
/1 83

(A ,B)

32

REG

Half
multiplier

REG

Half
multiplier

32

fref

Ppipeline 1 .2 C ref (
National Central University

V ref
1 .83

) f ref 0 .36 Pref


2

EE4012VLSI Design

22

Architecture--Level Design Retiming


Architecture
Retiming is a transformation technique used to change the
locations of delay elements in a circuit without affecting the
input/output characteristics of the circuit
circuit.
Two versions of an IIR filter.
(1)

(1)
x(n)
( )

y(n))
y(
D

w(n)

(1)

x(n)
( )
D

2D
(1)

(2)
b

retiming

D
w1(n)
w2(n)

(2)

National Central University

y(n))
y(

EE4012VLSI Design

a
(2) 2D
b
(2)

23

Architecture--Level Design Retiming


Architecture
Retiming for pipeline design

REG

C1
((6ns))

C2
((2ns))

REG

REG

C2
(2ns)

C3
(4ns)

fref

REG

C1
(6ns)

C3
(4ns)

fref

National Central University

EE4012VLSI Design

24

Architecture--Level Design Retiming


Architecture

Clock cycle is 4 gate delays

Clock cycle is 2 gate delays


National Central University

EE4012VLSI Design

25

Architecture--Level Design
Architecture

Power Management

C2
C1
C1_FREEZE
C2_FREEZE

C2
C1
C1_FREEZE
FREEZE
C2_FREEZE

National Central University

EE4012VLSI Design

26

Architecture--Level Design
Architecture

Bus Segmentation

Avoid the sharing of resources


Reduce the switched capacitance

For example: a global system bus


A single shared bus is connected to all modules, this
structure results in a large bus capacitance due to

The large number of drivers and receivers sharing the same


bus
The parasitic capacitance of the long bus line

A segmented
g
bus structure
Switched capacitance during each bus access is
significantly reduced
Overall routing area may be increased

National Central University

EE4012VLSI Design

27

Architecture--Level Design
Architecture

Bus Segmentation

Cbus

Interface

Bus

Cbus1

Cbus1
National Central University

EE4012VLSI Design

28

Algorithmic--Level Design
Algorithmic

factivity Reduction

Minimization the switching activity, at high level, is one way to


reduce the power dissipation of digital processors.
One method to minimize the switching signals
signals, at the algorithmic
level, is to use an appropriate coding for the signals rather than
straight binary code.
The table shown below shows a comparison of 3-bit representation
of the binary and Gray codes.
Binary Code

Gray Code

000
001
010
011
100
101
110
111

000
001
011
010
110
111
101
100

National Central University

Decimal Equivalent

EE4012VLSI Design

0
1
2
3
4
5
6
7
29

State Encoding for a Counter

Two-bit binary counter:

Two-bit Gray-code counter

Gray-code
Gray
code counter is more power efficient.

State sequence,
q
, 00 01 10 11 00
Six bit transitions in four clock cycles
6/4 = 1.5 transitions pper clock
State
St t sequence, 00 01 11 10 00
Four bit transitions in four clock cycles
4/4 = 11.0
0 ttransition
iti per clock
l k

G. K. Yeap, Practical Low Power Digital VLSI Design, Boston:


Kluwer Academic Publishers (now Springer)
Springer), 1998
1998.
Source: Prof. V. D. Agrawal
National Central University

EE4012VLSI Design

30

Binary Counter: Original Encoding

Present
state

a
Next state

A = ab + ab
B = ab + ab

A
B

CK
CLR
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

31

Binary Counter: Gray Encoding


Present
state

Next state

A = ab + ab
B = ab + ab

a
A
B

CK
CLR
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

32

Three--Bit Counters
Three
Binary

Gray-code

State

gg
No. of toggles

State

No. of toggles
gg

000

000

001

001

010

011

011

010

100

110

101

111

110

101

111

100

000

000

Av. Transitions/clock = 1.75

Av. Transitions/clock = 1
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

33

N-Bit Counter: Toggles in Counting Cycle

Binary counter: T(binary) = 2(2N 1)


Gray code counter: T(gray) = 2N
Gray-code
T(gray)/T(binary) = 2N-1/(2N 1) 0.5
Bits

T(binary)

T(gray)

T(gray)/T(binary)

1.0

0.6667

14

0.5714

30

16

0.5333

62

32

0.5161

126

64

0.5079

0.5000
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

34

FSM State Encoding


Transition
probability
based on
PI statistics

06
0.6

06
0.6

11

0.3

0.4
00
0.6

0.1

0.3

01
0.1
01

01
0.4

09
0.9

00
0.6

0.1

01
0.1
11

09
0.9

Expected number of state-bit transitions:

2(0.3+0.4) + 1(0.1+0.1) = 1.6

1(0.3+0.4+0.1) + 2(0.1) = 1.0

State encoding can be selected using a power-based cost function.


Source: Prof. V. D. Agrawal
National Central University

EE4012VLSI Design

35

FSM: ClockClock-Gating

Moore machine: Outputs depend only on the


state variables.
If a state has a self-loop in the state transition

graph (STG), then clock can be stopped


whenever a self-loop is to be executed.
Xi/Zk
Si
Sk
Sj

Xk/Zk
C oc ca
Clock
can be stopped
when (Xk, Sk) combination
occurs.

Xj/Zk

Source: Prof. V. D. Agrawal


National Central University

EE4012VLSI Design

36

Clock--Gating in Moore FSM


Clock

PI

Flip-fflops

Combinational
logic

Clock
activation
logic
CK

Latch

PO

L. Benini and G
L
G. De Micheli
Micheli,
Dynamic Power Management,
Boston: Springer, 1998.
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

37

Bus Encoding for Reduced Power

Example: Four bit bus

Bit-inversion encoding for N-bit bus:


Number of bit transitio
ons
after iinversion
n encod
ding

0000 1110 has three transitions.


If bits of second pattern are inverted
inverted, then 0000
0001 will have only one transition.

N/2

N/2
N b off bi
Number
bit transitions
ii

Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

38

Sen
nt data

R
Receive
ed data

Bus--Inversion Encoding Logic


Bus

Polarity
decision
logic

Bus register
Polarity bit

M. Stan and W. Burleson, Bus-Invert


Coding for Low Power I/O, IEEE
Trans. VLSI Systems, vol. 3, no. 1, pp.
49-58, March 1995.
Source: Prof. V. D. Agrawal

National Central University

EE4012VLSI Design

39

RTL--Level Design
RTL

Decoder with enable

Simple Decoder
module decoder (a, sel);
input [1:0[ a;
ouput [3:0] sel;
reg [3:0]
[3 0] sel;
l
always @(a) begin
case (a)
2b00: sel=4b0001;
2b01: sel=4b0010;
2b10: sel=4b0100;
2b11: sel=4b1000;
endcase
end
endmodule

National Central University

Signal Gating

module decoder (en,a, sel);


input
en;
input [1:0[ a;
ouput [3:0]
[3 0] sel;
l
reg [3:0] sel;
always @({en,a}) begin
case ({en,a})
3b100: sel=4b0001;
3b101: sel=4b0010;
3b110: sel=4b0100;
3b111: sel=4b1000;
default: sel=4b0000;
endcase
d
end
endmodule

EE4012VLSI Design

40

RTL--Level Design
RTL

Datapath Reordering

Initial

A<B

Reordered

stable

Mux

M
Mux

glitchy
glitchy
lit h
A<B

Mux

Mux

stable

National Central University

EE4012VLSI Design

41

RTL--Level Design
RTL

Memory Partition

128x32
din
addr
write

dout

32

noe
pre addr
pre_addr

[ ]
q addr[7:0]

M 32
U
dout
X

addr[7:1]
dd [7 1]
noe

clk
write
addr
din
addr0

National Central University

EE4012VLSI Design

dout

32

128x32

42

RTL--Level Design
RTL

Memory Partition

Application-driven memory partition

64K bytes

Reads

Data
ARM
Core

Addr
R/W
28K

4K

32K
64K

National Central University

EE4012VLSI Design

Addr
Range

43

RTL--Level Design
RTL

Memory Partition

A power-optimal partitioned memory organization

Decoder

Data
Addr
R/W
CS

EE4012VLSI Design

Data
Addr
R/W
CS

National Central University

Data
Addr
R/W
CS

ARM
Core

44

You might also like