Synthesis For Low Power: A New VLSI Design Paradigm For DSM: Ajit Pal

Synthesis for Low Power:
A New VLSI Design Paradigm

for DSM
Ajit Pal
Professor
Department of Computer Science and
Engineering
Indian Institute of Technology Kharagpur
INDIA -721302
Outline
• Why Low Power?
• Sources of power dissipation
– Dynamic Power
– Static Power
• Degrees of Freedom
• Low Power Techniques
– Supply Voltage Scaling Techniques
– Minimizing switching capacitances
– Leakage Power reduction Techniques
Ajit Pal IIT Kharagpur
Why Low-power?
• Until recently performance has been
synonymous with circuit speed or
processing power, e.g. MIPS or MFLOPS.
• Implementation involved Area-Time
tradeoff. Power Consumption = k.A.f,
where k= 0.063 W/cm2.MHz, A is the area
in cm2 and f is the frequency in MHz.
• Power consumption were of secondary
concern.

Why Low-power?
¾ Contemporary high performance
processors consume heavy power
¾ Cost associated with packaging and
cooling such devices is prohibitive
¾ Low-power methodology to be used to
reduce cost of packaging and cooling
Clock Technology Vdd Peak Power
Processor (MHz) (mm) (Volt) (Watt)
Ultra Sparc 167 0.45 3.3 30
Intel Pentium 200 0.50 3.3 26
Alpha 21064 200 0.50 3.3 30
Alpha 21164 300 0.45 3.3 50
Alpha 21264 667 0.35 2.0 72
Alpha 21364 1000 0.25 1.5 100
Processor Power

Why Low-power?

Why Low-power?
¾ Emergence of portable computing and
communication equipment, such as laptops,
palmtops, cell-phones, etc. Growth rate of
these portable equipment are very high.
¾ As these devices are battery operated,
battery life is of primary concern.
Unfortunately, the battery technology has not
kept up with the energy requirement of the
portable equipment.
¾ Commercial success of these products
depend on weight, cost and battery life.
¾ Low power design methodology is very
important to make them commercially viable.
Why Low-power?
¾ Reliability is closely related to power
dissipation – Every 10ºC rise in temperature
roughly doubles the failure rate
Thermal runway
Gate dielectric
Junction diffusion
Electromigration diffusion
Electrical parameter shift
Package related failure
Silicon interconnect fatigue
0 100 200 300
o
C above normal operating temperature
Onset temperatures of various failure mechanism

Why Low-power?
¾ According to an estimate of the U.S.
Environmental Protection Agency (EPA),
80% of the power consumption by office
equipment are due to computing equipment
and a large part from unused equipment
¾ Power is dissipated mostly in the form of
heat. The cooling techniques, such as AC
transfer the heat to the environment.
¾ To reduce adverse effect on environment
efforts such as EPA’s Energy Star program
leading to power management standard for
desktops and laptops has emerged.
¾ Drive towards Green PC
Sources of Power Dissipation
¾ CMOS has emerged as the technology of

choice for low power applications and is likely
to remain so in the near future.
¾ The sources of power dissipation in CMOS
circuits
¾Dynamic Power
¾Static

Dynamic Power Dissipation
¾Dynamic Power
Switching power
Short-circuit power
Glitching power
¾Static Power
Due to reverse-biased junction diode
currents when the transistors are off
Due to sub-threshold leakage current

Dynamic Power Dissipations
¾ Switching Power
Due to the charging and discharging of load
and parasitic capacitors
Pdynamic = α L ⋅ C L ⋅ VDD
2
⋅ f + ∑α i ⋅ Ci ⋅ VDD ⋅ (VDD − VT )
i
Vdd Vdd Vdd
Charging
Pull-up current Pull-up
Pull-up
network network
network
ON OFF
OUT IN IN
IN
Pull-down Pull-down
Pull-down
CL network network CL
network CL
OFF ON
Discharging
current
Switching Power
¾During transition of the output from 0 to Vdd, the
energy drawn from the power supply is given by
Vdd
dV 0
E0→1 = ∫ p(t )dt = ∫V .i(t )dt
0
dd i(t ) = C L
dt
Vdd
Substituting this we get E0→1 = Vdd ∫ CL dV0 = CLVdd2
If a square wave of 0
repetition frequency f (I/T)
1
is applied at the input then Pd = .C LV dd2 = C LV dd2 f
the power dissipated per T
unit time is given by
Contd…
Dynamic Power Dissipation
¾ Short Circuit Power
Dissipation
As input changes slowly,
power dissipation takes ISC
place even when there is no
load or parasitic capacitor.
This is known as the short
circuit current.
Note that the short circuit
power dissipation is greatly 1 kτf
affected by the power I sc = ⋅ ⋅ (VDD − VT )3
supply scaling and is also 12 VDD
proportional to the
frequency and rise/fall time
of the input signal.
Short Circuit Power
Dissipation

Short Circuit Power Dissipation
⎡ t2 t3
⎤
= 2 × ⎢ ∫ i (t )dt + ∫ i (t )dt ⎥
1
I mean
T ⎢⎣ t1 t2 ⎥⎦
Because of symmetry we may write
⎡ t2 ⎤
⎢ ∫ i (t )dt ⎥
4
I mean =
T ⎢⎣ t 1 ⎥⎦
For the nMOS transistor is operating in the saturation region
⎡ t2 β ⎤
I mean =
4
⎢∫ (V in (t ) − V t ) 2
⎥ dt Contd…
T ⎢⎣ t 1 2 ⎥⎦

Short Circuit Power Dissipation
τ
2 2
2β ⎛ Vdd ⎞
Imean =
2
τ
∫ ⎜
⎝ τ
t −Vt ⎟ dt
⎠
Vt
Vdd
This results in β τ
Imean = (Vdd −2Vt ) .
3
12Vdd T
Short circuit power is given by
β
Psc = Vdd .I mean = (Vdd − 2Vt ) τ . f .
3
12
Glitching Power
Output waveform showing glitch at output O2

Leakage Current Mechanisms of
Deep-submicrometer Transistors

Static Power Dissipation
I1= Reverse-bias p-n junction diode leakage

current
I2 = Band-to-band tunneling current
I3 = Subthreshold leakage current
I4 = Gate Oxide tunneling current
I5 = Gate current due to hot-carrier injection
I6 = Channel punch-through
I7 = Gate induced drain-leakage current

Reverse Biased Leakage
nMOS is ON pMOS is OFF
Vdd Vdd
Vdd
V out ="0"
n+ n+ p+ p+ n+
Drain leakage
n-well
Reverse leakage
current
p-type substrate

p-n Junction Reverse-Biased Current
nMOS inverter and its physical structure

The current for one diode is given by
⎛ qV d
⎞
I r d lc = A J s ⎜ e
nK T
− 1⎟
Where, ⎝ ⎠
Js = reverse saturation current density, Vd =
diode voltage,
n = emission co-efficient of the diode
(sometimes equal to 1),
q = charge of an electron (1.602 ×10-19 ),
K = Boltzmann constant (1.38 × 10-23 j / k)
T = temperature in ˚K
Then total static power dissipation due to diode
leakage current for 1 million transistors is given by:
6
≈ 0 . 01 μ W
10
P = V dd ∑ i=1
I di

Band-to-Band Tunneling Current
High electric field across reverse-biased p-n

junction causes significant current known as
BTBT current, which dominates the p-n junction
leakage current
Band-to-Band Tunneling Current
The tunneling current density is given by
EV app ⎛ 3
Eg 2 ⎞
J b −b = A exp ⎜ − B ⎟
1
Eg 2 ⎜ E ⎟
⎝ ⎠
Where,
2m * q 3 and B =
4 2m *
A = 3qη
4π 3 η 2

Sub-threshold Leakage Current
¾ Static power due sub-threshold leakage
current:
q
(VG - VS - Vtho - δ 'VS + η VDS ) ⎛ -q VDS ⎞
I sub = Ae n'kT ⎜1 − e kT ⎟
⎜ ⎟
⎝ ⎠
¾ This current increases drastically with
temperature
¾ It also increases as threshold voltage is
scaled down along with the power supply
voltage for better performance.

Subthreshold Leakage Current
¾Various mechanism which affect the

subthreshold leakage current are:
Drain induced Barrier Lowering

Body effect
Narrow-width effect
Effect of channel length and Vth Roll off
Effect of temperature

Contributions of Various Power
Dissipations
0% 100%
Switching power 80%-90%
Leakage power 10%-30%
Short-circuit power 0%-5%

Why Leakage Power is an Issue?
¾ In stand-by application leakage component
becomes significant % of total power
¾ Leakage current approaches 10% of total power
in sub deep micron technology
2
10
1
10 Active Power
0
10
Power (W)
-1
10
-2 Stand by Power
10
-3
10
-4
10
-5
10
-6
10
1.0 0.8 0.6 0.5 0.35 0.25 0.18
Technology Generation (μm)

Why Leakage Power is an Issue?
Leakage power is becoming a large

component of total power dissipation
Degrees of freedom
¾ Three degrees of freedom inherent in the
low-power design space:
Supply voltage
Physical capacitance
Switching activity
¾ Optimizing power consumption invariably
involves reducing one or more of three
parameters
¾ The supply voltage has the most
dominating effect on power dissipation
because of quadratic relationship
Low-Power Design Methodology
¾ Low-power design methodologies are to be
applied throughout the design process from
system-level to layout-level, gradually refining or
detailing the abstract specification or model of
the design.
¾ Starting with the system specification the
following steps are performed to get the layout:
System Specification =>System-level Design
Behavioral Description => High-level Synthesis
Structural RTL Description => Logic Synthesis
Logic-level netlist => Layout Synthesis =>
Layout
Power Reduction at Different Levels
¾ Power optimization
approaches at the
high-level are
significant since
research results
indicate that higher
levels of abstraction
have greater
potential for power
reductions.

Supply Voltage Scaling
¾Device feature size scaling

¾Architectural level approaches
Parallelism
Pipelining
¾ Voltage scaling using high-level
transformations
¾ Dynamic voltage scaling

Supply Voltage Scaling for Low Power
¾ A factor of two reduction in supply voltage yields
a factor of four decrease in energy
¾ Theoretical lower limit of supply voltage for
CMOS circuit is 0.2V
¾ Unfortunately, as supply voltage is lowered
delays increases leading to dramatic reduction in
performance
1.0 1.0
Normalized Energy
Normalized Delay
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
Vdd Vdd
Device Feature Size Scaling
GATE
Source Drain
1 t ox ' = t ox / S
2 N D' = N D × S
5
OXIDE 1 3 L′ = L / S
4 2 3 2
4 X ′j = X j / S
Substrate doping 6 5 W′=W /S
6 N A' = N A × S

Constant Field Scaling

Constant Voltage Scaling

Architecture-Level Approaches:
Parallelism
• Parallel processing can be an important
technique for reducing power consumption
in CMOS circuits.
• Key approach is to trade area for power
while maintaining the same throughput.
• In simple terms, if the supply voltage is
reduced by half, the power is reduced by
one-fourth and performance is lowered by
half.
• The loss is performance can be
compensated by parallel processing.
Parallelism
• Example: Two 16 bit
LATCH
registers supplies two A
operands to a Adder. 16 16 bit
Adder
Delay of the critical 16
path of the adder is 10
LATCH
nsec. Operating B
frequency = 100 MHz 16
• The estimated dynamic fref

power of the circuit is
•
Pref = C ref .Vref . f ref .
2

Parallelism
• Here the
LATCH
multiplier has A
been duplicated
16-bit
twice, but the Adder
LATCH
input registers
have been
clocked at half the 16 MUX
16
LATCH
frequency of fref.
This helps to 16
reduce the supply 16-bit

Adder
fref
voltage such that
LATCH
B
the critical path
delay is not more
than 20 nsec. fref/2
2
⎛ V ref ⎞ f
P par = 2 . 2 C ref .⎜⎜ ⎟⎟ × ref
•The estimated dynamic power is ⎝ 2 ⎠ 2

Pipelining
In this realization,
instead of 16-bit
addition 8-bit addition
fref is performed in each
0 stage. The critical path
s0-7
delay through the 8-bit
a0-7 8-bit
adder
adder stage is about
b0-7 half that of 16-bit adder
LATCH
LATCH
LATCH
s8-15
s0-15 stage. Therefore, the 8-
a8-15 8-bit bit adder will operate
adder
b8-15 at a clock frequency of
100 mHz with a
reduced power supply
voltage of Vref /2.
2
•Estimated power is: Ppipe = C pipe .V pipe

2
. f pipe = (1.15C ref ).⎛⎜⎜ Vref ⎞
⎟⎟ . f = 0.28 Pref .
⎝ 2 ⎠

Pipelining and Parallelism
0 s0-7
a0-7 8-bit
• Here, More than one
adder
b0-7
s8-15
parallel structure is
s0-15
a8-15 8-bit used and each
adder
structure is pipelined.
LATCH
b8-15
LATCH
MUX
LATCH
0 s0-7
s0-15
s0-15 Both power supply
a0-7 8-bit
b0-7
adder and frequency of
s8-15
a8-15
operation are reduced
8-bit
b8-15
adder fref to achieve substantial
overall reduction in
fref/2
power dissipation.
⎛ f ref ⎞
Parpipe = (2.5C ref )(0.4Vref ) ⎜⎜ ⎟⎟
2
Estimated power
⎝ 2 ⎠
Dynamic Voltage Scaling (DVS)
2.5
Destructive
Processor voltage
2.0
1.5 Operational
1.0
0.5
Non-functional
1.2
0
0
1.0 No voltage scaling 74 103 133 162 192 221
CPU clock frequency
Normalized
0.8
Energy
0.6
0.4 The DAG after unrolling

0.2 Ideal DVS and using distributivity
.1 .2 .3 .4 .5 .6 .7 .8 .9 1 and constant
W ork load (r)
propagation
Dynamic Voltage Scaling (DVS)
r Worked
Vfixed DC / DC
Monitor
CONVERTER
V(r) f(r)
λ1 w
Basic scheme of DVS
λ2 λ Variable Voltage
Processor μ (r )
X
Μ λn
Task Queue
S
Low-pass
Filter
VF
DC/DC Converter Pulse-width
V0 RL
to be used in DVS Modulation
Reference
Voltage
Comparator

Minimizing Switched Capacitance
• Hardware Software Tradeoff

• By choice of suitable data representation
(Bus Encoding)
• Two’s complement Vs Sign Magnitude
• Architectural optimization
• Logic styles

Hardware Software Tradeoff
• Same functionality can be either realized by
hardware or by software or by a combination
of both.
¾ Hardware-based approach:
– Faster
– Costlier
– Consumes more power
¾ Software-based approach:
– Cheaper
– Slower
– Consumes lesser power

Superscaler Versus VLIW approach
Conventional superscalar out-of-order CPUs use

hardware to create and dispatch micro-ops that can
be executed in parallel.

Superscaler Versus VLIW approach
A compiler generates long instructions having

multiple operations meant for different
functional units

Transmeta’s Crusoe Processor
¾Long instruction word, called molecule,

can be 64 or 128 bits long
¾A molecule can contain up to 4 RISC like
instructions, called atoms
¾All atoms get executed in parallel
¾Molecules are executed in order

Code Morphing Software
The Code
Morphing
software
mediates
between
x86
software
and the
Crusoe
Processor

Code Morphing Software
¾ It is fundamentally a dynamic translation
system
A program that compiles instructions for
instruction set architecture into instructions for
another ISA
Here, x86 code is compiled into VLIW code
¾ Code Morphing s/w insulates x86 programs
from the h/w engine’s native instruction set
The native instruction set can be changed
arbitrarily without affecting any x86 software at all
Only Code Morphing s/w needs to be ported

Comparison of Heat Dissipation
A Pentium III processor plays a

DVD at 105°C
A Crusoe processor model TM5400

plays a DVD at 48°C

Bus Encoding
¾ Communicating data bits in an appropriately
coded form can reduce the switching activity
¾ Goals of coding
Remove undesired correlation among information
bits, or
Introduce controlled correlation
¾ Coding for reduced switching activity falls
under the second category
Introducing sample to sample correlation such that
total number of bit transitions is reduced

One Hot Coding
¾ Two chips are connected using m=2n wires, and
a n-bit data word is encoded by placing ‘1’ on
the ith wire, where 0<=i<=2n-1 is the binary value
corresponding to the bit pattern and ‘0’ on the
remaining m-1 wires.
¾ Guarantees precisely one 0Æ1 and one 1Æ0 bit
transition when a different data word is sent
¾ Number of wires required increases
exponentially with the word size of the data

Gray Coding Vs Binary Coding
¾ A gray code
sequence is
a set of
numbers in
which
adjacent
numbers
only have
one bit
difference

Gray Coding Vs Binary Coding
¾It is useful when the data is sequential
and highly correlated, like Instruction
addresses
¾No of bit transitions is limited to 2 for
sequential data
¾For random data, the no of transitions
for binary and gray code were
approximately equal

Comparison of the temporal activity
Bit transitions per instruction executed

Temporal Transition Activity Comparison for Instruction
Addresses
2.43
Chat 1.54
2.51
Browse 1.64
2.76
Boyer 2.09
2.42
Nand 1.57
2.68 Binary Coded

Semigroup 1.99
Gray Coded
2.33
Circuit 1.47
2.57
Reducer 1.71
2.64
Qsort 1.33
2.46
Fastqueens 1.03
0 0.5 1 1.5 2 2.5 3
Bit Transitions Per Instruction Executed

Temporal Transition Activity Comparison for Data
Addresses
1.32
Chat 1.2
1.32
Browse 1.4
1.76
Boyer 1.72
1.25
Nand 1.16
1.38 Binary Coded

Semigroup 1.34
Gray Coded
1.33
Circuit 1.18
1.47
Reducer 1.4
1.32
Qsort 1.25
0.85
Fastqueens 0.91
0 0.5 1 1.5 2
Bit Transitions per Instruction Executed
Bus Inversion Coding
¾ It is a redundant coding scheme where m=n+1
¾ If the ith data word is Si, then either Si or ~Si is
transmitted depending on which would result
in fewer no of bit transitions
¾ An extra bit P encodes the polarity of the data
word
¾ The coding technique works better for smaller
values of n
For n=2, switching activity reduction is 25%
For n=32, switching activity reduction is 11%

• For larger value of n, the n-bit data bus
can be divided into smaller groups and
each group is coded independently by
associating polarity bit with each of
the group
• When a 32-bit bus is divided into 4
groups of 8-bits, it gives 18.3%
reduction in switching activity as
opposed to 11%
Predicted Reduction in Switching Activity

T0 Encoding
¾ The Gray coding provides an asymptotic best
performance of a single transitions for each address
generated when infinite streams of consecutive
addresses are considered.
¾ However, the code is optimum only in the class of
irredundant codes, i.e. codes that employ exactly n-bit
patterns to encode a maximum of 2n words.
¾ By adding some redundancy to the code, better
performance can be achieved by adapting the T0
encoding scheme, which requires a redundant line INC.
¾ The T0 code provides, zero transition property for
infinite streams of consecutive addresses.
⎧ B(t − 1),1, if t > 0, b(t ) = b(t − 1) + S
Encoding ( B(t ), INC (t )) = ⎨
⎩b(t ), 0, otherwise
⎧b(t − 1) + S if INC = 1 and t > 0
Decoding b(t ) = ⎨
⎩ B (t ) if INC = 0
Reducing Glitching Activity
¾ Static designs can exhibit spurious
transitions due to finite propagation delays
from one logic block to the next
¾ “Extra” transitions can be minimized by
Balancing all signal paths
Reducing logic depth
Figure: Reducing the glitching activity

Logic Styles for
High Performance and Low Power
¾Potential Logic Styles

Static CMOS Logic
Dynamic CMOS Logic
Pass-Transistor Logic (PTL)
¾Experimental Results
¾Conclusions

VDD
Static CMOS Logic pull

up
network
INPUT f
¾ Advantages pull
down
CL
Ease of fabrication network
Good noise margin VSS
Robust
Lower switching ¾ Disadvantages
activity Larger number of transistors
(larger chip area and delay)
Good input/output Spurious transitions (glitch) due
decoupling to finite propagation delays
No charge sharing leading to extra power dissipation
problem and incorrect operation
Short circuit power dissipation
Availability of matured
logic synthesis tools Weak output driving capability
and techniques Large number of standard cells
requiring substantial engineering
effort for technology mapping
VDD
Dynamic CMOS Logic φ
f precharge evaluation
¾ Advantages INPUT
pull
down
CL f H|H L
φ=1
Combines the
network
f H
φ=0
advantages of low φ
power of static CMOS VSS
and low chip area of

pseudo-nMOS
Reduced number of ¾ Disadvantages
transistors compared Higher switching activity
to static CMOS (n+2 Not as robust as static
versus. 2n) CMOS logic
Faster than static Clock skew problem in
CMOS logic cascaded realization
Suffers from charge
No short circuit power sharing problem
dissipation Matured synthesis tools are
No spurious transition not available
and glitching power
dissipation Ajit Pal IIT Kharagpur
Pass-Transistor Logic
¾ Advantages ¾ Disadvantages
¾ Lower area due to Increased delay due to
smaller number of long chain pf pass-
transistors and smaller transistors
input loads Multi-threshold voltage
¾ Ratio-less PTL allows drop
minimum dimension Dual-rail logic to
transistors and hence provide all signals in
makes area efficient complementary form
circuit realization There is possibility of
¾ No short circuit current sneak path
leading to lower power
dissipation
Problems in PTL Synthesis
¾ Multi-threshold voltage drop Vdd

Vdd-Vt Vdd-2Vt
¾ Sneak path
Vdd Vdd-mVt
¾ Long chain of pass transistors Vdd-Vt
f From other pass logic
T1 T2 T3 Tn
B C CL
A D
n(n + 1)
1 0 T = 0.69 R.C L
2
Experimental Results
¾ Static CMOS circuits have been realized using
Berkeley SIS tool (script.rugged to optimize the netlist
and technology mapping with 44-2.genlib and option
of minimum area)
¾ A large number of benchmark circuits are realized
using the three logic styles with C/C++ programming
in Sun system
¾ Requirements of area are approximated with the
number of transistors
¾ Estimation models for calculating delay and switching
power dissipation for the circuits with three different
logic styles have been proposed and their accuracies
are verified with Spice and Design Analyzer in
Cadence
¾ MOSFET parameters are used from 0.18mm process
technology Ajit Pal IIT Kharagpur
1 2 3 4 5 6 7 8 9 10
Static CMOS circuits Dynamic CMOS circuits PTL circuits
Benchma Dela Powe Powe
rk Area
Delay Power Area y r Area Delay r
(#Transistor)
(ns) (mW) (ns) (mW) (ns) (mW)
C432 692 3.32 122 581 1.82 88 546 2.08 91
C499 1880 2.23 367 1506 1.69 248 1428 1.62 167
C880 1412 2.21 293 1249 1.79 166 988 1.18 267
C1355 1880 2.61 400 1603 1.51 280 1203 1.04 379
C1908 1756 2.91 367 1689 1.80 251 1088 1.57 298
C2670 1804 2.94 493 1584 1.74 395 1010 1.54 449
C3540 4214 4.57 409 2815 2.63 314 2782 2.58 294
C5315 7058 3.65 830 5970 2.69 515 5364 1.62 778
C6288 11222 11.84 409 8716 4.64 504 6060 4.69 445
C7552 8214 2.99 1604 7328 1.98 1173 5682 1.66 1328
Average % reduction compared to static
-16% -37% -33% -47%
CMOS circuits -25% -17%

Comparison of area in terms of
the # of Transistors
12000
10000
8000
#Transistor
Static CMOS
6000 Dynamic CMOS
PTL
4000
2000
0
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Comparison of Delay
14
12
10
Delay (ns)
Static CMOS
8
Dynamic CMOS
6 PTL
0
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552

Comparison of Energy requirement
6000
Switching energy (fJ)
5000
4000 Static CMOS
3000 Dynamic CMOS
2000 PTL
1000
0
C1355
C1908
C2670
C3540
C5315
C6288
C7552
C432
C499
C880

Leakage Power Limits Vt Scaling

Threshold Voltage Scaling
• Scale down the threshold voltage
– As Vth is reduced, the subthreshold
leakage current increases leading to
increase in power dissipation
Subthreshold Leakage current

1.00 10e3 50
Normalized Delay
Isub
[Normalized]
0.80 10e2 40
0.60 10e1 30
0.40 10e0 20
Delay
0.20 10e-1 10
0.00 10e-2
0.0 0.2 0.4 0.6 0.8
Threshold Voltage
Ajit Pal IIT V th [V]
Kharagpur
Threshold Voltage (VT) Scaling
Scale down the threshold voltage for low
voltage low power circuits to increase
performance
VT ↓ = Delay ↓ + Ileakage ↑
Low -VT : Provides high performance
VT ↑ = Delay ↑ + Ileakage ↓
High -VT : Reduces subthreshold leakage
0.2VDD ≤ VT ≤ 0.5VDD
Threshold Voltage Scaling
• Fabrication of multiple threshold voltages:
•Multiple channel doping
•Multiple Oxide thickness
•Multiple channel length
•Multiple body bias
•Various Approaches:
•Variable-threshold-voltage CMOS
(VTCMOS) approach
•Multi-threshold-voltage CMOS
(MTCMOS) approach
•Dual-Vt assignment approach

VTCMOS
p+ n+ n+ p+ p+ n+
Approach
n-well
p substrate
Typical n-well CMOS

Multi-threshold-voltage CMOS (MTCMOS)
zMTCMOS (Multi-threshold CMOS)
(S. Mutoh et al. 1996)
Vdd Vdd
SL Q1 SL Q1
VDDV VDDV
Circuit with
low-Vth
Transistors
GNDV GNDV
SL Q2 SL Q2
Q1, Q2 = Sleep control transistors
MTCMOS
Ajit Pal IITCircuit Scheme
Kharagpur
MTCMOS Performance
• Simulation results
5.0 2.0
Conv. CMOS
Conv. CMOS
(full H-Vth)
Normalized Energy
Normalized Delay
4.0 (full H-Vth) 1.6
3.0 1.2
MTCMOS
2.0 0.8 MTCMOS
1.0 0.6 Conv. CMOS

(full L-Vth)
0.0 0.2
0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0
Supply Voltage (Volt) Supply Voltage (Volt)

Advantages and Limitations of MTCMOS
• MTCMOS can be easily implemented using

existing circuits
• MTCMOS reduces only the standby power
• Large inserted MOSFETs will increase
area and delay
• Extra Vth memory circuit is needed to
maintain the data in the standby mode

Dual-Vth Assignment
Reference: L. Wei, Z. Chen, M.

Johnson, K. Roy, and V. De, Design
and Optimization of Low Voltage
High Performance dual threshold
CMOS Circuits, IEEE/ACM Proc. of
DAC-1998

Dual Threshold CMOS Technology
•Low-Vth transistors in critical path for high
performance
•Some high-Vth transistors in non-critical paths
to reduce leakage
c
e nodes in critical
d path (low-Vth)
h nodes with low-Vth
nodes with high-Vth

b g
a f
c
e nodes in critical
d path (low-Vth)
nodes with high-Vth

b g
a f
Darker gates on the critical path

c
e nodes in critical
d path (low-Vth)
nodes with high-Vth

b g
a f
HighVt = 0.25 assigned to all gates in the off-critical path

c
e nodes in critical
d path (low-Vth)
nodes with high-Vth

b g
a f
HighVt = 0.396 assigned to some gates in the off-critical path

c
e nodes in critical
d path (low-Vth)
nodes with high-Vth

b g
a f
HighVt = 0.46 assigned to some gates in the off-critical path

Dual-Vth Assignment Problem
• However, not all
the transistors in 2.0
Standby Leakage Power

Vdd = 1V
non-critical paths 1.8
Leff = 0.32u
can be assigned 1.5
Wpeff = 10.5u
a high-Vth 1.2 Wneff = 3u
Tox = 9.8nm
• How to 1.0
selectively assign 0.8

0.5
dual Vth to 0.15 0.25 0.35 0.45 0.55
achieve the best 0.20 0.30 0.40 0.50
leakage saving Vth2(V)
under
performance
constraint?
Solution of Dual-Vth Assignment (Contd.)
PI PO
Represents the circuits as a DAG
where each node represents a gate :
and each edge represents a :
connection :
:
1. Initialize all nodes with low-Vth.
2. Compute the critical path(s)
3. Using BFS traversal, assign high-Vth to a node
such that it does not alter the critical path
4. Optimal high-Vth calculation
Repeat the assignment with different high-Vth
( 0.2Vdd<high-Vth<0.5Vdd , Vdd=1V) for which
maximum number of node assignment and hence
minimum leakage power is possible
Optimal Dual-Vth Assignment
another Approach
• N.Tripathy, A.Bhosle, D. Samanta and A. Pal,

“Optimal Assignment of High Threshold
Voltage for Synthesizing Dual Threshold CMOS
Circuits”, Proc. VLSI Design 2001, pp.227-232,
Bangalore, January 2001.

Delay-Constrained Dual-VT
Static CMOS Circuits
Critical path
Assigned with high-VT transistors

Delay-Constrained Dual-VT Assignment
Assume the circuit as a DAG where
each node represents a gate and PI PO
each edge represents a
connection
Algorithm
1. Assume low-VT<high-VT<0.5VDD
2. Initialize all nodes with high-VT
3. Compute the critical path(s)
4. Using DFS traversal, assign low-VT
to a node on the critical path
5. Go to Step 3 until all the nodes on
the critical path are assigned with
low-VT
Delay-Constrained Dual-VT Assignment
Repeat the
assignment with 20
different high-VT
Leakage power (μ W)
15
(0.2VDD<high-
VT<0.5VDD ) for 10
which maximum
number of nodes 5
assignment and Optimal high-VT
hence minimum 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55
VT in volt
leakage power is
possible
Comparison of our results with [Wei+99]
With approach [Wei+99] Our approach
%Redn %Redn
in %Redn CPU in %Redn CPU
Benchmark
#Transistor standby in total time #Transistor standby in total time
leakage power (s) leakage power (S)
power power
C432 278 59.65 16.52 20 348 87.35 26.17 36
C499 604 51.09 7.18 118 796 64.45 12.68 174
C880 1126 84.87 14.07 55 1208 88.65 19.41 89
C1355 1232 49.36 8.51 198 1346 59.95 13.15 346
C1908 1430 76.21 15.46 225 1684 83.45 21.75 412
C2670 2736 81.24 19.27 269 3092 92.96 24.80 485
C3540 3430 85.60 21.43 301 3698 90.42 32.29 541
C5315 5432 83.12 18.44 342 5516 89.69 31.05 619
C6288 5768 43.38 19.89 564 8950 83.69 45.42 890
C7552 7102 76.41 20.35 387 7786 87.65 22.36 609
69.01% 16.11% 82.82% 24.92%

ISCAS Benchmarks Results
250
Leakage Power in uW
200
150 Single low Vth

100 Dual Vth
50
Average 85.84% savings in leakage power

compared to 60.91%
Ajit Pal of the earlier result
IIT Kharagpur
Comparison
More reduction in leakage power
(Average 25% more reduction in leakage power)
Leakage Power in uW
250
200
150
100
50
0
Single low Vth Dual Vth(Old) Dual Vth (New)

Comparison
More number of transistor assignment
10000
transistors
8000
No. of
6000 Old approach

4000 New approach
2000
0

Comparison
Higher time complexity
Old approach New approach
2000
C PU tim e in sec.
1500
1000
500
0
List of publications
1. Debasis Samanta, Ajit Pal, Synthesis of Low Power High
Performance Dual-VT PTL Circuits, Proc. 17th International
Conference on VLSI Design, 2004, pp.85-90, Mumbai, January
2004
2. D. Samanta, Ajit Pal, Logic Styles for High Performance and Low
Power, Proceedings of the 12th International Workshop on Logic
and Synthesis, 2003 (IWLS-2003), pp. 355-362, May, 2003
3. D. Samanta, M. C. Dharmadeep, and Ajit Pal, Synthesis of High
Performance Low Power PTL Circuits, Proc. ASP-DAC 2003,
Kitakyusyu, Japan, pp. 209-212, January 2003.
4. D. Samanta, and A. Pal, Synthesis of Dual-VT Dynamic CMOS
Circuits, Proc. VLSI Design 2003, New Delhi, India, pp. 121-128,
January 2003.
5. D. Samanta, N. Sinha, and A. Pal, Synthesis of High Performance
Low Power Dynamic CMOS Circuits, Proc. ASP-DAC/VLSI Design
2002, Bangalore, India, pp. 99-104, January 2002.
6. D. Samanta, and A. Pal, Optimal Dual-VT Assignment for Low-
Voltage Energy-Constraint CMOS Circuits, Proc. ASP-DAC/VLSI
Design 2002, Bangalore, India, pp. 193-198, January 2002.
7. N. Tripathi, A. Bhosle, D. Samanta, and Ajit Pal, Optimal
Assignment of High-VT for Synthesizing Dual-VT CMOS Circuits,
Proc. VLSI Design 2001,Ajit
Bangalore, India, pp. 227-232, January
Pal IIT Kharagpur
2001.
Thanks!

Synthesis For Low Power: A New VLSI Design Paradigm For DSM: Ajit Pal

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Synthesis For Low Power: A New VLSI Design Paradigm For DSM: Ajit Pal

Uploaded by

Copyright:

Available Formats

Synthesis for Low Power:

A New VLSI Design Paradigm

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

Onset temperatures of various failure mechanism

¾ CMOS has emerged as the technology of

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

Vdd Vdd Vdd

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

Output waveform showing glitch at output O2

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

 I1= Reverse-bias p-n junction diode leakage

Ajit Pal IIT Kharagpur

nMOS is ON pMOS is OFF

Ajit Pal IIT Kharagpur

nMOS inverter and its physical structure

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

High electric field across reverse-biased p-n

The tunneling current density is given by

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

¾Various mechanism which affect the

 Drain induced Barrier Lowering

Ajit Pal IIT Kharagpur

Switching power 80%-90%

Leakage power 10%-30%

Short-circuit power 0%-5%

Ajit Pal IIT Kharagpur

1.0 0.8 0.6 0.5 0.35 0.25 0.18

Technology Generation (μm)

 Leakage power is becoming a large

Ajit Pal IIT Kharagpur

¾Device feature size scaling

Ajit Pal IIT Kharagpur

Substrate doping 6 5 W′=W /S

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

frequency = 100 MHz 16

• The estimated dynamic fref

Ajit Pal IIT Kharagpur

reduce the supply 16-bit

voltage such that

Ajit Pal IIT Kharagpur

•Estimated power is: Ppipe = C pipe .V pipe

Ajit Pal IIT Kharagpur

0.4 The DAG after unrolling

to be used in DVS Modulation

Ajit Pal IIT Kharagpur

• Hardware Software Tradeoff

Ajit Pal IIT Kharagpur

Ajit Pal IIT Kharagpur

Conventional superscalar out-of-order CPUs use

Ajit Pal IIT Kharagpur

A compiler generates long instructions having

Ajit Pal IIT Kharagpur

¾Long instruction word, called molecule,

I1= Reverse-bias p-n junction diode leakage

Drain induced Barrier Lowering

Leakage power is becoming a large

Ease of fabrication network

Good noise margin VSS