You are on page 1of 67

Micro transductors ’08

Low Power VLSI Design 2
Dr.-Ing. Frank Sill
Department of Electrical Engineering, Federal University of Minas Gerais,

Av. Antônio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil

franksill@ufmg.br http://www.cpdee.ufmg.br/~frank/

Agenda
 

Recap Power reduction on
 Gate

level level

 Architecture  Algorithm  System

level

level

Copyright Sill, 2008

Micro transductors „08, Low Power 2

2

Recap: Problems of Power Dissipation

Continuously increasing performance demands

Increasing power dissipation of technical devices
Today: power dissipation is a main problem

High Power dissipation leads to:  Reduced time of operation  High efforts for cooling

 Higher weight (batteries)  Reduced mobility

 Increasing operational costs  Reduced reliability

Copyright Sill, 2008

Micro transductors „08, Low Power 2

3

2008 Micro transductors „08. A) Energy Water pressure (bar) Water quantity per second (liter/s) Amount of Water 1 CL 0 Energy consumption is proportional to capacitive load! Copyright Sill. Low Power 2 4 .Recap: Consumption in CMOS    Voltage (Volt. V) Current (Ampere.

Low Power 2 5 .Recap: Energy and Power Power is height of curve Watts Approach 1 Approach 2 time Energy is area under curve Watts Approach 1 Approach 2 time Energy = Power * time for calculation = Power * Delay Copyright Sill. 2008 Micro transductors „08.

Low Power 2 6 . 2008 Micro transductors „08.70% today and decreasing relatively) Short-circuit power (≈ 10 % today and decreasing absolutely) Leakage power (≈ 20 – 50 % today and increasing) Copyright Sill.Recap: Power Equations in CMOS P = α f CL VDD2 + VDD Ipeak (P01 + P10 ) + VDD Ileak Dynamic power (≈ 40 .

Recap: Levels of Optimization
Savings
MEM

Speed
Seconds Minute Minutes Hour Hours

Error
> 50 % 25-50 % 15-30 % 10-20 % 5-10 %
nach Massoud Pedram

System Algorithm Architecture Gate Transistor
Copyright Sill, 2008

MEM

ALU MP3

> 70 % 40-70 % 25-40 % 15-25 % 10-15 %

T1

S

T T

+

Micro transductors „08, Low Power 2

7

Recap: Logic Restructuring
 Logic restructuring: changing the topology of a logic network to reduce transitions
AND: P01 = P0 * P1 = (1 - PAPB) * PAPB

0.5 A B 0.5

(1-0.25)*0.25 = 3/16 W 7/64 = 0.109 X 15/256 C F 0.5 D 0.5

0.5 A 0.5 B 0.5 C 0.5 D

3/16 Y 15/256 F Z 3/16 = 0.188

Chain implementation has a lower overall switching activity than tree implementation for random inputs
Source: Timmernann, 2007

 BUT: Ignores glitching effects
Copyright Sill, 2008 Micro transductors „08, Low Power 2 8

Recap: Input Ordering
(1-0.5x0.2)*(0.5x0.2)=0.09 0.5 A B 0.2 X C 0.1 F (1-0.2x0.1)*(0.2x0.1)=0.0196 0.2 B X C F 0.1 A 0.5
AND: P01 = (1 - PAPB) * PAPB

Beneficial: postponing introduction of signals with a high transition rate (signals with signal probability close to 0.5)
Source: Irwin, 2000

Copyright Sill, 2008

Micro transductors „08, Low Power 2

9

2008 Micro transductors „08. Low Power 2 10 . 2000 Copyright Sill.Recap: Glitching A B C X Z ABC X Z 101 000 Unit Delay Source: Irwin.

Design Layer: Gate Level  Basic elements:  Logic gates elements (flipflops. Low Power 2 11 . 2008 Micro transductors „08. latches)  Sequential  Behavior of elements is described in libraries Copyright Sill.

for Cload=20. UCB 12 . Cin=1 = 20 0. Low Power 2 Source: Nikolic.g.53 0 = 4..5  fcircuit  fopt_energy  fopt_performance  For = 3.47 1 2 3 4 5 6 7 fanout f Low Power: avoid oversizing (f too big) beyond the optimal Copyright Sill.5 Affects input capacitance Cin Affects dynamic power consumption Pdyn normalized energy   Affects load capacitance Cload fcircuit=1 fcircuit=2 fcircuit=5 fcircuit=10 fcircuit=20  Optimal fanout factor f for Pdyn is smaller than for performance (especially for large loads)  1 e. 2008 Micro transductors „08.Dynamic Power and Device Size  Device  Sizing (= changing gate width) 1.

4 Supply voltage (VDD) td Pdyn 6 4 2 0  Delay (td) and dynamic power consumption (Pdyn) are functions of VDD Micro transductors „08.8 1 1. Low Power 2 13 Copyright Sill. 2008 Relative Pdyn .2 1.8 2 2.4 1.2 2.6 1.VDD versus Delay and Power 6 10 8 Relative Delay td 5 4 3 2 1 0 0.

Low Power 2 14 Copyright Sill. 2008 .Multiple VDD  Main ideas:    Use of different supply voltages within the same design High VDD for critical parts (high performance needed) Low VDD for non-critical parts (only low performance demands)  At design phase:    Determine critical path(s) (see upper next slide) High VDD for gates on those paths Lower VDD on the other gates (in non-critical paths)  For low VDD: prefer gates that drive large capacitances (yields the largest energy benefits)  Usually two different VDD (but more are possible) Micro transductors „08.

Multiple VDD cont‟d  Level converters:    Necessary. 2008 . when module at lower supply drives gate at higher supply (step-up) If gate supplied with VDDL drives a gate supplied with VDDH  then PMOS never turns off VDDH Possible implementation:   Cross-coupled PMOS transistors NMOS transistor operate on reduced supply Vin VDDL Vout   No need of level converters for step-down change in voltage Reducing of overhead:   Conversions at register boundaries Embedding of inside flipflop Micro transductors „08. Low Power 2 15 Copyright Sill.

FF)   Paths mostly differ in propagation delay times Frequency of clock signal (CLK) depends on path with longest delay  critical path FF FF FF FF FF FF Paths Path FF FF FF CLK Copyright Sill. Low Power 2 CLK 16 . 2008 CLK Micro transductors „08.Data Paths  Data propagate through different data paths between registers (flipflops .

Low Power 2 .Data Paths: Slack C A B A B G1 ready with evaluation G1 Y G2 Y C all Inputs of G1 arrived all inputs of G2 arrived delay of G1 Copyright Sill. 2008 Slack for G1 time 17 Micro transductors „08.

Multiple VDD in Data Paths   Minimum energy consumption when all logic paths are critical (same delay) Possible Algorithm: clustered voltage-scaling  Each path starts with VDDH and switches to VDDL (blue gates) when slack is available  Level conversion in flipflops at end of paths Connected with VDDL Connected with VDDH Copyright Sill. Low Power 2 18 . 2008 Micro transductors „08.

2008 Micro transductors „08. Low Power 2 19 .Design Layer: Architecture Level   Also known as Register transfer level (RTL) Base elements:    Register structures Arithmetic logic units (ALU) Memory elements  Only behavior is described (no inner structure) Copyright Sill.

2000 Copyright Sill. 2008 Micro transductors „08.Clock Gating  Most popular method for power reduction of clock signals and functional units   Gate off clock to idle functional units Logic for generation of disable signal necessary  Higher complexity of control logic  Higher power consumption  Critical timing critical for avoiding of clock glitches at OR gate output R Functional e unit g  Additional gate delay on clock signal clock disable Source: Irwin. Low Power 2 20 .

2008 Micro transductors „08.Clock Gating cont‟d  Clock-Gating in Low-Power Flip-Flop D D Q CLK Source: Agarwal. 2007 Copyright Sill. Low Power 2 21 .

2008 Latch Source: L. De Micheli.Clock Gating cont‟d  Clock gating over consideration of state in Finite-StateMachines (FSM) PI Flip-flops Combinational logic PO Clock activation logic CLK Copyright Sill. Dynamic Power Management. Micro transductors „08. Boston: Springer. Benini and G. Low Power 2 22 . 1998.

2002 Copyright Sill. Ohashi. Matsushita.Clock Gating: Example Without clock gating 30. 2008 Micro transductors „08.6mW With clock gating 8. Low Power 2 23 .5mW VDE DEU 0 5 10 15 Power [mW] 20 25 MIF DSP/ HIF  90% of FlipFlops clock-gated  70% power reduction by clock-gating 896Kb SRAM MPEG4 decoder Source: M.

4 Supply voltage (VDD) td Pdyn 6 4 2 0 Dynamic Power can be traded by delay Copyright Sill.8 1 1.8 2 2.2 2.Recap: VDD versus Delay and Power 6 10 8 Relative Delay td 5 4 3 2 1 0 0.2 1.6 1. 2008 Micro transductors „08.4 1. Low Power 2 24 Relative Pdyn .

Low Power 2 = Vref = Cref = fClk = CrefVref2fclk Source: Agarwal.A Reference Datapath Register Input Combinational logic Register Output Cref CLK Supply voltage Total capacitance switched per cycle Clock frequency Power consumption: Pref Micro transductors „08. 2008 25 . 2007 Copyright Sill.

of parallelism Register fclk Register Output Input fclk/N Register Multiphase Clock gen. 2008 Micro transductors „08. Logic Copy N Source: Agarwal. Logic Copy 2 N to 1 multiplexer Supply voltage: VN ≤ Vref N = Deg.Parallel Architecture Each copy processes every Nth input. Logic Copy 1 Comb. and mux control CK fclk/N Comb. operates at fclk/N reduced voltage Register Comb. Low Power 2 26 . 2007 Copyright Sill.

2008 Micro transductors „08.Pipelined Architecture  Reduces the propagation time of a block by factor N  Voltage can be reduced at constant clock frequency  Constant throughput Area A CLK  A/N CLK A/N A/N Functionality: Data CLK Copyright Sill. Low Power 2 27 .

Parallel Architecture: Example  Reference Data path (for example) A B     Critical path delay Tadder + Tcomparator (= 25 ns)  fref = 40 MHz Total capacitance being switched = Cref VDD = Vref = 5V Power for reference datapath = Pref = Cref Vref2 fref Source: Irwin. Low Power 2 28 . 2000 Copyright Sill. 2008 Micro transductors „08.

7.15 Cref Ppar = (2. 2008 Micro transductors „08.36 Pref Source: Irwin. 2000 Copyright Sill. Low Power 2 29 .15 Cref) (Vref / 1.Parallel Architecture: Example cont‟d Area = 1476 x 1219 µ2    The clock rate can be reduced by half with the same throughput fpar = fref / 2 Vpar = Vref / 1.7)2 (fref / 2) = 0. Cpar = 2.

1 Cref . Vpipe = Vref / 1. . 2008 Micro transductors „08.1 Cref) (Vref/1.37 Pref Source: Irwin. Low Power 2 30 . Cpipe = 1. 2000 Copyright Sill.Pipelined Architecture: Example    fpipe = fref.7)2 fref = 0.7 Voltage can be dropped while maintaining the original throughput Ppipe = CpipeVpipe2 fpipe = (1.

Yeap. Copyright Sill. 1998.Approximate Trend N-parallel proc. 2008 Micro transductors „08. Low Power 2 31 . Boston: Kluwer Academic Publishers. Capacitance Voltage Frequency Dynamic Power Chip area N*Cref Vref/N fref/N CrefVref2fref/N2 N times N-stage pipeline proc. Cref Vref/N fref CrefVref2fref/N2 10-20% increase Source: G. Practical Low Power Digital VLSI Design. K.

Guarded Evaluation  Reduction of switching activity by adding latches at inputs A B C Multiplier condition B A Latch condition Multiplier C   Latch preserves previous value of inputs to suppress activity Could also use AND gates to mask inputs to zero = forced zero Micro transductors „08. 2008 . Low Power 2 32 Copyright Sill.

Low Power 2 33 . disable input transitions Trade area for energy Source: Irwin. 2000 Copyright Sill. 2008 Micro transductors „08.Precomputation Precomputed inputs R1 Gated inputs R2 Combination logic f(X) Outputs Precomputation logic g(X) Load disable  Identify logical conditions at inputs that are invariant to the output   Since those inputs don‟t affect output.

Precomputation: Design Issues  Design steps 1. Selection of precomputation architecture 2. Search good implementation for g(X) 4. Determination of precomputed and gated inputs (Register R1 should be much smaller than R2) 3. Evaluation of potential energy savings based on input statistics (if savings not sufficient go to step 2 or 3 and try again)  Also works for multiple output functions where g(X) is the product of gj(X) over all j Source: Irwin. Low Power 2 34 . 2000 Copyright Sill. 2008 Micro transductors „08.

Low Power 2 .Precomputation: Example  Binary Comparator An Bn An-1 Bn-1 A1 B1 Load disable An = Bn R1 n-bit binary value comparator A>B A>B R2 Can achieve up to 75% power reduction with 3% area overhead and 1 to 5 additional gate delays in worst case path Source: Irwin. 2000 35 Copyright Sill. 2008 Micro transductors „08.

2008 Micro transductors „08. Look-ahead. Low Power 2 36 . conditional-sum.  Each with its own characteristics of timing and power consumption. Intel Copyright Sill. select. Ripple Carry FA FA FA FA Carry Select FA FA FA FA FA FA FA 0 1 Variable/Fixed Width Carry Skip Carry Look-ahead FA FA FA FA FA FA FA FA Source: Mendelson.Adder Design  Various algorithms exist to implement an integer adder  Ripple. skip (x2).

Low Power 2 37 . Copyright Sill. …) Source: Callaway. 1993.56 20. counter. 2008 Micro transductors „08. Proceedings. Swartzlander “Estimating the power consumption of CMOS adders” .27 28.Adder Design Energy (pJ) Ripple Carry Constant Width Carry Skip Variable Width Carry Skip Carry Lookahead Carry Select Conditional Sum   Delay (nSec) 54.13 19.11th Symposium on Computer Arithmetic.38 21.05 117 109 126 171 216 304  Adders differ in Energy and delay Different adders for different applications Also true for other units (multiplier.84 17.

2003)  Caused by:  High switching activities  Large capacitive loading Wout Xout Yout Zout Bus receivers Bus Bus drivers Source: Irwin. 2000 Ain Copyright Sill. Low Power 2 38 . SLIP 04)  MIT Raw processor‟s on-chip network consumes 36% of total chip power (Wang et al.Bus Power  Buses are significant source of power dissipation  50% of dynamic power for interconnect switching (Magen. 2008 Bin Cin Din Micro transductors „08.

2008 Micro transductors „08.Bus Power Reduction   For an n-bit bus: Pbus = n* αfClkCloadVDD2 Alternative bus structures    Segmented buses (lower Cload) Charge recovery buses Bus multiplexing (lower fClk possible) Code compression Instruction loop buffers  Minimizing bus traffic (n)    Minimization of bit switching activity (fclk) by data encoding Minimize voltage swing (VDD2) using differential signaling Source: Irwin. 2000  Copyright Sill. Low Power 2 39 .

2008 Micro transductors „08.Reducing Shared Resources   Shared resources incur switching overhead Local bus structures reduce overhead Global bus architecture Local bus architecture Source: Irwin. Low Power 2 40 . 2000 Copyright Sill.

Reducing Shared Resources cont‟d  Bus segmentation  Another  Control way to reduce shared buses of bus segment by controller blocks (B) Shared Bus B Segmented Bus B Source: Evgeny Bolotin – Jan 2004 Copyright Sill. 2008 Micro transductors „08. Low Power 2 41 .

Low Power 2 42 .Design Layer: Algorithm Level  Base elements:  Functions  Procedures  Processes  Control structures  Description of design behavior Copyright Sill. 2008 Micro transductors „08.

2008 Micro transductors „08.Coding styles  Use processor-specific instruction style:    Variable types Function calls style Conditionalized instructions (for ARM)  Follow general guidelines for software coding   Use table look-up instead of conditionals Make local copies of global variables so that they can be assigned to registers Avoid multiple memory look-ups with pointer chains  Copyright Sill. Low Power 2 43 .

N) B[c] = A[c]*D[c] for (c = 1.. 2008 Micro transductors „08.N) F[c] = B[c]-1 for (c = 1..N) receive (A) B=c*A receive (A) for (c = 1...N) F[c] = A[c]*D[c]-1 Copyright Sill.N) B=c*A  Storage for (c = 1..Source-code Transformations  Minimize power-consuming activity:  Computation A*B+A*C A*(B+C)  Communication for (c = 1. Low Power 2 44 .

2008 Micro transductors „08.c quick.c Others Functional Unit Pipeline Registers Register File  Algorithms can differ in power dissipation Source: Irwin.c heap. 2000 Copyright Sill. Low Power 2 45 .Datapath Energy Consumption 14000 Switched Capacitance (nF) 12000 10000 8000 6000 4000 2000 0 bubble.

4 V  Runtime Scheduler determines processor speed and selects appropriate voltage   Transitions delay for frequencies ~150s Potential to realize 10x energy savings Micro transductors „08. Low Power 2 46 Copyright Sill.3 V 2. 2008 .Adaptive Dynamic Voltage Scaling (DVS)   Slow down processor to fill idle time More Delay  lower operational voltage Active Idle Active Active Idle 3.

lower energy T2 Task Idle Task Time Copyright Sill. 50 ms idle/stopped time    Half speed/voltage system gives 100 ms computation.Adaptive DVS: Example  Task with 100 ms deadline. 2008 Micro transductors „08. 0 ms idle Same number of CPU cycles but: E = C (VDD/2)2 = Eref / 4 Dynamic Voltage Scaling adapts voltage to workload T1 Speed T2 T1 Same work. requires 50 ms CPU time at full speed  Normal system gives 50 ms computation. Low Power 2 Time 47 .

2008 Micro transductors „08. Low Power 2 48 .Design Layer: System Level  Basic Elements:     Complex modules Processors Calculation and control units Sensors MEM MEM ALU MP3 Copyright Sill.

but … Not needing peak performance most of the time   Components are idle sometimes Dynamic power management (DPM):  Puts idle components in low-power non-operational states when idle Observes and controls the system Power consumption of power manager is negligible  Power manager:   Copyright Sill. 2008 Micro transductors „08.Dynamic Power Management  Systems are:   Designed to deliver peak performance. Low Power 2 49 .

external interrupt to resume SLEEP Deeper sleep mode consumes less power Deeper sleep mode requires more latency to resume Copyright Sill. Low Power 2 50 . 2008 Micro transductors „08.power management DOZE NAP Most units stopped except on-chip cache memory (cache coherency) Cache also turned off. PLL still on. time out or external interrupt to resume PLL off.Processor Sleep Modes  Software power control .

2000 Copyright Sill. 2008 Micro transductors „08.Processor Sleep Modes: Example  PowerPC sleep modes Mode No power mgmt Dynamic power mgmt DOZE NAP SLEEP SLEEP without PLL SLEEP without clock 66Mhz 2. Low Power 2 51 .54W 2.89W 307mW 113mW 89mW 18mW 2mW 80Mhz 2.18W 1.20W 366mW 135mW 105mW 19mW 2mW 10 cycles to wake up from SLEEP 100us to wake up from SLEEP+ Source: Irwin.

Low Power 2 52 . 2008 Micro transductors „08.Transmeta LongRun   Applies adaptive DVS LongRun policies:   Detection of different workload scenarios Based on runtime performance information Processor supply voltage Processor frequency Clock frequency always within limits required by supply voltage to avoid clock skew problems  After detection  accordingly adaptation of:     Use of core frequency/voltage hard coded operating points  Best trade-off between performance and power possible Copyright Sill.

30 V 1000 Frequency (MHz) Source: Transmeta Copyright Sill. 2008 Micro transductors „08.05 V 700 800 Mhz 1. Low Power 2 53 .87 V 500 533 Mhz 0.Transmeta LongRun cont‟d 100 % of max powerl consumption 90 80 70 60 50 40 30 20 10 0 300 Typical operating region Peak performance region 300 Mhz 0.80 V 400 433 Mhz 0.25 V 900 1000 Mhz 1.95 V 600 667 Mhz 1.15 V 800 900 Mhz 1.

Low Power 2 54 . 2008 Micro transductors „08.Transmeta LongRun: Example Source: Transmeta Copyright Sill.

Battery aware design   Non-linear effects influence life time of batteries “Rate Capacity”  If discharging currents higher than allowed real capacity goes under nominal capacity “Battery Recovery”    Capacity (mAh) 1000 800 600 400 200 1000 mAh (Standard Capacity) 125mA ( Rated Current) Discharge current (mA) Available Charge (mA)  Pulsed discharge increases nominal capacity Based on recovery times Discharge (as long there is no rate Current capacity effect) (mA) time idle time Source: Timmermann. Low Power 2 55 . 2008 Micro transductors „08. 2007 Copyright Sill.

2008 .Rakhmatov.Battery aware design cont‟d Diffusion Model from . Fully charged battery After a recent discharge After Recovery Fully discharged Electro-active species   Analytically very sound but computationally intensive Cannot be used for online scheduling decisions. Low Power 2 56 Copyright Sill. Vrudula et al. Micro transductors „08.

bipolar lead acid batteries”.Battery aware design: Example 1  Performance of a bipolar lead-acid battery subjected to six current impulses. rest period=22 ms. pulsed discharge. Current Battery Voltage Source: LaFollette. “Design and performance of high specific power. 10th Annual Battery Conference on Applications and Advances . January 1995. 2008 Micro transductors „08. 43–47. Copyright Sill. Pulse length=3 ms. Low Power 2 57 . pp. Long Beach.

2 357053 536484 15.Battery aware design: Example 2 Current [mA] Current [mA] Discharge profile A Profile Aver.58  Minimum average current ≠ Maximum battery life time Source: Timmermann.12 18.8 124. energy [Wh/Kg] A B 123. 2008 Micro transductors „08. Low Power 2 58 . Current [mA] Discharge profile B Battery lifetime [ms] Specif. 2007 Copyright Sill.

Backup Copyright Sill. Low Power 2 59 . 2008 Micro transductors „08.

2008 Micro transductors „08. Low Power 2 60 .FSM: Clock-Gating  Moore machine: Outputs depend only on the state variables. Xj/Zk Copyright Sill.  If a state has a self-loop in the state transition graph (STG). Sk) combination occurs. Xi/Zk Si Sk Sj Xk/Zk Clock can be stopped when (Xk. then clock can be stopped whenever a selfloop is to be executed.

2002] Source: Tenhunen. 2008 Micro transductors „08. Low Power 2 61 .Trend: Interconnects Interconnects Propagation delays of global wires will be a multiple of the clock cycle. 2005 Copyright Sill. Example (very optimistic): 6–10 clock cycles in 50nm technology [Benini.

Low Power 2 62 .. 2000 Copyright Sill.) = 4 Source: Irwin.Bus Multiplexing or  Number of bus transitions per cycle = 2 (1 + 1/2 + 1/4 + . 2008 Micro transductors „08..

2008 Micro transductors „08. Low Power 2 63 .Resource Sharing and Activity II Copyright Sill.

2000 Copyright Sill. Low Power 2 64 . 2008 Micro transductors „08.Bus Multiplexing   Sharing of long data buses with time multiplexing Example:   S1 uses even cycles S2 odd S1 D1 S1 D1 S2 D2 S2 D2 Source: Irwin.

5 Bus sharing should not be used for positively correlated data streams  Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) .more random switching 0 14 12 MSB 10 8 6 4 2 0 LSB Source: Irwin. Low Power 2 65 . 2000 Bit position Copyright Sill.Correlated Data Streams Bit switching probabilities Muxed Dedicated 1  For a shared (multiplexed) bus advantages of data correlation are lost (bus carries samples from two uncorrelated data streams)  0. 2008 Micro transductors „08.

Low Power 2 66   Copyright Sill.more random switching Micro transductors „08. 2008 . advantages of data correlation are lost (bus carries samples from two uncorrelated data streams) Bus sharing should not be used for positively correlated data streams Bus sharing may prove advantageous in a negatively correlated data stream (where successive samples switch sign bits) .Disadvantages of Bus Multiplexing  If data bus is shared.

2008 Micro transductors „08.Adaptive DVS cont‟d  Implementation Power-Speed Control Knob Workload Filter Variable Power-Speed System FIFO Input Buffer Copyright Sill. Low Power 2 67 .