Sleepy stack: a New Approach to Low Power VLSI Logic and Memory

Ph.D. Dissertation Defense by Jun Cheol Park Advisor: Vincent J. Mooney III School of Electrical and Computer Engineering Georgia Institute of Technology June 2005
© 2005 Georgia Institute of Technology

Outline
Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion
© 2005 Georgia Institute of Technology 2

Power consumption
Power consumption of VLSI is a fundamental problem of mobile devices as well high-performance computers
Limited operation (battery life) Heat Operation cost

Power = dynamic + static
Dynamic power more than 90% of total power (0.18u tech. and above)

Dynamic power reduction:
Technology scaling Frequency scaling Voltage scaling IBM PowerPC 970*

*N. Rohrer et al., “PowerPC 970 in 130nm and 90nm Technologies," IEEE International Solid-State circuits Conference, vol 1, pp. 68-69, February 2004.

© 2005 Georgia Institute of Technology 3

80E+05 4. sleep transistor) lose logic state thus requires long wake-up time *Assume chip area ½ processor logic Users can use a cell phone without waiting and ½ memory..07u tech.09E-01 leakage power matches to dynamic power at 0.23E-01 3.83E-02 2.57E-04 7.200min) Leakage reduction technique can potentially increase battery life 34X State-saving Prior ultra-low-leakage techniques (e.99E-02 1. 32KB SRAM Total 1.74E-03 8.50E-03 1.Motivation Ultra-low leakage power A cell phone calling plan with 500min (per month 43.34E-01 3.56E+04 Power consumption scenario for 0.49E+05 1.04E-01 5. processor © 2005 Georgia Institute of Technology 4 . no on-chip memory.71E-01 3.26E+03 2.82E-01 7. Our approach (sleepy stack) Energy (J) Energy (J) Active (W) Leakage (W) (Month) (Month) 1.10E-02 2. long wake-up time Emergency calling situation Best prior work (forced stack) Active (W) Leakeage (W) Proc.30E+05 2.39E+03 6.g.07u tech.

edu/~ptm. [Online].eecs.berkeley.00E-10 0.00E-09 1.00E-08 1. Available http://www-device.00E-06 1. © 2005 Georgia Institute of Technology 5 .18u 0.Leakage power 1.10u 0.13u 0.07u Experimental result 4-bit adder* Gate Source Gate-oxide Leakage current Drain Gate-oxide leakage Gate tunneling due to thin oxide High-k dielectric could be a solution n+ P-substrate Subthreshold Leakage current n+ NFET *Berkeley Predictive Technology Model (BPTM).00E-07 1.00E-05 1.00E-04 Dynamic Power Leakage Power Leakage power became important as the feature size shrinks Subthreshold leakage Scaling down of Vth: Leakage increases exponentially as Vth decreases Short-channel effect: channel controlled by drain Our research focus 1.

Contribution Design of novel ultra-low leakage sleepy stack which savies state Design of sleepy stack SRAM cell which provides new pareto points in ultra-low leakage power consumption Design of low-power pipelined cache and find optimal number of pipeline stages in the given architecture Design of sleepy stack pipelined SRAM for lowleakage © 2005 Georgia Institute of Technology 6 .

Outline Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion © 2005 Georgia Institute of Technology 7 .

e..Low-leakage CMOS VLSI Sleep transistor Multi-threshold-voltage CMOS (MTCMOS) [Mutoh95] Loses state and incurs high wake-up cost Zigzag [Min03] Non-floating by asserting a predefined input during sleep mode Only retain predefined state Stack [Narendra01] Stack effect: when two or more stacked transistors turned off together. leakage power is suppressed Forcing stack Cannot utilize high-Vth without huge delay penalties (>6.8X) Logic Network Sleep transistor © 2005 Georgia Institute of Technology 8 .g.2X) (we will show for less. 2.

2.g. e.2X) (we will show for less.Low-leakage CMOS VLSI Sleep transistor Multi-threshold-voltage CMOS (MTCMOS) [Mutoh95] Loses state and incurs high wake-up cost Zigzag [Min03] Non-floating by asserting a predefined input during sleep mode 0 Only retain predefined state Stack [Narendra01] Stack effect: when two or more stacked transistors turned off together..8X) S Pullup Network A Pullup Network A Pullup Network A 1 Pulldown Network B S’ 0 Pulldown Network B Pulldown Network B S’ 1 Zigzag © 2005 Georgia Institute of Technology 9 . leakage power is suppressed Forcing stack Cannot utilize high-Vth without huge delay penalties (>6.

8X) 2 4 2 2 1 1 Forcing stack © 2005 Georgia Institute of Technology 10 . leakage power is suppressed Forcing stack Cannot utilize high-Vth without huge delay penalties (>6.2X) (we will show for less.Low-leakage CMOS VLSI Sleep transistor Multi-threshold-voltage CMOS (MTCMOS) [Mutoh95] Loses state and incurs high wake-up cost Zigzag [Min03] Non-floating by asserting a predefined input during sleep mode Only retain predefined state Stack [Narendra01] Stack effect: when two or more stacked transistors turned off together.g. e.. 2.

so does not need to restore original state (unlike the sleep transistor technique and the zigzag technique) Larger leakage power saving (forced stack) © 2005 Georgia Institute of Technology 11 .Low-leakage CMOS VLSI comparison summary Sleepy stack approaches for logic circuits Saves exact state.

K.Low-leakage SRAM Auto-Backgate-Controlled Multi Threshold CMOS (ABC-MTCMOS) [Nii98] Reverse source-body bias during sleep mode Slow transition and large dynamic power to charge n-wells Gated-Vdd [Powell00](Prof. Roy) Isolate SRAM cells using sleep transistor Loses state during sleep mode Drowsy cache [Flautner02] Scaling Vdd dynamically Smaller leakage reduction (<86%) (we will show 3 orders magnitude reduction) Vdd Gate Source Drain n-well High-Vdd p+ p+ p-substrate ABC-MTCMOS © 2005 Georgia Institute of Technology 12 .

com . K. Roy) VGND Isolate SRAM cells using sleep Gated-VDD control transistor bitline’ bitline Loses state during sleep mode Drowsy cache [Flautner02] Gated-VDD Scaling Vdd dynamically Smaller leakage reduction *Intel introduces 65-nm sleep transistor SRAM (<86%) (we will show 3 orders from Intel.Low-leakage SRAM Auto-Backgate-Controlled Multi Threshold CMOS (ABC-MTCMOS) [Nii98] VDD Reverse source-body bias during sleep mode wordline Slow transition and large dynamic power to charge n-wells Gated-Vdd [Powell00](Prof. “65-nm process technology extends the benefit of Moore’s law” magnitude reduction) © 2005 Georgia Institute of Technology 13 .

K. Roy) Isolate SRAM cells using sleep transistor Loses state during sleep mode Drowsy cache [Flautner02] Scaling Vdd dynamically Smaller leakage reduction (<86%) (we will show 3 orders magnitude reduction) LowVolt VDDH VDDL LowVolt’ wordline P2 N3 N2 P1 N4 bit N1 bit’ Drowsy cache © 2005 Georgia Institute of Technology 14 .Low-leakage SRAM Auto-Backgate-Controlled Multi Threshold CMOS (ABC-MTCMOS) [Nii98] Reverse source-body bias during sleep mode Slow transition and large dynamic power to charge n-wells Gated-Vdd [Powell00](Prof.

Low-power pipelined cache Breaking down a cache into multiple segment cache can be pipelined to increase performance [Chappell91][Agarwal03] No power reduction addressed Pipelining caches to offset performance degradation and pipelined cache [Gunadi04] Saving power by enabling necessary subbank Only dynamic power is addresses (0.18u technology) © 2005 Georgia Institute of Technology 15 .

Low-leakage SRAM and low power cache comparison Sleepy stack SRAM cell No need to charge n-well (ABC-MTCMOS) State-saving (gated-Vdd) Larger leakage power savings (drowsy cache) No prior work found that uses a pipelined cache to reduce dynamic power by scaling Vdd or reducing static power © 2005 Georgia Institute of Technology 16 .

Outline Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion © 2005 Georgia Institute of Technology 17 .

g. cell phone © 2005 Georgia Institute of Technology 18 .. e.Introduction of sleepy stack New state-saving ultra low-leakage technique Combination of the sleep transistor and forced stack technique Applicable to generic VLSI structures as well as SRAM Target application requires long standby with fast response.

5 W/L=1. Break down a transistor similar to the forced stack technique Then add sleep transistors © 2005 Georgia Institute of Technology 19 .5 Conventional CMOS inverter Sleepy stack stack inverter inverter First.5 S’ W/L=1.Sleepy stack structure S W/L=3 W/L=6 W/L=3 W/L=3 W/L=3 W/L=1.

5 Stack effect On High-Vth Off S’=1 S’=0 W/L=1.2X) © 2005 Georgia Institute of Technology 20 . sleep transistors are off.5 W/L=1. sleep transistors are on.Sleepy stack operation On S=0 Stack effect Off S=1 W/L=3 W/L=3 W/L=3 Low-Vth W/L=1.5 Active mode Sleep mode During active mode. which is not used in the forced stack technique due to the dramatic delay increase (>6. then reduced resistance increases current while reducing delay During sleep mode. stacked transistors suppress leakage current while saving state Can apply high-Vth.

Outline Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion © 2005 Georgia Institute of Technology 21 .

sleep*.Experimental methodology Five techniques are compared base case. CL=3Cinv 4:1 multiplexer 5 stages standard cell gates Cin=11Cinv. forced stack. CL=3Cinv 4-bit adder 4 full adders complex gate Cin=18.5Cinv. CL=3Cinv © 2005 Georgia Institute of Technology 22 . zigzag* and sleepy stack* (*single.and dual-Vth applied) Three benchmark circuits are used (CL=3Cinv) A chain of 4 inverters 4 inverter chain Cin=3Cinv.

18µ.8V 0. Available http://www.18µ layout Dynamic power.0V 0.cadence.eecs. Available http://www-device.10µ 1. [Online]. VDD 0. **Berkeley Predictive Technology Model (BPTM).8V *NC State University Cadence Tool Information.edu/~ptm.13µ. 0.18µ Area estimation Tech.3V 0. . 0.07µ 0.10µ.Experimental methodology Area estimated by scaling down 0.berkeley. © 2005 Georgia Institute of Technology 23 [Online]. 0.13µ 1.edu.18µ 1.ncsu. static power and worst-case delay of each benchmark circuit measured Layout (Cadence Virtuoso) Scaling down Schematics from layout HSPICE (Synopsys HSPICE) Power and delay estimation BPTM** 0.07µ NCSU Cadence design kit* TSMC 0.

13u Berkeley 0.10u Berkeley 0.E-15 1.5E-10 2.07u 1.0E-11 0.E-09 1.5E-10 1.2V and 0.E-08 1.E-16 St ac k Sl ee p* Zi gZ ag Sl * ee py St ac k* St ac k Sl ee p Zi gZ ag ca se * Dual-Vth applied (0.E-07 1.E-11 1.A chain of 4 inverters (a) Static power (W) 1.0E-10 5.E-04 1.E-13 1.E-07 St ac k St ac k Sl ee p* Zi gZ ag Sl * ee py St ac k* Sl ee p* Zi gZ ag Sl ee * py St ac k* St ac k Sl ee p Sl ee p Zi gZ ag ca se Ba se 1.4V) (b) Dynamic power (W) 1.E-12 1.E-14 1.18u Berkeley 0.18u Berkeley 0.E-06 Ba se (c) Propagation delay (s) 3.0E-10 2.0E+00 Zi gZ ag Sl ee py St ac k Sl ee py 100 (d) Area (µ2) 10 1 Sl ee p* Zi gZ ag Sl ee * py St ac k* St ac k Sl ee p St ac k Zi gZ ag ca se ca se Ba se Ba se Sl ee py Sl ee py © 2005 Georgia Institute of Technology 24 .0E-10 1.E-10 1.E-05 TSMC 0.

A chain of 4 inverters Compared mainly to forced stack (best prior leakage technique while saving state) Compared to forced stack.45E-10 1.67E-10 1. A chain of 4 inverters Base case Stack Sleep ZigZag Sleepy Stack Sleep (dual Vth) ZigZag (dual Vth) Sleepy Stack (dual Vth) Propagation delay (s) 7.45E-09 1.69E-10 1.12E-12 4.13E-10 1.09E-06 Area (µ2) 5.07E-12 4.05E-11 2.39 9.69E-09 4.03 © 2005 Georgia Institute of Technology 25 . respectively 0.46E-06 1. sleepy stack with dual-Vth achieves 215X reduction in leakage power with 6% decrease in delay Sleepy stack is 73% and 51% larger than base case and forced stack.23 5.34E-06 1.39E-06 1.67 7.97 10.99E-10 Static Power (W) 1.15E-10 1.03 10.39E-06 1.56E-12 Dynamic Power (W) 1.39 9.08E-06 1.11E-10 1.67 7.34E-06 1.07u tech.57E-08 9.25E-06 1.96E-09 1.81E-10 2.

0E-10 6.18u Berkeley 0.4V) (b) Dynamic power (W) 1.0E-10 3.E-09 1.0E-10 2.0E-10 0.E-12 St ac k Sl ee p* Zi gZ ag Sl * ee py St ac k* St ac k Sl ee p Zi gZ ag ca se * Dual-Vth applied (0.10u Berkeley 0.2V and 0.E-05 1.0E-10 4.E-06 St ac k St ac k Sl ee p* Zi gZ ag Sl * ee py St ac k* Sl ee p* Zi gZ ag Sl ee * py St ac k* St ac k Sl ee p Sl ee p Zi gZ ag Zi gZ ag ca se Ba se Ba se (c) Propagation delay (s) 8.E-04 Berkeley 0.0E-10 5.0E+00 St ac k Sl ee py 1000 (d) Area (µ2) 100 10 Sl ee p* Zi gZ ag Sl ee * py St ac k* St ac k Sl ee p Zi gZ ag St ac k ca se ca se Sl ee py Ba se Sl ee py Ba se Sl ee py © 2005 Georgia Institute of Technology 26 .0E-10 7.E-08 1.E-10 1.07u 1.0E-10 1.E-06 TSMC 0.13u Berkeley 0.E-11 1.4:1 Multiplexer (a) Static power (W) 1.18u 1.E-07 1.

36 125.17E-10 3.14E-06 2.54E-06 2.41E-11 3.28E-10 4.11 74.33 74.07u tech.33 © 2005 Georgia Institute of Technology 27 .15E-06 2.35E-10 2.18E-06 2.87E-10 3.36E-08 1.59E-06 2.39E-10 4.4:1 Multiplexer Compared to forced stack.11 74.57E-08 6.84E-10 Static Power (W) 8.36 125.09E-06 Area (µ2) 50.10E-06 2. sleepy stack with dual-Vth achieves 202X reduction in leakage power with 7% increase in delay Sleepy stack is 150% and 118% larger than base case and forced stack.65E-08 1.49E-06 2. respectively 0.09E-08 2.46E-09 1.40 74. 4:1 multiplexer Base case Stack Sleep ZigZag Sleepy Stack Sleep (dual Vth) ZigZag (dual Vth) Sleepy Stack (dual Vth) Propagation delay (s) 1.52E-10 1.99E-10 2.20E-11 Dynamic Power (W) 2.62E-11 3.17 57.

E-06 1.07u 1.10u Berkeley 0.E-05 1.2E-09 1.E-04 (a) Static power (W) 1.6E-09 1.E-08 1.4V) (b) Dynamic power (W) 1.E-07 1.0E-10 4.E-06 St ac k St ac k Sl ee p* Zi gZ ag Sl * ee py St ac k* Sl ee p* Zi gZ ag Sl * ee py St ac k* Sl ee p* Zi gZ ag Sl ee * py St ac k* St ac k Sl ee p Sl ee p Zi gZ ag Zi gZ ag St ac k Sl ee p St ac k ca se Ba se Ba se ca se (c) Propagation delay (s) 1.18u Berkeley 0.4E-09 1.E-10 1.18u Berkeley 0.0E-10 0.4-bit adder 1.0E-10 2.E-12 * Dual-Vth applied (0.0E-10 6.0E+00 St ac k Sl ee py (d) Area (µ2) 1000 100 10 Sl ee p* Zi gZ ag Sl ee * py St ac k* St ac k Sl ee p Zi gZ ag St ac k Zi gZ ag ca se Ba se ca se Sl ee py Ba se Sl ee py Sl ee py © 2005 Georgia Institute of Technology 28 .E-09 1.E-03 TSMC 0.2V and 0.E-11 1.13u Berkeley 0.8E-09 1.0E-09 8.

23E-11 2.09E-09 1.88 © 2005 Georgia Institute of Technology 29 .48E-10 7.03E-06 8.23E-09 Static Power (W) 8.4-bit adder Compared to forced stack.24E-08 9.56E-11 Dynamic Power (W) 8.87E-08 6.62 65.41E-06 8.65E-10 7.88 30.43E-10 1.44E-06 7.07E-08 2.94 27.63E-06 9.16E-09 5.96 30.07u tech 4-bit adder Base case Stack Sleep ZigZag Sleepy Stack Sleep (dual Vth) ZigZag (dual Vth) Sleepy Stack (dual Vth) Propagation delay (s) 3. sleepy stack with dual-Vth achieves 190X reduction in leakage power with 6% increase in delay Sleepy stack is 187% and 113% larger than base case and forced stack.24E-10 5.26E-06 Area (µ2) 22.24E-10 8.76E-10 1.77E-09 1.81E-06 7.53E-06 7.19E-11 3.62 65.70E-06 9.94 30. respectively 0.94 27.

00E-10 0. 48 0. 44 0. varied Vth 3. 4 0.00E-10 1. 5 Vth (b) Static power (W) 1. 3 0. 26 0. 2 0. 38 © 2005 Georgia Institute Technology 30 Results from 4 inverters using 0. 46 0. 34 0.00E-10 1. 28 0.50E-10 1.07uof technology 0. 42 0. low Vth only Sleepy stack. 44 0.Sleepy stack Vth variation (a) Delay (sec) 4.00E-09 1.4V while matching delay to the forced stack 215X leakage power reduction of the sleepy stack technique 3. 38 0. 46 0. low Vth only Impact of Vth by comparing the sleepy stack and the forced stack Vth of the sleepy stack can be increased up to 0. 22 0. 22 0.50E-10 Sleepy stack. varied Vth 1. 5 0. 24 0. 2 0. 36 0.00E-12 0. 4 0.00E-10 2.00E-10 Forced stack. 32 0. 32 0.00E-11 1. 26 0.50E-10 2. 24 0. 42 0. 36 0. 28 0.00E-08 Forced stack. 3 0. 34 0. 48 Vth .

5% faster but leakage power is 430X larger 80 70 60 50 40 30 20 1 1.5 5 xWidth (c) Static power (W) Forced stack.5 4 4. varied w idth Sleepy stack. fixed w idth 1.5 4 4.5 3 3.00E-07 1.5 2 (b) Delay (sec) Forced stack. fixed w idth 2.4V Between 2X~2.80E-10 1. sleepy stack (sleep and paralleled transistor) Vth=0.60E-10 1 1.Forced stack transistor width variation Impact of the forced stack transistor width by comparing the sleepy stack and the forced stack using similar area Forced stack Vth=0.5 3 3. varied w ith Sleepy stack.5 3 3.00E-10 1.00E-12 1 1.5X transistor width of the forced stack matches area with the sleepy stack Force stack is 1.5 2 (a) Area (u2) Forced stack.70E-10 1.10E-10 2.5 5xWidth © 2005 Georgia Institute Technology 31 Results from 4 inverters using 0.2V.5 2 2. varied w idth Sleepy stack.00E-09 1.5 5 xWidth 2.00E-10 1.00E-08 1.07uof technology .20E-10 2. fixed w idth 2.90E-10 1.5 4 4.00E-11 1.

Outline Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion © 2005 Georgia Institute of Technology 32 .

Sleepy stack SRAM cell Sleepy stack technique achieves ultra-low leakage power while saving state Apply the sleepy stack technique to SRAM cell design Large leakage power saving expected in cache State-saving 6-T SRAM cell is based on coupled inverters SRAM cell leakage paths Cell leakage Bitline leakage © 2005 Georgia Institute of Technology 33 .

WL sleepy stack Area. PD sleepy stack PU. WL sleepy stack PU. PD. delay and leakage power tradeoffs © 2005 Georgia Institute of Technology 34 .Sleepy stack SRAM cell Sleepy stack SRAM cell PD sleepy stack PD.

PD stack PU. WL high-Vth PD stack PD. PD. WL Stack applied to PU. PD. WL Sleepy stack applied to PU. PD.18u layout*(0. forced stack. PD High-Vth applied to PU. PD. WL Stack applied to PD Stack applied to PD. WL stack PU.Experimental methodology Base case and three techniques are compared High-Vth technique. WL High-Vth applied to PU.5xVth and 2.18µ layout Area of 0.07u/0.18u) Power and read time using HSPICE targeting 0. WL high-Vth PU. WL Sleepy stack applied to PD Sleepy stack applied to PD. PD high-Vth PU. WL stack PD sleepy stack PD. PD. PD. and sleepy stack Technique Conventional 6T SRAM High-Vth applied to PD High-Vth applied to PD.0xVth 25oC and 110oC Case1 Case2 Case3 Case4 Case5 Case6 Case7 Case8 Case9 Case10 Case11 Case12 Case13 Low-Vth Std PD high-Vth PD. PD Sleepy stack applied to PU.07µ 1. WL sleepy stack PU. WL sleepy stack © 2005 Georgia Institute of Technology 35 . PD sleepy stack PU. PD Stack applied to PU. WL 64x64 bit SRAM array designed Area estimated by scaling down 0.

respectively © 2005 Georgia Institute of Technology 36 PU. PD. PD stack PD. WL high-Vth PD.0E+01 5. PD.0E+01 1.0E+01 3. WL stack PU.5E+01 3. WL sleepy stack PU. WL high-Vth PD sleepy stack Low-Vth Std PD high-Vth PD stack 0. WL stack PU. PD.5E+01 1. PD high-Vth PD. PD sleepy stack PU. WL sleepy stack PU. PD. PD.0E+00 Unit=µ2 PU.0E+01 2. WL sleepy stack is 113% and 83% larger than base case and PU. WL forced stack.0E+00 .Area 4.5E+01 2.

PD high-Vth PD. PD.5E-10 1.2E-10 1. 110C 2xVth.1E-10 PU. WL stack PD.4E-10 1. PD sleepy stack PU. PD. WL sleepy stack PU.0E-10 Low-Vth Std Unit=sec 1xVth. WL sleepy stack PU.8E-10 1.Cell read time 1.5xVth.3E-10 1. 110C 1.7E-10 1. WL high-Vth PD.6E-10 1. WL high-Vth PD sleepy stack PD high-Vth PD stack . PD stack 1. WL stack PU. PD. 110C Delay: High-Vth < sleepy stack < forced stack © 2005 Georgia Institute of Technology 37 PU.

WL high-Vth PD sleepy stack PD. WL sleepy stack PU. WL high-Vth PD. 110C 2xVth.0E-02 1. PD.0E-03 1xVth.0E-04 1. WL sleepy stack PU. WL stack PU.Leakage power 1. the worst case. 110C Unit=W PU. PD. PD. PD stack At 110oC. 110C 1. WL stack PD high-Vth PD stack . PD high-Vth PD.0E-05 1.5xVth.0E-06 Low-Vth Std 1. leakage power: forced stack > high-Vth 2xVth > sleepy stack 2xVth © 2005 Georgia Institute of Technology 38 PU. PD sleepy stack PU.

Tradeoffs
Technique Case1 Case2 Case6 Case10* Case10 Case4 Case8 Case12* Case12 Case3 Case7 Case11* Case11 Case5 Case9 Case13* Case13

1.5xVth at 110oC

Normalized Normalized Normalized Leakage Delay (sec) Area (u2) leakage power delay area power (W) Low-Vth Std 1.254E-03 1.05E-10 17.21 1.000 1.000 1.000 PD high-Vth 7.159E-04 1.07E-10 17.21 0.571 1.020 1.000 PD stack 7.071E-04 1.41E-10 16.22 0.564 1.345 0.942 PD sleepy stack* 6.744E-04 1.15E-10 25.17 0.538 1.102 1.463 PD sleepy stack 6.621E-04 1.32E-10 22.91 0.528 1.263 1.331 PU, PD high-Vth 5.042E-04 1.07E-10 17.21 0.402 1.020 1.000 PU, PD stack 4.952E-04 1.40E-10 15.37 0.395 1.341 0.893 PU, PD sleepy stack* 4.532E-04 1.15E-10 31.30 0.362 1.103 1.818 PU, PD sleepy stack 4.430E-04 1.35E-10 29.03 0.353 1.287 1.687 PD, WL high-Vth 3.203E-04 1.17E-10 17.21 0.256 1.117 1.000 PD, WL stack 3.202E-04 1.76E-10 19.96 0.255 1.682 1.159 PD, WL sleepy stack* 2.721E-04 1.16E-10 34.40 0.217 1.111 1.998 PD, WL sleepy stack 2.451E-04 1.50E-10 29.87 0.196 1.435 1.735 PU, PD, WL high-Vth 1.074E-04 1.16E-10 17.21 0.086 1.110 1.000 PU, PD, WL stack 1.043E-04 1.75E-10 19.96 0.083 1.678 1.159 PU, PD, WL sleepy stack* 4.308E-05 1.16E-10 41.12 0.034 1.112 2.389 PU, PD, WL sleepy stack 2.093E-05 1.52E-10 36.61 0.017 1.450 2.127

Sleepy stack delay is matched to Case5 (“*” means delay matched to Case5=best prior work) Sleepy stack SRAM provides new pareto points (blue rows) Case13 achieves 5.13X leakage reduction (with 32% delay increase), alternatively Case13* achieves 2.49X leakage reduction compared to Case5 (while matching delay to Case5)
© 2005 Georgia Institute of Technology 39

Tradeoffs
Technique Case1 Case6 Case2 Case10 Case10* Case8 Case4 Case12* Case12 Case7 Case3 Case11* Case11 Case9 Case5 Case13* Case13 Low-Vth Std PD stack PD high-Vth PD sleepy stack PD sleepy stack* PU, PD stack PU, PD high-Vth PU, PD sleepy stack* PU, PD sleepy stack PD, WL stack PD, WL high-Vth PD, WL sleepy stack* PD, WL sleepy stack PU, PD, WL stack PU, PD, WL high-Vth PU, PD, WL sleepy stack* PU, PD, WL sleepy stack

2.0xVth at 110oC
Static (W) 1.25E-03 7.07E-04 6.65E-04 6.51E-04 6.51E-04 4.95E-04 4.42E-04 4.31E-04 4.31E-04 3.20E-04 2.33E-04 2.29E-04 2.28E-04 1.04E-04 8.19E-06 3.62E-06 2.95E-06 Delay (sec) Area (u2) 1.05E-10 1.41E-10 1.11E-10 1.31E-10 1.31E-10 1.40E-10 1.10E-10 1.33E-10 1.38E-10 1.76E-10 1.32E-10 1.30E-10 1.62E-10 1.75E-10 1.32E-10 1.32E-10 1.57E-10 17.21 16.22 17.21 22.91 22.91 15.37 17.21 29.48 29.03 19.96 17.21 32.28 29.87 19.96 17.21 38.78 36.61 Normalized Normalized Normalized leakage delay area 1.000 1.000 1.000 0.564 1.345 0.942 0.530 1.061 1.000 0.519 1.254 1.331 0.519 1.254 1.331 0.395 1.341 0.893 0.352 1.048 1.000 0.344 1.270 1.713 0.344 1.319 1.687 0.255 1.682 1.159 0.186 1.262 1.000 0.183 1.239 1.876 0.182 1.546 1.735 0.083 1.678 1.159 0.007 1.259 1.000 0.003 1.265 2.253 0.002 1.504 2.127

Sleepy stack delay is matched to Case5 (“*” means delay matched to Case5=best prior work)
Sleepy stack SRAM provides new pareto points (blue rows)

Case13 achieves 2.77X leakage reduction (with 19% delay increase over Case5), alternatively Case13* achieves 2.26X leakage reduction compared to Case5 (while matching delay to Case5)
© 2005 Georgia Institute of Technology 40

Static noise margin

Technique Case1 Case10 Case11 Case12 Case13 Low-Vth Std PD sleepy stack PD, WL sleepy stack PU, PD sleepy stack PU, PD, WL sleepy stack

Static noise margin (V) Active mode Sleep mode 0.299 N/A 3.167 0.362 0.324 0.363 0.299 0.384 0.299 0.384

Measure noise immunity using static noise margin (SNM) SNM of the sleepy stack is similar or better than the base case
© 2005 Georgia Institute of Technology 41

Outline Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion © 2005 Georgia Institute of Technology 42 .

= 0.T.25V. CL=1pF f = 233Mhz. = 1sec E = ½*1pF(1.5sec = 0. = 1sec E = ½*1pF*(2.T.25)2 x 466 Mhz x 0.128mJ *Energy saving = 78. CL=1pF f = 466Mhz.3 ns delay VDD = 2.15ns VDD = 1.589mJ (c) Low-power pipelined cache cycle time = 4.3ns delay/2 slack delay/2 slack (b) Pipelined cache for high-performance cycle time = 2.Low-power pipelined cache (LPPC) (a) Base case cycle time = 4.T. E. CL=1pF f = 233Mhz. E.25 V.3% delay/2 delay/2 VDD = 2.589mJ © 2005 Georgia Institute of Technology 43 .25)2 x 233 Mhz x 1sec = 0.25 V. E.5sec E = ½*1pFX(2.25)2 x 233 Mhz x 1sec = 0.

no. pp. B. 11." IEEE Journal of Solid-State Circuits. and R. © 2005 Georgia Institute of Technology 44 . 26. Allan.Low-power pipelined cache (LPPC) Extra slack by splitting cache stages Generic pipelined cache increases clock frequency by reducing delay* Dynamic power reduction by lowering Vdd of caches Optimal pipeline depth to a given architecture VDD (Non-Cache) VDD (Cache) IF1 IF IF2 ID EX ID MEM EX MEM1 WB MEM2 WB I-cache1 I-cache I-cache2 D-cache D-cache1 D-cache2 A pipeline with low-power pipelined caches non-pipelined caches *T. vol. R. Schuster. 1991. Chappell. 3. "A 2-ns Cycle. 1577-1585. S. Klepner. S.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture. Franch. Chappell. Joshi. J.

308-309.. **K. © 2005 Georgia Institute of Technology 45 . February.sun.Pipelining techniques Latched-pipelining Place latches in-between stages Typically used for pipelined processor Easy to implement When applied to the cache pipelining." IEEE International Solid-State Circuits Conference. pp. Ishibashi et al. February 1995. the delay of each pipeline stage could be different Wave-pipelining Use existing gates as a virtual storage element Even distribution of delay is potentially possible Complex timing analysis is required Used for industry SRAM design UltraSPARC-IV uses wave-pipelined SRAM (90nm tech)* Hitachi designs 300-Mhz 4-Mbit wave-pipelined CMOS SRAM** *UntraSPARC IV Processor Architecture Overview. "A 300 MHz 4-Mb Wave-pipeline CMOS SRAM Using a Multi-Phase PLL. http://www.com.

com/wrl/people/jouppi/CACTI.and 4-stage pipelined cache CACTI cache structure Tag array & sense amp Decoder Data array & sense amp Pipeline segmentation for latch pipelined cache Wave pipelined cache Cycle time using wave variable in CACTI Comparator Output driver *Reinman. 3. and Jouppi..Cache delay model Modify CACTI* to measure cycle time of pipelined cache with variable Vdd Latch pipelined cache Divide CACTI cache model into four segments Merge adjacent segments to form 2-.compaq.0: An integrated cache timing and power model. CACTI 2.html . © 2005 Georgia Institute of Technology 46 [Online]. G. Available http://www. N.research.

2 1.Cache cycle time Measure cycle time while varying Vdd and pipeline depth with 0. 95 2.25V 3 0. 7 2.65V 3 1.25µ tech. 2 2. 7 0. 7 1. . 7 1.05V 2. 95 1.15V 2. 45 2. 95 2. 45 2. 95 1. 2 2.75V 1. 2 Voltage (V) Latch pipelined cache Wave pipelined cache © 2005 Georgia Institute of Technology 47 1. 2 Voltage (V) 7 0. Cycle time is maximum delay of stages 1 Stage 3 Stages 6 2 Stages 4 Stages 6 1 Stage 3 Stages 2 Stages 4 Stages 5 5 Cycle time (ns) 4 Cycle time (ns) 4 1.25V 2 2 1 1 0 0 0. 2 1. 45 0. 95 3. 95 3. 7 2. 45 1.

” Proceedings of the International Symposium on Computer Architecture. **The Simplescalar-Arm power modeling project. 83-94. “Wattch: A Framework for Architectural Level Power Analysis and Optimizations. June 2000. [Online]. Available http://www.edu/~jringenb/power © 2005 Georgia Institute of Technology 48 .. pp. Brooks et al.Experimental Setup Evaluate targeting embedded processor Simplescalar/ARM+Wattch* for performance and power estimation Modify Simplescalar/ARM+Wattch to simulate a variable stage pipelined cache processor Michigan ARM-like Verilog processor model (MARS**) for the power estimation of buffers between broken (non-cache) pipeline stages Expand MARS pipeline and measure power consumption using synthesis based power measurement method Benchmark Program (C/C++) Binary Translation (GCC) MARS Verilog Model Functional Simulation (VCS) Toggle Rate Generation Datapath Power (Power Compiler) Synthesize Verilog Model (Design Compiler) CACTI Delay Simplescalar/ARM +Wattch Processor Power Processor Energy * D.eecs.umich.

1.25V 2.05V 0.75V © 2005 Georgia Institute of Technology 49 .Architecture configuration and benchmarks Simplescalar/ARM+Wattch configuration is modeled after Intel StrongARM Branch target buffer used to hide branch delay Compiler optimization used to hide load delay 7 benchmarks targeting embedded system domain Execution type Branch predictor L1 I-cache L1 D-cache L2 cache Memory bus width Memory latency Clock speed Vdd (Core) Vdd (Cache) In-order 128 entry BTB 32KB 4-way 32KB 4-way None 4-byte 12 cycles 233Mhz 2.25V.

74 13.91 1.700 6.600.10 3.241 0.913 15.681 9.38 24.34 0.19 24.53 2.65 4.552 13.98 1.199.04 6.14 0.735 28.90 2.50 17.94 6.398.409.668.173.00 12.969.801 26.048.65 2.40 1.Execution cycles 1-stage cache Benchmark DIJKSTRA DJPEG GSM MPEG2DEC QSORT SHA STRINGSEARCH Average cycles 100.734.32 13.68 21.413 4.859.57 23.03 6.014.280.94 3.34 5.12 9.55 100.11 5.078.533.747 10.606 21.889 15.27 10.T.638 10.87 2.11 25. increases as the pipelined cache deepens due to pipelining penalties (branch misprediction.49 31.13 2.37 13.08 1.74 0.08 6.380.84 7.275 7.00 18.31 7.02 0.190 17.26 141.356.46 11.091.322 7.461.14 3.881.23 1.522.14% © 2005 Georgia Institute of Technology 50 .57 0.74 3.52 1.248 6.88 4.206.29 0.25 33.916 1.403 2.42 3.38 0.489 3.68 10.399 14.06 0.904 3.111.925 2-stage pipelined cache Increase (%) Total Icache Dcache cycles 109.437.01 93.369 11.11 29.74 2.34 18.41 1.74 E.50 97.59 9.41 1.77 4.741.379 11.20 3-stage pipelined cache 4-stage pipelined cache Increase (%) Increase (%) Total Icache Dcache Total Icache Dcache cycles cycles 127.243.40 6.58 6.46 1.84 13.13 8.147 30. load delay) 2-stage pipelined cache increases execution time by 4.413.724 90.27 11.488.41 14.569 4.822 10.28 3.787.

1 0.Normalized processor power consumption 1-stage 1.9 0.7 0.8 0.0 0.55% of power © 2005 Georgia Institute of Technology 51 ST RI NG SE DI J M .3 0.2 0.0 PE G2 DE C QS OR T KS T DJ P AR CH RA EG GS M SH A 2-stage 3-stage 4-stage Saving 2-stage pipelined cache processor saves 23.4 0.5 0.6 0.

5 0.9 0.Normalized cache power consumption 1-stage 1.8 0.2 0.7 0.0 PE G2 DE C QS OR T KS T DJ P AR CH RA EG GS M SH A 2-stage 3-stage 4-stage Saving 2-stage pipelined caches save about 70% of cache power © 2005 Georgia Institute of Technology 52 ST RI NG SE DI J M .4 0.0 0.3 0.6 0.1 0.

2 0.7 0.5 0.Normalized energy consumption 1-stage 1.8 0.0 0.4 0.3 0.6 0.1 0.43% of energy ST RI NG SE DI J M © 2005 Georgia Institute of Technology 53 AR CH RA EG GS M SH A .0 2-stage 3-stage 4-stage Saving PE G2 DE C QS OR T KS T DJ P 2-stage pipelined cache processor saves 20.9 0.

Outline Introduction Related Work Sleepy stack Sleepy stack logic circuits Sleepy stack SRAM Low-power pipelined cache (LPPC) Sleepy stack pipelined SRAM Conclusion © 2005 Georgia Institute of Technology 54 .

i. 7-stage pipeline IF1 IF2 ID EX MEM1 MEM2 WB I-cache1 I-cache2 D-cache1 D-cache2 Sleepy stack SRAM Sleepy stack low-power pipelined caches © 2005 Georgia Institute of Technology 55 ..Sleepy stack pipelined SRAM Combine the sleepy technique and low-power pipelined cache (LPPC) Leakage reduction while maintaining performance Use 2-stage LPPC.e.

7V Dynamic power. leakage power.07µ technology (Vdd=1. and delay are measured using HSPICE Measured parameters are fed into Simplescaler/ARM to measure process performance © 2005 Georgia Institute of Technology 56 .0V) Sleepy stack is applied to SRAM cell Pre-decoder and row-decoder except global wordline drivers Low-voltage pipelined SRAM with Vdd=0.Methodology Model the base case 32KB SRAM with 4 subblocks targeting 0.

00E-03 6.80E-09 1.00E-02 0.00E-09 8.00E-02 3.00E-10 4.00E-03 1.00E-10 SRAM SRAM SRAM subblock Rise/fall time 5.00E+00 0.00E-03 Delay Decoder 1.00E-02 4.00E-02 2.SRAM performance Active power Decoder 8.00E-03 2.00E-10 1.40E-09 1.00E-02 7.20E-09 1.60E-09 1.00E-02 Static power Decoder 6.00E+00 Basecase Lowvoltage SRAM Sleepy stack SRAM Basecase Lowvoltage SRAM Sleepy stack SRAM Basecase Lowvoltage SRAM Sleepy stack SRAM Active power increases 36% (sleepy stack) and decreases 58% (lowvoltage SRAM) Sleepy stack SRAM achieves 17X leakage reduction (low-voltage SRAM 3X) Delay increases 33% (low-voltage SRAM 66%) (before pipelining) Estimated area overhead of sleepy stack is less that 2X © 2005 Georgia Institute of Technology 57 .00E-02 3.00E-03 5.00E-10 2.00E+00 0.00E-03 4.00E-02 6.

00E-02 4.8 0.6 Base case Sleepy stack 0.00E-02 5.2 0 G ST R D A J G PE SM G _E NC Q S M PE OR T G 2D EC ST 1 R IN G SH SE A AR C Av H er ag e IJ K C JP E 7.2 1 0.00E-02 6.4 0.00E-02 0.00E+00 Active power I-cache D-cache Basecase Lowvoltage SRAM Sleepy stack SRAM Average 4% execution cycle increase with same cycle time (33% of delay increase before pipelining) Active power of sleepy stack pipelined SRAM increase 31% (low-voltage SRAM active power decreases 60%) When sleep mode is 3 times longer than active mode.00E-02 2.Processor performance Normalized execution cycles 1.00E-02 3. the sleepy stack pipelined cache is effective to save energy © 2005 Georgia Institute of Technology 58 D .00E-02 1.

Conclusion and contribution Sleepy stack structure achieves dramatic leakage power reduction (4-inverters. 215X over forced stack) while saving state with some delay and area overhead Sleepy stack SRAM cell provides new pareto points in ultra-low leakage power consumption (2.77X over highVth with 19% delay increase or 2.26X without delay increase) Low-power pipelined cache reduces cache power by lowering cache supply voltage (2-stage pipelined cache 20% of energy with 4% delay increase) Sleepy stack pipelined SRAM achieves 17X leakage reduction with small execution cycle (4%) increase and less than 2X estimate area increase © 2005 Georgia Institute of Technology 59 .

“Combining Data Remapping and Voltage/Frequency Scaling of Second Level Memory for Energy Reduction in Embedded Systems.” Technical Report GIT-CC-04-05. K. Pereira. Georgia Institute of Technology.gatech.” Technical Report GIT-CC-03-33. C. pp. “Golay and Wavelet Error Control Codes in VLSI. pp. C.edu/tech_reports/index. 563-564. “Golay and Wavelet Error Control Codes in VLSI. J. Kluwer Academic/Plenum Publishers. December 2003. 148-158. Balasundaram. J. [Online] Available http://www. V. J. 1019-1024. Pereira. A.” Proceedings of the International Workshop on Power and Timing Modeling. May 2002. [Online] Available http://www. Park and V. J. Georgia Institute of Technology. Optimization and Simulation (PATMOS’04). November 2003. Balasundaram. J. “Sleepy Stack Reduction in Leakage Power.04. V. [4] A. pp. © 2005 Georgia Institute of Technology 60 . Mooney. Mooney and S. Mooney.” Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'04). J.03. [2] P. C. Pfeiffenberger. Park.” Microelectronics Journal. January 2004.html [5] J.html. J. Mooney.gatech. “Some Layouts Using the Sleepy Stack Approach. C. 211-224.cc. C. Srinivasan. Park. A.edu/tech_reports/index. Park and V.Publications [1] J. Mooney and P. J. September 2004. June 2004. pp. Pfeiffenberger.cc. Park and V. 34(11). [3] A.

Korkmaz. V. J. C. V. C. N. Diril. pp. October 2002. Lee. November 2002. 225-230. September 2002. K. J. K. Dhillon. USA and Robert Graybill. Mooney. VA. W. Mooney. V. W. University of Pittsburgh. J. Puttaswamy. pp. Choi. K. P. 67-73. DARPA/ITO. A. Park. pp. Puttaswamy. Chatterjee. 57-62. Choi. Park and V. [8] S. A. [9] K. pp. Palem and K. K. Ellervee. [10] J. Park and V. Arlington. Mooney.Publications [6] J.” in the book Power Aware Computing. published by Kluwer Academic/Plenum Publishers. J.” Proceedings of the International Symposium on System Synthesis (ISSS'02). Srinivasan. “System Level Power-Performance Trade-Offs in Embedded Systems Using Voltage and Frequency Scaling of Off-Chip Buses and Memory. C.” Proceedings of the International Workshop on Embedded System Codesign (ESCODES'02). C. 211-224. P. Ellervee. S. K. J. Mooney. J. Park. edited by Rami Melhem. J. Systems and Computers (ASILOMAR'02). “Energy Minimization of a Pipelined Processor using a Low Voltage Pipelined Cache. L. J. May 2002. W.” Conference Record of the 36th Asilomar Conference on Signals. “Pareto Points in SRAM Design Using the Sleepy Stack Approach. Chakrapani. F.” IFIP International Conference on Very Large Scale Integration (IFIP VLSISOC'05). [7] K. Palem and W. Choi. PA. Wong. K. Park. C. USA. “Power-Performance Trade-Offs in Second Level Memory Used by an ARM-Like RISC Architecture. Chatterjee and P. Mooney. “Combining Data Remapping and Voltage/Frequency Scaling of Second Level Memory for Energy Reduction in Embedded Systems. © 2005 Georgia Institute of Technology 61 . Y. U. 2005. K.