You are on page 1of 53

Architectural-Level Low-Power

Design

Naehyuck Chang
Dept. of EECS/CSE
Seoul National University
naehyuck@snu.ac.kr

1
Precomputation-based optimization

 If a part of inputs have special relation with the output


 Function Z will have either 1 or 0
 f1 = 1  Z = 1 / f2 = 1  Z = 0
 Compute the result one cycle earlier

Embedded Low-Power Laboratory


Precomputation-based optimization

 Benefit
 Either f1 = 1 or f2 = 1, LE (Load Enable) of R2 will be low
 No change at the output of R2
 Internal switching activity of Comb will decrease
 Trade-off
 f1 and f2 will generate additional switching activities
 f1 and f2 should be simple
 Power reduction
 Depending on the quality of f1 and f2

Embedded Low-Power Laboratory


Precomputation-based optimization

 Example: n-bit comparator


 f1 = A(n-1)B(n-1)’
 f2 = A(n-1)’B(n-1)
 f1 XOR f2
 A(n-1) XNOR B(n-1)
 Uniform signal probability
 50% power saving
 A(n-2) and B(n-2) can also be
fed to R1
 Should be careful

Embedded Low-Power Laboratory


Precomputation-based optimization

 What about two or more cycles in advance?


 Consider two cycle precomputation
 Example: adder-comparator circuit
 Latency: 2 cycles
 f1 = A(n-1)B(n-1)C(n-1)’D(n-1)’
 f2 = A(n-1)’B(n-1)’C(n-1)D(n-1)
 Power saving: 12.5% (= 2/16)

Embedded Low-Power Laboratory


Precomputation-based optimization

 For general case


 Shannon’s expansion
 Z = xjZxj + xj’Zxj’
 Zxj and Zxj’ are cofactors w.r.t. xj
 All logic expression can be
represented as shown above
 Selection of xj is very important
 No precomputation logic is
required
 Inputs are duplicated for both
registers

Embedded Low-Power Laboratory


Idleness (I)

 In any digital circuit, some components are idle at time


 Dynamic power management:
 Monitor components
 Shut components down
 Assumption
 Power management scheme and idleness detection
 consumes a small fraction of the overall power

Embedded Low-Power Laboratory


Precomputation-based optimization

 Gated clock versus precomputation


 Both techniques can be extended to cover the same classes of
circuits
 Both techniques exploit different types
 External idleness:
 Precomputation
 Performs poorly when many signals are always observable
 Internal idleness:
 Gated clocks
 Performs poorly when outputs change at each cycle

Embedded Low-Power Laboratory


Precomputation-based optimization

 Integrated architecture for pipelined design


 Gated clocks and pre-computation can be used jointly
 Exploit both kinds of idleness
 Use two idleness observers
 Gate input registers with two independent signals

Embedded Low-Power Laboratory


Precomputation-based optimization

 Integrated architecture

Embedded Low-Power Laboratory 10


Precomputation-based optimization

 Impact of the integrated architecture


 Precomputation and gated clocks are efficient for pipeline design
 Precomputation is not efficient for sequential circuits
 Integrated scheme performs better for sequential circuits
 These techniques are not effective for some circuits

Embedded Low-Power Laboratory


Low-Power Interconnect

 Interconnect heavily affects power consumption


 Since it is the medium of most electrical activity
 Efforts to improve chip performance are resulting in smaller chips
with more transistors and more densely packed wires carrying
larger currents
 The wires in a chip often use materials with poor thermal
conductivity

Embedded Low-Power Laboratory


Low-Power Interconnect

 Bus encoding
 Presence of heavily loaded global communication paths
 High capacitances
 May be several orders of magnitude higher than an internal node
 One transition can dissipate as much power as several hundred
internal transitions
 Miller multiplication makes this worse

Embedded Low-Power Laboratory 13


Low-Power Interconnect

 Traditional bus encoding


 Attack “α” portion of P=αCV2f
 Encode Data

Embedded Low-Power Laboratory 14


Low-Power Interconnect

 Encoding schemes
 Redundancy
 Space
 Time
 Both
 Level or transition signaling
 DC power or switching power
 Switching power reduction
 Maximum
 Average
 Information Carried
 Data or address
 Random or sequential
 Compressed
Embedded Low-Power Laboratory 15
Low-Power Interconnect

 Bus invert coding


 One extra “Invert” line
 Compute the Hamming distance, H
 If H > N/2, invert data, invert = 1
 If H <= N/2, normal data, invert = 0
 Reduces the maximum switching by ½
 Two codewords exist for each information symbol
 Pick codeword with lowest activity factor
 Gray coding
 For incremental counting sequences
 T0 encoding

Embedded Low-Power Laboratory 16


Low-Power Interconnect

 Limited weight coding (N-LW)


 Use transition signaling scheme
 Minimize number of ones
 Limit number of transitions
 Reduce switching power
 An N-LW code has at most N ones
 Can use arbitrarily large number of extra lines
 Extreme case: one hot encoding 2n-1 where n is number of bits to be
encoded

Embedded Low-Power Laboratory 17


Low-Power Interconnect

 On-chip bus encoding: crosstalk


 As chips become smaller, there arise additional sources of power
consumption
 One of these sources is crosstalk
 Spurious activity on a wire that is caused by activity in neighboring
wires
 As well as increasing delays and impairing circuit integrity, crosstalk
can increase power consumption
 Solutions
 Inserting shield wire between adjacent bus wires
 Area overhead

0->1
0->1 Shield Wire
0->0
Crosstalk
1->0 1->0

Embedded Low-Power Laboratory


Low-Power Interconnect

 Low Swing Buses


 A bus can transmit the same information but at a lower voltage
 These is implemented with differential signaling
 A signal is split into two signals of opposite polarity bounded by a
smaller voltage range
 Side-effects
 Pros.
 Immune to crosstalk and electromagnetic radiation effects
 Cons.
 Increased hardware at the encoder and decoder

Embedded Low-Power Laboratory


Low-Power Interconnect

 Bus Segmentation
 Splitting a bus into multiple segments connected by links that
regulate the traffic between adjacent segments
 Disadvantage
 Multiple clocks to go to other segmentations

Activated Paths

Embedded Low-Power Laboratory 20


Low-Power Interconnect

 Adiabatic Buses
 Reducing the total capacitance
 These circuits reuse existing electrical charge to avoid creating
new charge
 Recycling the charge for wires about to be asserted
 In a traditional bus, when a wire becomes deasserted, its previous
charge is wasted

Embedded Low-Power Laboratory 21


Low-Power Interconnect

 NOC (Network On Chip)


 Previously explained techniques can be adopted

Embedded Low-Power Laboratory 22


Power Analysis: ARM920T

Embedded Low-Power Laboratory


Breakdown Of Power Consumption

Embedded Low-Power Laboratory


Power Consumption Variations

tech: 0.8 BlockSize: 64b Banks: 1 ports: 1


tech: 0.8 BlockSize: 64b Banks: 1 ports: 2
tech: 0.8 BlockSize: 64b Banks: 1 ports: 4

Embedded Low-Power Laboratory


Low-Power Memories (1)

 The techniques that will be introduced are not confined to any


specific type of RAM or ROM
 Rather they are high-level architectural principals that apply
across the spectrum of memories to the extent that the
required technology is available
 Two way approaches
 Reducing the energy dissipation in a memory access
 Reducing the number of memory accesses

Embedded Low-Power Laboratory


Low-Power Memories (2)

 Splitting Memories into Smaller Sub-Systems


 Activating only the needed memory circuits in each access
 One way to expose these circuits is to partition memories into
smaller, independently accessible components
 Main Memory
 Multiple banks
 Compiler and OS
 Clustering data into a minimal set of banks, allowing the other banks
to power down and thereby save energy

Embedded Low-Power Laboratory


Low-Power Memories (3)

 Splitting Memories into Smaller Sub-Systems


 Case Study: Banked Cache

Embedded Low-Power Laboratory


Low-Power Memories (4)

 Splitting Memories into Smaller Sub-Systems


 Case Study: Partitioned Power-Aware Instruction Cache
 Motivations: Lastly accessed page will be re-accessed

Embedded Low-Power Laboratory


Low-Power Memories (5)

 Splitting Memories into Smaller Sub-Systems


 Case Study: Partitioned Power-Aware Instruction Cache

Embedded Low-Power Laboratory


Low-Power Memories (6)

 Splitting Memories into Smaller Sub-Systems


 Case Study: Partitioned Power-Aware Instruction Cache

Embedded Low-Power Laboratory


Low-Power Memories (7)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 L0 cache (Filter Cache)
 New cache between the processor and L1 cache
 Large enough to store an application’s working set and filter out
many memory references

Embedded Low-Power Laboratory


Low-Power Memories (8)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 Filter TLB (Translation Lookahead Buffer)
Required PTE achieve low power consumption
and high performance

Filter TLB
filtering main TLB access
low power consumption

MUX

High performance
Main TLB But, high power consumption
(fully associative type)

Missed PTE
Embedded Low-Power Laboratory
Low-Power Memories (9)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 Scratch Pad Memory
 Data they should contain is determined ahead of time and remains
fixed while programs execute
 These memories are less volatile than conventional caches and can
be accessed within a cycle
 They are ideal for specialized embedded systems whose applications
are often hand tuned

Embedded Low-Power Laboratory


Low-Power Memories (10)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 Scratch Pad Memory
 Case Study: ARM TCM

Embedded Low-Power Laboratory


Low-Power Memories (11)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 Trace Cache
 Instead of storing instructions in their compiled order, a trace cache
stores traces of instructions in their executed order
 If an instruction sequence is already in the trace cache, then it need
not be fetched from the instruction cache but can be decoded directly
from the trace cache
 This saves power by reducing the number of instruction cache
accesses

Embedded Low-Power Laboratory


Low-Power Memories (12)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 Dynamic Direction Prediction-Based Trace Cache
 Using branch prediction to decide where to fetch instructions from
 If the branch predictor predicts the next trace with high confidence
and that trace is in the trace cache, then the instructions are fetched
from the trace cache rather than the instruction cache

Embedded Low-Power Laboratory


Low-Power Memories (13)

 Augmenting the Memory Hierarchy with Specialized Cache


Structure
 Selective Trace Cache
 Identifying frequently executed, or hot trace, and bring these traces
into the trace cache
 The compiler inserts hints after every branch instruction, indicating
whether it belongs to a hot trace
 At runtime, these hints cause instructions to be fetched from either
the trace cache or the instruction cache

Embedded Low-Power Laboratory


Outline

 Circuit and Logic Level Techniques


 Low-Power Interconnect
 Low-Power Memories
 Low-Power Caches
 Adaptive Instruction Queue
 Algorithms for Reconfiguring Multiple Structures
 Indeterminism and Anomalies in Real Systems

Embedded Low-Power Laboratory


Low-Power Caches

 Basic Structure

Embedded Low-Power Laboratory


Adaptive Caches (1)

 Dynamically Resizable Instruction (DRI) Cache


 Storage elements can be selectively activated based on the
application workload
 It can deactivate its individual sets on demand by gating their
supply voltages
 To decide what sets to activate at any given time, the cache uses
a hardware profiler that monitors the application’s cache-miss
patterns
 Whenever the cache misses exceed a threshold, the DRI cache
activates previously deactivated sets
 Limitation
 Memory cells lose data and need more time to be reactivated for their
next use

Embedded Low-Power Laboratory


Adaptive Caches (2)

 Dynamically Resizable Instruction (DRI) Cache

Embedded Low-Power Laboratory


Adaptive Caches (3)

 Cache Decay
 Activates/deactivates individual cache lines
 If it has not been accessed for a pre-determined amount of time
 A cache line is placed in the sleep mode
 If it is re-accessed
 A cache line is re-activated
 Please note that the data is lost

Embedded Low-Power Laboratory


Adaptive Caches (4)

 DRI and Cache Decay


 Disadvantages
 Control mechanisms are highly dependent on arbitrary parameters
 Application-specific
 DRI
 Miss bound is chosen based on the typical miss rate of application
 DRI may require cache profiling
 Cache Line Decay
 Uses a different parameter, decay time

Embedded Low-Power Laboratory


Adaptive Caches (5)

 Drowsy Cache
 Data is retained
 Wakeup penalty is 1 cycle in the 70 nm technology
 Details are explained in Module 6: Low-Power CMOS RAM Circuits

Embedded Low-Power Laboratory


Adaptive Caches (6)

 Dead Block Elimination


 Powering down cache lines containing basic blocks that have
reached their final use
 Compiler-directed technique
 Example
 B1 is a dead block after B1 is executed

B1 B2 B3 B4

Embedded Low-Power Laboratory


Adaptive Caches (7)

 Gated Precharging
 In 100-cycle window,
 >95% of cache accesses reuse <30% of subarrays
 Most accesses temporally localized in small # of subarrays
 Decay counter per subarray
 Threshold value to decide “when” to precharge
 Algorithm
 If count < threshold
Precharge
 If count > threshold
No precharge

Embedded Low-Power Laboratory


Adaptive Caches (8)

 6-T SRAM cell and precharge logic

Embedded Low-Power Laboratory


Adaptive Caches (9)

 Gated Precharging
 Can be combined with predecoding

Reg
Fetch Rename Issue Execution DCache Commit
Read

Biteline
Predecoding
Precharging

Embedded Low-Power Laboratory


Outline

 Circuit and Logic Level Techniques


 Low-Power Interconnect
 Low-Power Memories
 Low-Power Caches
 Adaptive Instruction Queue
 Algorithms for Reconfiguring Multiple Structures
 Indeterminism and Anomalies in Real Systems

Embedded Low-Power Laboratory


Adaptive Instruction Queues (1)

 A 32-bit queue consisting of four equal size partitions that can


be independently activated
 Each partition consists of wakeup logic that decides when
instructions are ready to execute and readout logic that
dispatches ready instructions into the pipeline
 At any time, only the partitions that contain the currently
executing instructions are activated

Embedded Low-Power Laboratory


Adaptive Instruction Queues (2)

 Heuristic algorithms
 Measuring IPC to determine wakeup
 Separating INT IPC from FP IPC
 Youngest part of the issue queue contributes very little to the
overall IPC in that instructions in this part are often committed
late

Embedded Low-Power Laboratory


References

 Basic References
 V. Venkatachalam and M. Franz, Power reduction techniques for
microprocessor systems, ACM Computing Survey, 2005
 Advanced References
 N. Jouppi, Cacti, Available in www.hpl.hp.com/personal/
Norman_Jouppi.cacti4.html
 C.-H. Kim, et al., “PP-Cache: A Partitioned Power-aware
Instruction Cache Architecture”, Microprocessors and
Microsystems, 2006.
 J.-H. Lee, et al., “A Selective Filter-Bank TLB System”, ISLPED
2003.
 ARM corp., TCM (Tightly Coupled Memory), available in
www.arm.com
 Footnotes are omitted due to the complexity

Embedded Low-Power Laboratory

You might also like