Architectural-Level Low-Power Design: Naehyuck Chang Dept. of EECS/CSE Seoul National University Naehyuck@snu - Ac.kr

Architectural-Level Low-Power
Design
Naehyuck Chang
Dept. of EECS/CSE
Seoul National University
naehyuck@snu.ac.kr
1
Precomputation-based optimization
 If a part of inputs have special relation with the output

 Function Z will have either 1 or 0
 f1 = 1  Z = 1 / f2 = 1  Z = 0
 Compute the result one cycle earlier
Embedded Low-Power Laboratory

 Benefit
 Either f1 = 1 or f2 = 1, LE (Load Enable) of R2 will be low
 No change at the output of R2
 Internal switching activity of Comb will decrease
 Trade-off
 f1 and f2 will generate additional switching activities
 f1 and f2 should be simple
 Power reduction
 Depending on the quality of f1 and f2

 Example: n-bit comparator

 f1 = A(n-1)B(n-1)’
 f2 = A(n-1)’B(n-1)
 f1 XOR f2
 A(n-1) XNOR B(n-1)
 Uniform signal probability
 50% power saving
 A(n-2) and B(n-2) can also be
fed to R1
 Should be careful

 What about two or more cycles in advance?

 Consider two cycle precomputation
 Example: adder-comparator circuit
 Latency: 2 cycles
 f1 = A(n-1)B(n-1)C(n-1)’D(n-1)’
 f2 = A(n-1)’B(n-1)’C(n-1)D(n-1)
 Power saving: 12.5% (= 2/16)

 For general case

 Shannon’s expansion
 Z = xjZxj + xj’Zxj’
 Zxj and Zxj’ are cofactors w.r.t. xj
 All logic expression can be
represented as shown above
 Selection of xj is very important
 No precomputation logic is
required
 Inputs are duplicated for both
registers

Idleness (I)
 In any digital circuit, some components are idle at time

 Dynamic power management:
 Monitor components
 Shut components down
 Assumption
 Power management scheme and idleness detection
 consumes a small fraction of the overall power

 Gated clock versus precomputation

 Both techniques can be extended to cover the same classes of
circuits
 Both techniques exploit different types
 External idleness:
 Precomputation
 Performs poorly when many signals are always observable
 Internal idleness:
 Gated clocks
 Performs poorly when outputs change at each cycle

 Integrated architecture for pipelined design

 Gated clocks and pre-computation can be used jointly
 Exploit both kinds of idleness
 Use two idleness observers
 Gate input registers with two independent signals

 Integrated architecture
Embedded Low-Power Laboratory 10

 Impact of the integrated architecture

 Precomputation and gated clocks are efficient for pipeline design
 Precomputation is not efficient for sequential circuits
 Integrated scheme performs better for sequential circuits
 These techniques are not effective for some circuits

Low-Power Interconnect
 Interconnect heavily affects power consumption

 Since it is the medium of most electrical activity
 Efforts to improve chip performance are resulting in smaller chips
with more transistors and more densely packed wires carrying
larger currents
 The wires in a chip often use materials with poor thermal
conductivity

 Bus encoding
 Presence of heavily loaded global communication paths
 High capacitances
 May be several orders of magnitude higher than an internal node
 One transition can dissipate as much power as several hundred
internal transitions
 Miller multiplication makes this worse

 Traditional bus encoding

 Attack “α” portion of P=αCV2f
 Encode Data

 Encoding schemes
 Redundancy
 Space
 Time
 Both
 Level or transition signaling
 DC power or switching power
 Switching power reduction
 Maximum
 Average
 Information Carried
 Data or address
 Random or sequential
 Compressed
 Bus invert coding

 One extra “Invert” line
 Compute the Hamming distance, H
 If H > N/2, invert data, invert = 1
 If H <= N/2, normal data, invert = 0
 Reduces the maximum switching by ½
 Two codewords exist for each information symbol
 Pick codeword with lowest activity factor
 Gray coding
 For incremental counting sequences
 T0 encoding

 Limited weight coding (N-LW)

 Use transition signaling scheme
 Minimize number of ones
 Limit number of transitions
 Reduce switching power
 An N-LW code has at most N ones
 Can use arbitrarily large number of extra lines
 Extreme case: one hot encoding 2n-1 where n is number of bits to be
encoded

 On-chip bus encoding: crosstalk

 As chips become smaller, there arise additional sources of power
consumption
 One of these sources is crosstalk
 Spurious activity on a wire that is caused by activity in neighboring
wires
 As well as increasing delays and impairing circuit integrity, crosstalk
can increase power consumption
 Solutions
 Inserting shield wire between adjacent bus wires
 Area overhead
0->1
0->1 Shield Wire
0->0
Crosstalk
1->0 1->0

 Low Swing Buses

 A bus can transmit the same information but at a lower voltage
 These is implemented with differential signaling
 A signal is split into two signals of opposite polarity bounded by a
smaller voltage range
 Side-effects
 Pros.
 Immune to crosstalk and electromagnetic radiation effects
 Cons.
 Increased hardware at the encoder and decoder

 Bus Segmentation
 Splitting a bus into multiple segments connected by links that
regulate the traffic between adjacent segments
 Disadvantage
 Multiple clocks to go to other segmentations
Activated Paths

 Adiabatic Buses
 Reducing the total capacitance
 These circuits reuse existing electrical charge to avoid creating
new charge
 Recycling the charge for wires about to be asserted
 In a traditional bus, when a wire becomes deasserted, its previous
charge is wasted

 NOC (Network On Chip)

 Previously explained techniques can be adopted

Power Analysis: ARM920T

Breakdown Of Power Consumption

Power Consumption Variations
tech: 0.8 BlockSize: 64b Banks: 1 ports: 1


Low-Power Memories (1)
 The techniques that will be introduced are not confined to any

specific type of RAM or ROM
 Rather they are high-level architectural principals that apply
across the spectrum of memories to the extent that the
required technology is available
 Two way approaches
 Reducing the energy dissipation in a memory access
 Reducing the number of memory accesses

 Splitting Memories into Smaller Sub-Systems

 Activating only the needed memory circuits in each access
 One way to expose these circuits is to partition memories into
smaller, independently accessible components
 Main Memory
 Multiple banks
 Compiler and OS
 Clustering data into a minimal set of banks, allowing the other banks
to power down and thereby save energy


 Case Study: Banked Cache


 Case Study: Partitioned Power-Aware Instruction Cache
 Motivations: Lastly accessed page will be re-accessed





 Augmenting the Memory Hierarchy with Specialized Cache

Structure
 L0 cache (Filter Cache)
 New cache between the processor and L1 cache
 Large enough to store an application’s working set and filter out
many memory references


Structure
 Filter TLB (Translation Lookahead Buffer)
Required PTE achieve low power consumption
and high performance
Filter TLB
filtering main TLB access
low power consumption
MUX
High performance
Main TLB But, high power consumption
(fully associative type)
Missed PTE

Structure
 Scratch Pad Memory
 Data they should contain is determined ahead of time and remains
fixed while programs execute
 These memories are less volatile than conventional caches and can
be accessed within a cycle
 They are ideal for specialized embedded systems whose applications
are often hand tuned


Structure
 Scratch Pad Memory
 Case Study: ARM TCM


Structure
 Trace Cache
 Instead of storing instructions in their compiled order, a trace cache
stores traces of instructions in their executed order
 If an instruction sequence is already in the trace cache, then it need
not be fetched from the instruction cache but can be decoded directly
from the trace cache
 This saves power by reducing the number of instruction cache
accesses


Structure
 Dynamic Direction Prediction-Based Trace Cache
 Using branch prediction to decide where to fetch instructions from
 If the branch predictor predicts the next trace with high confidence
and that trace is in the trace cache, then the instructions are fetched
from the trace cache rather than the instruction cache


Structure
 Selective Trace Cache
 Identifying frequently executed, or hot trace, and bring these traces
into the trace cache
 The compiler inserts hints after every branch instruction, indicating
whether it belongs to a hot trace
 At runtime, these hints cause instructions to be fetched from either
the trace cache or the instruction cache

Outline
 Circuit and Logic Level Techniques

 Low-Power Interconnect
 Low-Power Memories
 Low-Power Caches
 Adaptive Instruction Queue
 Algorithms for Reconfiguring Multiple Structures
 Indeterminism and Anomalies in Real Systems

Low-Power Caches
 Basic Structure

Adaptive Caches (1)
 Dynamically Resizable Instruction (DRI) Cache

 Storage elements can be selectively activated based on the
application workload
 It can deactivate its individual sets on demand by gating their
supply voltages
 To decide what sets to activate at any given time, the cache uses
a hardware profiler that monitors the application’s cache-miss
patterns
 Whenever the cache misses exceed a threshold, the DRI cache
activates previously deactivated sets
 Limitation
 Memory cells lose data and need more time to be reactivated for their
next use

Adaptive Caches (2)
 Dynamically Resizable Instruction (DRI) Cache

Adaptive Caches (3)
 Cache Decay
 Activates/deactivates individual cache lines
 If it has not been accessed for a pre-determined amount of time
 A cache line is placed in the sleep mode
 If it is re-accessed
 A cache line is re-activated
 Please note that the data is lost

Adaptive Caches (4)
 DRI and Cache Decay

 Disadvantages
 Control mechanisms are highly dependent on arbitrary parameters
 Application-specific
 DRI
 Miss bound is chosen based on the typical miss rate of application
 DRI may require cache profiling
 Cache Line Decay
 Uses a different parameter, decay time

Adaptive Caches (5)
 Drowsy Cache
 Data is retained
 Wakeup penalty is 1 cycle in the 70 nm technology
 Details are explained in Module 6: Low-Power CMOS RAM Circuits

Adaptive Caches (6)
 Dead Block Elimination

 Powering down cache lines containing basic blocks that have
reached their final use
 Compiler-directed technique
 Example
 B1 is a dead block after B1 is executed
B1 B2 B3 B4

Adaptive Caches (7)
 Gated Precharging
 In 100-cycle window,
 >95% of cache accesses reuse <30% of subarrays
 Most accesses temporally localized in small # of subarrays
 Decay counter per subarray
 Threshold value to decide “when” to precharge
 Algorithm
 If count < threshold
Precharge
 If count > threshold
No precharge

Adaptive Caches (8)
 6-T SRAM cell and precharge logic

Adaptive Caches (9)
 Gated Precharging
 Can be combined with predecoding
Reg
Fetch Rename Issue Execution DCache Commit
Read
Biteline
Predecoding
Precharging

Outline
 Circuit and Logic Level Techniques

 Low-Power Interconnect
 Low-Power Memories
 Low-Power Caches
 Adaptive Instruction Queue
 Algorithms for Reconfiguring Multiple Structures
 Indeterminism and Anomalies in Real Systems

Adaptive Instruction Queues (1)
 A 32-bit queue consisting of four equal size partitions that can

be independently activated
 Each partition consists of wakeup logic that decides when
instructions are ready to execute and readout logic that
dispatches ready instructions into the pipeline
 At any time, only the partitions that contain the currently
executing instructions are activated

Adaptive Instruction Queues (2)
 Heuristic algorithms
 Measuring IPC to determine wakeup
 Separating INT IPC from FP IPC
 Youngest part of the issue queue contributes very little to the
overall IPC in that instructions in this part are often committed
late

References
 Basic References
 V. Venkatachalam and M. Franz, Power reduction techniques for
microprocessor systems, ACM Computing Survey, 2005
 Advanced References
 N. Jouppi, Cacti, Available in www.hpl.hp.com/personal/
Norman_Jouppi.cacti4.html
 C.-H. Kim, et al., “PP-Cache: A Partitioned Power-aware
Instruction Cache Architecture”, Microprocessors and
Microsystems, 2006.
 J.-H. Lee, et al., “A Selective Filter-Bank TLB System”, ISLPED
2003.
 ARM corp., TCM (Tightly Coupled Memory), available in
www.arm.com
 Footnotes are omitted due to the complexity

Architectural-Level Low-Power Design: Naehyuck Chang Dept. of EECS/CSE Seoul National University Naehyuck@snu - Ac.kr

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architectural-Level Low-Power Design: Naehyuck Chang Dept. of EECS/CSE Seoul National University Naehyuck@snu - Ac.kr

Uploaded by

Copyright:

Available Formats

Architectural-Level Low-Power

 If a part of inputs have special relation with the output

Embedded Low-Power Laboratory

Embedded Low-Power Laboratory

 Example: n-bit comparator

Embedded Low-Power Laboratory

 What about two or more cycles in advance?

Embedded Low-Power Laboratory

 For general case

Embedded Low-Power Laboratory

 In any digital circuit, some components are idle at time

Embedded Low-Power Laboratory

 Gated clock versus precomputation

Embedded Low-Power Laboratory

 Integrated architecture for pipelined design

Embedded Low-Power Laboratory

Embedded Low-Power Laboratory 10

 Impact of the integrated architecture

Embedded Low-Power Laboratory

 Interconnect heavily affects power consumption

Embedded Low-Power Laboratory

Embedded Low-Power Laboratory 13

 Traditional bus encoding

Embedded Low-Power Laboratory 14

 Bus invert coding

Embedded Low-Power Laboratory 16

 Limited weight coding (N-LW)

Embedded Low-Power Laboratory 17

 On-chip bus encoding: crosstalk

Embedded Low-Power Laboratory

 Low Swing Buses

Embedded Low-Power Laboratory

Embedded Low-Power Laboratory 20

Embedded Low-Power Laboratory 21

 NOC (Network On Chip)

Embedded Low-Power Laboratory 22

Embedded Low-Power Laboratory

Embedded Low-Power Laboratory

tech: 0.8 BlockSize: 64b Banks: 1 ports: 1

Embedded Low-Power Laboratory

 The techniques that will be introduced are not confined to any

Embedded Low-Power Laboratory

 Splitting Memories into Smaller Sub-Systems

Embedded Low-Power Laboratory

 Splitting Memories into Smaller Sub-Systems

Embedded Low-Power Laboratory

 Splitting Memories into Smaller Sub-Systems

Embedded Low-Power Laboratory

 Splitting Memories into Smaller Sub-Systems

Embedded Low-Power Laboratory

 Splitting Memories into Smaller Sub-Systems

Embedded Low-Power Laboratory

 Augmenting the Memory Hierarchy with Specialized Cache

Embedded Low-Power Laboratory

 Augmenting the Memory Hierarchy with Specialized Cache

 Augmenting the Memory Hierarchy with Specialized Cache

Embedded Low-Power Laboratory

 Augmenting the Memory Hierarchy with Specialized Cache

Embedded Low-Power Laboratory

 Augmenting the Memory Hierarchy with Specialized Cache

Embedded Low-Power Laboratory

 Augmenting the Memory Hierarchy with Specialized Cache

Embedded Low-Power Laboratory