Professional Documents
Culture Documents
Design
Naehyuck Chang
Dept. of EECS/CSE
Seoul National University
naehyuck@snu.ac.kr
1
Precomputation-based optimization
Benefit
Either f1 = 1 or f2 = 1, LE (Load Enable) of R2 will be low
No change at the output of R2
Internal switching activity of Comb will decrease
Trade-off
f1 and f2 will generate additional switching activities
f1 and f2 should be simple
Power reduction
Depending on the quality of f1 and f2
Integrated architecture
Bus encoding
Presence of heavily loaded global communication paths
High capacitances
May be several orders of magnitude higher than an internal node
One transition can dissipate as much power as several hundred
internal transitions
Miller multiplication makes this worse
Encoding schemes
Redundancy
Space
Time
Both
Level or transition signaling
DC power or switching power
Switching power reduction
Maximum
Average
Information Carried
Data or address
Random or sequential
Compressed
Embedded Low-Power Laboratory 15
Low-Power Interconnect
0->1
0->1 Shield Wire
0->0
Crosstalk
1->0 1->0
Bus Segmentation
Splitting a bus into multiple segments connected by links that
regulate the traffic between adjacent segments
Disadvantage
Multiple clocks to go to other segmentations
Activated Paths
Adiabatic Buses
Reducing the total capacitance
These circuits reuse existing electrical charge to avoid creating
new charge
Recycling the charge for wires about to be asserted
In a traditional bus, when a wire becomes deasserted, its previous
charge is wasted
Filter TLB
filtering main TLB access
low power consumption
MUX
High performance
Main TLB But, high power consumption
(fully associative type)
Missed PTE
Embedded Low-Power Laboratory
Low-Power Memories (9)
Basic Structure
Cache Decay
Activates/deactivates individual cache lines
If it has not been accessed for a pre-determined amount of time
A cache line is placed in the sleep mode
If it is re-accessed
A cache line is re-activated
Please note that the data is lost
Drowsy Cache
Data is retained
Wakeup penalty is 1 cycle in the 70 nm technology
Details are explained in Module 6: Low-Power CMOS RAM Circuits
B1 B2 B3 B4
Gated Precharging
In 100-cycle window,
>95% of cache accesses reuse <30% of subarrays
Most accesses temporally localized in small # of subarrays
Decay counter per subarray
Threshold value to decide “when” to precharge
Algorithm
If count < threshold
Precharge
If count > threshold
No precharge
Gated Precharging
Can be combined with predecoding
Reg
Fetch Rename Issue Execution DCache Commit
Read
Biteline
Predecoding
Precharging
Heuristic algorithms
Measuring IPC to determine wakeup
Separating INT IPC from FP IPC
Youngest part of the issue queue contributes very little to the
overall IPC in that instructions in this part are often committed
late
Basic References
V. Venkatachalam and M. Franz, Power reduction techniques for
microprocessor systems, ACM Computing Survey, 2005
Advanced References
N. Jouppi, Cacti, Available in www.hpl.hp.com/personal/
Norman_Jouppi.cacti4.html
C.-H. Kim, et al., “PP-Cache: A Partitioned Power-aware
Instruction Cache Architecture”, Microprocessors and
Microsystems, 2006.
J.-H. Lee, et al., “A Selective Filter-Bank TLB System”, ISLPED
2003.
ARM corp., TCM (Tightly Coupled Memory), available in
www.arm.com
Footnotes are omitted due to the complexity