Professional Documents
Culture Documents
5 D
Edition
th
Chapter 5
Large and Fast:
Exploiting Memory
Hierarchy
5.1 Introduction
Principle of Locality
Spatial locality
Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby)
items from disk to smaller DRAM memory
Main memory
Fast
Slow
Magnetic disk
Memory Technology
Ideal memory
DRAM Technology
DRAM Generations
Row buffer
Synchronous DRAM
DRAM banking
Flash Storage
Flash Types
Disk Storage
Sector ID
Data (512 bytes, 4096 bytes proposed)
Error correcting code (ECC)
Given
Cache Memory
How do we know if
the data is present?
Where do we look?
#Blocks is a
power of 2
Use low-order
block address
bits
Cache Example
000
001
010
011
100
101
110
111
Tag
Data
Cache Example
Word addr
Binary addr
Hit/miss
Cache block
22
10 110
Miss
110
Index
000
001
010
011
100
101
110
111
Tag
Data
10
Mem[10110]
Cache Example
Word addr
Binary addr
Hit/miss
Cache block
26
11 010
Miss
010
Index
000
001
010
011
100
101
110
111
Tag
Data
11
Mem[11010]
10
Mem[10110]
Cache Example
Word addr
Binary addr
Hit/miss
Cache block
22
10 110
Hit
110
26
11 010
Hit
010
Index
000
001
010
011
100
101
110
111
Tag
Data
11
Mem[11010]
10
Mem[10110]
Cache Example
Word addr
Binary addr
Hit/miss
Cache block
16
10 000
Miss
000
00 011
Miss
011
16
10 000
Hit
000
Index
Tag
Data
000
10
Mem[10000]
001
010
11
Mem[11010]
011
00
Mem[00011]
100
101
110
10
Mem[10110]
111
N
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 24
Cache Example
Word addr
Binary addr
Hit/miss
Cache block
18
10 010
Miss
010
Index
Tag
Data
000
10
Mem[10000]
001
010
10
Mem[10010]
011
00
Mem[00011]
100
101
110
10
Mem[10110]
111
Address Subdivision
Depending on block size
for offset; 4-byte here
64 blocks, 16 bytes/block
10 9
4 3
Tag
Index
Offset
22 bits
6 bits
4 bits
Block address
Cache Misses
FIGURE 5.11Miss rate versus block size. Note that the miss rate actually goes up
if the block size is too large relative to the cache size. Each line represents a cache
of different size. (This figure is independent of associativity, discussed soon.)
Unfortunately, SPEC CPU2000 traces would take too long if block size were
included, so this data is based on SPEC92.
Copyright 2014 Elsevier Inc. All rights reserved.
31
For write-back
12-stage pipeline
Instruction and data access on each cycle
Each 16KB: 256 blocks 16 words/block
D-cache: write-through or write-back
I-cache: 0.4%
D-cache: 11.4%
Weighted average: 3.2%
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 35
(64 bytes)
(4 bytes)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 36
Instructions
Misses
Miss penalty
Program
Instruction
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 39
Given
Example
Performance Summary
Fully associative
Fixed location
Spectrum of Associativity
Associativity Example
Direct mapped
Block
address
Cache
index
Hit/miss
0 (00000)
8 (01000)
0 (00000)
6 (00110)
8 (01000)
0
0
0
2
0
miss
miss
miss
miss
miss
0
Mem[0]
Mem[8]
Mem[0]
Mem[0]
Mem[8]
Mem[6]
Mem[6]
Last 2 bits
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 46
Associativity Example
Cache
index
Hit/miss
0 (00000)
8 (01000)
0 (00000)
6 (00110)
8 (01000)
0
0
0
0
0
miss
miss
hit
miss
miss
Mem[8]
Mem[8]
Mem[6]
Mem[6]
Last bit
Fully associative
Block
address
0 (00000)
8 (01000)
0 (00000)
6 (00110)
8 (01000)
Hit/miss
miss
miss
hit
miss
hit
Mem[8]
Mem[8]
Mem[8]
Mem[8]
Mem[6]
Mem[6]
1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 48
4-way set-associative
Cache size
= 4 * 256 * 4 = 4 KB
Higher associativity
Reduces conflict misses
Increases hit time, increases power consumption
Introduction
Miss rate
Compulsory
First reference to a block
Capacity
Introduction
Misses by Type
Conflict
General locality
behavior
Fully-asso.
Capacity
Replacement Policy
Random
12
14
23
24
34
01
0
10
LRU LRU
Replacement
54
56
Given
Example (cont.)
59
Misses depend on
memory access
patterns
Algorithm behavior
Compiler
optimization for
memory access
C, A, and B arrays; C = A B
older accesses
new accesses
Unoptimized
Blocked
Service accomplishment
Service delivered
as specified
Restoration
Failure
Fault: failure of a
component
Dependability
Service interruption
Deviation from
specified service
Dependability Measures
Decoding SEC
SEC/DEC Code
Virtual Machines
Examples
Address Translation
Page table
(30 bits)
Page Tables
TLB Misses
If page is in memory
Or in software
Raise exception
Complications due to
aliasing
Different virtual
addresses for shared
physical address
Memory Protection
Block placement
Finding a block
Replacement on a miss
Write policy
Block Placement
Determined by associativity
Fully associative
Any location
Finding a Block
Associativity
Location method
Tag comparisons
Direct mapped
Index
n-way set
associative
Fully associative
#entries
Hardware caches
Replacement
Random
Pseudo LRU
Virtual memory
Write Policy
Write-through
Write-back
Virtual memory
Sources of Misses
Capacity misses
Negative
performance effect
Decrease capacity
misses
Increase associativity
Decrease conflict
misses
31
10 9
4 3
Tag
Index
Offset
18 bits
10 bits
4 bits
Cache Control
Interface Signals
CPU
Read/Write
Read/Write
Valid
Valid
Address
32
Write Data
32
Read Data
32
Ready
Cache
Address
32
Write Data
128
Read Data
128
Memory
Ready
Multiple cycles
per access
Use an FSM to
sequence control steps
Set of states, transition
on each clock edge
Write-through caches
Time Event
step
CPU As
cache
CPU Bs
cache
Memory
0
CPU A reads X
CPU B reads X
CPU A writes 1 to X
Coherence Defined
P1 writes X, P2 writes X
all processors see writes in the same order
Snooping protocols
Directory-based protocols
P1
u=?
$
P3
u :5 u = 7
u :5
u=?
I/O devices
u :5
Memory
Coherence
Consistency
Directory-Based Schemes
Read Miss:
P1
Pn
Bus snoop
Cache directory
Mem
I/O devices
Cache-memory
transaction
P1
u=?
$
P3
3
u=?
u :5 u = 7
u :5
I/O devices
u :5
u=7
Memory
CPU activity
Bus activity
CPU As
cache
CPU Bs
cache
Memory
0
CPU A reads X
CPU B reads X
CPU A writes 1 to X
Invalidate for X
CPU B read X
0
0
0
0
Data prefetching
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 114
DGEMM
Pitfalls
Pitfalls
Pitfalls
Principle of locality
Memory hierarchy
Concluding Remarks