Professional Documents
Culture Documents
Hierarchy
Part One
Computer Organization II
Spring 2017
Gedare Bloom
1
Memory throttles computation
Processor Memory
100000
10000
Power
1000 Wall
100
Normalized to 1980 Memory
Wall
10
2
The Memory Wall
Matters
ProcessorDRAM speed disparity
1000
100
0.1
0.01
VAX/1980 PPro/1996 2010+
Hours Minute
s
4
Hierarchy Works Due to
Locality
Spatial locality access nearby items
5
How Hierarchy Works: An
Analogy
Block (or line): the minimum unit of
information that is present (or not)
in a cache
6
How Hierarchy Works: An
Analogy Main
Memory
30 seconds miss
penalty
4 seconds
1
second
hit 8 seconds
time
7
Apply hierarchy to memory
core core core core
L1 cache L1 cache L1 cache L1 cache
on-chip interconnect
Memory
on-chip cache
controller
last-level cache
Memory
8
Intel Haswell (2013)
L2 L2 L2 L2
Shared L3
Cache
I/O, Memory Controller
9
Memory Hierarchy
Technologies
Caches use SRAM for speed
Fast (typical access times of 0.5 to 2.5 nsec)
Low density (6 transistor cells), higher power, expensive ($2000 to
$5000 per GB in 2008)
Static: content will last forever (as long as power is left on)
Main memory uses DRAM for size
(density)
Slower (typical access times of 50 to 70 nsec)
High density (1T cells), lower power, cheaper ($20 to $75 per GB
in 2008)
Dynamic: needs to be refreshed regularly (~ every 8 ms)
consumes1% to 2% of the active cycles of the DRAM
Apply hierarchy to memory
core core core core
4 cycles L1 cache L1 cache L1 cache L1 cache
on-chip interconnect
10
cycles Memory
on-chip cache
controller
40
cycles
last-level cache
300
cycles
Memory
11
Why not just a really big
L1?
Smaller, core core core core
faster, L1 cache L1 cache L1 cache L1 cache
and
costlier on-chip interconnect
(per byte) Memory
on-chip cache
controller
Larger,
slower, last-level cache
and
cheaper
(per byte) Memory
12
Cache Basics
Two questions to answer (in hardware):
Q1: How do we know if a data item is in the cache?
Q2: If it is, how do we find it?
Direct mapped
Each memory block is mapped to exactly one block
in the cache
Lots of lower level blocks must share blocks in the
cache
Address mapping (to answer Q2):
(block address) modulo (# of blocks in the cache)
Have a tag associated with each cache block that contains the
address information (the upper portion of the address) required
to identify the block (to answer Q1)
A Simple First Example
Main Memory
0000x
Cache x One word blocks
0001x Two low order bits
IndexValid Tag Data define the byte in
x
the word (32b
00 0010x
words)
01 x
10 0011x
11 x
0100x Q2: How do we find
x it?
0101x
Q1: Is it there? x Use next 2 low
0110x order memory
Compare the cache x address bits the
tag to the high 0111x index to
order 2 memory x determine which
address bits to tell if 1000x cache block (i.e.,
the memory block is x modulo the number
in the cache (block address) modulo
1001x (#
ofofblocks
blocksin
in the
the
cache) cache)
x
Direct Mapped Cache
Consider the main memory word reference
Start with an empty cache - all
string
blocks initially marked as not
valid
0 miss 0 1 21 miss
3 4 3 4 15
2 miss 3 miss
00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(0)
00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2)
00 Mem(3)
4 hit 15miss
01 Mem(5) Mem(4) 01
11 Mem(5) Mem(4)
15 14
00 Mem(3) Mem(2) 00 Mem(3) Mem(2)
8 requests, 4 misses
Cache effect on memory
performance
Average Memory Access Time (AMAT)
AMAT = Hit Time + Miss Rate * Miss Penalty
Hit Time NOT a
time to look in cache measure of
CPU
Miss Rate performance.
fraction of accesses that miss
Miss Penalty
time to fetch from next level (recursive)
17
Example: One-level cache
(Pentium)
AMAT = HitTimeL1 + MissRateL1 *
MissPenaltyL1
L1 access time: 1 cycle
Memory access time: 8 cycles
Program behavior: 2% miss rate
2. Prefetching
21
Reducing Capacity Misses
1. Increase cache size
22
Reducing Conflict Misses
1. Increase cache size
2. Increase associativity
Associative
Reducing Conflict Misses
1. Increase cache size
2. Increase associativity
Improve replacement policy
25
Tradeoffs in Cache
Performance
Reducing Miss Rate is just one part of
AMAT
Increasing block size increases Miss Penalty
Increasing associativity increases Hit Time
8 KB
Miss rate (%) 16 KB
5 64 KB
256 KB
0
16 32 64 128 256
28
3 Cs and block placement policies
29
Cache size trends
30
20 year
trend: Relative BW Improvemnt Pentium 4
(2001)
Latency
lags
Patterson, D. [2004].
Latency lags bandwidth, Pentium Pro
bandwidth
CACM
7175.
47:10 (October), DDR-
200
SDRAM Pentium (1993)
(1997)
Microprocessor
(Intel) DRAM (1980), 80286 (1982)
Memory Module
BW=Latency Normalized Latency
31