Cache and The Memory Hierarchy Part One: Computer Organization II Spring 2017 Gedare Bloom

Cache and the Memory
Hierarchy
Part One
Computer Organization II
Spring 2017
Gedare Bloom
1
Memory throttles computation
Processor Memory
100000
10000
Power
1000 Wall
100
Normalized to 1980 Memory
Wall
10
2
The Memory Wall
Matters
ProcessorDRAM speed disparity
1000
100
10 CoreClocks per instruction

Clocks per DRAM acce
Memory
1
0.1
0.01
VAX/1980 PPro/1996 2010+
What does memory wall imply for pipeline?

How do humans handle
bottlenecks?
Months
Hours Minute
s
4
Hierarchy Works Due to
Locality
Spatial locality access nearby items
Temporal locality access items

repeatedly
Cache recently used items
5
How Hierarchy Works: An
Analogy
Block (or line): the minimum unit of
information that is present (or not)
in a cache
6
How Hierarchy Works: An
Analogy Main
Memory
30 seconds miss
penalty
4 seconds
1
second
hit 8 seconds
time
7
Apply hierarchy to memory
core core core core
L1 cache L1 cache L1 cache L1 cache
on-chip interconnect
Memory
on-chip cache
controller
last-level cache
Memory
8
Intel Haswell (2013)
L2 L2 L2 L2
Shared L3
Cache
I/O, Memory Controller
9
Memory Hierarchy
Technologies
Caches use SRAM for speed
Fast (typical access times of 0.5 to 2.5 nsec)
Low density (6 transistor cells), higher power, expensive ($2000 to
$5000 per GB in 2008)
Static: content will last forever (as long as power is left on)
Main memory uses DRAM for size
(density)
Slower (typical access times of 50 to 70 nsec)
High density (1T cells), lower power, cheaper ($20 to $75 per GB
in 2008)
Dynamic: needs to be refreshed regularly (~ every 8 ms)
consumes1% to 2% of the active cycles of the DRAM
Apply hierarchy to memory
core core core core
4 cycles L1 cache L1 cache L1 cache L1 cache
on-chip interconnect
10
cycles Memory
on-chip cache
controller
40
cycles
last-level cache
300
cycles
Memory
11
Why not just a really big
L1?
Smaller, core core core core
faster, L1 cache L1 cache L1 cache L1 cache
and
costlier on-chip interconnect
(per byte) Memory
on-chip cache
controller
Larger,
slower, last-level cache
and
cheaper
(per byte) Memory
12
Cache Basics
Two questions to answer (in hardware):
Q1: How do we know if a data item is in the cache?
Q2: If it is, how do we find it?
Direct mapped
Each memory block is mapped to exactly one block
in the cache
Lots of lower level blocks must share blocks in the
cache
Address mapping (to answer Q2):
(block address) modulo (# of blocks in the cache)
Have a tag associated with each cache block that contains the
address information (the upper portion of the address) required
to identify the block (to answer Q1)
A Simple First Example
Main Memory
0000x
Cache x One word blocks
0001x Two low order bits
IndexValid Tag Data define the byte in
x
the word (32b
00 0010x
words)
01 x
10 0011x
11 x
0100x Q2: How do we find
x it?
0101x
Q1: Is it there? x Use next 2 low
0110x order memory
Compare the cache x address bits the
tag to the high 0111x index to
order 2 memory x determine which
address bits to tell if 1000x cache block (i.e.,
the memory block is x modulo the number
in the cache (block address) modulo
1001x (#
ofofblocks
blocksin
in the
the
cache) cache)
x
Direct Mapped Cache
Consider the main memory word reference
Start with an empty cache - all
string
blocks initially marked as not
valid
0 miss 0 1 21 miss
3 4 3 4 15
2 miss 3 miss
00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(0)
00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2)
00 Mem(3)
4 miss 3 hit 4 hit 15 miss

01 4
00 Mem(0) 01 Mem(4) 01 Mem(4) 01 Mem(4)
00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(2)
00 Mem(3) 00 Mem(3) 00 Mem(3) 11 00 Mem(3)
15
8 requests, 6 misses
Taking Advantage of
Let cache block hold more than one
Spatial Locality
Start with an empty cache - all
word
blocks initially marked as not
valid
0 miss 0 1 2 3 1 4hit 3 4 15 2 miss
00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0)
00 Mem(3) Mem(2)
3 hit 4 miss 3hit

01 5 4
4 hit 15miss
01 Mem(5) Mem(4) 01
11 Mem(5) Mem(4)
15 14
00 Mem(3) Mem(2) 00 Mem(3) Mem(2)
8 requests, 4 misses
Cache effect on memory
performance
Average Memory Access Time (AMAT)
AMAT = Hit Time + Miss Rate * Miss Penalty
Hit Time NOT a
time to look in cache measure of
CPU
Miss Rate performance.
fraction of accesses that miss
Miss Penalty
time to fetch from next level (recursive)
17
Example: One-level cache
(Pentium)
AMAT = HitTimeL1 + MissRateL1 *
MissPenaltyL1
L1 access time: 1 cycle
Memory access time: 8 cycles
Program behavior: 2% miss rate
AMAT with cache: 1 + (0.02 * 8) = 1.16

What is the AMAT without a
cache?
18
Exercise: Two-level cache
(Pentium 4)
AMAT = HitTimeL1 + MissRateL1 *
(HitTimeL2 + MissRateL2 *
MissPenaltyL2)
L1: 2 cycles access time
L2: 19 cycles access time
Memory access time: 240 cycles
Program behavior: 5% L1 and 25% L2
What is the AMAT?
miss rates
19
The 3 Cs of Cache
Misses
Compulsory (cold)
First access to a block
Capacity
Information used exceeds cache size
Conflict (collision)
Multiple blocks mapped to the same cache location
Reducing Compulsory
Misses
1. Increase block size
2. Prefetching
21
Reducing Capacity Misses
1. Increase cache size
22
Reducing Conflict Misses
2. Increase associativity
Direct Mapped Set Associative

23
Direct Mapped Fully

24
Associative
Improve replacement policy
25
Tradeoffs in Cache
Performance
Reducing Miss Rate is just one part of
AMAT
Increasing block size increases Miss Penalty
Increasing associativity increases Hit Time
Cache size limited by cost and technology

scaling
AMAT is not a direct measure of CPU

performance 26
Miss Rate vs Block Size vs
Cache Size
10
8 KB
Miss rate (%) 16 KB
5 64 KB
256 KB
0
16 32 64 128 256
Block size (bytes)

Miss rate goes up if the block size becomes a
significant fraction of the cache size because the
number of blocks that can be held in the same size
cache is smaller (increasing capacity misses)
Extra Slides
28
3 Cs and block placement policies
29
Cache size trends
30
20 year
trend: Relative BW Improvemnt Pentium 4
(2001)
Latency
lags
Patterson, D. [2004].
Latency lags bandwidth, Pentium Pro
bandwidth
CACM
7175.
47:10 (October), DDR-
200
SDRAM Pentium (1993)
(1997)
Microprocessor
(Intel) DRAM (1980), 80286 (1982)
Memory Module
BW=Latency Normalized Latency
31

Cache and The Memory Hierarchy Part One: Computer Organization II Spring 2017 Gedare Bloom

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cache and The Memory Hierarchy Part One: Computer Organization II Spring 2017 Gedare Bloom

Uploaded by

Copyright:

Available Formats

Cache and the Memory

10 CoreClocks per instruction

What does memory wall imply for pipeline?

Temporal locality access items

Cache recently used items

4 miss 3 hit 4 hit 15 miss

3 hit 4 miss 3hit

AMAT with cache: 1 + (0.02 * 8) = 1.16

Direct Mapped Set Associative

Direct Mapped Fully

Cache size limited by cost and technology

AMAT is not a direct measure of CPU

Block size (bytes)

You might also like