You are on page 1of 31

Cache and the Memory

Hierarchy
Part One
Computer Organization II
Spring 2017

Gedare Bloom

1
Memory throttles computation
Processor Memory
100000

10000
Power
1000 Wall

100
Normalized to 1980 Memory
Wall
10

2
The Memory Wall
Matters
ProcessorDRAM speed disparity
1000

100

10 CoreClocks per instruction


Clocks per DRAM acce
Memory
1

0.1

0.01
VAX/1980 PPro/1996 2010+

What does memory wall imply for pipeline?


How do humans handle
bottlenecks?
Months

Hours Minute
s

4
Hierarchy Works Due to
Locality
Spatial locality access nearby items

Temporal locality access items


repeatedly

Cache recently used items

5
How Hierarchy Works: An
Analogy
Block (or line): the minimum unit of
information that is present (or not)
in a cache

6
How Hierarchy Works: An
Analogy Main
Memory

30 seconds miss
penalty

4 seconds

1
second
hit 8 seconds
time

7
Apply hierarchy to memory
core core core core
L1 cache L1 cache L1 cache L1 cache

on-chip interconnect

Memory
on-chip cache
controller

last-level cache

Memory

8
Intel Haswell (2013)

L2 L2 L2 L2

Shared L3
Cache
I/O, Memory Controller

9
Memory Hierarchy
Technologies
Caches use SRAM for speed
Fast (typical access times of 0.5 to 2.5 nsec)
Low density (6 transistor cells), higher power, expensive ($2000 to
$5000 per GB in 2008)
Static: content will last forever (as long as power is left on)
Main memory uses DRAM for size
(density)
Slower (typical access times of 50 to 70 nsec)
High density (1T cells), lower power, cheaper ($20 to $75 per GB
in 2008)
Dynamic: needs to be refreshed regularly (~ every 8 ms)
consumes1% to 2% of the active cycles of the DRAM
Apply hierarchy to memory
core core core core
4 cycles L1 cache L1 cache L1 cache L1 cache

on-chip interconnect
10
cycles Memory
on-chip cache
controller
40
cycles
last-level cache
300
cycles
Memory

11
Why not just a really big
L1?
Smaller, core core core core
faster, L1 cache L1 cache L1 cache L1 cache
and
costlier on-chip interconnect
(per byte) Memory
on-chip cache
controller

Larger,
slower, last-level cache
and
cheaper
(per byte) Memory

12
Cache Basics
Two questions to answer (in hardware):
Q1: How do we know if a data item is in the cache?
Q2: If it is, how do we find it?
Direct mapped
Each memory block is mapped to exactly one block
in the cache
Lots of lower level blocks must share blocks in the
cache
Address mapping (to answer Q2):
(block address) modulo (# of blocks in the cache)
Have a tag associated with each cache block that contains the
address information (the upper portion of the address) required
to identify the block (to answer Q1)
A Simple First Example
Main Memory
0000x
Cache x One word blocks
0001x Two low order bits
IndexValid Tag Data define the byte in
x
the word (32b
00 0010x
words)
01 x
10 0011x
11 x
0100x Q2: How do we find
x it?
0101x
Q1: Is it there? x Use next 2 low
0110x order memory
Compare the cache x address bits the
tag to the high 0111x index to
order 2 memory x determine which
address bits to tell if 1000x cache block (i.e.,
the memory block is x modulo the number
in the cache (block address) modulo
1001x (#
ofofblocks
blocksin
in the
the
cache) cache)
x
Direct Mapped Cache
Consider the main memory word reference
Start with an empty cache - all
string
blocks initially marked as not
valid
0 miss 0 1 21 miss
3 4 3 4 15
2 miss 3 miss
00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(0)
00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2)
00 Mem(3)

4 miss 3 hit 4 hit 15 miss


01 4
00 Mem(0) 01 Mem(4) 01 Mem(4) 01 Mem(4)
00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(2)
00 Mem(3) 00 Mem(3) 00 Mem(3) 11 00 Mem(3)
15
8 requests, 6 misses
Taking Advantage of
Let cache block hold more than one
Spatial Locality
Start with an empty cache - all
word
blocks initially marked as not
valid
0 miss 0 1 2 3 1 4hit 3 4 15 2 miss
00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0)
00 Mem(3) Mem(2)

3 hit 4 miss 3hit


01 5 4
00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4)
00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

4 hit 15miss
01 Mem(5) Mem(4) 01
11 Mem(5) Mem(4)
15 14
00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

8 requests, 4 misses
Cache effect on memory
performance
Average Memory Access Time (AMAT)
AMAT = Hit Time + Miss Rate * Miss Penalty
Hit Time NOT a
time to look in cache measure of
CPU
Miss Rate performance.
fraction of accesses that miss
Miss Penalty
time to fetch from next level (recursive)

17
Example: One-level cache
(Pentium)
AMAT = HitTimeL1 + MissRateL1 *
MissPenaltyL1
L1 access time: 1 cycle
Memory access time: 8 cycles
Program behavior: 2% miss rate

AMAT with cache: 1 + (0.02 * 8) = 1.16


What is the AMAT without a
cache?
18
Exercise: Two-level cache
(Pentium 4)
AMAT = HitTimeL1 + MissRateL1 *
(HitTimeL2 + MissRateL2 *
MissPenaltyL2)
L1: 2 cycles access time
L2: 19 cycles access time
Memory access time: 240 cycles
Program behavior: 5% L1 and 25% L2
What is the AMAT?
miss rates
19
The 3 Cs of Cache
Misses
Compulsory (cold)
First access to a block
Capacity
Information used exceeds cache size
Conflict (collision)
Multiple blocks mapped to the same cache location
Reducing Compulsory
Misses
1. Increase block size

2. Prefetching

21
Reducing Capacity Misses
1. Increase cache size

22
Reducing Conflict Misses
1. Increase cache size
2. Increase associativity

Direct Mapped Set Associative


23
Reducing Conflict Misses
1. Increase cache size
2. Increase associativity

Direct Mapped Fully


24

Associative
Reducing Conflict Misses
1. Increase cache size
2. Increase associativity
Improve replacement policy

25
Tradeoffs in Cache
Performance
Reducing Miss Rate is just one part of
AMAT
Increasing block size increases Miss Penalty
Increasing associativity increases Hit Time

Cache size limited by cost and technology


scaling

AMAT is not a direct measure of CPU


performance 26
Miss Rate vs Block Size vs
Cache Size
10

8 KB
Miss rate (%) 16 KB
5 64 KB
256 KB

0
16 32 64 128 256

Block size (bytes)


Miss rate goes up if the block size becomes a
significant fraction of the cache size because the
number of blocks that can be held in the same size
cache is smaller (increasing capacity misses)
Extra Slides

28
3 Cs and block placement policies

29
Cache size trends

30
20 year
trend: Relative BW Improvemnt Pentium 4
(2001)
Latency
lags
Patterson, D. [2004].
Latency lags bandwidth, Pentium Pro
bandwidth
CACM
7175.
47:10 (October), DDR-
200
SDRAM Pentium (1993)

(1997)

Microprocessor
(Intel) DRAM (1980), 80286 (1982)
Memory Module
BW=Latency Normalized Latency

31

You might also like