You are on page 1of 38

MEMORY HIERARCHY

ISAAC HANSON
IT/FINF/GTUC
Processor-Memory Performance Gap
1
10
100
1000
10000
1
9
8
0
1
9
8
3
1
9
8
6
1
9
8
9
1
9
9
2
1
9
9
5
1
9
9
8
2
0
0
1
2
0
0
4
Year
P
e
r
f
o
r
m
a
n
c
e
“Moore’s Law”
µProc
55%/year
(2X/1.5yr)
DRAM
7%/year
(2X/10yrs)
Processor-Memory
Performance Gap
(grows 50%/year)
The Memory Hierarchy Goal
• Fact: Large memories are slow and fast memories are
small

• How do we create a memory that gives the illusion of
being large, cheap and fast (most of the time)?
– With hierarchy
– With parallelism
Second
Level
Cache
(SRAM)
A Typical Memory Hierarchy
Control
Datapath
Secondary
Memory
(Disk)
On-Chip Components
R
e
g
F
i
l
e

Main
Memory
(DRAM)
D
a
t
a

C
a
c
h
e

I
n
s
t
r

C
a
c
h
e

I
T
L
B

D
T
L
B

eDRAM
Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s
Size (bytes): 100’s K’s 10K’s M’s G’s to T’s
Cost: highest lowest
 By taking advantage of the principle of locality
l Can present the user with as much memory as is available in the
cheapest technology
l at the speed offered by the fastest technology
Characteristics of the Memory
Hierarchy
Increasing
distance
from the
processor in
access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what
is in L1$ is a
subset of what is
in L2$ is a
subset of what is
in MM that is a
subset of is in
SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
Memory Performance Metrics
• Latency: Time to access one word
– Access time: time between the request and when the data is
available (or written)
– Cycle time: time between requests
– Usually cycle time > access time
– Typical read access times for SRAMs in 2004 are 2 to 4 ns for
the fastest parts to 8 to 20ns for the typical largest parts
• Bandwidth: How much data from the memory can be
supplied to the processor per unit time
– width of the data channel * the rate at which it can be used

• Size: DRAM to SRAM 4 to 8
• Cost/Cycle time: SRAM to DRAM 8 to 16
DRAM Memory Latency & Bandwidth
Milestones
• In the time that the memory to processor bandwidth
doubles the memory latency improves by a factor of
only 1.2 to 1.4
• To deliver such high bandwidth, the internal DRAM has
to be organized as interleaved memory banks
DRAM Page
DRAM
FastPage
DRAM
FastPage
DRAM
Synch
DRAM
DDR
SDRAM
Module Width 16b 16b 32b 64b 64b 64b
Year 1980 1983 1986 1993 1997 2000
Mb/chip 0.06 0.25 1 16 64 256
Die size (mm
2
) 35 45 70 130 170 204
Pins/chip 16 16 18 20 54 66
BWidth (MB/s) 13 40 160 267 640 1600
Latency (nsec) 225 170 125 75 62 52
Patterson, CACM Vol 47, #10, 2004
Review: The Memory Hierarchy
Increasing
distance
from the
processor in
access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what
is in L1$ is a
subset of what is
in L2$ is a
subset of what is
in MM that is a
subset of is in
SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
 Take advantage of the principle of locality to present the user
with as much memory as is available in the cheapest technology
at the speed offered by the fastest technology
The Memory Hierarchy: Terminology
• Hit: data is in some block in the upper level (Blk X)
– Hit Rate: the fraction of memory accesses found in the upper
level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss




• Miss: data is not in the upper level so needs to be
retrieve from a block in the lower level (Blk Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level
+ Time to deliver the block the processor
– Hit Time << Miss Penalty
Lower Level
Memory
Upper Level
Memory
To Processor
From Processor
Blk X
Blk Y
4 Questions for the Memory Hierarchy
• Q1: Where can a block be placed in the upper level?
(Block placement)

• Q2: How is a block found if it is in the upper level?
(Block identification)

• Q3: Which block should be replaced on a miss?
(Block replacement)

• Q4: What happens on a write?
(Write strategy)
Q1&Q2: Where can a block be
placed/found?
# of sets Blocks per set
Direct mapped # of blocks in cache 1
Set associative (# of blocks in cache)/
associativity
Associativity (typically
2 to 16)
Fully associative 1 # of blocks in cache
Location method # of comparisons
Direct mapped Index 1
Set associative Index the set; compare
set’s tags
Degree of
associativity
Fully associative Compare all blocks tags # of blocks
Q1: Block Placement:
Where can a Block be Placed in the Upper Level?
Block 12 placed in
8 block cache
– Fully Associative(FA),
Direct Mapped,
2-way Set Associative(SA)
– SA Mapping ;
(Block #) Modulo(# of Sets)
Block
Number 0 1 2 3 4 5 6 7
Fully Associative:
Block 12 can go
anywhere

. . .

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
1 1 1 1 1 1 Block
Number
Memory
Block Frame Address
Direct mapped:
Block 12 can go
only into Block 4
(12 Mod 8)= 4
0 1 2 3 4 5 6 7
Set Set Set Set
0 1 2 3
Set Associative:
Block 12 can go
anywhere in Set 0
(12 Mod 4)= 0
0 1 2 3 4 5 6 7
Q3: Which block should be replaced on a miss?
• Easy for direct mapped – only one choice- FIFO
• Set associative or fully associative
–Random
–LRU (Least Recently Used)

• For a 2-way set associative cache, random
replacement has a miss rate about 1.1 times
higher than LRU.
• LRU is too costly to implement for high levels of
associativity (> 4-way) since tracking the usage
information is costly
Q4: What happens on a write?
• Write-through – The information is written to both the block
in the cache and to the block in the next lower level of the
memory hierarchy
– Write-through is always combined with a write buffer so write
waits to lower level memory can be eliminated (as long as the
write buffer doesn’t fill)
• Write-back – The information is written only to the block in
the cache. The modified cache block is written to main
memory only when it is replaced.
– Need a dirty bit to keep track of whether the block is clean or
dirty
• Pros and cons of each?
– Write-through: read misses don’t result in writes (so are simpler
and cheaper)
– Write-back: repeated writes require only one write to lower level
Interaction Policies with Main
Memory
• The read policies are:
Read Through - reading a word from main memory to CPU
No Read Through - reading a block from main memory to
cache and then from cache to CPU
• Such is not the case for writes. Modifying a block cannot
begin until the tag is checked to see if the address is a
hit. Also the processor specifies the size of the write, usually
between 1 and 8 bytes; only that portion of the block can
be changed. In contrast, reads can access more bytes than
necessary without a problem.

Write policies
• The write policies on write hit often distinguish cache
designs:
Write Through - the information is written to both the block
in the cache and to the block in the lower-level memory.
Advantage:
- read miss never results in writes to main memory
- easy to implement
- main memory always has the most current copy of the data
(consistent)
Disadvantage:
- write is slower
- every write needs a main memory access
- as a result uses more memory bandwidth
Write back
• Write back - the information is written only to the block in the cache.
The modified cache block is written to main memory only when it is
replaced. To reduce the frequency of writing back blocks on replacement,
a dirty bit is commonly used. This status bit indicates whether the block is
dirty (modified while in the cache) or clean (not modified). If it is clean the
block is not written on a miss.
Advantage:
- writes occur at the speed of the cache memory
- multiple writes within a block require only one write to main memory
- as a result uses less memory bandwidth
Disadvantage:
- harder to implement
- main memory is not always consistent with cache
- reads that result in replacement may cause writes of dirty blocks to
main memory
Write Miss Policies
• There are two common options on a write miss:
Write Allocate - the block is loaded on a write miss, followed
by the write-hit action.
No Write Allocate - the block is modified in the main memory
and not loaded into the cache.
• Although either write-miss policy could be used with write
through or write back, write-back caches generally use write
allocate (hoping that subsequent writes to that block will be
captured by the cache) and
write-through caches often use no-write allocate (since
subsequent writes to that block will still have to go to
memory).

Interaction policies with main
memory on write,

• Write hit policy Write miss policy
• Write Through Write Allocate
• Write Through No Write Allocate
• Write Back Write Allocate
• Write Back No Write Allocate
• Possible combinations of interaction policies with main memory on write
Continues……..
• Write Through with Write Allocate:
– on hits it writes to cache and main memory
– on misses it updates the block in main memory and brings the block to the cache
– Bringing the block to cache on a miss does not make a lot of sense in this combination
because the next hit to this block will generate a write to main memory anyway
(according to Write Through policy)
• Write Through with No Write Allocate:
– on hits it writes to cache and main memory;
– on misses it updates the block in main memory not bringing that block to the cache;
– Subsequent writes to the block will update main memory because Write Through policy
is employed. So, some time is saved not bringing the block in the cache on a miss
because it appears useless anyway
Continues……..
• Write Back with Write Allocate:
– on hits it writes to cache setting “dirty” bit for the block, main memory is not updated;
– on misses it updates the block in main memory and brings the block to the cache;
– Subsequent writes to the same block, if the block originally caused a miss, will hit in the
cache next time, setting dirty bit for the block. That will eliminate extra memory
accesses and result in very efficient execution compared with Write Through with Write
Allocate combination.

• Write Back with No Write Allocate:
– on hits it writes to cache setting “dirty” bit for the block, main memory is not updated;
– on misses it updates the block in main memory not bringing that block to the cache;
– Subsequent writes to the same block, if the block originally caused a miss, will generate
misses all the way and result in very inefficient execution

Improving Cache Performance
Reduce the time to hit in the cache
– smaller cache
– direct mapped cache
– smaller blocks
– for writes
• no write allocate – no “hit” on cache, just write to write buffer
• write allocate – to avoid two cycles (first check for hit, then write) pipeline
writes via a delayed write buffer to cache
Reduce the miss rate
– bigger cache
– more flexible placement (increase associativity)
– larger blocks (16 to 64 bytes typical)
– victim cache – small buffer holding most recently
discarded blocks
Improving Cache Performance
2. Reduce the miss penalty
– smaller blocks
– use a write buffer to hold dirty blocks being replaced so
don’t have to wait for the write to complete before
reading
– check write buffer (and/or victim cache) on read miss –
may get lucky
– for large blocks fetch critical word first
– use multiple cache levels – L2 cache not tied to CPU clock
rate
– faster backing store/improved memory bandwidth
• wider buses
• memory interleaving, page mode DRAMs

Summary: The Cache Design Space
• Several interacting dimensions
– cache size
– block size
– associativity
– replacement policy
– write-through vs write-back
– write allocation
• The optimal choice is a
compromise
– depends on access characteristics
• workload
• use (I-cache, D-cache, TLB)
– depends on technology / cost
• Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
Virtual Storage System CS510 Computer Architectures Lecture 14 - 32

Virtual Storage System
Virtual Storage System
• Fulfill the main memory capacity needs, not to fill the speed gap
between processor and memory
• Thus, it is a software system
• Virtual address space >> Physical address space
• Needs translation of the virtual addresses to the physical addresses
– Page Table
– Usually entire Page Table can not be loaded in the memory
• Very slow translation
• Address translation should be fast - Shouldn’t need more than 1
memory access for the fast translation
– TLB made of associative memory
– Cache
CS510 Computer Architectures
Virtual Memory
• Virtual address (2
32
,

2
64
) to Physical Address
mapping (2
28
)
• Virtual memory terms of cache terms:
– Cache block? - page or segment
– Cache Miss? - page fault or address fault
• How is virtual memory different from
caches?
– What Controls Replacement - HW and SW
– Size (transfer unit, mapping mechanisms)
– Lower level use - secondary storage ( magnetic
disk)
Typical Ranges Of Parameters
Block(page) size 16-128 bytes 4KB - 64KB
Hit time 1-2 clock cycles 40-100 clock cycles
Miss penalty 8- 100 clock cycles 700,000-6,000,000
(Access time) (6-60 clock cycles) (500,000- 4,000,000)
(Transfer time) (2-40 clcok cycles) (200,000- 2,000,000)
Miss rate 0.5-10% 0.00001 - 0.001%
Data memory size 0.016-1MB 16-8192MB
Parameter First-level cache Virtual memory

Virtual Storage
• 4Qs for Virtual Memory?
– Q1: Where can a block be placed in the upper level?
Fully Associative, Set Associative, Direct Mapped

– Q2: How is a block found if it is in the upper level?
Tag/Block
• page table, multi-level page table, inverted page table, translation
look-aside buffer (TLB)

– Q3: Which block should be replaced on a miss?
Random, LRU

– Q4: What happens on a write?
Write Back or Write Through (with Write Buffer)
Selecting the Page Size
• Reasons for larger page size
– Page table size is inversely proportional to the page size;
memory can be saved with larger page, i.e. with smaller PT
– Fast cache hit time, easy when cache  page size (VA caches);
bigger page is feasible as cache size grows
– Transferring larger pages to or from secondary storage, possibly over a
network, is more efficient
– Number of TLB entries are restricted by clock cycle time, so a larger page size
maps more memory, thereby reducing TLB misses
• Reasons for a smaller page size
– Fragmentation: don’t waste storage; data must be contiguous within page
– Quicker process start for small pages(??)
• Hybrid solution: multiple page sizes
– Alpha: 8KB, 16KB, 32 KB, 64 KB pages (43, 47, 51, 55 virt addr bits)
•Thank you