CMP3010: Computer Architecture
L09: Memory Hierarchy ||
Dina Tantawy
Computer Engineering Department
Cairo University
Agenda
• Review
• Full Associative Cache
• Set Associative Cache
• Cache performance
• Cache vs virtual memory
• Software optimization based on caching
Review: memory hierarchy
3
Review: memory hierarchy
4
Review: memory hierarchy
Principle of locality
States that programs access a relatively small portion of their
address space at any instant of time.
Temporal locality Spatial locality
If an item is referenced, If an item is referenced, items
it will tend to be referenced again whose addresses are close by will tend
soon. to be referenced soon.
‹#›
Review: Terminologies
•Hit: data appears in some block in the upper level
–Hit Rate: the fraction of memory access found in the upper level
–Hit Time: Time to access the upper level which consists of
cache access time + Time to determine hit/miss
•Miss: data needs to be retrieved from a block in the lower level
–Miss Rate = 1 - (Hit Rate)
–Miss Penalty = Time to replace a block in the upper level + Time to deliver the
block to the processor
•Hit Time << Miss Penalty
6
Review: The Basics of Caches
• How do we know if a data item is in the cache?
• How do we find it?
7
Review: Direct Mapped Cache
8
Review: Read and Write Policies
• Two write options when the data block is in the memory :
• Write Through: write to cache and memory at the same time.
• Isn’t memory too slow for this?
• Write Back: write to cache only. Write the cache block to memory
when that cache block is being replaced on a cache miss.
• Need a “dirty” bit for each cache block
• Control can be complex
9
Review: Write Miss Policies
• Write allocate (also called fetch on write): data at the missed-
write location is loaded to cache, followed by a write-hit
operation. In this approach, write misses are like read misses.
• No-write allocate (also called write-no-allocate or write
around): data at the missed-write location is not loaded to
cache, and is written directly to the backing store. In this
approach, data is loaded into the cache on read misses only.
‹#›
What about other mappings ?
‹#›
Flexible placement of blocks: Associativity
1111111111 2222222222 33
Block Number 0 1 2 3 4 5 6 7 8 9 0123456789 0123456789 01
Memory
Set Number 0 1 2 3 01234567
Cache
Fully (2-way) Set Direct
Associative Associative Mapped
anywhere anywhere in only into
block 12
can be placed set 0 block 4
(12 mod 4) (12 mod 8)
12
Fully Associative
• Fully Associative Cache -- push the set associative idea to its limit!
• Forget about the Cache Index
• Compare the Cache Tags of all cache entries in parallel
• Example: Block Size = 32 B blocks, we need N 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
3 4 0
1 Cache Tag (27 bits long) Byte Select
Ex: 0x01
Cache Tag Valid Bit Cache Data
X Byte 31 Byte 1 Byte 0
: :
X Byte 63 Byte 33 Byte 32
X
X
X
: : :
13
Flexible placement of blocks: Associativity
14
A Four-way Set Associative Cache
• N-way set associative: N
entries for each Cache Index
• N direct mapped caches
operates in parallel
• Example: Four-way set
associative cache
• Cache Index selects a “set”
from the cache
• The four tags in the set are
compared in parallel
• Data is selected based on the
tag result
15
Replacement Policy
In an associative cache, which block from a set should be
evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• true implementation only feasible for small sets (2-way)
• First-In, First-Out (FIFO) a.k.a. Round-Robin
• used in highly associative caches
Replacement only happens on misses
16
Quiz
Assume a 16 Kbyte cache that holds both instructions and data.
Additional specs for the 16 Kbyte cache include:
- Each block will hold 32 bytes of data
- The cache would be 4-way set associative
- Physical addresses are 32 bits
Q1: How many blocks would be in this cache?
Q2: How many bits of tag are stored with each block entry?
17
Cache Performance
‹#›
Measuring Cache Performance
Impact of cache miss on Performance
19
Example
20
Example: Solution
21
Example Solution
22
Improving Cache Performance
Average memory access time(AMAT) =
Hit time + Miss rate x Miss penalty
To improve performance:
• reduce the hit time
• reduce the miss rate
• reduce the miss penalty
23
Sources of Cache Misses
• Compulsory (cold start, first reference): first access to a block
• Misses that would occur even with infinite cache
• “Cold” fact of life: not a whole lot you can do about it
• Conflict (collision):
• Multiple memory locations mapped to the same cache location
• Solution 1: increase cache size
• Solution 2: increase associativity
• Capacity:
• Cache cannot contain all blocks accessed by the program
• Solution: increase cache size
Reducing Miss Penalty Using Multilevel caches
• Use smaller L1 if there is also L2
• Trade increased L1 miss rate for reduced L1 hit time
and reduced L1 miss penalty
• Reduces average access time
CPU L1 L2 DRAM
25
Performance of Multilevel Caches
26
Effect of Cache Parameters on Performance
• Larger cache size
+ reduces capacity and conflict misses
- hit time will increase
• Higher associativity
+ reduces conflict misses
- may increase hit time
• Larger block size
+ reduces compulsory misses and reload
- increases conflict misses and miss penalty
27
Quiz
• Suppose a processor executes at
• Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.5
• 40% arith/logic, 40% ld/st, 20% control
• Suppose that 5% of memory operations (involving data) get 100 cycle miss penalty
• Suppose that 2% of instructions get same miss penalty
Determine how much faster a processor with a perfect cache that never missed
would run?
28
Is Virtual Memory same as
Caching ?
‹#›
‹#›
Virtual Memory Vs Cache Memory
Virtual Memory Cache Memory
Increases the capacity of main memory. Increase the accessing speed of CPU.
Virtual memory is not a memory unit, its a
Cache memory is a hardware.
technique.
Operating System manages the Virtual
Hardware manages the cache memory.
memory.
The size of virtual memory could be greater The size of cache memory is less than the
than main memory main memory
‹#›
How can we benefit from cache?
‹#›
Software Optimization via Blocking
• When dealing with arrays, we can get good performance from the
memory system if we store the array in memory so that accesses to
the array are sequential in memory. What about the matrix ?
• How Matrix is stored ?
• Row Major (row by row)
• Column Major (column by column)
• A size matrix of 512x512 needs = 1MB, much bigger than level-1
cache. It doesn’t fit in the memory ?!
‹#›
Software Optimization via Blocking
• How Matrix Multiplication is done?
for (int j = 0; j < n; ++j)
{
double cij = C[i+j*n]; /* cij = C[i][j] */
for( int k = 0; k < n; k++ )
cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
C[i+j*n] = cij; /* C[i][j] = cij */
}
‹#›
Software Optimization via Blocking
Do we need to store all three
White: not accessed
matrices ? Isn’t that increasing
Light grey: old access
cache misses due to
Dark grey: new access
replacement? ‹#›
Software Optimization via Blocking
Blocked DGEMM ‹#›
Software Optimization via Blocking
Blocked DGEMM ‹#›
Software Optimization via Blocking
‹#›
Thank you
‹#›