Professional Documents
Culture Documents
Architecture
Lecture 6
Introduction to Caches
• Cache
– is a small very fast memory (SRAM, expensive)
– contains copies of the most recently accessed memory
locations (data and instructions): temporal locality
– is fully managed by hardware (unlike virtual memory)
– storage is organized in blocks of contiguous memory
locations: spatial locality
– unit of transfer to/from main memory (or L2) is the
cache block
• General structure
– n blocks per cache organized in s sets
– b bytes per block
– total cache size n*b bytes
Cache
CPU
Word
3
Cache operation - overview
• CPU requests contents of memory location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block from main
memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of
main memory is in each cache slot
Cache Design
• Size
• Mapping Function
• Replacement Algorithm
• Write Policy
• Block Size
• Number of Caches
Size does matter
• Cost
– More cache is expensive
• Speed
– More cache is faster (up to a point)
– Checking cache for data takes time
Cache Organization
0 0
1 1
0010 2 2
3 3
4 Memory block address
5
0110 6 index
tag
7
8
9
1010 10
11
12
13
1110 14
15
Direct-mapped Cache
• Contd.
The direct mapped cache is simple to design and
its access time is fast (Why?)
• Good for L1 (on-chip cache)
• Problem: Conflict Miss, so low hit ratio
Conflict Misses are misses caused by accessing
different memory locations that are mapped to
the same cache index
In direct mapped cache, no flexibility in where
memory block can be placed in cache,
contributing to conflict misses
Direct Mapping
• Each block of main memory maps to only one
cache line
– i.e. if a block is in cache, it must be in one specific place
• Address is in two parts
• Least Significant w bits identify unique word
• Most Significant s bits specify one memory block
• The MSBs are split into a cache line field r and a
tag of s-r (most significant)
Direct Mapping
Address Structure
8 14 2
• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
– 8 bit tag (=22-14)
– 14 bit slot or line
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking Tag
Direct Mapping Example
Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
– If a program accesses 2 blocks that map to the
same line repeatedly, cache misses are very high
Associative Mapping
• A main memory block can load into any line of
cache
• Memory address is interpreted as tag and
word
• Tag uniquely identifies block of memory
• Every line’s tag is examined for a match
• Cache searching gets expensive
Fully Associative Cache
• Must search all tags in cache, as item can be
in any cache block
• Search for tag must be done by hardware in
parallel (other searches too slow)
• But, the necessary parallel comparator
hardware is very expensive
• Therefore, fully associative placement
practical only for a very small cache
Associative Mapping
Address Structure
Word
Tag 22 bit 2 bit
Word
Tag 9 bit Set 13 bit 2 bit
Word
Tag 9 bit Set 13 bit 2 bit
Write buffer
L2
Avg. Memory Access Time vs. Miss
Rate
• Associativity reduces miss rate, but increases hit
time due to increase in hardware complexity!
Unified vs Split Caches
Split cache
• Separate I-cache optimized for Instruction stream
• Separate D-cache optimized for read+write
• Can independently tune caches
• Provides increased bandwidth via replication (2
caches accessed in parallel)
Unified cache
• Single cache holds both Instructions and Data
• More flexible for changing instruction & data locality
• No problem with instruction modification (self-modifying code,
etc.)
• Increased cost to provide bandwidth enough for
instruction+data every clock
cycle
– Need dual-ported memory or cycle cache at 2x clock speed
– Alternately, can take an extra clock for loads/stores for low cost
designs; they don’t
happen for every instruction
Review: Four Questions for Memory
Hierarchy Designers
• Block placement
– Fully Associative, Set Associative, Direct Mapped
• Block identification
– Tag/Block
• Block replacement
– Random, LRU
• Write strategy
– Write Back or Write Through (with Write Buffer)
Problem
Intel Cache Evolution Solution
Processor on which feature
first appears
External memory slower than the system bus. Add external cache using faster 386
memory technology.
Increased processor speed results in external bus becoming a Move external cache on-chip, 486
bottleneck for cache access. operating at the same speed as the
processor.
Internal cache is rather small, due to limited space on chip Add external L2 cache using faster 486
technology than main memory
Contention occurs when both the Instruction Prefetcher and Create separate data and instruction Pentium
the Execution Unit simultaneously require access to the caches.
cache. In that case, the Prefetcher is stalled while the
Execution Unit’s data access takes place.
Some applications deal with massive databases and must Add external L3 cache. Pentium III
have rapid access to large amounts of data. The on-chip
caches are too small.
Move L3 cache on-chip. Pentium 4
More Caching Concepts
• Hit Rate: the percentage of memory accesses found in a level of the
memory hierarchy
• Hit Time: Time to access that level which consists of:
• Time to determine hit/miss + Time to access the block.
• Miss Rate: the percentage of memory accesses not found in a level of the
memory hierarchy, that is, 1 - (Hit Rate).
• Miss Penalty: Time to replace a block in that level with the
corresponding block from a lower level which consists of:
• Time to access the block in the lower level
+ Time to transmit that block to the level that experienced the miss
+ Time to insert the block in that level
+ Time to pass the block to the requester
• Hit Time « Miss Penalty