You are on page 1of 41

Computer

Architecture
Lecture 6
Introduction to Caches
• Cache
– is a small very fast memory (SRAM, expensive)
– contains copies of the most recently accessed memory
locations (data and instructions): temporal locality
– is fully managed by hardware (unlike virtual memory)
– storage is organized in blocks of contiguous memory
locations: spatial locality
– unit of transfer to/from main memory (or L2) is the
cache block
• General structure
– n blocks per cache organized in s sets
– b bytes per block
– total cache size n*b bytes
Cache

CPU

Word

CACHE Main Memory


Block

3
Cache operation - overview
• CPU requests contents of memory location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block from main
memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of
main memory is in each cache slot
Cache Design
• Size
• Mapping Function
• Replacement Algorithm
• Write Policy
• Block Size
• Number of Caches
Size does matter
• Cost
– More cache is expensive
• Speed
– More cache is faster (up to a point)
– Checking cache for data takes time
Cache Organization

• In a direct mapped cache, each memory


address is associated with one possible
block within the cache
–Therefore, we only need to look in a
single location in the cache for the data if
it exists in the cache
Mapping Function
• Cache of 64kByte
• Cache block of 4 bytes
– i.e. cache is 16k (214) lines of 4 bytes
• 16MBytes main memory
• 24 bit address
– (224=16M)
Simplest Cache: Direct Mapped

Main 4-Block Direct


Cache
Block Mapped Cache
Address Memory Index

0 0
1 1
0010 2 2
3 3
4 Memory block address
5
0110 6 index
tag
7
8
9
1010 10
11
12
13
1110 14
15
Direct-mapped Cache
• Contd.
The direct mapped cache is simple to design and
its access time is fast (Why?)
• Good for L1 (on-chip cache)
• Problem: Conflict Miss, so low hit ratio
Conflict Misses are misses caused by accessing
different memory locations that are mapped to
the same cache index
In direct mapped cache, no flexibility in where
memory block can be placed in cache,
contributing to conflict misses
Direct Mapping
• Each block of main memory maps to only one
cache line
– i.e. if a block is in cache, it must be in one specific place
• Address is in two parts
• Least Significant w bits identify unique word
• Most Significant s bits specify one memory block
• The MSBs are split into a cache line field r and a
tag of s-r (most significant)
Direct Mapping
Address Structure

Tag s-r Line or Slot r Word w

8 14 2

• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
– 8 bit tag (=22-14)
– 14 bit slot or line
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking Tag
Direct Mapping Example
Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
– If a program accesses 2 blocks that map to the
same line repeatedly, cache misses are very high
Associative Mapping
• A main memory block can load into any line of
cache
• Memory address is interpreted as tag and
word
• Tag uniquely identifies block of memory
• Every line’s tag is examined for a match
• Cache searching gets expensive
Fully Associative Cache
• Must search all tags in cache, as item can be
in any cache block
• Search for tag must be done by hardware in
parallel (other searches too slow)
• But, the necessary parallel comparator
hardware is very expensive
• Therefore, fully associative placement
practical only for a very small cache
Associative Mapping
Address Structure
Word
Tag 22 bit 2 bit

• 22 bit tag stored with each 32 bit block of data


• Compare tag field with tag entry in cache to check
for hit
• Least significant 2 bits of address identify which 16
bit word is required from 32 bit data block
• e.g.
– Address Tag Data Cache line
– FFFFFC FFFFFC 24682468 3FFF
Associative Mapping Example
Another Extreme: Fully Associative
• Fully Associative Cache (8 word block)
–Omit cache index; place item in any block!
–Compare all Cache Tags in parallel

By definition: Conflict Misses = 0 for a fully


associative cache
Compromise: N-way Set Associative
Cache
• N-way set associative:
N cache blocks for each Cache Index
– Like having N direct mapped caches operating in
parallel
– Select the one that gets a hit
• Example: 2-way set associative cache
– Cache Index selects a “set” of 2 blocks from the
cache
– The 2 tags in set are compared in parallel
– Data is selected based on the tag result (which
matched the address)
Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a number of lines
• A given block maps to any line in a given set
– e.g. Block B can be in any line of set i
• e.g. 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in only one
set
Set Associative Cache Contd.

• Direct Mapped, Fully Associative can be seen as just


variations of Set Associative block placement
strategy
• Direct Mapped =
1-way Set Associative Cache
• Fully Associative =
n-way Set associativity for a cache with exactly n
blocks
Set Associative Mapping
Address Structure

Word
Tag 9 bit Set 13 bit 2 bit

• Use set field to determine cache set to look in


• Compare tag field to see if we have a hit
• e.g
– Address Tag Data Set number
– 1FF 7FFC 1FF 12345678 1FFF
– 001 7FFC 001 11223344 1FFF
Set Associative Mapping
Address Structure

Word
Tag 9 bit Set 13 bit 2 bit

• Use set field to determine cache set to look in


• Compare tag field to see if we have a hit
• e.g
– Address Tag Data Set number
– 1FF 7FFC 1FF 12345678 1FFF
– 001 7FFC 001 11223344 1FFF
REPLACEMENT ALGORITHM
Replacement Algorithms (1)
Direct mapping
• No choice
• Each block only maps to one line
• Replace that line
Block Replacement Policy

• N-way Set Associative or Fully Associative have


choice where to place a block, (which block to
replace)
–Of course, if there is an invalid block, use it
• Whenever get a cache hit, record the cache block
that was touched
• When need to evict a cache block, choose one
which hasn't been touched recently: “Least
Recently Used” (LRU)
–Past is prologue: history suggests it is least likely of the
choices to be used soon
–Flip side of temporal locality
Replacement Algorithms (2)
Associative & Set Associative
• Hardware implemented algorithm (speed)
• Least Recently used (LRU)
• e.g. in 2 way set associative
– Which of the 2 block is lru?
• First in first out (FIFO)
– replace block that has been in cache longest
• Least frequently used
– replace block which has had fewest hits
• Random
WRITE POLICY
Write Policy
• Must not overwrite a cache block unless main
memory is up to date
• Multiple CPUs may have individual caches
• I/O may address main memory directly
Write through
• All writes go to main memory as well as cache
• Multiple CPUs can monitor main memory
traffic to keep local (to CPU) cache up to date
• Lots of traffic
• Slows down writes

• Remember bogus write through caches!


Write back
• Updates initially made in cache only
• Update bit for cache slot is set when update
occurs
• If block is to be replaced, write to main
memory only if update bit is set
• Other caches get out of sync
• I/O must access main memory through cache
• N.B. 15% of memory references are writes
Write Policy:
Write-Through vs Write-Back
• Write-through: all writes update cache and underlying memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Flagged write-back
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
• memory (or other processors) always have latest data
• Simpler management of cache
– Write-back:
• Needs much lower bus bandwidth due to infrequent access
• Better tolerance to long-latency memory?
Write Through: Write Allocate vs Non-Allocate

• Write allocate: allocate new cache line in cache


– Usually means that you have to do a “read miss” to fill in
rest of the cache-line!
– Alternative: per/word valid bits
• Write non-allocate (or “write-around”):
– Simply send write data through to underlying
memory/cache - don’t allocate new cache line!
Write Buffers
• Write Buffers (for wrt- • Write-back Buffers
through) – between a write-back cache
– buffers words to be written in and L2 or MM
L2 cache/memory along with – algorithm
their addresses. • move dirty block to write-
– 2 to 4 entries deep back buffer
– all read misses are checked • read new block
against pending writes for • write dirty block in L2 or MM
dependencies (associatively) – can be associated with victim
– allows reads to proceed ahead cache (later)
of writes
– can coalesce writes to same
address
L1 to CPU

Write buffer

L2
Avg. Memory Access Time vs. Miss
Rate
• Associativity reduces miss rate, but increases hit
time due to increase in hardware complexity!
Unified vs Split Caches
Split cache
• Separate I-cache optimized for Instruction stream
• Separate D-cache optimized for read+write
• Can independently tune caches
• Provides increased bandwidth via replication (2
caches accessed in parallel)
Unified cache
• Single cache holds both Instructions and Data
• More flexible for changing instruction & data locality
• No problem with instruction modification (self-modifying code,
etc.)
• Increased cost to provide bandwidth enough for
instruction+data every clock
cycle
– Need dual-ported memory or cycle cache at 2x clock speed
– Alternately, can take an extra clock for loads/stores for low cost
designs; they don’t
happen for every instruction
Review: Four Questions for Memory
Hierarchy Designers

• Block placement
– Fully Associative, Set Associative, Direct Mapped
• Block identification
– Tag/Block
• Block replacement
– Random, LRU
• Write strategy
– Write Back or Write Through (with Write Buffer)
Problem
Intel Cache Evolution Solution
Processor on which feature
first appears

External memory slower than the system bus. Add external cache using faster 386
memory technology.

Increased processor speed results in external bus becoming a Move external cache on-chip, 486
bottleneck for cache access. operating at the same speed as the
processor.

Internal cache is rather small, due to limited space on chip Add external L2 cache using faster 486
technology than main memory

Contention occurs when both the Instruction Prefetcher and Create separate data and instruction Pentium
the Execution Unit simultaneously require access to the caches.
cache. In that case, the Prefetcher is stalled while the
Execution Unit’s data access takes place.

Create separate back-side bus that Pentium Pro


runs at higher speed than the main
Increased processor speed results in external bus becoming a (front-side) external bus. The BSB is
bottleneck for L2 cache access. dedicated to the L2 cache.

Move L2 cache on to the processor Pentium II


chip.

Some applications deal with massive databases and must Add external L3 cache. Pentium III
have rapid access to large amounts of data. The on-chip  
caches are too small.
Move L3 cache on-chip. Pentium 4
More Caching Concepts
• Hit Rate: the percentage of memory accesses found in a level of the
memory hierarchy
• Hit Time: Time to access that level which consists of:
• Time to determine hit/miss + Time to access the block.
• Miss Rate: the percentage of memory accesses not found in a level of the
memory hierarchy, that is, 1 - (Hit Rate).
• Miss Penalty: Time to replace a block in that level with the
corresponding block from a lower level which consists of:
• Time to access the block in the lower level
+ Time to transmit that block to the level that experienced the miss
+ Time to insert the block in that level
+ Time to pass the block to the requester
• Hit Time « Miss Penalty

You might also like