You are on page 1of 54


Memory Hierarchy Design

Memory Hierarchy Design

5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing Cache Miss Rate 5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism 5.7 Reducing Hit Time 5.8 Main Memory and Organizations for Improving Performance 5.9 Memory Technology 5.10 Virtual Memory 5.11 Protection and Examples of Virtual Memory

The five classic components of a computer:

Processor Input Control Memory Datapath


Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches

CPU: DRAM: Disk:

Capacity 2x in 1.5 years 4x in 3 years 4x in 3 years

Speed (latency) 2x in 1.5 years 2x in 10 years 2x in 10 years

Technology Trends


Year 1980 1983 1986 1989 1992 1995 2000

Size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb

Cycle Time 250 ns 220 ns 190 ns 165 ns 145 ns 120 ns 100 ns



Performance Gap between CPUs and Memory

CPU 1.35X/yr 1.55X/yr

(improvement ratio)

Memory 7%/yr

The gap (latency) grows about 50% per year!


Memory Hierarchy
Levels of the Memory Hierarchy
Capacity Access Time CPU Registers 500 bytes 0.25 ns Cache 64 KB 1 ns

Upper Level Faster


Main Memory 512 MB 100ns Disk 100 GB 5 ms

Memory Pages I/O Devices Files ??? Larger Lower Level





ABCs of Caches
Cache: In this textbook it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Memory Hierarchy: Terminology

Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main memory (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache + Time to deliver the block to the processor Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)
To Processor cache
Blk X

main Memory
Blk Y

From Processor

Cache Measures
CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data

P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?

(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.

Four Memory Hierarchy Questions

Q1 (block placement): Where can a block be placed in the upper level? Q2 (block identification): How is a block found if it is in the upper level? Q3 (block replacement): Which bock should be replaced on a miss? Q4 (write strategy): What happens on a write?


Q1(block placement): Where can a block be placed?

Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) # of set # of blocks n-way: n blocks in a set 1-way = direct mapped Fully associative: # of set = 1

Example: block 12 placed in a 8-block cache


Simplest Cache: Direct Mapped (1-way)

Block number
0 1 2

Memory 4 Block Direct Mapped Cache

Block Index in Cache 0 1 2 3

4 5 6 7

9 A B C D E F

The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)

Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M)
31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00

Stored as part of the cache state

Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3

: :

Byte 1023

Byte 992 31


Q2 (block identification): How is a block found?

Three portions of an address in a set-associative or direct-mapped cache
Block Address Tag Cache/Set Index Block Offset (Block Size)

Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desiredbytes using Block Offset Increasing associativity => shrinks index expands tag

Example: Two-way set associative cache

Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result
31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00


Cache Tag

Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0

Cache Tag


Adr Tag



Sel1 1


0 Sel0


OR Hit

Cache Block


Disadvantage of Set Associative Cache

N-way Set Associative Cache v.s. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss.
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0


Cache Tag

Cache Tag


Adr Tag


Sel1 1


0 Sel0


Hit Cache Block


Q3 (block replacement): Which block should be replaced on a cache miss?

Easy for Direct Mapped hardware decisions are simplified
Only one block frame is checked and only that block can be replaced

Set Associative or Fully Associative

There are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced

Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way
Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes


Q4(write strategy): What happens on a write?

Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache access are writes Two option we can adopt when writing to the cache: Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found Pros and Cons WT: simply to be implemented. The cache is always clean, so read misses cannot result in writes WB: writes occur at the speed of the cache. And multiple writes within a block require only one write to the lower-level memory


Write Stall and Write Buffer

When the CPU must wait for writes to complete during WT, the CPU is said to write stall A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating




Write Buffer

A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4

Write-Miss Policy: Write Allocate vs. Not Allocate

Two options on a write miss

Write allocate the block is allocated on a write miss, followed by the write hit actions

Write misses act like read misses

No-write allocate write misses do not affect the cache. The block is modified only in the lower-level memory

Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache


Write-Miss Policy Example

Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations.
Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?

Answer: No-write Allocate:

Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss

Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]; 1 write hit

4 misses; 1 hit

2 misses; 3 hits

Cache Performance
Example: Split Cache vs. Unified Cache Which has the better avg. memory access time? A 16-KB instruction cache with a 16-KB data cache (split cache), or A 32-KB unified cache? Miss rates Size Instruction Cache Data Cache Unified Cache
16KB 0.4% 11.4% 32 KB 3.18% Assume A hit takes 1 clock cycle and the miss penalty is 100 cycles A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port 36% of the instructions are data transfer instructions. About 74% of the memory accesses are instruction references

Answer: Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24 Average memory access time(unified) = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44


Impact of Memory Access on CPU Performance

Example: Suppose a processor: Ideal CPI = 1.0 (ignoring memory stalls) Avg. miss rate is 2% Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles What are the impact on performance when behavior of the cache is included? Answer: CPI = CPU execution cycles per instr. + Memory stall cycles per instr. = CPI execution + Miss rate x Memory accesses per instr. x Miss penalty CPI with cache = 1.0 + 2% x 1.5 x 100 = 4 CPI without cache = 1.0 + 1.5 x 100 = 151 CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle time CPU time without cache = IC x 151 x Clock cycle time Without cache, the CPI of the processor increases from 1 to 151! 75 % of the time the processor is stalled waiting for memory! (CPI: 14)

Impact of Cache Organizations on CPU Performance

Example: What is the impact of two different cache organizations (direct mapped vs. 2-way set associative) on the performance of a CPU? Ideal CPI = 2.0 (ignoring memory stalls) Clock cycle time is 1.0 ns Avg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to accommodate the selection multiplexer Cache miss penalty is 75 ns Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%. Answer: Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time = IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC

Summary of Performance Equations


Improving Cache Performance

The next few sections in the text book look at ways to improve cache and memory access times.

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Section 5.7

Section 5.5

Section 5.4

CPU Time IC * (CPIExecution

Memory Accesses Miss Rate Miss Penalty) Clock Cycle Time Instruction


Reducing Cache Miss Penalty

Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Five optimizations 1. Multilevel caches 2. Critical word first and early restart 3. Giving priority to read misses over writes 4. Merging write buffer 5. Victim caches


Multilevel Caches
Approaches Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap L1: fast hits, L2: fewer misses L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << << Hit TimeMem Miss RateL1 < Miss RateL2 < Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rateL1 , Miss rateL2) L1 cache skims the cream of the memory accesses Global miss ratemisses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) Indicate what fraction of the memory accesses that leave the CPU go all the way to memory 29

Design of L2 Cache
Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1

Whether data in L1 is in L2
novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

multilevel exclusion: L1 data is never found in L2

A cache miss in L1 results in a swap of blocks between L1 and L2 Advantage: prevent wasting space in L2 i.e. AMD Athlon: 64 KB L1 and 256 KB L2 30

O2: Critical Word First and Early Restart

Dont wait for full block to be loaded before restarting CPU Critical Word FirstRequest missed word first from memory and send it to CPU as soon as it arrives; let CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restartAs soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
Given spatial locality, CPU tends to want next sequential word, so its not clear if benefit by early restart

Generally useful only in large blocks,



O3: Giving Priority to Read Misses over Writes

Serve reads before writes have been completed Write through with write buffers
SW LW LW R3, 512(R0) ; M[512] <- R3 R1, 1024(R0) ; R1 <- M[1024] R2, 512(R0) ; R2 <- M[512] (cache index 0) (cache index 0) (cache index 0)

Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read

O4: Merging Write Buffer

If a write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPUs perspective
Usually a write buffer supports multi-words

Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined

Write buffer with 4 entries, each can hold four 64-bit words (left) without merging (right) Four writes are merged into a single entry writing multiple words at the same time is faster than writing multiple times

O5: Victim Caches

Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again
rather simply discarded or swapped into L2

victim cache: a small, fully associative cache between a cache and its refill path
contain only blocks that are discarded from a cache because of a miss, victims checked on a miss before going to the next lower-level memory Victim caches of 1 to 5 entries are effective at reducing misses, especially for small, direct-mapped data caches AMD Athlon: 8 entries 34

Reducing Miss Rate

3 Cs of Cache Miss
CompulsoryThe first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) CapacityIf the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) ConflictIf block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative but hits in Fully Associative Size X Cache)

3 Cs of Cache Miss
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
0.14 0.12 2-way Miss Rate per Type 0.1 4-way 0.08 8-way 0.06 Capacity


0.04 0.02
0 4 1 2 8 16 32 64

Compulsory vanishingly small

Cache Size (KB)




3Cs Relative Miss Rate

100% 1-way 80% Miss Rate per Type 60% 40% 20% 2-way 4-way 8-way


4 1 2 8 16 32 64 128

Flaws: for fixed block size Good: insight => invention

Cache Size (KB)



Five Techniques to Reduce Miss Rate

1. 2. 3. 4. 5.

Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations


O1: Larger Block Size

25% 20% Miss Rate 15% 10% 64K 5% 0% 256K 1K 4K 16K

Using the principle of locality: The larger the block, the greater the chance parts of it will be used again.

Size of Cache
16 32 64 128
Block Size (bytes)

Take advantage of spatial locality -The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is small Usually high latency and high bandwidth encourage large block size


O2: Larger Caches

0.14 0.12 2-way Miss Rate per Type 1-way


0.08 8-way
0.06 0.04 0.02 0 4 1 2 8 16 32 64 128


Cache Size (KB)


Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15) May be longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches


O3: Higher Associativity

Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity
8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

Tradeoff: higher associative cache complicates the circuit

May have longer clock cycle

Beware: Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache? Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2%

O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, or block within the set of the next cache access
Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a latency of 3 clock cycles excess of 85% accuracy reduce conflict miss and maintain the hit speed of direct-mapped cache

pseudoassociative or column associative

one fast hit and one slow hit

On a miss, a 2nd cache entry is checked before going to the next lower level Invert the most significant bit to the find other block in the pseudoset Miss penalty may become slightly longer


O5: Compiler Optimizations

Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])
Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75% in an 8KB cache Get best performance when it was possible to prevent some instruction from entering the cache

Aligning basic block: the entry point is at the beginning of a cache block
Decrease the chance of a cache miss for sequential code

Loop Interchange: exchanging the nesting of loops

Improve spatial locality => reduce misses Make data be accessed in order => maximize use of data in a cache block before discarded
/* Before: row first */ for(j=0;j<100;j=j+1) for(i=0;i<5000;i=i+1) x[i][j]=2*x[i][j]; /* Before: row first */ for(i=0;i<5000;i=i+1) for(j=0;j<100;j=j+1) x[i][j]=2*x[i][j];

skip through memory in strides of 100 words

access all words in a cache block


Blocking: operating on submatrices or blocks

Maximize accesses to the data loaded into the cache before replaced Improve temporal locality /* After: B=blocking factor */ X=Y*Z for(jj=0;jj<N;jj=jj+B)
/* Before */ for(i=0;i<N;i=i+1) for(j=0;j<N;j=j+1){ r=0; for(k=0;k<N;k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; }
for(kk=0;kk<N;kk=kk+B) for(i=0;i<N;i=i+1) for(j=jj;j<min(jj+B,N;j=j+1){ r=0; for(k=kk;k<min(kk+B,N);k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=x[i][j]+r; }

# of capacity misses depends on N and cache size

total # of memory words accessed = 2N3/B+N2 y benefits from spatial locality z benefits from temporal locality 44

5.6 Reducing Cache Penalty or Miss Rate via Parallelism

Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses
to match the out-of-order processors

2.Hardware prefetching of insructions and data 3.Compiler-controlled prefetching


O1: Nonblocking cache to reduce stalls on cache miss

For pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss separate I-cache and D-cache
Nonblocking cache (lookup-free cache)
Continue fetching instructions from I-cache while waiting for D-cache to return missing data hit under miss: D-cache continues to supply cache hits during a miss hit under multiple miss or miss under miss: overlap multiple misses
Ratio of average memory stall time for a blocking cache to hit-under-miss schemes first 14 are FP programs average: 76% for 1-miss, 51% for 2-miss, 39% for 64miss final 4 are INT programs average: 81%, 78% and 78% 46

O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU
either directly into the caches or into an external buffer (faster than accessing main memory)

Instruction prefetch: frequently done in hardware outside cache

Fetch two blocks on a miss
the requested block is placed in I-cache when it returns the prefetched block is placed in instruction stream buffer (ISB) 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

UltraSPARC III: data prefetch

If a load hits in the prefetch cache
the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next prefetched block using the difference between the current address and the previous address

Up to 8 simultaneous prefetches It may interfere with demand misses resulting in lowering performance

O3: Compiler-Controlled Prefetching

Compiler-controlled prefetching
Register prefetch: load the value into a register Cache prefetch: load data only into the cache (not register)

Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations
normal load instruction = faulting register prefetch instruction

Most effective prefetch: semantically invisible to a program

doesnt change the contents of registers and memory, and cannot cause virtual memory faults

nonbinding prefetch: nonfaulting cache prefetch

Overlapping execution: CPU proceeds while the prefetched data are being fetched Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead


5.7 Reducing Hit Time

Importance of cache hit time
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty More importantly, cache access time limits the clock cycle rate in many processors today!

Fast hit time:

Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache

Four techniques:
1.Small and simple caches 2.Avoiding address translation during indexing of the cache 3.Pipelined cache access 4.Trace caches

O1: Small and Simple Caches

A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address

Guideline: smaller hardware is faster

Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? Small data cache and thus fast clock rate

Guideline: simpler hardware is faster

Direct Mapped, on chip

General design:
small and simple cache for 1st-level cache Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory


O2: Avoiding address translation during cache indexing

Two tasks: indexing the cache and comparing addresses virtually vs. physically addressed cache
virtual cache: use virtual address (VA) for the cache physical cache: use physical address (PA) after translating virtual address

Challenges to virtual cache

1.Protection: page-level protection (RW/RO/Invalid) must be checked
Its checked as part of the virtual to physical address translation solution: an addition field to copy the protection information from TLB and check it on every access to the cache

2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)

3.Synonyms or aliases: two different VA for the same PA

inconsistency problem: two copies of the same data in a virtual cache hardware antialiasing solution: guarantee every cache block a unique PA Alpha 21264: check all possible locations. If one is found, it is invalidated software page-coloring solution: forcing aliases to share some address bits Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA

4.I/O: typically use PA, so need to interact with cache (see Section 5.12)

Virtually indexed, physically tagged cache

CPU VA TB PA $ PA TB PA MEM Virtually Addressed Cache Translate only on miss Synonym Problem VA Tags $ VA CPU VA PA Tags CPU VA $ L2 $ MEM TB PA

Conventional Organization

Overlap cache access with VA translation: requires $ index to remain invariant across translation


O3: Pipelined Cache Access

Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit

Advantage: fast cycle time and slow hit

Example: accessing instructions from I-cache Pentium: 1 clock cycle Pentium Pro ~ Pentium III: 2 clocks Pentium 4: 4 clocks

Drawback: Increasing the number of pipeline stages leads to

greater penalty on mispredicted branches and more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit

O4: Trace Caches

Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block
The cache blocks contain dynamic traces of executed instructions determined by CPU rather than static sequences of instructions determined by memory branch prediction is folded into the cache: validated along with the addresses to have a valid fetch i.e. Intel NetBurst microarchitecture

advantage: better utilization

Trace caches store instructions only from the branch entry point to the exit of the trace Unused part of a long block entered or exited from a taken branch in conventional I-cache may not be fetched

Downside: store the same instructions multiple times