ACA Unit-5 | Cpu Cache | Central Processing Unit

UNIT-V

Memory Hierarchy Design

Memory Hierarchy Design
5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing Cache Miss Rate 5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism 5.7 Reducing Hit Time 5.8 Main Memory and Organizations for Improving Performance 5.9 Memory Technology 5.10 Virtual Memory 5.11 Protection and Examples of Virtual Memory
2

The five classic components of a computer:
Processor Input Control Memory Datapath

Output

Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches
3

CPU: DRAM: Disk:

Capacity 2x in 1.5 years 4x in 3 years 4x in 3 years

Speed (latency) 2x in 1.5 years 2x in 10 years 2x in 10 years

Technology Trends

DRAM

Year 1980 1983 1986 1989 1992 1995 2000

Size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb

Cycle Time 250 ns 220 ns 190 ns 165 ns 145 ns 120 ns 100 ns

4000:1!

2.5:1!
4

Performance Gap between CPUs and Memory
CPU 1.35X/yr 1.55X/yr

(improvement ratio)

Memory 7%/yr

The gap (latency) grows about 50% per year!
5

25 ns Cache 64 KB 1 ns Upper Level Faster Registers Blocks Main Memory 512 MB 100ns Disk 100 GB 5 ms Memory Pages I/O Devices Files ??? Larger Lower Level 6 Capacity Speed Cache .Memory Hierarchy Levels of the Memory Hierarchy Capacity Access Time CPU Registers 500 bytes 0.

. items whose addresses are close by tend to be referenced soon (e.g. it will tend to be referenced again soon (e.ABCs of Caches • Cache: – In this textbook it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU – applied whenever buffering is employed to reuse commonly occurring items. reuse) – Spatial Locality (Locality in Space): If an item is referenced.. array access) 7 . loops. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced.e. and so on • Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. i. file caches.g. straightline code. name caches.

g.(Hit Rate) – Miss Penalty: Time to replace a block in cache + Time to deliver the block to the processor • Hit Time << Miss Penalty (e.Memory Hierarchy: Terminology • Hit: data appears in some block in the cache (example: Block X) – Hit Rate: the fraction of cache access found in the cache – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieved from a block in the main memory (Block Y) – Miss Rate = 1 .vs. 1 clock cycle . 40 clock cycles) To Processor cache Blk X main Memory Blk Y From Processor 8 .

Cache Measures CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data 9 .

10 .395 Example Example Assume we have a computer where the CPI is 1. we need to calculate memory stalls. CPI=1.P.75)* Clock cycle time = 1.75 then CPU(B) = (IC + IC*0.0 when all memory accesses hit the cache. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0. then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss. If the miss penalty is 25 clock cycles and the miss rate is 2%. how much faster would the computer be if all instructions are in the cache? Answer: (A) If instructions always hit in the cache.75 times faster. The only data access are loads and stores.0.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.0. and these total 50% of the instructions. no memory stalls. CPI = 1.75 The computer with no cache miss is 1.

Four Memory Hierarchy Questions Q1 (block placement): Where can a block be placed in the upper level? Q2 (block identification): How is a block found if it is in the upper level? Q3 (block replacement): Which bock should be replaced on a miss? Q4 (write strategy): What happens on a write? 11 .

Q1(block placement): Where can a block be placed? Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) – # of set  # of blocks – n-way: n blocks in a set – 1-way = direct mapped Fully associative: # of set = 1 Example: block 12 placed in a 8-block cache 12 .

Simplest Cache: Direct Mapped (1-way) Block number 0 1 2 Memory 4 Block Direct Mapped Cache Block Index in Cache 0 1 2 3 3 4 5 6 7 8 9 A B C D E F The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache) 13 .

Example: 1 KB Direct Mapped Cache. 32B Blocks For a 2N byte cache: – The uppermost (32 .N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2M) 31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 : Byte 992 31 : 14 .

and the tag field compared against the CPU address for a hit • Use the Cache Index to select the cache set • Check the Tag on each block in that set – No need to check index or block offset – A valid bit is added to the Tag to indicate whether or not this entry contains a valid address • Select the desiredbytes using Block Offset Increasing associativity ↑ => shrinks index↓ expands tag ↑ 15 . the index filed selects the set.Q2 (block identification): How is a block found? Three portions of an address in a set-associative or direct-mapped cache Block Address Tag Cache/Set Index Block Offset (Block Size) Block Offset selects the desired data from the block.

Example: Two-way set associative cache • Cache Index selects a ―set‖ from the cache • The two tags in the set are compared in parallel • Data is selected based on the tag result 31 Cache Tag Example: 0x50 9 Cache Index Ex: 0x01 4 0 Byte Select Ex: 0x00 Valid Cache Tag Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0 Cache Tag Valid : Adr Tag : 0x50 : : : : Compare Sel1 1 Mux 0 Sel0 Compare OR Hit Cache Block 16 .

Disadvantage of Set Associative Cache • N-way Set Associative Cache v. Recover later if miss.s. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss • In a direct mapped cache. Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0 Valid Cache Tag Cache Tag Valid : Adr Tag : : : : : Compare Sel1 1 Mux 0 Sel0 Compare OR Hit Cache Block 17 . Cache Block is available BEFORE Hit/Miss: – Possible to assume a hit and continue. Direct Mapped Cache: – N comparators vs.

First out) Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.4 102.1 92.3 92.4 104.3 109.7 100.5 There are little difference between LRU and random for the largest size cache.Q3 (block replacement): Which block should be replaced on a cache miss? Easy for Direct Mapped – hardware decisions are simplified Only one block frame is checked and only that block can be replaced Set Associative or Fully Associative There are many blocks to choose from on a miss to replace Three primary strategies for selecting a block to be replaced    Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in.2 92.1 92.5 92.3 103.5 111.1 92.1 99.0 111.1 117. FIFO generally outperforms random in the smaller cache sizes 18 .8 110.1 92.5 100.4 103.3 103.7 115.5 92.9 102. with LRU outperforming the others for smaller caches.3 115.1 92.1 113.

Write back —The information is written only to the block in the cache. E. And multiple writes within a block require only one write to the lower-level memory 19 .g. The cache is always clean. no write back since identical information to the cache is found Pros and Cons WT: simply to be implemented. 7% of overall memory traffic are writes while 21% of data cache access are writes Two option we can adopt when writing to the cache: Write through —The information is written to both the block in the cache and to the block in the lower-level memory. If clean.Q4(write strategy): What happens on a write? Reads dominate processor cache accesses. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement. a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). so read misses cannot result in writes WB: writes occur at the speed of the cache.

which allows the processor to continue as soon as the data are written to the buffer. thereby overlapping processor execution with memory updating  Processor Cache DRAM Write Buffer • A Write Buffer is needed between the Cache and Memory – Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: – Typical number of entries: 4 20 . the CPU is said to write stall  A common optimization to reduce write stall is a write buffer.Write Stall and Write Buffer When the CPU must wait for writes to complete during WT.

but with write allocate even blocks that are only written will still be in the cache 21 . The block is modified only in the lower-level memory  Block stay out of the cache in no-write allocate until the program tries to read the blocks. Not Allocate Two options on a write miss Write allocate – the block is allocated on a write miss. followed by the write hit actions  Write misses act like read misses No-write allocate – write misses do not affect the cache.Write-Miss Policy: Write Allocate vs.

Read Mem[200]. 3 hits 22 . Write Mem[100]. 1 write hit 4 misses. 1 write miss Read Mem[200]. 1 write hit Read Mem[200]. Below is sequence of five memory operations. 1 write hit Write Mem[100].Write-Miss Policy Example Example: Assume a fully associative write-back cache with many cache entries that starts empty. 1 write hit Write Mem[100]. Write Mem[100]. Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate? Answer: No-write Allocate: Write Mem[100]. 1 write miss Write Mem[100]. 1 write miss Write Mem[100]. Write Mem[200]. 1 read miss Write Mem[200]. 1 write miss Write allocate: Write Mem[100]. 1 read miss Write Mem[200]. 1 hit 2 misses.

18%x100) + 26% X ( 1 + 1 + 3.Cache Performance Example: Split Cache vs.24 Average memory access time(unified) = 74% X ( 1 + 3. • About 74% of the memory accesses are instruction references Answer: Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) = 4.4% X 100) + 26% X ( 1 + 11.4% 11.18% X 100) = 4.44 23 .18% Assume • A hit takes 1 clock cycle and the miss penalty is 100 cycles • A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port • 36% of the instructions are data transfer instructions.4% 32 KB 3. memory access time? A 16-KB instruction cache with a 16-KB data cache (split cache). or A 32-KB unified cache? Miss rates Size Instruction Cache Data Cache Unified Cache 16KB 0. Unified Cache Which has the better avg.

= CPI execution + Miss rate x Memory accesses per instr.0 + 1. + Memory stall cycles per instr.0 x Clock cycle time CPU time without cache = IC x 151 x Clock cycle time •Without cache. x Miss penalty CPI with cache = 1. miss rate is 2% – Avg.5 x 100 = 151 CPU time with cache = IC x CPI x Clock cycle time = IC x 4.Impact of Memory Access on CPU Performance Example: Suppose a processor: – Ideal CPI = 1.5 – Miss penalty is 100 cycles What are the impact on performance when behavior of the cache is included? Answer: CPI = CPU execution cycles per instr.0 (ignoring memory stalls) – Avg.5 x 100 = 4 CPI without cache = 1. the CPI of the processor increases from 1 to 151! •75 % of the time the processor is stalled waiting for memory! (CPI: 1→4) 24 . memory references per instruction is 1.0 + 2% x 1.

memory references per instruction is 1.63 IC 25 .0 + (1.014 x 75) = 2. block size: 64 bytes – For set-associative.05 ns Avg.0 x 1.0 x 1.4%.0 x 1.00 ns • CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time = IC x (2.0%. 2-way set associative) on the performance of a CPU? – Ideal CPI = 2.01 x 75) = 2.Impact of Cache Organizations on CPU Performance Example: What is the impact of two different cache organizations (direct mapped vs. Answer: • Avg.25 times to accommodate the selection multiplexer – Cache miss penalty is 75 ns – Hit time is 1 clock cycle – Miss rate: direct mapped 1. memory access time2-way= 1.5 – Cache size: 64 KB.5 x 0.014 x 75)) = 3. memory access time1-way= 1.25 + (1.5 x 0.25 + (0.0 (ignoring memory stalls) – Clock cycle time is 1.0 x 1.58 IC CPU time2-way = IC x (2.01 x 75)) = 3.0+(0.0 ns – Avg. assume the clock cycle time is stretched 1. 2-way set-associative 1.

Summary of Performance Equations 26 .

Improving Cache Performance The next few sections in the text book look at ways to improve cache and memory access times. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Section 5.7 Section 5.4 CPU Time  IC * (CPIExecution  Memory Accesses  Miss Rate Miss Penalty)  Clock Cycle Time Instruction 27 .5 Section 5.

Reducing Cache Miss Penalty Time to handle a miss is becoming more and more the controlling factor. Multilevel caches 2. Critical word first and early restart 3. Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Five optimizations 1. Merging write buffer 5. Victim caches 28 . Giving priority to read misses over writes 4. This is because of the great improvement in speed of processors as compared to the speed of memory.

Multilevel Caches • Approaches – Make the cache faster to keep pace with the speed of CPUs – Make the cache larger to overcome the widening gap L1: fast hits. Miss RateL1 x Miss RateL2) • Indicate what fraction of the memory accesses that leave the CPU go all the way to memory 29 . Miss rateL2) • L1 cache skims the cream of the memory accesses – Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1. L2: fewer misses • L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << … << Hit TimeMem Miss RateL1 < Miss RateL2 < … Definitions: – Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL1 .

e. AMD Athlon: 64 KB L1 and 256 KB L2 30 .e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2 – multilevel exclusion: L1 data is never found in L2 • A cache miss in L1 results in a swap of blocks between L1 and L2 • Advantage: prevent wasting space in L2 • i.Design of L2 Cache •Size – Since everything in L1 cache is likely to be in L2 cache. L2 cache should be much bigger than L1 •Whether data in L1 is in L2 – novice approach: design L1 and L2 independently – multilevel inclusion: L1 data are always present in L2 • Advantage: easy for consistency between I/O and cache (checking L2 only) • Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate • i.

block 31 . CPU tends to want next sequential word.O2: Critical Word First and Early Restart Don’t wait for full block to be loaded before restarting CPU • Critical Word First—Request missed word first from memory and send it to CPU as soon as it arrives. Also called wrapped fetch and requested word first • Early restart—As soon as the requested word of the block arrives. so it’s not clear if benefit by early restart Generally useful only in large blocks. send it to the CPU and let the CPU continue execution – Given spatial locality. let CPU continue execution while filling the rest of the words in the block.

M[512] (cache index 0) (cache index 0) (cache index 0) Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses – If simply wait for write buffer to empty.R3 R1. M[512] <. let the memory access continue • Write Back Suppose a read miss will replace a dirty block – Normal: Write dirty block to memory. and then do the write – CPU stall less since restarts as soon as do read 32 . if no conflicts. R1 <.O3: Giving Priority to Read Misses over Writes • Serve reads before writes have been completed • Write through with write buffers SW LW LW R3. do the read. R2 <. 1024(R0) . 512(R0) . and then do the read – Instead: Copy the dirty block to a write buffer. 512(R0) .M[1024] R2. might increase read miss penalty (old MIPS 1000 by 50% ) – Check write buffer contents before read.

each can hold four 64-bit words (left) without merging (right) Four writes are merged into a single entry •writing multiple words at the same time is faster than writing multiple times 33 . the new data are combined Write buffer with 4 entries. and the write is finished from the CPU’s perspective – Usually a write buffer supports multi-words • Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so.O4: Merging Write Buffer • If a write buffer is empty. the data and the full address are written in the buffer.

―victims‖ – checked on a miss before going to the next lower-level memory – Victim caches of 1 to 5 entries are effective at reducing misses.O5: Victim Caches Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again – rather simply discarded or swapped into L2 victim cache: a small. direct-mapped data caches – AMD Athlon: 8 entries 34 . especially for small. fully associative cache between a cache and its refill path – contain only blocks that are discarded from a cache because of a miss.

Also called cold start misses or first reference misses. (Misses in N-way Associative but hits in Fully Associative Size X Cache) 35 . (Misses in even an Infinite Cache) • Capacity—If the cache cannot contain all the blocks needed during execution of a program. so the block must be brought into the cache. Also called collision misses or interference misses.Reducing Miss Rate 3 C’s of Cache Miss • Compulsory—The first access to a block is not in the cache. capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) • Conflict—If block-placement strategy is set associative or direct mapped. conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set.

12 2-way Miss Rate per Type 0.3 C’s of Cache Miss 3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 0.1 4-way 0.08 8-way 0.04 0.02 0 4 1 2 8 16 32 64 Compulsory vanishingly small Cache Size (KB) Compulsory 128 36 .14 0.06 Capacity Conflict 1-way 0.

3Cs Relative Miss Rate 100% 1-way 80% Miss Rate per Type 60% 40% 20% 2-way 4-way 8-way Conflict Capacity 0% 4 1 2 8 16 32 64 128 Flaws: for fixed block size Good: insight => invention Cache Size (KB) Compulsory 37 .

3. 4. 2.Five Techniques to Reduce Miss Rate 1. 5. Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations 38 .

Size of Cache 16 32 64 128 Block Size (bytes) • Take advantage of spatial locality -The larger the block.O1: Larger Block Size 25% 20% Miss Rate 15% 10% 64K 5% 0% 256K 1K 4K 16K Using the principle of locality: The larger the block. the greater the chance parts of it will be used again. the greater the chance parts of it is used again • # of blocks is reduced for the cache of same size => Increase miss penalty • It may increase conflict misses and even capacity misses if the cache is small • Usually high latency and high bandwidth encourage large block size 39 256 .

15) • May be longer hit time and higher cost • Trends: Larger L2 or L3 off-chip caches 40 .1 4-way 0.14 and 5.O2: Larger Caches 0.02 0 4 1 2 8 16 32 64 128 Capacity Cache Size (KB) Compulsory • Increasing capacity of cache reduces capacity misses (Figure 5.08 8-way 0.06 0.14 0.04 0.12 2-way Miss Rate per Type 1-way 0.

O3: Higher Associativity • Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity – 8-way set asociative is as effective as fully associative for practical purposes – 2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2 • Tradeoff: higher associative cache complicates the circuit – May have longer clock cycle • Beware: Execution time is the only final measure! – Will Clock Cycle time increase as a result of having a more complicated cache? – Hill [1988] suggested hit time for 2-way vs. internal + 2% 41 . 1-way is: external cache +10%.

changes the way predictor. or block within the set of the next cache access Example: 2-way I-cache of Alpha 21264 – If the predictor is correct. tries the other block.O4: Way Prediction & Pseudoassociative Caches way prediction: extra bits are kept in cache to predict the way. I-cache latency is 1 clock cycle – If incorrect. and has a latency of 3 clock cycles – excess of 85% accuracy reduce conflict miss and maintain the hit speed of direct-mapped cache pseudoassociative or column associative • one fast hit and one slow hit – On a miss. a 2nd cache entry is checked before going to the next lower level – Invert the most significant bit to the find other block in the ―pseudoset‖ – Miss penalty may become slightly longer 42 .

skip through memory in strides of 100 words access all words in a cache block 43 .O5: Compiler Optimizations Improve hit rate by compile-time optimization • Reordering instructions with profiling information (McFarling[1989]) – Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache. /* Before: row first */ for(i=0.j=j+1) x[i][j]=2*x[i][j]. and 75% in an 8KB cache – Get best performance when it was possible to prevent some instruction from entering the cache • Aligning basic block: the entry point is at the beginning of a cache block – Decrease the chance of a cache miss for sequential code • Loop Interchange: exchanging the nesting of loops – Improve spatial locality => reduce misses – Make data be accessed in order => maximize use of data in a cache block before discarded /* Before: row first */ for(j=0.j<100.j=j+1) for(i=0.i<5000.j<100.i=i+1) x[i][j]=2*x[i][j].i=i+1) for(j=0.i<5000.

x[i][j]=r.Blocking: operating on submatrices or blocks –Maximize accesses to the data loaded into the cache before replaced –Improve temporal locality /* After: B=blocking factor */ X=Y*Z for(jj=0.i=i+1) for(j=0.k=k+1) r=r+y[i][k]*z[k][j].i=i+1) for(j=jj.kk=kk+B) for(i=0.j=j+1){ r=0. } # of capacity misses depends on N and cache size • total # of memory words accessed = 2N3/B+N2 • y benefits from spatial locality • z benefits from temporal locality 44 .k=k+1) r=r+y[i][k]*z[k][j].kk<N.j<min(jj+B. for(k=kk.jj=jj+B) /* Before */ for(i=0.N. } for(kk=0.jj<N.j<N.N).k<N.i<N. for(k=0.i<N.j=j+1){ r=0.k<min(kk+B. x[i][j]=x[i][j]+r.

Compiler-controlled prefetching 45 .6 Reducing Cache Penalty or Miss Rate via Parallelism Three techniques that overlap the execution of instructions 1.5.Hardware prefetching of insructions and data 3.Nonblocking caches to reduce stalls on cache misses •to match the out-of-order processors 2.

the CPU need not stall on a cache miss •separate I-cache and D-cache •―Nonblocking cache (lookup-free cache) – Continue fetching instructions from I-cache while waiting for D-cache to return missing data – ―hit under miss‖: D-cache continues to supply cache hits during a miss – ―hit under multiple miss‖ or ―miss under miss‖: overlap multiple misses Ratio of average memory stall time for a blocking cache to hit-under-miss schemes • first 14 are FP programs average: 76% for 1-miss. 78% and 78% 46 . 39% for 64miss • final 4 are INT programs average: 81%.O1: Nonblocking cache to reduce stalls on cache miss For pipelined computers that allow out-of-order completion. 51% for 2-miss.

4 ISBs increased the data hit rate to 43% (Jouppi1990) •UltraSPARC III: data prefetch – If a load hits in the prefetch cache • the block is read from the prefetch cache • the next prefetch request is issued: calculating the ―stride‖ of the next prefetched block using the difference between the current address and the previous address – Up to 8 simultaneous prefetches It may interfere with demand misses resulting in lowering performance 47 .O2: Hardware Prefetching of Instructions and Data Prefetch instructions or data before requested by the CPU – either directly into the caches or into an external buffer (faster than accessing main memory) •Instruction prefetch: frequently done in hardware outside cache – Fetch two blocks on a miss • the requested block is placed in I-cache when it returns • the prefetched block is placed in instruction stream buffer (ISB) • 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block direct-mapped I-cache.

O3: Compiler-Controlled Prefetching •Compiler-controlled prefetching – Register prefetch: load the value into a register – Cache prefetch: load data only into the cache (not register) •Faulting vs. and – cannot cause virtual memory faults •nonbinding prefetch: nonfaulting cache prefetch – Overlapping execution: CPU proceeds while the prefetched data are being fetched – Advantage: The compiler may avoid unnecessary prefetches in hardware – Drawback: Prefetch instructions incurs instruction overhead 48 . nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations – normal load instruction = faulting register prefetch instruction •Most effective prefetch: ―semantically invisible‖ to a program – doesn’t change the contents of registers and memory.

cache access time limits the clock cycle rate in many processors today! •Fast hit time: – Quickly and efficiently find out if data is in the cache. and – if it is.Trace caches 49 .Pipelined cache access 4.7 Reducing Hit Time •Importance of cache hit time – Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty – More importantly.5.Small and simple caches 2. get that data out of the cache •Four techniques: 1.Avoiding address translation during indexing of the cache 3.

O1: Small and Simple Caches •A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address Guideline: smaller hardware is faster – Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? • Small data cache and thus fast clock rate Guideline: simpler hardware is faster – Direct Mapped. on chip •General design: – small and simple cache for 1st-level cache – Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory 50 .

12) 51 .I/O: typically use PA.context switching: same VA of different processes refer to different PA.Synonyms or aliases: two different VA for the same PA –inconsistency problem: two copies of the same data in a virtual cache –hardware antialiasing solution: guarantee every cache block a unique PA –Alpha 21264: check all possible locations. physically addressed cache –virtual cache: use virtual address (VA) for the cache –physical cache: use physical address (PA) after translating virtual address •Challenges to virtual cache 1. If one is found.Protection: page-level protection (RW/RO/Invalid) must be checked –It’s checked as part of the virtual to physical address translation –solution: an addition field to copy the protection information from TLB and check it on every access to the cache 2. it is invalidated –software page-coloring solution: forcing aliases to share some address bits –Sun’s Solaris: all aliases must be identical in last 18 bits => no duplicate PA 4.O2: Avoiding address translation during cache indexing •Two tasks: indexing the cache and comparing addresses •virtually vs. requiring the cache to be flushed –solution: increase width of cache address tag with process-identifier tag (PID) 3. so need to interact with cache (see Section 5.

Virtually indexed. physically tagged cache CPU VA TB PA $ PA TB PA MEM Virtually Addressed Cache Translate only on miss Synonym Problem VA Tags $ VA CPU VA PA Tags CPU VA $ L2 $ MEM TB PA MEM Conventional Organization Overlap cache access with VA translation: requires $ index to remain invariant across translation 52 .

O3: Pipelined Cache Access Simply to pipeline cache access – Multiple clock cycle for 1st-level cache hit •Advantage: fast cycle time and slow hit Example: accessing instructions from I-cache – Pentium: 1 clock cycle – Pentium Pro ~ Pentium III: 2 clocks – Pentium 4: 4 clocks •Drawback: Increasing the number of pipeline stages leads to – greater penalty on mispredicted branches and – more clock cycles between the issue of the load and the use of the data Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit 53 .

e. Intel NetBurst microarchitecture •advantage: better utilization – Trace caches store instructions only from the branch entry point to the exit of the trace – Unused part of a long block entered or exited from a taken branch in conventional I-cache may not be fetched •Downside: store the same instructions multiple times 54 .O4: Trace Caches Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block – The cache blocks contain • dynamic traces of executed instructions determined by CPU • rather than static sequences of instructions determined by memory – branch prediction is folded into the cache: validated along with the addresses to have a valid fetch – i.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.