Chapter 5A: Exploiting the Memory Hierarchy Hierarchy, Part 1

Mary Jane Irwin ( www.cse.psu.edu/~mji )

[Adapted from Computer Organization and Design, 4th Edition, Patterson & Hennessy, Hennessy © 2008, 2008 MK]

CSE431 Chapter 5A.1

Irwin, PSU, 2008

Review: Major Components of a Computer

Processor Control

Devices Memory Input Output

Datapath

Main Memory

Cache

Seconda ary Memor ry (Disk) )

CSE431 Chapter 5A.2

Irwin, PSU, 2008

Processor-Memory Performance Gap
10000 Pe erformance 1000 100 10 1 Year
CSE431 Chapter 5A.3 Irwin, PSU, 2008

µProc 55%/year (2X/1.5yr) “Moore’s Law” Processor-Memory Performance Gap (grows 50%/year) (g y ) DRAM 7%/year (2X/10yrs)

1980 1984 1988 1992 1996 2000 2004

The “Memory Wall” ‰ Processor vs DRAM speed disparity continues to grow Clocks s per DRA AM acces ss 2010+ Irwin.1 0 01 0. PSU. 2008 1000 Clock ks per ins struction 100 10 1 0.4 .01 VAX/1980 PPro/1996 Core Memory ‰ Good memory hierarchy (cache) design is increasingly important to overall performance CSE431 Chapter 5A.

The Memory Hierarchy Goal ‰ Fact: Large memories are slow and fast memories are small ‰ How do we create a memory that gives the illusion of being large.5 Irwin. 2008 . PSU. cheap and fast (most of the time)? z z With hi hierarchy h With parallelism CSE431 Chapter 5A.

000’s T’s lowest Irwin.A Typical Memory Hierarchy ‰ Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip On Chip Components Control Second Level Cache (SRAM) Secondary Memory (Disk) In nstr Data e Ca ache Cache B IT TLB DTLB Datapath Main Memory (DRAM) Speed (%cycles): ½’s (bytes): y ) Size ( Cost: CSE431 Chapter 5A. PSU.6 100’s highest RegFile e 1’s 10K’s 10’s M’s 100’s G’s 10. 2008 .

higher power.5 nsec) Low density (6 transistor cells).consumes1% to 2% of the active cycles of the DRAM z Addresses divided into 2 halves (row and column) . expensive ($2000 to $5000 per GB in 2008) Static: content will last “forever” (as long as power is left on) ‰ Main memory uses DRAM for size (density) z z z Slower (typical access times of 50 to 70 nsec) High density (1 transistor cells).CAS or Column Access Strobe triggering the column selector CSE431 Chapter 5A.RAS or Row Access Strobe triggering the row decoder .7 Irwin. PSU. lower power. 2008 . cheaper ($20 to $75 per GB in 2008) Dynamic: needs to be “refreshed” regularly (~ every 8 ms) .Memory Hierarchy Technologies ‰ Caches use SRAM for speed and technology compatibility z z z Fast (typical access times of 0.5 to 2.

2008 .The Memory Hierarchy: Why Does it Work? ‰ Temporal Locality (locality in time) z If a memory location is referenced then it will tend to be referenced again soon ⇒ Keep most recently accessed data items closer to the processor ‰ Spatial Locality (locality in space) z If a memory location is referenced.8 Irwin. the locations with nearby addresses dd will ill t tend dt to b be referenced f d soon ⇒ Move blocks consisting of contiguous words closer to the processor CSE431 Chapter 5A. PSU.

PSU.9 Irwin.The Memory Hierarchy: Terminology Block (or line): the minimum unit of information that is present (or not) in a cache ‰ Hit Rate: the fraction of memory accesses found in a level of the memory hierarchy ‰ z Hit Time: Time to access that level which consists of Time to access the block + Time to determine hit/miss ‰ Miss Rate: the fraction of memory accesses not found in a level of the memory hierarchy ⇒ 1 . 2008 .(Hit Rate) z Miss Penalty: Time to replace a block in that level with the corresponding block from a lower level which consists of Time to access the block in the lower level + Time to transmit that block to the level that experienced the miss + Time to insert the block in that level + Time to pass the block to the requestor Hit Time << Miss Penalty CSE431 Chapter 5A.

10 Irwin.Characteristics of the Memory Hierarchy Processor 4-8 8 bytes ( (word) o d) Increasing distance from the processor in access time L1$ 8-32 bytes y ( (block) ) L2$ 1 to 4 blocks Main Memory Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 1.024+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level CSE431 Chapter 5A. PSU. 2008 .

2008 .11 Irwin. PSU.How is the Hierarchy Managed? ‰ registers ↔ memory z by compiler (programmer?) ‰ cache ↔ main memory z by the cache controller hardware ‰ main memory ↔ disks z z z by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files) CSE431 Chapter 5A.

PSU. is how do we find it? ‰ Direct mapped z Each memory block is mapped to exactly one block in the cache . 2008 CSE431 Chapter 5A.12 .Cache Basics ‰ Two questions to answer (in hardware): z z Q1: How do we know if a data item is in the cache? Q2: If it is.lots of lower level blocks must share blocks in the cache z Address mapping (to answer Q2): (block address) modulo (# of blocks in the cache) z Have a tag associated with each cache block that contains th address the dd information i f ti (th (the upper portion ti of f th the address) dd ) required to identify the block (to answer Q1) Irwin.

13 ) blocks in the cache) (block address) modulo (# of blocks in the cache) Irwin. 2008 .Caching: A Simple First Example Cache Index Valid Tag 00 01 10 0 11 Data Main Memory 0000 0000xx 0001xx One word blocks 0010xx Two low order bits define the byte in the 0011xx word (32b words) 0100xx 0101xx 0110xx 0111xx 1000xx Q2: How do we find it? 1001xx 1010xx Use next 2 low order 1011xx memory address bits 1100xx – the index – to determine which 1101 1101xx 1110xx cache block (i.e. 1111xx modulo the number of Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to y block tell if the memory is in the cache CSE431 Chapter 5A.. PSU.

1111xx modulo the number of Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to y block tell if the memory is in the cache CSE431 Chapter 5A.e. PSU.14 ) blocks in the cache) (block address) modulo (# of blocks in the cache) Irwin. 2008 .Caching: A Simple First Example Cache Index Valid Tag 00 01 10 0 11 Data Main Memory 0000 0000xx 0001xx One word blocks 0010xx Two low order bits define the byte in the 0011xx word (32b words) 0100xx 0101xx 0110xx 0111xx 1000xx Q2: How do we find it? 1001xx 1010xx Use next 2 low order 1011xx memory address bits 1100xx – the index – to determine which 1101 1101xx 1110xx cache block (i..

2008 .all blocks initially marked as not valid 0 1 4 3 4 15 CSE431 Chapter 5A. PSU.Direct Mapped Cache ‰ Consider the main memory word reference string 0 1 2 3 4 3 4 15 2 3 Start with an empty cache .15 Irwin.

16 Irwin. 2008 .Direct Mapped Cache ‰ Consider the main memory word reference string 0 1 2 3 4 3 4 15 2 miss i 00 00 00 Mem(0) Mem(1) Mem(2) hit 01 00 00 11 00 00 00 00 00 3 miss i Mem(0) Mem(1) Mem(2) Mem(3) 15 miss Mem(4) Mem(1) Mem(2) Mem(3) Start with an empty cache .all blocks initially marked as not valid 0 miss i 00 Mem(0) 00 00 1 miss i Mem(0) Mem(1) 01 4 miss 00 00 00 00 z 4 Mem(0) Mem(1) Mem(2) Mem(3) 3 hit 01 00 00 00 Mem(4) Mem(1) Mem(2) Mem(3) 01 00 00 00 4 Mem(4) Mem(1) Mem(2) Mem(3) 15 8 requests. q 6 misses CSE431 Chapter 5A. PSU.

2 1 0 Byte offset Data Hit Tag Index Valid 0 1 2 .17 Irwin.. 2008 . PSU. cache size = 1K words (or 4KB) 31 30 ..MIPS Direct Mapped Cache Example ‰ One word blocks blocks. 1021 1022 1023 20 Index Tag 10 Data 20 32 What kind of locality are we taking advantage of? CSE431 Chapter 5A.. . .. 13 12 11 .

Multiword Block Direct Mapped Cache ‰ Hit Tag Index Valid Tag 0 1 2 . . . . PSU.18 Irwin. 4 32 10 Byte offset Bl k offset Block ff t Data Data 20 Index 8 20 32 What kind of locality are we taking advantage of? CSE431 Chapter 5A. 253 254 255 Four words/block.. words/block cache size = 1K words 31 30 .. 13 12 11 . . 2008 .

19 Irwin.all blocks initially marked as not valid 0 3 4 3 4 15 CSE431 Chapter 5A.Taking Advantage of Spatial Locality ‰ Let cache block hold more than one word 0 1 2 3 4 3 4 15 1 2 Start with an empty cache . 2008 . PSU.

20 . PSU.Taking Advantage of Spatial Locality ‰ Let cache block hold more than one word 0 1 2 3 4 3 4 15 1 hit 00 Mem(1) Mem(0) 00 00 i 2 miss Mem(1) Mem(0) Mem(3) Mem(2) 3 hit 01 00 Mem(5) Mem(3) Mem(4) Mem(2) Start with an empty cache . 2008 CSE431 Chapter 5A.all blocks initially marked as not valid 0 miss i 00 Mem(1) Mem(0) 3 hit 00 00 Mem(1) Mem(3) Mem(0) Mem(2) 01 00 00 4 hit 4 miss 5 4 Mem(1) Mem(0) Mem(3) Mem(2) 15 miss Mem(4) Mem(2) 1101 00 Mem(5) Mem(4) 15 14 Mem(3) Mem(2) 01 00 z Mem(5) Mem(3) 8 requests. 4 misses Irwin.

2008 CSE431 Chapter 5A.Miss Rate vs Block Size vs Cache Size 10 8 KB 16 KB 64 KB 256 KB Mis ss rate (% %) 5 0 16 32 64 Block size (bytes) ‰ 128 256 Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) Irwin. PSU.21 .

n bits are used for the index For a block size of 2m words (2m+2 bytes).Cache Field Sizes ‰ The number of bits in a cache includes both the storage for data and for the tags z z z 32-bit 32 bit byte address For a direct mapped cache with 2n blocks. 2008 CSE431 Chapter 5A.22 . PSU. bytes) m bits are used to address the word within the block and 2 bits are used to address the byte within the word What Wh ti is th the size i of f th the t tag fi field? ld? ‰ The total number of bits in a direct-mapped cache is then ‰ 2n x (block size + tag field size + valid field size) ‰ How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address? Irwin.

PSU.Handling Cache Hits ‰ Read hits (I$ and D$) z this is what we want! ‰ Write hits (D$ only) z require the cache and memory to be consistent .always write the data into both the cache block and the next level in the memory hierarchy (write-through) .23 Irwin. 2008 .writes run at the speed of the next level in the memory hierarchy – so slow! – or can use a write buffer and stall only if the write buffer is full z allow cache and memory to be inconsistent .write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”) .need a dirty bit for each data cache block to tell if it needs to be written back to memory y when it is evicted – can use a write buffer to help “buffer” write-backs of dirty blocks CSE431 Chapter 5A.

very large blocks could increase miss rate) z ‰ Capacity: z z Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time) ‰ Conflict (collision): z z z Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution S l i 2 2: i increase associativity i i i (stay ( tuned) d) ( (may i increase access time) Irwin. “cold” fact of life. PSU. 2008 CSE431 Chapter 5A. migration first reference): z First access to a block.Sources of Cache Misses ‰ Compulsory (cold start or process migration. not a whole lot you can do d about b t it. it If you are going i t to run “millions” “ illi ” of f instruction. i t ti compulsory misses are insignificant Solution: increase block size (increases miss penalty.24 .

25 . p . then let the pipeline resume stall the p pipeline. . no need to check for cache hit. no need to stall No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level). no need to stall if the write buffer isn’t full Irwin. PSU. install it in the cache and send the requested word to the processor processor. write the word from the processor to the cache. .Handling Cache Misses (Single Word Blocks) ‰ Read misses (I$ and D$) z stall the pipeline. 2008 ‰ Write misses (D$ only) 1. then let the pipeline resume Write allocate – just write the word into the cache updating both the tag g and data. CSE431 Chapter 5A. or 2. install it in the cache (which may involve having to evict a dirty block if using a write-back cache). fetch the block from the next level in the memory hierarchy. or 3. fetch the block from next level in the memory y hierarchy.

and three words of data from the old block) Irwin. PSU. one word of data from the new block block. 2008 ‰ Write misses (D$) z CSE431 Chapter 5A.Early restart – processor resumes execution as soon as the requested word of the block is returned .Multiword Block Considerations ‰ Read misses (I$ and D$) z z Processed the same as for single word blocks – a miss returns the entire block from memory y Miss penalty grows as block size grows .. a new tag.g.26 . for 4 word blocks.Requested word first – requested word is transferred from the memory to the cache (and processor) first z Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss If using i write it allocate ll t must t first fi t fetch f t h the th block bl k f from memory and d then write the word to the block (or could end up with a “garbled” block in the cache (e.

2008 32-bit data & 32-bit addr per cycle DRAM M Memory ‰ 3.Memory Systems that Support Caches ‰ The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways on-chip CPU One word wide organization (one word wide bus and d one word d wide id memory) ) ‰ Assume 1 1. PSU. 3rd. 2. Memory-Bus to Cache bandwidth z CSE431 Chapter 5A. Cache bus 1 memory bus clock cycle to send the addr 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time) 5 memory bus clock cycles for 2nd.27 . 4th words (column access time) 1 memory bus clock cycle to return a word of data number of bytes y accessed from memory y and transferred to cache/CPU per memory bus clock cycle Irwin. time).

The memory bus clock controls transfer of successive words in the burst Cycle Time N rows Input p CAS as the starting g “burst” address along with a burst length DRAM Row Address N x M SRAM M bit planes M-bit Output 4th M-bit 1st M-bit Access 2nd M-bit 3rd M-bit RAS CAS Row Address CSE431 Chapter 5A. 2008 .Review: (DDR) SDRAM Operation ‰ After a row is Aft i read into the SRAM register z Column Address +1 N cols z Transfers a burst of data (ideally a cache block) from a series of sequential addr’s within that row .28 Col Address Row Add Irwin. PSU.

PSU.12 to show DRAM growth since 1980 CSE431 Chapter 5A.29 Irwin.DRAM Size Increase ‰ Add a table like figure f 5. 2008 .

then for f a memory access due to a cache miss.30 Irwin.One Word Wide Bus. the pipeline will have to stall for the number of cycles required to return one data word from memory cycle y to send address cycles to read DRAM cycle to return data t t l clock total l k cycles l miss i penalty lt DRAM Memory ‰ Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per memory bus clock cycle CSE431 Chapter 5A. One Word Blocks ‰ on-chip CPU Cache bus If f the block size is one word. PSU. 2008 .

the pipeline will have to stall for the number of cycles required to return one data word from memory 1 15 1 17 memory y bus clock cycle y to send address memory bus clock cycles to read DRAM memory bus clock cycle to return data t t l clock total l k cycles l miss i penalty lt DRAM Memory ‰ Number of bytes transferred per clock cycle (bandwidth) for a single miss is 4/17 = 0.31 Irwin. then for f a memory access due to a cache miss. 2008 . PSU. One Word Blocks ‰ on-chip CPU Cache bus If f the block size is one word.235 bytes per memory bus clock cycle CSE431 Chapter 5A.One Word Wide Bus.

One Word Wide Bus. PSU. Four Word Blocks ‰ on-chip What if the block size is four words and each word is in a different DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty CPU Cache bus DRAM Memory ‰ Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock CSE431 Chapter 5A. 2008 .32 Irwin.

258 0 258 bytes per clock CSE431 Chapter 5A.One Word Wide Bus.33 Irwin. PSU. Four Word Blocks ‰ on-chip What if the block size is four words and each word is in a different DRAM row? st 1 cycle to send 1 address 4 x 15 = 60 cycles to read DRAM 1 cycles to return last data word CPU Cache bus 62 total clock cycles miss penalty 15 cycles 15 cycles 15 cycles DRAM Memory ‰ 15 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/62 = 0. 2008 .

Four Word Blocks ‰ on-chip What if the block size is four words and all words are in the same DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty CPU Cache bus DRAM Memory ‰ Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock Irwin. PSU.34 .One Word Wide Bus. 2008 CSE431 Chapter 5A.

2008 CSE431 Chapter 5A.5 bytes per clock Irwin. PSU.35 . Four Word Blocks ‰ on-chip What if the block size is four words and all words are in the same DRAM row? 1 cycle to send 1st address 15 + 3*5 = 30 cycles to read DRAM 1 cycles to return last data word 32 total clock cycles miss penalty 15 cycles 5 cycles l 5 cycles 5 cycles CPU Cache bus DRAM Memory ‰ Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/32 = 0.One Word Wide Bus.

One Word Wide Bus ‰ on-chip For a block size of f four f words cycle to send 1st address cycles to read DRAM banks cycles to return last data word total clock cycles miss penalty CPU Cache bus DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 Number of bytes transferred per clock cycle (bandwidth) for a single g miss is ‰ bytes per clock CSE431 Chapter 5A.Interleaved Memory. PSU.36 Irwin. 2008 .

2008 .Interleaved Memory.8 CSE431 Chapter 5A. PSU. One Word Wide Bus ‰ on-chip For a block size of f four f words 1 cycle to send 1st address 15 cycles to read DRAM banks 4*1 = 4 cycles to return last data word 20 total clock cycles miss penalty 15 cycles 15 cycles 15 cycles 15 cycles CPU Cache bus DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 ‰ Number of bytes transferred per clock cycle (bandwidth) for a single g miss is (4 x 4)/20 = 0.37 bytes per clock Irwin.

PSU. preferably ones that match the block size of the cache ‰ with the memory-bus characteristics z z make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus to Cache bandwidth CSE431 Chapter 5A.DRAM Memory System Summary ‰ Its important to match the cache characteristics z caches access one block at a time (usually more than one word) ‰ with the DRAM characteristics z use DRAMs that support fast multiple word accesses.38 Irwin. 2008 .

Measuring Cache Performance ‰ Assuming cache hit costs are included as part of the normal CPU execution cycle. then CPU time = IC × CPI × CC = IC × (CPIideal + Memory-stall cycles) × CC CPIstall ‰ M Memory-stall t ll cycles l come f from cache h misses i ( (a sum of f read-stalls and write-stalls) Read-stall cycles y = reads/program p g × read miss rate × read miss penalty Write-stall cycles = (writes/program × write miss rate × write miss penalty) + write buffer stalls ‰ For write-through caches. PSU.39 Irwin. 2008 . we can simplify this to Memory-stall cycles = accesses/program × miss rate × miss penalty CSE431 Chapter 5A.

a 100 cycle miss penalty.5? 0.44 = 5. 2008 .25? ‰ What if the D$ miss rate went up 1%? 2%? ‰ What if the processor clock rate is doubled (doubling the miss penalty)? CSE431 Chapter 5A. PSU. the more pronounced the impact of stalls z ‰ A processor with a CPIideal of 2.44 44 So CPIstalls = 2 + 3.Impacts of Cache Performance ‰ Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI) z The memory speed is unlikely to improve as fast as processor cycle time.40 Irwin. 36% load/store instr’s. When calculating CPIstall. and 2% I$ and 4% D$ miss rates Memory stall cycles = 2% × 100 + 36% × 4% × 100 = 3 Memory-stall 3.44 more than twice the CPIideal ! ‰ What if the CPIideal is reduced to 1? 0. the cache miss penalty is measured in processor clock cycles needed to handle a miss The lower the CPIideal.

Average Memory Access Time (AMAT)
A larger cache will have a longer access time. An increase in hit time will likely add another stage to the pipeline. At some point the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance. ‰ Average A M Memory A Access Ti Time (AMAT) i is th the average t to access memory considering both hits and misses
‰

AMAT = Time for a hit + Miss rate x Miss penalty
‰

What is the AMAT for a processor with a 20 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache access time of 1 clock cycle?
Irwin, PSU, 2008

CSE431 Chapter 5A.41

Reducing Cache Miss Rates #1
1.

Allow more flexible f block placement

‰

In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache

‰

‰

A compromise is to divide the cache into sets each of which consists of n “ways” ways (n-way (n way set associative). associative) A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)
(block address) modulo (# sets in the cache)
Irwin, PSU, 2008

CSE431 Chapter 5A.42

Another Reference String Mapping
‰

Consider the main memory word reference string
0 4 0 4 0 4 0 4
0 4

Start with an empty cache - all blocks initially marked as not valid

0

4

0

4

0

4

CSE431 Chapter 5A.43

Irwin, PSU, 2008

all blocks initially marked as not valid 01 00 00 01 0 miss Mem(4) 0 01 00 4 miss Mem(0) 4 00 01 0 miss 0 Mem(4) 01 00 4 miss Mem(0) 4 z 8 requests. PSU.two memory locations that map into the same cache block Irwin. 8 misses ‰ Ping pong effect due to conflict misses .44 .Another Reference String Mapping ‰ Consider the main memory word reference string 0 4 0 4 0 4 0 4 4 00 01 0 miss Mem(4) 0 01 00 4 miss Mem(0)4 0 miss 00 Mem(0) 4 miss Mem(0) Start with an empty cache . 2008 CSE431 Chapter 5A.

. PSU. 2008 .e.45 Irwin.Set Associative Cache Example Cache Way Set V Tag 0 1 0 1 0 1 Data Main Memory 0000 0000xx One word blocks 0001xx Two low order bits 0010xx define the byte in the 0011xx word d (32b words) d ) 0100xx 0101xx 0110xx 0111xx 1000xx Q2: How do we find it? 1001xx 1010xx Use next 1 low order 1011xx memory address bit to 1100xx determine which cache set (i (i. e modulo 1101 1101xx 1110xx the number of sets in 1111xx the cache) Q1: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memor memory block is in the cache CSE431 Chapter 5A.

2008 .46 Irwin.Another Reference String Mapping ‰ Consider the main memory word reference string 0 4 0 4 0 4 0 4 0 4 Start with an empty cache .all blocks initially marked as not valid 0 4 CSE431 Chapter 5A. PSU.

47 . 2008 CSE431 Chapter 5A.Another Reference String Mapping ‰ Consider the main memory word reference string 0 4 0 4 0 4 0 4 0 hit 000 010 Mem(0) Mem(4) 000 010 4 hit Mem(0) Mem(4) Start with an empty cache . 2 misses ‰ Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map p into the same cache set can co-exist! Irwin.all blocks initially marked as not valid i 0 miss 000 Mem(0) 000 010 i 4 miss Mem(0) Mem(4) z 8 requests. PSU.

253 254 255 V Tag Data Way 0 Way 1 Way 2 Way 3 32 4x1 select CSE431 Chapter 5A. 253 254 255 V Tag Data 0 1 2 .. 2 1 0 Byte offset Tag Index V Tag 0 1 2 .. 13 12 11 ..Four-Way Set Associative Cache ‰ 28 = 256 sets each with four ways y (each ( with one block) ) 31 30 . 253 254 255 8 V Tag Data 0 1 2 . . . 253 254 255 22 Index Data 0 1 2 . . .48 Hit Data Irwin.. 2008 . PSU. . . . .

PSU. each increase by a factor of two in associativity doubles the number of blocks per set (i. the number or ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Tag g Index Block offset Byte y offset CSE431 Chapter 5A.Range of Set Associative Caches ‰ For a fixed size cache cache. 2008 ..49 Irwin.e.

. the number or ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Used for tag compare Tag g Selects the set Index Selects the word in the block Block offset Byte y offset Decreasing associativity Direct mapped (only one way) Smaller tags.e.50 Increasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset Irwin. 2008 . PSU. only a single comparator CSE431 Chapter 5A.Range of Set Associative Caches ‰ For a fixed size cache cache. each increase by a factor of two in associativity doubles the number of blocks per set (i.

So its not possible to just assume a hit and continue and recover later if it was a miss CSE431 Chapter 5A. 2008 . the cache block is available before the Hit/Miss decision . In a direct mapped cache.Costs of Set Associative Caches ‰ When a miss occurs occurs. PSU.51 Irwin.For 2-way set associative. which way’s way s block do we pick for replacement? z Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time .Must have hardware to keep track of when each way’s block was used relative to the other blocks in the set . takes one bit per set → set the bit when a block is referenced (and reset the other way’s bit) ‰ N-way set associative cache costs z z z N comparators (delay and area) MUX delay (set selection) before data is available Data available after set selection (and Hit/Miss decision) decision).

2008 CSE431 Chapter 5A. Architecture.52 .Benefits of Set Associative Caches ‰ The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation 12 10 Miss Rate 8 6 4 2 0 1-way 2-way 4-way 8-way Associativity 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB Data from Hennessy & Patterson Computer Patterson. 2003 ‰ Largest gains are in going from direct mapped to 2 2-way a (20%+ reduction in miss rate) Irwin. PSU.

‰ U multiple Use lti l l levels l of f caches h With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches – normally a unified L2 cache (i.36×. CPIideal of 2. it holds both instructions and data) and in some cases even a unified L3 cache For our example. 0.e.53 Irwin. 2008 . 36% load/stores.36×.04×25 + . 100 cycle y miss penalty p y (to main memory) and a 25 cycle miss penalty (to UL2$)..5% 5% UL2$ miss rate CPIstalls = 2 + .02×25 + .005×100 36× 005×100 = 3 3. PSU.54 54 (as compared to 5. . a 2% (4%) L1 I$ (D$) miss rate add a 0 rate.44 with no L2$) ‰ CSE431 Chapter 5A.Reducing Cache Miss Rates #2 2. p .005×100 + .

Larger with larger block sizes .Smaller with smaller block sizes z Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times . hit time is less important than miss rate z z ‰ The L2$ $ hit time determines L1$’s $ miss p penalty y L2$ local miss rate >> than the global miss rate Irwin. PSU. 2008 CSE431 Chapter 5A.54 .Higher levels of associativity ‰ The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache – so it can be smaller (i. faster) but have a higher miss rate For the L2 cache.e.Multilevel Cache Design Considerations ‰ Design considerations for L1 and L2 caches are very different z Primary cache should focus on minimizing hit time in support of a shorter clock cycle ..

2008 .18 CSE431 Chapter 5A. PSU.Using the Memory Hierarchy Well ‰ Include plots from f Figure 5.55 Irwin.

Two Machines’ Cache Parameters Intel Nehalem L1 cache Split I$ and D$.. . 8192KB (8MB) organization & size shared by cores. 32KB for organization & size each per core.. 64B blocks L3 associativity L3 write policy CSE431 Chapter 5A.25MB) organization & size per core. 2048KB (2MB) shared by cores. ~LRU replacement write-back. LRU replacement write-back. evict block shared by fewest cores write-back. write-allocate Unified. . write-allocate L3 cache Unified. write-allocate .. 2008 L2 cache Unified. ~LRU write-back write-back. 512KB (0. write-allocate Unified.56 16-way set assoc. 64KB for each per core. 8-way (D) set assoc.. write-allocate AMD Barcelona Split I$ and D$. 64B blocks 2-way set assoc. 64B blocks 16-way y set assoc. 64B blocks 32-way set assoc. write-back. write-allocate Irwin. 64B blocks L1 associativity policy y L1 write p 4-way (I). ~LRU write-back write-back..5MB) per core. 256MB (0. PSU. 64B blocks L2 associativity y L2 write policy L2 write policy 8-way y set assoc.

PSU. ~LRU write-back Irwin. ~LRU write-back . LRU write-back Unified 1024KB (1MB) 64 bytes 16-way set assoc. 2008 Split I$ and D$ 8KB for D$.57 AMD Opteron Split I$ and D$ 64KB for each of I$ and D$ 64 bytes 2-way set assoc.Two Machines’ Cache Parameters Intel P4 L1 organization L1 cache size L1 block size L1 associativity L1 replacement L1 write p policy y L2 organization L2 cache size L2 block size L2 associativity L2 replacement L2 write policy CSE431 Chapter 5A. ~ LRU write-through g Unified 512KB 128 bytes 8-way set assoc. 96KB for trace cache (~I$) 64 bytes 4-way set assoc.

10-bit index index. valid bit. dirty bit. 2008 DDR S SDRAM Proc cessor 1 bi R 1-bit Read/Write d/W i 1 bi R 1-bit Read/Write d/W i . LRU bits (if set associative) 1-bit Valid 32-bit address 32-bit data 32-bit data 1 bit Ready 1-bit Cache & Cache Controller 1-bit Valid 32-bit address 128-bit data 128-bit data 1 bit Ready 1-bit CSE431 Chapter 5A. Cache size of 16KB (so 1024 blocks) 18-bit tags tags.58 Irwin. PSU.FSM Cache Controller ‰ Key characteristics for f a simple L1 cache z z z z Direct mapped Write-back Write back using write-allocate write allocate Block size of 4 32-bit words (so 16B). 2-bit block offset. offset 2-bit byte offset offset.

Set Tag. If Write set Dirty Cache Miss Old block is Dirty Cache Miss Old block is clean Allocate Read new block from memory Memory N t Ready Not R d CSE431 Chapter 5A.59 Memory Ready Write Back Write old block to memory Memory Not Ready Irwin. PSU.Four State Cache Controller Cache Hit Mark Cache Ready Idl Idle Valid CPU request Compare Tag If Valid && Hit Set Valid. 2008 .

Cache Coherence in Multicores ‰ In future multicore processors its likely that the cores will share a common physical address space. causing a cache coherence problem Core 1 L1 I$ L1 D$ Core 2 L1 I$ L1 D$ Unified (shared) L2 CSE431 Chapter 5A. 2008 .60 Irwin. PSU.

2008 . causing a cache coherence problem Read X Core 1 Read X Core 2 L1 I$ L1 D$ Write 1 to X L1 I$ L1 D$ X X= =0 1 X=1 0 Unified (shared) L2 X=0 CSE431 Chapter 5A.Cache Coherence in Multicores ‰ In future multicore processors its likely that the cores will share a common physical address space.61 Irwin. PSU.

62 .A Coherent Memory System ‰ Any read of a data item should return the most recently written value of the data item z Coherence – defines what values can be returned by a read . PSU.Writes to the same location are serialized (two writes to the same location must be seen in the same order by all cores) z Consistency – determines when a written value will be returned by a read ‰ To enforce coherence. coherence caches must provide z Replication of shared data items in multiple cores’ caches z Replication reduces both latency and contention for a read shared data item Migration reduced the latency of the access the data and the bandwidth demand on the shared memory (L2 in our example) Irwin. 2008 z Migration of shared data items to a core’s local cache z CSE431 Chapter 5A.

2008 CSE431 Chapter 5A. For the other core to complete.Cache Coherence Protocols ‰ Need a hardware protocol to ensure cache coherence the most popular of which is snooping z The cache controllers monitor (snoop) on the broadcast medium (e. it must obtain a new copy of the data which must now contain the updated value – thus enforcing write serialization Irwin.g. bus) with duplicate address tag hardware (so they don’t interfere with core’s access to the cache) to determine if their cache has a copy of a block that is requested ‰ Write invalidate protocol – writes require exclusive access and invalidate all other copies z Exclusive access ensure that no other readable or writable copies of an item exists ‰ If two processors attempt to write the same data at the same time time. one of them wins the race causing the other core’s copy to be invalidated..63 . PSU.

Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: 1 1.64 . if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) z Uses the bus only on the first write → lower bus traffic. can reduce latency 2 2. Write-invalidate – writing processor issues invalidation signal on bus. all copies are updated z z All writes rites go to the b bus s → higher bus b s traffic Since new values appear in caches sooner. PSU. cache snoops check to see if they have a copy of the data. Write update (write Write-update (write-broadcast) broadcast) – writing processor broadcasts new data over the bus. 2008 CSE431 Chapter 5A. so better use of bus bandwidth Irwin.

Example of Snooping Invalidation Core 1 L1 I$ L1 D$ Core 2 L1 I$ L1 D$ U ifi d ( Unified (shared) h d) L2 CSE431 Chapter 5A.65 Irwin. 2008 . PSU.

2008 CSE431 Chapter 5A.Example of Snooping Invalidation Read X Core 1 Read X Core 2 Write 1 to X L1 I$ L1 D$ Read X L1 I$ L1 D$ X X= =0 1 I X=1 0 U ifi d ( Unified (shared) h d) L2 X I X= =0 1 ‰ When the second miss by Core 2 occurs.66 . Core 1 responds with the value canceling the response from the L2 cache ( (and also updating p g the L2 copy) py) Irwin. PSU.

67 write-back caching protocol in black Irwin. 2008 . PSU.A Write-Invalidate CC Protocol read (miss) Invalid Shared (clean) read (hit or miss) write (miss w s) Modified (dirty) read d (hit) or write (hit or miss) CSE431 Chapter 5A.

2008 . PSU.A Write-Invalidate CC Protocol read (miss) Invalid receives invalidate (write by another core to this block) Shared (clean) read (hit or miss) w write (miss s) se end invalidate write e-back due to rea ad (miss) by anoth her core to o this block Modified (dirty) read d (hit) or write it (hit) CSE431 Chapter 5A.68 write-back caching protocol in black signals from the core in red signals from the bus in bl blue Irwin.

Write-Invalidate CC Examples z I = invalid (many). M = modified (only one) Core 2 A I Core 1 A S Core 2 A I Core 1 A S Main Mem A Core 1 A M Core 2 A I Main Mem A Core 1 A M Core 2 A I Main M i M Mem A CSE431 Chapter 5A.69 Main M i M Mem A Irwin. 2008 . PSU. S = shared (many).

Write-Invalidate CC Examples z I = invalid (many). M = modified (only one) 1. Core 1 4. 2008 . gets A from MM & changes its state A I to S 1. S = shared (many). change A M A I to M state to I 5.1 writes. C2 sends invalidate for A 2. 1 read d miss i f for A request Core for A. snoop sees read 3 d 1. change A A IS state to Core 2 2. write miss for A Core 1 4. gets A from MM back A to MM & changes its state g A 5. Main Mem A 1 write miss for A 1.Core 2 4.70 3 C2 sends invalidate for A 3. snoop sees read request Core 1 for A & lets MM supply A A S 2. C2 sends invalidate for A Main M i M Mem A Irwin. change A A IM state to Core 2 2. read ead request eques for o A Main Mem A 3. PSU. writes A & changes its state A I to M 3. read miss for A Core 2 4. writes A & changes its state A I to M 3. read request for A Main M i M Mem A CSE431 Chapter 5A.

71 1 2 4 8 16 Irwin.Data Miss Rates ‰ Shared data has lower spatial and temporal locality z Share data misses often dominate cache behavior even though they may only be 10% to 40% of the data accesses Capacity miss rate Coherence miss rate 18 16 14 12 10 8 6 4 2 0 1 2 4 8 16 64KB 2-way set associative data cache with 32B blocks Hennessy & Patterson. 2008 Ocean . Computer Architecture: A Quantitative Approach Capacity miss rate Coherence miss rate 8 6 4 2 0 FFT CSE431 Chapter 5A. PSU.

Block Size Effects ‰ Writes to one word in a multi-word block mean that the full block is invalidated Multi-word blocks can also result in false sharing: when two cores are writing to two different variables that h happen t to fall f ll in i th the same cache h bl block k z ‰ With write-invalidate false sharing increases cache miss rates Core1 A Core2 B 4 word cache block ‰ Compilers can help reduce false sharing by allocating highly correlated data to the same cache block Irwin.72 . PSU. 2008 CSE431 Chapter 5A.

2008 .. memory has an up-to-date copy Invalid – same z CSE431 Chapter 5A. data permitted to be cached with more than one processor). PSU.e. memory has an up-to-date copy .Si Since there th is i only l one copy of f the th block.73 Irwin. bl k write it hits hit d don’t ’t need dt to send invalidate signal z Shared – multiple copies of the shared data may be cached (i.Other Coherence Protocols ‰ There are many variations on cache coherence protocols ‰ Another write-invalidate protocol used in the Pentium 4 (and many other processors) is MESI with four states: z z Modified – same Exclusive – only one copy of the shared data is allowed to be cached.

2008 Modified (dirty) CSE431 Chapter 5A.MESI Cache Coherency Protocol Processor shared read Invalid A th Another ( t valid (not lid processor block) has read/write miss for this bl k block Invalidate for this block Processor shared read miss Shared (clean) Processor write miss Processor write [Write back block] Processor write or read hit Processor P shared read Processor Processor exclusive read miss exclusive read Exclusive (clean) Processor exclusive read miss Processor exclusive read Irwin. PSU.74 .

2008 .no write allocate – no “hit” hit on cache cache. then write) pipeline writes via a delayed write buffer to cache 1.write allocate – to avoid two cycles (first check for hit.75 Irwin. Reduce the miss rate z z z z bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache – small buffer holding most recently discarded blocks CSE431 Chapter 5A. PSU. z z z z smaller cache direct mapped cache smaller blocks for writes .Summary: Improving Cache Performance 0 Reduce the time to hit in the cache 0. just write to write buffer .

DDR SDRAMs CSE431 Chapter 5A.Summary: Improving Cache Performance 2 Reduce 2. 2008 . PSU.76 Irwin.wider buses . R d th the miss i penalty lt z z z z z z smaller blocks use a write buffer to hold dirty y blocks being g replaced p so don’t have to wait for the write to complete before reading check write buffer (and/or victim cache) on read miss – may get lucky y for large blocks fetch critical word first use multiple cache levels – L2 cache not tied to CPU clock rate f t b faster backing ki store/improved t /i d memory bandwidth b d idth .memory interleaving.

workload . TLB) ) Bad z depends on technology / cost Good Factor A Less Factor B ‰ Simplicity often wins More CSE431 Chapter 5A. PSU. ( D-cache.use (I-cache.Summary: The Cache Design Space ‰ Several interacting dimensions z z z z z z cache size block size associativity replacement policy write through vs write-back write-through write back write allocation Cache Size A Associativity i ti it Block Size ‰ The optimal choice is a compromise z depends on access characteristics . 2008 .77 Irwin.