# Date

:

Quiz for Chapter 5 Large and Fast: Exploiting Memory Hierarchy
3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully.

Name: Course:
Solutions in RED 1. [24 points] Caches and Address Translation. Consider a 64-byte cache with 8 byte blocks, an associativity of 2 and LRU block replacement. Virtual addresses are 16 bits. The cache is physically tagged. The processor has 16KB of physical memory.

(a) What is the total number of tag bits? The cache is 64-bytes with 8-byte blocks, so there are 8 blocks. The associativity is 2, so there are 4 sets. Since there are 16KB of physical memory, a physical address is 14 bits long. Of these, 3 bits are taken for the offset (8-byte blocks), and 2 for the index (4 sets). That leaves 9 tag bits per block. Since there are 8 blocks, that makes 72 tag bits or 9 tag bytes. (b) Assuming there are no special provisions for avoiding synonyms, what is the minimum page size? To avoid synonyms, the number of sets times the block size cannot exceed the page size. Hence, the minimum page size is 32 bytes (c) Assume each page is 64 bytes. How large would a single-level page table be given that each page requires 4 protection bits, and entries must be an integral number of bytes. There is 16KB of physical memory and 64 B pages, meaning there are 256 physical pages and each PFN is 8 bits. Each entry has a PFN and 4 protection bits, that’s 12 bits which we round up to 2 bytes. A single-level page table has one entry per virtual page. With a 16-bit virtual address space, there is 64KB of memory. Since each page is 64 bytes, that means there are 1K pages. At 2 bytes per page, the page table is 2KB in size. (d) For the following sequence of references, label the cache misses.Also, label each miss as being either a compulsory miss, a capacity miss, or a conflict miss. The addresses are given in octal (each digit represents 3 bits). Assume the cache initially contains block addresses: 000, 010, 020, 030, 040, 050, 060, and 070 which were accessed in that order.

Cache state prior to access (00,04),(01,05),(02,06),(03,07) (00,04),(01,05),(06,02),(03,07) (04,10),(01,05),(06,02),(03,07) (04,10),(01,05),(06,02),(07,27) (04,10),(01,05),(06,02),(27,57) (04,10),(01,05),(06,02),(57,07) (04,10),(01,05),(06,02),(07,27) (10,00),(01,05),(06,02),(07,27)

Reference address 024 100 270 570 074 272 004 044

Miss ? Which ?

Name: _____________________

(00,04),(01,05),(06,02),(07,27) (04,64),(01,05),(06,02),(07,27) (64,00),(01,05),(06,02),(07,27) (64,00),(05,41),(06,02),(07,27) (64,00),(41,71),(06,02),(07,27) (64,00),(71,55),(06,02),(07,27) (64,00),(71,55),(06,02),(27,57)

640 000 410 710 550 570 410

Since there are 8-byte blocks and addresses are in octal, to keep track of blocks you need only worry about the two most significant digits in each address, so 020 and 024 are in the same block, which we write as 02. Now, the cache is arranged into 4 sets of 2 blocks each. To map a block to a set, use (address / blocksize) % #sets, where address / blocksize is basically the address minus its last octal digit. Block 024 is in set (02) % 4 which is set #2. In short, set mappings are determined by the middle digit of the address. 0 and 4 go into set #0, 1 and 5 into set #1, 2 and 6 into set #2, and 3 and 7 into set #3. From there, just run LRU on each of the individual sets. Now, to separate the misses, any miss to a block you have not seen before is a compulsory miss. Any miss to a block you have not seen within the last N distinct blocks where N is the total number of blocks in the cache is a capacity miss. (Basically, the last N distinct blocks will be the N blocks present in a fully associative cache). All other misses are conflict misses.
Cache state prior to access (00,04),(01,05),(02,06),(03,07) (00,04),(01,05),(06,02),(03,07) (04,10),(01,05),(06,02),(03,07) (04,10),(01,05),(06,02),(07,27) (04,10),(01,05),(06,02),(27,57) (04,10),(01,05),(06,02),(57,07) (04,10),(01,05),(06,02),(07,27) (10,00),(01,05),(06,02),(07,27) (00,04),(01,05),(06,02),(07,27) (04,64),(01,05),(06,02),(07,27) (64,00),(01,05),(06,02),(07,27) (64,00),(05,41),(06,02),(07,27) (64,00),(41,71),(06,02),(07,27) (64,00),(71,55),(06,02),(07,27) (64,00),(71,55),(06,02),(27,57) Reference address 024 100 270 570 074 272 004 044 640 000 410 710 550 570 410 Miss ? Which ? Hit (update 02 LRU) Compulsory miss Compulsory miss Compulsory miss Conflict miss Conflict miss Capacity miss Capacity miss Compulsory miss Conflict miss Compulsory miss Compulsory miss Compulsory miss Capacity miss Conflict miss

(e) Which of the following techniques are aimed at reducing the cost of a miss: dividing the current block into sub-blocks, a larger block size, the addition of a second level cache, the addition of a victim buffer, early restart with critical word first, a writeback buffer, skewed associativity, software prefetching, the use of a TLB, and multi-porting. The techniques that reduce miss cost are sub-blocking (allows smaller blocks to be transferred on a miss), the addition of a second level cache (obviously), early restart/critical word first (allows a miss to return to the processor earlier), and a write-back buffer (buffers write-backs allowing replacement blocks to access the bus immediately). A larger block size (without early restart or sub-blocking) actually increases miss cost. Skewed associativity and software prefetching, reduce the miss rate, not the cost of a miss. A TLB reduces the

Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 2 of 26

then it reduces miss cost. but OK). Supplying this kind of bandwidth in a unified cache requires either replication (in which case you may as well have a split cache) or multi-porting or multi-banking.e. Assume a 2-way set assocative cache design that uses the LRU algorithm (with a cache that can hold a total of 4 blocks). [6 points] A two-part question. depending on your definition.Name: _____________________ cost of a hit (over the use of a serial TB) and is in general a functionality structure rather than a performance structure. Here the primary design goal of a second level cache is a low miss-rate. to avoid structural hazards. not miss cost. A victim buffer can reduce miss cost or not. (f) Why are the first level caches usually split (instructions and data are in different caches) while the L2 is usually unified (instructions and data are both in the same cache)? The primary reason for splitting the first level cache is bandwidth. and HIT/MISS information for each access. BYTE OFFSET fields and fill in the table above. Initial Block 0 Set 0 Set 1 Block 1 Access 0 Access 1 Block 0 Set 0 Set 1 Block 1 Set 0 Set 1 Block 0 Block 1 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 3 of 26 . If you count the victim buffer as part of the cache (which it typically is). 2. In the figure below. both of which would slow down the hit time. (Part A) Assume the following 10-bit address sequence generated by the microprocessor: Time 0 10001101 1 10110010 2 10111111 3 10001100 4 10011100 5 11101001 6 11111110 7 11101001 Access TAG SET INDEX The cache uses 4 bytes per block. i. Least Recently Used (LRU). First determine the TAG. Multi-porting increases cache bandwidth. By combining instructions and data in the same cache. or as part of the next level of cache (which it really isn’t. clearly mark for each access the TAG. the capacity of the cache will be better utilized and a lower overall miss rate may be achieved. A typical processor must access instructions and data in the same cycle quite frequently. then it reduces miss rate. SET. Assume that the cache is initially empty. Structural hazards are a much smaller issue for a second-level cache which is accessed much less frequently. If you count it on its own..

0 Block 1 Set 0 Set 1 Block 0 10110.0.0. Hit Bit) LRU bit = 1 if the current block is the least recently used one. LRU Bit. Hit Bit = 1 if the current reference is a hit Initial Block 0 Set 0 Set 1 Block 1 Access 0 Access 1 Block 0 Set 0 Set 1 10001.0 Block 1 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 4 of 26 .0 10001.Name: _____________________ Access 2 Access 3 Block 0 Set 0 Set 1 Block 1 Set 0 Set 1 Block 0 Block 1 Access 4 Access 5 Block 0 Set 0 Set 1 Block 1 Set 0 Set 1 Block 0 Block 1 Access 6 Access 7 Block 0 Set 0 Set 1 Block 1 Set 0 Set 1 Block 0 Block 1 Time 0 10001101 10001 1 01 1 10110010 10110 0 10 2 10111111 10111 1 11 3 10001100 10001 1 00 4 10011100 10011 1 00 5 11101001 11101 0 01 6 11111110 11111 1 10 7 11101001 11101 0 01 Access TAG SET INDEX In the following each access is represented by a triple: (Tag.0.

1. for instance). Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 5 of 26 .0. This compatibility freed the application writer from worrying about how much physical memory a machine had.0.0 10001.0 Access 6 Access 7 Set 0 Set 1 (Part B) Block 0 10110. Specifically.0.1.0 Block 1 11101.0 11111. A cache may have a low miss rate. it is difficult to design a cache which is both fast in the common case (a hit) and minimizes the costly uncommon case by having a low miss rate.0 11111. Alternatively.0. it may have a low miss rate but a high hit time (this is true for large fully associative caches.0 Set 0 Set 1 Block 0 11101.0 Set 0 Set 1 Block 0 10110. These two design goals are achieved using two caches.0 10001.0 Block 1 11101.0 10011. therefore it is usually small with a low-associativity. with tmiss (memory latency) given.0 10001.1. The first level cache minimizes the hit time.1 Block 1 10111.0 Derive the hit ratio for the access sequence in Part A. (b) The original motivation for using virtual memory was “compatibility”.0.0.0 10001.0 Block 1 10011. What is the reason for using a combination of first and second. The second level cache minimizes the miss rate. Not all of the performance factors can be optimized in a single cache. The hit ratio for the above access sequence is given by : (1/8) = 0.125 3.0 10011.Name: _____________________ Access 2 Access 3 Set 0 Set 1 Block 0 10110. making it lower-performing than a cache with a higher miss rate but a substantially lower miss penalty.1.0 10011.0.1. but an extremely high penalty per miss.0.0 Block 1 10111. The miss rate is only one component of this equation.level caches rather than using the same chip area for a larger first-level cache? The ultimate metric for cache performance is average access time: tavg = thit + miss-rate * tmiss.0. it is usually large with large blocks and a higher associativity.1.0 Access 4 Access 5 Set 0 Set 1 Block 0 10110.1.1.0.0.0.0 Set 0 Set 1 Block 0 10110.0 Block 1 11101. Multiple levels of cache are used for exactly this reason.0. [6 points] A two part question (a) Why is miss rate not a good metric for evaluating cache performance? What is the appropriate metric? Give its definition. What does that mean in this context? What are two other motivations for using virtual memory? Compatibility in this context means the ability to transparently run the same piece of (un-recompiled) software on different machines with different amounts of physical memory.

sharing and communication. conflict misses and compulsory misses (Part C) Design a 128KB direct-mapped data cache that uses a 32-bit address and 16 bytes per block. [6 points] A four part question (Part A) What are the two characteristics of program memory accesses that caches exploit? Temporal and spatial locality (Part B) What are three types of cache misses? Cold misses.Name: _____________________ The motivation for using virtual memory today have mainly to do with multiprogramming. Calculate the following: (a) How many bits are used for the byte offset? 7 bits (b) How many bits are used for the set (index) field? 15 bits (c) How many bits are used for the tag? 10 bits (Part D) Design a 8-way set associative cache that has 16 blocks and 32 bytes per block. relocation. and the ability to memory map peripheral devices. Assume a 32 bit address. Virtual memory provides protection. 4. Calculate the following: (a) How many bits are used for the byte offset? 5 bits (b) How many bits are used for the set (index) field? 2 bits Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 6 of 26 . the ability to interleave the execution of multiple programs simultaneously on one processor. fast startup.

[6 points] This question covers virtual memory access. show the contents of physical memory after the following virtual accesses: 10100. The number of 40 42 entries are therefore 2 .All overhead bits) other than required by the hardware translation algorithm. So the total size of the page table is 2 bytes which is 4 terabytes. 00011. Also indicate when a page fault occurs. (c) Assuming 1 application exists in the system and the maximum physical memory is devoted to the process. The physical memory has 16 bytes (four page frames).11111. The page table used is a one-level scheme that can be found in memory at the PTBR location. Initially the table indicates that no virtual pages have been mapped.Name: _____________________ (c) How many bits are used for the tag? 25 bits 5. 17 bits to hold the page offset (2) we need 32 – 8 (used for protection) = 24 bits for page number The largest physical memory size is 2 (17 + 24) 10 3 /2 bytes = 2 bytes = 256 GB (3) 38 (b) How large (in bytes) is the page table? The page table is indexed by the virtual page number which uses 54 – 14 = 40 bits. which 24 means the page table size is 2 *4 bytes. 38 26 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 7 of 26 . replacement. 01011. modified. Each page table entry (PTE) is 1 byte. ie. valid. Virtual Address Page Size PTE Size 54 bits 16 K bytes 4 bytes (a) Assume that there are 8 bits reserved for the operating system functions (protection. 01011. Since each Page Table element has 4 bytes (32 bits) and the page size is 16K bytes: (1) we need log2(16*2 *2 ) . Assume a 5-bit virtual address and a memory system that uses 4 bytes per page. [6 points] The memory architecture of a machine X is summarized in the following table. and Hit/Miss. Each PTE has 4 bytes. how much physical space (in bytes) is there for the application’s data and code. Implementing a LRU page replacement algorithm. 01000. This leaves the remaining physical space to the process: 2 . The application’s page table has an entry for every physical page that exists on the system. Make sure you consider all the fields required by the translation algorithm.2 bytes 6. Show the contents of memory and the page table information after each access successfully completes in figures that follow. Derive the largest physical memory size (in bytes) allowed by this PTE format.

Name: _____________________ Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 8 of 26 .

Name: _____________________ Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 9 of 26 .

Name: _____________________ Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 10 of 26 .

Name: _____________________ Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 11 of 26 .

Name: _____________________ 7. indicate which component of the AMAT equation is improved. (Part A) In what pipeline stage is the branch target buffer checked? (a) Fetch (b) Decode (c) Execute (d) Resolve The branch target buffer is checked at (a) Fetch stage. To eliminate the branch penalty the BTB needs to store (b) Address of branch target and branch prediction The Average Memory Access Time equation (AMAT) has three components: hit time. For each of the following cache optimizations. miss rate. and miss penalty. • • • • • • • • • • • • • • Using a second-level cache Using a direct-mapped cache Using a 4-way set-associative cache Using a virtually-addressed cache Performing hardware pre-fetching using stream buffers Using a non-blocking cache Using larger blocks Using a second-level cache improves miss rate Using a direct-mapped cache improves hit time Using a 4-way set-associative cache improves miss rate Using a virtually-addressed cache improves hit time Performing hardware pre-fetching using stream buffers improves miss rate Using a non-blocking cache improves miss penalty Using larger blocks improves miss rate Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 12 of 26 . What needs to be stored in a branch target buffer in order to eliminate the branch penalty for an unconditional branch? (a) Address of branch target (b) Address of branch target and branch prediction (c) Instruction at branch target. [6 points] A multipart question.

only DRAM’s need to be refreshed periodically.Name: _____________________ (Part B) (a) (True/False) A virtual cache access time is always faster than that of a physical cache? True (for a single level cache) since the virtual to physical address translation is avoided. 8. Note: The cache must be accessed after memory returns the data. [12 points] A three-part question. True.08*124 = 11. (d) (True/False) A write-through cache typically requires less bus bandwidth than a write-back cache. Assume that latency to memory and the cache miss penalty together is 124 cycles. AMAT = 2 + 0. (c) (True/False) Both DRAM and SRAM must be refreshed periodically using a dummy read/write operation. (Part A) Write the formula for the average memory access time assuming one level of cache memory: Average Memory Access Time = Time for a hit + Miss rate × Miss penalty (Part B) For a data cache with a 92% hit rate and a 2-cycle hit latency. False. It reduces conflict misses. False. The huge processor-memory gap implies cache performance is very critical. (b) (True/False) High associativity in a cache reduces compulsory misses. False. This question covers cache and pipeline performance analysis. It is the other way round (e) (True/False) Cache performance is of less importance in faster processors because the processor speed compensates for the high memory access time. calculate the average memory access latency.92 cycles Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 13 of 26 . False. (f) (True/False) Memory interleaving is a technique for reducing memory access time through increased bandwidth utilization of the data bus.

Name: _____________________ Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 15 of 26 .

Name: _____________________ Addr H/M 10011 M 00001 H 00110 H 01010 M 01110 M 11001 M 00001 M 11100 M 10100 M Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 16 of 26 .

Use the Least Recently Used replacement policy. Addr H/M 10011 00001 00110 01010 01110 11001 00001 11100 10100 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 17 of 26 . Assume 00110 and 11011 were the last two addresses to be accessed.Name: _____________________ (Part B) Do the same thing as in Part A. except for a 4-way set associative cache.

Name: _____________________ Addr H/M 10011 M 00001 H 00110 H 01010 M 01110 M 11001 M 00001 H 11100 M 10100 M (Part C) (a) What is the hit and miss rate of the direct-mapped cache in Part A? Hit Rate = 2/9 Miss Rate = 7/9 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 18 of 26 .

8KB page size Below are diagrams of the cache and TLB. Ignoring writes. calculate the ratio of the performance of the 4-way set associative cache to the direct-mapped cache.Name: _____________________ (b) What is the hit and miss rate of the 4-way set associative cache in Part B? Hit Rate = 3/9 Miss Rate = 6/9 (c) Assume a machine with a CPI of 4 and a miss penalty of 10 cycles. Please fill in the appropriate information in the table that follows the diagrams: Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 19 of 26 . 64Kbyte L1 Data Cache has 128 byte lines and is also 2-way set associative. In other words.105 10. Virtual addresses are 64-bits and physical addresses are 32 bits. what is the speedup when using the machine with the 4-way cache? Number of Hits = 4 Number of Misses = 14 SpeedUp = ((4*2)/9 + (7*14)/9)/((4*3/9) +(6*14/9)) = 1. [6 points] Consider a memory system with the following parameters: • • • • Translation Lookaside Buffer has 512 entries and is 2-way set associative.

(b) In a multilevel cache hierarchy. since the other processor updates the cache line as and when it writes any value. On the other hand this could also result in tolerating L1 miss latency to a greater extent than before if the blocks that are updated are immediately being used up by the second processor. [6 points] How many total SRAM bits will be required to implement a 256KB four-way set associative cache. 1 dirty bit. Each entry in the set occupies 4 bits + 64*8 bits = 516 bits The total number of SRAM bits required = (516)*4*1024 = 2113536 10 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 20 of 26 . Assume that there are 4 extra bits per entry: 1 valid bit. The cache is physically-indexed cache. This is because an invalidation protocol just flushes a cache line on a processor whenever some other processor writes to a cache line which has the same mapped block and hence a subsequent load (which may occur much later after the invalidation) will incur a higher miss penalty with a higher miss latency. and has 64-byte blocks.Name: _____________________ L1 Cache A= B= C= D= E= L1 Cache A= B= C= D= E= bits bits bits bits bits F G H I TLB = = = = Bits Bits Bits Bits 17 8 7 512 19 bits bits bits bits bits F G H I TLB = = = = 43 8 13 19 Bits Bits Bits Bits 11. Update-based Protocols. This could increase the hit time. The number of sets in the 256KB four-way set associative cache = (256*2 )/(4*64) =1024 A set has four entries. (a) As miss latencies increase. [3 points] Invalidation vs. does an update protocol become more or less preferable to an invalidation-based protocol? Explain. would you propagate updates all the way to the first-level cache or only to the second-level cache? Explain the trade-offs. As miss latencies increase. 12. Assume that the physical address is 50 bits wide. and 2 LRU bits for the replacement policy. which is especially true for shared memory workloads. Propagating updates all the way to the first cache could result in unnecessary traffic being thrust upon the first level cache of one processor just because it has the memory blocks being mapped into its cache the first time it was accessed and was not evicted since then. In case of an update based protocol. the miss latency is not the critical determining factor of the subsequent load. an update protocol would be more preferable to an invalidation protocol.

Suppose that it is executed on a system with a 2-way set associative 16KB data cache with 32-byte blocks. first-level data caches are more likely to be direct-mapped or 2 or 4-way set associative. There the number of misses due to a[i] accesses inside the loop is 1024/8 = 128. int is word sized and the size of a cache block is 32 bytes. Also assume that the address of a is 0x0. int a[1024*1024]. 14. Assume that int is word-sized. The total number of misses = 1024 + 128 = 1062 The total number of hits = 512. This is because map alternately to sets 0 and 128 consecutively where there are cold misses the first time they are referenced. they are designed to be fully associative.i<1024. Since (a) a page size is much larger than a normal cache line size. (b) Different processes have their own virtual address space and hence to prevent a lot of conflict misses to ensure TLBs don’t become a source of unfairness as far as a process’s memory access is concerned. Similarly the array elements a[1024*2] to a[1024*3-1] map to cache lines of sets 0 to 127. while all the ints in a from a[1024] to a[1024*2 -1] map to the sets 128 to 255.Name: _____________________ 13. and that the cache is initially empty. Give two good reasons why this is so. 10 Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 21 of 26 . that i and x are in registers. int x=0. Therefore all the ints in a from a[0] to a[1023] map to the one of the cache lines of the sets 0 to 127. for(i=0. [6 points] Caches: Misses and Hits int i. Now all accesses to a[1024*i] within the loop are misses. How many data cache misses are there? How many hits are there? The number of sets in the cache = (16 * 2 ) /(2*32) = 256 Since a word size is 4 bytes. TLB’s cache the translation from a virtual page number to a physical page number. In contrast. and an LRU replacement policy. every time a[i] is accessed for i being a multiple of 8 would be a miss. the number of ints that would fit in a cache block is 8. there is likely to very less spatial locality amongst the virtual address accesses from a single process than between successive memory blocks. 32-bit words. } Consider the code snippet in code above.i++) { x+=a[i]+a[1024*i]. In the loop. a[1024*3] to a[1024*4 – 1] map to cache lines 128 to 255 and so on. The number of hits is 512. [6 points] TLB’s are typically built to be fully-associative or highly set-associative.