You are on page 1of 15

CS433g: Computer System Organization Fall 2005 Practice Set 5 Memory Hierarchy

Please refer to the newsgroup message for instructions on obtaining EXTRA CREDIT for this homework. Problem 1
Consider a system with 4-way set associative cache of 256 KB. The cache line size is 8 words (32 bits per word). The smallest addressable unit is a byte, and memory addresses are 64 bits long. a. Show the division of the bits in a memory address and how they are used to access the cache. Solution: We are given that the block size is 8 words (32 bytes). Therefore, the number of bytes required to specify the block offset is log232 = 5 bits. The number of sets is 256 KB / (32 * 4) = 2048 sets. Therefore, the index field would require 11 bits. The remaining 64 11 5 = 48 bits are used for the tag field. b. Draw a diagram showing the organization of the cache and, using your answer from part (a), indicate how physical addresses are related to cache locations. Solution: The diagram would look similar to Figures 5.4 and/or 5.5 from H&P. We know that any physical address with the same index bits will map to the same set in the cache. The tag is used to distinguish between these physical locations. c. What memory addresses can map to set 289 of the cache? Solution: Memory locations with index bits 00100100001 will map to set 289. d. What percentage of the cache memory is used for tag bits? Solution: For each cache line (block), we have 1 tag entry. The size of the cache line is 32 * 8 = 256 bits. Therefore, the percentage of cache memory used for tag bits is 48 / (48 + 256) = 15.8%.

Problem 2
You are building a computer system around a processor with in-order execution that runs at 1 GHz and has a CPI of 1, excluding memory accesses. The only instructions that read or write data from/to memory are loads (20% of all instructions) and stores (5% of all instructions).

The memory system for this computer has a split L1 cache. Both the I-cache and the D-cache are direct mapped and hold 32 KB each. The I-cache has a 2% miss rate and 64 byte blocks, and the D-cache is a write-through, no-write-allocate cache with a 5% miss rate and 64 byte blocks. The hit time for both the I-cache and the D-cache is 1 ns. The L1 cache has a write buffer. 95% of writes to L1 find a free entry in the write buffer immediately. The other 5% of the writes have to wait until an entry frees up in the write buffer (assume that such writes arrive just as the write buffer initiates a request to L2 to free up its entry and the entry is not freed up until the L2 is done with the request). The processor is stalled on a write until a free write buffer entry is available. The L2 cache is a unified write-back, write-allocate cache with a total size of 512 KB and a block size of 64-bytes. The hit time of the L2 cache is 15ns. Note that this is also the time taken to write a word to the L2 cache. The local hit rate of the L2 cache is 80%. Also, 50% of all L2 cache blocks replaced are dirty. The 64-bit wide main memory has an access latency of 20ns (including the time for the request to reach from the L2 cache to the main memory), after which any number of bus words may be transferred at the rate of one bus word (64-bit) per bus cycle on the 64-bit wide 100 MHz main memory bus. Assume inclusion between the L1 and L2 caches and assume there is no write-back buffer at the L2 cache. Assume a write-back takes the same amount of time as an L2 read miss of the same size. While calculating any time values (such as hit time, miss penalty, AMAT), please use ns (nanoseconds) as the unit of time. For miss rates below, give the local miss rate for that cache. By miss penaltyL2, we mean the time from the miss request issued by the L2 cache up to the time the data comes back to the L2 cache from main memory.

Part A Computing the AMAT (average memory access time) for instruction accesses. i. Give the values of the following terms for instruction accesses. hit timeL1, miss rateL1, hit timeL2, miss rateL2 hit timeL1 = 1 processor cycle = 1 ns miss rateL1= 0.02 hit timeL2 = 15 ns miss rateL2 = 1 0.8 = 0.2 ii. Give the formula for calculating miss penaltyL2, and compute the value of miss penaltyL2. miss penaltyL2 = memory access latency + time to transfer one L2 cache block Transfer rate of memory bus = 64 bits / bus cycle = 64 bits / 10 ns = 8 bytes / 10 ns = 0.8 bytes / ns Time to transfer one L2 cache block = 64 bytes / 0.8 bytes = 80 ns. So, miss penaltyL2 = 20 + 80 = 100 ns However, 50% of all replaced blocks are dirty and so they need to be written back to main memory. This takes another 100 ns. Therefore, miss penaltyL2 = 100 + 0.5 x 100 = 150 ns

iii. Give the formula for calculating the AMAT for this system using the five terms whose values you computed above and any other values you need. AMAT = hit timeL1 + miss rateL1 x (hit timeL2 + miss rateL2 x miss penaltyL2) iv. Plug in the values into the AMAT formula above, and compute a numerical value for AMAT for instruction accesses. AMAT = 1 + 0.02 x (15 + 0.2 x 150) = 1.9 ns Part B Computing the AMAT for data reads. i. Give the value of miss rateL1 for data reads. miss rateL1 = 0.05

ii. Calculate the value of the AMAT for data reads using the above value, and other values you need. AMAT = hit timeL1 + miss rateL1 x (hit timeL2 + miss rateL2 x miss penaltyL2) AMAT = 1 + 0.05 x (15 + 0.2 x 150) = 3.25 ns Part C Computing the AMAT for data writes. i. Give the value of miss penaltyL2 for data writes. miss penaltyL2 = miss penaltyL2 for the data read case So, miss penaltyL2 = 150 ns (Assuming that after the block is read into the L2 cache from the main memory, no further time is spent writing to it. In other words the time to write to it is included in the 150 ns value. This value of 150 ns is used in the solutions for all subsequent parts, but using a value of 151 ns is also perfectly acceptable.) Note: Here a value of 151 ns is also equally acceptable. (Assuming that one additional cycle (1ns) is spent writing to the block once it has arrived in the L2 cache.) ii. Give the value of write timeL2Buff for a write buffer entry being written to the L2 cache. As, the L2 cache hit rate is 80%, only 20% of the write buffer writes will miss in the L2 cache and will thus incur the miss penaltyL2. So, write timeL2Buff = hit timeL2 + 0.2 x miss penaltyL2 1 x 15 + 0.2 x 150 = 45 ns iii. Calculate the value of the AMAT for data writes using the above two values, and any other values that you need. Only include the time that the processor will be stalled. Hint: There are two cases to be considered here depending upon whether the write buffer is full or not. There are two cases to consider here. In 95% of the cases the write buffer will have empty space, so the processor will only need to wait 1 cycle. In the remaining 5% of the cases, the write buffer will be full, and the processor will have to wait for the additional time taken for a buffer entry to be written to the L2 cache, which is write timeL2Buff.

AMAT = hit timeL1 + 0.05 x write timeL2Buff = 1 + 0.05 x (45) = 3.25 ns

Part D Compute the overall CPI, including memory accesses (instructions plus data). Assume that there is no overlap between the latencies of instruction and data accesses. The CPI excluding memory accesses = 1 We are given that 20% of the instructions are data reads (loads), and 5% are data writes (stores). Also, note that 100% of the instructions require an instruction fetch. Since, on this system one clock cycle is 1 ns, we can use the AMAT values directly. So, CPI including memory accesses = 1 + (AMAT for instructions 1) + (0.2 x AMAT for data reads - 1) + (0.05 x AMAT for data writes - 1) = 1 + (1.9 1) + 0.2 x (3.25 1) + 0.05 x (3.25 1) = 2.46 Note: We are subtracting 1 cycle (1ns) from all of the AMAT times (instruction, data read and data write), because in the pipeline 1 cycle of memory access is already accounted for in the CPI of 1.

Problem 3
Way prediction allows an associative cache to provide the hit time of a direct-mapped cache. The MIPS R10000 processor uses way prediction to achieve a different goal: reduce the cost of the chip package. The R10000 hardware includes an on-chip L1 cache, on-chip L2 tag comparison circuitry, and an on-chip L2 way prediction table. L2 tag information is brought on chip to detect an L2 hit or miss. The way prediction table contains 8K 1-bit entries, each corresponding to two L2 cache blocks. L2 cache storage is built external to the processor package, is 2-way associative, and may have one of several block sizes. a. How can way prediction reduce the number of pins needed on the R10000 package to read L2 tags and data, and what is the impact on performance compared to a package with a full complement of pins to interface to the L2 cache? Solution: When way prediction is not used, the chip would need to access L2 tags for both associative ways. Ideally, this would be done in parallel; thus, the R10000 and L2 chips would need enough pins to bring both tags onto the processor for comparison. With way prediction, we need only bring the tag for the way that was predicted; in the less likely case where the predicted way is incorrect, we could load the other tag with minimal penalty.

b. What is the performance drawback of just using the same smaller number of pins but not including way prediction?

Solution: To use the smaller number of pins without way prediction, we would check the tags for the two ways one after the other. Now, when we have a hit, on average half the time we will get the correct tag first, and half the time we will get the correct tag second. With the way prediction, we were getting the correct tag a high fraction of the time, so the average L2 access time will be higher.

c. Assume that the R10000 uses most-recently used way prediction. What are reasonable design choices for the cache state update(s) to make when the desired data is in the predicted way, the desired data is in the non-predicted way, and the desired data is not in the L2 cache? Please fill in your answers in the following table. Solution: Cache Access Case Desired data is in the predicted way Desired data is in the non-predicted way Desired data is not in the L2 cache Way prediction entry No change Flip the way prediction bit Set way prediction bit to point to new location of the data Cache State Change Tag and valid bits No change No change Set tag and valid bits

Cache data No change No change Bring data from memory

d. For a 1024 KB L2 cache with 64-byte blocks and 8-way set associativity, how many way prediction table entries are needed? Solution: The number of blocks in the L2 cache = 1024KB / 64B = 16K The number of sets in the L2 cache = 16K / 8 = 2K Thus, the number of way prediction table entries needed = 2K e. For an 8 MB L2 cache with 128-byte blocks and 2-way set associativity, how many way prediction table entries are needed? Solution: The number of blocks in the L2 cache = 8MB / 128B = 64K The number of sets in the L2 cache = 64K / 2 = 32K Thus, the number of way prediction table entries needed = 32K

f.

What is the difference in the way that the R10000 with only 8K way prediction table entries will support the cache in part d) versus the cache in part e)? Hint: Think about the similarity between a way prediction table and a branch prediction table. Solution: Since the R10000 way prediction table has 8K entries, it can easily support the cache in part d). However, this table is too small to accommodate all of the 16K entries required by the

cache in part e). One idea is to make each prediction entry in part e), correspond to two different sets. However, this introduces the possibility of interference, just like we have seen previously with branch history tables.

Problem 4
Consider the following piece of code: register int i,j; /* i, j are in the processor registers */ register float sum1, sum2, a[64][64], b[64][64]; for ( i = 0; i < 64; i++ ) { for ( j = 0; j < 64; j++ ){ sum1 += a[i][j]; } for ( j = 0; j < 32; j++ ){ sum2 += b[i][2*j]; } } Assume the following: There is a perfect instruction cache; i.e., do not worry about the time for any instruction accesses. Both int and float are of size 4 bytes. Assume that only the accesses to the array locations a[i][j] and b[i][2*j] generate loads to the data cache. The rest of the variables are all allocated in registers. Assume a fully associative, LRU data cache with 32 lines, where each line has 16 bytes. Initially, the data cache is empty. The arrays a and b are stored in row major form. To keep things simple, we will assume that statements in the above code are executed sequentially. The time to execute lines (1), (2), and (4) is 4 cycles for each invocation. Lines (3) and (5) take 10 cycles to execute and an additional 40 cycles to wait for the data if there is a data cache miss. There is a data prefetch instruction with the format prefetch(array[index]). This prefetches the entire block containing the word array[index] into the data cache. It takes 1 cycle for the processor to execute this instruction and send it to the data cache. The processor can then go ahead and execute subsequent instructions. If the prefetched data is not in the cache, it takes 40 cycles for the data to get loaded into the cache. Assume that the arrays a and b both start at cache line boundaries. /* 1 */ /* 2 */ /* 3 */ /* 4 */ /* 5 */

a. How many cycles does the above code fragment take to execute if we do NOT use pefetching? Also calculate the average number of cycles per outer-loop iteration. Solution: Number of cycles taken by line 1 = 64 x 4 = 256 Number of cycles taken by line 2 = 64 x 64 x 4 = 16384

Number of cycles taken by line 3 = 64 x 64 x (10 + 40/4) = 81920 Note that for line 3 every fourth cache access will be a miss, and thats where the 40/4 comes from. Number of cycles taken by line 4 = 64 x 32 x 4 = 8192 Number of cycles taken by line 5 = 64 x 32 x (10 + 40/2) = 61440 Note that for line 3 every second cache access will be a miss, and thats where the 40/2 comes from. Total number of cycles taken by the entire code fragment = 256 + 16384 + 81920 + 8192 + 61440 = 168192 cycles The average # of cycles per outer loop iteration = 168192 / 64 = 2628 b. Consider inserting prefetch instructions for the two inner loops for the arrays a and b respectively. Explain why we may need to unroll the loops to insert prefetches? What is the minimum number of times you would need to unroll for each of the two loops for this purpose? Solution: Since the cache line is 16 bytes long and the size of the float is 4 bytes, a cache block has 4 floats. Thus, one prefetch instruction will bring in four elements of the array. So we only need to do a prefetch every 4 iterations of first inner loop operating on array a. To achieve this unroll the loop 4 times. For the second inner loop operating on array b, we need to do a prefetch every 2 iterations, so unroll this loop 2 times. c. Unroll the inner loops for the number of times identified in part b, and insert the minimum number of software prefetches to minimize execution time. The technique to insert prefetches is analogous to software pipelining. You do not need to worry about startup and cleanup code and do not introduce any new loops. Solution: for (i=0; i<64; i++) { for(j=0; j<64; j+=4) { prefetch(a[i][j+4]); sum1 += a[i][j]; sum1 += a[i][j+1]; sum1 += a[i][j+2]; sum1 += a[i][j+3]; } /* (1) */ /* (2) */ /* (3-0) */ /* (3-1) */ /* (3-2) */ /* (3-3) */ /* (3-4) */

for(j=0; j<32; j+=2) { /* (4) */ prefetch(b[i][2*(j+4)]); /* (5-0) */ sum2 += b[i][2*j]; /* (5-1) */ sum2 += b[i][2*(j+1)]; /* (5-2) */ } } d. How many cycles does each outer loop iteration of the code in part (c) take to execute? Calculate the average speedup over the code without prefetching. Assume prefetches are not present in the startup code. Extra time needed by prefetches executing beyond the end of the loop execution time should not be counted. Solution:

Number of cycles taken by line 1 = 4 = 4 Number of cycles taken by line 2 = 16 x 4 = 64 Number of cycles taken by line 3-0 = 16 x 1 = 16 Number of cycles taken by line 3-1 = 15 x 10 + 1 x (10 + 40) = 200 Note that for line 3-1 the first iteration of the inner loop will always cause a miss because the prefetch is getting the data required by the (j+1)st iteration, i.e the 2nd iteration. Number of cycles taken by line 3-2 = 16 x 10 = 160 Number of cycles taken by line 3-3 = 16 x 10 = 160 Number of cycles taken by line 3-4 = 16 x 10 = 160 Number of cycles taken by line 4 = 16 x 4 = 64 Number of cycles taken by line 5-0 = 16 x 1 = 16 Number of cycles taken by line 5-1 = 14 x 10 + 2 x (10 + 40) = 240 Note that for line 5-1 the first two iterations of the inner loop will always cause a miss because the prefetch is getting the data required by the (j+2)nd iteration, i.e the 3rd iteration. Number of cycles taken by line 5-2 = 16 x (10) = 160 Total number of cycles taken by the entire code fragment = 4 + 64 + 16 + 200 + 160 + 160 + 160 + 64 + 16 + 240 + 160 = 1244 cycles The speedup over the code without prefetching = 168192 / (1244 x 64) = 2.11 e. In part (c) above it is possible that loop unrolling reduces performance by increasing code size. Is there another technique that can be used to achieve the same objective as loop unrolling in this example, but with fewer instructions? Explain this technique and illustrate its use for the code in part (c). Solution: We could use an if statement to eliminate loop unrolling. for (i=0; i<64; i++) { /* (1) */ for(j=0; j<64; j++) { /* (2) */ if(j%4 == 0) /* (3-0) */ prefetch(a[i][j+4]); /* (3-1) */ sum1 += a[i][j]; /* (3-2) */ } for(j=0; j<32; j++) { /* (4) */ if(j%2 == 0) /* (5-0) */ prefetch(b[i][2*(j+2)]); /* (5-1) */ sum2 += b[i][2*j]; /* (5-2) */ } }

Problem 5 Consider a system with the following processor components and policies: A direct-mapped L1 data cache of size 4KB and block size of 16 bytes, indexed and tagged using physical addresses, and using a write-allocate, write-back policy A fully-associative data TLB with 4 entries and an LRU replacement policy

Physical addresses of 32 bits, and virtual addresses of 40 bits Byte addressable memory Page size of 1MB

Part A Which bits of the virtual address are used to obtain a virtual to physical translation from the TLB? Explain exactly how these bits are used to make the translation, assuming there is a TLB hit. Solution The virtual address is 40 bits long. Because the virtual page size is 1MB = 2^20 bytes, and memory is byte addressable, the virtual page offset is 20 bits. Thus, the first 4020=20 bits are used for address translation at the TLB. Since the TLB is fully associative, all of these bits are used for the tag; i.e., there are no index bits. When a virtual address is presented for translation, the hardware first checks to see if the 20 bit tag is present in the TLB by comparing it to all other entries simultaneously. If a valid match is found (i.e., a TLB hit) and no protection violation occurs, the page frame number is read directly from the TLB. Part B Which bits of the virtual or physical address are used as the tag, index, and block offset bits for accessing the L1 data cache? Explicitly specify which of these bits can be used directly from the virtual address without any translation. Solution Since the cache is physically indexed and physically tagged, all of the bits from accessing the cache must come from the physical address. However, since the lowest 20 bits of the virtual address form the page offset and are therefore not translated, these 20 bits can be used directly from the virtual address. The remaining 12 bits (of the total of 32 bits in the physical address) must be used after translation. Since the block size is 16 bytes = 2^4 bytes, and memory is byte addressable, the lowest 4 bits are used as block offset. Since the cache is direct mapped, the number of sets is 4KB/16 bytes = 2^8. Therefore, 8 bits are needed for the index. The remaining 32-8-4 = 20 bits are needed for the tag. 20 bits Tag Index 8 bits 4 bits Offset

As mentioned above, the index and offset bits can be used before translation while the tag bits must await the translation for the 12 uppermost bits.

Part C The following lists part of the page table entries corresponding to a few virtual addresses (using hexadecimal notation). Protection bits of 01 imply read-only access and 11 implies read/write access. Dirty bit of 0 implies the page is not dirty. Assume the valid bits of all the following entries are set to 1. Virtual number FFFFF FFFFE FFFFD FFFFC FFFFB FFFFA page Physical page number Protection bits CFC CAC CFC CBA CAA CCA 11 11 11 11 11 01 Dirty bits 0 0 0 0 0 0

1 2 3 4 5 6

The following table lists a stream of eight data loads and stores to virtual addresses by the processor (all addresses are in hexadecimal). Complete the rest of the entries in the table corresponding to these loads and stores using the above information and your solutions to parts A and B. For the data TLB hit, data cache hit, and protection violation columns, specify yes or no. Assume initially the data TLB and data cache are both empty.

Processor load/store to Corresponding virtual address physical address

Part of the Data TLB physical address used hit? to index the data cache

Data cache hit?

Protection violation?

Dirty bit

1 2 3 4 5 6 7 8

Store FFFFF ABAC1 Store FFFFC ECAB1 Load FFFFF BAAE3 Load FFFFB CEBC3 Store FFFFE AAFA1 Store FFFFC AABC9 Load FFFFD BAAE2 Store FFFFA ABAC4

Solution Processor load/store to Corresponding virtual address physical address Part of the Data TLB physical address used hit? to index the data cache Data cache hit? Protection violation? Dirty bit

1 2 3 4 5 6 7 8

Store FFFFF ABAC1 Store FFFFC ECAB1 Load FFFFF BAAE3 Load FFFFB CEBC3 Store FFFFE AAFA1 Store FFFFC AABC9 Load FFFFD BAAE2 Store FFFFA ABAC4

CFCABAC1 CBAECAB1 CFCBAAE3 CAACEBC3 CACAAFA1 CBAAABC9 CFCBAAE2 CCAABAC4

AC AB AE BC FA BC AE AC

No No Yes No No Yes No No

No No No No No No Yes No

No No No No No No No Yes

1 1 0 0 1 1 0 0

Problem 6 Consider a 4-way set-associative L1 data cache with a total of 64 KB of data. Assume the cache is write-back and with block size of 16 bytes. Further, the cache is virtuallyindexed and physically-tagged meaning that the index field of the address comes from the virtual address generated by the CPU, and the tag field comes from the physical address that the virtual address is translated into. The data TLB in the system has 128 entries and is 2-way set associative. The physical and virtual addresses are 32 bits and 50 bits, respectively. Both the cache and the memory are byte addressable. The physical memory page size is 4 kilobytes. Part A Show the bits from the physical and virtual addresses that are used as the block offset, index, and tag bits for the cache. Similarly, show the bits from the two addresses that are used as the page offset, index, and tag bits for the TLB. Solution For the TLB, we only use the virtual address. Each page has 4 KB or 212 bytes, and we need 12 bits to be able to uniquely address each of these bytes. So the page offset is the least significant (right most) 12 bits of the address. The TLB has 128 entries divided in 64 or 26 sets, and we need 6 bits to be able to uniquely address each of these 64 sets. So the next 6 bits represent the index. As stated the virtual address is 50 bits long, so the remaining 32 most significant bits make up the tag field of the address. This is illustrated below.

Tag (32 bits) Index (6 bits) Page Offset (12 bits) 49------------------18 17------------12 11-----------------0 For the data cache, a cache block is 16 bytes or 24 bytes, so we need 4 bits to be able to uniquely address each of these 16 bytes. So the block offset is the least significant (right most) 4 bits of the address. The number of blocks in the cache is 64 KB / 16 or 212 blocks. Since the cache is 4-way set associative, there are 210 sets in the cache, so we need 10 bits to be able to uniquely identify each of these blocks. So the next 10 bits represent the index. As stated, the physical address is 32 bits long, so the remaining 18 most significant bits make up the tag field of the address. This is illustrated below.

Tag Since the cache is physically tagged, the tag bits come from the physical address, using bits 14 to 31. Physical Virtual Virtual Tag (18 bits) Index (10 bits) Block Offset (4 bits) 31------------------14 13--------------4 3-------------0

Part B Can the cache described for this problem have synonyms? If no, explain in detail why not. If yes, first explain in detail why and then explain how and why you could eliminate this problem by changing the cache configuration, but not changing the total data storage in the cache? Would you expect this cache to perform worse or better than the original design? Please provide a justification for your answer. Solution Yes, the original cache can have synonyms because 2 bits of the virtual page-frame number are used in the index. These bits could result in the synonym problem since different values for these bits may map to the same physical address. To eliminate this problem, the cache needs to use only virtual bits that correspond directly to physical bits (the page offset) for the index. This can be done by reducing the number of index bits and increasing the number of tag bits. Since we need to reduce the number of index bits by two we make the cache 4-way 22 = 16-way set associative. A cache lookup will then use bits 12-31 as tag, bits 4-11 as index, and bits 0-3 as block offset. Since bits 0-11 are the page offset, we effectively use only physical address bits for indexing. Since the cache is physically tagged, the cache lookup uses only physical address bits and there can be no synonyms. Cache look up with new configuration Tag (20 bits) Index (8 bits) Block Offset (4 bits) 31------------------------12 11----------4 3---------------0 Translation of Page-Frame Address Page Offset For whether the new cache will perform better, reasonable answers will be accepted. The preferred response is that the new cache has the advantage that the cache can be indexed completely in parallel with the TLB access since the page offset is already aligned to the boundary of the cache's tag and index fields. Therefore, the page offset can be used in its un-translated form to index the cache. The disadvantage of this approach is that a 16-way associative cache is difficult to build and will require many tag comparisons to operate in parallel. Therefore, whether the new cache is faster depends on whether the advantage outweighs the disadvantage.

Problem 7 A graduate student comes to you with the following graph. The student is performing experiments by varying the amount of data accessed by a certain benchmark. The only thing the student tells you of the experiments is that their system uses virtual memory, a data TLB, only one level of data cache, and the data TLB maps a much smaller amount of data than can be contained in the data cache. You may assume that there are no conflict misses in the caches and TLB. Further assume that instructions always fit in the instruction TLB and an L1 instruction cache.

7 6
Execution Time

5 4 3 2 1
a c

b Data Size (KB)

Part A Give an explanation for the shape of the curve in each of the regions numbered 1 through 7. Solution 1: Execution time slowly increases (performance decreases) due to increasing data size but remains at a roughly similar level. 2: At this point, the TLB overflows and execution time sharply increases to handle the increased TLB misses. 3: Execution time again slowly increases due to increasing data size and plateaus at a higher level than before due to overhead from TLB misses. 4: At this point, the data cache overflows, causing a high frequency of cache misses and execution time again sharply increases. 5: Execution time again slowly increases due to increasing data size and plateaus at a high level due to overhead from retrieving data directly from main memory due to cache misses. 6: Execution time again sharply increases due to physical memory filling up and thrashing occurring between disk and physical memory.

7: Execution time is very high due to overhead from TLB misses, cache misses and virtual memory thrashing. It is slowly increasing due to increasing data size.

Part B From the graph, can you make a reasonable guess at any of the following system properties? If so, what are they? If not, why not? Explain your answers. (Note: your answers can be in terms of a, b, and c). (i) Number of TLB entries (ii) Page size (iii) Physical memory size (iv) Virtual memory size (v) Cache size Solution There is no reasonable guess for page size and virtual memory size. There is also no reasonable guess for the number of TLB entries since it depends on the page size. It is acceptable if you guess that the cache size is b KB and the physical memory size is c KB, since these are the points at which the execution time shows significant degradations. However, these quantities are actually only upper bounds, since the actual size of these structures depends on the temporal and spatial reuse in the access stream. (The actual size depends on a property known as the working set of the application.) Problem 8 Consider a memory hierarchy with the following parameters. Main memory is interleaved on a word basis with four banks and a new bank access can be started every cycle. It takes 8 processor clock cycles to send an address from the cache to main memory; 50 cycles for memory to access a block; and an additional 25 cycles to send a word of data back from memory to the cache. The memory bus width is 1 word. There is a single level of data cache with a miss rate of 2% and a block size of 4 words. Assume 25% of all instructions are data loads and stores. Assume a perfect instruction cache; i.e., there are no instruction cache misses. If all data loads and stores hit in the cache, the CPI for the processor is 1.5. Part A Suppose the above memory hierarchy is used with a simple in-order processor and the cache blocks on all loads and stores until they complete. Compute the miss penalty and resulting CPI for such a system. Solution Miss penalty = 8 + 50 + 25*4 = 158 cycles. CPI = 1.5 + (0.25* 0.02 * 158) = 2.29

Part B Suppose we now replace the processor with an out-of-order processor and the cache with a non-blocking cache that can have multiple load and store misses outstanding. Such a configuration can overlap some part of the miss penalty, resulting in a lower effective penalty as seen by the processor. Assume that this configuration effectively reduces the miss penalty (as seen by the processor) by 20%. What is the CPI of this new system and what is the speedup over the system in Part A? Solution Effective miss penalty = 0.80 * 158 = 126 cycles. CPI = 1.5 + (0.25 * .02 * 126) = 2.13 Speedup over the system in part A is 2.29/2.13 = 1.08. Part C Start with the system in Part A for this part. Suppose now we double the bus width and the width of each memory bank. That is, it now takes 50 cycles for memory to access the block as before, and the additional 20 cycles now send a double word of data back from memory to the cache. What is the miss penalty now? What is the CPI? Is this system faster or slower than that in Part B? Solution Miss penalty = 8 + 50 + 25*2 = 108 cycles. CPI = 1.5 + (0.25 * .02 * 108) = 2.04. This system is slightly faster than that in part B.

You might also like