Caches Master | Cpu Cache | Computer Data Storage

An example memory hierarchy

Smaller, faster, and costlier (per byte) storage devices

L0: registers L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM)

CPU registers hold words retrieved from cache memory.

L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from memory.

L3:
Larger, slower, and cheaper (per byte) storage devices

main memory (DRAM)

Main memory holds disk blocks retrieved from local disks.

L4:

local secondary storage (local disks)
Local disks hold files retrieved from disks on remote network servers.

L5:

remote secondary storage (distributed file systems, Web servers)

Principle of Locality
• Principle of Locality
– Programs tend to re-use data and instructions near those they have used recently

Two types of locality
1. Temporal locality: items currently being used are likely to be reused in the near future 2. Spatial locality: items with nearby addresses tend to be referenced close together in time

Programming to Maximize Benefits of Locality

• Are there things we can do in our code to allow us to benefit from the principle of locality?
– Programs that reference the same variables repeatedly have good locality – Programs with small strides in array locations have good locality – Programs with small loops with many iterations have good locality

Locality Example
int sumvec(int v[N]) { int i, sum = 0; for (i = 0; i < n; i++) sum += v[i]; return sum; }

Locality:
• Data
– Reference array elements in succession (stride-1 reference pattern): Spatial locality – Reference sum each iteration: Temporal locality • Instructions –Reference instructions in sequence: Spatial locality –Cycle through loop repeatedly: Temporal locality

Locality Example
• How is the locality of this program?

•Note the elements are accessed in the order they

are stored in memory

• This implies good locality

Locality Example 2 • Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer. • Question: Does this function have good locality? .

etc.Locality Example 2 • How is the locality of this program? • Recall arrays in C/C++ are stored row-wise • This means that all of row 1 is stored contiguously. then row 2. • This leads to good locality .

j++) for (i = 0. i < M. return sum.Question: Does this function have good locality? • Assume a 2x3 array for ease int sum_array_cols(int a[M][N]) { int i. for (j = 0. j. sum = 0.Locality Example 3 •. i++) sum += a[i][j]. j < N. } .

Locality Example 3 • How is the locality of this program ? • Recall arrays in C/C++ are stored row-wise • But we are accessing the array by columns • This leads to very poor locality .

k++) sum += a[k][i][j]. j++) for (k = 0. return sum. } . sum = 0. k. j < N. for (i = 0. j. i++) for (j = 0. k < N.Locality Example 4 • Question: Can you permute the loops so that the function scans the 3-d array a[ ] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a [N] [N] [N]) { int i. i < N.

sum = 0. } . k++) for (i = 0.e. j < N. for (k = 0. i++) for (j = 0. increment the j counter in the inner loop • So: for multi-dimensional loops. k.Locality Example 4 • Recall: the best indexing for a 2-D array is to start at a row and count across all the columns in a[i][j] • i. k < N. simply run the loops in the order of the indices int sumarray3d(int a [N] [N] [N]) { int i. j++) sum += a[k][i][j]. i < N. return sum. j.

L1 cache holds cache lines retrieved from the L2 cache. and cheaper (per byte) storage devices main memory (DRAM) Main memory holds disk blocks retrieved from local disks. L3: Larger. faster.An example memory hierarchy Smaller. L4: local secondary storage (local disks) Local disks hold files retrieved from disks on remote network servers. L5: remote secondary storage (distributed file systems. slower. Web servers) . and costlier (per byte) storage devices L0: registers L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM) CPU registers hold words retrieved from cache memory. L2 cache holds cache lines retrieved from memory.

Definition of a Cache • The word cache is used throughout the memory hierarchy diagram • What is a Cache? – A smaller. slower device. • Why would a cache help us at all? – It relies on the Principle of Locality . faster storage device that acts as a staging area for a subset of the data in a larger.

cheaper storage device at level k+1 is partitioned into blocks. more expensive memory caches a subset of the blocks Level k: 4 9 14 3 Data is copied between levels in block-sized transfer units Level k+1: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Larger. slower.Caching in the Memory Hierarchy • It works because programs tend to access data at level k more often than at k+1. . Smaller. faster.

. only a little bit more).Caching in the Memory Hierarchy II • Goal: – The memory hierarchy provides a large pool of memory that: • Costs as much as the cheap storage near the bottom (well. • but serves data to programs at the rate of the fast storage near the top.

General caching concepts I Level k: 4 9 14 3 Level k+1: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 • Program needs object d. • Block 14 is located in the level k cache • This is called a Cache hit . which is stored in block 14.

• b is not at level k. • This is called a Cache miss • Simplest case is if there is a spot for block 12 in level k • What happens if level k is full? . so level k cache must fetch it from level k+1. which is stored in block 12.General caching concepts II Level k: 4 9 14 3 Level k+1: 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 • Program needs object b.

. b mod 4 • Replacement policy: which block should be evicted? E. • level k cache is full – must be replaced (evicted) – Which one is the victim? • Placement policy: where can the new block go? E. which is stored in block 12. • b is not at level k. so level k cache must fetch it from level k+1.g. LRU .g..General caching concepts III • Program needs object b.

General Caching Concepts IV • Block – A block is the smallest amount of memory that is stored in a cache – The size of the block can be different for each stage in the memory hierarchy – We rely on what kind of locality to ensure the additional data in the block is needed? • Transfer Unit – These are the amount of data that we transfer into and out of a cache when we need new data – A transfer unit is the same size as the block for that stage .

so level k cache must fetch it from level k+1.g.General Caching Concepts V • Cache hit – Program finds b in the cache at level k. next lower level in the hierarchy – Placement policy: where can the new block go? • If level k cache is full. then some current block must be replaced – Replacement policy: which block should be evicted? .. • Cache miss – b is not at level k. E.

Conflict miss – 3. Cache Misses • Recall a cache miss is when the data was not in the cache – What can cause a block not to be in the cache at a given time? • There are three kinds of misses: – 1. Capacity miss .General Caching Concepts . Cold (compulsory) miss – 2.

.General Caching Concepts Cache Misses II • Cold (compulsory) miss – Cold misses occur because the cache is empty – This happens when the machine starts-up • Capacity miss – Occurs when the set of active cache blocks (working set) is larger than the cache • There is no longer enough room in the cache for all the data so some block needs to be discarded.

0.. • E. Block i at level k+1 must be placed in block (i mod 4) at level k+1. .g. 8. would miss every time in a cache with 4 blocks where we store at location i mod 4..g. 8. • Conflict misses occur when multiple data objects all map to the same level k block. 0. – E. Cache Misses III • Conflict miss – Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of positions at level k. . 8. Referencing blocks 0.General Caching Concepts .

General Org of a Cache Memory • The cache contains sets of memory words called Blocks – These blocks usually contain a few words of memory • Each block is part of a Line in the cache – The line contains identifying information on what is actually in the blocks and a status bit to say whether the data in the block is still valid • These lines are then grouped into Sets – There may be 1 or more lines of cache data per set – The exact number of lines per set is determined by the type of cache (more on that in a few slides) .

General Org of a Cache Memory .

Block Identification . How is a block found if it is in the “upper level” ? 2. Block Placement . What happens on a memory write? . Block Replacement .Four Key Questions for Designing a Cache 1. Which block should be replaced when there is a cache miss? 4.Where can a block be placed in the “ upper level”? 3. Write Strategy .

tag provides the remaining block address bits . set index tells which Set • 2.Q1: How is a block found if it is in the cache? • Blocks are maintained based on their block address (assume it is a physical address in memory) • The address is divided into two parts: • The block address and • The block offset (the LSBs that correspond to the location of the desired data within the Block • The block address is further divided into two: • 1.

Read the set ID to see which set the word may be in • Called set selection 2.Decoding the Address for a Cached Word • • The following stages are needed to find a word in a cache Different parts of the actual word address provide these values 1. Read the tag to see if the word we want is actually in this line of the cache • The tag is a type of hash that provides the word address (at least part of it) • Called line matching is line of the cache .

Only want to read the desired word if the cache line is valid 4.Called word Extraction .Decoding the Address for a Cached Word II 3. Read the block offset value to know which word in the line is the word we want . Read the valid bit .

Decoding the Address for a Cached Word .

Order of the Bits Used in the Address • Remember that the bit fields in the actual word address are used to access the cache – A simple hash function uses these fields to access the lines and individual words in the cache • Generally: a hash (hash value) is a non-unique address used to quickly access a table – It is not unique because hash tables can only store small subsets of a larger range of data values – The pigeon hole principle is in effect • The order of the bit field selection is critical to effective performance .

the middle bits are used for set selection – Why does this make sense? • It’s undesirable for nearby words in memory to map to the same line in the cache – It will cause the cache to replace possibly useful data too often .Order of the Bits Used in the Address II • First.

the next cache line is accessed. . when the program accesses the next data word in memory.Order of the Bits Used in the Address III • It would go against our knowledge of spatial • Locality – Here we want consecutive words in different lines of the cache • This way. so the cache keeps them both. the cache won’t need to discard the last one. Instead.

Why Use the Middle Bits as an Index? • High-Order Bit Indexing – Adjacent memory lines would map to same cache Entry – Poor use of spatial locality Middle-Order Bit Indexing – Consecutive memory lines map to different cache lines – (Least significant bits are not shown here) • .

then the middle order bits already match – Again.Order of the Bits Used in the Address IV • Second. this is using knowledge of spatial locality . the top (most significant) bits are used for line matching (tag matching) – Why does this make sense? • If the process got this far.

then the higher order bits need to be checked – These are only the same if the address is from the same higher order bit region of memory – With locality and the limited cache size.Order of the Bits Used in the Address V • If the set index bits match. we know that the tag will indicate whether we are dealing with the same words (the same block) .

it is unlikely that another word in memory with the same LSBs will be needed near the same time .Order of the Bits Used in the Address VI • Third (finally). according to the principle of locality. the low order bits are used for word extraction – Why does this make sense? • These are the most unique identifiers for a word in memory • Another word with the same low order bits will be located far away in memory • Therefore.

How is a block found if it is in the “upper level” ? 2. 2.Where can a block be placed in the “ upper level”? 3. Which block should be replaced when there is a cache miss? 4. Block Placement . Block Replacement . Block Identification . What happens on a memory write? .Four Key Questions for Designing a Cache 1. Write Strategy .

Fully Associative • The block can be placed anywhere in the cache .Q2: Where Can a Block be Placed in a Cache? • . Set-Associative • A set is a group of blocks in the cache • The incoming block can only be placed in a specific set of blocks • The block can be placed anywhere within this set – 3. Direct Mapped • The block can only be placed in a specific location in the cache – 2. There are three approaches to block placement: – 1.

since the block can go anywhere .Mapping Memory Blocks into the Cache • This is really a continuum of associative properties: – Direct mapped can be considered 1-way set associative (the block has only 1 allowed destination) – For N-way set associative. the block has N places it can be placed – Fully associative can be considered m-way set associative (where m is the number of blocks).

Mapping Memory Blocks into the Cache II .

Direct-Mapped Cache • Simplest kind of cache • Characterized by exactly one line per set. – Therefore. the tag and set index are somewhat redundant • The tag is insurance that the wrong word is not used .

Step 1: Set selection – identical to direct-mapped cache .Accessing Direct-Mapped Caches I • .

Accessing Direct-Mapped Caches II • Step 2 : Line matching – Trivial. since there is only one line in this cache set .

Accessing Direct-Mapped Caches III • Step 3 : Word selection – Use the block offset to grab the right word. .

w1 is the next byte etc. First .it doesn. but most systems don.t make much sense – They may do it to make the later caches easier to understand.Some Words About Book Conventions • . the last figure talked about w0 being the lower order byte of word w. Most direct mapped caches do not even talk about a set index – There is only 1 line per set .t use the set index with direct mapped organizations • Second . – It is easy in this convention to confuse words and bytes – It should have used w0 and w1 to talk about the words in the cache and then used b00 and b10 to talk about the low order bytes of the words 0 and 1 in the cache line .

Direct-Mapped Cache Simulation .

Recall: Cache Misses • A cache miss is when the data was not in the cache – What can cause a block not to be in the cache at a given time? – There are three kinds of misses: • Cold (compulsory) miss • Conflict miss • Capacity miss • What kinds of cache misses did we just see? .

Direct-Mapped Cache Simulation Results .

since there is only one line per set • What are the problems? – The simple example showed that it is already experiencing misses due to conflicts • The conflicts occurred because there is only one line • What was needed: a second line for each set index to allow both desired values to be placed • This leads to set associative caches – It will also introduce the need for replacement strategies – Up to now. the replacement was only dictated by the address .Direct Mapped Cache Observations • Direct mapped caches are easy to implement – Finding the line and the set are one step.

Set Associative Caches • Characterized by more than one line per set .

Sign up to vote on this title
UsefulNot useful