You are on page 1of 67

MEMORY

Cache

 A cache is a hardware or software component that stores data so that


future requests for that data can be served faster; the data stored in a
cache might be the result of an earlier computation or a copy of data
stored elsewhere.
 Temporal locality –The principle stating that if a data location is
referenced then it will tend to be referenced again soon.
 Spatial locality- The locality principle stating that if a data location is
referenced, data locations with nearby addresses will tend to be
referenced soon.
Memory Hierarchy
Memory Hierarchy

 A structure that uses multiple levels of memories; as the distance from the
processor increases, the size of the memories and the access time both
increase.
 The faster memories are more expensive per bit than the slower memories and
thus are smaller
 Main memory is implemented from DRAM (dynamic random access memory),
while levels closer to the processor (caches) use SRAM (static random access
memory).
 DRAM uses significantly less area per bit of memory, and DRAMs thus have
larger capacity for the same amount of silicon
 The third technology, used to implement the largest and slowest level in the
hierarchy, is usually magnetic disk. (Flash memory is used instead of disks in
many embedded devices)
 A memory hierarchy can consist of multiple levels, but data is copied between
only two adjacent levels at a time,
 Block (or line) The minimum unit of information that can Be either present or not
present in a cache.
 Hit rate The fraction of memory accesses found in a level of the memory
hierarchy
 Miss rate The fraction of memory accesses not found in a level of the memory
hierarchy.(1- hit rate)
 Hit time The time required to access a level of the memory hierarchy, including
the time needed to determine whether the access is a hit or a miss.
 Miss penalty The time required to fetch a block into a level of the memory
hierarchy from the lower level, including the time to access the block, transmit
it from one level to the other, insert it in the level that experienced the miss, and
then pass the block to the requestor.
 direct-mapped cache
 A cache structure in which each memory location is mapped to exactly
one location in the cache.
 (Block address) modulo (Number of blocks in the cache)
 Tag- A field in a table used for a memory hierarchy that contains the
address information required to identify whether the associated block in
the hierarchy corresponds To a requested word.
 Valid bit A field in the tables of a memory hierarchy that indicates that the
associated block in the hierarchy contains valid data.
Direct Mapped Cache
Accessing the cache

recently referenced words replace less recently referenced words.


 The index of a cache block, together with the tag contents of that block,
uniquely specifies the memory address of the word contained in the
cache block.
 A tag field, which is used to compare with the value of the tag field of the
cache
 A cache index, which is used to select the block
 The total number of bits needed for a cache is a function of the cache size
and the address size, because the cache includes both the storage for the
data and the tags.
 Larger blocks exploit spatial locality to lower miss rates. Increasing block
size usually decreases the miss rate
 Miss rate might increase if the block size becomes a significant fraction of
the cache size
 Increase in block size might increase the cost of miss penalty
 Miss penalty includes latency involved to fetch first word and transfer time
for the rest of the block.
How to hide transfer time so that miss
penalty is effectively smaller?

 Early restart-Resume execution as soon as the requested word of the block


is returned, rather than wait for the entire block
-Efficient for instruction access compared to data access
 Requested word first/critical word first- Organize the memory so that
requested word is transferred from the memory to the cache first
-Efficient for instruction access compared to data access
Handling cache misses

 Done in collaboration with the processor control unit and with a separate
controller that initiates the memory access and refills the cache
 Processing of a cache miss creates a pipeline stall
 Processor might also get stalled by freezing registers while wait for memory
 Out-of-order execution could be supported by advanced architectures
Handling instruction miss

 1 Send the original PC value (current PC – 4) to the memory.


 2 Instruct main memory to perform a read and wait for the memory to
complete its access.
 3 Write the cache entry, putting the data from memory in the data portion
of the entry, writing the upper bits of the address (from the ALU) into the
tag field, and turning the valid bit on.
 4. Restart the instruction execution at the first step, which will refetch the
instruction, this time finding it in the cache.
Handling writes

 If cache and main memory consist of different data, then inconsistent


 Write through
-always write data to both cache and main memory
-Reduced performance
 Write Buffer
-Stores data while it is waiting to be written to memory and processor
continues execution after writing to cache and write buffer
- To handle write in bursts, depth of write buffer beyond a single entry
 Write back
- Written to lower level of the hierarchy when it is replaced
- Improves performance especially when processors generate write faster
than main memory can handle
- Write back is more complex to implement than write through
An Example Cache: The Intrinsity
FastMATH Processor

The 16 KB caches in the Intrinsity FastMATH each contain 256 blocks with 16 words per block
the steps for a read request to either
cache are as follows:

 Send the address to the appropriate cache. The address comes either
from the PC (for an instruction) or from the ALU (for data).
 If the cache signals hit, the requested word is available on the data lines.
Since there are 16 words in the desired block, we need to select the right
one. A block index field is used to control the multiplexor (shown at the
bottom of the figure), which selects the requested word from the 16 words
in the indexed block.
 If the cache signals miss, we send the address to the main memory. When
the memory returns with the data, we write it into the cache and then
read it to fulfill the request.
split cache

 A scheme in which a level of the memory hierarchy is composed of two


independent caches that operate in parallel with each other, with one handling
instructions and one handling data.
 Assume
1 memory bus clock cycle to send the address
15 memory bus clock cycles for each DRAM access initiated
1 memory bus clock cycle to send a word of data
 If we have a cache block of four words and a one-word-wide bank of DRAMs,
the miss penalty would be 1 + 4 × 15 + 4 × 1 = 65 memory bus clock cycles. Thus,
the number of bytes transferred per bus clock cycle for a single miss would be
(4 × 4)/65 = 0.25
Designing the memory system

 1 memory is one word wide, and all accesses are made sequentially.
 2 increases the bandwidth to memory by widening the memory and the
buses between the processor and memory; this allows parallel access to
multiple words of the block
 3 increases the bandwidth by widening the memory but not the
interconnection bus.
How to avoid performance loss

 bandwidth of main memory is increased to transfer cache blocks more


efficiently.
 external to the DRAM are making the memory wider and interleaving
 improved the interface between the processor and memory to increase
the bandwidth of burst mode transfers to reduce the cost of larger cache
block sizes.
Measuring and Improving Cache
Performance
 Performance improved by
-reducing the miss rate by reducing the probability that two different memory
blocks will contend for the same cache location
-reduces the miss penalty by adding an additional level to the hierarchy
 CPU time can be divided into
-clock cycles that the CPU spends executing the program
-clock cycles that the CPU spends waiting for the memory system
 Assuming the costs of cache accesses that are hits are part of the normal
CPU execution cycles.
CPU time = (CPU execution clock cycles + Memory-stall clock cycles) × Clock
cycle time
 Memory-stall clock cycles can be defined as the sum of the stall cycles
coming from reads plus those coming from writes:
Memory-stall clock cycles = Read-stall cycles + Write-stall cycles
 The read-stall cycles can be defined in terms of the number of read
accesses per program, the miss penalty in clock cycles for a read, and the
read miss rate:
Read-stall cycles = (Reads /Program)× Read miss rate × Read miss penalty
 Writes are more complicated.
 For a write-through scheme, we have two sources of stalls:
-write misses, which usually require that we fetch the block before continuing
the
-write buffer stalls, which occur when the write buffer is full when a write
occurs.
Write-stall cycles = (( Writes/Program) × Write miss rate × Write miss penalty ) +
Write buffer stalls
 If we assume that the write buffer stalls are negligible
Memory-stall clock cycles =(Memory accesses/Program)× Miss rate × Miss
penalty
or
Memory-stall clock cycles = (Instructions/Program)× (Misses/Instruction)× Miss
penalty
Calculating Cache Performance

Assume the miss rate of an instruction cache is 2% and the miss rate of the data
cache is 4%. If a processor has a CPI of 2 without any memory stalls and the miss
penalty is 100 cycles for all misses, determine how much faster a processor would
run with a perfect cache that never missed. Assume the frequency of all loads
and stores is 36%.
The number of memory miss cycles for instructions in terms of the Instruction count
(I) is Instruction miss cycles = I × 2% × 100 = 2.00 × I
memory miss cycles for data references: Data miss cycles = I × 36% × 4% × 100 =
1.44 × I
The total number of memory-stall cycles is 2.00 I + 1.44 I = 3.44 I.
the total CPI including memory stalls is 2 + 3.44 = 5.44.
CPU execution times is
average memory access time (AMAT) as a way to examine
alternative cache designs.

AMAT = Time for a hit + Miss rate × Miss penalty


Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of
20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access
time (including hit detection) of 1 clock cycle. Assume that the read and
write miss penalties are the same and ignore other write stalls
The average memory access time per instruction is
AMAT = Time for a hit + Miss rate × Miss penalty
= 1 + 0.05 × 20 = 2 clock cycles or 2 ns.
Fully associative cache- A cache structure in which a block can be placed in
any location in the cache.
Set-associative cache- A cache that has a fixed number of locations (at least
two) where each block can be placed.
Misses and Associativity in Caches

Assume there are three small caches, each consisting of four one-word blocks.
One cache is fully associative, a second is two-way set-associative, and the
third is direct-mapped. Find the number of misses for each cache organization
given the following sequence of block addresses: 0, 8, 0, 6, and 8.
2 way set associative & fully associative
Choosing Which Block to Replace

least recently used (LRU) A replacement scheme in which the block replaced
is the one that has been unused for the longest time
Increasing associativity requires more comparators and more tag bits per cache block. Assuming a
cache of 4K blocks, a 4-word block size, and a 32-bit address, find the total number of sets and the
total number of tag bits for caches that are direct mapped, two-way and four-way set associative,
and fully associative.
 Since there are 16 (= 24) bytes per block, a 32-bit address yields 32 - 4 = 28 bits to be used for
index and tag.
 The direct-mapped cache has the same number of sets as blocks, and hence 12 bits of index,
since log2(4K) = 12; hence, the total number is (28 - 12) × 4K = 16 × 4K = 64 K tag bits.
 Each degree of associativity decreases the number of sets by a factor of 2 and thus decreases
the number of bits used to index the cache by 1 and increases the number of bits in the tag by
1.
 Thus, for a two-way set-associative cache, there are 2K sets, and the total number of tag bits is
(28 -11) × 2 × 2K = 34 × 2K = 68 Kbits.
 For a four-way set-associative cache, the total number of sets is 1K, and the total number is (28
- 10) × 4 × 1K = 72 × 1K = 72 K tag bits.
 For a fully associative cache, there is only one set with 4K blocks, and the tag is 28 bits, leading
to 28 × 4K × 1 = 112K tag bits
Reducing the Miss Penalty Using Multilevel Caches

Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a main memory access time of 100 ns, including all
the miss handling. Suppose the miss rate per instruction at the primary cache is 2%. How much
faster will the processor be if we add a secondary cache that has a 5 ns access time for either
a hit or a miss and is large enough to reduce the miss rate to main memory to 0.5%?

The effective CPI with one level of caching is given by


Total CPI = Base CPI + Memory-stall cycles per instruction
For the processor with one level of caching,
Total CPI = 1.0 + Memory-stall cycles per instruction = 1.0 + 2% × 400 = 9
The miss penalty for an access to the second-level cache is

For a two-level cache, total CPI is the sum of the stall cycles from both levels of
cache and the base CPI:
Total CPI = 1 + Primary stalls per instruction + Secondary stalls per instruction
= 1 + 2% × 20 + 0.5% × 400 = 1 + 0.4 + 2.0 = 3.4
the stall cycles can be computed by summing the stall cycles of those references that
hit in the secondary cache ((2% - 0.5%) * 20 =0.3).
Those references that go to main memory, which must include the cost to
access the secondary cache as well as the main memory access time, is
(0.5% *(20 + 400) = 2.1).
The sum, 1.0 + 0.3 + 2.1, is again 3.4.
Virtual Memory
TLB-Translation look aside buffer
Cache Coherence

 In multiprocessor system, caching is there for both shared and private data
 Private data is used by single processor
 Shared data used by multiple processors
 When shared data is cached, the shared value is replicated in multiple
caches
 It reduces access latency and required memory BW
 It reduces contention for shared data which is read by multiple processors
simultaneously
 Problem-Cache coherence
Cache Coherence
Two types of Protocol

 Directory based protocol-Sharing status of a block of physical memory is


kept in just one location(directory)
 Snoopy protocol-Every cache that has a copy of the data from a block of
physical memory also has a copy of the sharing status of the block, but no
centralized state is kept.
-Caches are accessible via some broadcast medium(bus/switch) and all
cache controllers monitor or snoop on the medium to know whether or not
they have a copy of a block that is requested on a bus or switch access
Snooping Protocol

 2 ways of maintaining coherence


-Write invalidate protocol
-Write update/Write broadcast protocol
Write Invalidate protocol
 Writing should have exclusive access-any copy held by the reading
processor must be invalidated
 When 2 processors attempt writes simultaneously, then serialization of
writes
 Owner-which indicates that a block may be shared, but the owning
processor is responsible for updating any other processors and memory
when it changes the block or replaces it.
Write update or write broadcast

 to update all the cached copies of a data item when that item is written
 write update protocol must broadcast all writes to shared cache lines, it
consumes considerably more bandwidth.
 Write invalidate is most commonly used in modern multiprocessors
Implementation technique of write
invalidate

 To invalidate, the processor aquires the bus access and broadcasts the
address to be invalidated on the bus
 All processors continuously snoop on the bus watching the addresses
 If processor finds the address on the bus is in their cache, corresponding data
in the cache are invalidated
 When a shared block write occurs, writing processor must acquire bus access
to broadcast its invalidation
 If 2 processors attempt to write at same time, their attempt to broadcast an
invalidate operation will be serialized when they arbitrate for the bus.
 A write to a shared data item cannot actually complete until it obtains bus
access
 In write through cache, when a cache miss occurs, it is easy to find the
recent value of a data item as it is updated in in main memory
 In write back cache, problem of finding most recent data value is harder
 Snooping is used for cache misses and for writes
 If a processor finds that it has a dirty copy of the requested cache block, it
provides that cache block in response to a read request and causes the
memory access to be aborted
Snooping in write back

 Read miss, whether generated by invalidation or by some other event is simple


 Write miss requires to know whether any other copies of the block is cached-if
no copies, not necessary to place on the bus-reduces required BW
 To know, an extra bit might be added as shared/not
 When write to shared state occur, cache generates invalidation on bus and
marks the block exclusive
 Processor with sole copy of a cache block normally called as the owner of the
cache block
 If another requests read later, state is made as shared again
Snooping protocol example

 Implemented usually using finite state controller at each node


 This controller responds the request from processor and bus by changing
state of the selected cache block as well as using bus to access data or
invalidate it
 Interleaving usually used to handle each block
 3 states- invalid, shared and modified(exclusive)
 When write miss occurs-invalidation state- any processor with copies of the
cache block invalidate it
 The state in each node represents the state of the selected cache block
specified by the processor or bus request
 Any valid cache block is either in shared(one/more caches) or in
exclusive(exactly one)
 Any transition to exclusive state for write requires an invalidate/write miss to
be placed on the bus, causing all caches to make the block invalid
 If some other block there in exclusive state, cache generates write back
 If read miss occurs on the bus to a block in the exclusive state, the cache
with exclusive copy changes its state to shared
Directory based Protocol

 A directory keeps the state of every block that may be cached.


 Information in the directory includes which caches have copies of the
block, whether it is dirty, and so on.
 A directory protocol used to reduce the bandwidth demands in
centralized shared-memory machine
 Not a problem upto 200 processors, To prevent the directory from
becoming the bottleneck, the directory is distributed along with the
memory
 A distributed directory retains the characteristic that the sharing status of a
block is always in a single known location.
 Shared—One or more processors have the block cached, and the value in
memory is up to date (as well as in all the caches).
 Uncached—No processor has a copy of the cache block.
 Modified—Exactly one processor has a copy of the cache block, and it
has written the block, so the memory copy is out of date. The processor is
called the owner of the block.
 Needs tracking the state of each potentially shared memory block
 must track which processors have copies of that block, since those copies
will need to be invalidated on a write.
 The simplest way to do this is to keep a bit vector for each memory block.
 Also use the bit vector to keep track of the owner of the block when the
block is in the exclusive state.
 also track the state of each cache block at the individual caches.
Type of messages sent among nodes
 The local node is the node where a request originates.
 The home node is the node where the memory location and the directory
entry of an address reside
 A remote node is the node that has a copy of a cache block, whether
exclusive or shared.
Example of directory protocol

 The state transitions for an individual cache are caused by read misses, write
misses, invalidates, and data fetch requests;
 An individual cache also generates read miss, write miss, and invalidate
messages that are sent to the home directory.
 Read and write misses require data value replies, and these events wait for
replies before changing state
 The write miss operation, which was broadcast on the bus (or other network) in
the snooping scheme, is replaced by the data fetch and invalidate operations
that are selectively sent by the directory controller.
 A message sent to a directory causes two different types of actions: updating
the directory state and sending additional messages to satisfy the request
P = requesting processor number, A = requested
address, and D = data contents

You might also like