Professional Documents
Culture Documents
Memory Hierarchy Design
Memory Hierarchy Design
UEC509-Part-5
Dr. Debabrata Ghosh
Assistant Professor, ECED
Thapar University
Memory hierarchy
• Simple axiom in hardware design: smaller is faster (smaller h/w is usually faster than larger h/w):
applicable to memory design
• Faster memories are available in smaller number of bits per chip
• Principle of locality: Data most recently used is very likely to be accessed again in near future
• Keep recently-used data in fastest memory (smaller memory close to CPU)
• Keep data not-used-recently in slower memory (larger memory farther away from CPU)
Cache hit, cache miss, page fault
• Cache memory: Small, fast memory close to CPU (holding most recently used
data/code)
• Cache hit: If CPU finds requested data (referenced by a program) in cache
memory
• Cache miss: If CPU doesn’t find requested data in cache memory
• When cache miss happens, block of data (called block) containing the requested
data is retrieved from main memory and placed in cache
• Temporal locality: retrieved data is likely to be used again in near future
• Spatial locality: high probability that other data within the block will be used
soon
• Cache miss handled by hardware. CPU is stalled until requested data available
• Hit rate = hit / (hit + miss) = no. of hits/total accesses
Page fault
• Page fault: If CPU doesn’t find requested data in cache and
main memory
• Virtual address space broken into multiple pages
• When page fault happens, page containing the requested
data is retrieved from disc memory and placed in main
memory
• Page fault handled by software. CPU is not stalled but
switched to other task until requested data available
Performance of Cache
• For cache miss, CPU is stalled
• In CPU execution time, take into account the number of CCs for which CPU is stalled
• CPU execution time = (CPU clock cycles + Memory stall cycles) x clock cycle time
• Memory stall cycles = number of cache misses x cost per cache miss in CC (miss penalty)
• Number of cache miss = IC x cache miss per instruction
• Cache miss per instruction = memory reference (cache access) per instruction x miss rate
• Memory stall cycles = IC x memory reference per instruction x miss rate x miss penalty
• Miss rate: fraction of memory reference (cache access) that are cache miss
• Miss penalty: additional CCs to service a cache miss
Performance of Cache
Block can be placed in any of the block frames Block can only be placed in the following block frame: Block can only be placed in the following set:
(Block address) MOD (number of blocks in cache) (Block address) MOD (number of sets in cache)
Cache Mapping
• Bringing content (data/instruction) from main memory (MM) to cache memory (CM)
• Fully associative mapping, direct mapping, set associative mapping
Example:
• Assume there are 4,096 blocks in the MM, each having 16 words (total 4,096 x 16 =
65,536 words in the MM)
• 65,536 unique addresses are needed to address these words, i.e., 16-bit address needed
(216=65,536)
• Assume there are 128 cache blocks (also called block frames or cache lines)
• Size of cache line = size of MM block, i.e., 16 words in each cache line
• How to map 4,096 MM blocks to 128 cache lines ?
Fully associative mapping
Each line can hold 16 words Each block has 16 words A 16 bit MM address is divided into two fields:
12 most-significant bits for identifying MM
blocks and 4 least-significant bits for
identifying location of word within a MM block
When a MM block is brought in the cache line,
12 bits are used as tag bits for each cache line
(indicating which of the 4096 MM blocks has
been brought in the cache line), and 4 bits are
used as line offset to identify a word within
the cache line
Block j of MM can be
mapped to any line
in CM (provided that
line is not already
12 bits 4 bits
occupied)
From the incoming 16 bit address, 12 most
significant bits are compared to the tag bits of
each cache line
If a match is found, 4 least significant bits
Assume that the MM block containing the requested word has already
identify the word (out of 24 = 16 possible
been mapped into a CM line. How to find that word in CM?
words) in that particular cache line
The main memory of a computer has 2 cm blocks while the cache has 2c blocks (cache
lines). If the cache uses the set associative mapping scheme with 2 blocks per set, then
block k of the main memory maps to which of the following set?
A. (k mod m) of the cache
B. (k mod c) of the cache
C. (k mod 2 c) of the cache
D. (k mod 2 cm) of the cache
Numerical of cache mapping
Page number
Multilevel page table !!!!
Multilevel page table
Consider a system with 32-bit logical address space. If page size in such system is 4
KB, then a page table consists of 232/212 = 220 entries. Assuming that each entry
consists of 4 bytes, it takes 4 MB of physical address space for the page table.
Page table size more than page frame size !!!
The entries of the level 1 page table are pointers to level 2 page tables and entries of the level 2 page tables are
pointers to level 3 page tables and so on. The entries of the last level page table stores actual page frame
information. Level 1 contains single page table (which fits into a page frame) and address of that table is stored
in PTBR.
Replacement/swapping/paging algorithm
• Important to reduce number of page fault
• Least recently used (LRU) algorithm: OS replaces a page in MM
that has least recently been used. A page that has been used
recently is likely to be used again in near future (temporal
locality), hence should not be replaced. Reference bit used to
know if page has been recently used or not
Translation lookaside buffer
• Each memory access requires accessing memory twice: once to look up
the page table and once to get the actual data
• Page table substantially large, thus look up process considerably slow
• To make the address translation (i.e. look up process) faster, a dedicated
cache used: translation lookaside buffer (TLB)
• Each TLB entry contains a logical address and its corresponding physical
address
• Using TLB, CPU can quickly convert logical address to physical address
• When process requests access to a memory address, CPU checks first if
TLB has an entry for that logical address
• If TLB has an entry, the corresponding physical address is immediately
known
• If TLB has no entry, CPU does laborious look up into page table to convert
logical address to physical address
• Once done, CPU adds the result as an entry in TLB
Inverted page table
• Page tables are usually large
• Each process has its own page table: accounts for significant space in MM
• Alternative approach: make a global page table for all the active processes,
known as inverted page table
• Number of entries equal to number of page frames (not equals to number
of pages as in page table)
• Each entry has two information: process ID (PID) and page number
• PID specifies the process which owns the page
Inverted page table
Inverted page table
Page table for process 1 ( Pr 1) Page table for process 2 ( Pr 2)
PID Page number
1 - 1 3
1 Pr 2, P7
2 - 2 5
2 Pr 1, P3
3 2 3 8
No of page frames 8 Pr 2, P1
4 4 4 - 3
Pr 1, P4
5 7 5 - 4
Pr 2, P2
6 - 6 - 5
Pr 1, P7
7 6 7 1 6
Pr 1, P5
Page index
Page frame index
Page index
Page frame index 7
Pr 2, P3
8
Virtual memory
Suppose that a system has a 32-bit virtual address space. It has 1GB of physical
memory, and uses 1MB pages.
Distributed-memory MIMD: All PUs have a local memory. Communication between PUs
takes place through an interconnection network. Complex design but high scalability
Local memory
(not shared)
• P1 and P2 can have two different values for the same location
• Problem is called multiprocessor cache coherence problem
Cache coherence problem
Write-through cache
• Single memory location (X), read and written by two processors (A and B)
• Initially assume neither cache contains the variable at X, initial value at X is 1
• After variable at X is written by A (new value 0), both A’s cache and memory contain new value, but not B’s
cache
• Two different values for same location (Caches of CPU A and CPU B have 0 and 1 at location X, respectively)
• If B reads the value of the variable at X, it will read 1, not the most recently written 0
• Cache coherence ensures that changes in the values of shared operands are propagated throughout the
system in a timely fashion
Approaches for cache coherence problem
• Initially assume neither cache contains the data item at X. Initial value at X is 0
• When B wants to read X, A responds with the written value (1) cancelling the response from
memory (0)
• Content of B’s cache and memory content at X are updated same time: write-back cache
Cache coherence approaches
• Write-update (write broadcast) snoopy protocol: Update all the cached copies of
the data item when it is written
• Initially assume neither cache contains the data item at X. Initial value at X is 0
• When A broadcasts a write, both B’s cache and memory location X are
updated
Additional Topics
• Write-invalidate, cache-coherence protocol (MSI protocol) for write-back cache
(P. 664 of Hennesey)
• Universal Serial Bus
• Direct Memory Access (DMA)
• Daisy Chain