Memory Hierarchy Design

Computer Architecture
UEC509-Part-5
Dr. Debabrata Ghosh
Assistant Professor, ECED
Thapar University
Memory hierarchy
• Simple axiom in hardware design: smaller is faster (smaller h/w is usually faster than larger h/w):
applicable to memory design
• Faster memories are available in smaller number of bits per chip
• Principle of locality: Data most recently used is very likely to be accessed again in near future
• Keep recently-used data in fastest memory (smaller memory close to CPU)
• Keep data not-used-recently in slower memory (larger memory farther away from CPU)
Cache hit, cache miss, page fault
• Cache memory: Small, fast memory close to CPU (holding most recently used
data/code)
• Cache hit: If CPU finds requested data (referenced by a program) in cache
memory
• Cache miss: If CPU doesn’t find requested data in cache memory
• When cache miss happens, block of data (called block) containing the requested
data is retrieved from main memory and placed in cache
• Temporal locality: retrieved data is likely to be used again in near future
• Spatial locality: high probability that other data within the block will be used
soon
• Cache miss handled by hardware. CPU is stalled until requested data available
• Hit rate = hit / (hit + miss) = no. of hits/total accesses
Page fault
• Page fault: If CPU doesn’t find requested data in cache and
main memory
• Virtual address space broken into multiple pages
• When page fault happens, page containing the requested
data is retrieved from disc memory and placed in main
memory
• Page fault handled by software. CPU is not stalled but
switched to other task until requested data available
Performance of Cache
• For cache miss, CPU is stalled
• In CPU execution time, take into account the number of CCs for which CPU is stalled
• CPU execution time = (CPU clock cycles + Memory stall cycles) x clock cycle time
• Memory stall cycles = number of cache misses x cost per cache miss in CC (miss penalty)
• Number of cache miss = IC x cache miss per instruction
• Cache miss per instruction = memory reference (cache access) per instruction x miss rate
• Memory stall cycles = IC x memory reference per instruction x miss rate x miss penalty
• Miss rate: fraction of memory reference (cache access) that are cache miss
• Miss penalty: additional CCs to service a cache miss
Performance of Cache
Speedup = 2.7/2 = 1.35

Where a block be placed in cache (cache mapping)
• When cache miss happens, block of data (called block) containing the requested data is retrieved from
main memory and placed in cache
Memory has 32 blocks
? Assume there is nothing in the cache and lower-level block 12 needs

to be placed into one of the block frames of cache
Cache has 8 block frames

Where a block be placed in cache (cache mapping)
Memory has 32 blocks
4 sets, each set with two blocks
2 way set associative
Block can be placed in any of the block frames Block can only be placed in the following block frame: Block can only be placed in the following set:
(Block address) MOD (number of blocks in cache) (Block address) MOD (number of sets in cache)
Cache Mapping
• Bringing content (data/instruction) from main memory (MM) to cache memory (CM)
• Fully associative mapping, direct mapping, set associative mapping
Example:
• Assume there are 4,096 blocks in the MM, each having 16 words (total 4,096 x 16 =
65,536 words in the MM)
• 65,536 unique addresses are needed to address these words, i.e., 16-bit address needed
(216=65,536)
• Assume there are 128 cache blocks (also called block frames or cache lines)
• Size of cache line = size of MM block, i.e., 16 words in each cache line
• How to map 4,096 MM blocks to 128 cache lines ?
Fully associative mapping
Each line can hold 16 words Each block has 16 words A 16 bit MM address is divided into two fields:
12 most-significant bits for identifying MM
blocks and 4 least-significant bits for
identifying location of word within a MM block
When a MM block is brought in the cache line,
12 bits are used as tag bits for each cache line
(indicating which of the 4096 MM blocks has
been brought in the cache line), and 4 bits are
used as line offset to identify a word within
the cache line
Block j of MM can be
mapped to any line
in CM (provided that
line is not already
12 bits 4 bits
occupied)
From the incoming 16 bit address, 12 most
significant bits are compared to the tag bits of
each cache line
If a match is found, 4 least significant bits
Assume that the MM block containing the requested word has already
identify the word (out of 24 = 16 possible
been mapped into a CM line. How to find that word in CM?
words) in that particular cache line
Ultimate flexibility, but searching overhead !!!!

Direct mapping
Only 25 = 32 MM blocks can be mapped to each cache line A 16 bit MM address is divided into three fields: 4
least-significant bits for identifying location of
word within a MM block, next 7 bits plus 5 most-
significant bits for identifying MM blocks
When a MM block is brought in the cache line, 5

bits are used as tag bits for each cache line
(indicating which of the 32 MM blocks has been
brought in the cache line), next 7 bits are used to
Block j of MM can be identify a cache line number out of 128 possible,
mapped to (j mod and 4 bits are used as line offset to identify a
128) line in CM word within the cache line
5 bits 7 bits 4 bits
From the incoming 16 bit address, 7 bits at the

MM blocks 0, 128, 256,……, 3968 map to CM line 0 (total of 32) middle indicate the cache line which holds the
MM blocks 1, 129, 257,……., 3969 maps to CM line 1 (total of 32) requested word. 5 most significant bits are then
matched with the tag bits of that cache line.
….
Assume that the MM block containing the requested word has already identify the word (out of 24 = 16 possible words)
been mapped into a CM line. How to find that word in CM? in that particular cache line
Set associative mapping
128 CM lines divided into 64 sets, each having two CM lines A 16 bit MM address is divided into three fields: 4
Only 26 = 64 MM blocks can be mapped to each set in CM least-significant bits for identifying location of word
within a MM block, next 6 bits plus 6 most-
significant bits for identifying MM blocks
When a MM block is brought in the cache line, 6 bits

are used as tag bits for each cache line (indicating
which of the 64 MM blocks has been brought in the
cache line), next 6 bits are used to identify a set
number out of 64 possible, and 4 bits are used as
Block j of MM can be
line offset to identify a word within the cache line
mapped to (j mod
64) set in CM (any
line in that set)
6 bits 6 bits 4 bits
From the incoming 16 bit address, 6 bits at the

middle indicate the set which holds the
MM blocks 0, 64, 128,……,4032 maps to CM set 0 (total of 64) requested word. 6 most significant bits are then
MM blocks 1, 65,129,…….,4033 maps to CM set 1 (total of 64) compared with the tag bits of each cache line in
that set.
…
Assume that the MM block containing the requested word has already
identify the word (out of 24 = 16 possible words)
been mapped into a CM line. How to find that word in CM?
in that particular cache line
Valid bit and cache hit
• In addition to the tag bits associated with each cache line, there is valid
bit: 0 or 1
• 0: cache line has been loaded with invalid data, thus no cache hit
possible
• 1: cache line has been loaded with valid data.
• When valid bit 1, and tag bits match with upper address bits of the
incoming address, there is cache hit
• Cache memory analogous to letter box
• Valid bit 1: Letter inside letter box
• Valid bit 0: No letter inside or no letters addressed to you
Which line should be replaced on cache miss
• On miss, cache controller selects a cache line to be replaced with the
desired (requested) data
Which line to replace?
• Random: Candidate lines are randomly selected
• Least recently used (LRU): Access to lines are recorded. Line unused
for longest time is replaced
Numerical of cache mapping
The main memory of a computer has 2 cm blocks while the cache has 2c blocks (cache
lines). If the cache uses the set associative mapping scheme with 2 blocks per set, then
block k of the main memory maps to which of the following set?
A. (k mod m) of the cache
B. (k mod c) of the cache
C. (k mod 2 c) of the cache
D. (k mod 2 cm) of the cache
Numerical of cache mapping
1. A computer system uses 16-bit memory addresses. It has a 2K-byte cache

organized in a direct-mapped manner with 64 bytes per cache block. Assume
that the size of each memory word is 1 byte.
Calculate the number of bits in each of the Tag, Block, and Word fields of the
memory address.
2. Repeat the problem, if the cache is organized as a 2-way set-associative

cache
Cache write strategies
How to write to cache?
• Write through (store through): Data written to both the cache line and
to the block in MM
• Write back (store back): Data written only to the cache line. Modified
cache line written to MM only when the cache line needs to be replaced.
‘Dirty bit’ used to reduce frequency of writing back on replacement. Dirty
bit ‘0’ indicates cache line is ‘clean’ (not modified), thus no write back
needed on replacement. Dirty bit ‘1’ indicates cache line is ‘dirty’
(modified), thus write back needed on replacement.
Virtual memory
Why virtual memory?
• Each process needs its own segment of memory
• Total memory required by all active processes exceeds the amount of memory
available
How virtual memory helps?
• VM creates virtual address space in secondary memory
• Virtual address space: paging file
• Paging file combined with primary memory accounts for the available memory for
all active processes
Virtual memory
How virtual memory works?
• Process requests a data
• CPU doesn’t find the data in primary memory (page fault)
• OS takes a block of data from primary mem that hasn’t been used recently, writes it to VM
• OS takes a block of data (containing the requested data) from VM and places in primary
mem in place of the old block of data: swapping or paging
• Blocks of data swapped: ‘page’ (in VM), ‘page frame’ (in MM)
• Virtual address space divided into pages, primary memory divided into page frames
• Typical size of a page or page frame is 4 KB
• OS maintains a page table in primary memory
• Page table has one entry for each page
• Page table indicates whether a page has been moved into a page frame in primary memory,
and if so, which page frame contains it
• CPU uses page table while looking for data in primary memory
Simple illustration of page table
Address translation or memory mapping

1 page = 4 KB = 4096 Bytes
8 pages = 32768 Bytes
1 a 1 3 Assuming memory is byte addressable, each memory address is 15 bits (2 15
2 b 2 - = 32768)
1
When process requests access to a particular memory address
3 c 2 e 3 4
• CPU treats it as logical address or VM address and breaks it into two parts
4 d 3 a 4 - (page index bits: MS 3 bits, Offset bits: LS 12 bits)
5 e 5 2 • Page index bits identify the page where the address is located
4 c
• Offset bits identify the exact location of the address within a particular page
6 f 6 -
MM, 4 page frames • CPU checks the page table at page index to find which page frame (f, called
7 g 7 - page frame index) in MM holds that particular page
• If page table reveals the page is not in MM, page fault. OS brings the page
8 h 8 -
in MM and updates the page table
VM, 8 pages Page table • If page table reveals the page is in MM, OS replaces the page index with the
page frame index
• Page frame index appended with offset bits make the physical address
Page table
Address translation
3 bits 12 bits
Logical address Page index Offset
Physical address Page frame Offset

index
2 bits 12 bits
Page table format

• Page table entry includes the following information:
• 1 bit to specify whether the page is currently loaded into main memory
• Several bits (2 bits in our example) to specify which page frame contains the page
• Dirty bit (1 bit): Specifies whether the page has been modified in MM. When ‘1’, it indicates the page in
MM has been modified. Thus, during replacement, OS must write it in SM. When ‘0’, it indicates no
modification. Thus, no need to write in SM.
• Reference bit (1 bit): Specifies whether the page has been accessed recently in MM. OS uses the
information to decide whether the page should be swapped to make room for a required page. CPU sets
this bit ‘1’ whenever the page in MM is accessed. OS periodically resets it to ‘0’
Page table
Translating logical address to physical address
Address of the page table (starting
location) is stored in PTBR (page table
base register)
P1
What if size of page table is

greater than the size of page
frame?
Can’t store page table in single page
Page frame number
frame in MM
Page number
Multilevel page table !!!!
Multilevel page table
Consider a system with 32-bit logical address space. If page size in such system is 4
KB, then a page table consists of 232/212 = 220 entries. Assuming that each entry
consists of 4 bytes, it takes 4 MB of physical address space for the page table.
Page table size more than page frame size !!!

• Page table divided into several parts (each having the size of a page frame)
• These parts of the page table are stored in different page frames of the MM: level n page table
• Keep track of these page frames (which store the different parts of the page tables) using
another page table: level (n-1) page table
• Hierarchy of page table generated
• Continue until entire page table can be stored in a single page frame: lowest level page table
Consider a virtual memory system with physical memory of 8GB, a page size of
8KB and 46 bit virtual address. Assume every page table exactly fits into a
single page. If page table entry size is 4B then how many levels of page tables
would be required?
Three level page table
The entries of the level 1 page table are pointers to level 2 page tables and entries of the level 2 page tables are
pointers to level 3 page tables and so on. The entries of the last level page table stores actual page frame
information. Level 1 contains single page table (which fits into a page frame) and address of that table is stored
in PTBR.
Replacement/swapping/paging algorithm
• Important to reduce number of page fault
• Least recently used (LRU) algorithm: OS replaces a page in MM
that has least recently been used. A page that has been used
recently is likely to be used again in near future (temporal
locality), hence should not be replaced. Reference bit used to
know if page has been recently used or not
Translation lookaside buffer
• Each memory access requires accessing memory twice: once to look up
the page table and once to get the actual data
• Page table substantially large, thus look up process considerably slow
• To make the address translation (i.e. look up process) faster, a dedicated
cache used: translation lookaside buffer (TLB)
• Each TLB entry contains a logical address and its corresponding physical
address
• Using TLB, CPU can quickly convert logical address to physical address
• When process requests access to a memory address, CPU checks first if
TLB has an entry for that logical address
• If TLB has an entry, the corresponding physical address is immediately
known
• If TLB has no entry, CPU does laborious look up into page table to convert
logical address to physical address
• Once done, CPU adds the result as an entry in TLB
Inverted page table
• Page tables are usually large
• Each process has its own page table: accounts for significant space in MM
• Alternative approach: make a global page table for all the active processes,
known as inverted page table
• Number of entries equal to number of page frames (not equals to number
of pages as in page table)
• Each entry has two information: process ID (PID) and page number
• PID specifies the process which owns the page
Inverted page table
Inverted page table
Page table for process 1 ( Pr 1) Page table for process 2 ( Pr 2)
PID Page number
1 - 1 3
1 Pr 2, P7
2 - 2 5
2 Pr 1, P3
3 2 3 8
No of page frames 8 Pr 2, P1
4 4 4 - 3
Pr 1, P4
5 7 5 - 4
Pr 2, P2
6 - 6 - 5
Pr 1, P7
7 6 7 1 6
Pr 1, P5
Page index
Page frame index
Page index
Page frame index 7
Pr 2, P3
8
Virtual memory
Suppose that a system has a 32-bit virtual address space. It has 1GB of physical
memory, and uses 1MB pages.
a) How many virtual pages are there in the address space?

b) How many physical pages are there in the address space?
c) How many bits are there in the physical page number?
Virtual memory = 232 Bytes = 4 GB

Number of virtual pages = 4GB/1MB = 4096
Number of physical pages = 1GB/1MB = 1024
1024 = 210, thus 10 bits in physical page number
Virtual memory
Consider a system with byte-addressable memory, 32-bit logical addresses, 4 KB
page size and page table entries of 4 bytes each. What is the size of the page
table in the system?
Virtual memory = 232 Bytes

Number of pages = 232/212 = 220
1 entry in page table = 4 Bytes
Size of page table = 4 x 220 bytes = 4 MB
Parallel processors/multiprocessors
Based on number of instruction streams and data streams that can be processed
simultaneously, computing systems can be classified into four categories (Flynn’s
Classification)
• Single-instruction, single-data (SISD) systems

• Single-instruction, multi-data (SIMD) systems
• Multi-instruction, single-data (MISD) systems
• Multi-instruction, multi-data (MIMD) systems
Flynn’s classification
• SISD: Uniprocessor system. Capable of executing a single instruction, operating on a single data
stream. Traditional uniprocessor PC (before 2010)
• SIMD: Multiprocessor system. Capable of executing the same instruction on all the PUs but operating
on different data streams. Vector processing machines
• MISD: Multiprocessor system. Capable of executing different instructions on different PUs but all of
them operating on the same data stream. Not available commercially
• MIMD: Multiprocessor system. Capable of executing different instructions on different PUs operating
on different data streams. Truly parallel machines
Shared vs distributed-memory architecture for MIMD
Shared-memory MIMD: PUs are connected to a single global memory (either by a single bus
or by a network). All PUs have access to it. Communication between PUs takes place through
the shared memory. Modification of data stored in the global memory by one PU is visible to
all other PUs. Easy to design but less likely to scale (memory contention)
Distributed-memory MIMD: All PUs have a local memory. Communication between PUs
takes place through an interconnection network. Complex design but high scalability
Shared memory modules
Local memory
(not shared)
Shared memory architecture Distributed memory architecture

Shared memory architecture: UMA vs NUMA
• Uniform memory access: Memory access time uniform across all processors. Access time
independent of data location within memory (i.e. access time is same regardless of which
shared memory module contains the data)
• Non uniform memory access: Each processor has its own local memory module that it
can access directly (local access). Local memory of other processors can also be accessed
(remote access) with longer access time.
• Interconnection network/ Processor to memory network: Butterfly, Benes

Butterfly vs Benes network
• Butterfly network: Blocking network. Some permutations result in link contention
• Benes network: Start with butterfly network. Flip it and repeat this network to the
other side. Non-blocking network. Any permutation can be realized without link
contention
Butterfly network Benes network

Number of levels = log2N +1 Number of levels = 2 log2N +1
Where N = number of rows Where N = number of rows

Cache coherence
• Multiple processor cores share the same memory, but have their own caches
• View of the memory (shared) by two different processors through their individual
caches
• P1 and P2 can have two different values for the same location
• Problem is called multiprocessor cache coherence problem
Cache coherence problem
Write-through cache
• Single memory location (X), read and written by two processors (A and B)
• Initially assume neither cache contains the variable at X, initial value at X is 1
• After variable at X is written by A (new value 0), both A’s cache and memory contain new value, but not B’s
cache
• Two different values for same location (Caches of CPU A and CPU B have 0 and 1 at location X, respectively)
• If B reads the value of the variable at X, it will read 1, not the most recently written 0
• Cache coherence ensures that changes in the values of shared operands are propagated throughout the
system in a timely fashion
Approaches for cache coherence problem
• Software-based approach: Detect the potential code

segments which might cause cache coherence issues and
treat them. Prevent any shared data variables from being
cached. Compile time approach. Inefficient utilization of
cache.
• Hardware-based approach: Dynamically detect (at run
time) potential cache coherence issues. Efficient use of
cache. Can be of two types: directory protocol, snoopy
protocol
• Directory-based approach (directory protocol): A single directory is maintained
in MM to keep track of the sharing status of each cache block. Responsibility
of cache coherence on central cache controller. Any local action changing a
cache block, is reported to the central controller. Central controller maintains
(using the global directory) information about which processors have copies of
which cache blocks. Before a processor can write to a local cache block, it must
request exclusive access to the cache block from the central controller
• Snooping-based approach (snoopy protocol): Responsibility of cache coherence
distributed among all the cache controllers. Each cache block is accompanied
by a sharing status. When write operation is performed on a shared cache
block, it is announced to all other caches by broadcasting mechanism. Each
cache controller snoops on the bus to determine if it has that particular block
where write operation is performed and react accordingly
• Two common snoopy protocol approaches are:
• Write-invalidate snoopy protocol: Processor has exclusive access to a data item before it
writes that item. Invalidate all other cached copies of the data item when that data item is
written
• Initially assume neither cache contains the data item at X. Initial value at X is 0
• When B wants to read X, A responds with the written value (1) cancelling the response from
memory (0)
• Content of B’s cache and memory content at X are updated same time: write-back cache
Cache coherence approaches
• Write-update (write broadcast) snoopy protocol: Update all the cached copies of
the data item when it is written
• Initially assume neither cache contains the data item at X. Initial value at X is 0
• When A broadcasts a write, both B’s cache and memory location X are
updated
Additional Topics
• Write-invalidate, cache-coherence protocol (MSI protocol) for write-back cache
(P. 664 of Hennesey)
• Universal Serial Bus
• Direct Memory Access (DMA)
• Daisy Chain

Memory Hierarchy Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Memory Hierarchy Design

Uploaded by

Copyright:

Available Formats

Computer Architecture

Speedup = 2.7/2 = 1.35

Memory has 32 blocks

? Assume there is nothing in the cache and lower-level block 12 needs

Cache has 8 block frames

Memory has 32 blocks

4 sets, each set with two blocks

2 way set associative

Ultimate flexibility, but searching overhead !!!!

When a MM block is brought in the cache line, 5

5 bits 7 bits 4 bits

From the incoming 16 bit address, 7 bits at the

When a MM block is brought in the cache line, 6 bits

From the incoming 16 bit address, 6 bits at the

1. A computer system uses 16-bit memory addresses. It has a 2K-byte cache

2. Repeat the problem, if the cache is organized as a 2-way set-associative

Address translation or memory mapping

Logical address Page index Offset

Physical address Page frame Offset

Page table format

What if size of page table is

Multilevel page table

a) How many virtual pages are there in the address space?

Virtual memory = 232 Bytes = 4 GB

Virtual memory = 232 Bytes

• Single-instruction, single-data (SISD) systems

Shared memory modules

Shared memory architecture Distributed memory architecture

Shared memory modules

Shared memory modules

• Interconnection network/ Processor to memory network: Butterfly, Benes

Butterfly network Benes network

Where N = number of rows Where N = number of rows

• Software-based approach: Detect the potential code

You might also like