Hierarchical Memory Organization

Hierarchical Memory
Organization
Ajit Pal
Professor
Department of Computer Science and
Engineering
Indian Institute of Technology Kharagpur
INDIA -721302
Outline
Key characteristics of memory systems
Hierarchical Memory Organization
Basic principles of cache memory
Basic issues in cache design:
 Cache size
 Mapping functions
 Replacement algorithms
 Write policy
 Block size
 Number of caches
 Performance analysis
Summary
Ajit Pal, IIT Kharagpur
Introduction
 Memory systems are critical to performance
 Computer designers devoted great deal of
attention to develop sophisticated
mechanisms to improve the performance of
memory systems
 Primary approach used to improve
performance is Hierarchical Memory
Organization
 Programs exhibit temporal locality and
spatial locality

Von Neumann Computer
Architecture
CPU
ALU I
N M EM O RY
T
R E G IS T E R S E
R
T IM IN G F
& A
CO NTRO L C
U N IT E
I/O D E V IC E S
SYSTEM
BUS

Key characteristics of Computer Memory Systems
Location:
On-chip – Registers
On-board – Main memory
External – Secondary/backup memory
Capacity: Number of Bytes/words
Access method: Random access, sequential,
Associative
Performance: Access time, Transfer rate
Physical type: Semiconductor, magnetic, optical
Physical characteristics: Volatile/nonvolatile
Organization: bit-organized, byte-organized
Key characteristics of Memory Systems
A
Storage capacity in c
c
Mb versus Access e
time in sec of s
different types of s
memories shown
Observation: t
i
Larger the capacity m
slower is the device e
Storage capacity
Key characteristics of Memory Systems
P
r
o
c
Observation: u
Higher capacity, r
e
lower cost m
e
Cost n
decreasing over t
the years
Year
c Pal, IIT Kharagpur
Ajit
o
Gap in Performance Between Memory and
CPUs Plotted over time

Hierarchical Memory Organization
 Basic Objectives: 0
 Fast: Approaching 1
the fastest memory
2
available
 Large: 3
Approaching the
size of the largest Level-0: Registers
memory
 Optimum cost: Level-1: Cache memory
Close to the cost Level-2: Main memory
of cheapest
memory available Level-3: Secondary memory

Attributes of memory hierarchy components
Component Technology Bandwidth Latency Cost Per Bit Cost Per
($) Gigabyte($)
Disk Drive Magnetic 10+ MB/s 10ms < 1X10E-9 <1
Main DRAM 2+ GB/s 50+ ns <2X10E-7 < 200

Memory
On chip L2 SRAM 10+ GB/s 2+ ns <1X10E-4 < 100K

cache
On chip L1 SRAM 50+ GB/s 300+ ps >1X10E-4 > 100K

cache
Register file Multiported 200+ GB/s 300+ ps >1X10E-2 > 10M (?)
SRAM (?)
Ajit Pal IIT Kharagpur

Five Parameters
 Access time (ti): t i-1 < ti
 Cost per byte (ci): c i-1 > ci
 Memory size (si): s i-1 < si
 Transfer bandwidth (bi): b i-1 > bi
 Unit of transfer (xi): x i-1 < xi
As you go from lower to higher level

Access time increases
Cost per byte decreases
Capacity increases
Frequency of access decreases
Three Properties:
 Inclusion: M1 < M2 < … < Mn
 Coherence: Copies of different levels are
consistent
 Locality of Reference:
Temporal locality
Spatial Locality
Sequential locality

CPU c
a
Registers c Memory Hard Disk
h
e
Size: 500 bytes 64 KB 512 MB 500 GB

Speed: 0.25 ns 1 ns 100 ns 5 ms
0
Faster 1 Bigger
2
3

Crucial for contemporary multi-core processors:
Intel Core i7 can generate two references per clock
 Four cores and 3.2 GHz clock
25.6 billion 64-bit data references per second
12.8 billion 128-bit instruction references
 409.6 GB/s
DRAM bandwidth is only 6% of this (25 GB/s)
It requires
Multi-port, pipelined cache memories
Two levels of cache per core
Shared 3rd level cache on chip

Cache Memory
A small, fast storage is introduced in

between CPU and the slow main memory
to improve average access time
Improves memory system performance:
Exploits spatial and temporal locality

Cache Memory: Basic Principles
CPU
Word Transfer
Fast & Cache
small
0 Block Transfer
1
Slow & Main Memory
. large
.
. Blocks
0
1
2n-1 m
k words each
Main Memory
Cache Memory
Cache Memory: Basic Principles
Start
Access main memory
for block containing RA
Receive Address
RA from CPU Allocate cache slot for
main memory block
Is block containing
No
RA in cache? Deliver RA word
Yes to the CPU
Fetch RA word and

deliver to CPU Load main memory
block into cache slot
Done
Basic Issues: Cache Size
Small enough so that average cost is

closer to the main memory
Large enough so that speed is closer to
the cache memory
Cache sizes of between 1K to 512K
words have been found to give optimum
results

Basic Issues: Where can a Block be placed?
 As there are fewer lines in cache CPU
than main memory blocks,

Cache
suitable technique for mapping
of main memory blocks into
cache lines is necessary Main Memory
Mapping Functions:
One place: Direct mapping
Any place: Associative mapping
Few places: Set associative mapping
Block Identification
 Given an address, how do we find where it goes
in the cache?
 Indexing
 Full search
 Limited search
 This is done by first breaking down an address
into three parts
Tag used for Index of the set Offset of the address in
identifying a match the cache block
address tag set index block offset

Block Identification: Indexing
 Consider the following system
– Addresses are of 32 bits
 Memory is byte-addressable
– Block frame size is 22 = 4 bytes
– Cache is 64 KB (216 bytes)
 Consists of 214 block frames
– Direct-mapped
 For each cache block brought in from memory, there is a
single possible frame among the 214 available
 A 32-bit address can be decomposed as follows:
16-bit tag 14-bit index 2-bit block offset

Direct Mapping
r w
Address 31 16 15 210
S
Address length = s + w
Block size = 2w
Number of blocks in main memory = 2s
Number of lines in cache memory = m = 2r
V TAG DATA
Cache
1 16-bits 32-bits
line
Overhead bits

Direct Mapping
 i = j mod m,
where
i = the cache
line number
j = main memory
block number
m = number of
lines in the cache

Mapping in Direct Mapping
Cache line Main memory block
assigned
0 0, m, …, 2s - m
1 1, m+1, …, 2s – m+1
- -
- -
- -
m-1 m – 1, 2m-1, …, 2s +1

Direct Mapping
Main Memory 15 11 2 0
Cache Memory
64Kb
4Kb
0 Tag Index Byte offset
0
1
1024
Blocks
Disadvantages
Fixed Cache location for a main
memory word
216-1
Two words with the same index but
 Advantages different tag value cannot reside in
 Simple cache simultaneously
Vulnerable to continuous swapping
 Inexpensive
Associative Mapping: Full Search
T D T D … … T D
 Associative mapping
 Allows any main
memory block to be
mapped into any
cache line
 Better performance
 Expensive to
implement

Four Basic Questions
 Q1: Where can a block be placed in the cache?
(Block placement)
 Fully Associative, Set Associative, Direct Mapped
 Q2: How is a block found if it is in the cache?
(Block identification)
 Tag/Block
 Q3: Which block should be replaced on a miss?
(Block replacement)
 Random, LRU
 Q4: What happens on a write?
(Write strategy)
 Write Back or Write Through (with Write Buffer)
Hierarchical Memory
Organization (Contd.)
Ajit Pal
Professor
Department of Computer Science and
Engineering
Indian Institute of Technology Kharagpur
INDIA -721302
Mapping Functions
 Q1: Where can a block be placed in the
cache?
(Block placement)
 Direct Mapped, Fully Associative, Set
Associative
 Q2: How is a block found if it is in the
cache?
(Block identification)
 Tag/Block
Direct Mapping
Tag Index BO V Tag Data
D
e
c
o
d
e
r
Data
=
Hit / Miss
Fully Associative Mapping
Tag BO V Tag Data
=
=
=
Parallel search
Associative search
Content =
Addressable Memory
Data
Hit / Miss
Set-Associative Mapping: Limited Search
 A compromise that exhibits the strengths of both
the direct and associative mapping
 Overcomes the disadvantages of both
 m = v × k, i = j mod v
 i = cache set number
 j = main memory block number
 m = number of lines in the cache
 v = number of sets
 k = number of lines in each set
 A generalization of the previous two approaches
Address 31 10 9 210
Cache size = 4Kb TAG SET
Set-Associative Mapping
Cache
size =
4Kb
AND
OR
Set-Associative Mapping
 Example:
two-way set-
associative
mapping, v=
m/2, k= 2
 For v = m
and k = 1, it
reduces to
direct
mapping
 For v = 1 and
k = m, it
reduces to
associative
mapping
Size of Tags Versus Associativity
 Increasing associativity requires more

comparators, as well as more tag bits
per cache block
 The choice among direct-mapped, set-
associative and fully-associative
mapping in any memory hierarchy will
depend on cost of a miss versus the
cost of implementing associativity,
both in time and in extra hardware

Size of Tags versus
Associativity 4-way set-associative
Assume that 31 11 2 0
•Address size = 32-bit 20 10 2
Tag Index Byte offset
•Word size = 32-bit
Total no. of tag bits = 4x210x20
•Block size = Word size
Total no. of comparators = 4
•Cache size = 16Kb
Direct mapping Fully Associative
31 13 2 0 31 2 0
18 12 2 30 2
Tag Index Byte offset Tag Byte offset
Total no. of tag bits = 212x18 Total no. of tag bits = 212x30
Total no. of comparators = 1 Total no. of comparators = 212

Basic Issues: Replacement Algorithms
 Which block to be replaced on a cache
miss?
 One of the existing blocks is to be
replaced when a new block is brought in
 No alternative in case of direct mapping
 In case of associative and set-associative
mapping, several approaches are used:
 Least Recently Used (LRU)
 First-In-First-Out (FIFO)
 Least Frequently Used (LFU)
 Random

Basic Issues: What happens on a write?
 Arises in case memory write requests
 Percentage of memory writes: 15%
 Possible Approaches:
 Write Through: The information is written to both
the block in the cache as well as in the main
memory
 Disadvantage: Generates substantial memory
traffic
 Write Back: The information is written only in the
block of the cache and a ‘update’ flag is set. It
minimizes memory writes. Main memory is modified
when the block is discarded and if the ‘update’ flag
is set. Creates cache coherency problem in
multiprocessor-based systems (later)

Cache Optimization Techniques
Larger block size
 Reduces compulsory misses
 Increases capacity and conflict misses
Larger total capacity to reduce miss rate
 Increases hit time, increases power consumption
Higher associativity
 Reduces conflict misses
 Increases hit time, increases power consumption
Higher number of cache levels
 Reduces overall memory access time
Giving priority to read misses over writes
 Reduces miss penalty
Avoiding address translation in cache indexing
 Reduces hit time

Basic Issues: Block Size
Tag Index BO V Tag Data Data
D
e
c
o
d
e
r
= BO
Data
Hit / Miss
Basic Issues:
Block Size
 Multiword Cache block for better performance

 Takes advantage of Spatial locality
 A cache block is made larger than one word
of main memory
 In case of miss, multiple adjacent words are
fetched that are likely to be needed shortly
Basic Issues: Block Size
 As the block size
increases, the hit
ratio initially
increases because
of the principle of
locality
 The miss rate may
go up as the block
size goes beyond
some limit (When it
becomes a
significant fraction
of the cache size)
Larger block size reduces the number of blocks

that can fit into a cache
As a block becomes larger, each additional
word is further away from the requested word

Basic Issues: Number of Caches
 Single- or two-level
 The advancement of VLSI technology allowed
to have on-chip cache, which provides fastest
possible cache access
 This also eliminated external bus activity
 Lead to two or more levels of cache: on-chip
(L1) and off-chip (L2) and provided still better
performance
 Unified or split
 Single cache for both instruction and data
 Separate cache for instruction and data

Unified and Split Memory
P R IN C E T O N HARVARD
A R C H IT E C T U R E A R C H IT E C T U R E
CPU CPU
PRO G RAM
M EM O RY DATA
M EM O RY
M EM O RY

Unified vs Split Caches
 A Load or Store instruction requires two memory
accesses:
 One for the instruction itself
 One for the data
 Therefore, unified cache causes a structural hazard!
 Modern processors use separate data and instruction
L1 caches:
 As opposed to “unified” or “mixed” caches
 The CPU sends simultaneously:
 Instruction and data address to the two ports
 Both caches can be configured differently
 Size, associativity, etc

Unified vs Split Caches
 Separate Instruction and Data caches:
 Avoids structural hazard
 Also each cache can be tailored specific to need
Processor Processor
I - Cache-1 D - Cache-1
Unified
Cache-1
Unified
Cache-2
Unified
Cache-2
Unified Cache Split Cache

Cache in Intel Processors
80386: 80486:
32KB Off-chip 8KB On-chip
Direct mapped 4-way set-associative
Block size: 16 bytes Block size: 16 bytes
Write-through Write-through
Cache Pentium 4:
Two on-chip, each 8KB, Off-chip 256KB
Block size: 64 bytes Block size128 bytes
4-way set-associative 8-way set-associative

Alpha 21264 Data Cache
 Let us understand the operation of the Alpha
21264 data cache
 Alpha processor presents a 48-bit virtual
address to the cache:
 29 tag bits
 9 index bits
 6 offset bits
 Cache is 2-way set associative

Case Study: The Alpha 21264 Cache
What happens on a read?

The Alpha 21264 Cache
Step 1:
CPU generates 44-bit address
The address is split into
29-bit tag
9-bit set index (29 = 512 sets)
6-bit block offset (26 = 64 bytes
blocks)

Step 2:
The “right” set is selected
using the index bits

Step 3:
The tag is compared to
both tags in the set; If a
match AND valid=1:
then a hit (If not: then a
miss)

Step 4:
If a match, select the matching
block and return the byte at the
right offset
The election of the matching
block is done via a 2:1
multiplexer

Performance
Hit ratio hi:
Probability of finding in Mi
Miss ratio:
(1 – hi)
Average Memory Access Time:
TAMAT = h1t1 + (1 – h1)t2
Effective Cost =
c1s1 + c2s2
Example: Performance
 Cache memory 1Kb, Main memory 1Mb
 Fast:
 Cache hit 95% (10 nsec), main memory 5%
(100nsec)
 AMAT = (0.95)(10) + (0.05)(10 +100) = 9.5 + 5.5 =
15 nsec
 Large: Increasing capacity:
 Closer to the higher level memory (1 Mb)
 Optimum cost:
 Cache (Rs. 1.0 per byte), Main memory (Rs. 0.1
per byte).
 Average cost = Rs. 0.1009
Summary
 Memory system in a computer is organized in a
hierarchical manner
 Cache memory is used in between the main memory
and the CPU
 Cache memory is faster and smaller than the main
memory
 Cache memory is accessed more frequently than
main memory
 There are three mapping functions
 Several replacement algorithms are possible
 There are two alternatives for write policies
 Several levels of cache memories are used
 Faster and larger memory at lower cost is achieved

Points to remember
 Where can a block be placed?
 One place
 A few places
 Any place
 How is a block found?
 Indexing
 Limited search
 Full search
 Which block to be replaced on a cache miss?
 Random
 LRU
 What happens on a write?
 Write-through
 Write-back

Thanks!

Hierarchical Memory Organization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hierarchical Memory Organization

Uploaded by

Copyright:

Available Formats

Hierarchical Memory

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Disk Drive Magnetic 10+ MB/s 10ms < 1X10E-9 <1

Main DRAM 2+ GB/s 50+ ns <2X10E-7 < 200

On chip L2 SRAM 10+ GB/s 2+ ns <1X10E-4 < 100K

On chip L1 SRAM 50+ GB/s 300+ ps >1X10E-4 > 100K

Ajit Pal IIT Kharagpur

As you go from lower to higher level

Ajit Pal, IIT Kharagpur

Size: 500 bytes 64 KB 512 MB 500 GB

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

A small, fast storage is introduced in

Improves memory system performance:

Exploits spatial and temporal locality

Ajit Pal, IIT Kharagpur

Fetch RA word and

Small enough so that average cost is

Ajit Pal, IIT Kharagpur

than main memory blocks,

address tag set index block offset

16-bit tag 14-bit index 2-bit block offset

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

 Increasing associativity requires more

Ajit Pal, IIT Kharagpur

Direct mapping Fully Associative

Tag Index Byte offset Tag Byte offset

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

 Multiword Cache block for better performance

Larger block size reduces the number of blocks

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Unified Cache Split Cache

Ajit Pal, IIT Kharagpur

Block size: 64 bytes Block size128 bytes

4-way set-associative 8-way set-associative

Ajit Pal, IIT Kharagpur

What happens on a read?

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

Ajit Pal, IIT Kharagpur

You might also like