You are on page 1of 59

Hierarchical Memory

Organization

Ajit Pal
Professor
Department of Computer Science and
Engineering
Indian Institute of Technology Kharagpur
INDIA -721302
Outline
Key characteristics of memory systems
Hierarchical Memory Organization
Basic principles of cache memory
Basic issues in cache design:
 Cache size
 Mapping functions
 Replacement algorithms
 Write policy
 Block size
 Number of caches
 Performance analysis
Summary
Ajit Pal, IIT Kharagpur
Introduction
 Memory systems are critical to performance
 Computer designers devoted great deal of
attention to develop sophisticated
mechanisms to improve the performance of
memory systems
 Primary approach used to improve
performance is Hierarchical Memory
Organization
 Programs exhibit temporal locality and
spatial locality

Ajit Pal, IIT Kharagpur


Von Neumann Computer
Architecture
CPU

ALU I
N M EM O RY
T
R E G IS T E R S E
R
T IM IN G F
& A
CO NTRO L C
U N IT E
I/O D E V IC E S
SYSTEM
BUS

Ajit Pal, IIT Kharagpur


Key characteristics of Computer Memory Systems
Location:
On-chip – Registers
On-board – Main memory
External – Secondary/backup memory
Capacity: Number of Bytes/words
Access method: Random access, sequential,
Associative
Performance: Access time, Transfer rate
Physical type: Semiconductor, magnetic, optical
Physical characteristics: Volatile/nonvolatile
Organization: bit-organized, byte-organized
Ajit Pal, IIT Kharagpur
Key characteristics of Memory Systems

A
Storage capacity in c
c
Mb versus Access e
time in sec of s
different types of s
memories shown
Observation: t
i
Larger the capacity m
slower is the device e

Storage capacity
Ajit Pal, IIT Kharagpur
Key characteristics of Memory Systems

P
r
o
c
Observation: u
Higher capacity, r
e
lower cost m
e
Cost n
decreasing over t
the years

Year
c Pal, IIT Kharagpur
Ajit
o
Gap in Performance Between Memory and
CPUs Plotted over time

Ajit Pal, IIT Kharagpur


Hierarchical Memory Organization
 Basic Objectives: 0
 Fast: Approaching 1
the fastest memory
2
available
 Large: 3
Approaching the
size of the largest Level-0: Registers
memory
 Optimum cost: Level-1: Cache memory
Close to the cost Level-2: Main memory
of cheapest
memory available Level-3: Secondary memory

Ajit Pal, IIT Kharagpur


Ajit Pal, IIT Kharagpur
Attributes of memory hierarchy components
Component Technology Bandwidth Latency Cost Per Bit Cost Per
($) Gigabyte($)

Disk Drive Magnetic 10+ MB/s 10ms < 1X10E-9 <1

Main DRAM 2+ GB/s 50+ ns <2X10E-7 < 200


Memory

On chip L2 SRAM 10+ GB/s 2+ ns <1X10E-4 < 100K


cache

On chip L1 SRAM 50+ GB/s 300+ ps >1X10E-4 > 100K


cache

Register file Multiported 200+ GB/s 300+ ps >1X10E-2 > 10M (?)
SRAM (?)

Ajit Pal IIT Kharagpur


Hierarchical Memory Organization
Five Parameters
 Access time (ti): t i-1 < ti
 Cost per byte (ci): c i-1 > ci
 Memory size (si): s i-1 < si
 Transfer bandwidth (bi): b i-1 > bi
 Unit of transfer (xi): x i-1 < xi

As you go from lower to higher level


Access time increases
Cost per byte decreases
Capacity increases
Frequency of access decreases
Ajit Pal, IIT Kharagpur
Hierarchical Memory Organization

Three Properties:
 Inclusion: M1 < M2 < … < Mn
 Coherence: Copies of different levels are
consistent
 Locality of Reference:
Temporal locality
Spatial Locality
Sequential locality

Ajit Pal, IIT Kharagpur


Hierarchical Memory Organization
CPU c
a
Registers c Memory Hard Disk
h
e

Size: 500 bytes 64 KB 512 MB 500 GB


Speed: 0.25 ns 1 ns 100 ns 5 ms

0
Faster 1 Bigger

2
3

Ajit Pal, IIT Kharagpur


Hierarchical Memory Organization
Crucial for contemporary multi-core processors:
Intel Core i7 can generate two references per clock
 Four cores and 3.2 GHz clock
25.6 billion 64-bit data references per second
12.8 billion 128-bit instruction references
 409.6 GB/s
DRAM bandwidth is only 6% of this (25 GB/s)
It requires
Multi-port, pipelined cache memories
Two levels of cache per core
Shared 3rd level cache on chip

Ajit Pal, IIT Kharagpur


Cache Memory

A small, fast storage is introduced in


between CPU and the slow main memory
to improve average access time

Improves memory system performance:

Exploits spatial and temporal locality

Ajit Pal, IIT Kharagpur


Cache Memory: Basic Principles
CPU
Word Transfer
Fast & Cache
small
0 Block Transfer
1
Slow & Main Memory
. large
.
. Blocks
0
1

2n-1 m
k words each
Main Memory
Cache Memory
Ajit Pal, IIT Kharagpur
Cache Memory: Basic Principles
Start
Access main memory
for block containing RA
Receive Address
RA from CPU Allocate cache slot for
main memory block
Is block containing
No
RA in cache? Deliver RA word
Yes to the CPU

Fetch RA word and


deliver to CPU Load main memory
block into cache slot

Done
Ajit Pal, IIT Kharagpur
Basic Issues: Cache Size

Small enough so that average cost is


closer to the main memory
Large enough so that speed is closer to
the cache memory
Cache sizes of between 1K to 512K
words have been found to give optimum
results

Ajit Pal, IIT Kharagpur


Basic Issues: Where can a Block be placed?
 As there are fewer lines in cache CPU

than main memory blocks,


Cache
suitable technique for mapping
of main memory blocks into
cache lines is necessary Main Memory

Mapping Functions:
One place: Direct mapping
Any place: Associative mapping
Few places: Set associative mapping
Ajit Pal, IIT Kharagpur
Block Identification
 Given an address, how do we find where it goes
in the cache?
 Indexing
 Full search
 Limited search
 This is done by first breaking down an address
into three parts
Tag used for Index of the set Offset of the address in
identifying a match the cache block

address tag set index block offset


Ajit Pal, IIT Kharagpur
Block Identification: Indexing
 Consider the following system
– Addresses are of 32 bits
 Memory is byte-addressable
– Block frame size is 22 = 4 bytes
– Cache is 64 KB (216 bytes)
 Consists of 214 block frames
– Direct-mapped
 For each cache block brought in from memory, there is a
single possible frame among the 214 available
 A 32-bit address can be decomposed as follows:

16-bit tag 14-bit index 2-bit block offset

Ajit Pal, IIT Kharagpur


Direct Mapping
r w
Address 31 16 15 210
S
Address length = s + w
Block size = 2w
Number of blocks in main memory = 2s
Number of lines in cache memory = m = 2r
V TAG DATA
Cache
1 16-bits 32-bits
line
Overhead bits

Ajit Pal, IIT Kharagpur


Direct Mapping

 i = j mod m,
where
i = the cache
line number
j = main memory
block number
m = number of
lines in the cache

Ajit Pal, IIT Kharagpur


Mapping in Direct Mapping
Cache line Main memory block
assigned
0 0, m, …, 2s - m
1 1, m+1, …, 2s – m+1
- -
- -
- -
m-1 m – 1, 2m-1, …, 2s +1

Ajit Pal, IIT Kharagpur


Direct Mapping
Main Memory 15 11 2 0
Cache Memory
64Kb
4Kb
0 Tag Index Byte offset
0
1
1024
Blocks

Disadvantages
Fixed Cache location for a main
memory word
216-1
Two words with the same index but
 Advantages different tag value cannot reside in
 Simple cache simultaneously
Vulnerable to continuous swapping
 Inexpensive
Ajit Pal, IIT Kharagpur
Associative Mapping: Full Search
T D T D … … T D
 Associative mapping
 Allows any main
memory block to be
mapped into any
cache line
 Better performance
 Expensive to
implement

Ajit Pal, IIT Kharagpur


Four Basic Questions
 Q1: Where can a block be placed in the cache?
(Block placement)
 Fully Associative, Set Associative, Direct Mapped
 Q2: How is a block found if it is in the cache?
(Block identification)
 Tag/Block
 Q3: Which block should be replaced on a miss?
(Block replacement)
 Random, LRU
 Q4: What happens on a write?
(Write strategy)
 Write Back or Write Through (with Write Buffer)
Hierarchical Memory
Organization (Contd.)

Ajit Pal
Professor
Department of Computer Science and
Engineering
Indian Institute of Technology Kharagpur
INDIA -721302
Mapping Functions
 Q1: Where can a block be placed in the
cache?
(Block placement)
 Direct Mapped, Fully Associative, Set
Associative
 Q2: How is a block found if it is in the
cache?
(Block identification)
 Tag/Block
Ajit Pal, IIT Kharagpur
Direct Mapping
Tag Index BO V Tag Data

D
e
c
o
d
e
r

Data
=
Hit / Miss
Fully Associative Mapping
Tag BO V Tag Data

=
=

=
Parallel search
Associative search
Content =
Addressable Memory
Data

Hit / Miss
Set-Associative Mapping: Limited Search
 A compromise that exhibits the strengths of both
the direct and associative mapping
 Overcomes the disadvantages of both
 m = v × k, i = j mod v
 i = cache set number
 j = main memory block number
 m = number of lines in the cache
 v = number of sets
 k = number of lines in each set
 A generalization of the previous two approaches

Address 31 10 9 210
Cache size = 4Kb TAG SET
Ajit Pal, IIT Kharagpur
Set-Associative Mapping

Cache
size =
4Kb

AND

OR
Set-Associative Mapping
 Example:
two-way set-
associative
mapping, v=
m/2, k= 2
 For v = m
and k = 1, it
reduces to
direct
mapping
 For v = 1 and
k = m, it
reduces to
associative
mapping
Ajit Pal, IIT Kharagpur
Size of Tags Versus Associativity

 Increasing associativity requires more


comparators, as well as more tag bits
per cache block
 The choice among direct-mapped, set-
associative and fully-associative
mapping in any memory hierarchy will
depend on cost of a miss versus the
cost of implementing associativity,
both in time and in extra hardware

Ajit Pal, IIT Kharagpur


Size of Tags versus
Associativity 4-way set-associative
Assume that 31 11 2 0
•Address size = 32-bit 20 10 2
Tag Index Byte offset
•Word size = 32-bit
Total no. of tag bits = 4x210x20
•Block size = Word size
Total no. of comparators = 4
•Cache size = 16Kb

Direct mapping Fully Associative

31 13 2 0 31 2 0
18 12 2 30 2

Tag Index Byte offset Tag Byte offset

Total no. of tag bits = 212x18 Total no. of tag bits = 212x30
Total no. of comparators = 1 Total no. of comparators = 212

Ajit Pal, IIT Kharagpur


Basic Issues: Replacement Algorithms
 Which block to be replaced on a cache
miss?
 One of the existing blocks is to be
replaced when a new block is brought in
 No alternative in case of direct mapping
 In case of associative and set-associative
mapping, several approaches are used:
 Least Recently Used (LRU)
 First-In-First-Out (FIFO)
 Least Frequently Used (LFU)
 Random

Ajit Pal, IIT Kharagpur


Basic Issues: What happens on a write?
 Arises in case memory write requests
 Percentage of memory writes: 15%
 Possible Approaches:
 Write Through: The information is written to both
the block in the cache as well as in the main
memory
 Disadvantage: Generates substantial memory
traffic
 Write Back: The information is written only in the
block of the cache and a ‘update’ flag is set. It
minimizes memory writes. Main memory is modified
when the block is discarded and if the ‘update’ flag
is set. Creates cache coherency problem in
multiprocessor-based systems (later)

Ajit Pal, IIT Kharagpur


Cache Optimization Techniques
Larger block size
 Reduces compulsory misses
 Increases capacity and conflict misses
Larger total capacity to reduce miss rate
 Increases hit time, increases power consumption
Higher associativity
 Reduces conflict misses
 Increases hit time, increases power consumption
Higher number of cache levels
 Reduces overall memory access time
Giving priority to read misses over writes
 Reduces miss penalty
Avoiding address translation in cache indexing
 Reduces hit time

Ajit Pal, IIT Kharagpur


Basic Issues: Block Size
Tag Index BO V Tag Data Data

D
e
c
o
d
e
r

= BO
Data
Hit / Miss
Basic Issues:
Block Size

 Multiword Cache block for better performance


 Takes advantage of Spatial locality
 A cache block is made larger than one word
of main memory
 In case of miss, multiple adjacent words are
fetched that are likely to be needed shortly
Ajit Pal, IIT Kharagpur
Basic Issues: Block Size
 As the block size
increases, the hit
ratio initially
increases because
of the principle of
locality
 The miss rate may
go up as the block
size goes beyond
some limit (When it
becomes a
significant fraction
of the cache size)

Larger block size reduces the number of blocks


that can fit into a cache
As a block becomes larger, each additional
word is further away from the requested word

Ajit Pal, IIT Kharagpur


Basic Issues: Number of Caches
 Single- or two-level
 The advancement of VLSI technology allowed
to have on-chip cache, which provides fastest
possible cache access
 This also eliminated external bus activity
 Lead to two or more levels of cache: on-chip
(L1) and off-chip (L2) and provided still better
performance
 Unified or split
 Single cache for both instruction and data
 Separate cache for instruction and data

Ajit Pal, IIT Kharagpur


Unified and Split Memory

P R IN C E T O N HARVARD
A R C H IT E C T U R E A R C H IT E C T U R E

CPU CPU

PRO G RAM
M EM O RY DATA
M EM O RY
M EM O RY

Ajit Pal, IIT Kharagpur


Unified vs Split Caches
 A Load or Store instruction requires two memory
accesses:
 One for the instruction itself
 One for the data
 Therefore, unified cache causes a structural hazard!
 Modern processors use separate data and instruction
L1 caches:
 As opposed to “unified” or “mixed” caches
 The CPU sends simultaneously:
 Instruction and data address to the two ports
 Both caches can be configured differently
 Size, associativity, etc

Ajit Pal, IIT Kharagpur


Unified vs Split Caches
 Separate Instruction and Data caches:
 Avoids structural hazard
 Also each cache can be tailored specific to need

Processor Processor

I - Cache-1 D - Cache-1
Unified
Cache-1
Unified
Cache-2
Unified
Cache-2

Unified Cache Split Cache

Ajit Pal, IIT Kharagpur


Cache in Intel Processors
80386: 80486:
32KB Off-chip 8KB On-chip
Direct mapped 4-way set-associative
Block size: 16 bytes Block size: 16 bytes
Write-through Write-through

Cache Pentium 4:
Two on-chip, each 8KB, Off-chip 256KB

Block size: 64 bytes Block size128 bytes

4-way set-associative 8-way set-associative


Ajit Pal, IIT Kharagpur
Alpha 21264 Data Cache
 Let us understand the operation of the Alpha
21264 data cache
 Alpha processor presents a 48-bit virtual
address to the cache:
 29 tag bits
 9 index bits
 6 offset bits
 Cache is 2-way set associative

Ajit Pal, IIT Kharagpur


Case Study: The Alpha 21264 Cache

What happens on a read?

Ajit Pal, IIT Kharagpur


The Alpha 21264 Cache

Step 1:
CPU generates 44-bit address
The address is split into
29-bit tag
9-bit set index (29 = 512 sets)
6-bit block offset (26 = 64 bytes
blocks)

Ajit Pal, IIT Kharagpur


The Alpha 21264 Cache

Step 2:
The “right” set is selected
using the index bits

Ajit Pal, IIT Kharagpur


The Alpha 21264 Cache

Step 3:
The tag is compared to
both tags in the set; If a
match AND valid=1:
then a hit (If not: then a
miss)

Ajit Pal, IIT Kharagpur


The Alpha 21264 Cache

Step 4:
If a match, select the matching
block and return the byte at the
right offset
The election of the matching
block is done via a 2:1
multiplexer

Ajit Pal, IIT Kharagpur


Performance
Hit ratio hi:
Probability of finding in Mi
Miss ratio:
(1 – hi)
Average Memory Access Time:
TAMAT = h1t1 + (1 – h1)t2
Effective Cost =
c1s1 + c2s2
Ajit Pal, IIT Kharagpur
Example: Performance
 Cache memory 1Kb, Main memory 1Mb
 Fast:
 Cache hit 95% (10 nsec), main memory 5%
(100nsec)
 AMAT = (0.95)(10) + (0.05)(10 +100) = 9.5 + 5.5 =
15 nsec
 Large: Increasing capacity:
 Closer to the higher level memory (1 Mb)
 Optimum cost:
 Cache (Rs. 1.0 per byte), Main memory (Rs. 0.1
per byte).
 Average cost = Rs. 0.1009
Ajit Pal, IIT Kharagpur
Summary
 Memory system in a computer is organized in a
hierarchical manner
 Cache memory is used in between the main memory
and the CPU
 Cache memory is faster and smaller than the main
memory
 Cache memory is accessed more frequently than
main memory
 There are three mapping functions
 Several replacement algorithms are possible
 There are two alternatives for write policies
 Several levels of cache memories are used
 Faster and larger memory at lower cost is achieved

Ajit Pal, IIT Kharagpur


Points to remember
 Where can a block be placed?
 One place
 A few places
 Any place
 How is a block found?
 Indexing
 Limited search
 Full search
 Which block to be replaced on a cache miss?
 Random
 LRU
 What happens on a write?
 Write-through
 Write-back

Ajit Pal, IIT Kharagpur


Thanks!

Ajit Pal, IIT Kharagpur

You might also like