Professional Documents
Culture Documents
Chapter 9
Memory Hierarchy
MEMORY
• Reality…
• Processors have cycle times of ~1 ns
• Fast DRAM has a cycle time of ~100 ns
• We have to bridge this gap for pipelining to be
effective!
9 Memory Hierarchy
• Clearly fast memory is possible
– Register files made of flip flops operate at
processor speeds
– Such memory is Static RAM (SRAM)
• Tradeoff
– SRAM is fast
– Economically unfeasible for large memories
9 Memory Hierarchy
• SRAM
– High power consumption
– Large area on die
– Long delays if used for large memories
– Costly per bit
• DRAM
– Low power consumption
– Suitable for Large Scale Integration (LSI)
– Small size
– Ideal for large memories
– Circa 2007, a single DRAM chip may contain up to 256 Mbits with
an access time of 70 ns.
9 Memory Hierarchy
Source: http://www.storagesearch.com/semico-art1.html
9.1 The Concept of a Cache
• Feasible to have small amount of fast memory and/or
large amount of slow memory.
• Want Increasing speed
as we get closer to
Increasing size as
we get farther away
the processor from the processor
– Size advantage of DRAM
– Speed advantage of SRAM.
CPU
• CPU looks in cache for data it seeks
from main memory.
• If data not there it retrieves it from Cache
main memory.
• If the cache is able to service "most"
CPU requests then effectively we will Main memory
2
Cache
3
4
0
5
1
6
2
7
3
8
4
9
5
10
6
11
7
12
13
14
15
9.6 Direct-mapped cache organization
0
1
2
3
4
5
6 0 8 16 24
7 1 9 17 25
2 10 18 26
3 11 19 27
4 12 20 28
5 13 21 29
6 14 22 30
7 15 23 31
9.6 Direct-mapped cache organization
000
001
010
011
100
101
110 00000 01000 10000 11000
111 00001 01001 10001 11001
00010 01010 10010 11010
00011 01011 10011 11011
00100 01100 10100 11100
00101 01101 10101 11101
00110 01110 10110 11110
00111 01111 10111 11111
9.6.1 Cache Lookup
000
001
010 Cache_Index = Memory_Address mod Cache_Size
011
100
101
110 00000 01000 10000 11000
111 00001 01001 10001 11001
00010 01010 10010 11010
00011 01011 10011 11011
00100 01100 10100 11100
00101 01101 10101 11101
00110 01110 10110 11110
00111 01111 10111 11111
Tag Contents 9.6.1 Cache Lookup
000 00
001 00
010 00 Cache_Index = Memory_Address mod Cache_Size
011 00 Cache_Tag = Memory_Address/Cache_Size
100 00
101 00
110 00 00000 01000 10000 11000
111 00 00001 01001 10001 11001
00010 01010 10010 11010
00011 01011 10011 11011
00100 01100 10100 11100
00101 01101 10101 11101
00110 01110 10110 11110
00111 01111 10111 11111
9.6.1 Cache Lookup
• Keeping it real!
• Assume
– 4Gb Memory: 32 bit address
– 256 Kb Cache
– Cache is organized by words
• 1 Gword memory
• 64 Kword cache 16 bit cache index
Memory address
. . .
. . .
Data
To
CPU
= hit
y
Question
• Does the caching concept described so far
exploit
1. Temporal locality
2. Spatial locality
3. Working set
9.7 Repercussion on pipelined processor
design
IF ID/RR EXEC MEM WB
B B B B
U U U U
PC DPRF ALU-1
F F F F
F F F D-Cache F data
I-Cache A B ALU-2
E E E E
R Decode R R R DPRF
ALU logic
Read Hit
9.8 Basic cache read/write algorithms
Read Miss
9.8 Basic cache read/write algorithms
Write-Back
9.8 Basic cache read/write algorithms
Write-Through
9.8.1 Read Access to Cache from CPU
• Two choices
– Write through policy
• Write allocate
• No-write allocate
– Write back policy
9.8.2.1 Write Through Policy
• Each write goes to cache. Tag is set and valid
bit is set
• Each write also goes to write buffer (see next
slide)
9.8.2.1 Write Through Policy
CPU
Address Data
Address Data
Write Buffer
Address Data
Address Data
Address Data
Address
Data
Main memory
• Write Through
– Cache logic simpler and faster
– Creates more bus traffic
• Write back
– Requires dirty bit and extra logic
• Multilevel cache processors may use both
– L1 Write through
– L2/L3 Write back
9.9 Dealing with cache misses in the
processor pipeline
• Read miss in the MEM stage:
I1: ld r1, a ; r1 <- MEM[a]
I2: add r3, r4, r5 ; r3 <- r4 + r5
I3: and r6, r7, r8 ; r6 <- r7 AND r8
I4: add r2, r4, r5 ; r2 <- r4 + r5
I5: add r2, r1, r2 ; r2 <- r1 + r2
• Write miss in the MEM stage: The write-buffer alleviates the
ill effects of write misses in the MEM stage. (Write-Through)
9.9.1 Effect of Memory Stalls Due to Cache
Misses on Pipeline Performance
• ExecutionTime = NumberInstructionsExecuted * CPIAvg * clock cycle time
Word Offset
16 bytes (4 words) so 4 (2) bits
Byte Offset
Tag Block Index Block Tag Block Index
14 bits 14 bits Offset 14 bits 14 bits
00000000000000 00000000000000 0000 00000000000000 00000000000000 00 00
9.10 Exploiting spatial locality to improve
cache performance
9.10 Exploiting spatial locality to improve
cache performance
Unused WS 2
Cache
WS 3
Fully
Associative
(8-way)
VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data
9.11.2 Set associative caches
Assume we have a computer with 16 bit addresses and 64 Kb of memory
Further assume cache blocks are 16 bytes long and we have 128 bytes available
for cache data
Cache Type Cache Lines Ways Tag Index bits Block Offset
(bits)
Direct Mapped 8 1 9 3 4
Two-way 4 2 10 2 4
Set Associative
Four-way
Set Associative 2 4 11 1 4
Fully
1 8 12 0 4
Associative
9.11.2 Set associative caches
Tag Index Byte Offset (2 bits)
10
= = = =
32 32 32 32
4 to 1 multiplexor
hit Data
9.11.3 Extremes of set associativity
Two-way Set 2 Ways
Direct
Associative
Mapped 1-Way
VTag Data VTag Data
(1-way)
VTag Data VTag Data
VTag Data 4 sets VTag Data VTag Data
VTag Data
VTag Data VTag Data
VTag Data
VTag Data
8 sets
VTag Data 4 Ways
VTag Data Four-way Set
VTag Data Associative
VTag Data VTag Data VTag Data VTag Data VTag Data
2 sets
VTag Data VTag Data VTag Data VTag Data
Fully 8 Ways
Associative
(8-way)
1 set VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data VTag Data
9.12 Instruction and Data caches
• Would it be better to have two separate
caches or just one larger cache with a lower
miss rate?
• Roughly 30% of instructions are Load/Store
requiring two simultaneous memory accesses
• The contention caused by combining caches
would cause more problems than it would
solve by lowering miss rate
9.13 Reducing miss penalty
• Reducing the miss penalty is desirable
• It cannot be reduced enough just by making the
block size larger due to diminishing returns
• Bus Cycle Time: Time for each data transfer
between memory and processor
• Memory Bandwidth: Amount of data
transferred in each cycle between memory and
processor
9.14 Cache replacement policy
• An LRU policy is best when deciding which of
the multiple "ways" to evict upon a cache miss
CPU
VA PA Instruction
TLB Cache or
Data
9.17 Cache controller
• Upon request from processor, looks up cache to determine
hit or miss, serving data up to processor in case of hit.
• Upon miss, initiates bus transaction to read missing block
from deeper levels of memory hierarchy.
• Depending on details of memory bus, requested data
block may arrive asynchronously with respect to request.
In this case, cache controller receives block and places it in
appropriate spot in cache.
• Provides ability for the processor to specify certain regions
of memory as “uncachable.”
9.18 Virtually Indexed Physically Tagged
Cache
VPN Page offset
Index
Cache
TLB
=?
Hit
9.19 Recap of Cache Design Considerations
• Principles of spatial and temporal locality
• Hit, miss, hit rate, miss rate, cycle time, hit time, miss penalty
• Multilevel caches and design considerations thereof
• Direct mapped caches
• Cache read/write algorithms
• Spatial locality and blocksize
• Fully- and set-associative caches
• Considerations for I- and D-caches
• Cache replacement policy
• Types of misses
• TLB and caches
• Cache controller
• Virtually indexed physically tagged caches
9.20 Main memory design considerations
CPU Cache
Address Data
Main memory
(32 bits wide)
9.20.2 Main memory and bus to match cache
block size
CPU Cache
Address Data
Main memory
(128 bits wide)
9.20.3 Interleaved memory
Data
(32 bits)
Memory Bank M0
(32 bits wide)
Memory Bank M1
(32 bits wide)
Block
Address
CPU
(32 bits)
Memory Bank M2
(32 bits wide)
Cache
Memory Bank M3
(32 bits wide)
9.21 Elements of a modern main memory
systems
9.21 Elements of a modern main memory
systems
9.21 Elements of a modern main memory
systems
9.21.1 Page Mode DRAM
9.22 Performance implications of memory
hierarchy
Type of Memory Typical Size Approximate latency in
CPU clock cycles to read
one word of 4 bytes