Professional Documents
Culture Documents
Anthony J Souza
April 2017
Topics And Reading
• Topics
• Technology Issues
• Cache Design
• Reading
• Patterson and Hennessey 5.1 to 5.5
CPU Speed vs. Memory Speed
• Main memory:
• Main storage for active program
• Based on Dynamic Random Access Memory(DRAM)
• In MIPS pipeline, organized into
• Instruction memory
• Data memory
• Each memory access takes one cycle to access
• Typical clock rate 2 Ghz
• Cycle time = 500ps
Reference Material for Memory
• Memory technology:
• http://arstechnica.com/paedia/r/ram_guide/ram_guide.part1-1.html
• What are DRAM speeds?
• https://en.wikipedia.org/wiki/DDR3_SDRAM
• In 2010, the faster DDR3 SDRAM (Double Data Rate Synchronous
DRAM) has a cycle time of < 4 ns
• Instruction memory and data memory are actually instruction cache
and data cache.
• A cache is a smaller, faster memory, between main memory and the
CPU. Using Static RAM (SRAM) technology.
Memory hierarchy concepts
• Levels of memory hierarchy:
• Registers
• Cache (1 to 3 levels; static RAM)
• Main memory (dynamic RAM)
• Disk / secondary storage (magnetic disk)
• SRAM: .5-2.5ns access time, $2k-$5k per GB (2008)
• DRAM: 50-70ns access time, $20-$75 per GB
• Disk: 5 – 20ms access time, $.2-$2 per GB
• Closer to CPU: small and fast, expensive
• Further from CPU: large and cheap
Memory hierarchy concepts
Suppose we are running a program that works with some data.
Most of the time, we are just
• Executing a small number of instructions
(ex.: loop, small block)
• Working with a small amount of data
(ex.: a few variables, a small piece of an array)
This is the principle of locality observed in programs.
If we keep the few instructions/data that we need in the fast storage, then
we can get to them quickly, most of the time.
Similar to keeping variables in registers!
Memory hierarchy concepts
Two types of locality
• Temporal locality
• If a memory address is referenced at the current time, the same address will
be referenced again soon.
• Examples:
• Spatial locality
• If a memory address is referenced at the current time, an address close to it
will be referenced soon.
• Examples:
Memory hierarchy concepts
Two types of locality
• Temporal locality
• If a memory address is referenced at the current time, the same address will
be referenced again soon.
• Examples: instructions in loops, recursive functions
• Spatial locality
• If a memory address is referenced at the current time, an address close to it
will be referenced soon.
• Examples: instructions in straight-line code, sequential array access.
Technology Issues: DRAM vs SRAM
• Main memory constructed from DRAMs (dynamic random access
memory)
• one transistor holds one bit
• destructive reads; periodically refreshed
• DRAMs organized as dual inline memory modules
(DIMMs)
• typically 8 to 16 DRAM chips per DIMM
• usually 8 bytes wide
A 4 GB DDR4 DIMM:
Operation:
• address divided into row and column
• send row address and row access strobe (RAS)
Read row of bits from memory array
• send column address and column access strobe (CAS)
Select bit(s) from within row, send out to CPU
Early DRAMs were asynchronous (no clock)
Most fast DRAMs today are synchronous
Decent reference:
http://pcguide.com/art/sdramBasics-c.html
DRAM block
diagram:
Performance Considerations
• Performance considerations:
• DRAM much slower than CPU clock cycle time!
• Each memory access takes many cycles
• SRAM basics:
• Reads are not destructive; no refresh necessary
• typically 6 transistors per bit
• DRAMs 4 to 8 times denser per chip than SRAMs
• SRAMs 8 to 16 times faster than DRAMs
• SRAMs 8 to 16 times as expensive than DRAMs
Terminology
• Consider pair of connecting levels:
• Cache / main memory
• Main memory / disk
Memory
Basic Principles of Caches
• Usually, main memory much slower than CPU.
• Cache: small fast memory array close to CPU
• (AMD Opteron: 64KB L1 instruction cache, 64KB L1 data cache, 1MB
L2 cache)
• Block or line: Basic unit of information transferred between cache
and CPU
• Block Size - Note multi-word blocks read entire size of block from
memory. 4-word blocks will read 16-bytes on cache misses.
• Simplest: one-word block
• Since cache is much smaller than main memory,
must hash 32-bit memory address to small cache index.
Direct-Mapped Cache with one-word blocks
• Direct-mapped: Modulo hashing
• Suppose cache has 4 blocks:
memory word 0…0 00 0000 cache block 00
0…0 01 0000
0…0 10 0000
Cache Empty, so we have a miss. Load contents into cache. Set valid bit to 1.
Instruction Cache Trace
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 M
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100 M
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100
Index 0000 0010 01 is empty, so we have a miss. Load contents into cache. Set
valid bit to 1.
Instruction Cache Trace
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 M
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100 M
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000 M
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100
Index 0000 0010 10 is empty, so we have a miss. Load contents into cache. Set
valid bit to 1.
Instruction Cache Trace
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 M
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100 M
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000 M
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100 M
Index 0000 0010 11 is empty, so we have a miss. Load contents into cache. Set
valid bit to 1.
Instruction Cache Trace
• First iteration of loop each instruction causes a cache miss.
• This causes each instruction to be pulled into the cache from
memory.
• This is slow. Pipeline stalls waiting for memory read to finish.
• On the next iteration each instruction will be in the cache therefore
we will have 4 hits.
• The final iteration will be all hits as well.
• Giving us 4 misses and 8 hits.
Instruction Cache Trace = Iteration 2
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 H
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100
Index 0000 0010 00 is not empty, tags are equal. Therefore we have a hit.
Instruction Cache Trace = Iteration 2
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 H
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100 H
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100
Index 0000 0010 01 is not empty, tags are equal. Therefore we have a hit.
Instruction Cache Trace = Iteration 2
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 H
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100 H
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000 H
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100
Index 0000 0010 10 is not empty, tags are equal. Therefore we have a hit.
Instruction Cache Trace = Iteration 2
0x0040 0020 = 0000 0000 0100 0000 0000 0000 0010 0000 H
0x0040 0024 = 0000 0000 0100 0000 0000 0000 0010 0100 H
0x0040 0028 = 0000 0000 0100 0000 0000 0000 0010 1000 H
0x0040 002C = 0000 0000 0100 0000 0000 0000 0010 1100 H
Index 0000 0010 11 is not empty, tags are equal. Therefore we have a hit.
Data Cache Trace
• 4KB direct-mapped data cache, 1-word blocks
• Code fragment:
lw $s4, 0($s0) # $s0 = 0x1001 0040
lw $s5, 0($s1) # $s0 = 0x1001 c040
lw $s6, 0($s0)
lw $s7, 0($s2) # $s0 = 0x1001 0840
lw $t0, 0($s0)
Data Cache Trace
• 4KB direct-mapped data cache, 1-word blocks
• Code fragment: # of blocks = 4KB/4 = 2^12 / 2^2 = 2^10
# of bits in index = log(1024) = 01 10-bit index
lw $s4, 0($s0) # $s0 = 0x1001 0040 # of bits in offset = log(4) = 2 2 bit index
lw $s5, 0($s1) # $s0 = 0x1001 c040 # of bits in tag = 32 – 10 – 2 = 20 20-bit index
lw $s6, 0($s0) 20-bit Tag 10-bit index 2-bit offset
lw $s7, 0($s2) # $s0 = 0x1001 0840
lw $t0, 0($s0)
Data Cache Trace
• 4KB direct-mapped data cache, 1-word blocks
• Code fragment: # of blocks = 4KB/4 = 2^12 / 2^2 = 2^10
# of bits in index = log(1024) = 01 10-bit index
lw $s4, 0($s0) # $s0 = 0x1001 0040 # of bits in offset = log(4) = 2 2 bit index
lw $s5, 0($s1) # $s0 = 0x1001 c040 # of bits in tag = 32 – 10 – 2 = 20 20-bit index
lw $s6, 0($s0) 20-bit Tag 10-bit index 2-bit offset
lw $s7, 0($s2) # $s0 = 0x1001 0840
lw $t0, 0($s0)
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
0x1001 c040 = 0001 0000 0000 0001 1100 0000 0100 0000
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
0x1001 0840 = 0001 0000 0000 0001 0000 1000 0100 0000
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
Data Cache Trace
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 c040 = 0001 0000 0000 0001 1100 0000 0100 0000
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
0x1001 0840 = 0001 0000 0000 0001 0000 1000 0100 0000
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
OLD: Cache Valid Tag Contents
Index= 0x10 ? ? ? ?
Tag= none
0000 0100 00 1 0001 0000 0000 0001 0000 Some value
? ? ? ?
CURRENT: ? ? ? ?
Index= 0x10 ? ? ? ?
Tag= 0x10010
1000 0100 00 0 trash Trash
At index 0000 0100 00, we have a cache miss, (tags don’t match, cache slot is empty).
We read in block-size data. In this case 4 bytes.
Data Cache Trace
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 c040 = 0001 0000 0000 0001 1100 0000 0100 0000 M
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
0x1001 0840 = 0001 0000 0000 0001 0000 1000 0100 0000
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
OLD: Cache Valid Tag Contents
Index= 0x10 ? ? ? ?
Tag= 0x10010
0000 0100 00 1 0001 0000 0000 0001 1100 Some value
? ? ? ?
CURRENT: ? ? ? ?
Index= 0x10 ? ? ? ?
Tag= 0x1001C
1000 0100 00 0 trash Trash
At index 0000 0100 00, we have a cache miss, (tags don’t match). We read in block-
size data. In this case 4 bytes.
Data Cache Trace
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 c040 = 0001 0000 0000 0001 1100 0000 0100 0000 M
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 0840 = 0001 0000 0000 0001 0000 1000 0100 0000
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
OLD: Cache Valid Tag Contents
Index= 0x10 ? ? ? ?
Tag= 0x1001C
0000 0100 00 1 0001 0000 0000 0001 0000 Some value
? ? ? ?
CURRENT: ? ? ? ?
Index= 0x10 ? ? ? ?
Tag= 0x10010
1000 0100 00 0 trash Trash
At index 0000 0100 00, we have a cache miss, (tags don’t match). We read in block-
size data. In this case 4 bytes.
Data Cache Trace
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 c040 = 0001 0000 0000 0001 1100 0000 0100 0000 M
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 0840 = 0001 0000 0000 0001 0000 1000 0100 0000 M
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000
OLD: Cache Valid Tag Contents
Index= 0x210 ? ? ? ?
Tag= none
0000 0100 00 1 0001 0000 0000 0001 0000 Some value
? ? ? ?
CURRENT: ? ? ? ?
Index=0x210 ? ? ? ?
Tag= 0x10010
1000 0100 00 1 0001 0000 0000 0001 0000 Some value
At index 1000 0100 00, we have a cache miss, (tags don’t match) cache block is empty.
We read in block-size data. In this case 4 bytes.
Data Cache Trace
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 c040 = 0001 0000 0000 0001 1100 0000 0100 0000 M
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 M
0x1001 0840 = 0001 0000 0000 0001 0000 1000 0100 0000 M
0x1001 0040 = 0001 0000 0000 0001 0000 0000 0100 0000 H
OLD: Cache Valid Tag Contents
Index= 0x10 ? ? ? ?
Tag= 0x10010
0000 0100 00 1 0001 0000 0000 0001 0000 Some value
? ? ? ?
CURRENT: ? ? ? ?
Index=0x10 ? ? ? ?
Tag= 0x10010
1000 0100 00 1 0001 0000 0000 0001 0000 Some value
At index 1000 0100 00, we have a cache hit, (tags match). We read data from cache.
Basic Cache Operation
• Operations for a hit (instruction cache again):
• Get memory address from PC
• Hash memory address to get index, tag.
• Use index to access cache, match tags, discover hit
• Get information directly from cache block
Cache operation and pipelines
• Usually one instruction cache, one data cache (split caches)
• Pipeline needs up to two accesses per cycle
• If cache hit, pipeline executes one instruction per cycle.
• On cache miss, stall pipeline
• When instruction/data transferred to cache from memory, pipeline
resumes operation.
Multiword Blocks
• One-word cache blocks exploit temporal locality, not spatial locality.
• Most CPUs have more than one word per cache block
(block size and cache size usually given in bytes, not words)
• Given cache size C and block size B:
𝐶
• Number of blocks is, 𝑏𝑙𝑜𝑐𝑘𝑠 =
𝐵 𝐶
• Number of bits in index is, 𝐼 = log
𝐵
loop: [instr 1]
[instr 2]
addi $s0, $s0, -1
bne $s0, $0, loop
• Example: 4KB direct-mapped cache, 4-word blocks
• Loop is at address 0x40 0020
• Code fragment ($s0 initialized to 4): # of blocks = 4𝐾𝐵
16
=
2
2
12
4 = 28 = 256
# of bits in index = log(256) = 8 bits
loop: [instr 1] # of bits in offset = 4-word blocks = 16bit blocks =
log(16) = 4 bits
[instr 2]
addi $s0, $s0, -1
bne $s0, $0, loop
tag index offset
0x400020 = 0..0 0100 0000 0000 0000 0010 0000 Miss
0x400024 = 0..0 0100 0000 0000 0000 0010 0100
0x400028 = 0..0 0100 0000 0000 0000 0010 1000
0x40002C = 0..0 0100 0000 0000 0000 0010 1100
Cache miss for index 0000 0010. Therefore we must perform a read from memory. This causes 4-
words to be read in at address 0x0040 0020 to 0x0040 002C.
• Example: 4KB direct-mapped cache, 4-word blocks
• Loop is at address 0x40 0020
• Code fragment ($s0 initialized to 4): # of blocks = 4𝐾𝐵
16
=
2
2
12
4 = 28 = 256
# of bits in index = log(256) = 8 bits
loop: [instr 1] # of bits in offset = 4-word blocks = 16bit blocks =
log(16) = 4 bits
[instr 2]
addi $s0, $s0, -1
bne $s0, $0, loop
tag index offset
0x400020 = 0..0 0100 0000 0000 0000 0010 0000 Miss
0x400024 = 0..0 0100 0000 0000 0000 0010 0100 Hit
0x400028 = 0..0 0100 0000 0000 0000 0010 1000
0x40002C = 0..0 0100 0000 0000 0000 0010 1100
Cache hit for index 0000 0010. Tags match. Get data at block offset 0100
• Example: 4KB direct-mapped cache, 4-word blocks
• Loop is at address 0x40 0020
• Code fragment ($s0 initialized to 4): # of blocks = 4𝐾𝐵
16
=
2
2
12
4 = 28 = 256
# of bits in index = log(256) = 8 bits
loop: [instr 1] # of bits in offset = 4-word blocks = 16bit blocks =
log(16) = 4 bits
[instr 2]
addi $s0, $s0, -1
bne $s0, $0, loop
tag index offset
0x400020 = 0..0 0100 0000 0000 0000 0010 0000 Miss
0x400024 = 0..0 0100 0000 0000 0000 0010 0100 Hit
0x400028 = 0..0 0100 0000 0000 0000 0010 1000 Hit
0x40002C = 0..0 0100 0000 0000 0000 0010 1100
Cache hit for index 0000 0010. Tags match. Get data at block offset 1000
• Example: 4KB direct-mapped cache, 4-word blocks
• Loop is at address 0x40 0020
• Code fragment ($s0 initialized to 4): # of blocks = 4𝐾𝐵
16
=
2
2
12
4 = 28 = 256
# of bits in index = log(256) = 8 bits
loop: [instr 1] # of bits in offset = 4-word blocks = 16bit blocks =
log(16) = 4 bits
[instr 2]
addi $s0, $s0, -1
bne $s0, $0, loop
tag index offset
0x400020 = 0..0 0100 0000 0000 0000 0010 0000 Miss
0x400024 = 0..0 0100 0000 0000 0000 0010 0100 Hit
0x400028 = 0..0 0100 0000 0000 0000 0010 1000 Hit
0x40002C = 0..0 0100 0000 0000 0000 0010 1100 Hit
Cache hit for index 0000 0010. Tags match. Get data at block offset 1100
Write-through and write buffers
• So far, only consider reads and read misses.
• What about writes?
• Suppose we have a 4KB data cache, 4-byte blocks
• Consider MIPS code sequence:
lw $t0, ($s0) # $s0=0x10010008, cache index=2
add $t0, $t0, 1
sw $t0, ($s0)
After lw $t0, ($s0): After sw $t0, ($s0):
???
• (If there are too many writes, write buffer will be full often!)
Write Policy : Writeback
• Each cache block has dirty bit
• indicates whether cache block is different from memory block.
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 0
11 0
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 0
11 0
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 0
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 0
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 0
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 0
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 1 0000 0000 0001 0000 0000 0000 00 X[6] X[7] X[8] X[9]
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 1 0000 0000 0001 0000 0000 0000 00 X[6] X[7] X[8] X[9]
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 1 0000 0000 0001 0000 0000 0000 00 X[6] X[7] X[8] X[9]
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 1 0000 0000 0001 0000 0000 0000 00 X[6] X[7] X[8] X[9]
00 0
01 1 0000 0000 0001 0000 0000 0000 00 trash trash X[0] X[1]
10 1 0000 0000 0001 0000 0000 0000 00 x[2] x[3] X[4] X[5]
11 1 0000 0000 0001 0000 0000 0000 00 X[6] X[7] X[8] X[9]
0 0
1 1
2 2
3 3
4
2-way set-associative cache with
5
4 sets
6
7
• On an access, first compute index, then must check all k blocks, match k
tags.
• If all k choices are full, must replace a block with new block.
• How to decide which block to replace?
• Disk controller handles control of disk and transfer between disk and
memory.