Professional Documents
Culture Documents
Memory System: UEEA4653: Computer Architecture
Memory System: UEEA4653: Computer Architecture
Memory System
Major Goals
Memory System
Memory Revisit
Memory System
DRAM
Memory System
Transistor:
To enable read/write to
the capacitor
A memory cell.
The first DRAM cell
was invented in 1966
by Robert Dennard, a
researcher at IBM's
Thomas J. Watson
Research Center
Capacitor:
To store the value
Electrons discharged
upon read
Memory System
SRAM
Memory System
DRAM
Slow; because capacitor charging and discharging times are long
Can be as long as hundreds of cycles w.r.t. processors speed
SRAM
Fast; because switching time at transistor level is short
May take a few cycles to a few tens of cycles, depending on the size
Mostly used as caches inside the CPU
Before, discussions always assumed that memory access takes one cycle
From now on, this assumption goes away
A memory access instruction NO LONGER takes 5 cycles
Implication performance of processor looks bad with this reality
(example in next slide)
In this chapter, well look into ways to minimize the impact
Memory System
Example
Summing up
addi
addi
L1: lw
add
addi
subi
bne
$zero, 0
$zero, 100
0($s2)
$s0, $t1
$s2, 4
$s1, 1
$zero, L1
# $s0: accumulator
# $s1: counter
# $s2: memory addr
Multicycle design: 4 cycles for addi, add, & subi; 3 cycles for bne
Before, the ideal case, each memory access took 1 cycles
Execution time = 4 + 4 + {(1 + 4) + 4 + 4 + 4 + 3}*100 = 2008
Now, in reality, assume that each memory access takes 100 cycles
Execution time = 4 + 4 + {(100 + 4) + 4 + 4 + 4 + 3}*100 = 11908
With real memory, slowdown by 1 - 2008 / 11908 = -83%
So, it is critical to find a way to bridge such performance gap
Copyright Dr. Lai An Chow
Memory System
Solution
Use a memory hierarchy exploiting the principle of locality
Copyright Dr. Lai An Chow
Memory System
10
1. Memory Hierarchy
Memory System
Principle of Locality
11
Memory System
Example
12
L1:
addi
addi
lw
add
addi
subi
bne
$s0,
$s1,
$t1,
$s0,
$s2,
$s1,
$s1,
$zero, 0
$zero, 100
0($s2)
$s0, $t1
$s2, 4
$s1, 1
$zero, L1
# $s0: accumulator
# $s1: counter
# $s2: memory addr
Memory System
Memory Hierarchy
13
size
smallest
Memory
speed
cost / bit
fastest
highest
slowest
lowest
.....
Main memory
(DRAM)
biggest
Magnetic Disk
(secondary storage)
Memory System
14
Memory Technology
$ per GB in 2004
SRAM
0.5-5 ns
$4000-$10,000
DRAM
50-70ns
$100-$200
Magnetic disk
5,000,000 20,000,000 ns
$0.50-$2
In memory hierarchy,
Different levels of memory use different technologies
Different technologies give different performance/cost trade-off
(example will be given later)
Cache was the name chosen to represent the level of the memory
hierarchy between the CPU and the main memory in the first
commercial machine
Memory System
15
CPU
L1 Cache
(SRAM)
32 Kbytes
L2 Cache
(SRAM)
1 Mbytes
Main memory
(DRAM)
2 Gbytes
On-chip
Bus
Off-chip
Magnetic Disk
(secondary storage)
Copyright Dr. Lai An Chow
160 Gbytes
Memory System
16
Initially,
Instructions and data are loaded into memory (DRAM) from disk
Upon first access,
A copy of the referenced instruction or data item is kept in cache
In subsequent accesses,
First, look for the requested item in the cache
If the item is in the cache, return the item to CPU
If NOT in the cache, look it up in the memory
WLOG, we can say
If not found in this level, look it up at next level until found
Keep (or cache) a copy of the found item at this level after use
Memory System
17
level 1
Level in the
memory hierarchy
level 2
.....
(lower)
Increasing distance
from the CPU in
access time
level n
Memory System
18
Memory System
Performance Measures
19
Memory System
20
CPU
Cache
bus
Memory
Memory System
21
CPU
L1 Cache
L2 Cache
bus
Memory
Memory System
Cache Performance
22
Average memory access latency = hit time + miss rate * miss penalty
For a configuration with a single level of cache, average memory latency
= hit time + miss ratio * miss penalty
= hit time + (1 - hit ratio) * miss penalty
For two levels of caches, average memory latency
= hit time1 + miss ratio1 * miss penalty1
= hit time1 + miss ratio1 * (hit time2 + miss ratio2 * miss penalty2)
= hit time1 +
miss ratio1 * hit time2 +
miss ratio1 * miss ratio2 * miss penalty2
Same idea can be extended to multiple levels of caches
Copyright Dr. Lai An Chow
Memory System
Example
23
Memory System
24
2. Cache
Memory System
Issues to Consider
25
Block placement:
Where is a block placed in the cache?
Block identification:
How can a block be found if it is in the cache?
Block replacement:
Upon miss, how a block in the cache is selected for replacement?
Write strategy:
When a write occurs, is the information written only to the cache?
Memory System
Implementation of Caches
26
Memory System
27
000
0 01
010
011
1 00
1 01
1 10
1 11
cache
cache location
memory
block address
00 00 1
0 01 01
01 00 1
0 11 01
10 00 1
1 01 01
1 1 001
11 1 01
Memory System
Example
28
M: cache miss
H: cache hit
0010000
set 0
set 1
set 2
set 3
set 4
set 5
set 6
set 7
0011010
0000011
3
0010110
8-entry cache
Memory System
29
byte address
bytes per block
and
byte address
Memory System
Explanation
30
Cache
32 bytes
Block 0
1152
1184
1216
Block 5
Block 37
Block 63
Copyright Dr. Lai An Chow
Memory System
10
31
Block
address
0010 1102
Miss
0011 0102
Miss
0010 1102
Hit
0010 1102
Hit
0010 0002
Miss
0000 0112
Miss
0010 0002
Hit
0010 0102
Miss
Memory System
32
Cache
# of cache sets
32 bytes
= 214 / 25
= 29
= 512
Memory System
33
Bigger cache block size, less number of cache blocks; and vice versa
Given cache size is fixed, varying cache block size changes miss rate
4 0%
3 5%
Miss ra te
3 0%
2 5%
2 0%
1 5%
1 0%
5%
0%
16
64
25 6
B loc k s iz e (byte s)
1 KB
8 KB
1 6 KB
6 4 KB
2 56 K B
Memory System
11
Disadvantage of DM Cache
34
Memory System
35
Memory System
36
Memory
block X
Fully associative
cache
cache
0 1 2 3 4 5 6 7
block number
set number
Memory System
12
# of
cache sets
37
32 bytes
N = 16K / 32 = 512
# of cache sets = N / 2
= 256
In general, for a m-way set-associative cache,
# of cache sets = cache size / cache block size / m
Copyright Dr. Lai An Chow
Memory System
38
Memory System
39
Memory System
13
40
Tag Data
Set
2
3
0
1
Memory System
41
15%
1 KB
12%
2 KB
9%
4 KB
6%
8 KB
16 KB
32 KB
3%
128 KB
64 KB
0
One-way
Two-way
Four-way
Eight-way
Associativity
Copyright Dr. Lai An Chow
Memory System
Block Identification in DM
42
Tag
byte
offset
10
17
data
index
index valid tag
data
0
1
2
1 0 21
1 0 22
1 0 23
17
32
Memory System
14
43
Memory System
44
Memory address
byte
offset
19
index
V tag
data
V tag
data
V tag
data
V tag
data
0
1
2
253
254
255
22
Questions to answer:
Which cache set?
i.e. which address bits are used?
32
4- to- 1 multiplexor
Hit
Data
Memory System
45
Memory address
tag
Copyright Dr. Lai An Chow
index
byte
offset
Memory System
15
46
Upon a memory access by CPU, a request is sent to the first level cache
If the cache reports a hit
CPU continues using the data from cache as if nothing had happened
If the cache reposts a miss, some extra work is needed
A separate controller initiates a request to refill the cache
This request is sent to the next level cache or memory
(Keep going to next level until data are found)
During the process of refilling, CPU is stalled
Stall entire CPU stops operating until the 1st-level cache is filled
Unlike interrupts, stall does not need saving the state of all registers
Memory System
47
CPU
memory request
CPU
return data
memory request
L1
Cache
return data
L1
Cache
memory request
return data
Main memory
Main memory
Magnetic Disk
(secondary storage)
Magnetic Disk
(secondary storage)
Cache hit
Cache Miss
Memory System
48
CPU
L1 Instruction
Cache
L1 Data
Cache
L2 Cache
Main memory
Magnetic Disk
(secondary storage)
Memory System
16
49
Put the data from memory in the data portion of the entry
Write the upper bits of the PC into the tag field
Turn the valid bit on
4. Restart the instruction execution at the first step
Which will re-fetch the instruction, this time finding it in the cache
On a miss, simply stall the CPU until the memory responds with instr.
Memory System
Block Replacement
50
Memory System
51
1000
0001
0010
0100
0011
0010
0110
1000
0001
0010
0100
0011
0010
0110
1000
0001
0010
0011
0010
0110
1000
0001
0011
0010
0110
1000
miss
miss
miss
miss
miss
hit
MRU
LRU
replace 0011
Copyright Dr. Lai An Chow
miss
replace 0110
Memory System
17
52
Upon miss:
Replace the victim from the LRU position with miss address
Move the miss address to the MRU position
Heuristic: the LRU is not referenced for a long time, good candidate
Upon hit:
Move the hit address to the MRU position
Then, pack the rest
Heuristic: the one just got hit should be the last one to be replaced
Memory System
53
Definition
If the content of a cache entry is modified, it is considered as dirty
If a line is dirty, what should we do upon block replacement?
It is a question about how we update the memory
Two options:
1. Write-back
2. Write-through
Memory System
Write-Back Strategy
54
Memory System
18
Write-Through Strategy
55
Memory System
Write-Back vs Write-Through
56
CPU
Write to block[0]
Write to block[4]
CPU
Write to block[0]
Write to block[4]
L1
Cache
Upon replacement, write
block[0~31] to memory
in ONE burst
L1
Cache
Upon each write, update
both cache and memory
Main memory
Write-Back
Main memory
Write-Through
Memory System
57
# of reads
read miss rate read miss penalty
program
miss penalty
program
instruction
Copyright Dr. Lai An Chow
Memory System
19
58
CPIstall
5.44
2.72
CPIperfect
2
Memory System
59
Message:
The lower the CPI, the more pronounced the impact of stall cycles
Memory is becoming the major performance bottleneck
As the performance of the underlying DRAM (main memory) is
unlikely to improve as fast as processor cycle time
Memory System
60
Cache
Memory System
20