Professional Documents
Culture Documents
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: The Big Picture: Where are We Now?
Datapath Output
° Today’s Topics:
• Recap last lecture
• Simple caching techniques
• Many ways to improve cache performance
• Virtual memory?
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Who Cares About the Memory Hierarchy?
1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance
(2X/1.5yr)
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
“Less’ Law?” DRAM
DRAM
9%/yr.
1 (2X/10 yrs)
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Time CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Memory Hierarchy of a Modern Computer System
° By taking advantage of the principle of locality:
• Present the user with as much memory as is available in the
cheapest technology.
• Provide access at the speed offered by the fastest technology.
Processor
Control Tertiary
Secondary Storage
Storage (Tape)
Second Main
(Disk)
On-Chip
Registers
Level Memory
Cache
Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Cache Performance
Execution_Time =
Instruction_Count x Cycle_Time x
(ideal CPI + Memory_Stalls/Inst + Other_Stalls/Inst)
Memory_Stalls/Inst =
Instruction Miss Rate x Instruction Miss Penalty +
Loads/Inst x Load Miss Rate x Load Miss Penalty +
Stores/Inst x Store Miss Rate x Store Miss Penalty
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Static RAM
Cell
6-Transistor SRAM Cell word
word
0 1 (row select)
0 1
bit bit
° Write:
bit bit
1. Drive bit lines (bit=1, bit=0)
2.. Select row
replaced with pullup
° Read: to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: 1-Transistor Memory Cell
(DRAM)
row select
° Write:
• 1. Drive bit line
• 2.. Select row
° Read:
• 1. Precharge bit line to Vdd/2
• 2.. Select row bit
• 3. Cell and bit line share charges
- Very small voltage changes on the bit line
• 4. Sense (fancy sense amp)
- Can detect changes of ~1 million electrons
• 5. Write: restore the value
° Refresh
• 1. Just do a dummy read to every cell.
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Classical DRAM Organization
(square) bit (data) lines
r
o Each intersection represents
w a 1-T DRAM Cell
RAM Cell
d Array
e
c
o word (row) select
d
e
r
row Sense-AMPS,
Column Selector & I/O Column
address Address
data
° Row and Column Address Select 1 bit at a time
° Act of reading refreshes one complete row
• Sense amps detect slight variations from VDD/2 and amplify them
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: DRAM Read Timing
° Every DRAM access RAS_L CAS_L WE_L OE_L
begins at:
• The assertion of the RAS_L
A 256K x 8
• 2 ways to read: DRAM D
9 8
early or late v. CAS
DRAM Read Cycle Time
RAS_L
CAS_L
A Row Address Col Address Junk Row Address Col Address Junk
WE_L
OE_L
Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Fast Page Mode
Operation
° Regular DRAM Organization: Column
• N rows x N column x M-bit Address N cols
• Read & Write M-bit at a time
• Each M-bit access requires
a RAS / CAS cycle DRAM
Row
° Fast Page Mode DRAM
N rows
Address
• N x M “SRAM” to save a row
° After a row is read into the
register
• Only CAS is needed to access N x M “SRAM”
other M-bit blocks on that row M bits
• RAS_L remains asserted while M-bit Output
CAS_L is toggled
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
RAS_L
CAS_L
A Row Address Col Address Col Address Col Address Col Address
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
SDRAM: Syncronous DRAM?
° More complicated, on-chip controller
• Operations syncronized to clock
- So, give row address one cycle
- Column address some number of cycles later (say 3)
- Date comes out later (say 2 cycles later)
• Burst modes
- Typical might be 1,2,4,8, or 256 length burst
- Thus, only give RAS and CAS once for all of these accesses
• Multi-bank operation (on-chip interleaving)
- Lets you overlap startup latency (5 cycles above) of two banks
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Memory Systems: Delay more than RAW DRAM
address n
DRAM DRAM
Controller 2^n x 1
chip
Memory w
Timing
Controller
Bus Drivers
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Main Memory Performance
° Wide: ° Interleaved:
• CPU/Mux 1 word; • CPU, Cache, Bus 1 word:
Mux/Cache, Bus, Memory N Modules
Memory N words (4 Modules); example is
(Alpha: 64 bits & 256 word interleaved
bits)
° Simple:
• CPU, Cache, Bus, Memory
same width
(32 bits)
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Main Memory Performance
Cycle Time
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Increasing Bandwidth -
Interleaving
Access Pattern without Interleaving:
CPU Memory
D1 available
Start Access for D1 Start Access for D2
Memory
Access Pattern with 4-way Interleaving: Bank 0
Memory
Bank 1
CPU
Memory
Bank 2
Memory
Access Bank 0
Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Main Memory Performance
° Timing model
• 1 to send address,
• 4 for access time, 10 cycle time, 1 to return data
• Cache Block is 4 words
° Simple M.P. = 4 x (1+10+1) = 48
° Wide M.P. = 1 + 10 + 1 = 12
° Interleaved M.P. = 1+10+1 + 3 =15
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Independent Memory Banks
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Fewer DRAMs/System over Time
(from Pete
MacWilliams,
Intel) DRAM Generation
‘86 ‘89 ‘92 ‘96 ‘99 ‘02
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
Minimum PC Memory Size
4 MB 32 8 Memory per
16 4 DRAM growth
8 MB @ 60% / year
16 MB 8 2
32 MB 4 1
Memory per 8 2
64 MB
System growth
128 MB @ 25%-30% / year 4 1
256 MB 8 2
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Administrative Issues
° Should be reading Chapter 7 of your book
° Second midterm: Wednesday November 28th
• Pipelining
- Hazards, branches, forwarding, CPI calculations
- (may include something on dynamic scheduling)
• Memory Hierarchy
• Possibly something on I/O (see where we get in lectures)
• Possibly something on power (Broderson Lecture)
° Lab 6: Hopefully all is well!
• Sorry about fact that burst writes don’t work: will fix this!
• Cache/DRAM Bus options:
- #0 (no extra credit): 32-bit bus, 1 DRAM, 4 RAS/CAS per cache line
- #1 (extra cred 2a): 64-bit bus, 2 DRAMs, 2 RAS/CAS per line
- #2 (extra cred 2b): 32-bit bus, 1 DRAM, 1 RAS/CAS, 3 CAS per line
• Be careful for option #1: single word write must only mod 1 DRAM!
- This would happen with write-through policy!
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
The Art of Memory System
Design
Workload or
Benchmark
programs
Processor
reference stream
<op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Example: 1 KB Direct Mapped Cache with 32 B
°Blocks
For a 2 ** N byte cache:
• The uppermost (32 - N) bits are always the Cache Tag
• The lowest M bits are the Byte Select (Block Size = 2M)
• One cache miss, pull in complete “Cache Block” (or “Cache Line”)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Set Associative Cache
° N-way set associative: N entries for each Cache Index
• N direct mapped caches operates in parallel
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Disadvantage of Set Associative Cache
° N-way Set Associative Cache versus Direct Mapped
Cache:
• N comparators vs. 1
• Extra MUX delay for the data
• Data comes AFTER Hit/Miss decision and set selection
° In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
• Possible to assume a hit and continue. Recover later if miss.
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Example: Fully
Associative
° Fully Associative Cache
• Forget about the Cache Index
• Compare the Cache Tags of all cache entries in parallel
• Example: Block Size = 32 B blocks, we need N 27-bit comparators
: :
= Byte 63 Byte 33 Byte 32
=
=
=
: : :
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
A Summary on Sources of Cache Misses
° Compulsory (cold start or process migration,
first reference): first access to a block
• “Cold” fact of life: not a whole lot you can do about it
• Note: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant
° Capacity:
• Cache cannot contain all blocks access by the program
• Solution: increase cache size
° Conflict (collision):
• Multiple memory locations mapped
to the same cache location
• Solution 1: increase cache size
• Solution 2: increase associativity
Note:
If you are going to run “billions” of instruction, Compulsory Misses are insignificant
(except for streaming media types of programs).
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Four Questions for Caches and Memory Hierarchy
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q1: Where can a block be placed in the upper
level?
° Block 12 placed in 8 block cache:
• Fully associative, direct mapped, 2-way set associative
• S.A. Mapping = Block Number Modulo Number Sets
Block 1111111111222222222233
no.
01234567890123456789012345678901
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q2: How is a block found if it is in the upper
level?
Set Select
Data Select
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q3: Which block should be replaced on a
miss?
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q4: What happens on a
write?
° Write through—The information is written to both
the block in the cache and to the block in the lower-
level memory.
° Write back—The information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced.
• is block clean or dirty?
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Write Buffer for Write
Through
Cache
Processor DRAM
Write Buffer
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Write Buffer
Saturation
Cache
Processor DRAM
Write Buffer
° Store frequency (w.r.t. time) > 1 / DRAM write cycle
• If this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row):
- Store buffer will overflow no matter how big you make it
- The CPU Cycle Time <= DRAM Write Cycle Time
Cache L2
Processor DRAM
Cache
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Write-miss Policy: Write Allocate versus Not
Allocate
° Assume: a 16-bit write to memory location 0x0 and
causes a miss
• Do we allocate space in cache and possibly read in the block?
- Yes: Write Allocate
- No: Not Write Allocate
31 9 4 0
Cache Tag Example: 0x00 Cache Index Byte Select
Ex: 0x00 Ex: 0x00
: :
Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Impact of Memory Hierarchy on
Algorithms
° Today CPU time is a function of (ops, cache misses)
° What does this mean to Compilers, Data structures,
Algorithms?
• Quicksort:
fastest comparison based sorting algorithm when keys fit in memory
• Radix sort: also called “linear time” sort
For keys of fixed length and fixed radix a constant number of passes
over the data is sufficient independent of the number of keys
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Quicksort vs. Radix as vary number keys:
Instructions
Radix sort
Quick
sort Instructions/key
Radix sort
Time
Quick
sort
Instructions
Radix sort
Cache misses
Quick
sort
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001