You are on page 1of 42

CS152

Computer Architecture and Engineering


Lecture 21

Memory Systems (recap)


Caches

November 11th, 2001


John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: The Big Picture: Where are We Now?

° The Five Classic Components of a Computer


Processor
Input
Control
Memory

Datapath Output

° Today’s Topics:
• Recap last lecture
• Simple caching techniques
• Many ways to improve cache performance
• Virtual memory?

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Who Cares About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency)

1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance

(2X/1.5yr)
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
“Less’ Law?” DRAM
DRAM
9%/yr.
1 (2X/10 yrs)
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Time CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Memory Hierarchy of a Modern Computer System
° By taking advantage of the principle of locality:
• Present the user with as much memory as is available in the
cheapest technology.
• Provide access at the speed offered by the fastest technology.

Processor

Control Tertiary
Secondary Storage
Storage (Tape)
Second Main
(Disk)
On-Chip
Registers

Level Memory
Cache

Datapath Cache (DRAM)


(SRAM)

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s


(10s sec)
Size (bytes): 100s Ks Ms (10s
Gsms) Ts
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Memory Hierarchy: Why Does it Work?
Locality!
Probability
of reference

0 Address Space 2^n - 1

° Temporal Locality (Locality in Time):


=> Keep most recently accessed data items closer to the processor

° Spatial Locality (Locality in Space):


=> Move blocks consists of contiguous words to the upper levels

Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Cache Performance
Execution_Time =
Instruction_Count x Cycle_Time x
(ideal CPI + Memory_Stalls/Inst + Other_Stalls/Inst)

Memory_Stalls/Inst =
Instruction Miss Rate x Instruction Miss Penalty +
Loads/Inst x Load Miss Rate x Load Miss Penalty +
Stores/Inst x Store Miss Rate x Store Miss Penalty

Average Memory Access time (AMAT) =


Hit TimeL1 + (Miss RateL1 x Miss PenaltyL1) =
(Hit RateL1 x Hit TimeL1) + (Miss RateL1 x Miss TimeL1)

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Static RAM
Cell
6-Transistor SRAM Cell word
word
0 1 (row select)

0 1

bit bit

° Write:
bit bit
1. Drive bit lines (bit=1, bit=0)
2.. Select row
replaced with pullup
° Read: to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: 1-Transistor Memory Cell
(DRAM)

row select
° Write:
• 1. Drive bit line
• 2.. Select row

° Read:
• 1. Precharge bit line to Vdd/2
• 2.. Select row bit
• 3. Cell and bit line share charges
- Very small voltage changes on the bit line
• 4. Sense (fancy sense amp)
- Can detect changes of ~1 million electrons
• 5. Write: restore the value

° Refresh
• 1. Just do a dummy read to every cell.
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Classical DRAM Organization
(square) bit (data) lines

r
o Each intersection represents
w a 1-T DRAM Cell
RAM Cell
d Array
e
c
o word (row) select
d
e
r

row Sense-AMPS,
Column Selector & I/O Column
address Address

data
° Row and Column Address Select 1 bit at a time
° Act of reading refreshes one complete row
• Sense amps detect slight variations from VDD/2 and amplify them
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: DRAM Read Timing
° Every DRAM access RAS_L CAS_L WE_L OE_L
begins at:
• The assertion of the RAS_L
A 256K x 8
• 2 ways to read: DRAM D
9 8
early or late v. CAS
DRAM Read Cycle Time
RAS_L

CAS_L

A Row Address Col Address Junk Row Address Col Address Junk

WE_L

OE_L

D High Z Junk Data Out High Z Data Out


Read Access Output Enable
Time Delay

Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recall: Fast Page Mode
Operation
° Regular DRAM Organization: Column
• N rows x N column x M-bit Address N cols
• Read & Write M-bit at a time
• Each M-bit access requires
a RAS / CAS cycle DRAM
Row
° Fast Page Mode DRAM

N rows
Address
• N x M “SRAM” to save a row
° After a row is read into the
register
• Only CAS is needed to access N x M “SRAM”
other M-bit blocks on that row M bits
• RAS_L remains asserted while M-bit Output
CAS_L is toggled
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
RAS_L

CAS_L

A Row Address Col Address Col Address Col Address Col Address
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
SDRAM: Syncronous DRAM?
° More complicated, on-chip controller
• Operations syncronized to clock
- So, give row address one cycle
- Column address some number of cycles later (say 3)
- Date comes out later (say 2 cycles later)
• Burst modes
- Typical might be 1,2,4,8, or 256 length burst
- Thus, only give RAS and CAS once for all of these accesses
• Multi-bank operation (on-chip interleaving)
- Lets you overlap startup latency (5 cycles above) of two banks

° Careful of timing specs!


• 10ns SDRAM may still require 50ns to get first data!
• 50ns DRAM means first data out in 50ns

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Memory Systems: Delay more than RAW DRAM

address n
DRAM DRAM
Controller 2^n x 1
chip

Memory w
Timing
Controller
Bus Drivers

Tc = Tcycle + Tcontroller + Tdriver

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Main Memory Performance

° Wide: ° Interleaved:
• CPU/Mux 1 word; • CPU, Cache, Bus 1 word:
Mux/Cache, Bus, Memory N Modules
Memory N words (4 Modules); example is
(Alpha: 64 bits & 256 word interleaved
bits)
° Simple:
• CPU, Cache, Bus, Memory
same width
(32 bits)
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Main Memory Performance

Cycle Time

Access Time Time

° DRAM (Read/Write) Cycle Time >> DRAM


(Read/Write) Access Time
• ­2:1; why?
• Compare with Core: Access time 750 ns, cycle time 1500-3000 ns!

° DRAM (Read/Write) Cycle Time :


• How frequent can you initiate an access? 1/CycleTime
• Analogy: A little kid can only ask his father for money on Saturday

° DRAM (Read/Write) Access Time:


• How quickly will you get what you want once you initiate an access?
• Analogy: As soon as he asks, his father will give him the money

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Increasing Bandwidth -
Interleaving
Access Pattern without Interleaving:
CPU Memory

D1 available
Start Access for D1 Start Access for D2

Memory
Access Pattern with 4-way Interleaving: Bank 0
Memory
Bank 1
CPU
Memory
Bank 2
Memory
Access Bank 0

Bank 3
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Main Memory Performance

° Timing model
• 1 to send address,
• 4 for access time, 10 cycle time, 1 to return data
• Cache Block is 4 words
° Simple M.P. = 4 x (1+10+1) = 48
° Wide M.P. = 1 + 10 + 1 = 12
° Interleaved M.P. = 1+10+1 + 3 =15

address address address address


0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Bank 0 Bank 1 Bank 2 Bank 3

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Independent Memory Banks

° How many banks?


number banks  number clocks to access word in bank
• For sequential accesses, otherwise will return to original bank
before it has next word ready
• For strided accesses (access ever N words), want prime number of
banks!!! (Address computation easy if prime of form 2x-1)

° Increasing DRAM => fewer chips => harder to have


banks
• Growth bits/chip DRAM : 50%-60%/yr
• Nathan Myrvold M/S: mature software growth
(33%/yr for NT) ­growth MB/$ of DRAM (25%-30%/yr)

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Fewer DRAMs/System over Time

(from Pete
MacWilliams,
Intel) DRAM Generation
‘86 ‘89 ‘92 ‘96 ‘99 ‘02
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
Minimum PC Memory Size

4 MB 32 8 Memory per
16 4 DRAM growth
8 MB @ 60% / year
16 MB 8 2
32 MB 4 1
Memory per 8 2
64 MB
System growth
128 MB @ 25%-30% / year 4 1

256 MB 8 2
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Administrative Issues
° Should be reading Chapter 7 of your book
° Second midterm: Wednesday November 28th
• Pipelining
- Hazards, branches, forwarding, CPI calculations
- (may include something on dynamic scheduling)
• Memory Hierarchy
• Possibly something on I/O (see where we get in lectures)
• Possibly something on power (Broderson Lecture)
° Lab 6: Hopefully all is well!
• Sorry about fact that burst writes don’t work: will fix this!
• Cache/DRAM Bus options:
- #0 (no extra credit): 32-bit bus, 1 DRAM, 4 RAS/CAS per cache line
- #1 (extra cred 2a): 64-bit bus, 2 DRAMs, 2 RAS/CAS per line
- #2 (extra cred 2b): 32-bit bus, 1 DRAM, 1 RAS/CAS, 3 CAS per line
• Be careful for option #1: single word write must only mod 1 DRAM!
- This would happen with write-through policy!
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
The Art of Memory System
Design
Workload or
Benchmark
programs

Processor

reference stream
<op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .

op: i-fetch, read, write

Memory Optimize the memory system organization


$ to minimize the average memory access time
for typical workloads
MEM

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Example: 1 KB Direct Mapped Cache with 32 B
°Blocks
For a 2 ** N byte cache:
• The uppermost (32 - N) bits are always the Cache Tag
• The lowest M bits are the Byte Select (Block Size = 2M)
• One cache miss, pull in complete “Cache Block” (or “Cache Line”)

Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”

Valid Bit Cache Tag Cache Data


Byte 31 Byte 1 Byte 0 0

: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Set Associative Cache
° N-way set associative: N entries for each Cache Index
• N direct mapped caches operates in parallel

° Example: Two-way set associative cache


• Cache Index selects a “set” from the cache
• The two tags in the set are compared to the input in parallel
• Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Disadvantage of Set Associative Cache
° N-way Set Associative Cache versus Direct Mapped
Cache:
• N comparators vs. 1
• Extra MUX delay for the data
• Data comes AFTER Hit/Miss decision and set selection
° In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
• Possible to assume a hit and continue. Recover later if miss.

Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Example: Fully
Associative
° Fully Associative Cache
• Forget about the Cache Index
• Compare the Cache Tags of all cache entries in parallel
• Example: Block Size = 32 B blocks, we need N 27-bit comparators

° By definition: Conflict Miss = 0 for a fully associative


cache
31 4 0
Cache Tag (27 bits long) Byte Select
Ex: 0x01

Cache Tag Valid Bit Cache Data


= Byte 31 Byte 1 Byte 0

: :
= Byte 63 Byte 33 Byte 32
=
=

=
: : :
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
A Summary on Sources of Cache Misses
° Compulsory (cold start or process migration,
first reference): first access to a block
• “Cold” fact of life: not a whole lot you can do about it
• Note: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant

° Capacity:
• Cache cannot contain all blocks access by the program
• Solution: increase cache size

° Conflict (collision):
• Multiple memory locations mapped
to the same cache location
• Solution 1: increase cache size
• Solution 2: increase associativity

° Coherence (Invalidation): other process (e.g., I/O)


updates memory CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Design options at constant
cost

Direct Mapped N-way Set Associative Fully Associative

Cache Size Big Medium Small

Compulsory Miss Same Same Same

Conflict Miss High Medium Zero

Capacity Miss Low Medium High

Coherence Miss Same Same Same

Note:
If you are going to run “billions” of instruction, Compulsory Misses are insignificant
(except for streaming media types of programs).

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Recap: Four Questions for Caches and Memory Hierarchy

° Q1: Where can a block be placed in the upper


level? (Block placement)
° Q2: How is a block found if it is in the upper level?
(Block identification)
° Q3: Which block should be replaced on a miss?
(Block replacement)
° Q4: What happens on a write?
(Write strategy)

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q1: Where can a block be placed in the upper
level?
° Block 12 placed in 8 block cache:
• Fully associative, direct mapped, 2-way set associative
• S.A. Mapping = Block Number Modulo Number Sets

Fully associative: Direct mapped: Set associative:


block 12 can go block 12 can go block 12 can go
anywhere only into block 4 anywhere in set 0
(12 mod 8) (12 mod 4)
Block 01234567 Block 01234567 Block 01234567
no. no. no.

Set Set Set Set


Block-frame address 0 1 2 3

Block 1111111111222222222233
no.
01234567890123456789012345678901
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q2: How is a block found if it is in the upper
level?

Block Address Block


Tag Index offset

Set Select
Data Select

° Direct indexing (using index and block offset), tag


compares, or combination
° Increasing associativity shrinks index, expands tag

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q3: Which block should be replaced on a
miss?

° Easy for Direct Mapped


° Set Associative or Fully Associative:
• Random
• LRU (Least Recently Used)

Associativity: 2-way 4-way 8-way


Size LRU Random LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Q4: What happens on a
write?
° Write through—The information is written to both
the block in the cache and to the block in the lower-
level memory.
° Write back—The information is written only to the
block in the cache. The modified cache block is
written to main memory only when it is replaced.
• is block clean or dirty?

° Pros and Cons of each?


• WT: read misses cannot result in writes
• WB: no writes of repeated writes

° WT always combined with write buffers so that


don’t wait for lower level memory

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Write Buffer for Write
Through
Cache
Processor DRAM

Write Buffer

° A Write Buffer is needed between the Cache and


Memory
• Processor: writes data into the cache and the write buffer
• Memory controller: write contents of the buffer to memory

° Write buffer is just a FIFO:


• Typical number of entries: 4
• Must handle bursts of writes
• Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Write Buffer
Saturation
Cache
Processor DRAM

Write Buffer
° Store frequency (w.r.t. time) > 1 / DRAM write cycle
• If this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row):
- Store buffer will overflow no matter how big you make it
- The CPU Cycle Time <= DRAM Write Cycle Time

° Solution for write buffer saturation:


• Use a write back cache
• Install a second level (L2) cache: (does this always work?)

Cache L2
Processor DRAM
Cache

Write Buffer CS152 / Kubiatowicz


11/14/01 ©UCB Fall 2001
RAW Hazards from Write
Buffer!
° Write-Buffer Issues: Could introduce RAW Hazard with
memory!
• Write buffer may contain only copy of valid data 
Reads to memory may get wrong result if we ignore write buffer
° Solutions:
• Simply wait for write buffer to empty before servicing reads:
- Might increase read miss penalty (old MIPS 1000 by 50% )
• Check write buffer contents before read (“fully associative”);
- If no conflicts, let the memory access continue
- Else grab data from buffer

° Can Write Buffer help with Write Back?


• Read miss replacing dirty block
- Copy dirty block to write buffer while starting read to memory
• CPU stall less since restarts as soon as do read

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Write-miss Policy: Write Allocate versus Not
Allocate
° Assume: a 16-bit write to memory location 0x0 and
causes a miss
• Do we allocate space in cache and possibly read in the block?
- Yes: Write Allocate
- No: Not Write Allocate
31 9 4 0
Cache Tag Example: 0x00 Cache Index Byte Select
Ex: 0x00 Ex: 0x00

Valid Bit Cache Tag Cache Data


0x50 Byte 31 Byte 1 Byte 0 0

: :
Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Impact of Memory Hierarchy on
Algorithms
° Today CPU time is a function of (ops, cache misses)
° What does this mean to Compilers, Data structures,
Algorithms?
• Quicksort:
fastest comparison based sorting algorithm when keys fit in memory
• Radix sort: also called “linear time” sort
For keys of fixed length and fixed radix a constant number of passes
over the data is sufficient independent of the number of keys

° “The Influence of Caches on the Performance of


Sorting” by A. LaMarca and R.E. Ladner. Proceedings of
the Eighth Annual ACM-SIAM Symposium on Discrete
Algorithms, January, 1997, 370-379.
• For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8
byte keys, from 4000 to 4000000

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Quicksort vs. Radix as vary number keys:
Instructions

Radix sort

Quick
sort Instructions/key

Job size in keys CS152 / Kubiatowicz


11/14/01 ©UCB Fall 2001
Quicksort vs. Radix as vary number keys: Instrs &
Time

Radix sort

Time

Quick
sort
Instructions

Job size in keys CS152 / Kubiatowicz


11/14/01 ©UCB Fall 2001
Quicksort vs. Radix as vary number keys: Cache misses

Radix sort

Cache misses

Quick
sort

Job size in keys

What is proper approach to fast algorithms?


CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Summary #1/ 2:

° The Principle of Locality:


• Program likely to access a relatively small portion of the address
space at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space

° Three (+1) Major Categories of Cache Misses:


• Compulsory Misses: sad facts of life. Example: cold start misses.
• Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!
• Capacity Misses: increase cache size
• Coherence Misses: Caused by external processors or I/O devices

° Cache Design Space


• total size, block size, associativity
• replacement policy
• write-hit policy (write-through, write-back)
• write-miss policy CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001
Summary #2 / 2: The Cache Design Space

° Several interacting dimensions Cache Size


• cache size
• block size Associativity
• associativity
• replacement policy
• write-through vs write-back
• write allocation
Block Size
° The optimal choice is a compromise
• depends on access characteristics
- workload Bad
- use (I-cache, D-cache, TLB)
• depends on technology / cost
Good Factor A Factor B

° Simplicity often wins Less More

CS152 / Kubiatowicz
11/14/01 ©UCB Fall 2001

You might also like