You are on page 1of 134

Memory Hierarchy Design

5.1 Introduction
The five classic components of a computer:

Processor
Input
Control
Memory

Datapath Output

Where do we fetch instructions to execute?


l Build a memory hierarchy which includes main memory & caches (internal
memory) and hard disk (external memory)
l Instructions are first fetched from external storage such as hard disk and
are kept in the main memory. Before they go to the CPU, they are
probably extracted to stay in the caches

2
Memory Hierarchy
Levels of the Memory Hierarchy Upper Level
Capacity Faster
Access Time

CPU Registers
500 bytes Registers
0.25 ns

Cache
64 KB
Cache

Capacity
1 ns

Speed
Blocks
Main Memory
512 MB Memory
100ns
Pages
Disk
100 GB I/O Devices
5 ms
Files Larger
??? Lower Level

3
5.2 ABCs of Caches
• Cache:
– In this textbook it mainly means the first level of the memory
hierarchy encountered once the address leaves the CPU
– applied whenever buffering is employed to reuse commonly
occurring items, i.e. file caches, name caches, and so on
• Principle of Locality:
– Program access a relatively small portion of the address space at
any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)

4
Memory Hierarchy: Terminology

• Hit: data appears in some block in the cache (example: Block X)


– Hit Rate: the fraction of cache access found in the cache
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieved from a block in the main
memory (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in cache
+ Time to deliver the block to the processor
• Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)

To Processor cache main


Memory
Blk X
From Processor Blk Y

5
Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)
* Clock cycle time
Memory stall cycles: number of cycles during which the CPU is stalled
waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty
= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty

Memory access consists of fetching instructions and reading/writing data

6
P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of
the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time
(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle time
The performance ratio is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.

7
Four Memory Hierarchy Questions

Q1 (block placement):
Where can a block be placed in the upper level?
Q2 (block identification):
How is a block found if it is in the upper level?
Q3 (block replacement):
Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write?

8
Q1(block placement): Where can a block be placed?
Direct mapped: (Block number) mod (Number of blocks in cache)
Set associative: (Block number) mod (Number of sets in cache)
– # of set  # of blocks
– n-way: n blocks in a set
– 1-way = direct mapped
Fully associative: # of set = 1

Example: block 12 placed


in a 8-block cache

9
Simplest Cache: Direct Mapped (1-way)
Block number Memory
0
4 Block Direct Mapped Cache
1
Block Index in Cache
2
0
3
1
4
2
5
3
6
7
8
9
A
B
C
D The block have only one place it can appear in the
E cache. The mapping is usually
F (Block address) MOD ( Number of blocks in cache)

10
Example: 1 KB Direct Mapped Cache, 32B Blocks
For a 2N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2M)

31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”
Valid Bit Cache Tag Cache Data
Byte 31 Byte 1 Byte 0 0

: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
11
Q2 (block identification): How is a block found?
Three portions of an address in a set-associative or direct-mapped cache

Block Address Block Offset

Tag Cache/Set Index (Block Size)

Block Offset selects the desired data from the block, the index filed selects
the set, and the tag field compared against the CPU address for a hit
• Use the Cache Index to select the cache set
• Check the Tag on each block in that set
– No need to check index or block offset
– A valid bit is added to the Tag to indicate whether or not this entry
contains a valid address
• Select the desired bytes using Block Offset

Increasing associativity ↑ => shrinks index↓ expands tag ↑

12
Example: Two-way set associative cache
• Cache Index selects a “set” from the cache
• The two tags in the set are compared in parallel
• Data is selected based on the tag result
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00

Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :
0x50

Adr Tag
Compare Set1 1 Mux 0 Set0 Compare

OR
Cache Block
Hit 13
Disadvantage of Set Associative Cache
• N-way Set Associative Cache v.s. Direct Mapped Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.

Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit

14
Q3 (block replacement): Which block should be
replaced on a cache miss?
nEasy for Direct Mapped – hardware decisions are simplified
Only one block frame is checked and only that block can be replaced
nSet Associative or Fully Associative
There are many blocks to choose from on a miss to replace
nThree primary strategies for selecting a block to be replaced
l Random: randomly selected
l LRU: Least Recently Used block is removed
l FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies
Associativity: 2-way 4-way 8-way
Size LRU Random FIFO LRU Random FIFO LRU Random FIFO
16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with
LRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes
15
Q4(write strategy): What happens on a write?
Reads dominate processor cache accesses.
E.g. 7% of overall memory traffic are writes while 21% of data cache
access are writes
Two option we can adopt when writing to the cache:
lWrite through —The information is written to both the block in the
cache and to the block in the lower-level memory.
lWrite back —The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is
replaced.
To reduce the frequency of writing back blocks on replacement, a dirty
bit is used to indicate whether the block was modified in the cache
(dirty) or not (clean). If clean, no write back since identical information
to the cache is found.
Pros and Cons
lWT: simple to implement, the cache is always clean, so read misses
cannot result in writes.
lWB: writes occur at the speed of the cache and multiple writes within
a block require only one write to the lower-level memory.
16
Write Stall and Write Buffer
l When the CPU must wait for writes to complete during WT, the CPU is
said to write stall
l A common optimization to reduce write stall is a write buffer, which
allows the processor to continue as soon as the data are written to the
buffer, thereby overlapping processor execution with memory updating

Cache
Processor DRAM

Write Buffer
• A Write Buffer is needed between the Cache and Memory
– Processor: writes data into the cache and the write buffer
– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:
– Typical number of entries: 4

17
Write-Miss Policy: Write Allocate vs. Not Allocate

Two options on a write miss


Write allocate – the block is allocated on a write miss, followed
by the write hit actions
l Write misses act like read misses

No-write allocate – write misses do not affect the cache. The


block is modified only in the lower-level memory
l Block stay out of the cache in no-write allocate until the program tries
to read the blocks, but with write allocate even blocks that are only
written will still be in the cache

18
Write-Miss Policy Example

Example: Assume a fully associative write-back cache with many


cache entries that starts empty. Below is sequence of five memory
operations.
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100].
What are the number of hits and misses (inclusive reads and writes) when
using no-write allocate versus write allocate?
Answer:
No-write Allocate: Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit
Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss
Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit
Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit
4 misses; 1 hit 2 misses; 3 hits
19
The organization of the data cache in the Opteron microprocessor. The 64 KB cache is two-way set associative with 64-byte
blocks. The 9-bit index selects among 512 sets. The four steps of a read hit, shown as circled numbers in order of occurrence,
label this organization. Three bits of the block offset join the index to supply the RAM address to select the proper 8 bytes. Thus,
the cache holds two groups of 4096 64-bit words, with each group containing half of the 512 sets. Although not exercised in this
example, the line from lower-level memory to the cache is used on a miss to load the cache. The size of address leaving the
processor is 40 bits because it is a physical address and not a virtual address. Figure on page C-45 explains how the Opteron
maps from virtual(48-bit) to physical (40) for a cache access.
The Opteron Data Cache

• 64 KB cache
• 2-way set associative
• 64 byte block
• Tag index block offset
<25> 64 KB/64 64 byte/block
= 1 KB(blocks)/2 2^6 = 64
512 sets = 2^9 i.e., <6>
i.e., <9>
25 + 9 + 6 = 48 physical address

21
Impact of Memory Access on CPU
Performance

• Example: Let’s use an in-order execution computer.


Assume the cache miss penalty is 200 clock cycles,
and all instructions normally take 1.0 clock cycles
(ignoring memory stalls). Assume the average miss
rate is 2%, there is an average of 1.5 memory
references per instruction, and the average number
of cache misses per 1000 instructions is 30. What is
the impact on performance when behavior of the
cache is included? Calculate the impact using both
misses per instruction and miss rate.

22
Impact of Cache Organizations on CPU Performance
Example 1: What is the impact of two different cache organizations (direct
mapped vs. 2-way set associative) on the performance of a CPU?
– Ideal CPI = 2.0 (ignoring memory stalls)
– Clock cycle time is 1.0 ns
– Avg. memory references per instruction is 1.5
– Cache size: 64 KB, block size: 64 bytes
– For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer
– Cache miss penalty is 75 ns
– Hit time is 1 clock cycle
– Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%. Calculate AMAT
and then processor performance.
Answer:
• Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns
Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns

• CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instruction


x Miss penalty) x Clock cycle time
= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC
CPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC

23
• Example 2: What is the impact of two different cache organizations
(direct mapped vs. 2-way set associative) on the performance of a
CPU?
– Ideal CPI = 1.6 (ignoring memory stalls)
– Clock cycle time is 0.35 ns
– Avg. memory references per instruction is 1.4
– Cache size: 128 KB, block size: 64 bytes
– For set-associative, assume the clock cycle time is stretched 1.35
times to accommodate the selection multiplexer
– Cache miss penalty is 65 ns
– Hit time is 1 clock cycle
– Miss rate: direct mapped 2.1%; 2-way set-associative 1.9%.
Calculate AMAT and then processor performance.

24
Summary of Performance Equations

25
Types of Cache Misses
• Compulsory (cold start or process migration, first reference): first
access to a block
– The very first access to a block cannot be in the cache, so the block must
be brought into cache.
– They occur in an infinite cache.
• Capacity:
– Cache cannot contain all blocks accessed by the program, blocks must be
discarded and retrieved
– Occur in fully associative cache
– Solution: increase cache size, smaller upper level memory-thrash
• Conflict (collision):
– Multiple memory locations mapped to the same cache location, when
block placement strategy is set associative or direct.
– no conflicts in fully associativity
– Solution 1: increase cache size
– Solution 2: increase associativity

26
6 Basic Cache Optimizations

AMAT = Hit Time + Miss Rate * Miss Penalty

• Reducing Miss Rate


1. Larger Block size (compulsory misses)
2. Larger Cache size (capacity misses)
3. Higher Associativity (conflict misses)

• Reducing Miss Penalty


4. Multilevel Caches
5. Giving Reads Priority over Writes
E.g., Read complete before earlier writes in write buffer

• Reducing hit time


6. Avoiding address translation when indexing cache
27
1: Larger Block Size
25%

1K The miss rate actually


20%
goes up if the block size
4K
15% is too large relative to
Miss
16K the cache size.
Rate
10%
64K
5% 256K

0% Size of Cache
16

32

64

128

256
Block Size (bytes)

• Take advantage of spatial locality


-The larger the block, the greater the chance parts of it will be used again
• # of blocks is reduced for the cache of same size => Increase miss penalty
• It may increase conflict misses and even capacity misses if the cache is
small
• Usually high latency and high bandwidth of lower level memory encourage
large block size 28
2: Larger Caches

0.14
1-way
0.12
2-way
0.1
Miss Rate per Type

4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
4

8
2
1

16

32

64

128
Cache Size (KB) Compulsory

• Increasing capacity of cache reduces capacity misses


• May result in longer hit time and higher cost
• Trends: Larger L2 or L3 off-chip caches
29
3: Higher Associativity
• Previous figure shows how miss rates improve with
higher associativity
– 8-way set associative is as effective as fully associative for practical
purposes
– 2:1 Cache Rule:
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache
size N/2
• Tradeoff: higher associative cache complicates the
circuit
– May have longer clock cycle

• Beware: Execution time is the only final measure!


–Will Clock Cycle time increase as a result of having a more
complicated cache?

30
Reducing Cache Miss Penalty
Time to handle a miss is becoming more and more the
controlling factor. This is because of the great improvement in
speed of processors as compared to the speed of memory.

Average Memory Access Time


= Hit Time + Miss Rate * Miss Penalty
Five optimizations
1. Multilevel caches
2. Giving priority to read misses over writes

31
4: Multilevel Caches
• Approaches
– Make the cache faster to keep pace with the speed of CPUs
– Make the cache larger to overcome the widening gap
L1: fast hits, L2: fewer misses
• L2 Equations
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+ Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Hit TimeL1 << Hit TimeL2 << … << Hit TimeMem
Miss RateL1 > Miss RateL2 > …
Definitions:
– Local miss rate— misses in this cache divided by the total number of memory
accesses to this cache (1st level cache Miss rateL1 , 2nd level cache Miss rateL2)

– Global miss rate—misses in this cache divided by the total number of memory
accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2)
•Indicate what fraction of the memory accesses that leave the CPU go all the
way to memory.
32
Design of L2 Cache

•Size
– Since everything in L1 cache is likely to be in L2 cache, L2 cache
should be much bigger than L1
•Whether data in L1 is in L2
– novice approach: design L1 and L2 independently
– multilevel inclusion: L1 data are always present in L2
• Advantage: easy for consistency between I/O and cache (checking L2 only)
• Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level
block to be replaced => slightly higher 1st-level miss rate
• i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
– multilevel exclusion: L1 data is never found in L2
• A cache miss in L1 results in a swap(not replacement) of blocks between L1
and L2
• Advantage: prevent wasting space in L2
• i.e. AMD Athlon: 64 KB L1 and 256 KB L2

33
Example: Suppose that in 1000 memory references there are 40
misses in the first level cache and 20 misses in the second level
cache. What are the various miss rates? Assume the miss
penalty from the L2 cache to memory is 200 clock cycles, the hit
time of the L2 cache is 10 clock cycles, the hit time of L1 is 1
clock cycle, and there are 1.5 memory references per
instruction. What is the average memory access time and
average stall cycles per instruction?

34
5: Giving Priority to Read Misses over Writes
• Serve reads before writes have been completed
• Write through with write buffers
SW R3, 512(R0) ; M[512] <- R3 (cache index 0)
LW R1, 1024(R0) ; R1 <- M[1024] (cache index 0)
LW R2, 512(R0) ; R2 <- M[512] (cache index 0)
Problem: write through with write buffers offer RAW conflicts with main
memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read; if no conflicts, let the
memory access continue
• Write Back
Suppose a read miss will replace a dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead: Copy the dirty block to a write buffer, do the read, and then
do the write
– CPU stall less since restarts as soon as do read

35
6: Avoiding address translation during cache indexing
•Two tasks: indexing the cache and comparing addresses
•virtually vs. physically addressed cache
–virtual cache: use virtual address (VA) for the cache
–physical cache: use physical address (PA) after translating virtual address
•Challenges to virtual cache
1.Protection: page-level protection (RW/RO/Invalid) must be checked
–It’s checked as part of the virtual to physical address translation
–solution: an addition field to copy the protection information from TLB and check
it on every access to the cache
2.context switching: same VA of different processes refer to different PA,
requiring the cache to be flushed
–solution: increase width of cache address tag with process-identifier tag (PID)
3.Synonyms or aliases: two different VA for the same PA
–inconsistency problem: two copies of the same data in a virtual cache
–hardware antialiasing solution: guarantee every cache block a unique PA
–Alpha 21264: check all possible locations. If one is found, it is invalidated
–software page-coloring solution: forcing aliases to share some address bits
–Sun’s Solaris: all aliases must be identical in last 18 bits => no duplicate PA
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
36
37
Virtually indexed, physically tagged cache

CPU CPU CPU

VA VA VA
TB VA $ PA $ TB
Tags Tags
PA VA PA
L2 $
$ TB

PA PA MEM
MEM MEM
Overlap cache access
Conventional with VA translation:
Virtually Addressed Cache
Organization requires $ index to
Translate only on miss
remain invariant
Synonym Problem
across translation

38
• One alternative to get the best of both virtual and physical
caches is to use part of the page offset – the part that is
identical in both VA and PAs-to index the cache.
• At the same time as the cache is being read using the index, the
virtual part of the address is translated, and the tag match uses
PAs.
• It allows the cache read to begin immediately, and yet the tag
comparison is still with PAs.

39
40
Virtual Memory
• Virtual memory (VM) allows programs to have the illusion
of a very large memory that is not limited by physical
memory size
– Make main memory (DRAM) acts like a cache for secondary
storage (magnetic disk)
– Otherwise, application programmers have to move data in/out main
memory
– That’s how virtual memory was first proposed
• Virtual memory also provides the following functions
– Allowing multiple processes share the physical memory in
multiprogramming environment
– Providing protection for processes (compare Intel 8086: without VM
applications can overwrite OS kernel)
– Facilitating program relocation in physical memory space
VM Example

42
Virtual Memory and Cache
• VM address translation a provides a mapping from the
virtual address of the processor to the physical address
in main memory and secondary storage.

• Cache terms vs. VM terms


– Cache block => page
– Cache Miss => page fault

• Tasks of hardware and OS


– TLB does fast address translations
– OS handles less frequently events:
• page fault
• TLB miss (when software approach is used)

43
Virtual Memory and Cache
45
4 Qs for Virtual Memory
• Q1: Where can a block be placed in Main Memory?
– Miss penalty for virtual memory is very high => Full
associativity is desirable (so allow blocks to be placed
anywhere in the memory)
– Have software determine the location while accessing
disk (10M cycles enough to do sophisticated
replacement)

• Q2: How is a block found if it is in Main Memory?


– Address divided into page number and page offset
– Page table and translation buffer used for address
translation
4 Qs for Virtual Memory
• Q3: Which block should be replaced on a
miss?
– Want to reduce miss rate & can handle in software
– Least Recently Used typically used
– A typical approximation of LRU
• Hardware set reference bits
• OS record reference bits and clear them periodically
• OS selects a page among least-recently referenced for
replacement

• Q4: What happens on a write?


– Writing to disk is very expensive
– Use a write-back strategy

47
Virtual-Physical Translation
• A virtual address consists of a virtual page
number and a page offset.
• The virtual page number gets translated to a
physical page number.
• The page offset is not changed
36 bits 12 bits
Virtual Page Number Page offset Virtual Address

Translation

Physical Page Number Page offset Physical Address

33 bits 12 bits
48
Address Translation Via Page Table

Assume the access hits in main memory

49
TLB: Improving Page Table Access
• Cannot afford accessing page table for every
access include cache hits (then cache itself
makes no sense)
• Again, use cache to speed up accesses to page
table! (cache for cache?)
• TLB is translation lookaside buffer storing
frequently accessed page table entry
• A TLB entry is like a cache entry
– Tag holds portions of virtual address
– Data portion holds physical page number, protection
field, valid bit, use bit, and dirty bit (like in page table
entry)
– Usually fully associative or highly set associative
– Usually 64 or 128 entries
• Access page table only for TLB misses
50
TLB Characteristics
• The following are characteristics of TLBs
– TLB size : 32 to 4,096 entries
– Block size : 1 or 2 page table entries (4 or 8 bytes
each)
– Hit time: 0.5 to 1 clock cycle
– Miss penalty: 10 to 30 clock cycles (go to page table)
– Miss rate: 0.01% to 0.1%
– Associative : Fully associative or set associative
– Write policy : Write back (replace infrequently)

51
52
• Selecting a page size
Ø The size of the page table is inversely proportional to the page
size; memory can therefore be saved by making the pages
bigger.
Ø A larger page size can allow larger caches with fast cache hit
times.
Ø Transferring larger pages to or from secondary storage, possibly
over a network, is more efficient than transferring smaller pages.
Ø The number of TLB entries is restricted, so a larger page size
means that more memory can be mapped efficiently, thereby
reducing the number of TLB misses.

53
54
10 Advanced Cache Optimizations
• Reducing hit time • Reducing Miss Penalty
1. Small and simple 6. Critical word first
caches 7. Merging write buffers
2. Way prediction
• Reducing Miss Rate
• Increasing cache 8. Compiler optimizations
bandwidth
3. Pipelined caches • Reducing miss penalty or
4. Multibanked caches miss rate via parallelism
5. Nonblocking caches 9. Hardware prefetching
10.Compiler prefetching

55
01: Small and Simple Caches
•A time-consuming portion of a cache hit is using the index portion of the
address to read the tag memory and then compare it to the address.
Guideline: smaller hardware is faster
– Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?
• Small data cache and thus fast clock rate
Guideline: simpler hardware is faster
– Direct Mapped, on chip caches can overlap the tag check with
transmission of data, reducing hit time.
– Lower associativity reduce hit time & power because of fewer cache
lines.
•General design:
– small and simple cache for 1st-level cache
– Keeping the tags on chip and the data off chip for 2nd-level caches
The emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory.
56
02. Fast Hit Times Via Way Prediction
• How to combine fast hit time of direct-mapped with lower conflict
misses of 2-way SA cache?
• Way prediction: keep extra bits in cache to predict “way” (block
within set) of next cache access.
– Multiplexer set early to select desired block; only 1 tag
comparison done that cycle (in parallel with reading data)
– Miss  check other blocks for matches in next cycle
– Addede to each block of a cache are predictor bits, which
indu=icate which of the blocks to try on next cache access.
– If prediction is correct faster hit time, else, changes the
predictor bits and latency of one extra clock cycle.
• Accuracy > 90% for 2-way (popular), > 80% for 4-way
• Drawback: CPU pipeline harder if hit time is variable-length

57
03: Pipelined Cache Access to increase cache
Bandwidth
Simply to pipeline cache access
– Multiple clock cycle for 1st-level cache hit giving fast clock cycle time
and high bandwidth.
•Advantage: fast cycle time and slow hit
Example: the pipeline for accessing instructions from I-cache
– Pentium: 1 clock cycle
– Pentium Pro ~ Pentium III: 2 clocks
– Pentium 4 & Intel Core i7: 4 clocks
•Drawback: Increasing the number of pipeline stages leads to
– greater penalty on mispredicted branches and
– more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather than


decreasing the actual latency of a cache hit

58
04: Non-blocking Caches to increase cache
Bandwidth
• For out-of-order completion processors, the processor need not stall on a
data cache miss.
Ø Eg: processor can continue fetch from I-cache , helpful during a miss
instead of ignoring the processor requests.
• Blocking Caches
o When a "miss" occurs, CPU stalls until the data cache successfully
finds the missing data.
• Non-blocking or lockup-free caches
o Allow the CPU to continue being productive (such as continue fetching
instructions) while the "miss" resolves – “hit under miss” – reduces
miss penalty
• Effective miss penalty can further be reduced if cache can overlap
multiple misses: a “hit under multiple misses” or “miss under miss” (Intel
Core i7)
• Difficult to judge the impact of any single miss and hence to calculate the
AMAT.
05: Multibanked Caches
Increasing Cache Bandwidth Via Multiple
Banks
• Rather than treating cache as single monolithic block, divide into
independent banks to support simultaneous accesses
– E.g., Arm Cortex-A8 - L2 has one to 4 banks
– Intel Core i7 – 4 banks in L1 (2 memory accesses per clock
cycle), 8 banks in L2 cache
• Works best when accesses naturally spread across banks 
mapping of addresses to banks affects behavior of memory
system.
• Simple mapping that works well is sequential interleaving
– Spread block addresses sequentially across banks
– E,g, bank i has all blocks with address i modulo n

61
Single Bank
Multibank Cache
Simultaneous
Access

Multibank Cache
Simple mapping: sequential
interleaving
Block
Addr.

0
4
8
12

Bank #0 - Handles addresses where:

(block_address) mod 4 = 0
Block
Addr.

0
4
8
12

1
5
9
Bank #1 - Handles addresses where: 13

(block_address) mod 4 = 1
Block
Addr.

0
4
8
12

1
5
9
Bank #2 - Handles addresses where: 13

(block_address) mod 4 = 2 2
6
10
14
Block
Addr.

0
4
8
12

1
5
9
Bank #3 - Handles addresses where: 13

(block_address) mod 4 = 3 2
6
10
14

3
7
11
15
06:Critical Word First / Early Restart
Reduce Miss Penalty:
Early Restart and Critical Word First
• Don’t wait for full block before restarting CPU
• Early restart—As soon as requested word of block arrives, send
to CPU and continue execution
– Spatial locality  tend to want next sequential word, so may
still pay to get that one
• Critical Word First—Request missed word from memory first,
send it to CPU right away; let CPU continue while filling rest of
block
– Long blocks more popular today  Critical Word 1st widely
used

block

73
Unoptimize
MISS!

Unoptimized
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Early Start
MISS!

Early Start
Early Start
Early Start
Early Start
Early Start
Critical Word First
MISS!

Critical Word First


Critical Word First
Critical Word First
Critical Word First
Critical Word First
Critical Word First
Critical Word First
Critical Word First
07:Merging Write Buffer
Merging Write Buffer
to Reduce Miss Penalty
• Write buffer lets processor continue while waiting for
write to complete
• If buffer contains modified blocks, addresses can be
checked to see if new data matches that of some
write buffer entry
• If so, new data combined with that entry
• For sequential writes in write-through caches,
increases block size of write (more efficient)
• Sun T1 (Niagara) and many others use write merging

97
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
75% wasted
space

Without Merging Write Buffer


With Merging Write Buffer
With Merging Write Buffer
With Merging Write Buffer
With Merging Write Buffer
With Merging Write Buffer
With Merging Write Buffer
With Merging Write Buffer
8. Reducing Misses
by Compiler Optimizations
• McFarling [1989] reduced misses by 75% in software on 8KB
direct-mapped cache, 4 byte blocks
• Instructions
– Reorder procedures in memory to reduce conflict misses
– Profiling to look at conflicts (using tools they developed)
• Data
– Merging arrays: Improve spatial locality by single array of
compound elements vs. 2 arrays
– Loop interchange: Change nesting of loops to access data in
memory order
– Loop fusion: Combine 2 independent loops that have same
looping and some variable overlap
– Blocking: Improve temporal locality by accessing blocks of
data repeatedly vs. going down whole columns or rows
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of structures */


struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reduce conflicts between val & key;


improve spatial locality
Loop Interchange Example

/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory
every 100 words; improved spatial locality
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve
spatial locality
09. Reducing Misses by Hardware
Prefetching of Instructions & Data

• Prefetching relies on having extra memory bandwidth that can be


used without penalty
• Instruction prefetching
– Typically, CPU fetches 2 blocks on miss: requested and next
– Requested block goes in instruction cache, prefetched goes in
instruction stream buffer
• Data prefetching
– Pentium 4 can prefetch data into L2 cache from up to 8 streams
from 8 different 4 KB pages
– Prefetching invoked if 2 successive L2 cache misses to a page,
if distance between those cache blocks is < 256 bytes
10. Reducing Misses by
Software Prefetching Data
• Data prefetch
– Load data into register (HP PA-RISC loads)
– Cache prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; form of
speculative execution

• Prefetch instructions take time


– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces problem of issue bandwidth
Memory Technology and Optimizations
• M.M satisfies the demands of caches and serves as the
I/O interface.
• Performance of memory – latency and B.W
• Latency is the primary concern of cache (benefit from low
latency), B.W is the concern of multiprocessor and I/O.
• Easier to increase B.W than to reduce latency
• Cache designers increase block size in L2 to take
advantage of high B.W
• Innovations in M.M are needed as well.
• Traditionally – organizing DRAM chips- multiple memory
banks – making memory and bus wider – board level
• Now, innovations are happening inside DRAM chips
Main-Memory Background
• Performance of Main Memory:
– Latency: cache miss penalty
» Access time: time between request and when word arrives
» Cycle time: minimum time between requests
• Main Memory is DRAM: Dynamic Random Access
Memory
– Dynamic since needs to be refreshed (8 ms, 1% time)
– Emphasis is on cost per bit & capacity
– Addresses divided into 2 halves (memory as 2D matrix):
• Cache uses SRAM: Static Random Access
Memory
– No refresh (6 transistors/bit vs. 1 transistor)
– Designs are concerned with speed and capacity.
– Needs minimal power to retain charge in standby mode
Size: DRAM/SRAM ~ 4-8,
Cost/cycle time: SRAM/DRAM ~ 8-16
DRAM logical organization (4 Mbit)

Column Decoder

Data in
11 Sense Amps & I/O D

Data out
A0…A10 Memory Array Q

Bit Line
(2,048 x 2,048)

Cell
Word Line
Quest for DRAM Performance

• DRAM access is divided into row and column


access.
• DRAMs must buffer a row of bits inside the DRAM
for column access and this row is square root of
DRAM size.
• Internal organization of DRAM actually consists of
many memory modules.
• 1G bit DRAM may contaion 512 memory arrays of
2M bits each.
• This large number of arrays provide much higher
band width.

CS136 121
Quest for DRAM Performance

1. Fast Page mode


– Add timing signals to allow repeated accesses to row buffer
without another row access time
– Easy, since array buffers 1024-2048 bits for each access
2. Synchronous DRAM (SDRAM)
– Add clock signal to DRAM interface so repeated transfers
don’t pay overhead to synchronize with controller
3. Double Data Rate (DDR SDRAM)
– Transfer data on rising and falling edges of clock, doubling
peak data rate
– DDR2 lowers power by dropping voltage from 2.5 to 1.8 volts,
and offers higher clock rates: up to 400 MHz
– DDR3 drops to 1.5 volts and raises clock rates to 800 MHz
• Improved bandwidth, not latency

CS136 122
A Dynamic Memory Chip

RAS
Row Addr. Strobe

Row Row 4096


´ (512´ 8)
address
latch decoder cell array

A20- 9 ¤ A8- 0 Sense / Write CS


circuits
R/W

Column
address Column
latch decoder

CAS D7 D0
Column Addr. Strobe

Figure 5.7. Internal organization of a 2M ´ 8 dynamic memory123


chip.
Fast Page Mode

• When the DRAM in last slide is accessed, the


contents of all 4096 cells in the selected row
are sensed, but only 8 bits are placed on the
data lines D7-0, as selected by A8-0.
• Fast page mode – make it possible to access
the other bytes in the same row without
having to reselect the row.
• A latch is added at the output of the sense
amplifier in each column.
• Good for bulk transfer.

124
Synchronous DRAMs

• The operations of SDRAM are controlled by a clock signal.


Refresh
counter

Row
address Row
decoder Cell array
latch
Row/Column
address
Column Column
address Read/Write
decoder circuits & latches
counter

Clock
RAS Mode register
CAS and Data input Data output
register register
R/W timing control

CS

Figure 5.8. Synchronous DRAM. Data 125


Synchronous DRAMs

Clock

R/W

RAS

CAS

Address Row Col

Data D0 D1 D2 D3

Figure 5.9. Burst read of length 4 in an SDRAM.


126
DDR SDRAM

• Double-Data-Rate SDRAM
• Standard SDRAM performs all actions on the
rising edge of the clock signal.
• DDR SDRAM accesses the cell array in the
same way, but transfers the data on both
edges of the clock.
• The cell array is organized in two banks. Each
can be accessed separately.
• DDR SDRAMs and standard SDRAMs are
most efficiently used in applications where
block transfers are prevalent.

127
DRAM Named by Peak Chip Xfers / Sec
DIMM Named by Peak DIMM MBytes / Sec

Stan- Clock Rate M transfers DRAM Mbytes/s/ DIMM


dard (MHz) / second Name DIMM Name
DDR 133 266 DDR266 2128 PC2100
DDR 150 300 DDR300 2400 PC2400
DDR 200 400 DDR400 3200 PC3200
DDR2 266 533 DDR2-533 4264 PC4300
DDR2 333 667 DDR2-667 5336 PC5300
DDR2 400 800 DDR2-800 6400 PC6400
DDR3 533 1066 DDR3-1066 8528 PC8500
DDR3 666 1333 DDR3-1333 10664 PC10700
DDR3 800 1600 DDR3-1600 12800 PC12800
x2 x8
CS136 128
Protection : Virtual Memory

• Need for security and privacy against electronic


burglary has led to the innovations in architecture
and system software.
• Page based virtual memory, including TLB that
caches page table entries, is the primary mechanism
that protects processes from each other.
• Multiprogramming- process – program+state – O.S
and architecture should allow processes to share the
hardware yet not interfere with each other.
Ø should allow context switching
Ø must limit what a what a process can access
when running a user process
Ø Allow O.S process to access more

CS136 129
Protection : Virtual Memory

The architecture must do the following


• Provide atleast 2 modes indicating whether the
running process is a user process or an O.S process
(kernel/supervisor process) .
• Provide a portion of the processor state that a user
process can use but not write – includes
user/supervisor mode bits, exception enable/disable
bit, memory protection information
• Provide mechanisms whereby the processor can go
from user mode to supervisor mode and vice versa -
- system call, APIs
• Provide mechanism to limit memory accesses to
protect the memory state of a process wihhou having
to swap the process to disk on a context switch.
CS136 130
Protection : Virtual Machines (VM)

Reasons for popularity of VM


• Increasing importance of isolation and
security.
• Failures in security and reliability of standard
Oss.
• Sharing of a single computer among many
unrelated users.
• Increase in the raw speed of the processor
which makes the overhead of VMs more
acceptable.

CS136 131
Protection : Virtual Machines (VM)
• VMs include all emulation methods that provide a
standard software interface.
• System VMs : provide complete system level environment
at binary ISA level, run different ISAs but always match
the hardware – VMware, IBM VM/370.
• Present an illusion that the users of a VM have an entire
computer to themselves, including a copy of OS
• A single OS owns all the hardware but with VM, multiple
OSes all share the hardware resources
• VMM/ hypervisor – software that supports VM, determines
how to map virtual resources to physical resources.
• Underlying H/W platform is “host” and its resources are
shared among “guest VMs.

CS136 132
Protection : Virtual Machines (VM)

• Managing software: VMs provide an


abstraction that can run complete software
stack. A typical example is some VMs
running legacy Oses, many running the
current stable Oses release, few testing the
next OS releases.
• Managing hardware: multiple servers run
different applications on separate computers,
but VMs allow these separate software stacks
to run independently yet share hardware,
thereby consolidating the number of servers.

CS136 133
Protection : Virtual Machines (VM)

Requirements of VMM:
• VM presents S/W interface to guest S/W it must
isolate the state of guests from each other.
• Must protect itself from guest S/W.
• Guest S/W should behave on a VM exactly as if it
were running on the native hardware
• Guest S/W should not be able to change allocation of
real time resources directly.
• VMM must control –access to privilage state,
address translation, I/O, exceptions, interrupts.
• VMM must be at the higher privilege level than the
guest VM.

CS136 134

You might also like