Professional Documents
Culture Documents
I/O main
L2 cache bus interface
bridge memory
Inserting an L1 Cache Between
the CPU and Main Memory
The tiny, very fast CPU register file
The transfer unit between has room for four 4-byte words.
the CPU register file and
the cache is a 4-byte block.
line 0 The small fast L1 cache has room
line 1 for two 4-word blocks.
The transfer unit between
the cache and main
memory is a 4-word block
(16 bytes). block 10 abcd
...
The big slow main memory
block 21 pqrs
has room for many 4-word
... blocks.
block 30 wxyz
...
General Org of a Cache Memory
1 valid bit t tag bits B = 2b bytes
Cache is an array per line per line per cache block
of sets.
valid tag 0 1 ••• B–1
Each set contains E lines
set 0: •••
one or more lines. per set
valid tag 0 1 ••• B–1
Each line holds a
block of data. valid tag 0 1 ••• B–1
set 1: •••
S = 2s sets valid tag 0 1 ••• B–1
•••
m-1 0
v tag 0 1 • • • B–1
set 0: •••
v tag 0 1 • • • B–1 <tag> <set index> <block offset>
v tag 0 1 • • • B–1
set 1: •••
v tag 0 1 • • • B–1
The word at address A is in the cache if
••• the tag bits in one of the <valid> lines in
v tag 0 1 • • • B–1 set <set index> match <tag>.
set S-1: •••
v tag 0 1 • • • B–1 The word contents begin at offset
<block offset> bytes from the beginning
of the block.
Direct-Mapped Cache
Simplest kind of cache
Characterized by exactly one line per set.
•••
0 1 2 3 4 5 6 7
(2) The tag bits in the cache (3) If (1) and (2), then
line must match the =?
cache hit,
tag bits in the address and block offset
selects
t bits s bits b bits
starting byte.
0110 i 100
m-1
tag set index block offset0
Direct-Mapped Cache Simulation
M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 entry/set
t=1 s=2 b=1
x xx x Address trace (reads):
0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
Cache: 1 0
8 bit address and 1 word block 2 1 00 M[8-11]
Direct Mapped : 16 sets each line contains one word
s = 4, b = 2, and tag = 2 3 0
Access 4 0
2 [00 0000 10]: Set 0 : miss 5 0
3 [00 0000 11]: Set 0 : hit
11 [00 0010 11]: Set 2 : miss 6 0
16 [00 0100 00]: set 4 : miss 7 0
21 [00 0101 01]: Set 5 : miss
13 [00 0011 01]: Set 3 : miss 8 0
64 [01 0000 00]: Set 0 : miss 9 0
48 [00 1100 00]: Set 12 : miss
A 0
19 [00 0100 11]: Set 4 : hit
11 [00 0010 11]: Set 2 : hit B 0
3 [00 0000 11]: Set 0 : miss
C 0
22 [00 0101 10]: Set 5 : hit
4 [00 0001 00]: Set 1 : miss D 0
27 [00 0110 11]: Set 6 : miss E 0
6 [00 0001 10]: Set 1 : hit
11 [00 0010 11]: Set 2 : hit F 0
Set Associative Caches
Characterized by more than one line per set
•••
valid tag cache block
set S-1:
valid tag cache block
Accessing Set Associative Caches
Set selection
identical to direct-mapped cache
valid tag cache block
set 0:
valid tag cache block
•••
valid tag cache block
t bits s bits b bits set S-1:
valid tag cache block
00 001
m-1
tag set index block offset0
Accessing Set Associative Caches
Line matching and word selection
must compare the tag in each valid line in the selected set.
=1? (1) The valid bit must be set.
0 1 2 3 4 5 6 7
1 1001
selected set (i):
1 0110 w0 w1 w2 w3
•••
valid tag cache block
valid tag cache block
Accessing Fully Set Associative Caches
No Set selection
m-1
tag block offset0
•••
valid tag cache block
0 1 2 3 4 5 6 7
1 1001
set 0:
1 0110 w0 w1 w2 w3
Regs L1
Processor Unified
d-cache
L2 Memory disk
L1 Cache
i-cache
L1 Data
1 cycle latency
16 KB
4-way assoc L2 Unified
Regs.
Write-through 128KB--2 MB
32B lines Main
4-way assoc
Memory
Write-back
Up to 4GB
Write allocate
L1 Instruction 32B lines
16 KB, 4-way
32B lines
Processor Chip
Cache Performance Metrics
Miss Rate
Fraction of memory references not found in cache
(misses/references)
Typical numbers:
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Time to deliver a line in the cache to the processor (includes time to
determine whether the line is in the cache)
Typical numbers:
1-2 clock cycle for L1
5-20 clock cycles for L2
Miss Penalty
Additional time required because of a miss
Typically 25-100 cycles for main memory
Lets think about those numbers
Huge difference between a hit and a miss
Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles
Average access time:
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
This is why “miss rate” is used instead of “hit rate”
Types of Cache Misses
Cold (compulsory) miss
Occurs on first access to a block
Conflict miss
Most hardware caches limit blocks to a small subset
(sometimes a singleton) of the available cache slots
e.g., block i must be placed in slot (i mod 4)
Conflict misses occur when the cache is large enough, but
multiple data objects all map to the same slot
e.g., referencing blocks 0, 8, 0, 8, ... would miss every time
Capacity miss
Occurs when the set of active cache blocks (working set) is
larger than the cache
Why Caches Work
Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently
Temporal locality:
Recently referenced items are likely block
to be referenced again in the near future
Data:
Temporal: sum referenced in each iteration
Spatial: array a[] accessed in stride-1 pattern
Instructions:
Temporal: cycle through loop repeatedly
Spatial: reference instructions in sequence
Being able to assess the locality of code is a crucial skill
for a programmer
Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
int main()
{
int size; /* Working set size (in bytes) */
int stride; /* Stride (in array elements) */
double Mhz; /* Clock frequency */
600
Ridges of
400 xe
Temporal
Slopes of L2 Locality
Spatial 200
Locality
0
s1
mem
2k
s3
s5
8k
s7
32k
s9
128k
s11
stride (words)
512k
2m
s15
8m
Ridges of Temporal Locality
Slice through the memory mountain with stride=1
illuminates read throughputs of different caches and memory
1200
main memory L2 cache L1 cache
region region region
1000
read througput (MB/s)
800
600
400
200
0
8m
4m
2m
8k
4k
2k
1k
64k
32k
16k
512k
256k
128k
1024k
700
600
read throughput (MB/s)
500
one access per cache line
400
300
200
100
0
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
Matrix Multiplication Example
i k i
A B C
Layout of C Arrays in Memory (review)
C arrays allocated in row-major order
each row in contiguous memory locations
Stepping through columns in one row:
for (i = 0; i < N; i++)
sum += a[0][i];
accesses successive elements
if block size (B) > 4 bytes, exploit spatial locality
compulsory miss rate = 4 bytes / B
Stepping through rows in one column:
for (i = 0; i < n; i++)
sum += a[i][0];
accesses distant elements
no spatial locality!
compulsory miss rate = 1 (i.e. 100%)
Matrix Multiplication (ijk)
/* ijk */
Inner loop:
for (i=0; i<n; i++) {
for (j=0; j<n; j++) { (*,j)
sum = 0.0; (i,j)
(i,*)
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j]; A B C
c[i][j] = sum;
}
} Row-wise Column- Fixed
wise
/* kij */
Inner loop:
for (k=0; k<n; k++) {
for (i=0; i<n; i++) { (i,k) (k,*)
r = a[i][k]; (i,*)
for (j=0; j<n; j++) A B C
c[i][j] += r * b[k][j];
}
}
Fixed Row-wise Row-wise
/* ikj */
Inner loop:
for (i=0; i<n; i++) {
for (k=0; k<n; k++) { (i,k) (k,*)
r = a[i][k]; (i,*)
for (j=0; j<n; j++) A B C
c[i][j] += r * b[k][j];
}
}
Fixed Row-wise Row-wise
/* jki */
Inner loop:
for (j=0; j<n; j++) {
for (k=0; k<n; k++) { (*,k) (*,j)
r = b[k][j]; (k,j)
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r; A B C
}
}
/* kji */
Inner loop:
for (k=0; k<n; k++) {
for (j=0; j<n; j++) { (*,k) (*,j)
r = b[k][j]; (k,j)
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r; A B C
}
}
Column- Fixed Column-
wise wise
Misses per Inner Loop Iteration:
A B C
1.0 0.0 1.0
Summary of Matrix Multiplication
ijk (& jik): kij (& ikj): jki (& kji):
• 2 loads, 0 stores • 2 loads, 1 store • 2 loads, 1 store
• misses/iter = 1.25 • misses/iter = 0.5 • misses/iter = 2.0
for (i=0; i<n; i++) { for (k=0; k<n; k++) { for (j=0; j<n; j++) {
for (j=0; j<n; j++) { for (i=0; i<n; i++) { for (k=0; k<n; k++) {
sum = 0.0; r = a[i][k]; r = b[k][j];
for (k=0; k<n; k++) for (j=0; j<n; j++) for (i=0; i<n; i++)
sum += a[i][k] * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] += a[i][k] * r;
c[i][j] = sum; } }
} } }
}
Pentium Matrix Multiply Performance
Miss rates are helpful but not perfect predictors.
Code scheduling matters, too.
60
50
40
Cycles/iteration
kji
jki
30
kij
ikj
20 jik
ijk
10
0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Improving Temporal Locality by Blocking
Example: Blocked matrix multiplication
“block” (in this context) does not mean “cache block”.
Instead, it mean a sub-block within the matrix.
Example: N = 8; sub-block size = 4
A11 A12 B11 B12 C11 C12
X =
A21 A22 B21 B22 C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
kk
i i
A B C
50
kji
40 jki
Cycles/iteration
kij
30 ikj
jik
20
ijk
bijk (bsize = 25)
bikj (bsize = 25)
10
0
0
5
0
5
0
5
0
5
0
5
0
5
0
25
50
75
10
12
15
17
20
22
25
27
30
32
35
37
40