You are on page 1of 61

EL3011 Computer System Architecture

Chapter 6 Cache Memories


Sekolah Teknik Elektro dan Informatika
Institut Teknologi Bandung
Topics
 Generic cache memory organization
 Direct mapped caches
 Set associative caches
 Impact of caches on performance

 Definition: Computer memory with short access time


used for the storage of frequently or recently used
instructions or data
Cache Memories
 Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
 Hold frequently accessed blocks of main memory
 CPU looks first for data in L1, then in L2, then in main
memory.
 Typical bus structure:
CPU chip
register file
L1
ALU
cache
cache bus system bus memory bus

I/O main
L2 cache bus interface
bridge memory
Inserting an L1 Cache Between
the CPU and Main Memory
The tiny, very fast CPU register file
The transfer unit between has room for four 4-byte words.
the CPU register file and
the cache is a 4-byte block.
line 0 The small fast L1 cache has room
line 1 for two 4-word blocks.
The transfer unit between
the cache and main
memory is a 4-word block
(16 bytes). block 10 abcd

...
The big slow main memory
block 21 pqrs
has room for many 4-word
... blocks.
block 30 wxyz

...
General Org of a Cache Memory
1 valid bit t tag bits B = 2b bytes
Cache is an array per line per line per cache block
of sets.
valid tag 0 1 ••• B–1
Each set contains E lines
set 0: •••
one or more lines. per set
valid tag 0 1 ••• B–1
Each line holds a
block of data. valid tag 0 1 ••• B–1

set 1: •••
S = 2s sets valid tag 0 1 ••• B–1

•••

valid tag 0 1 ••• B–1

set S-1: •••


valid tag 0 1 ••• B–1

Cache size: C = B x E x S data bytes


Addressing Caches
Address A:
t bits s bits b bits

m-1 0
v tag 0 1 • • • B–1
set 0: •••
v tag 0 1 • • • B–1 <tag> <set index> <block offset>

v tag 0 1 • • • B–1
set 1: •••
v tag 0 1 • • • B–1
The word at address A is in the cache if
••• the tag bits in one of the <valid> lines in
v tag 0 1 • • • B–1 set <set index> match <tag>.
set S-1: •••
v tag 0 1 • • • B–1 The word contents begin at offset
<block offset> bytes from the beginning
of the block.
Direct-Mapped Cache
 Simplest kind of cache
 Characterized by exactly one line per set.

set 0: valid tag cache block E=1 lines per set

set 1: valid tag cache block

•••

set S-1: valid tag cache block


Accessing Direct-Mapped Caches
 Set selection
 Use the set index bits to determine the set of interest.

set 0: valid tag cache block

selected set valid tag cache block


set 1:
•••
t bits s bits b bits
set S-1: valid tag cache block
00 001
m-1
tag set index block offset0
Accessing Direct-Mapped Caches
 Line matching and word selection
 Line matching: Find a valid line in the selected set with a
matching tag
 Word selection: Then extract the word
=1? (1) The valid bit must be set

0 1 2 3 4 5 6 7

selected set (i): 1 0110 w0 w1 w2 w3

(2) The tag bits in the cache (3) If (1) and (2), then
line must match the =?
cache hit,
tag bits in the address and block offset
selects
t bits s bits b bits
starting byte.
0110 i 100
m-1
tag set index block offset0
Direct-Mapped Cache Simulation
M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 entry/set
t=1 s=2 b=1
x xx x Address trace (reads):
0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]

0 [00002] (miss) 13 [11012] (miss)


v tag data v tag data
11 0 m[1] m[0]
M[0-1] 1
1 0 m[1] m[0]
M[0-1]
(1) (3) 1
1 1 m[13] m[12]
M[12-13]

8 [10002] (miss) 0 [00002] (miss)


v tag data v tag data
11 1 m[9] m[8]
M[8-9] 11 0 m[1] m[0]
M[0-1]
(4) 1 1 M[12-13] (5) 11 1 m[13] m[12]
M[12-13]
Why Use Middle Bits as Index?
4-line Cache High-Order Middle-Order
Bit Indexing Bit Indexing
00 0000 0000
01 0001 0001
10 0010 0010
11 0011 0011
0100 0100
 High-Order Bit Indexing 0101 0101
 Adjacent memory lines would map to 0110 0110
same cache entry 0111 0111
 Poor use of spatial locality 1000 1000
 Middle-Order Bit Indexing 1001 1001
 Consecutive memory lines map to 1010 1010
different cache lines 1011 1011
 Can hold C-byte region of address 1100 1100
space in cache at one time 1101 1101
1110 1110
1111 1111
Cache Example
 Problem
 32 bit address
 4K cache sets
 1 word per line (32 bit data)
 Byte addressable
 Memory address structure?
 Cache line structure?
 Cache size?
Cache Example
 Memory address
 4K sets requires 12 bits (4096 = 2s) s = 12
 1 word 32 bit = 4 bytes (4 = 2b) b = 2
 Tag = 32 – 12 – 2 = 18 Tag Set Block
(18) (12) (2)
 Cache line structure
 1 valid bit Word
 18 tag bit (32)
V Tag (8) (8) (8) (8)
 1 word : 32 bit (1) (18)
 Total 51 bit
 Cache size
 C = S x Line = 4096 x 51 bit = 208896 bit
Cache access problem
 Cache:
 8 bit address and 1 word block (32 bit)
 Direct Mapped : 16 sets each line contains one word (32 bit)
 Show the final content of the cache given
2,3,11,16,21,13,64,48,19,11,3,22,4,27,6,11 memory
access
Set V Tag B0 B1 B2 B3
0 1 00 M[0] M[1] M[2] M[3]
1 1 00 M[4] M[5] M[6] M[7]
 2 00 0000 10 miss 2 1 00 M[8] M[9] M[10] M[11]
 3 00 0000 11 hit
3 1 00 M[12] M[13] M[14] M[15]
 11 00 0010 11 miss
 16 00 0100 00 miss 4 1 00 M[16] M[17] M[18] M[19]
 21 00 0101 01 miss 5 1 00 M[20] M[21] M[22] M[23]
 13 00 0011 01 miss 6 1 00 M[24] M[25] M[26] M[27]
 64 01 0000 00 miss
7 0
 48 00 1100 00 miss
 19 00 0100 11 hit 8 0
 11 00 0010 11 hit 9 0
 3 00 0000 11 miss 10 0
 22 00 0101 10 hit
11 0
 4 00 0001 00 miss
 27 00 0110 11 miss 12 1 00 M[48] M[49] M[50] M[51]
 6 00 0001 10 hit 13 0
 11 00 0010 11 hit 14 0
15 0
Set Valid Tag Block
Cache access problem 0 1 00 M[0-3]

 Cache: 1 0
 8 bit address and 1 word block 2 1 00 M[8-11]
 Direct Mapped : 16 sets each line contains one word
 s = 4, b = 2, and tag = 2 3 0
 Access 4 0
 2 [00 0000 10]: Set 0 : miss 5 0
 3 [00 0000 11]: Set 0 : hit
 11 [00 0010 11]: Set 2 : miss 6 0
 16 [00 0100 00]: set 4 : miss 7 0
 21 [00 0101 01]: Set 5 : miss
 13 [00 0011 01]: Set 3 : miss 8 0
 64 [01 0000 00]: Set 0 : miss 9 0
 48 [00 1100 00]: Set 12 : miss
A 0
 19 [00 0100 11]: Set 4 : hit
 11 [00 0010 11]: Set 2 : hit B 0
 3 [00 0000 11]: Set 0 : miss
C 0
 22 [00 0101 10]: Set 5 : hit
 4 [00 0001 00]: Set 1 : miss D 0
 27 [00 0110 11]: Set 6 : miss E 0
 6 [00 0001 10]: Set 1 : hit
 11 [00 0010 11]: Set 2 : hit F 0
Set Associative Caches
 Characterized by more than one line per set

valid tag cache block


set 0: E=2 lines per set
valid tag cache block

valid tag cache block


set 1:
valid tag cache block

•••
valid tag cache block
set S-1:
valid tag cache block
Accessing Set Associative Caches
 Set selection
 identical to direct-mapped cache
valid tag cache block
set 0:
valid tag cache block

Selected set valid tag cache block


set 1:
valid tag cache block

•••
valid tag cache block
t bits s bits b bits set S-1:
valid tag cache block
00 001
m-1
tag set index block offset0
Accessing Set Associative Caches
 Line matching and word selection
 must compare the tag in each valid line in the selected set.
=1? (1) The valid bit must be set.

0 1 2 3 4 5 6 7

1 1001
selected set (i):
1 0110 w0 w1 w2 w3

(3) If (1) and (2), then


(2) The tag bits in one
=? cache hit, and
of the cache lines must
block offset selects
match the tag bits in
starting byte.
the address
t bits s bits b bits
0110 i 100
m-1
tag set index block offset0
Cache Example
 Problem
 32 bit address
 2K cache sets (2 way set associative)
 1 word per line (32 bit data)
 Byte addressable
 Memory address structure?
 Cache line structure?
 Cache size?
Cache access problem
 Cache:
 8 bit address and 1 word block
 Two-way Set Associative : 8 sets each line contains one word
 Show the final content of the cache given
2,3,11,16,21,13,64,48,19,11,3,22,4,27,6,11 memory
access
Set V Tag B0 B1 B2 B3
0 1 000 M[0] M[1] M[2] M[3]
 2 000 000 10 miss 1 010 M[64] M[65] M[66] M[67]
 3 000 000 11 hit 1 0 000 M[4] M[5] M[6] M[7]
 11 000 010 11 miss 0
 16 000 100 00 miss
2 1 000 M[8] M[9] M[10] M[11]
 21 000 101 01 miss
 13 000 011 01 miss 0
 64 010 000 00 miss 3 1 000 M[12] M[13] M[14] M[15]
 48 001 100 00 miss 0
 19 000 100 11 hit
4 1 000 M[16] M[17] M[18] M[19]
 11 000 010 11 hit
 3 000 000 11 hit 1 001 M[48] M[49] M[50] M[51]
 22 000 101 10 hit 5 1 000 M[20] M[21] M[22] M[23]
 4 000 001 00 miss 0
 27 000 110 11 miss
6 0 000 M[24] M[25] M[26] M[27]
 6 000 001 10 hit
 11 000 010 11 hit 0
7 0
0
Fully Set Associative Caches
 Characterized by only one set

valid tag cache block


valid tag cache block

valid tag cache block


valid tag cache block
set 0: E= n lines per set

•••
valid tag cache block
valid tag cache block
Accessing Fully Set Associative Caches
 No Set selection

valid tag cache block


set 0:
valid tag cache block

valid tag cache block


t bits b bits valid tag cache block

m-1
tag block offset0
•••
valid tag cache block

valid tag cache block


Accessing Fully Set Associative Caches
 Line matching and word selection
 must compare the tag in each valid.
=1? (1) The valid bit must be set.

0 1 2 3 4 5 6 7

1 1001
set 0:
1 0110 w0 w1 w2 w3

(3) If (1) and (2), then


(2) The tag bits in one
=? cache hit, and
of the cache lines must
block offset selects
match the tag bits in
starting byte.
the address
t bits b bits
0110 100
m-1
tag block offset0
Cache Example
 Problem
 32 bit address
 1K cache lines (Fully associative)
 1 word per line (32 bit data)
 Byte addressable
 Memory address structure?
 Cache line structure?
 Cache size?
Cache access problem
 Cache:
 8 bit address and 1 word block
 Fully Associative : 8 lines contains one word
 Show the final content of the cache given
2,3,11,16,21,13,64,48,19,11,3,22,4,27,6,11 memory
access
Set V Tag B0 B1 B2 B3
0 1 000110 M[24] M[25] M[26] M[27]
 2 000 000 10 miss 1 000010 M[8] M[9] M[10] M[11]
 3 000 000 11 hit 1 000100 M[16] M[17] M[18] M[19]
 11 000 010 11 miss 1 000101 M[20] M[21] M[22] M[23]
 16 000 100 00 miss
1 000011 M[12] M[13] M[14] M[15]
 21 000 101 01 miss
 13 000 011 01 miss 1 010000 M[64] M[65] M[66] M[67]
 64 010 000 00 miss 1 001100 M[48] M[49] M[50] M[51]
 48 001 100 00 miss 1 000001 M[4] M[5] M[6] M[7]
 19 000 100 11 hit
 11 000 010 11 hit
 3 000 000 11 hit
 22 000 101 10 hit
 4 000 001 00 miss
 27 000 110 11 miss
 6 000 001 10 hit
 11 000 010 11 hit
Multi-Level Caches
 Options: separate data and instruction caches, or a
unified cache

Regs L1
Processor Unified
d-cache
L2 Memory disk
L1 Cache
i-cache

size: 200 B 8-64 KB 1-4MB SRAM 128 MB DRAM 30 GB


speed: 3 ns 3 ns 6 ns 60 ns 8 ms
$/Mbyte: $100/MB $1.50/MB $0.05/MB
line size: 8B 32 B 32 B 8 KB
larger, slower, cheaper
Intel Pentium Cache Hierarchy

L1 Data
1 cycle latency
16 KB
4-way assoc L2 Unified
Regs.
Write-through 128KB--2 MB
32B lines Main
4-way assoc
Memory
Write-back
Up to 4GB
Write allocate
L1 Instruction 32B lines
16 KB, 4-way
32B lines

Processor Chip
Cache Performance Metrics
 Miss Rate
 Fraction of memory references not found in cache
(misses/references)
 Typical numbers:
 3-10% for L1
 can be quite small (e.g., < 1%) for L2, depending on size, etc.
 Hit Time
 Time to deliver a line in the cache to the processor (includes time to
determine whether the line is in the cache)
 Typical numbers:
 1-2 clock cycle for L1
 5-20 clock cycles for L2
 Miss Penalty
 Additional time required because of a miss
 Typically 25-100 cycles for main memory
Lets think about those numbers
 Huge difference between a hit and a miss
 Could be 100x, if just L1 and main memory
 Would you believe 99% hits is twice as good as 97%?
 Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles
 Average access time:
 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
 This is why “miss rate” is used instead of “hit rate”
Types of Cache Misses
 Cold (compulsory) miss
 Occurs on first access to a block
 Conflict miss
 Most hardware caches limit blocks to a small subset
(sometimes a singleton) of the available cache slots
 e.g., block i must be placed in slot (i mod 4)
 Conflict misses occur when the cache is large enough, but
multiple data objects all map to the same slot
 e.g., referencing blocks 0, 8, 0, 8, ... would miss every time
 Capacity miss
 Occurs when the set of active cache blocks (working set) is
larger than the cache
Why Caches Work
 Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

 Temporal locality:
 Recently referenced items are likely block
to be referenced again in the near future

 Spatial locality: block


 Items with nearby addresses tend
to be referenced close together in time
Example: Locality?
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;

 Data:
 Temporal: sum referenced in each iteration
 Spatial: array a[] accessed in stride-1 pattern
 Instructions:
 Temporal: cycle through loop repeatedly
 Spatial: reference instructions in sequence
 Being able to assess the locality of code is a crucial skill
for a programmer
Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;

for (i = 0; i < M; i++)


for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
Locality Example #2
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;

for (j = 0; j < N; j++)


for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Locality Example #3
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;

for (i = 0; i < M; i++)


for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
sum += a[k][i][j];
return sum;
}

 How can it be fixed?


Writing Cache Friendly Code
 Repeated references to variables are good (temporal
locality)
 Stride-1 reference patterns are good (spatial locality)
 Examples:
 cold cache, 4-byte words, 4-word cache blocks
int sumarrayrows(int a[M][N]) int sumarraycols(int a[M][N])
{ {
int i, j, sum = 0; int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++)


for (j = 0; j < N; j++) for (i = 0; i < M; i++)
sum += a[i][j]; sum += a[i][j];
return sum; return sum;
} }

Miss rate = 1/4 = 25% Miss rate = 100%


The Memory Mountain
 Read throughput (read bandwidth)
 Number of bytes read from memory per second (MB/s)
 Memory mountain
 Measured read throughput as a function of spatial and
temporal locality.
 Compact way to characterize memory system performance.
Memory Mountain Test Function
/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;

for (i = 0; i < elems; i += stride)


result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}

/* Run test(elems, stride) and return read throughput (MB/s) */


double run(int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);

test(elems, stride); /* warm up the cache */


cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
Memory Mountain Main Routine
/* mountain.c - Generate the memory mountain. */
#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23) /* ... up to 8 MB */
#define MAXSTRIDE 16 /* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

int main()
{
int size; /* Working set size (in bytes) */
int stride; /* Stride (in array elements) */
double Mhz; /* Clock frequency */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */


Mhz = mhz(0); /* Estimate the clock frequency */
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride, Mhz));
printf("\n");
}
exit(0);
}
The Memory Mountain
Pentium III Xeon
1200
550 MHz
16 KB on-chip L1 d-cache
read throughput (MB/s)

1000 16 KB on-chip L1 i-cache


L1 512 KB off-chip unified
L2 cache
800

600

Ridges of
400 xe
Temporal
Slopes of L2 Locality
Spatial 200
Locality

0
s1

mem

2k
s3

s5

8k
s7

32k
s9

128k
s11

stride (words)
512k

working set size (bytes)


s13

2m
s15
8m
Ridges of Temporal Locality
 Slice through the memory mountain with stride=1
 illuminates read throughputs of different caches and memory

1200
main memory L2 cache L1 cache
region region region
1000
read througput (MB/s)

800

600

400

200

0
8m

4m

2m

8k

4k

2k

1k
64k

32k

16k
512k

256k

128k
1024k

working set size (bytes)


A Slope of Spatial Locality
 Slice through memory mountain with size=256KB
 shows cache block size.
800

700

600
read throughput (MB/s)

500
one access per cache line
400

300

200

100

0
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
Matrix Multiplication Example

 Major Cache Effects to Consider


 Total cache size
 Exploit temporal locality and keep the working set small (e.g., by using blocking)
 Block size /* ijk */ Variable sum
 Exploit spatial locality for (i=0; i<n; i++) { held in register
for (j=0; j<n; j++) {
sum = 0.0;
 Description: for (k=0; k<n; k++)
 Multiply N x N matrices sum += a[i][k] * b[k][j];
 O(N3) total operations c[i][j] = sum;
 Accesses }
}
 N reads per source element
 N values summed per destination
 but may be able to hold in register
Miss Rate Analysis for Matrix Multiply
 Assume:
 Line size = 32B (big enough for 4 64-bit words)
 Matrix dimension (N) is very large
 Approximate 1/N as 0.0
 Cache is not even big enough to hold multiple rows
 Analysis Method:
 Look at access pattern of inner loop
k j j

i k i

A B C
Layout of C Arrays in Memory (review)
 C arrays allocated in row-major order
 each row in contiguous memory locations
 Stepping through columns in one row:
 for (i = 0; i < N; i++)
sum += a[0][i];
 accesses successive elements
 if block size (B) > 4 bytes, exploit spatial locality
 compulsory miss rate = 4 bytes / B
 Stepping through rows in one column:
 for (i = 0; i < n; i++)
sum += a[i][0];
 accesses distant elements
 no spatial locality!
 compulsory miss rate = 1 (i.e. 100%)
Matrix Multiplication (ijk)

/* ijk */
Inner loop:
for (i=0; i<n; i++) {
for (j=0; j<n; j++) { (*,j)
sum = 0.0; (i,j)
(i,*)
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j]; A B C
c[i][j] = sum;
}
} Row-wise Column- Fixed
wise

Misses per Inner Loop Iteration:


A B C
0.25 1.0 0.0
Matrix Multiplication (jik)

/* jik */ Inner loop:


for (j=0; j<n; j++) {
for (i=0; i<n; i++) { (*,j)
sum = 0.0; (i,j)
for (k=0; k<n; k++) (i,*)
sum += a[i][k] * b[k][j]; A B C
c[i][j] = sum
}
}
Row-wise Column- Fixed
wise
Misses per Inner Loop Iteration:
A B C
0.25 1.0 0.0
Matrix Multiplication (kij)

/* kij */
Inner loop:
for (k=0; k<n; k++) {
for (i=0; i<n; i++) { (i,k) (k,*)
r = a[i][k]; (i,*)
for (j=0; j<n; j++) A B C
c[i][j] += r * b[k][j];
}
}
Fixed Row-wise Row-wise

Misses per Inner Loop Iteration:


A B C
0.0 0.25 0.25
Matrix Multiplication (ikj)

/* ikj */
Inner loop:
for (i=0; i<n; i++) {
for (k=0; k<n; k++) { (i,k) (k,*)
r = a[i][k]; (i,*)
for (j=0; j<n; j++) A B C
c[i][j] += r * b[k][j];
}
}
Fixed Row-wise Row-wise

Misses per Inner Loop Iteration:


A B C
0.0 0.25 0.25
Matrix Multiplication (jki)

/* jki */
Inner loop:
for (j=0; j<n; j++) {
for (k=0; k<n; k++) { (*,k) (*,j)
r = b[k][j]; (k,j)
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r; A B C
}
}

Column - Fixed Column-


wise wise
Misses per Inner Loop Iteration:
A B C
1.0 0.0 1.0
Matrix Multiplication (kji)

/* kji */
Inner loop:
for (k=0; k<n; k++) {
for (j=0; j<n; j++) { (*,k) (*,j)
r = b[k][j]; (k,j)
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r; A B C
}
}
Column- Fixed Column-
wise wise
Misses per Inner Loop Iteration:
A B C
1.0 0.0 1.0
Summary of Matrix Multiplication
ijk (& jik): kij (& ikj): jki (& kji):
• 2 loads, 0 stores • 2 loads, 1 store • 2 loads, 1 store
• misses/iter = 1.25 • misses/iter = 0.5 • misses/iter = 2.0

for (i=0; i<n; i++) { for (k=0; k<n; k++) { for (j=0; j<n; j++) {
for (j=0; j<n; j++) { for (i=0; i<n; i++) { for (k=0; k<n; k++) {
sum = 0.0; r = a[i][k]; r = b[k][j];
for (k=0; k<n; k++) for (j=0; j<n; j++) for (i=0; i<n; i++)
sum += a[i][k] * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] += a[i][k] * r;
c[i][j] = sum; } }
} } }
}
Pentium Matrix Multiply Performance
 Miss rates are helpful but not perfect predictors.
 Code scheduling matters, too.
60

50

40
Cycles/iteration

kji
jki
30
kij
ikj
20 jik
ijk

10

0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Improving Temporal Locality by Blocking
 Example: Blocked matrix multiplication
 “block” (in this context) does not mean “cache block”.
 Instead, it mean a sub-block within the matrix.
 Example: N = 8; sub-block size = 4
A11 A12 B11 B12 C11 C12
X =
A21 A22 B21 B22 C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22


Blocked Matrix Multiply (bijk)
for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
Blocked Matrix Multiply Analysis
 Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X
bsize block of B and accumulates into 1 X bsize sliver of C
 Loop over i steps through n row slivers of A & C, using same B
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
Innermost c[i][j] += sum;
Loop Pair } kk jj jj

kk
i i

A B C

row sliver accessed Update successive


bsize times block reused n elements of sliver
times in succession
Pentium Blocked Matrix
Multiply Performance
 Blocking (bijk and bikj) improves performance by a factor
of two over unblocked versions (ijk and jik)
 relatively insensitive to array size.
60

50
kji
40 jki
Cycles/iteration

kij
30 ikj
jik
20
ijk
bijk (bsize = 25)
bikj (bsize = 25)
10

0
0
5
0
5
0
5
0
5
0
5
0
5
0
25
50
75
10
12
15
17
20
22
25
27
30
32
35
37
40

Array size (n)


Concluding Observations
 Programmer can optimize for cache performance
 How data structures are organized
 How data are accessed
 Nested loop structure
 Blocking is a general technique
 All systems favor “cache friendly code”
 Getting absolute optimum performance is very platform
specific
 Cache sizes, line sizes, associativities, etc.
 Can get most of the advantage with generic code
 Keep working set reasonably small (temporal locality)
 Use small strides (spatial locality)

You might also like