You are on page 1of 3

Homework #6

6.41
You are writing a new 3D game that you hope will earn you fame and fortune. You are
currently working on a function to blank the screen buffer before drawing the next frame.
The screen you are working with is a 640 × 480 array of pixels. The machine you are
working on has a 64 KB direct-mapped cache with 4-byte lines. The C structures you are
using are as follows:

1 struct pixel {
2 char r;
3 char g;
4 char b;
5 char a;
6 };
7
8 struct pixel buffer[480][640];
9 int i, j;
10 char *cptr;
11 int *iptr;
Assume the following:
sizeof(char) = 1 and sizeof(int) = 4.
buffer begins at memory address 0.
The cache is initially empty.
The only memory accesses are to the entries of the array buffer. Variables i, j, cptr, and
iptr are stored in registers.

What percentage of writes in the following code will miss in the cache?

1 for (j = 0; j < 640; j++) {


2 for (i = 0; i < 480; i++){
3 buffer[i][j].r = 0;
4 buffer[i][j].g = 0;
5 buffer[i][j].b = 0;
6 buffer[i][j].a = 0;
7 }
8 }

Solution:
Given the C structures and the layout of the memory, the code is performing a write
operation in a column-major order. This means that it is not accessing data in a way that
is contiguous in memory. The pixel structure in the code is 4 bytes long, exactly fitting
into a cache line. When the first byte (the r byte) of a pixel is written to, a cache miss
occurs as that pixel data is not in the cache yet. The direct-mapped cache then loads the
whole pixel structure (4 bytes) from memory into a cache line, which covers the r, g, b,
and a bytes of that pixel. The next three write operations, which are writing to the g, b,
and a bytes of the same pixel, will hit in the cache because they are located in the same
cache line that was fetched when the r byte was written to. Therefore, for every group of
four write operations (to r, g, b, and a), only the first write will cause a cache miss, while
the following three writes will hit in the cache. This results in a cache miss rate of 1 out
of 4 writes, or 25%. So the percentage of writes in the provided code that will miss in the
cache is 25%.

6.45
In this assignment, you will apply the concepts you learned in Chapters 5 and 6 to the
problem of optimizing code for a memory-intensive application. Consider a procedure to
copy and transpose the elements of an N × N matrix of type int. That is, for source matrix
S and destination matrix D, we want to copy each element si,j to dj,i. This code can be
written with a simple loop,

1 void transpose(int *dst, int *src, int dim)


2 {
3 int i, j;
4
5 for (i = 0; i < dim; i++)
6 for (j = 0; j < dim; j++)
7 dst[j*dim + i] = src[i*dim + j];
8 }
where the arguments to the procedure are pointers to the destination (dst) and source (src)
matrices, as well as the matrix size N (dim). Your job is to devise a transpose routine that
runs as fast as possible.

Solution:
The given code transposes a matrix with a double loop that accesses the src array in a
row-major order, and the dst array in a column-major order. This can cause cache misses
especially for larger matrices, since these accesses won't fully utilize a cache line most of
the time.
We will use blocking (or tiling), as the optimization technique for the matrix. Make the
matrix divided into smaller sub-matrices (or blocks) that can fit into the cache, and these
blocks are processed individually. BLOCK_SIZE is a tunable parameter, and we will set
it to the size that gives the best performance for our specific cache architecture.
Here's an improved version of the transpose function using blocking:

#define BLOCK_SIZE 16

void transpose(int *dst, int *src, int dim) {


int i, j, m, n;
for (i = 0; i < dim; i += BLOCK_SIZE) {
for (j = 0; j < dim; j += BLOCK_SIZE) {
for (m = i; m < i + BLOCK_SIZE && m < dim; ++m) {
for (n = j; n < j + BLOCK_SIZE && n < dim; ++n) {
dst[n*dim + m] = src[m*dim + n];
}
}
}
}
}

You might also like