You are on page 1of 2

Addis Ababa University

Addis Ababa Institute of Technology


School of Electrical and Computer Engineering

ECEG 6503 - Advanced Computer Architecture

Assignment Two

This assignment is adopted from the book by Hennessy and Patterson.

1. Memory Optimization
We use compiler optimization to improve the performance of the memory subsystem. In this
exercise we will use two experiments to appreciate this improvement.
a. Loop Interchange (a kind of)
i. Implement a simple matrix multiplication program. Make the matrix size
above 2048x2048. Measure the runtime.
ii. Transpose matrix B and change the matrix multiplication algorithm to
accommodate this change. Measure both the transposing time and
multiplication time together. Comment on the results!
b. Blocking
i. For the same matrix size stated above use the blocking method and algorithm
described on pages 107 – 109 of the textbook and measure the run-time for
block size of 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024.
2. Memory access behavior study
In this study you will follow the case study (case study 2) described on pages 150-152.
Answer question 2.4 from the plots of your experiment.

3. Vector Processing
Most modern processors come equipped with vector processors to enhance performance of
data parallel application. Do the following experiments and report your finding.
a. Simple vector addition: given the following vector addition function measure the time
it takes to complete it. Vary size from 1024, 2048, 4096, 8192, 16384, 32768, 65536,
131072, 262144, 524588, 1048576, 2097152, 4194304, 8388608, 16777216,
33554432, 67108864.
void vecAdd(int *A, int *B, int *C, int size)
{
For(int i=0; i<size;i++)
C[i]=A[i] + B[i];
}

b. Follow the same process as in A except use loop unrolling. How does this differ from
the results in a?

For(int i=0;i<size;i+=4)

1
{

C[i]=A[i] + B[i];
C[i+1]=A[i+1] + B[i+1];
C[i+2]=A[i+2] + B[i+2];
C[i+3]=A[i+3] + B[i+3];

c. In this experiment we will implement the same vector addition task but using the
vector processing capability of the processor. Most Intel CPUs should have this
capability. Compare and contrast the results you got with a and b. Make sure to
include the header file #include <emmintrin.h>

void vecAddRealSSE(float *a,float *b,float *c, int N)


{
for(int i=0;i<N;i+=4)
{
__m128 sse_a=_mm_load_ps(&a[i]); //loading into vector register
__m128 sse_b=_mm_load_ps(&b[i]); //loading into vector register
__m128 sse_c=_mm_add_ps(sse_a,sse_b); //adding the two vectors
_mm_store_ps(&c[i],sse_c); //storing the result back
}

You might also like