You are on page 1of 15

ADDIS ABABA INSTITTUTE

OF TECHNOLOGY

SCHOOL OF ELECTRICAL AND


COMPUTER ENGINEERING

ECEG-6518 PARALLEL COMPUTING


Assignment 5:- OpenCL Optimization

BY: - RAHEL ABERA


ID: - GSR/0519/08
STREAM: - COMPUTER ENGINEERING

Submitted To:- Fitsum Assamnew


Submission Date: - May 17, 2016
OBJECTIVE

The objective of this assignment is to let us know about OpenCL optimization


strategies. And this is achieved by analyzing benefit of memory coalescing and
benefit of caching on local memory. The benefit of memory coalescing is seen with
the offset and stride access. And the benefit of caching on local memory is seen
with simple and cached matrix multiplication.

PROBLEM STATEMENT

On this assignment, the problem that will be addressed is memory optimization


when implementing using OpenCL. Coalesced global memory access is one
strategy for memory optimization. But when accessing the global memory,
misaligned accesses will affect the performance. In OpenCL implementation
execution time is affected when there is offset and stride access.
The other memory optimization method is to use local memory as a cache. To
reduce global memory access, the data is cached. And to avoid non coalesced global
memory access, local memory can be used which in turn give good performance.
On this assignment, using OpenCL the effect of offset access and stride access in
memory coalescing is shown. And also the efficiency of the simple matrix multiply,
that is uncached version, and the cached coalesced matrix multiplication is seen.
METHODODLOGY

To do this assignment first a machine that is configured for OpenCL is required.


To see the benefit of memory coalescing, a kernel code is written with an offset
and stride accesses. To implement an offset access, a kernel code that accepts two
arrays of size N = 67108864 chars and an offset, is written as given on the lecture
note. Then varying the offset 1, 2, …., 16 the time it takes to complete running the
kernel is measured. Also for the stride access, varying the stride 1, 2, …, 16 the time
it takes to complete running the kernel is measured. To measure the time <ctime>
library is used.
To see the benefit of caching on local memory, a simple and cached version matrix
multiplication of two floating point matrices with dimension of 1024X1024 is
implemented using OpenCL. Then by varying the work group size, the time it takes
to complete the multiplication is measured.
The experimental setup and tools used on this assignment:
Processor Intel(R) Core i7 CPU Q720@1.60GHz

Installed memory (RAM) 4.00GB


System type 64-bit Operating System
IDE Visual studio 2012
Programming language OpenCL version 1.2, C++
Additional SDK used Intel SDK for OpenCL
Graphics card ATI Radeon HD 5650(Radwood)@Google
RESULT AND DISCUSSION

offset Kernel time


0 0.083
1 0.041
2 0.064 Offset
3 0.041 0.09
4 0.068 0.08
5 0.041 0.07
6 0.064 0.06
7 0.041 0.05
0.04
8 0.064
0.03
9 0.04
0.02
10 0.064
0.01
11 0.041 0
12 0.07 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
13 0.04
14 0.061
15 0.041
16 0.06

stride Kernel time


0 0.111
Stride
1 0.041
2 0.067 0.12
3 0.044 0.1
4 0.075
5 0.044 0.08
6 0.067 0.06
7 0.045
8 0.07 0.04
9 0.046
0.02
10 0.072
11 0.045 0
12 0.071 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
13 0.045
14 0.072
15 0.045
16 0.078
It is expected that as the offset and stride value increase the performance will
decrease. Since as the stride increase, the block boundary that will be jumped will
also increase. Which in turn decrease the performance. But what is obtained from
the experiment is NOT as expected. This may be because of the machine that I use
to perform the experiment.
For the matrix multiply with cached and uncached version, still am NOT able to see
the output. The machine in which the experiment performed gives the kernel
execution time of ZERO.
Therefore, I am NOT able to see both the benefit of memory coalescing and benefit
of caching on local memory.
CONCLUSION

On this assignment, kernel code for the offset access and stride access is
implemented using OpenCL to see the impact on performance. And also simple
uncached version and cached version of matrix multiplication is implemented using
OpenCL to see the benefit of caching on local memory.
But because the machine used to perform this experiment is not giving the output
as expected, the benefit of memory coalescing and the benefit of caching on local
memory cannot be seen on the result. So to see the effects a machine that is
configured well is needed.
INDEX A OpenCL code for offset access

#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>
using namespace std;

const char *source =


"__kernel void offsetCopy (__global char *a, \n"
" __global const char *b, \n"
" const int offset) \n"
"{ \n"
" int gid = get_global_id(0)+offset; \n"
" a[gid] = b[gid]; \n"
"} \n" ;

void main () {
int N = 67108864;
for(int offset = 0; offset <= 16; offset++) {
cout <<"with offset value: " << offset << endl ;

cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);

cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);


cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,


NULL);
clBuildProgram (program, 1, &device, NULL, NULL, NULL);

cl_kernel kernel = clCreateKernel(program, "offsetCopy", NULL);

cl_char *a = (cl_char *) malloc(N*sizeof(cl_char));


cl_char *b = (cl_char *) malloc(N*sizeof(cl_char));

int i;
for(i = 0; i < N; i++){
a[i] = i;
b[i] = N - i;
}

double elapsed ;

cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |


CL_MEM_COPY_HOST_PTR, N*sizeof(cl_char), b, NULL);
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
N*sizeof(cl_char), NULL, NULL);

size_t global_work_size = N/16;

clSetKernelArg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer);


clSetKernelArg(kernel, 1, sizeof(b_buffer), (void*) &b_buffer);
clSetKernelArg(kernel, 2, sizeof(cl_int), (void*) &offset);

clock_t start=clock();
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, NULL, 0,
NULL, NULL);
clFinish(queue);
clock_t end=clock();
elapsed = double (end - start) / CLOCKS_PER_SEC ;
cout<<"the time it take to run the kernel is : " << elapsed <<endl <<endl;

clEnqueueReadBuffer(queue, a_buffer, CL_TRUE, 0, N * sizeof(cl_char), a, 0,


NULL, NULL);

free(a);
free(b);

clReleaseMemObject(a_buffer);
clReleaseMemObject(b_buffer);

clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseContext(context);
clReleaseCommandQueue(queue);

system("PAUSE");

}
INDEX B OpenCL code for stride access

#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>
using namespace std;

const char *source =


"__kernel void strideCopy (__global char *a, \n"
" __global const char *b, \n"
" const int stride) \n"
"{
\n"
" int gid = get_global_id(0)*stride; \n"
" a[gid] = b[gid]; \n"
"}
\n" ;

void main () {
int N = 67108864;
for(int stride = 0; stride <= 16; stride++) {
cout <<"with stride value: " << stride << endl ;

cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);

cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);


cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,


NULL);
clBuildProgram (program, 1, &device, NULL, NULL, NULL);

cl_kernel kernel = clCreateKernel(program, "strideCopy", NULL);

cl_char *a = (cl_char *) malloc(N*sizeof(cl_char));


cl_char *b = (cl_char *) malloc(N*sizeof(cl_char));

int i;
for(i = 0; i < N; i++){
a[i] = i;
b[i] = N - i;
}

double elapsed ;

cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |


CL_MEM_COPY_HOST_PTR, N*sizeof(cl_char), b, NULL);
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
N*sizeof(cl_char), NULL, NULL);

size_t global_work_size = N/16;


clSetKernelArg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer);
clSetKernelArg(kernel, 1, sizeof(b_buffer), (void*) &b_buffer);
clSetKernelArg(kernel, 2, sizeof(cl_int), (void*) &stride);

clock_t start=clock();
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, NULL, 0,
NULL, NULL);
clFinish(queue);
clock_t end=clock();
elapsed = double (end - start) / CLOCKS_PER_SEC ;
cout<<"the time it take to run the kernel is : " << elapsed <<endl <<endl;

clEnqueueReadBuffer(queue, a_buffer, CL_TRUE, 0, N * sizeof(cl_char), a, 0,


NULL, NULL);

free(a);
free(b);

clReleaseMemObject(a_buffer);
clReleaseMemObject(b_buffer);

clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseContext(context);
clReleaseCommandQueue(queue);

system("PAUSE");

}
INDEX C OpenCL code for simple matrix multiply

#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>
using namespace std;

const char *source =


"__kenel void simpleMultiply (__global float* a, \n"
" __global float* b, \n"
" __global float* c \n"
" int N, \n"
" int TILE_DIM) \n"
"{
\n"
" int row = get_global_id(1); \n"
" int col = get_global_id(0); \n"
" float sum = 0.0f; \n"
" for (int i = 0; i < TILE_DIM; i++){ \n"
" sum += a[row*TILE_DIM+i]* b[i*N+col]; \n"
" } \n"
" c[row*N+col] = sum; \n"
"} \n";

void main () {
int N = 1024;
for(int workgroup = 4; workgroup <= 256; workgroup*=2) {
//cout <<"with stride value: " << stride << endl ;

cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);

cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);


cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,


NULL);
clBuildProgram (program, 1, &device, NULL, NULL, NULL);

cl_kernel kernel = clCreateKernel(program, "simpleMultiply", NULL);

cl_float *a = (cl_float *) malloc(N*N*sizeof(cl_float));


cl_float *b = (cl_float *) malloc(N*N*sizeof(cl_float));

int i;
for(i = 0; i < N*N; i++){
a[i] = i;
b[i] = N - i;
}

double elapsed ;
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), b, NULL);
cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), a, NULL);
cl_mem c_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
N*N*sizeof(cl_float), NULL, NULL);

clEnqueueWriteBuffer (queue, a_buffer,


CL_TRUE,0,N*sizeof(cl_float),a,0,NULL,NULL);
clEnqueueWriteBuffer (queue, b_buffer,
CL_TRUE,0,N*sizeof(cl_float),b,0,NULL,NULL);

size_t global_work_size = 256;


size_t workitem = workgroup;
clock_t start=clock();
clSetKernelArg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer);
clSetKernelArg(kernel, 1, sizeof(b_buffer), (void*) &b_buffer);
clSetKernelArg(kernel, 2, sizeof(c_buffer), (void*) &c_buffer);
clSetKernelArg(kernel, 3, sizeof(cl_int), (void*) &N);
clSetKernelArg(kernel, 4, sizeof(cl_int), (void*) &workgroup);

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &workitem, 0,


NULL, NULL);
clFinish(queue);
clock_t end=clock();
elapsed = double (end - start) / CLOCKS_PER_SEC ;
cout<<"the time it take to run the kernel with workgroup size " << workgroup <<
"is " << elapsed <<endl <<endl;

cl_float *c = (cl_float *) malloc(N*N*sizeof(cl_float));


clEnqueueReadBuffer(queue, c_buffer, CL_TRUE, 0, N *N* sizeof(cl_float), c,
0, NULL, NULL);

free(a);
free(b);
free(c);
clReleaseMemObject(a_buffer);
clReleaseMemObject(b_buffer);
clReleaseMemObject(c_buffer);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseContext(context);
clReleaseCommandQueue(queue);

system("PAUSE");

}
INDEX D OpenCL code for cached matrix multiply
#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>
using namespace std;

const char *source =


"__kenel void simpleMultiply (__global float* a, \n"
" __global float* b, \n"
" __global float* c \n"
" int N, \n"
" int TILE_DIM) \n"
"{
\n"
" int row = get_global_id(1); \n"
" int col = get_global_id(0); \n"
" float sum = 0.0f; \n"
" for (int i = 0; i < TILE_DIM; i++){ \n"
" sum += a[row*TILE_DIM+i]* b[i*N+col]; \n"
" } \n"
" c[row*N+col] = sum; \n"
"} \n";

const char *source1 =


"__kenel void coalescedMultiply (__global float* a, \n"
" __global float* b, \n"
" __global float* c \n"
" int N, \n"
" __local float
aTile[TILE_DIM][TILE_DIM]) \n"
"{
\n"
" int row = get_global_id(1); \n"
" int col = get_global_id(0); \n"
" float sum = 0.0f; \n"
" int x = get_local_id(0); \n"
" int y = get_local_id(1); \n"
" aTile[y][x] = a[row*TILE_DIM+x]; \n"
" for (int i = 0; i < TILE_DIM; i++){ \n"
" sum += aTile[y][i]* b[i*N+col]; \n"
" } \n"
" c[row*N+col] = sum; \n"
"} \n" ;

int main (int argc, char** argv) {


int N = 2048;
for(int workgroup = 4; workgroup <= 256; workgroup*=2) {
//cout <<"with stride value: " << stride << endl ;

cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);

cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);


cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source1, NULL,


NULL);
clBuildProgram (program, 1, &device, NULL, NULL, NULL);

//cl_kernel kernel = clCreateKernel(program, "simpleMultiply", NULL);


cl_kernel kernel = clCreateKernel(program, "coalescedMultiply", NULL);

cl_float *a = (cl_float *) malloc(N*N*sizeof(cl_float));


cl_float *b = (cl_float *) malloc(N*N*sizeof(cl_float));
cl_float **aTile = (cl_float **)
malloc(workgroup*workgroup*sizeof(cl_float));
int i;
for(i = 0; i < N*N; i++){
a[i] = i;
b[i] = N - i;
}

double elapsed ;

cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |


CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), a, NULL);
cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), b, NULL);
cl_mem c_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
N*N*sizeof(cl_float), NULL, NULL);
cl_mem aTile_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
workgroup*workgroup*sizeof(cl_float), NULL, NULL);

clEnqueueWriteBuffer (queue, a_buffer,


CL_TRUE,0,N*sizeof(cl_float),a,0,NULL,NULL);
clEnqueueWriteBuffer (queue, b_buffer,
CL_TRUE,0,N*sizeof(cl_float),b,0,NULL,NULL);

size_t global_work_size = 256;


size_t workitem = workgroup;
clock_t start=clock();
clSetKernelArg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer);
clSetKernelArg(kernel, 1, sizeof(b_buffer), (void*) &b_buffer);
clSetKernelArg(kernel, 2, sizeof(c_buffer), (void*) &c_buffer);
clSetKernelArg(kernel, 3, sizeof(cl_int), (void*) &N);
clSetKernelArg(kernel, 4, sizeof(cl_int), (void*) &workgroup);
clSetKernelArg(kernel, 5, sizeof(aTile_buffer), NULL);

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &workitem, 0,


NULL, NULL);
clFinish(queue);
clock_t end=clock();
elapsed = double (end - start) / CLOCKS_PER_SEC ;
cout<<"the time it take to run the kernel with workgroup size " << workgroup << "
is " << elapsed <<endl <<endl;

cl_float *c = (cl_float *) malloc(N*N*sizeof(cl_float));


clEnqueueReadBuffer(queue, c_buffer, CL_TRUE, 0, N *N* sizeof(cl_float), c,
0, NULL, NULL);

free(a);
free(b);
free(c);
clReleaseMemObject(a_buffer);
clReleaseMemObject(b_buffer);
clReleaseMemObject(c_buffer);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseContext(context);
clReleaseCommandQueue(queue);

system("PAUSE");
return 0;
}

You might also like