Assignment 5 Final

ADDIS ABABA INSTITTUTE
OF TECHNOLOGY
SCHOOL OF ELECTRICAL AND

COMPUTER ENGINEERING
ECEG-6518 PARALLEL COMPUTING

Assignment 5:- OpenCL Optimization
BY: - RAHEL ABERA

ID: - GSR/0519/08
STREAM: - COMPUTER ENGINEERING
Submitted To:- Fitsum Assamnew

Submission Date: - May 17, 2016
OBJECTIVE
The objective of this assignment is to let us know about OpenCL optimization

strategies. And this is achieved by analyzing benefit of memory coalescing and
benefit of caching on local memory. The benefit of memory coalescing is seen with
the offset and stride access. And the benefit of caching on local memory is seen
with simple and cached matrix multiplication.
PROBLEM STATEMENT
On this assignment, the problem that will be addressed is memory optimization

when implementing using OpenCL. Coalesced global memory access is one
strategy for memory optimization. But when accessing the global memory,
misaligned accesses will affect the performance. In OpenCL implementation
execution time is affected when there is offset and stride access.
The other memory optimization method is to use local memory as a cache. To
reduce global memory access, the data is cached. And to avoid non coalesced global
memory access, local memory can be used which in turn give good performance.
On this assignment, using OpenCL the effect of offset access and stride access in
memory coalescing is shown. And also the efficiency of the simple matrix multiply,
that is uncached version, and the cached coalesced matrix multiplication is seen.
METHODODLOGY
To do this assignment first a machine that is configured for OpenCL is required.

To see the benefit of memory coalescing, a kernel code is written with an offset
and stride accesses. To implement an offset access, a kernel code that accepts two
arrays of size N = 67108864 chars and an offset, is written as given on the lecture
note. Then varying the offset 1, 2, …., 16 the time it takes to complete running the
kernel is measured. Also for the stride access, varying the stride 1, 2, …, 16 the time
it takes to complete running the kernel is measured. To measure the time <ctime>
library is used.
To see the benefit of caching on local memory, a simple and cached version matrix
multiplication of two floating point matrices with dimension of 1024X1024 is
implemented using OpenCL. Then by varying the work group size, the time it takes
to complete the multiplication is measured.
The experimental setup and tools used on this assignment:
Processor Intel(R) Core i7 CPU Q720@1.60GHz
Installed memory (RAM) 4.00GB

System type 64-bit Operating System
IDE Visual studio 2012
Programming language OpenCL version 1.2, C++
Additional SDK used Intel SDK for OpenCL
Graphics card ATI Radeon HD 5650(Radwood)@Google
RESULT AND DISCUSSION
offset Kernel time

0 0.083
1 0.041
2 0.064 Offset
3 0.041 0.09
4 0.068 0.08
5 0.041 0.07
6 0.064 0.06
7 0.041 0.05
0.04
8 0.064
0.03
9 0.04
0.02
10 0.064
0.01
11 0.041 0
12 0.07 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
13 0.04
14 0.061
15 0.041
16 0.06
stride Kernel time

0 0.111
Stride
1 0.041
2 0.067 0.12
3 0.044 0.1
4 0.075
5 0.044 0.08
6 0.067 0.06
7 0.045
8 0.07 0.04
9 0.046
0.02
10 0.072
11 0.045 0
12 0.071 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
13 0.045
14 0.072
15 0.045
16 0.078
It is expected that as the offset and stride value increase the performance will
decrease. Since as the stride increase, the block boundary that will be jumped will
also increase. Which in turn decrease the performance. But what is obtained from
the experiment is NOT as expected. This may be because of the machine that I use
to perform the experiment.
For the matrix multiply with cached and uncached version, still am NOT able to see
the output. The machine in which the experiment performed gives the kernel
execution time of ZERO.
Therefore, I am NOT able to see both the benefit of memory coalescing and benefit
of caching on local memory.
CONCLUSION
On this assignment, kernel code for the offset access and stride access is
implemented using OpenCL to see the impact on performance. And also simple
uncached version and cached version of matrix multiplication is implemented using
OpenCL to see the benefit of caching on local memory.
But because the machine used to perform this experiment is not giving the output
as expected, the benefit of memory coalescing and the benefit of caching on local
memory cannot be seen on the result. So to see the effects a machine that is
configured well is needed.
INDEX A OpenCL code for offset access
#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>
using namespace std;
const char *source =

"__kernel void offsetCopy (__global char *a, \n"
" __global const char *b, \n"
" const int offset) \n"
"{ \n"
" int gid = get_global_id(0)+offset; \n"
" a[gid] = b[gid]; \n"
"} \n" ;
void main () {
int N = 67108864;
for(int offset = 0; offset <= 16; offset++) {
cout <<"with offset value: " << offset << endl ;
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);

cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);
cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,

NULL);
clBuildProgram (program, 1, &device, NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program, "offsetCopy", NULL);
cl_char *a = (cl_char *) malloc(N*sizeof(cl_char));

cl_char *b = (cl_char *) malloc(N*sizeof(cl_char));
int i;
for(i = 0; i < N; i++){
a[i] = i;
b[i] = N - i;
}
double elapsed ;
cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |

CL_MEM_COPY_HOST_PTR, N*sizeof(cl_char), b, NULL);
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
N*sizeof(cl_char), NULL, NULL);
size_t global_work_size = N/16;
clSetKernelArg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer);

clSetKernelArg(kernel, 1, sizeof(b_buffer), (void*) &b_buffer);
clSetKernelArg(kernel, 2, sizeof(cl_int), (void*) &offset);
clock_t start=clock();
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, NULL, 0,
NULL, NULL);
clFinish(queue);
clock_t end=clock();
elapsed = double (end - start) / CLOCKS_PER_SEC ;
cout<<"the time it take to run the kernel is : " << elapsed <<endl <<endl;
clEnqueueReadBuffer(queue, a_buffer, CL_TRUE, 0, N * sizeof(cl_char), a, 0,

NULL, NULL);
free(a);
free(b);
clReleaseMemObject(a_buffer);
clReleaseMemObject(b_buffer);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseContext(context);
clReleaseCommandQueue(queue);
system("PAUSE");
}
INDEX B OpenCL code for stride access
#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>

"__kernel void strideCopy (__global char *a, \n"
" __global const char *b, \n"
" const int stride) \n"
"{
\n"
" int gid = get_global_id(0)*stride; \n"
" a[gid] = b[gid]; \n"
"}
\n" ;
void main () {
int N = 67108864;
for(int stride = 0; stride <= 16; stride++) {
cout <<"with stride value: " << stride << endl ;


NULL);
cl_kernel kernel = clCreateKernel(program, "strideCopy", NULL);
cl_char *a = (cl_char *) malloc(N*sizeof(cl_char));

cl_char *b = (cl_char *) malloc(N*sizeof(cl_char));
int i;
for(i = 0; i < N; i++){
a[i] = i;
b[i] = N - i;
}
double elapsed ;

CL_MEM_COPY_HOST_PTR, N*sizeof(cl_char), b, NULL);
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
N*sizeof(cl_char), NULL, NULL);
size_t global_work_size = N/16;

clSetKernelArg(kernel, 2, sizeof(cl_int), (void*) &stride);
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, NULL, 0,
NULL, NULL);
clFinish(queue);
cout<<"the time it take to run the kernel is : " << elapsed <<endl <<endl;
clEnqueueReadBuffer(queue, a_buffer, CL_TRUE, 0, N * sizeof(cl_char), a, 0,

NULL, NULL);
free(a);
free(b);
system("PAUSE");
}
INDEX C OpenCL code for simple matrix multiply
#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>

"__kenel void simpleMultiply (__global float* a, \n"
" __global float* b, \n"
" __global float* c \n"
" int N, \n"
" int TILE_DIM) \n"
"{
\n"
" int row = get_global_id(1); \n"
" int col = get_global_id(0); \n"
" float sum = 0.0f; \n"
" for (int i = 0; i < TILE_DIM; i++){ \n"
" sum += a[row*TILE_DIM+i]* b[i*N+col]; \n"
" } \n"
" c[row*N+col] = sum; \n"
"} \n";
void main () {
int N = 1024;
for(int workgroup = 4; workgroup <= 256; workgroup*=2) {
//cout <<"with stride value: " << stride << endl ;


NULL);
cl_kernel kernel = clCreateKernel(program, "simpleMultiply", NULL);
cl_float *a = (cl_float *) malloc(N*N*sizeof(cl_float));

cl_float *b = (cl_float *) malloc(N*N*sizeof(cl_float));
int i;
for(i = 0; i < N*N; i++){
a[i] = i;
b[i] = N - i;
}
double elapsed ;
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), b, NULL);
CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), a, NULL);
cl_mem c_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
N*N*sizeof(cl_float), NULL, NULL);
clEnqueueWriteBuffer (queue, a_buffer,

CL_TRUE,0,N*sizeof(cl_float),a,0,NULL,NULL);
clEnqueueWriteBuffer (queue, b_buffer,
CL_TRUE,0,N*sizeof(cl_float),b,0,NULL,NULL);
size_t global_work_size = 256;

size_t workitem = workgroup;
clSetKernelArg(kernel, 2, sizeof(c_buffer), (void*) &c_buffer);
clSetKernelArg(kernel, 3, sizeof(cl_int), (void*) &N);
clSetKernelArg(kernel, 4, sizeof(cl_int), (void*) &workgroup);
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &workitem, 0,

NULL, NULL);
clFinish(queue);
cout<<"the time it take to run the kernel with workgroup size " << workgroup <<
"is " << elapsed <<endl <<endl;
cl_float *c = (cl_float *) malloc(N*N*sizeof(cl_float));

clEnqueueReadBuffer(queue, c_buffer, CL_TRUE, 0, N *N* sizeof(cl_float), c,
0, NULL, NULL);
free(a);
free(b);
free(c);
clReleaseMemObject(c_buffer);
system("PAUSE");
}
INDEX D OpenCL code for cached matrix multiply
#include<stdio.h>
#include<CL/cl.h>
#include<ctime>
#include<iostream>

"__kenel void simpleMultiply (__global float* a, \n"
" int N, \n"
" int TILE_DIM) \n"
"{
\n"
" sum += a[row*TILE_DIM+i]* b[i*N+col]; \n"
" } \n"
"} \n";
const char *source1 =

"__kenel void coalescedMultiply (__global float* a, \n"
" int N, \n"
" __local float
aTile[TILE_DIM][TILE_DIM]) \n"
"{
\n"
" int x = get_local_id(0); \n"
" int y = get_local_id(1); \n"
" aTile[y][x] = a[row*TILE_DIM+x]; \n"
" sum += aTile[y][i]* b[i*N+col]; \n"
" } \n"
"} \n" ;
int main (int argc, char** argv) {

int N = 2048;
for(int workgroup = 4; workgroup <= 256; workgroup*=2) {
//cout <<"with stride value: " << stride << endl ;

cl_program program = clCreateProgramWithSource(context, 1, &source1, NULL,

NULL);
//cl_kernel kernel = clCreateKernel(program, "simpleMultiply", NULL);

cl_kernel kernel = clCreateKernel(program, "coalescedMultiply", NULL);
cl_float *a = (cl_float *) malloc(N*N*sizeof(cl_float));

cl_float *b = (cl_float *) malloc(N*N*sizeof(cl_float));
cl_float **aTile = (cl_float **)
malloc(workgroup*workgroup*sizeof(cl_float));
int i;
for(i = 0; i < N*N; i++){
a[i] = i;
b[i] = N - i;
}
double elapsed ;
cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |

CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), a, NULL);
CL_MEM_COPY_HOST_PTR, N*N*sizeof(cl_float), b, NULL);
cl_mem c_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
N*N*sizeof(cl_float), NULL, NULL);
cl_mem aTile_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
workgroup*workgroup*sizeof(cl_float), NULL, NULL);
clEnqueueWriteBuffer (queue, a_buffer,

CL_TRUE,0,N*sizeof(cl_float),a,0,NULL,NULL);
clEnqueueWriteBuffer (queue, b_buffer,
CL_TRUE,0,N*sizeof(cl_float),b,0,NULL,NULL);
size_t global_work_size = 256;

size_t workitem = workgroup;
clSetKernelArg(kernel, 2, sizeof(c_buffer), (void*) &c_buffer);
clSetKernelArg(kernel, 3, sizeof(cl_int), (void*) &N);
clSetKernelArg(kernel, 4, sizeof(cl_int), (void*) &workgroup);
clSetKernelArg(kernel, 5, sizeof(aTile_buffer), NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &workitem, 0,

NULL, NULL);
clFinish(queue);
cout<<"the time it take to run the kernel with workgroup size " << workgroup << "
is " << elapsed <<endl <<endl;
cl_float *c = (cl_float *) malloc(N*N*sizeof(cl_float));

clEnqueueReadBuffer(queue, c_buffer, CL_TRUE, 0, N *N* sizeof(cl_float), c,
0, NULL, NULL);
free(a);
free(b);
free(c);
clReleaseMemObject(c_buffer);
system("PAUSE");
return 0;
}

Assignment 5 Final

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 5 Final

Uploaded by

Copyright:

Available Formats

ADDIS ABABA INSTITTUTE

SCHOOL OF ELECTRICAL AND

ECEG-6518 PARALLEL COMPUTING

BY: - RAHEL ABERA

Submitted To:- Fitsum Assamnew

The objective of this assignment is to let us know about OpenCL optimization

On this assignment, the problem that will be addressed is memory optimization

To do this assignment first a machine that is configured for OpenCL is required.

Installed memory (RAM) 4.00GB

offset Kernel time

stride Kernel time

const char *source =

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,

cl_kernel kernel = clCreateKernel(program, "offsetCopy", NULL);

cl_char *a = (cl_char *) malloc(N*sizeof(cl_char));

cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |

size_t global_work_size = N/16;

clSetKernelArg(kernel, 0, sizeof(a_buffer), (void*) &a_buffer);

clEnqueueReadBuffer(queue, a_buffer, CL_TRUE, 0, N * sizeof(cl_char), a, 0,

const char *source =

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,

cl_kernel kernel = clCreateKernel(program, "strideCopy", NULL);

cl_char *a = (cl_char *) malloc(N*sizeof(cl_char));

cl_mem b_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |

size_t global_work_size = N/16;

clEnqueueReadBuffer(queue, a_buffer, CL_TRUE, 0, N * sizeof(cl_char), a, 0,

const char *source =

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source, NULL,

cl_kernel kernel = clCreateKernel(program, "simpleMultiply", NULL);

cl_float *a = (cl_float *) malloc(N*N*sizeof(cl_float));

clEnqueueWriteBuffer (queue, a_buffer,

size_t global_work_size = 256;

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &workitem, 0,

cl_float *c = (cl_float *) malloc(N*N*sizeof(cl_float));

const char *source =

const char *source1 =

int main (int argc, char** argv) {

cl_context context = clCreateContext(0, 1, &device, NULL, NULL, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &source1, NULL,

//cl_kernel kernel = clCreateKernel(program, "simpleMultiply", NULL);

cl_float *a = (cl_float *) malloc(N*N*sizeof(cl_float));

cl_mem a_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY |

clEnqueueWriteBuffer (queue, a_buffer,

size_t global_work_size = 256;

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_work_size, &workitem, 0,

cl_float *c = (cl_float *) malloc(N*N*sizeof(cl_float));

You might also like

cl_char a = (cl_char ) malloc(N*sizeof(cl_char));

cl_char a = (cl_char ) malloc(N*sizeof(cl_char));

cl_float a = (cl_float ) malloc(NNsizeof(cl_float));

cl_float c = (cl_float ) malloc(NNsizeof(cl_float));

cl_float a = (cl_float ) malloc(NNsizeof(cl_float));

cl_float c = (cl_float ) malloc(NNsizeof(cl_float));