You are on page 1of 8

Heterogeneous Parallel Programming -Image Convolution Filters(Box Filter)

Pi19404
January 28, 2013

Contents

Contents
OpenCL Parallel Programming for Image Convolution
0.1 0.2 0.3 0.4 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2D Convolution . . . . . . . . . . . . . . . . . . . . . . . Naive 2D convolution . . . . . . . . . . . . . . . . . . . Optimization method 1 2D convolution . . . . . . . . 0.4.1 Using Local Memory . . . . . . . . . . . . . . 0.4.2 Using Ternary Conditional Operator . . . . 0.4.3 Unrolling For Loops . . . . . . . . . . . . . . . 0.4.4 Read Only Memory and Constant Variables 0.4.5 Performance Comparison . . . . . . . . . . . . . 0.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3 4 4 5 5 6 6 6 7 7 8

2 | 8

OpenCL Parallel Programming for Image Convolution

OpenCL Parallel Programming for Image Convolution


0.1 Abstract
Data parallelism is one of ways to achieve parallelism wherein data is distributed across various computation units. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. Image Convolution is a neighborhood operations.The value of pixel is computed as weighted linear combination of neighborhood pixels.The task of the set of operations to be performed for convolution is the same for all pixels.Thus data parallelism can be achieved for image convolution by assigning each pixel to a computation unit and same task is performed by each computation unit. OpenCLTM is the first open, royalty-free standard for crossplatform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), DSPs and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus application programming interfaces (APIs) that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. In the present document we describe the details of OpenCL APIs but how OpenCL is used efficiently to attain the desired task. We will look at Image convolution using 2D kernels and seperable kernels and compare the performance of box standard CPU algorithms .

3 | 8

OpenCL Parallel Programming for Image Convolution

0.2 2D Convolution
A 2D Convolution operations is a neighborhood operations.Value of pixel in the output matrix depends on weighted linear sum of pixel is input matrix.The weigth map to be used during the summation are defined by convolution kernel. 2D convolution is viewed as the output of a discrete time LTI system whose impulse response is defined by the convolution kernel.The value the pixel at system output is the linear sum of the pixels in neighborhood corresponding pixel in system input weighted by the convolution kernel. The convolution kernel decides the neighborhood size and weight map. Different weight map correspond to different types of filtering operation. A box kernel,gaussian filter defined a Low Pass filter LTI system while the sobel filter kernel defines a High Pass Filter LTI system.

0.3 Naive 2D convolution


A 320x240 image is divided into 16x16 blocks.Each thread is configured to compute the value of a output pixel. Thus total number of threads is equal to total number of pixels in the image. The expression for 2D convolution is given below,P is input image,K is the kernel ,O is the output image ,(i; j ) is the pixel location and R is kernel size.
O [i; j ]

XX
R R

k=R l=R

P [i

+ k; j + l]K [k; l]

(1)

The pixels at the image borders use pixel index outside of the image,we need to extrapolate the values of such image pixels lying outside the image.Different methods can be used to extrapolate the value of pixel.One simple method is to set pixel value to zero or constant.Another method is replicate the border pixels to pixels outside image border.In present approach we will set the pixel values

4 | 8

OpenCL Parallel Programming for Image Convolution 0. The data for input image,output image and kernel are stored in device global memory. Thus Each thread will access the data from the global device memory. The same pixels in the global memory will be accessed different local threads multiple times. Each thread will implement the above code to compute the value output pixel O[i; j ] in terms of local block/work group indexs,local thread ids and global thread ids. The naive parallel version is compared with host CPU version of the code

0.4 Optimization method 1 2D convolution


0.4.1 Using Local Memory
In the earlier method the data in global memory is accessed multiple times by different threads.As Global Memory Reads are costly these multiple reads may lead to parallel algorithm take longer execution time than CPU host algorithm. The efficiency of parallel algorithm can be improved by optimizing the access to global memory.Each thread block/workgroup will be allocated a fixed local memory. This local memory is accessible to all the threads in the thread block/workgroup.The data required by all the threads of the thread block are loaded from the global memory to the local memory. Thus each thread block loads a sub-image from global memory. Thus each pixel cept the pixels adjacent thread The program will threads will load in the global memory is accessed only once exat border of sub-images that are accessed by blocks. be executed in two parts.In the first part all the the data from global to local memory.

Once the data is loaded into local memory ,all the thread in the thread block perform convolution operations on the sub-image loaded in the local memory.

5 | 8

OpenCL Parallel Programming for Image Convolution

Thus performance increases is obtained by reducing the number of loads from global memory as well fact that convolution computation involves local memory which is faster to access than the global memory.

0.4.2

Using Ternary Conditional Operator

Performance is also increased by replacing the if-else block by ternary conditional operator. The If-Else block takes more than two instruction while ternary conditional operation is executed in single instruction cycle in some devices.

0.4.3

Unrolling For Loops

The for loops are expensive operation .If the size of the for loops are known at compile time they can be unrolled .In some compilers the for loops are unrolled automatically by providing compiler with a hint. If the size of loop is not known still the loop can be partially unrolled. To take full advantage of unrolling the parameters used in for loops can be passed as defined directives at compile time rather as kernel arguments . However not all devices may support unrolling and in which case we need to manually substitue for loop with equivalent commands,this provides slight improvement than for loop in some cases

0.4.4

Read Only Memory and Constant Variables

Read Only memory are faster to access on some devices than readwrite memory Thus memories that are not required to be written to are labelled as read only memory. Also variable that are not going to be changed during the ex-

6 | 8

OpenCL Parallel Programming for Image Convolution ecution of the code are declared as const.These changes may provide improvement on some devices.

0.4.5 Performance Comparison


For small matrices the CPU version is faster as the size of matrices increases the parallel version shows improvement .The programs were executed on 4 Core device. The box filter shows an improvement of 2X for optimized version. The Convolution Kernel for NxN Box Filter is all 1s.The Box filter represents a averaging filter

0.5 Code
The code consits of two parts the host code and the device code. Host side code uses OpenCv APIs to read the image from video file and demonstrates the calling of the kernel code for Box filter,Gaussian Filter and Sobel with naive and optimized parallel version and host CPU version . Code is available in repository https://code.google.com/p/m19404/ source/browse/OpenCL-Image-Processing/Convolution/

7 | 8

Bibliography

Bibliography
[1] [2] [3] [4] [5]

uic.edu/kreda/gpu/image-convolution/. html.

A study of OpenCL image convolution optimization. url:

http://www.evl.

Image Convolution Filter. url:

http : / / lodev . org / cgtutor / filtering .

NVidia CUDA Example. url: http://developer.download.nvidia.com/ compute/cuda/4_2/rel/sdk/website/OpenCL/html/samples.html.


html.

OpenCL. url: http://www.khronos.org/opencl/. OpenCV color conversion. url: http://www.shervinemami.info/colorConversion.

8 | 8