OpenCL Image ConvolutionSeperable Filters

Pi19404
February 6, 2013

Contents

Contents
OpenCL Image Convolution-Seperable Filters 3

0.1 Abstract . . . . . . . . . . . . . . . . . . . . . 0.2 Separable Filters . . . . . . . . . . . . . . . 0.3 Computation Required for Convolution kernel . . . . . . . . . . . . . . . . . . . . . . 0.4 Parallel Implementation . . . . . . . . . . . 0.5 Comparison with CPU implementations . 0.6 Code . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . Using Separable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4 4 5 5 6

2 | 6

OpenCL Image Convolution-Seperable Filters

OpenCL Image Convolution-Seperable Filters
0.1 Abstract
In the earlier article we had seen the parallel implementation of 2D image convolution and application to simple averaging box filter. In this article we will look at the basics of separable filter and utilizing it for task of image convolution using Heterogeneous parallel programming approach using OpenCL.

0.2 Separable Filters
Image convolution operation can be viewed as the output of LTI system with impulse response corresponding to convolution kernel which is impulse response of the LTI system. Separable filters are a special case of general convolution A 2D kernel is called a separable if it can be expressed as the outer product of two 1D vectors.It can be decomposed into vertical and horizontal projections. Let f(x,y) be the input image,and h(x,y) be the convolution kernel h = (u £ v ) Let u be column vector and v be row vector f £ h = f £ (u £ v ) = (u £ v ) £ f = u £ (v £ f ) First convolve f u and v are 1D convolving f with convolving result with v and the result with u vectors v is convolving all the row of f with v with u is convolving all the columns with u
(1)

3 | 6

OpenCL Image Convolution-Seperable Filters

0.3 Computation Required for Convolution Using Separable kernel
If we have MxN image convolved with PxQ kernel then each pixel computation will require PQ multiplications and additions.Hence total operations require is MNPQ. If we perform convolution using separable filter we require multiplications and additions.
M N (P

+ Q)

Q Thus 2D convolution requires computation by a factor PP+Q larger than separable convolution method.

For a

3x3

convolution kernel it a speedup by a factor of

1:5.

0.4 Parallel Implementation
In parallel implementation will contain of two parts In the first part all the rows of the image will be convolved with 1D row vector. This is a 1D convolution operation. A thread will be launched for each row in the image each thread will perform 1D convolution of the image row with 1D kernel and store the result in the global memory. In the second part all the columns of the image are convolved with 1D column vector.This is again a 1D convolution operation. Each thread will read a column vector and perform 1D convolution. Before we proceed with the second part we need to make sure all the threads of the first part have completed their operation. Thus common code required is that of 1D convolution operation. The issue with column filter is that adjacent threads would be accessing non adjacent memory locations. A simple trick that can be used allocate the global memory as the square array of the largest dimension. While writing the output of the row filter,write it along the columns

4 | 6

OpenCL Image Convolution-Seperable Filters ie we are storing the results in the transposed matrix. Then performing the row convolution again with the resultant matrix is equivalent to performing column filtering and again storing the result in the transposed matrix. This way we have coalesced memory read ,adjacent threads in a thread block are accessing adjacent global memory locations. Each thread will perform computations of all the three dimensions of the pixels in the present implementations. In the present implementations local memory is not used.But in the later article we will implement a local memory implementations where the data from the global memory is copied to the local memory and computations are performed on a block of data by each thread block and results are combined to obtain the desired row/column filtering operations.

0.5 Comparison with CPU implementations
The Separable filter implementations was compared with 2D convolution,and CPU separable implementations for 320x240 image on a Intel(R) Core(TM) i3 CPU at 2.53GHz. The Separable filter gave a improved performance of 8x compared to the CPU implementation of box filter and 4x compared to 2D parallel implementation of box files.

0.6 Code
The code consits of two parts the host code and the device code. Host side code uses OpenCv API’s to read the image from video file and demonstrates the calling of the kernel code for Box filter for 2d convolution,seprable filter and host CPU implementation. Code is available in repository https://code.google.com/p/m19404/ source/browse/OpenCL-Image-Processing/Convolution/

5 | 6

Bibliography

Bibliography
[1] [2] [3] [4] [5] [6] [7]

uic.edu/kreda/gpu/image-convolution/. html.
Image Convolution Filter.

A study of OpenCL image convolution optimization.

url: http://www.evl.

url: http : / / lodev . org / cgtutor / filtering .

NVidia CUDA Example. url: http://developer.download.nvidia.com/ compute/cuda/4_2/rel/sdk/website/OpenCL/html/samples.html. OpenCL.

url: http://www.khronos.org/opencl/. url: http://www.shervinemami.info/colorConversion.

html.

OpenCV color conversion.

Steve on Image Processing Blog at Matlab Central.

url: http : / / blogs . mathworks.com/steve/2006/10/04/separable-convolution/.
The Scientist and Engineer's Guide to Digital Signal Processing.

//www.dspguide.com/ch24/3.htm.

url: http :

6 | 6