You are on page 1of 39



An Honors Thesis

Presented to

The Faculty of the Department of Computer Science

Washington and Lee University

In Partial Fulfillment Of the Requirements for

Honors in Computer Science


Alexander Lee Jackson

May 2009
To my Mother and Father. . .

1 Introduction 2

1.1 General Purpose GPU Programming . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 6

2.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 General Purpose Graphics Processing Unit . . . . . . . . . . . . . . 11

2.2.2 Compute Unified Device Architecture . . . . . . . . . . . . . . . . . 13

2.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 GPU Edge Detection Algorithms 19

3.1 One Pixel Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Multiple Pixels Per Thread Algorithm . . . . . . . . . . . . . . . . . . . . . 21

4 Evaluation and Results 25

4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Conclusions 31

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 33


I would like to thank my adviser Rance Necaise for assisting me through this long and, at
times, frustrating process.
Thanks also to Tania S. Douglas and her research team at the University of Cape Town
in South Africa for inspiring this project and supplying a number of sample images.
A special thank you to Ryleigh for talking me down from many ledges when it got


Often, it is a race against time to make a proper diagnosis of a disease. In areas of the world
where qualified medical personnel are scarce, work is being done on the automated diagnosis
of illnesses. Automated diagnosis involves several stages of image processing on lab samples
in search of abnormalities that may indicate the presence of such things as tuberculosis.
These image processing tasks are good candidates for migration to parallelism which would
significantly speed up the process. However, a traditional parallel computer is not a very
accessible piece of hardware to many. The graphics processing unit (GPU) has evolved into
a highly parallel component that recently has gained the ability to be utilized by developers
for non-graphical computations.
This paper demonstrates the parallel computing power of the GPU in the area of medical
image processing. We present a new algorithm for performing edge detection on images using
NVIDIA’s CUDA programming model in order to program the GPU in C. We evaluated our
algorithm on a number of sample images and compared it to two other implementations;
one sequential and one parallel. This new algorithm produces impressive speedup in the
edge detection process.

Chapter 1


Graphics processing units (GPUs) have evolved over the past decade into highly parallel,

multicore processors[2]. Until recently, these extremely powerful pieces of hardware were

typically only used for processing graphical data. Unless the user is playing a graphics

intensive computer game or executing some other application of a graphical nature, these

high-powered GPUs are often underutilized. In recent years researchers have begun to

investigate the viability of using this highly parallel, highly efficient processing unit for

computations of a non-graphical nature.

This project focused on using GPUs for image processing. We implemented two different

load-balancing algorithms for use with the GPU and showed the advantages of programming

the GPU over the central processing unit (CPU) for such problems.

1.1 General Purpose GPU Programming

Parallel computing has become the go-to method for dealing with problems that have large

data sets and computationally intense calculations. Some examples of such problems are

scientific modeling simulations, weather forecasting, and modeling the relations between

many heavenly bodies. However, there are limitations to the use of the high-powered com-

puters that are necessary for executing these parallel applications. These super-computers


are often very large in size and are typically quite pricey.

Working with the GPU for computationally intensive problems has several advantages

over the alternative options for parallelism. The GPU is a much more physically and

financially manageable piece of machinery with a top of the line unit going for at most

a few thousand dollars. It also has a growing community of enthusiasts that have been

showing impressive speed-up capabilities through GPU utilization. NVIDIA has become

the industry’s leading proponent of GPGPU (General Purpose Graphics Processing Unit)

programming through their release and support of CUDA. CUDA is an extension to the C

programming language that has helped to make GPGPU programming more accessible to


1.2 Image Processing

Image processing “refers to the manipulation and analysis of pictorial information”[3]. Using

forms of image processing has become an important part of society. It is utilized in the

scientific and entertainment community to do such things as convert photographs to black

and white or increase sharpness of an image. In relation to this paper, manipulating an

image is an important step for implementation of computer vision. Computer vision involves

the automation of such important tasks as the diagnosis of illnesses or facial recognition.

The general idea behind image processing involves examining image pixels and manip-

ulating them as defined by the type of image processing desired. Image processing can be

a time consuming task and, luckily, lends itself nicely to conversion to a parallel algorithm.

1.3 Motivation

Performing research under the R.E. Lee Scholar program initially sparked interest in uti-

lizing parallel algorithms for high performance computing. The Summer of 2007 and 2008

was spent doing work with this concept. Preliminary research was done for this thesis prior

to the beginning of classes in the Fall. We spent this time becoming proficient in program-

ming with CUDA and converted a simple physics heat-diffusion algorithm to a GPU based


Inspiration for this project was taken from Professor Tania S Douglas and her research

group, MRC/UCT Medical Imaging Research Unit, Department of Human Biology, Univer-

sity of Cape Town in South Africa. This research group has been developing an automated

process for diagnosis of tuberculosis particularly for low-income and developing countries[4].

In the current environment, the diagnosis of tuberculosis is a time consuming task that re-

quires a highly trained technician to examine a sputum smear underneath a microscope.

Examining sputum smears (smears of matter taken from the respiratory tract) under a

microscope is the number one method for diagnosing tuberculosis, according to the World

Health Organization (WHO) [1].

The problem with the current method for TB diagnosis is the human element inherent

to it. Each slide must be closely examined by a medical technician with a level of compe-

tency that is not necessarily guaranteed. This problem is compounded by the fact that in

developing countries where TB is still a major health risk, there are usually a shortage of se-

nior pathologists to verify manual screening, a requirement of the WHO. Additionally, slide

examination is a tedious and time consuming task. On average a technician will examine

each sputum slide for five minutes and examine around 25 slides per day [4]. Automating

TB diagnosis will help to alleviate the need for highly trained medical technicians to per-

form unexciting tasks while at the same time increasing the accuracy of diagnosis and the

number of diagnoses that can be made in a period of time.

Our work on the GPU with CUDA is significant for the University of Cape Town group

because of the aforementioned advantages of utilizing the GPU in parallel applications.

Using CUDA has the potential for speedups that are orders of magnitude faster than its

sequential counterparts. Additionally the small space requirement and relatively low-cost of

implementing a GPU for scientific applications allows for a degree of portability unavailable

to a typical high powered computer. Creating a cost-effective method for medical computing

in undeveloped areas has the potential to help improve the conditions in places that do not

have the level of health care enjoyed in other countries.

The automated diagnosis of TB through the examination of sputum smears requires

several steps, one of which is the recognition of abnormal smears. This recognition requires

that the image of the smear go through some form of image processing involving edge

detection. There are a number of well documented edge detection algorithms but the

one which we chose to implement for this research was the Laplacian method of pixel


1.4 Thesis Layout

Our intention in this thesis is to show the advantages of utilizing CUDA for image processing.

In the second chapter we provide some background information on parallel computing in

general and GPGPU programming in addition to a more in-depth description of image

processing. The third chapter describes our two GPU algorithms. Chapter four reports our

method of experimentation and the results of our work, and we make our conclusions in

chapter five. We show that the edge detection algorithm on the GPU is considerably faster

and more efficient than that of the sequential version and discuss how a GPU implementation

of the entire diagnostic process could be achieved.

Chapter 2


This thesis makes reference to the area of parallel computing. We used this area of computa-

tional science as foundation for our work with GPGPU programming with CUDA. Typical

parallel computing has a number of advantages and disadvantages, which we discuss in this

section. We also lay out how GPGPU programming compares to parallel computing.

2.1 Parallel Computing

Parallel computing is a simple idea. The human brain is well versed in performing tasks in

parallel; however, a single computer processor can only do one thing at a time in sequential

order. A parallel computer, which can perform multiple computations at once, can be used

to solve simple problems in a matter of minutes that would normally take hours or days on

a single processor.

The most basic concept behind parallel computing is the distribution of the workload

between the individual processors that are working together to perform computations. For

example, consider the problem of matrix addition, which is an embarrassingly parallel ap-

plication. Embarrassingly parallel problems are those that require no other effort beyond

dividing up the work and having each processor operate on their portion as though it were

a sequential algorithm. Suppose we have a parallel machine with p processors, and we want


to add two matrices with p elements each. This problem only requires that we distribute

the two corresponding elements from each matrix to their own processor, let the processor

add the elements, and then collect them into a solution matrix, as illustrated in Figure 2.1.

Other problems, such as matrix multiplication, are more complicated due to their require-

ment for processor cooperation or more careful and efficient distribution of data for their

Matrix 1 Matrix 2
P0 P1 P2 P3 P0 P1 P2 P3

P4 P5 P6 P7 P4 P5 P6 P7
P8 P9 P10 P11 P8 P9 P10 P11

Figure 2.1: Parallel matrix addition. Each processor, Pn, gets an element from each

The ideal theoretical parallel computer consists of p processors with an infinite amount of

memory. With such resources available, we can divide the workload between p processors to

the point where each processor works with the smallest possible piece of data. In practice

however, there does not exist such a computer. Efficient and proper use of a parallel

computer involves careful workload distribution between processors.

There are two basic architectures used with parallel computers: SIMD and MIMD.

Single Instruction Multiple Data (SIMD) architecture is defined by a simultaneous execution

of a single instruction on multiple pieces of data[9]. For example, referring back to the

matrix addition case, every processor has access to each element in both matrices and each

processor executes the same instructions, adding the corresponding elements based on their

processor ids. In Multiple Instruction Multiple Data (MIMD) architecture, each processor

runs independently using a its own set of instructions[9]. A Beowulf cluster, a common type

of parallel computer, is implemented using the MIMD architecture[9].

Inter-processor communications allow for the transmission of data between processors


and sending of signals for reporting on their status. In many cases this allows for the syn-

chronization of the individual processors among their group, preventing some from moving

forward before the others are ready. Often, synchronization is crucial for reliable execution

of applications due to sharing of memory space. A race condition may occur if two proces-

sors require read/write access to the same data register. When this happens you cannot

predict which processor gains access to the data first and therefore you can’t guarantee

the integrity of the data[8]. Race conditions can be avoided through inter-processor com-

munications. How the processors communicate usually depends on the type of processor

relationship being implemented.

Washington and Lee’s Beowulf cluster, The Inferno, is a 64 processor cluster that uses

the MPI (Message Passing Interface) protocol for communication between processors. In

clusters such as this one, there is no central shared memory repository for data; instead data

is often distributed by a central processor and stored in each processor’s local memory. Early

parallel computers made use of a central shared memory pool, but as time has progressed it

has become “difficult and expensive” to make larger machines with this form of memory[6].

MPI is used in order to distribute data between processors and allow for inter-processor

communication. With the message-passing model, we are able to work with very large sets

of data without being restricted by the size of a global shared memory pool[6].

The type of relationship between processors working in parallel determines the structure

for the program itself. The master/slave relationship is a common paradigm for working

in parallel. This consists of a sole “master” processor that presides over a group of “slave”

processors. The master is responsible for organizing and distributing the data while the

slaves operate on the data[6]. Typically, the slaves, upon starting, signal the master that

they are ready and waiting. As these signals are received, the master processor distributes

the data among the slave processors. The master is only responsible for managing the

data. In some cases, the master processor may clean up loose ends, but generally the slave

processors do the majority of the computing, as illustrated in Figure 2.2. Once they have

finished, the master collects the resulting data set from the slaves. The work-pool method

for processor communication, as illustrated in Figure 2.3, is similar to the master/slave

method, except that the slave processors make requests for smaller chunks of data from a

“pool” managed and distributed by the master processor [8].


Data Data

Figure 2.2: Master/Slave workload distribution.

Data Request
Data Request
P1 Data Request

Data Data

Data Request
Work Pool

Figure 2.3: Work pool workload distribution.

When the data set is large or the computation complex, utilizing a parallel computer and

an algorithm that employs an effective load balancing scheme can result in an exponential

increase in performance. With enough processors working together, the only performance

limitation is the time for inter-processor communication.

2.2 Graphics Processing Units

A GPU is quite different from a CPU. The GPU is concerned with one thing, the processing

of graphics data whereas the CPU is responsible for general computations and system

administration. GPUs process graphical data in the form of vertices in a geometric space.

This data is converted into a 2-dimensional image for display on the monitor through a

process known as the graphics pipeline[5]. The pipeline consists of a number of stages,

as illustrated in Figure 2.4. At each stage, the input consists of a set of vertices that are

manipulated or transformed. Data can be streamed into the graphics pipeline since vertices

are able to be processed independently from each other. Thus, the basic operations of the

graphics pipeline can be performed in parallel.

Figure 2.4: The OpenGL graphics pipeline.

Over time, the GPU has evolved into a very different component. The first genera-

tion of graphics processors were not themselves programmable[5]. Their instructions were

hard-coded in the chip-set with data transmitted from the CPU. However, this changed

with the development of the programmable GPU. The original programmable GPUs were

utilized with graphics processing API’s such as OpenGL and DirectX[5]. Developers gained

more control over the GPU in the next generation with graphics specific languages such as

NVIDIA’s Cg[5].

Due in large part to increased consumer demand over the years for computer graphics

that continue to dazzle the eye, graphics processing units have “evolved into highly parallel,

multithreaded, many-core processors with tremendous computational horsepower and very

high memory bandwidth”[2]. Today’s GPUs are capable of performing the necessary cal-

culations in real-time without skipping a beat so that gamers and researchers can become

immersed in the newest state-of-the-art computer games and scientific imaging applications.

The tremendous power and level of control the developer has over the current generation of

GPUs has made it attractive for developers to utilize the GPU for general purpose appli-

cations in addition to graphics processing—a practice known as General Purpose Graphics

Processing Units (GPGPU). The rising popularity of GPGPU has led to the newest gen-

eration of GPUs being constructed with the idea that they might not necessarily be used

exclusively for graphics processing.

2.2.1 General Purpose Graphics Processing Unit

In recent years a new area of parallel computing has begun to garner a good deal of attention

due to its affordability and power. GPGPU programming takes the highly parallel nature

of the GPU and applies it to computationally expensive algorithms. The area of GPGPU

programming focuses on using these programmable graphics cards for more than what they

were intended for, that is, general purpose high end computations.

GPGPU programming came about from the powerful nature of the GPU. Graphics

processing involves a heavy volume of mathematically intensive operations to create and

transform objects within a geometric space. Since the GPU traditionally only handles one

aspect of the computer, there is considerably more space on the chip for data processing

and storage to the point where the number of transistors present on the GPU has, over the

past five years or so, greatly surpassed the number of transistors on the area of the CPU,

as illustrated in Figure 2.5[2]. Furthermore, since Moore’s Law of increasing computational

power also applies to the GPU, we can expect the number of transistors on the area of a

state-of-the-art graphics processor to double approximately every two years[2]. Addition-

ally, a powerful GPU simply resides in the computer tower along with the other pieces of

hardware as opposed to a parallel cluster, which takes up an entire room.

Figure 2.5: GPU vs. CPU speeds over time.

Taking advantage of the GPU’s ability for speed is not that simple. Early GPGPU

methods for programming were tedious with high learning curves because data had to

be represented in ways completely different from typical programming methods on the

CPU. Data in the GPU must be stored in the form of vertices. This adds complication to

programming the GPU for tasks other than those of a graphical nature.

When considering the use of GPGPU programming, it is important to consider the

advantages and disadvantages. As stated earlier, the highly parallel nature of the GPU is

capable of far superior performance on certain types of computations when compared to the

CPU. Also, the price of a graphics card is another huge draw. A top-of-the-line GPU sells

for only a few thousand dollars whereas a traditional parallel computer of any reasonable

size goes for many times that.

2.2.2 Compute Unified Device Architecture

NVIDIA, the most prevalent graphics card producing corporation in the computing industry,

released its Compute Unified Device Architecture and the corresponding CUDA language in

2006 as one of the first programming languages meant specifically for GPGPU programming.

CUDA is an extension of the C programming language that adds some syntax for working

with the GPU. NVIDIA’s newest generation of graphics cards are CUDA capable allowing

GPU programming with this simple extension to the C programming language.

The CUDA based graphics cards use a SIMD-like architecture referred to by NVIDIA

as Single Instruction Multiple Thread (SIMT)[2]. As an application is executing, each

thread is mapped to one of the multiprocessor cores (8 to 128 cores per multiprocessor,

up to 30 multiprocessors, depending on the card). Each thread has an id that is used to

distribute the workload among them. Each core is able to run in parallel, resulting in as

many as 3840 threads running simultaneously. When there are too many threads for a total

parallel execution, scheduling of threads on cores is handled by the hardware. Scheduling

being handled on the GPU results in very fast context switches giving the illusion of a fast

parallel execution.

The section of code that is executed by the GPU is known as the device kernel. Before

this kernel can execute, data must be transferred to an allocated memory space on the GPU.

To make the most of CUDA, the programmer must distribute data throughout the GPU’s

memory and determine what type of memory to utilize[7]. Allocating correct memory

requires a basic understanding of how the different memory types are accessed and which

type is most efficient for the task at hand because if memory distribution is not utilized

correctly performance can be even worse than if the problem were simply being solved on

the CPU.

When allocating memory on the device, there are primarily three kinds of memory that

can be accessed: global memory, shared memory, and local registers[2]. Figure 2.6 illustrates

how different types of memory is arranged with respect to each other. Data stored in global

memory is available to all threads at once. Global data is loaded into the device from the

CPU during a memory allocation stage that occurs prior to the kernel execution. Memory

locations that are accessed consecutively within memory are most efficiently allocated to

global memory; however, care must be taken because if different thread blocks on the GPU

require read/write access to the same global memory register, there is no guarantee the

value in the memory location is correct due to race conditions.[2]

In the case of shared memory, data stored in this memory type is only accessible by

threads in a common block. If able to divide data into chunks, loading this into shared

memory allows for extremely fast retrieval within the thread block. Much like global mem-

ory, the developer must be careful using shared memory to avoid race conditions and the

resulting data corruption. Shared memory on this most recent generation of GPUs is limited

to 16Kb per block[2].

Figure 2.6: CUDA memory organization and access.

CUDA takes care of organizing the threads in the GPU. The user simply specifies how

the threads are divided amongst a collection of blocks, the number of these blocks and

how they relate to each other in a grid of up to three dimensions is also specified by the

user. Thread blocks have three dimensions. Taking advantage of these dimensions is useful

for working with data of various sizes. There are restrictions, however, as a thread block

has a limit to the number of threads that can be allocated to the block. CUDA allows

for a maximum of 512 threads per block[2]. How these threads are organized is up to the

user, so a one dimensional block of 512 threads and a two dimensional block of 32 x 16

are both valid. Some possible thread blocks are shown in Figure 2.7. Additionally, in the

background, threads within a block are grouped together in what is referred to as a warp[2].

These warps are made up of at most 32 threads and always contain consecutive threads

of increasing ids. The order of execution of warps is undefined. However, warps within

the same block are able to synchronize with each other for safe global and shared memory

1-D Thread Block
0 1 2 3 4 5
ty 0

2-D Thread Block

0 1 2 3 4 5


3-D Thread Block

0 1 2 3 4 5

ty 1

tz 1

Figure 2.7: Some possible thread block allocations.

Each thread within a block is assigned a unique thread id that is determined by its

placement within the block dimensions. Typically these thread ids play a part in the

division of labor between threads of a block. For instance, the thread with id of one may

be responsible for all data points in the first column of some collection. Additionally each

block of threads is also assigned a unique block id that, like the thread id, determines the

part of the workload each block is responsible for[2].

Blocks of threads execute a CUDA kernel. A kernel is a globally defined function that

is run by all threads. The threads within a block execute the kernel independently of each

other[2]. This independent thread execution results in the need for a way of synchronizing

the threads when data is shared. This would ensure reliable data retrieval. Luckily, CUDA

has a built in syncthreads() function that, when called within the kernel, forces all threads

to wait upon reaching the syncthreads() function call until all threads within the same

block reach that point within the kernel[2]. Typically, the syncthreads() function is needed

after the threads have loaded data into shared memory, before they begin retrieving and

performing computations on it. This synchronization process is illustrated in Figure 2.8

Thread Block

Shared Memory Load

Execute Calculations

Figure 2.8: Threads loading data into memory and synchronizing.


2.3 Image Processing

Image processing is defined as the manipulation of images. Operations on images that are

considered a form of image processing include zooming, converting to gray scale, increas-

ing/decreasing image brightness, red-eye redaction in photographs, and, in the case of this

study, edge detection, as illustrated in Figures 2.9 and 2.10. These operations typically

involve an exhaustive iteration over each individual pixel in an image.

Figure 2.9: Test image before edge detection.

Figure 2.10: Test image after edge detection.


A common method for image processing is pixel classification. Pixel classification defines

a pixel’s class based on one of its features, in the case of edge detection, the feature examined

is its intensity versus the intensity of its neighbor pixels. Pixel classification is not limited

to edge detection alone; it is also used for converting an image to gray-scale. (Gray-scale

conversion is also used in our work since we found that edges of images that had been

run through an edge-detection algorithm were easier to discern if they had been converted

to gray scale first). Pixel classification works as follows: for each pixel in an image, its

desired feature is examined, and the pixel is modified as specified. For the Laplacian edge

detection method this process is defined by an image kernel. (Note: this kernel is unrelated

to the CUDA kernel that is executed on the GPU). The Laplacian image kernel is a 3 x

3 two-dimensional array, as shown in Figure 2.11. This kernel is applied to each pixel in

the image and takes into account the pixel’s neighbors in a 3 x 3 area around it. Given the

pixel identified as xi,j and the kernel k the formula for the new value of xi,j is as follows:

out i , j= x i−1, j−1 k i −1, j−1 x i , j−1 k i , j −1 x i1, j−1 k i 1, j−1 x i −1, j k i−1, j  x i , j k i , j
x i1, j k i 1, j x i −1, j 1 k i−1, j 1 x i , j 1 k i , j1 x i 1, j 1 k i1, j 1

This algorithm is non-trivial for large images as it must perform this calculation three

times total for each pixel in an image, once for the red, blue, and green RGB values. The

number of computations that must be performed along with the ability to represent the

data as a two dimensional array indicated that edge detection would greatly benefit from a

parallel implementation on the GPU with CUDA.

Kernel Pixel Neighborhood
-1 -1 -1 xi-1,j-1 xi,j-1 xi+1,-+1

-1 8 -1 xi-1,j xi,j xi+1,j

-1 -1 -1 xi-1,j+1 xi,j+1 xi+1,j+1

Figure 2.11: Laplacian edge detection.

Chapter 3

GPU Edge Detection Algorithms

The organization of multidimensional thread blocks into multidimensional grids suits CUDA

development to the processing of arrays of data. As a result, data that can easily be

represented in these forms are usually best suited for migration to CUDA for processing.

Such data types include images that can be represented as a two-dimensional matrix where

each entry corresponds to a single pixel in the image. An image pixel consists of a discrete

red, green, and blue component in the range [0 . . . 255].

To develop a CUDA parallel algorithm for Laplacian edge detection, we took two ap-

proaches. The first was straight forward in its data distribution scheme and organization

of thread blocks, while the second took a new approach in an attempt to increase efficiency

within thread blocks.

3.1 One Pixel Per Thread Algorithm

Our first implementation of the Laplacian edge detection algorithm using CUDA is fairly

straightforward. We create a two-dimensional grid that is overlaid on the image, segmenting

it into several rectangular sections, as illustrated in Figure 3.1. For simplicity, we assume

the image can be evenly divided into full sized segments. Processing images dimensions

that do not divide evenly would not be an overly complicated addition to the application.


Each thread within the thread block corresponds to a single pixel within the image.

However, each thread is not necessarily only responsible for loading one pixel entry into

the shared memory. The nature of the Laplacian pixel group processing method for edge

detection requires that a 3x3 area surrounding the target pixel be analyzed to calculate

the output. Therefore, threads on the edge of a thread block must examine pixels that

are outside the dimensions of the thread block. In order to ensure accuracy of the output

image, these threads are responsible for loading the pixels they are adjacent to that do not

have a mapping in the thread block into shared memory. That is, the threads on the edge of

the block load the boundary pixels into shared memory. This extra step is performed after

the initial shared memory load that all threads perform. To compensate for the required

extra space, the two-dimensional shared memory array is allocated to have dimensions of

(blockDim.x + 2, blockDim.y + 2). This allocates two additional rows and two additional

columns of shared memory.



Figure 3.1: Thread blocks for single pixel per thread method.

Once the block has loaded its respective section of the target image into shared memory,

a syncthreads() function is called so that the threads can regroup before proceeding. With

the integrity of the data verified the kernel then proceeds with the convolution of the image.

The source code follows:

void edged(uchar3* pixelsIn, uchar3* pixelsOut, int1* devKernel, int width)

__shared__ uchar3 sData[blockDim.x + 2][blockDim.y + 2];
int ndx = compute2DOffset(width, threadidx, blockidx, blockDim);
loadImageBlock(sData, pixelsIn, ndx);

int3 value = makeint3(0,0,0);
for(int u = -1; u < 2; u++)
for(int v = -1; v < 2; v++)
convolve(value, kernel[u+1][v+1].x * sData[tx + 1 + u][ty + 1 + v]);

/******clamp RGB values*******/

pixelsOut[ndx] = value;

For each thread in the block we iterate through the convolution kernel and the 3x3

pixel group in which the target pixel is the center element. After applying the convolution

formula the pixel’s red, green, and blue values may be outside the range of [0 . . . 255]. We

fix this by clamping the RGB values so that if the value is greater than 255, it is set to

255 and if the value is less than 0, it is set to 0. Without this clamping step, the value

mod 256 is the new color value. This would produce the incorrect pixel value. As pixels

are calculated, they are stored in the out-pixel array that belongs to the designated output

image. Once the CUDA kernel has finished executing, the allocated memory within the

GPU is freed and the program exits.

3.2 Multiple Pixels Per Thread Algorithm

Our second implementation of the edge detection algorithm takes a different approach to

the data distribution aspect of the problem. Instead of creating two-dimensional thread

blocks that map directly to the target image, we create a series of one dimensional thread

blocks that are each responsible for a 2-D shared memory space, as illustrated in Figure 3.2.



Figure 3.2: Thread blocks for multiple pixels per thread method.

The basic flow of the second GPU implementation works the same as the initial one.

After allocating GPU memory and copying the source image to the GPU, the threads

are responsible for copying the global memory into shared memory. Because the blocks

are responsible for more pixels than threads, iteration through the image is necessary. A

number of f or loops are utilized within the CUDA kernel to cycle through the portions of

the image each block is responsible for. In order to maximize efficiency there is one for loop

per stage in the algorithm. This allows each stage to complete before moving on rather

than alternating between loading data into shared memory and solving.

Loading data into the 2-D shared array works similarly to the first implementation. An

initial load is done of the pixel that maps directly to a thread in the block. Then the kernel

iterates down the image segment within a f or loop. In each iteration we increment the

index by width, where width is the width of the image. The first and last threads in the

block are responsible for loading the left and right boundaries of the shared memory block,

and all threads load the pixels they correspond to in the upper and lower boundaries.

Following the successful data transfer to shared memory space, the actual computation

algorithm is nearly identical to the first implementation. The main difference is that the

calculations must be done for one row of the shared memory space at a time. Because of

the fewer CUDA threads executing. After each pixel is determined through the convolu-

tion process, it is stored temporarily in the shared 2-D array until the entire convolution

algorithm is completed. We can store the end-pixel in the shared memory space without

corrupting the data needed for the next computation since the algorithm iterates through

the shared array one row at a time. The convolution process only requires knowledge of

a pixel’s immediate neighbors in a 3x3 region, therefore, once one row is completed, the

kernel no longer needs the row above in any future calculations. This allows the kernel

to store the newly determined pixels in the previous row of shared memory with impunity

so long as the syncthreads() function is invoked to ensure thread synchronization. Once

all threads have finished convolution for the entire array of shared memory, the end-pixels

stored in shared space are then copied to the output pixel array.

void edged(uchar3* pixelsIn, uchar3* pixelsOut, int1* devKernel, int width)

__shared__ uchar3 sData[blockDim.x + 2][SHARED_SIZE_Y + 2];
int3 value = makeint3(0,0,0);
int ndx = compute2DOffset(width, threadidx, blockidx, blockDim);

/*****load shared data*****/

for(int i = 1; i < SHARED_SIZE_Y-1; i++)
loadImageBlock(sData, pixelsIn, ndx, i);

/*****sync from shared memory load******/

for(int i = 1; i < SHARED_SIZE_Y-1; i++)
value.x = value.y = value.z = 0;
for(int u = -1; u < 2; u++)
for(int v = -1; v < 2; v++)
convolve(value, kernel[u+1][v+1].x * sData[tx + 1 + u][ty + 1 + v]);

/******clamp RGB values******/


/****make sure all threads are done with this round of data****/

/***store calculated pixel values in shared memory that****/

/*** will no longer be used ****/
sData[tx + 1][ i - 1] = value;

for(int i = 1; i < SHARED_SIZE_Y-1; i++)

loadOutImage(sData, pixelsOut, ndx, i);
Chapter 4

Evaluation and Results

To evaluate the parallel GPU algorithms, we performed several tests on images and com-

pared the times to the sequential implementation. The before and after sputum smear

images can be seen in Figures 4.1 and 4.2. The results of these tests showed an impressive

speedup of our parallel algorithm over the sequential. In addition, we further investigated

the ideal number of threads per block to maximize the efficiency of our algorithms.

4.1 Method

We evaluated our algorithms on three different GPUs whose specifications are in Table 4.1.

All machines run Linux Fedora 9 and use CUDA 2.0. For each each implementation, a C++

driver which runs a main method that processes the arguments from the user, opens/creates

the image objects to be manipulated, and initializes the convolution kernel. This driver then

invokes the CUDA source code file, which in turn calls the CUDA kernel (the GPU device

function), which executes the respective algorithms described above.

Our edge detection algorithms were evaluated on a collection of example images of

sputum smears provided by the University of Cape Town research group. Each image had

dimensions of 1280x968 pixels, and the block dimensions were adjusted accordingly so that

they would fit the image evenly. To make the detected edges in the output image as stark


as possible, it was usually best to first convert the image to gray-scale. Since edge detection

usually works by detecting a sudden change in pixel brightness levels we observed that a

gray-scale image will work best for an edge detection algorithm. Converting the image to

gray-scale helped bring out the contrast at object edges, making them more distinguishable

than in images with color.

Model GeForce 8400GS

Total graphics memory 512Mb
Number of Multiprocessors 1
Number of Cores 8
Clock Rate 1.4 Ghz
Concurrent copy and execution No

Model GeForce 8500GT

Total graphics memory 512Mb
Number of Multiprocessors 2
Number of Cores 16
Clock Rate 0.92 Ghz
Concurrent copy and execution Yes

Model GeForce 9800GTX

Total graphics memory 512Mb
Number of Multiprocessors 8
Number of Cores 128
Clock Rate 1.89 Ghz
Concurrent copy and execution Yes

Table 4.1: GPU Specifications the NVIDIA graphics cards

4.2 Results

Results differed, as expected, depending on the hardware being utilized. On average the

speed up of computation of the GPU algorithm versus the sequential algorithm was on

the scale of an order of magnitude under the fastest GPU running the single pixel per

thread algorithm and two to three orders of magnitude with the multiple pixels per thread

algorithm, as illustrated in Figures 4.3 and 4.4. Running on the slowest machine, the

8400GS, times only averaged about 20ms faster than the sequential algorithm under the

Figure 4.1: Original sputum smear image.

single pixel per thread algorithm and a 2.5x speedup with the multiple pixels per thread

algorithm. As can be seen in Figure 4.5, the multiple pixels per thread algorithm was

consistently and considerably faster than the single pixel per thread algorithm.

After showing that the results of these implementations were far superior to the sequen-

tial one, we wanted to investigate how much of an impact different sized thread blocks have

on execution. We ran each algorithm a number of times with thread block sizes starting at

32 and doubling every time, stopping at 256, illustrated in Figure 4.6 and Figure 4.7.

Figure 4.2: Sputum smear after being run through edge detection algorithm.



150 Sequenti al

100 GTX+


Figure 4.3: Sequential time vs one pixel per thread GPU algorithm; 32 threads.



150 Sequential
ms 8400GS
100 GTX+


Figure 4.4: Sequential time vs multiple pixels per thread GPU algorithm; 32 threads.



One Pixel
per Thread
Mul tipl e

100 Pixels per



8400GS 9800GTX/GTX+

Figure 4.5: Comparison of GPU algorithms on different machines; 64 threads.




ms 8500GT
100 9800GTX/GT


64 threads 256 threads
32 threads 128 threads

Figure 4.6: Differences in computation time with different sized thread blocks; one pixel
per thread algorithm.




60 8400GS

40 X+


64 threads 256 threads
32 threads 128 threads

Figure 4.7: Differences in computation time with different sized thread blocks; multiple
pixels per thread algorithm.
Chapter 5


The GPU is an impressively powerful piece of hardware that has become well suited for par-

allel applications. We have shown that an excellent candidate for one of these applications

would be edge detection especially where speed is of the essence. There are also a number

of possible projects that can be taken up as future work with this thesis as a foundation.

5.1 Conclusion

GPGPU programming is a powerful tool that, when applied correctly, can give impressive

results. In this paper we have described two possible load-balancing algorithms for use with

performing edge detection and compared them to the sequential counterpart. Our findings

have indicated that our second implementation ,the multiple pixels per thread method, was

significantly more efficient than the single pixel per thread method.

5.2 Future Work

Edge detection is only one part of the auto-diagnosis project that the South-African re-

search group is undertaking. Further investigation of aspects of their research that could

be implemented in GPGPU programming with CUDA has the potential to make automatic

diagnosis an even more attractive project. Parts of their research that could have the poten-


tial for benefit include the auto-focus algorithm for the microscope and the actual diagnosis

of the image after having gone through the edge detection process.

GPUs are intended for working with data streams. Future work could involve investi-

gating the possibility of streaming multiple images into the GPU for edge detection. Cre-

ating data streams would enable copying of data to the device as the prior image is being

processed[2]. Working with multiple images over the course of a single execution would

likely make working with CUDA even more efficient than it already is.

GPU programming has the potential to be just as effective at performing fast com-

putations as traditional parallel computing. Further investigation into this process could

produce a very fast, relatively inexpensive method for efficient medical computations and

image processing. With this technology it is feasible that we could medical care could be

improved in areas without access to expensive hospitals or medical experts.


[1] Tuberculosis Fact Sheets., 2007.

[2] CUDA Programming Guide. NVIDIA Corporation, Santa Clara, CA, 2009.

[3] Gregory A. Baxes. Digital Image Processing: Principles and Applications. John
Wiley and Sons, Inc, New York, NY, 1994.

[4] Tania S. Douglas, Rethabile Khutlang, Sriram Krishnan, Andrew

Whitelaw, and Genevieve Learmonth. Image segmentation for automatic de-
tection of tuberculosis in sputum smears. 2008.

[5] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Definitive Guide
to Programmable Real-Time Graphics. Addison-Wesley, New York, NY, 2003.

[6] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable
Parallel Programming with the Message-Passing Interface; second edition. The MIT
Press, Cambridge, MA, 1999.

[7] Tom R. Halfhill. Parallel processing with cuda. Microprocessor Report, 2008.

[8] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. Operating Sys-
tem Concepts; sixth edition. John Wiley and Sons, Inc, New York, NY, 2002.

[9] Barry Wilkinson and Michael Allen. Parallel Programming: Techniques and
Applications Using Networked Workstations and Parallel Computers. Prentice Hall,
Upper Saddle River, NJ, 2005.