You are on page 1of 9

Fast and Accurate Color Image Processing

Using 3D Graphics Cards

Philippe C OLANTONI† , Nabil B OUKALA‡ , Jérôme DA RUGNA†

Laboratoire LIGIV† Laboratoire DIPI‡


Université Jean Monnet École Nationale d’Ingénieurs de Saint-Étienne
10, rue Barrouin 58, rue Jean Parot
42000 Saint-Étienne - FRANCE 42023 Saint-Étienne - FRANCE
colantoni|darugna@ligiv.org boukala@enise.fr

1 Abstract forme color image processing using such a type


of hardware technology. The second aim of this
The aim of this paper is to demonstrate how the new work is to evaluate the performances of 3D cards
technologies introduced in recent 3D cards can be in regards to several common algorithms, to study
used in color image processing and analysis. We whether the use of such hardware is relevant for
used the latest programmability feature available in a given class of algorithms. This study is mainly
3D cards in order to implement and to test five color based on the technologies available in the latest gen-
image processing algorithms: local mean filtering, eration of chips.
RGB to L∗ a∗ b∗ and RGB to HSV color spaces Algorithms can be differentiated and classified
conversions, local principal component analysis and by their structure and the programmability re-
anisotropic diffusion filtering. Using a nVIDIAT M sources they require.
NV30 graphic processor unit (GPU) we obtained, The tested algorithms have been carefully cho-
in most cases of study, faster results than with the sen to cover most types of implementations encoun-
tested processors (CPU). We are showing that the tered in image processing, excluding those which
GPU can be 10 times faster than the best CPU that cannot be applied to the graphic hardware. The pro-
we have tested in the case of per-pixel processing posed algorithms have been implemented twice.
with mathematical complex functions and vectorial While a first implementation makes use of the 3D
calculations. card, an other is optimized for the CPU. This al-
lows us to compare and analyze the performances
obtained from each version.
2 Introduction

Image processing and analysis are fields which nat- 3 Programmability


urally require high performances and large compu-
tational capabilities. Most algorithms in image pro- So far, 3D cards have always been designed to offer
cessing need a large amount of computations and high performances in two main points:
high precision. This is particularly true in color and • arithmetic calculations on vectors and matri-
multispectral image processing. Nowadays, real ces, which are essential for efficient computa-
time and high frame rate digital image processing tions of geometric transformations and color
can be obtained by using most of the time, dedi- manipulations. In our case of study, colors are
cated and expensive hardware, which are the only 3D or 4D vectors.
ones that can provide enough computational capa- • absolutely necessary accesses to different
bilities to process streams of color images. memory locations, such as the texture mem-
Moreover, latest 3D cards, which are now more ory, the Z-buffer, the frame buffer, requiring a
affordable, include interesting features for image large memory bandwidth.
processing purposes. Due to the rigidity of the graphic pipeline, these ca-
The purpose of this study is to show how to per- pabilities could not be exploited for any other pur-

VMV 2003 1 Munich, Germany, November 19–21, 2003


pose except 3D rendering.
Today, the programmability provided by the
new GPU (graphic processor unit) generation gives
the possibility to alter the main two engines of
the pipeline: the transforming and lighting en-
gine (T&L) and the multi-texturing engine. Basi-
cally, the vertex processor allows the substitution
of the traditional transforming and lighting process
by a specific geometrical manipulation program, Figure 1: Image processing GPU implementation
called “vertex program”. A vertex program han- model
dles the 3D geometrical data (vertices) entering the
graphic pipeline and outputs, for each vertex, the
vertex itself and its parameters (color, texture co- restricted to a maximum number of instructions1 :
ordinates...). Further in the pipeline, the fragment 65536 for vertex programs and 1024 for fragment
processor, usually in charge of the multi-texturing programs. Moreover, they cannot contain more than
task, can instead run any complex color manipula- 256 loops, 256 constants and 16 temporary regis-
tion program, and thus processes each polygon in a ters.
per pixel way. Finally, the parallel “architecture” of the frag-
This flexibility, especially at the multi-texturing ment processor enables several pixels to be pro-
stage, is essential for our field of study since the cessed at the same time. Consequently, this op-
main idea of our work is to use the fragment pro- timizes the performances of the algorithms imple-
cessor for an image processing purpose. mented but also prevents the processor to know or
This approach has already been partially ex- to monitor in which order pixels need to be pro-
plored by [1] within the scope of color image cessed. Therefore, the fragment processor is lim-
segmentation based on basic programming meth- ited to per pixel processing. On the other hand, al-
ods. More recently, GPU computational capabili- gorithms based on sequential scans of the image,
ties have been used for other applications such as where the processing of the current pixel needs the
numerical computations [2, 3] and simulations. result obtained by the processing of a previous pixel
(e.g. algorithms such as labeling or edge follow-
ing segmentation) are totally unsuitable for the GPU
4 Image processing using the frag- and cannot (or with difficulty) be implemented with
ment processor such a parallel architecture. Consequently they are
disregarded of our study.
In addition to this programmability feature, these This severe limitation can be, to some extent,
recent 3D cards are particularly well suited to per- overcome. Indeed, as for 3D image rendering, a
form floating point computations allowing high pre- multi-pass approach can be used to process an im-
cision calculations. Moreover, off-screen rendering, age: 2 pbuffers are needed, while the first contains
which is a very useful feature can be done by the the data to be processed, the other one receives the
use of pbuffers, and, mathematical functions imple- processed data. Once finished, a new process can
mented in hardware make the computation of com- be performed from the resulting image, then, the
plex mathematical expressions more efficient. The second pbuffer receives in its turn the recently pro-
combination of these features allows us to think that cessed data, and so forth. Thanks to this method, an
this type of hardware is well adapted to run image image can be processed several times and the im-
processing algorithms. plementation of iterative algorithms is made possi-
However, some types of image processing algo- ble. It should also be noted that both GPU and CPU
rithms cannot be implemented due to the fragment can work simultaneously, fully exploiting the sys-
processor limitations. These algorithms have been tem and giving better performances. The process-
therefore disregarded of our study. The first limita- ing pipeline is used at its best when the CPU and
tion is due to the fact that a fragment program can GPU processing times are equal.
only give, for a given pixel, its color and depth val-
1
ues. Secondly, vertex and fragment programs are In our study we have used the nVIDIAT M NV30 processor

2
Consequently, most of image processing algo- to the same processing:
rithms can be implemented using 3D cards, such as:
filtering, the major part of segmentation methods, • GPU Implementation
including those based on thresholding and multi- num.xyz=Ixx.xyz*Iy.xyz*Iy.xyz
-2*Ixy.xyz*Ix.xyz*Iy.xyz
resolution approaches (e.g. those using splitting +Iyy.xyz*Ix.xyz*Ix.xyz;
and merging processes). Likewise, most of math- denom.xyz=Ix.xyz*Ix.xyz+Iy.xyz*Iy.xyz;
ematical morphology methods can also be adapted dst.xyz+=alphag*(num.xyz/denom.xyz);
without any difficulty. Our field of interest concerns return clamp (dst.xyz, 0, 1);
only color image processing algorithms. • CPU implementation
for (int c=0;c<3;c++)
{
5 Adapted algorithms - fragment pro- num=Ixx[c]*Iy[c]*Iy[c]
cessor -2.0*Ixy[c]*Ix[c]*Iy[c]
+Iyy[c]*Ix[c]*Ix[c];
Conscious of the fact that it is difficult or impossi- denom=Ix[c]*Ix[c]+Iy[c]*Iy[c];
ble to implement all types of image processing al- value=dst(i,j)[c]+alphag*num/denom;
if (value>1) value=1;
gorithms on a GPU, we restrict the study to show else if (value<0) value=0;
the ability of the GPU to perform algorithms which dst(i,j)[c]= value;
differ in four aspects: }
• the amount of complex mathematical function
computations; For this example, the CG compiler generates three
• the amount of vectorial computations: the times less instructions than the C++ compiler.
GPU strength exploited mainly here for arith-
metic computations on color vectors; The five algorithms chosen to conduct our exper-
• the use of alternative structures; iments are the following:
• the amount of memory accesses, which is a • Local mean algorithm: this algorithm belongs
very important point for all image processing to the filtering algorithms class (see fig. 3).
algorithms based on neighborhood computa- It consists in computing, for each pixel, the
tions. mean of its local neighbors. We have tested
While some of the proposed algorithms focus this algorithm with different masks, i.e. dif-
on one of these aspects, others combine them to ferent sizes and kernels, to highlight the per-
present more realistic and common situations. In formances evolution of this tool when varying
the following sections we present these algorithms the amount of memory transfers. The figure
and their features which are summarized in table 1. 2 shows the principle of GPU implementation
used for this processing.
Math If Memory Vectorial
Local mean N N H H
L∗ a∗ b∗ H L L M
HSV L H L N
Local PCA M M M L
Diffusion M N H M

Table 1: N, L, M and H (for Null, Low, Medium and


High) describe the importance of these four aspects Figure 2: Local mean - GPU Implementation
according to the global content of an algorithm.
• RGB to CIE L∗ a∗ b∗ color space conversion:
The perceptually uniform L∗ a∗ b∗ color space
(see fig. 5) had been designed to compute
As mentioned above, the GPU is designed for small color distances [6, 10]. Performing con-
vectorial computations. To illustrate how this fea- versions between the L∗ a∗ b∗ and RGB color
ture is naturally used in color image processing we spaces necessitates the computation of non-
propose to compare these two codes corresponding linear mathematical expressions; that requires

3
L∗ a∗ b∗ conversion expressions, passing from
RGB to HSV requires several “if” tests. Con-
sequently, this kind of transform allows us to
test the behavior of “branching” algorithms.

Figure 3: Mean Result

to use intensely hardware mathematical func-


tions and involves on the other hand negligi-
ble memory transfers. The figure 4 shows our
GPU implementation.

Figure 6: RGB cube projected in HSV color space

• Local PCA: The principal component analy-


Figure 4: L∗ a∗ b∗ convertion - GPU Implementa- sis (PCA) [8], known in data analysis theory
tion as the Karhunen-Loève transform, is a way to
find directions of great variance for a data set,
which will then be projected in a new orthog-
onal space built from these directions. This
method is mainly used to perform dimension-
ality reduction. The algorithm that we have
tested in our experiment computes the local
PCA of a color image in order to determine the
local orientations of color clouds (see fig. 7).
As previously mentioned, a fragment program
can only output four values. This prevents us
from getting the three resulting eigenvectors
for each image pixel. The generated axes can
be represented as a quaternion, i.e. a 4D vec-
tor. A quaternion is an extension to complex
numbers which does not specify an orientation
but a change of orientation according to a ref-
erence which remains the same for each local
Figure 5: L∗ a∗ b∗ color space PCA (the z axis in our case).
This complex algorithm involves a lot of com-
• HSV color space conversion: The HSV putations, tests and requires a lot of memory
color space (see fig. 6), which stands for Hue- accesses.
Saturation-Value color space [7, 10], has been • Color image restoration: The restoration of
developed as an intuitive way of picking col- noised and blurred images has been widely
ors. The conversion from RGB to HSV color studied, and many algorithms based on varia-
space is done thanks to quite simple mathe- tional or stochastic formulations tried to solve
matical expressions. However, unlike RGB to this ill-posed problem. However, such algo-

4
6 Test Conditions
3D cards based on the NV30 graphic processor are
available in 2 versions that differ in their GPU and
memory frequencies. We have carried out our tests
with a standard NV30 card with the following fre-
quencies: 400MHZ for the GPU core and 128Mo
of a 800MHZ memory (SDRAM DDR2). Our frag-
ment programs have been implemented and com-
piled using the version 1.1 of CG [4, 5] and have
been executed with OpenGL 1.4 and the last driver
available: nVIDIAT M Detonator FX 44.03.
We have also tested our algorithms on two pro-
cessors: the AMD Athlon 1900MP and the Intel P4
2GHz, three compilers have been also compared:
Figure 7: Local PCA visualization
• Microsoft C++ Compiler V12.0 (included in
VC++ 6.0 Enterprise) and Intel C++ Compiler
rithms are mostly designed for grey level im- V7.1 both under Windows XP Professional,
ages, and only few methods exist for multi- • GNU/GCC 3.3 under GNU/Linux Debian,
channel/color images. For our experiment, we We have used very aggressive optimization op-
have implemented the anisotropic diffusion al- tions in order to generate the fastest binaries. No
gorithm developed by Sapiro and al. [12] (see assembler code has been integrated in our code
fig. 9). This algorithm is of interest since it source.
uses an iterative process [11] completely car-
ried out by the graphic card thanks to two
7 Results
pbuffers (see fig. 8).
In this section, we present the results obtained from
the different algorithms tested 2 . For each stud-
ied case, the average execution time and the perfor-
mance are expressed in number of processed images
per second 3 .
It is also of interest to precise that, in all of our
implementations, the computations were done using
32 bits floating point numbers, whatever the initial
Figure 8: Anisotropic diffusion - GPU Implementa- format of the data.
tion Each test was conducted using 3 different image
sizes: 256×256, 512×512 and 1024×1024, in or-
der to analyze the processing time evolution versus
the image size. For all tested algorithms, this evolu-
tion was expected to be linear, since each algorithm
presents the same linear algorithmic complexity.
It is of importance to note that the numerical re-
sults between the CPU and GPU differ of a neg-
ligible amount, which is, we suppose, due to dif-
ferences in the mathematical functions implemen-
tation.
Figure 9: Anisotropic diffusion result (100 itera- 2
For size reasons, the full set of results is available on-
tions) line at the following address: http://www.couleur.org/
articles/VMV2003/
3
which, we think, is a clearer way of showing the results.

5
7.1 Color data transfer 7.2 Local mean filtering
Before studying the performance of the GPU to pro- The graphs 12, 13 and 14 give the computation
cess the data already located in the graphic card times of the local mean filtering using square neigh-
memory, it seems essential to consider first the data borhoods of size 3 × 3, 7 × 7 and 11 × 11. The
transfer from the central memory to the graphic card fragment programs generated by CG in these three
memory, which involves the AGP bus (AGP X4 in cases are respectively constituted of 27, 147 and
our case, i.e. 1 GB/s). 343 instructions. Considering the graph 12, it ap-
pears clearly that for small neighborhood sizes, the
700
GPU gives better performances than the CPU. But
600
for larger neighborhoods (as illustrated in graphs 13
and 14), the difference is no more obvious and the
500
CPU can even be a better option.
400
Therefore same results can be expected for all al-
300
gorithms based on neighborhood approach. Con-
200
sequently it seems that our GPU is not really effi-
100 cient for selecting several texture pixels in a frag-
0 ment program.
UC UC Float Float Half Half
UC Float UC Float UC Float
700

Figure 10: Memory transfer CPU - GPU (image 600

sizes 256 × 256, 512 × 512 and 1024 × 1024) 500

400

300

800 200

700 100
600
0
500 UC Half Float Float Half VC++ IC GCC VC++ IC

400
UC Half Float UC UC Athlon Athlon Athlon P4 P4

300
Figure 12: Mean 3 × 3
200

100

0
UC Float UC Float UC Float 120
UC UC Half Half Float Float
100

Figure 11: Memory transfer GPU - CPU 80

60

The evolution of the data transfer speed (in image 40


per second) is given (see graphs given by figures 10
20
and 11) for three different methods of coding the
RGBA channels: unsigned char (8 bits), half float 0
UC Half Float Float Half VC++ IC GCC VC++ IC
(16 bits) or float (32 bits). UC Half Float UC UC Athlon Athlon Athlon P4 P4
The obtained results show clearly that the AGP
bus bandwidth is fully used during all the transfers. Figure 13: Mean 7 × 7
They also indicate that it is essential to select a good
storage format for the texture, in order to: The graph 15 shows the evolution of the results
• limit the used memory (only 128Mo of mem- obtained by the GPU with a mean filtering of size
ory are available on our graphic card); 7×7 while modifying the frequencies of our graphic
• increase the load speeds of the images in the card. They indicate that the most significant fac-
video memory. tor to improve the processing speed of the GPU is

6
50 180

45 160

40 140
35
120
30
100
25
20 80

15 60

10 40
5
20
0
0
UC Half Float Float Half VC++ IC GCC VC++ IC
Float Half Float Half VC++ GCC VC++ IC
UC Half Float UC UC Athlon Athlon Athlon P4 P4 UC UC Float Half Athlon Athlon P4 P4

Figure 14: Mean 11 × 11 Figure 16: L∗ a∗ b∗ color space transformation

linked to the increase of the core frequency of the


faster than with the CPU (see graphic 17), indi-
GPU.
cate that the GPU performs very well in this case
140
of study. Nevertheless, it seems that the numer-
120
ous tests used by this algorithm limit the processing
speed.
100

80
140
60
120
40
100
20
80
0
GPU 400/800MHz GPU 400/900MHz GPU 450/800MHz GPU 450/900MHz
60
UC-Float UC-Float UC-Float UC-Float
40

Figure 15: Mean results with different core and 20

memory frequencies 0
Float Half Float Half VC++ GCC VC++ IC
UC UC Float Half Athlon Athlon P4 P4

Figure 17: HSV color space transformation

7.3 RGB to L∗ a∗ b∗ color transformation


The color space transformation RGB to L∗ a∗ b∗ as
well as L∗ a∗ b∗ to RGB corresponds to per-pixel
processing which requires the execution of several 7.5 Local color cloud orientation
mathematical complex functions and vector calcu-
lations. They are carried out within the GPU by By using the computation of the local orientations
fragment programs made of 51 instructions. In the of color clouds, we can test a complex algorithm
best case (see graph 16) the GPU can be 10 times (the fragment program generated for the 7×7 neigh-
faster than the best CPU tested. borhood is composed of 884 instructions) that real-
izes many computations, which mostly do not in-
volve vectorial calculations. This allows to illus-
7.4 Transformation RGB to HSV trate the performances of our GPU in the framework
RGB to HSV conversions correspond also to per- of a generic algorithm.
pixel processing which requires fragment programs The results that we have obtained (see graphics
composed of 75 instructions. Nevertheless they do 18 and 19) show that the GPU is not very well
not use any vectorial instructions. While they re- adapted for this kind of algorithm. Above 5 × 5
quire few mathematical functions, they use many neighborhood size, it becomes less efficient than the
tests. The obtained results, that are until 4 times CPU.

7
16
7.7 Results analysis
14

12 The results that we have obtained indicate that the


10 GPU allows to obtain in most cases of study (4/5),
8 better results than a CPU, whatever the compiler.
6 Moreover, we have shown that the performances de-
4 pend on:
2 • the structure of the algorithms;
0
Half Float Float Half VC++ GCC VC++ IC
• the formats of the memory buffers used for
Half Float UC UC Athlon Athlon P4 P4
textures and pbuffers. We have been con-
fronted in our case of study to a bandwidth
Figure 18: Local PCA 3 × 3 limitation of the graphic card used.
• whether the processing is done on a neighbor-
7
hood and its size.
6
Finally, we can note that the evolution of the exe-
cution times of the GPU is proportional to the pro-
5
cessed image size, whatever the tested algorithm.
4

2 8 Conclusion and perspectives


1

0 The aim of this study was to evaluate the compu-


Half Float Float Half VC++ GCC VC++ IC
tational capabilities of the currently available GPU
Half Float UC UC Athlon Athlon P4 P4
when employed for color image processing. The
Figure 19: Local PCA 7 × 7 use of such specialized 3D processors requires the
definition of the type of algorithms that it is possi-
ble to implement. Among these algorithms we have
chosen five cases of study.
7.6 Anisotropic diffusion
We were able to show that the GPU gives very
With such an iterative algorithm we can measure good results:
the performances of the GPU when a processing is • for per-pixel processing (with a computation
completely executed within the graphic card by the power until 10 times faster than the tested pro-
use of two pbuffers. The graph 20 shows that the cessors)
GPU is 3 times faster than the best CPU tested and • for iterative processing completely holds by
this even if we are using an algorithm which pro- the graphic card (without any access to the
cesses a pixel neighborhood. computer memory).
The results obtained for processing based on
neighborhood computations allowed us to establish
160

that the accesses to many pixels of a texture in a


140
fragment program reduce the performances. For
120
complex algorithms, like the orientation computa-
100
tion of color clouds, there is no significant gain in
80
comparison to the CPU performances. The CPU
60
can even give better performances in the case of
40
complex algorithms involving large memory trans-
20
fers.
0
Float Half VC++ GCC VC++ IC
Of course these results remain linked to the GPU
Float Half Athlon Athlon P4 P4
used and its software layer (drivers). In fact, we
have noted during the last months, a very important
Figure 20: Anisotropic diffusion
increase in the performances of the GPU with the
use of new optimized drivers.

8
We are currently working on “hybrids” process- Proceedings of PICS Conference, pp 500-505,
ing. These hybrid methods allow us to use simulta- Rochester USA, May 2003.
neously CPU and GPU for video stream processing. [11] D. Tschumperlé and R. Deriche, “Constrained
This method can be apply to process a single im- and unconstrained PDEs for vector image
age by a procedure of “tile processing”, where one restoration”, Scandinavian Conference on Im-
image is decomposed in several images that can be age Analysis, Bergen, Norway, June 2001.
processed as a video stream. [12] G. Sapiro and D.L. Ringach, “Anisotropic dif-
These new graphic processors will change the de- fusion of multivalued images with applica-
sign of color images processing algorithms. It will tions to color filtering”, IEEE Transactions on
be necessary to integrate a new notion of hetero- Image Processing, Volume 5(11), pp 1582-
geneous parallelism in our algorithms. The GPU 1585, 1996.
will enable to efficiently realize processing requir-
ing heavy mathematical and vectorial calculations
while the CPU will be used for complex analysis.

References

[1] R. Yang and G. Welch, “Fast Image Seg-


mentation and Smoothing Using Commod-
ity Graphics Hardware”, Journal of graphics
tools, to appear, special issue on "Hardware-
Accelerated Rendering Techniques", 2003.
[2] J. Krüger and R. Westermann, “Linear Alge-
bra Operators for GPU Implementation of Nu-
merical Algorithms”, SIGGRAPH, to appear,
2003.
[3] J. Bolz and I. Farmer and E. Grinspun and
P. Schröder, “Sparse Matrix Solvers on the
GPU: Conjugate Gradients and Multigrid”,
SIGGRAPH, to appear, 2003.
[4] B. Mark and R. S. Glanville and K. Akeley and
M. J. Kilgard, “Cg: A system for program-
ming graphics hardware in a C-like language”,
SIGGRAPH, to appear, 2003.
[5] nVidia, “C for Graphic”, www.nvidia.
com, 2002.
[6] CIE, “Parametric effects in colour-difference
evaluation”, Bureau Central de la CIE, 101,
1993.
[7] D. F. Rogers, “Procedural Elements for Com-
puter Graphics”, Mc Graw Hill, 1985.
[8] D. Freedman and R. Pisani and R. Purves,
“Statistics”, W. W. Norton & Company, 1978.
[9] Tian-yuan Shih, “The reversibility of six geo-
metric color spaces”, Photogrammetric Engi-
neering and Remote Sensing, Volume 61(10),
pp 1223-1232, October 1995.
[10] P. Colantoni and A. Trémeau, “3D Visualiza-
tion of color data to analyze color images”,

You might also like