Professional Documents
Culture Documents
Fast and Accurate Color Image Processing
Fast and Accurate Color Image Processing
2
Consequently, most of image processing algo- to the same processing:
rithms can be implemented using 3D cards, such as:
filtering, the major part of segmentation methods, • GPU Implementation
including those based on thresholding and multi- num.xyz=Ixx.xyz*Iy.xyz*Iy.xyz
-2*Ixy.xyz*Ix.xyz*Iy.xyz
resolution approaches (e.g. those using splitting +Iyy.xyz*Ix.xyz*Ix.xyz;
and merging processes). Likewise, most of math- denom.xyz=Ix.xyz*Ix.xyz+Iy.xyz*Iy.xyz;
ematical morphology methods can also be adapted dst.xyz+=alphag*(num.xyz/denom.xyz);
without any difficulty. Our field of interest concerns return clamp (dst.xyz, 0, 1);
only color image processing algorithms. • CPU implementation
for (int c=0;c<3;c++)
{
5 Adapted algorithms - fragment pro- num=Ixx[c]*Iy[c]*Iy[c]
cessor -2.0*Ixy[c]*Ix[c]*Iy[c]
+Iyy[c]*Ix[c]*Ix[c];
Conscious of the fact that it is difficult or impossi- denom=Ix[c]*Ix[c]+Iy[c]*Iy[c];
ble to implement all types of image processing al- value=dst(i,j)[c]+alphag*num/denom;
if (value>1) value=1;
gorithms on a GPU, we restrict the study to show else if (value<0) value=0;
the ability of the GPU to perform algorithms which dst(i,j)[c]= value;
differ in four aspects: }
• the amount of complex mathematical function
computations; For this example, the CG compiler generates three
• the amount of vectorial computations: the times less instructions than the C++ compiler.
GPU strength exploited mainly here for arith-
metic computations on color vectors; The five algorithms chosen to conduct our exper-
• the use of alternative structures; iments are the following:
• the amount of memory accesses, which is a • Local mean algorithm: this algorithm belongs
very important point for all image processing to the filtering algorithms class (see fig. 3).
algorithms based on neighborhood computa- It consists in computing, for each pixel, the
tions. mean of its local neighbors. We have tested
While some of the proposed algorithms focus this algorithm with different masks, i.e. dif-
on one of these aspects, others combine them to ferent sizes and kernels, to highlight the per-
present more realistic and common situations. In formances evolution of this tool when varying
the following sections we present these algorithms the amount of memory transfers. The figure
and their features which are summarized in table 1. 2 shows the principle of GPU implementation
used for this processing.
Math If Memory Vectorial
Local mean N N H H
L∗ a∗ b∗ H L L M
HSV L H L N
Local PCA M M M L
Diffusion M N H M
3
L∗ a∗ b∗ conversion expressions, passing from
RGB to HSV requires several “if” tests. Con-
sequently, this kind of transform allows us to
test the behavior of “branching” algorithms.
4
6 Test Conditions
3D cards based on the NV30 graphic processor are
available in 2 versions that differ in their GPU and
memory frequencies. We have carried out our tests
with a standard NV30 card with the following fre-
quencies: 400MHZ for the GPU core and 128Mo
of a 800MHZ memory (SDRAM DDR2). Our frag-
ment programs have been implemented and com-
piled using the version 1.1 of CG [4, 5] and have
been executed with OpenGL 1.4 and the last driver
available: nVIDIAT M Detonator FX 44.03.
We have also tested our algorithms on two pro-
cessors: the AMD Athlon 1900MP and the Intel P4
2GHz, three compilers have been also compared:
Figure 7: Local PCA visualization
• Microsoft C++ Compiler V12.0 (included in
VC++ 6.0 Enterprise) and Intel C++ Compiler
rithms are mostly designed for grey level im- V7.1 both under Windows XP Professional,
ages, and only few methods exist for multi- • GNU/GCC 3.3 under GNU/Linux Debian,
channel/color images. For our experiment, we We have used very aggressive optimization op-
have implemented the anisotropic diffusion al- tions in order to generate the fastest binaries. No
gorithm developed by Sapiro and al. [12] (see assembler code has been integrated in our code
fig. 9). This algorithm is of interest since it source.
uses an iterative process [11] completely car-
ried out by the graphic card thanks to two
7 Results
pbuffers (see fig. 8).
In this section, we present the results obtained from
the different algorithms tested 2 . For each stud-
ied case, the average execution time and the perfor-
mance are expressed in number of processed images
per second 3 .
It is also of interest to precise that, in all of our
implementations, the computations were done using
32 bits floating point numbers, whatever the initial
Figure 8: Anisotropic diffusion - GPU Implementa- format of the data.
tion Each test was conducted using 3 different image
sizes: 256×256, 512×512 and 1024×1024, in or-
der to analyze the processing time evolution versus
the image size. For all tested algorithms, this evolu-
tion was expected to be linear, since each algorithm
presents the same linear algorithmic complexity.
It is of importance to note that the numerical re-
sults between the CPU and GPU differ of a neg-
ligible amount, which is, we suppose, due to dif-
ferences in the mathematical functions implemen-
tation.
Figure 9: Anisotropic diffusion result (100 itera- 2
For size reasons, the full set of results is available on-
tions) line at the following address: http://www.couleur.org/
articles/VMV2003/
3
which, we think, is a clearer way of showing the results.
5
7.1 Color data transfer 7.2 Local mean filtering
Before studying the performance of the GPU to pro- The graphs 12, 13 and 14 give the computation
cess the data already located in the graphic card times of the local mean filtering using square neigh-
memory, it seems essential to consider first the data borhoods of size 3 × 3, 7 × 7 and 11 × 11. The
transfer from the central memory to the graphic card fragment programs generated by CG in these three
memory, which involves the AGP bus (AGP X4 in cases are respectively constituted of 27, 147 and
our case, i.e. 1 GB/s). 343 instructions. Considering the graph 12, it ap-
pears clearly that for small neighborhood sizes, the
700
GPU gives better performances than the CPU. But
600
for larger neighborhoods (as illustrated in graphs 13
and 14), the difference is no more obvious and the
500
CPU can even be a better option.
400
Therefore same results can be expected for all al-
300
gorithms based on neighborhood approach. Con-
200
sequently it seems that our GPU is not really effi-
100 cient for selecting several texture pixels in a frag-
0 ment program.
UC UC Float Float Half Half
UC Float UC Float UC Float
700
400
300
800 200
700 100
600
0
500 UC Half Float Float Half VC++ IC GCC VC++ IC
400
UC Half Float UC UC Athlon Athlon Athlon P4 P4
300
Figure 12: Mean 3 × 3
200
100
0
UC Float UC Float UC Float 120
UC UC Half Half Float Float
100
60
6
50 180
45 160
40 140
35
120
30
100
25
20 80
15 60
10 40
5
20
0
0
UC Half Float Float Half VC++ IC GCC VC++ IC
Float Half Float Half VC++ GCC VC++ IC
UC Half Float UC UC Athlon Athlon Athlon P4 P4 UC UC Float Half Athlon Athlon P4 P4
80
140
60
120
40
100
20
80
0
GPU 400/800MHz GPU 400/900MHz GPU 450/800MHz GPU 450/900MHz
60
UC-Float UC-Float UC-Float UC-Float
40
memory frequencies 0
Float Half Float Half VC++ GCC VC++ IC
UC UC Float Half Athlon Athlon P4 P4
7
16
7.7 Results analysis
14
8
We are currently working on “hybrids” process- Proceedings of PICS Conference, pp 500-505,
ing. These hybrid methods allow us to use simulta- Rochester USA, May 2003.
neously CPU and GPU for video stream processing. [11] D. Tschumperlé and R. Deriche, “Constrained
This method can be apply to process a single im- and unconstrained PDEs for vector image
age by a procedure of “tile processing”, where one restoration”, Scandinavian Conference on Im-
image is decomposed in several images that can be age Analysis, Bergen, Norway, June 2001.
processed as a video stream. [12] G. Sapiro and D.L. Ringach, “Anisotropic dif-
These new graphic processors will change the de- fusion of multivalued images with applica-
sign of color images processing algorithms. It will tions to color filtering”, IEEE Transactions on
be necessary to integrate a new notion of hetero- Image Processing, Volume 5(11), pp 1582-
geneous parallelism in our algorithms. The GPU 1585, 1996.
will enable to efficiently realize processing requir-
ing heavy mathematical and vectorial calculations
while the CPU will be used for complex analysis.
References