You are on page 1of 12

Microprocessors and Microsystems 39 (2015) 393–404

Contents lists available at ScienceDirect

Microprocessors and Microsystems


journal homepage: www.elsevier.com/locate/micpro

Optimized parallel implementation of face detection based on GPU


component
Marwa Chouchene a,⇑, Fatma Ezahra Sayadi a, Haythem Bahri a, Julien Dubois b, Johel Miteran b,
Mohamed Atri a
a
Laboratory of Electronics and Microelectronics (ElE), Faculty of Sciences of Monastir, Tunisia
b
Laboratory of Electronics, Informatics and Image (LE2I), Burgundy University, France

a r t i c l e i n f o a b s t r a c t

Article history: Face detection is an important aspect for various domains such as: biometrics, video surveillance and
Available online 22 May 2015 human computer interaction. Generally a generic face processing system includes a face detection, or
recognition step, as well as tracking and rendering phase. In this paper, we develop a real-time and robust
Keywords: face detection implementation based on GPU component. Face detection is performed by adapting the
GPU Parallel computing Viola and Jones algorithm. We have developed and designed optimized several parallel implementations
Face detection of these algorithms based on graphics processors GPU using CUDA (Compute Unified Device Architecture)
Viola and Jones algorithm
description.
AdaBoost
WaldBoost
First, we implemented the Viola and Jones algorithm in the basic CPU version. The basic application is
CUDA optimization widened to GPU version using CUDA technology, and freeing CPU to perform other tasks. Then, the face
detection algorithm has been optimized for the GPU using a grid topology and shared memory. These
programs are compared and the results are presented. Finally, to improve the quality of face detection
a second proposition was performed by the implementation of WaldBoost algorithm.
Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction architecture and programming model of the GPU are significantly


different than most other commodity single-chip processors.
The analysis of the technology evolution on the last decade, The reasons for such a craze for these processors are very
considering the number of processor cores on a same ship as well numerous. Indeed, the most important needs of realism for render-
as the frequency improvement, clearly indicate that parallel com- ing images require computing power increasing continuously and
puting represents a serious candidate for future image processing naturally pushed the industry to increase the physical capacity of
implementation. Current, and may be future microprocessor devel- the cards in particular the number of parallel processors contained
opment efforts seem to focus on adding cores rather than increas- on graphics cards.
ing single-thread performance. Containing up to 512 CUDA processors (architecture Fermi),
The main processor in the Sony Playstation 3 provides one GPU are designed to run up to several thousand threads. For this
example of this trend. Indeed, this heterogeneous nine-core Cell reason, the GPU could be similar to super calculators rid of com-
broadband engine, has attracted a substantial interest from the sci- plex structure rather than multi-core CPU which enables only a
entific computing community. Similarly, Graphics Processing Unit few threads to be handled simultaneously.
(GPU), that proposes a highly parallel architecture, is rapidly gain- The achievement and success does not stop there, the real rev-
ing maturity as a powerful engine for computationally demanding olution of this product is made in 2006. The constructor Nvidia has
applications. GPU’s performances and its potential offer a great proposed a type of a language dedicated to GPGPU (General
deal of promise for future computing systems, nevertheless the Purpose processing on Graphics Processing Unit) named: CUDA
(Compute Unified Device Architecture). Its specificity is to unify
all existing processors in the GPU so two different processors can
handle the same task. This language enables the user to handle
⇑ Corresponding author.
with several level of refinements in the same system description,
E-mail addresses: ch.marwa.84@gmail.com (M. Chouchene), sayadi_fatma@ya-
by using the functions commonly available into C language
hoo.fr (F.E. Sayadi), bahri.haythem@hotmail.com (H. Bahri), julien.dubois@u-bour-
gogne.fr (J. Dubois), miteranj@u-bourgogne.fr (J. Miteran), mohamed.atri@fsm.rnu. libraries and by supporting specific CUDA terminology that refer
tn (M. Atri). to functions which has been optimized for GPU’s architecture.

http://dx.doi.org/10.1016/j.micpro.2015.04.009
0141-9331/Ó 2015 Elsevier B.V. All rights reserved.
394 M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404

Conscious of the available GPU’s processing power that is fre- special functional units SFU (Special Functional Units) that execute
quently underutilized, the work aims to optimize some common more complex floating point operations, such as reciprocal sine,
image processing for GPU architecture. Hence, our goal is to imple- cosine, and root square with low latency cycle.
ment efficiently on this architecture kind the detection of moving The SM contains other resources such as shared memory and
objects in a video sequence, in particular face detection. registry. A group of SM contains a set of treatment group thread
This work is organized as follow: First of all, the general parallel (TPC: Thread Processing Clusters). This one contains also other
computations are presented in this paper. Indeed, the evolution of resources (caches and texture) that are shared between the SM.
graphics processors was marked along with the presentation of the
CUDA environment. Next, we discuss recent advances and state of
2.1. Why GPU: motivations
the art implementations of face detection algorithms based on the
framework originally described by Viola and Jones. Thereafter a
Hardware accelerators (currently graphical processing units)
second implementation by a recent classifier which is WaldBoost
are an important component in many existing high-performance
algorithm is presented. The different steps of the face detection
computing solutions [1]. Their growth in variety and usage is
implementation on GPU are detailed and the different results are
expected to skyrocket [2] due to many reasons. First, GPUs offer
exposed. Finally, we will achieve this paper with the conclusion.
impressive energy efficiencies [3]. Second, when properly pro-
grammed, they yield impressive speedups by allowing program-
2. Graphics processors mers to model their computation around many fine-grained
threads whose focus can be rapidly switched during memory stalls.
The graphic processor (GPU) currently operating in the parallel Current motivations on the use of graphic cards as processors
performance may participate in this revolution if they extend their calculations in the field of research and modeling can be explained
architecture to support the execution of a generic code to be run on in many ways.
a CPU. In fact, the GPU is a massively parallel unit containing sev- In recent years, the CPU starts to show their technology limita-
eral hundred cores calculations quite different from a conventional tion in terms of architecture and speed. The CPU is oriented toward
multi core (Fig. 1). multi-core architectures in recent years, which still allows them to
The use of GPU for scientific computing is not modern but the provide increasingly high computing power. But this architecture
arrival of CUDA language in 2006, as the strong support of the has a limit which is related to fairly long latency when transferring
manufacturer processors Nvidia for scientific computing, have data between the memory and the microprocessor. In other words,
enabled a large increase in interest and experimental calculation the bandwidth or the amount of information transferred per sec-
on graphic cards. However, the choice of graphic accelerators is ond, is not sufficient and is a limiting factor for the performance
not motivated solely by the support of the manufacturer as the of the CPU. The raw computing power offered by GPU has far
CUDA language. The main reasons are: a floating peak performance exceeded in recent years that displayed by the most powerful
and higher bandwidth memory. Both of these arguments promise CPU: This is from 2003 that you can see the progress of NVIDIA
apparently an acceleration for any algorithm, given the superiority graphics cards compared to the evolution of CPU (in terms of GB/s).
of the memory speed and calculation. The sharp increase in GPU utilization is largely due to the highly
So to reach the peak performance, the scientific application specialized and parallel architecture optimized for graphics opera-
must be able to express massive parallelism of hundreds or thou- tions. In addition, the possible consolidation of GPU computing
sands of soft threads. Assuming this is possible, it is necessary that farm (cluster) still multiplies this computing power.
each thread have a way to access the memory that is relatively Furthermore, the development of GPGPU, allowing the use of
steady. the graphic cards for intensive parallel computing and also to
Fig. 1 illustrates the general architecture of the GPU. It is com- relieve the CPU of these calculations, it now provides many digital
posed of streaming multiprocessors (SM), each containing a num- tools and allows use the GPU in more accessible way. Moreover
ber of streaming processors (SP), or processor core. The SM offers Nvidia has developed a programming environment called CUDA
(Compute Unified Device Architecture), opening to a wide audi-
ence supercomputing GPU.
SM Due to the specific benefits of graphic card, our work currently
Instruction addresses of the use of new computing architectures and CUDA [2]
Unit approach to programming GPU.
TPC
SP SP
SP
Shared
SP
SM SM SM 3. Application’s background
Memory
SP Register SP Texture Real-time object detection is an important work for many appli-
Caches
File Units cations. One very robust and general approach to this work is using
SP SP
statistical classifiers that classify individual locations of the input
SFU SFU image and make a binary decision: the location contains the object
or it does not.
Viola and Jones [4] presented very successful face detector,
TPC TPC … TPC which combines boosting, Haar low-level features computed on
integral image and a consideration cascade of classifiers. Their
design was further developed by many researchers, most impor-
Interconnection Network tantly for accelerate the detection time.
There are some hardware solutions being able to accelerate the
Memory Memory Memory face detection in real-time, but hardly any software implementa-
(cache) (cache)
… (cache) tion. One of the best ways to achieve high real-time video process-
ing requirements is to take advantage of parallelization of the
Fig. 1. GPU Architecture composed of streaming multiprocessors. algorithm.
M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404 395

Massively parallel architecture of current GPUs is a platform features, it offers a few of ways, of how to parallelize the detection
suitable for acceleration of mathematical computations in the field step. The next reason is that there are divers algorithms for face
of digital image analysis. detection based on Viola and Jones. Hence Viola and Jones algo-
The work of Jaromír et al. [5] described a GPU accelerated face rithm seems to be a good application to test the different CUDA
detection implementation using CUDA. They compared their optimization’s methods.
implementation of Viola and Jones algorithm to the basic In recent years, face recognition has turned much attention and
one-thread CPU version. From the test results, it is convincing that its research has rapidly expanded by not only engineers but also
the GPU detection is usable with reasonable time-consuming neuroscientists, since it has many potential applications in com-
results against the CPU variants. puter vision communication and automatic access control system.
Some works are also written about acceleration object classifi- Especially, face detection is an important part of face recognition as
cation with some good results. As in illustration, Gao and Lu [6] the first step of automatic face recognition. However, face detec-
reached a detection at 37 frames/sec for 1 classifier and 98 fra- tion is not straightforward because it has lots of variations of image
mes/sec for 16 classifiers using 256  192 image resolution. appearance, such as pose variation (front, non-front), occlusion,
Kong et al. [7] proposed a GPU-based implementation for face image orientation, illuminating condition and facial expression.
detection system that enables 48 faces to be detected with a The purpose of the face detection module is to determine
197 ms latency. Herout et al. [8] presented a GPU-based face detec- whether there are any faces in an image or video sequence, and
tor based on local rank patterns as an alternative to the commonly if so, to return their position and scale. Face detection is an impor-
used Haar wavelets [9]. Hefenbrock et al. [10] described another tant area of research in computer vision, because it serves, as a nec-
stream-based multi-GPU implementation on 4 cards. essary first step, any face processing system, such as face
Finally, Sharma et al. [11] presented a working CUDA imple- recognition, face tracking or expression analysis. Most of these
mentation that effected a resolution of 1280  960 pixels. They techniques assume, in general, that the face region has been per-
proposed a parallel integral image to discharge both row-wise fectly localized. Therefore, their performance significantly depends
and column-wise prefix sums, by fetching input data from the on the accuracy of the face detection step.
off-chip texture memory cached in each SM. Face detection is the first step in facial recognition. Its effective-
In the present work we propose a parallel algorithm for evalu- ness has a direct influence on the performance of face recognition
ating Haar filters that fully exploits the GPU by the system. There are several methods for detecting faces, some use
micro-architecture of the NVIDIA GeForce 310 M and freeing the the color of skin, shape of the head, facial appearance, while others
CPU to perform other tasks. As well the results obtained on GPU combine several of these characteristics.
were improved by using different methods of optimization. The
first method exploits the shared memory, while the second studies 4.1. Conception schema of implemented method
the variation of size blocks used.
To evaluate the performance of the proposed algorithm on Our implementation of face detection algorithm is organized
CUDA, the following development environment is used: according to the steps given in Fig. 2. The first step is fast and
robust face detection in an image, based on adaptations of the
(1) Intel(R) Core(TM) i5 CPU 2.6 GHz with 4 Go memory, 35 W, AdaBoost algorithm using Haar classifier cascade [13]. This part
(2) NVIDIA GeForce 310 M with 1787Mo available graphic will be detailed later.
memory, it belongs to the Tesla architecture and it supports Then we will proceed to the program analysis, we are going to
16 CUDA cores, 14 W, analyze the performances and make the profiling using the C++
(3) Microsoft Windows Se7en Titan, Visual profiler. It is a method that allows calculating the execution
(4) Microsoft Visual Studio 2008, time of a function or a procedure. It provides statistics and precise
(5) CUDA Toolkit and SDK 2.3, calculations, whose percentage of execution time are extracted
(6) NVIDIA Driver for Microsoft Windows with CUDA Support from the time of the main function (main), where exclusive time
(258.96). is the time spent in the function, while the inclusive time is the
time spent in the function and its children.
4. Implementation of face detection The measurements provided by the profiler allow determining
the most critical parts in runtime. These parts will be optimized
Face recognition involves recognizing people with their intrin- with a new parallel design: either by adding other optimized
sic facial characteristics. Compared to other biometrics, such as fin- libraries or by the use of parallel languages and parallel libraries.
gerprint, DNA, or voice, face recognition is more natural, We select the CUDA language for the optimization of the face
nonintrusive and can be used without the cooperation of the sub- detection algorithm due to its performance to describe parallelism
ject. Since the first automatic system of Kanade, a growing atten- on Nvidia GPU component.
tion has been given to face recognition [12]. Due to powerful In fact, we propose a face detection algorithm that is able to
computers and recent advances in pattern recognition, face recog- handle a wide range of variations in static color images based on
nition systems can now perform in real-time and achieve satisfying the work of Viola and Jones
performance under controlled conditions, leading to many poten-
tial applications. 4.2. Complexity analysis on CPU
Face recognition is a major area of research within image and
video processing. Since most techniques assume the face images Automatic face location is a very important task which consti-
normalized in terms of scale and rotation, their performance tutes the first step of a large area of applications: face recognition,
depends heavily upon the accuracy of the detected face position face retrieval by similarity, face tracking, tracking, etc.
within the image. This makes face detection a crucial step in the In the step of detecting and locating faces, we propose an
process of face recognition. approach for robust and fast algorithm based on density images,
In this part we are interested in face detection particularly in AdaBoost, which combines simple descriptors (Haar feature) for a
the algorithm based on Viola and Jones works [4]. The first reason strong classifier.
to select this face detection algorithm is the system of how this The concept of Boosting was introduced in 1995 by Freund [14].
algorithm executes. By the use detection windows and Haar The Boosting algorithm uses the weak hypothesis a prior
396 M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404

Face Detection Profiling Parallelization Optimization Profiling

Analysis of the problem


complexity: the fast and
robust face detection

Analysis of the available


program: performance
analysis, profiling

Design of parallelization:
• Optimized libraries
• Parallel objects
• Parallel languages
• Parallel libraries

Implementation

Optimization:
Profiling

Fig. 2. General schema of implemented method.

knowledge to build a strong hypothesis. In 1996 Freund and The implementation of our algorithm in C++ on CPU request dis-
Schapire [15] proposed the AdaBoost algorithm which allowed to tribution program in a set of procedures following the steps
automatically choosing the weak hypothesis with appropriate already described (Fig. 3).
weight. After generating the source code successfully, it is time to
In 2001, Viola and Jones [4] worked on the AdaBoost algorithm embed the main entrances to see the result of detection. They
for face detection. They used the simple descriptors (Haar feature), are images of different sizes containing one or more face. Before
and the integral image which is the method of calculating the value compiling the source code, this series is introduced into the princi-
of descriptors, as well as the cascade of classifiers. pal project file. Once the source code is updated and saved it
In our work we applied the following chart [16] (Fig. 3): should go to the compilation. We will subsequently detail the dif-
An overview of our face detection algorithm is depicted in Fig. 3, ferent images used to test the effectiveness of our algorithm. We
which contains different major modules: ‘‘Read image’’, apply our algorithm on the images to get the following results
‘‘Downloading Cascade Classifier’’, ‘‘Display of results’’, and (Fig. 4):
‘‘Detection’’ which is the major module.

Variable declaration / size / memory

Read image

Image transformation
Downloading
Cascade Classifier

Computation of integral image


Detection

Staging image for cascade classifier


Save image

Display of result Run and evaluation cascade classifier

Save result

Fig. 3. Diagram organization of algorithm implemented.


M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404 397

Fig. 4. Results after the execution of the application.

Additional images from [17] are subjected under implementa- we will detail these functions giving the exact values of the execu-
tions to ensure the results were similar for faces of varying sizes tion time of each sub function in the following table:
and orientations. Faces are detected and surrounded by a square; Table 3 shows the distribution, we note that the most critical
the developed algorithm has successfully performed the face sub-function of the execution time is ‘‘Run and evaluation cascade
detection. classifier’’ function.
After these tests, we can talk about the effectiveness of the algo- The same approach made to the main program will be made for
rithm for face detection. In the next section, we measure the exe- ‘‘Detection’’ function. This program contains more procedures
cution time of the main functions. which explain the rather important time. The percentage of the
An evaluation of the execution time of this algorithm (face execution time of the procedures in relation to time of the main
detection based on the density of images, AdaBoost, which combi- program is shown in Table 4.
nes simple descriptors for a strong classifier) was performed in We can see that 66.95% of the total time is required in run and
Table 1: evaluate procedure. Until now, we have implemented the algo-
Note that the execution time of the ‘‘Detection’’ function is much rithm of face detection in C on the CPU. We have demonstrated
larger than the other functions of the program. Indeed, the the effectiveness of this application for the face detection. We also
required time for the ‘‘Detection’’ function represents 65% of the determined the execution time of each processing step, to improve
global execution time which enables optimization to be considered the result by using different optimization tools.
on this algorithm part. In general, there are different ways to speed up a numerical cal-
In Visual Studio, the profiling tools for Windows applications culation: One solution is to increase the clock frequency of the pro-
allows to measure, evaluate and target performance issues in our cessor which is an expensive replacement. In addition, the
code. The profiler collects timing information for applications writ- processor frequency intensive incensement seems to have reached
ten in Visual C++, using a sampling method that collects informa- limits [8,7].
tion on the call stack of the processor at regular intervals. The A second way of investigation is to enable simultaneous execu-
views of the profiling report displays graphical and tabular repre- tion of multiple instructions. We can find parallel computing,
sentations of detailed and rich context of the performance of our which is to use more electronic components such as (multi-cores
application, and help to navigate the execution paths of the code processor, multi-CPU, GPU...) and pipelining which is paralleliza-
and to evaluate the cost of performing our functions to find the tion within the same processor.
best opportunities for optimization. We may collect profiling infor- A final method for accelerating numerical computation consists
mation from the beginning to the end of a profiling run as shown in in improving memory access. Indeed, data transfers are frequently
Table 2. responsible of the system limitations. Therefore, the memory man-
As shown in the flowchart in Fig. 3, the ‘‘Detection’’ function is a agement should be optimized especially using graphics cards.
collection of sub functions, which explains the results. That’s why

Table 3
Measurement of execution time of ‘‘Detection’’ subfunctions.
Table 1
Profiling results execution time on CPU. Time CPU (s)
Declaration 0.003
Time CPU (s)
Image transformation 0.01
Read image 0.011 Computation of integral images 0.012
Downloading Cascade Classifier 0.036 Staging image for cascade classifier 0.01
Detection 0.11 Run and evaluation cascade classifier 0.065
Display of results 0.01 Save result 0.01
Total 0.167 Total (Detection) 0.11

Table 4
Table 2 Time statistical for the procedure ‘‘Detection’’.
Time Statistics and percentage of inclusive and exclusive time elapsed from the
application. Function name %Application %Application
Inclusive Time Exclusive Time
Function name %Application %Application N° = Call
Inclusive Time Exclusive Time Computation of integral 0.11 0.11
images
Main 99.7 40.09 1 Image transformation 0.21 9.37
Read image 3.28 0.00 1 Run and evaluation cascade 66.95 24.04
Downloading 10.26 0.37 1 classifier
Cascade Classifier Staging image for cascade 0.55 0.55
Detection 43.05 0.00 1 classifier
Display results 3.00 0.21 1 Save result 0.001 0.001
398 M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404

-Frequently, a sequential approach is proposed to implement Data transfers are significantly reduced in this approach to GPU
image processing algorithms, one pixel after another meanwhile computing. Only the initial data are transferred from the host to
efficient parallel implementation can be considered. Obviously, the GPU. Intermediate data are built directly into the graphics
these kinds of implementations should be proposed with appropri- memory (Fig. 6).
ate targets. The requests of high performance computing often We proposed a GPU implementation, via the use of NVidia
result by using hardware solutions to solve critical problems. CUDA API to solve the problem of face detection based on the den-
Therefore, we can use the different processor cores of a central unit sity of images, AdaBoost, which combines simple descriptors for a
can be used nevertheless their number is still, nowadays, quite strong classifier. The aim of this implementation was to get the
limited. interest of a GPU implementation compared to traditional
In this context, we focus on the use of GPU to improve our pro- approaches and optimization in terms of CPU programming.
cessing, and it is done by using CUDA programming tool [18]. It is necessary to calculate the theoretical gain of the use of tra-
ditional approaches then compare the gains we can get with the
use of CUDA algorithm.
GPU computing is to use the graphic processor in parallel to
4.3. Performance analysis on GPU
accelerate stains to offering maximum performance. GPU acceler-
ates the slower portions of code computing resources, by using
One cornerstone of our work has been to define our own version
CUDA, which gives us the ability to create as many threads as we
of an algorithm of detecting faces dedicated to a GPU’s implemen-
need for simulation.
tation in order to benefit from the potential application’s paral-
Fig. 7 shows the performance in term of execution time
lelism. Hence, data processing on the GPU is done by simple
obtained by a CPU implementation and a GPU one.
instructions on multiple data, the same operations are performed
The main information to consider in Fig. 7 is the foreseen
on a set of data in parallel. The algorithms presented in this section
processing-time acceleration of the measurement using the paral-
have been designed to allow parallel processing. The code for this
lel computing resources on the GPU compared to the optimal C++
algorithm was made based on the code of the sequential algorithm
implementation on GPCPU.
by parallelizing the loops (Fig. 5).
The CPU / GPU association combines the CPU’s efficiency on
The strategy which has been adopted to deal with the graphic
sequential parts of code, while the GPU handles the parallel pro-
calculator is to perform all the computation of the critical part
cessing of the regular parts.
(Tables 3 and 4) simultaneously. The grid of computation contains
threads. Each thread performs a computation of the functions ‘‘Run
and evaluation cascade classifier’’. This processing is done for each 4.4. Performance optimization in GPU
filter cascade, the most critical calculations are invoked by a suit-
able optimization and also the same filter cascade is called when- The objective of parallel computing is the significant reduction
ever processing. of the computation time of a process or an increase in the number
The grid used in the CUDA computation is in two dimensions of transactions for a fixed time. Historically, software has been
and the global coordinates of the threads in the grid correspond written for sequential treatment and to run on a single machine
to the coordinates of the used images. with a single computing unit. The development of parallel
Embedded loops of the classical algorithm (path of rows and approaches is quite new and opens up many possibilities in terms
columns) are replaced by the grid’s topology [20]. It then remains of hardware architectures but also in terms of programming tools.
in each thread just the computation of these functions, i.e., the The new GPU architectures are increasingly exploited for the
requested loops of ‘‘Run and evaluation cascade classifier’’. Finally purposes other than graphics, given the massive parallelization
the different tests are performed. they offer. This parallelization provides performance gains calcula-
The resulting program mixing C and CUDA includes a main tion. But there are several factors that must be taken into account
function (in a C file) which requests initialization functions, com- in the development of a treated CUDA parallel algorithm.
putation and performs the measurement of the execution time. To exploit the GPU performance, it is necessary, first and fore-
All C functions, that perform the kernel calls, are stored in a specific most, to know well the properties of the hardware architecture
file ‘‘.cu’’ of CUDA type, which requires dedicated nvcc compiler. and programming environment of the graphics card GPU. The effi-
We have developed a computational kernel from the initial ciency of an algorithm implemented on a GPU is closely related to
functions written in C. The CUDA kernels have been developed how the GPU resources were used. To optimize the performance of
for single-precision calculations. an algorithm on a GPU, it is necessary to maximize the use of GPU

Fig. 5. Parallel CUDA code [19].


M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404 399

Host Device
CPU Bus GPU

Start

Initiation of
environmental GPU

Allocation of memory Creating storage space


Data transfer in global memory

Processing

Launch kernel to Processing the filter


calculate the function cascade:
- A suitable optimization
« Run and evaluation - The same cascade filter
cascade classifier » is called whenever

Resolution
Obtained data
Resolution
Transfer

Free memory on the


GPU

End

Fig. 6. Interconnections between the CPU and GPU.

4.4.1. Optimization of the grid topology


CUDA gives us the ability to create as many threads as we have
0.07
points for simulation. Grid computing combines chosen thread for
each calculation instruction. These threads are grouped into block,
0.06
the particularity of the threads in the same block is that they share
a common memory called shared memory. The programmer deter-
0.05 mines the block size. It is an important step in the definition of grid
Time CPU (s) computing CUDA. The total number of threads in block allocated
0.04 should be determined according to the size of image and the capac-
Time GPU (s) ity of GPU. The number of blocks in a grid should be larger than the
0.03 number of multiprocessors so that all multiprocessors have at least
one block to execute. This recommendation is subject to resource
0.02 availability; therefore, it should be determined the second execu-
tion parameter the number of threads per block as well as shared
0.01 memory usage.
We propose here to study the influence of the block size on the
0 measurement of execution time.
We will see later some results showing the evolution of our
Fig. 7. Comparison between CPU and GPU ‘‘Run and evaluation cascade classifier’’ thinking as and the experience we have gained from the program
function. this architecture.
We define square blocks of size n. The following figure shows
the execution time depending on the block size n.
As shown in Fig. 8, the lowest execution time is obtained for
cores (maximize the number of threads running in parallel) and blocks 32  32 threads for the Lena image and 64  64
optimize the use of different GPU memory, always respecting does threads for Face image when running the ‘‘Detection’’ function on
not exceed the capability of GPU. the GPU.
400 M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404

0.014

0.012

0.01

Time « Detecon »
0.008
funcon(s) Image
Lena
0.006
Time « Detecon »
funcon(s) Image
0.004
Face

0.002

Taille de bloc
Fig. 8. Influence of block size on the GPU execution time.

The results show that the execution time improves further with We present in Fig. 10 the performance time of the functions
the increase in number of thread in block at start time. However, ‘‘Image transformation’’ and ‘‘Computation of Integral images.’’
there is an upper limit to the performance improvement. As shown We compare the performance of the basic kernel with those who
in Fig. 8, execution time becomes less obvious when the number of exploit shared memory. The performance is clearly improved; opti-
thread in block exceeds the total image size. mized kernel is 2 times faster in this case. This example illustrates
The reason for this is linked to the number of ports bank avail- the optimization need to understand the interaction between the
able in the shared memory, as well as the occupation of the com- algorithm and hardware.
puting cores as recommended by NVidia strategy shows that the To justify this performance, we used the NVidia profiler. This
more we defined large blocks, we obtain a more maximum occu- last also provides along with the beta version of CUDA 2.0 a profiler
pancy of the GPU processing speeds. that allows you to see the time spent in each kernels. Fig. 11 gives
the runtime result for various kernels implemented on the GPU.
4.4.2. Optimization with shared memory The kernels implemented are: ‘‘Image transformation’’ and
Shared memory has much higher bandwidth and much lower ‘‘Computation of Integral images’’. The two other bars in the graph
latency than local or global memory. are for the memory copy operation.
To achieve high bandwidth, shared memory is divided into This histogram relative ensures that the kernels were well exe-
equally-sized memory modules, called banks, which can be cuted as many times as required by the algorithm. Previous results
accessed simultaneously. Any memory read or write request made are found in terms of relative occupancy time. It is also possible to
of n addresses that fall in n distinct memory banks can therefore be view the sequence of kernels and their respective execution time.
serviced simultaneously, yielding an overall bandwidth that is n To conclude, we will present a summary table (Table 5) show-
times as high as the bandwidth of a single module. ing the measure execution time of ‘‘detection’’ function imple-
Optimization that we ask is we serve as a buffer between the mented in CPU, GPU and optimized GPU by the variation in the
global GPU memory (are stored intermediate data) and registers number of threads per block.
associated with computing cores. However, synchronization will The result obtained justifies the GPU version code achieves a
be required which may be more harmful than reducing the number best performance than the CPU version code. That this is due to
of access to the global memory (Fig. 9). the graphics processor architecture that sets up a cache system

Data 1

Data 1

Data 2 Data 2

Global GPU Memory


Shared Memory in a block of threads
Registers threads
Fig. 9. Optimization of access to the global memory, of the kernel calculation.
M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404 401

With shared memory Time of the funcon


“Computaon of Integral Image”
in CUDA (μs)
Time of the funcon “Image
Transformaon” in CUDA (μs)
Without shared memory

0 20 40 60 80

Fig. 10. Influence of the memory on the GPU execution time.

(a) Without shared memory

(b) With shared memory


Fig. 11. Relative histogram kernels.

for the management of the global memory. Thus, shared memory is Sequential execution time
Speed up ¼ ð1Þ
used, which explains the observed performance gain. Parallel execution time
In the parallel computing, the speed up shows to what extent a
parallel algorithm is faster than a corresponding sequential algo-
rithm [21]. Table 6 shows the speed up of the implementation versions of our
Analytically, we define the speed up as: algorithm in CPU and GPU versus the size of the image.
402 M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404

Table 5 Another comparison can be made by studying the number of


Compared execution time. frames per second (FPS). As shown in Table 6, it must be noted that
CPU GPU optimized GPU the FPS of the achieved result at 512  512 images is around 136
Time ‘‘Detection’’ function (s) 0.110 0.0073 0.0065 while in [7] it is 13 for comparable image and in [5], it is around
8 for much bigger image.
Moreover the results found are 4 FPS, 5 FPS respectively in [5],
Table 6
[7] against 112 FPS for the presented results for a very close image
The time-consuming comparison between CPU and GPU based algorithms. size (1280  1024 vs 1024  1024). Thus the results achieved in
this paper are better than those given in the state of the art.
Size Image Time CPU ‘‘Detection’’ Time GPU ‘‘Detection’’ Speed
function (s) function (s) up
Moreover another comparison has been made for the same
function, but this time on another plate forme. Gao and Lu [6]
64x64 0.069 0.0042 16.42
128x128 0.073 0.0046 15.86
implemented the cascade function on FPGA, their results varies
256x256 0.086 0.0051 16.86 between 0.25 s and 0.95 s, while our GPU implementation has
512x512 0.12 0.0073 16.43 given maximum time of 0.0042 s to 0.0089 s.
1024x1024 0.15 0.0089 16.85

4.6. Implementation on GPU of Waldboost Classifier

We note that with the increasing of the image size used in the We presented in the previous section, the GPU performance of a
experiment, speedup increases (Fig. 12). Our results show that the fixed size linear classifier: AdaBoost, we focus now on the GPU
implementation of our algorithm on GPU is 16 times faster than implementation of another more recently boosting algorithm,
the one on CPU. which is the WaldBoost classifier [23].
Next we turn to study the energy saving of the GPU accelerated WaldBoost is a combination of AdaBoost and Wald’s sequential
software as it is presented in [22]. The comparison is made probability ratio test [24]. The face detection is performed by eval-
between a program running one a CPU (Intel Corei5, 35 W) and a uating the classifier on all positions and scales. The positions with
GPU enhanced program running on GPU (NVIDIA GeForce 310 M, positive responses of classifiers can be clustered to remove possi-
14 W). ble multiple face detections.
The energy consumption is reduced from 385 J (35Wx0, 11 s) in The implementation of WaldBoost face detection is provided by
CPU to 0,91 J (14Wx0, 065 s) in GPU. combining many weak classifiers into one strong classifier (varia-
tion of AdaBoost). The implementation presented in this part can
4.5. Comparison with state of the art be divided in these steps:

The evaluation of our work is a major problem, especially there 1. Loading, representing the classifier data,
is unlikely no common base in the way of use and acceleration 2. Image treatment,
with GPU available, and so the most of previously reported works 3. Face detection,
create their own for evaluation aims, and so it is hard to compare 4. Display results.
them with our method directly. Jaromir et al. [5] obtained an exe-
cution time for an image of size 1280  1024 equal to 0.25 s The performances implementations are measured and given in
against an execution time equal to 0,0089s obtained by our work Table 7. This table contains the total detection time, for the imple-
for a very close image of size 1024  1024. This progress of perfor- mentations of image size 512  512.
mance comes mainly from the use of the optimized method espe-
cially the use of the shared memory.
Kong et al. [7] obtained a speed up equal to 14.7 times for an
image size 512  512, and it is an implementation of the full algo- Table 7
rithm of face detection. As against, we just work on the cascade The time comparison between CPU and GPU based WaldBoost algorithm.
detection function and we found for the same image size a speed CPU GPU Speed up
up equal to 16 times, so we can conclude that the results are
Time WaldBoost detection (s) 0.115 0.0051 22.54
comparable.

0.16

0.14

0.12

0.1 Time CPU “Detecon”


funcon (s)
0.08
Time GPU “Detecon”
0.06
funcon (s)
0.04

0.02

0
64x64 128x128 256x256 512x512 1024x1024

Fig. 12. Compared execution time for different size image.


M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404 403

As illustrated in Table 7, the CUDA implementation outperforms [11] B. Sharma, R. Thota, N. Vydyanathan, A. Kale, Towards a robust, real-time face
processing system using CUDA-enabled GPUs, in: International Conference on
the CPU 22 times for this image of size 512  512.
High Performance Computing, 2009, pp 368–377.
Some existing works in the literature have used a modern clas- [12] T. Kanade, Picture processing by computer complex and recognition of human
sifier (WaldBoost) for GPU implementation such as Herout et al. faces, PhD thesis, doctoral dissertation, Kyoto University, 1973.
[8]. They obtained a speed up from four to eight times per frame [13] C. Marwa, B. Haythem, E. Fatma, A. Mohamed, T. Rached, Software, hardware
for face detection, in: International Conference on Control, Engineering &
for the implementation of high resolution video. Information Technology (CEIT’13) Proceedings Engineering & Technology, vol.
For now it is difficult to compare them with our method, 1, 2013.
because of variation of the evaluation criterion. Comparison will [14] Y. Freund, Boosting a weak learning algorithm by majority, Inf. Comput. 121
(2) (1995) 256–285.
be the subject of future work integrating many optimizations such [15] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in:
as use of different GPU memory spaces (shared, texture. . .). Then Proceedings of the Thirteenth International Conference on Machine Learning,
we will move to the real-time implementation for high resolution Bari, Italy, 1996, pp. 148–156.
[16] F. Comaschi, Face Detection on Embedded Systems, University of Technology
videos. Eindhoven, 2013, pp 13–10.
[17] YUV Video Sequences: <http://trace.eas.asu.edu/yuv/>.
[18] Nvidia, NVIDIA CUDA Compute Unified Device Architecture – Programming
5. Conclusion Guide, Publisher NVIDIA, 2012.
[19] NVidia, GPU Computing With NVIDIA’s Kepler Architecture, 2013, p. 11.
[20] J.P. Harvey, GPU acceleration of object classification algorithms using nvidia
We have presented in this paper a GPU implementation of the cuda, Master’s thesis, Rochester Institute of Technology, Rochester, NY, 2009.
Viola–Jones face detection algorithm that achieves performance [21] A. Dhraief, R. Issaoui, A. Belghith, Parallel computing the Longest Common
comparable to that of the CPU implementation. Subsequence (LCS) on GPUs: efficiency and language suitability, in: The First
International Conference on Advanced Communication and Computation,
Due to its C-based interface, programming on the GPU using
INFOCOMP, 2011.
CUDA is much easier for developers without a graphics background [22] J. Cheng Wu, L. Chen, T. Chiueh, Design of a real-time software-based GPS
than using OpenGL. Parallelizing an algorithm using CUDA does baseband receiver using GPU acceleration, in: International Symposium on
not require mapping the algorithm to graphics concepts. VLSI Design, Automation, and Test (VLSI-DAT), 2012.
[23] J. Sochman, J. Matas, Waldboost – learning for time constrained sequential
However, a complete understanding of the memory and program- detection, in: Proceedings of the IEEE Computer Society Conference on
ming model is needed to achieve maximum efficiency on the GPU. Computer Vision and Pattern Recognition, 2005, pp 150–156.
Based on our experience with CUDA, intelligent use of the memory [24] A. Wald, Sequential Analysis, John Wiley and Sons Inc, 1947.

hierarchy (global memory, shared memory, registers, texture


cache) and ensuring high processor occupancy goes a long way
Marwa Chouchene received his M.S. degree in
in achieving good speedups. Electronic Materials and Devices from the Science
From the test results, it is convincing that the GPU detection is Faculty of Monastir, Tunisia, in 2010. Currently, she is a
usable with reasonable time-consuming results against the CPU PhD student at Laboratory of Electronics and
variants. It is possible to see that the GPU detection is an average Micro-electronics the University of Monsatir. His
research interests include Image and Video Processing,
of 16 times faster than CPU detection.
motion tracking and Pattern Recognition, Multimedia
The performance is not as high as could be provided from the Application, Video surveillance, processor graphic,
computational power of the GPU relative to the CPU. This is espe- CUDA language.
cially because of the nature of the detection algorithm, which does
not match the requirements of the CUDA and mostly GPU environ-
ment. More algorithmic adjustments may be sought for, to suite
the detection algorithms better to the execution platform.
Hence the use of WaldBoost classifier for face detection
improves the detection quality as well as the speed-up which Fatma Sayadi received the PhD Degree in
Micro-electronics from the Science Faculty of Monastir,
can reach 20 times.
Tunisia in collaboration with the LESTER Laboratory,
University of South Brittany Lorient FRANCE., in 2006.
References She is currently a member of the Laboratory of
Electronics & Micro-electronics. Her research interests
[1] D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands-on Image and Video Processing, motion tracking and
Approach. Book, Morgan Kauffman Publishers Inc., 2010. Pattern Recognition, Circuit and System Design and
[2] S. Borkar, A. Chien, The future of microprocessors, ACM Commun. 54 (5) (2011) graphics processor.
67–77.
[3] W. Dally, Effective computer architecture research in academy and industry,
in: International Conference on Supercomputing, Japan, 2010.
[4] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple
features, in: Proceedings IEEE Conference on Computer Vision and Pattern
Recognition, 2001, University of Kyoto, 1973.
[5] J. Krpec, M. Němec, Face detection CUDA accelerating, in: ACHI 2012: The Fifth Haythem Bahri received a Master degree in
International Conference on Advances in Computer–Human Interactions, 2012,
Micro-electronics and Nano-electronics from University
pp 155–160.
of Monastir, Tunisia in 2012. He is currently a PhD
[6] C. Gao, S.L. Lu, Novel FPGA based haar classifier face detection algorithm
student at Laboratory of Electronics and
acceleration, in: Proceedings of International Conference on Field
Programmable Logic and Applications, 2008, pp 373–378. Micro-electronics the University of Monsatir. His
[7] J. Kong, Y. Deng, GPU accelerated face detection, in: International Conference research interests are focused on processing image and
on Intelligent Control and Information Processing, 2010, pp 584–588. video in graphics processor.
[8] A. Herout, R. Josth, R. Juranek, J. Havel, M. Hradis, P. Zemcik, Real-time object
detection on CUDA, J. Real-Time Image Process. (2010) 1–12.
[9] M. Hradis, A. Herout, P. Zemcik, Local rank patterns: novel features for rapid
object detection, Comput. Vis. Graph. (2009) 239–248.
[10] D. Hefenbrock, J. Oberg, N. Thanh, R. Kastner, S.B. Baden, Accelerating Viola–
Jones face detection to FPGAlevel using GPUs, in: 18th IEEE Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), 2010, pp 11–18.
404 M. Chouchene et al. / Microprocessors and Microsystems 39 (2015) 393–404

Julien Dubois is associated professor at the University Mohamed Atri received his PhD Degree in
of Burgundy since 2003. He is a member of the Micro-electronics from the Science Faculty of Monastir,
Laboratory Le2i (UMR CNRS 6063). His research inter- Tunisia, in 2001 and his Habilitation in 2011. He is
ests include real-time implementation, smart camera, currently a member of the Laboratory of Electronics &
hardware design based on data-flow modeling, motion Micro-electronics. His research includes Circuit and
estimation and image compression. In 2001, he received System Design, Pattern Recognition, Image and Video
PhD in Electronics from the University Jean Monnet of Processing.
Saint Etienne (France) and joined EPFL based in
Lausanne (Switzerland) as a project leader to develop a
co-processor, based on FPGA, for a new CMOS camera.

Johel Miteran received the PhD degree in image pro-


cessing from the University of Burgundy, Dijon, France
in 1994. Since 1996, he has been an assistant professor
and since 2006 he has been professor at Le2i, University
of Burgundy. He is now engaged in research on classi-
fication algorithms, face recognition, access control
problem and real time implementation of these algo-
rithms on software and hardware architecture.

You might also like