You are on page 1of 9

doi:10.1145/ 2184319 .

2 1 8 43 3 7

Article development led by


queue.acm.org

Mobile computer-vision technology will soon


become as ubiquitous as touch interfaces.
By Kari Pulli, Anatoly Baksheev,
Kirill Kornyakov, and Victor Eruhimov

Real-Time
Computer
Vision with
OpenCV

Computer vision is a rapidly growing field devoted to


analyzing, modifying, and high-level understanding of
images. Its objective is to determine what is happening
in front of a camera and use that understanding to
control a computer or robotic system, or to provide
people with new images that are more informative

or aesthetically pleasing than the users park or warn them about poten-
original camera images. Application tially dangerous situations. Intelligent
areas for computer-vision technology video surveillance plays an increasingly
include video surveillance, biomet- important role in monitoring the secu-
rics, automotive, photography, movie rity of public areas.
production, Web search, medicine, As mobile devices such as smart-
augmented reality gaming, new user phones and tablets come equipped with
interfaces, and many more. cameras and more computing power,
Modern cameras are able automati- the demand for computer-vision ap-
cally to focus on peoples faces and trig- plications is increasing. These devices
ger the shutter when they smile. Optical have become smart enough to merge
text-recognition systems help trans- several photographs into a high-reso-
form scanned documents into text that lution panorama, or to read a QR code,
can be analyzed or read aloud by a voice recognize it, and retrieve information
synthesizer. Cars may include automat- about a product from the Internet. It
ed driver-assistance systems that help will not be long before mobile computer-

ju n e 2 0 1 2 | vo l. 55 | n o. 6 | c ommu n icat ion s of t he acm 61


practice

vision technology becomes as ubiqui- Vision And Heterogeneous of combining these two ideasthat is,
tous as touch interfaces. Parallel Computing running a CPU or CPUs together with
Computer vision is computation- In the past, an easy way to increase the various acceleratorsis called hetero-
ally expensive, however. Even an al- performance of a computing device geneous parallel computing.
gorithm dedicated to solving a very was to wait for the semiconductor pro- High-level computer-vision tasks
specific problem, such as panorama cesses to improve, which resulted in often contain subtasks that can be run
stitching or face and smile detec- an increase in the device's clock speed. faster on special-purpose hardware
tion, requires a lot of power. Many When the speed increased, all applica- architectures than on the CPU, while
computer-vision scenarios must be tions got faster without having to mod- other subtasks are computed on the
executed in real time, which implies ify them or the libraries that they relied CPU. The GPU (graphics processing
that the processing of a single frame on. Unfortunately, those days are over. unit), for example, is an accelerator
should complete within 3040 milli- As transistors get denser, they also that is now available on every desktop
seconds. This is a very challenging re- leak more current, and hence are less computer, as well as on mobile devices
quirement, especially for mobile and energy efficient. Improving energy effi- such as smartphones and tablets.
embedded computing architectures. ciency has become an important prior- The first GPUs were fixed-function
Often, it is possible to trade off qual- ity. The process improvements now al- pipelines specialized for accelerated
ity for speed. For example, the panora- low for more transistors per area, and drawing of shapes on a computer
ma-stitching algorithm can find more there are two primary ways to put them display, as illustrated in Figure 1. As
matches in source images and synthe- to good use. The first is via paralleliza- GPUs gained the capability of using
size an image of higher quality, given tion: creating more identical process- color images as input for texture map-
more computation time. To meet the ing units instead of making the single ping, and their results could be shared
constraints of time and the compu- unit faster and more powerful. The back with the CPU rather than just be-
tational budget, developers either second is via specialization: building ing sent to the display, it became pos-
compromise on quality or invest more domain-specific hardware accelerators sible to use GPUs for simple image-
time into optimizing the code for spe- that can perform a particular class of processing tasks.
cific hardware architectures. functions more efficiently. The concept Making the fixed-function GPUs
partially programmable by adding
Figure 1. Computer vision and GPU. shaders was a big step forward. This
enabled programmers to write special
Computer Vision on GPU programs that were run by the GPU
Computer Vision on every three-dimensional point of
High-level
information
the surface and at every pixel rendered
about a scene Raster image onto the output canvas. This vastly ex-
panded the GPUs processing capabil-
Computer Graphics ity, and clever programmers began to
try general-purpose computing on a
GPU (GPGPU), harnessing the graph-
Red ball Human face
The same hardware boosts both! ics accelerator for tasks for which it
was not originally designed. The GPU
became a useful tool for image pro-
cessing and computer-vision tasks.
Figure 2. CPU versus GPU performance comparison. The graphics shaders, however,
did not provide access to many useful
CPU GPU
hardware capabilities such as synchro-
nization and atomic memory opera-
30x tions. Modern GPU computation lan-
guages such as CUDA, OpenCL, and
DirectCompute are explicitly designed
to support general-purpose computing
Speedup

on graphics hardware. GPUs are still


not quite as flexible as CPUs, but they
12x perform parallel stream processing
8x much more efficiently, and an increas-
7x
6x
ing number of nongraphics applica-
tions are being rewritten using the
GPU compute languages.
Primitive image Stereo vision Pedestrian Viola Jones SURF Computer vision is one of the tasks
processing detection face detector keypoints
(HOG) that often naturally map to GPUs. This
is not a coincidence, as computer vi-
sion really solves the inverse to the

62 comm unicatio ns o f th e acm | j u ne 201 2 | vo l . 5 5 | no. 6


practice

computer graphics problem. While significant part of the librarys func- ing the CPU part of OpenCV, and then
graphics transforms a scene or object tionality and is still in active develop- accelerate it with the GPU module. De-
description to pixels, vision transforms ment. It is implemented using CUDA velopers should try different combina-
pixels to higher-level information. and therefore benefits from the CUDA tions of CPU and GPU processing, mea-
GPUs contain lots of similar process- ecosystem, including libraries such as sure their timing, and then choose the
ing units and are very efficient in ex- NVIDIA Performance Primitives (NPP). combination that performs the best.
ecuting simple, similar subtasks such The GPU module allows users to Another piece of advice for develop-
as rendering or filtering pixels. Such benefit from GPU acceleration without ers is to use the asynchronous mecha-
tasks are often known as embarrass- requiring training in GPU program- nisms provided by CUDA and the GPU
ingly parallel, because they are so easy ming. The module is consistent with module. This allows simultaneous
to parallelize efficiently on a GPU. the CPU version of OpenCV, which execution of data transfer, GPU pro-
Many tasks, however, do not paral- makes adoption easy. There are differ- cessing, and CPU computations. For
lelize easily, as they contain serial seg- ences, however, the most important of example, while one frame from the
ments where the results of the later which is the memory model. OpenCV camera is processed by the GPU, the
stages depend on the results of earlier implements a container for images next frame is uploaded to it, minimiz-
stages. These serial algorithms do not called cv::Mat that exposes access to ing data-transfer overheads and in-
run efficiently on GPUs and are much image raw data. In the GPU module the creasing overall performance.
easier to program and often run faster container cv::gpu::GpuMat stores
on CPUs. Many iterative numerical the image data in the GPU memory and Performance of
optimization algorithms and stack- does not provide direct access to the OpenCV GPU Module
based tree-search algorithms belong data. If users want to modify the pixel OpenCVs GPU module includes a
to that class. data in the main program running on large number of functions, and many
Since many high-level tasks consist the GPU, they first need to copy the of them have been implemented in
of both parallel and serial subtasks, data from GpuMat to Mat. different versions, such as the image
the entire task can be accelerated by types (char, short, float), number of
running some of its components on #include <opencv2/opencv.hpp> channels, and border extrapolation
the CPU and others on the GPU. Un- #include <opencv2/gpu/gpu.hpp> modes. This makes it challenging to
fortunately, this introduces two sourc- using namespace cv; report exact performance numbers. An
es of inefficiency. One is synchroniza- ... added source of difficulty in distilling
tion: when one subtask depends on Mat image = imread("file.png"); the performance numbers down is the
the results of another, the later stage gpu::GpuMat image_gpu; overhead of synchronizing and trans-
needs to wait until the previous stage image_gpu .upload(image); ferring data. This means that best per-
is done. The other inefficiency is the gpu::GpuMat result; formance is obtained for large images
overhead of moving the data back gpu::threshold(image_gpu, where a lot of processing can be done
and forth between the GPU and CPU result, 128, CV_THRESH_BINARY); while the data resides on the GPU.
memoriesand since computer-vi- result.download(image); To help the developer figure out the
sion tasks need to process lots of pix- imshow("WindowName", image); trade-offs, OpenCV includes a perfor-
els, it can mean moving massive data waitKey (); mance benchmarking suite that runs
chunks back and forth. These are the GPU functions with different param-
key challenges in accelerating com- In this example, an image is read eters and on different datasets. This
puter-vision tasks on a system with from a file and then uploaded to GPU provides a detailed benchmark of how
both a CPU and GPU. memory. The image is thresholded much different datasets are acceler-
there, and the result is downloaded ated on the users hardware.
OpenCV Library to CPU memory and displayed. In this Figure 2 is a benchmark dem-
The open source computer vision li- simple example only one operation is onstrating the advantage of the
brary, OpenCV, began as a research performed on the image, but several GPU module. The speedup is mea-
project at Intel in 1998.5 It has been others could be executed on the GPU sured against the baseline of a heav-
available since 2000 under the BSD without transferring images back and ily optimized CPU implementation of
open source license. OpenCV is aimed forth. The usage of the GPU module is OpenCV. OpenCV was compiled with
at providing the tools needed to solve straightforward for someone who is al- Intels Streaming SIMD Extensionsn
computer-vision problems. It contains ready familiar with OpenCV. (SSE) and Threading Building Blocks
a mix of low-level image-processing This design provides the user with (TBB) for multicore support, but not
functions and high-level algorithms explicit control over how data is moved all algorithms use them. The primitive
such as face detection, pedestrian de- between CPU and GPU memory. Al- image-processing speedups have been
tection, feature matching, and track- though the user must write some addi- averaged across roughly 30 functions.
ing. The library has been downloaded tional code to start using the GPU, this Speedups are also reported for several
more than three million times. approach is flexible and allows more ef- high-level algorithms.
In 2010 a new module that pro- ficient computations. In general, it is a It is quite normal for a GPU to show a
vides GPU acceleration was added to good idea to research, develop, and de- speedup of 30 times for low-level func-
OpenCV. The GPU module covers a bug a computer-vision application us- tions and up to 10 times for high-level

ju n e 2 0 1 2 | vo l. 55 | n o. 6 | c ommu n icat ion s of t he acm 63


practice

functions, which include more over- They contain many functional blocks overhead was not a significant part of
head and many steps that are not easy and a class hierarchy. the total algorithm time. This example
to parallelize with a GPU. For example, Wherever it made sense, we offload- shows that replacing only a few lines of
the granularity for color conversion is ed the computations to the GPU. For code results in a considerable speedup
per-pixel, making it easy to parallel- example, OpenCV GPU implementa- of a high-level vision application.
ize. Pedestrian detection, on the other tions performed Speeded-Up Robust
hand, is performed in parallel for each Feature (SURF) key point detection, Stereo Correspondence
possible pedestrian location, and par- matching, and search of stereo corre- with GPU Module
allelizing the processing of each win- spondences (block matching) for ste- Stereo correspondence search in a
dow position is limited by the amount reo visual odometry. The accelerated high-resolution video is a demanding
of on-chip GPU memory. packages were a mix of CPU/GPU im- application that demonstrates how
As an example, we accelerated two plementations. As a result, the visual CPU and GPU computations can be
packages from Robot Operation Sys- odometry pipeline was accelerated 2.7 overlapped. OpenCVs GPU module
tem (ROS)8stereo visual odometry times, and textured object detection includes an implementation that can
and textured object detectionthat was accelerated from 1.54 times, as process full HD resolution stereo pair
were originally developed for the CPU. illustrated in Figure 3. Data-transfer in real time (24 frames per second) on
the NVIDIA GTX580.
Figure 3. Textured object detection application: CPU and GPU. In a stereo system, two cameras are
mounted facing in the same direction.
While faraway objects project to the
same image locations on each cam-
era, nearby objects project to different
locations. This is called disparity. By
locating each pixel on the left camera
image where the same surface point
projects to the right image, you can
compute the distance to that surface
point from the disparity. Finding these
correspondences between pixels in
the stereo image pairs is the key chal-
lenge in stereo vision.
This task is made easier by rectifying
the images. Rectification warps the im-
ages to an ideal stereo pair where each
scene surface point projects to a match-
ing image row. This way, only points on
the same scan line need to be searched.
Figure 4. Stereo block matching pipeline. The quality of the match is evaluated by
comparing the similarity of a small win-
CPU
dow of pixels with the candidate-match-
Speckle filtering
ing pixel. Then the pixel in the right
image whose window best matches the
window of the pixel on the left image is
GPU Rectification Matching Low texture filtering Color and Show
selected as the corresponding match.
The computational requirements
obviously increase as the image size
increases, because there are more pix-
els to process. In a larger image the
range of disparities measured in pixels
also increases, which requires a larger
search radius. For small-resolution im-
Figure 5. RGB frame, depth frame, ray-casted frame, and point cloud. ages the CPU may be sufficient to cal-
culate the disparities; with full HD res-
olution images, however, only the GPU
can provide enough processing power.
Figure 4 presents a block-matching
pipeline that produces a disparity im-
age d(x,y) such that LeftImage(x,y)
corresponds to RightImage(x-d(x,y),y).
The pipeline first rectifies the images

64 communicatio ns o f th e acm | j u ne 201 2 | vo l . 5 5 | no. 6


practice

and then finds the best matches, as range measurements for all the pix- tracking the camera position is done
previously described. In areas where els, and it works reliably only on con- on a CPU. Though the linear equation
there is little texturefor example, a tinuous smooth matte surfaces. The matrix required for camera position es-
blank wallthe calculated matches range measurements that it returns timation is fully computed on the GPU,
are unreliable, so all such areas are are noisy, and depending on the sur- computing the final solution does not
marked to be ignored in later process- face shapes and reflectance properties, parallelize well, so it is done on the
ing. As the disparity values are expect- the noise can be significant. The noise CPU, which results in some download
ed to change significantly near object also increases with the distance to the and API call overhead. Another prob-
borders, the speckle-filtering stage measured surface. Kinect generates a lem is that the bottom-level image in
eliminates speckle noise within large new depth frame 30 times in a second. the hierarchical image processing ap-
continuous regions of disparity image. If the user moves the Kinect device proach is only 160120, which is not
Unfortunately, the speckle-filtering al- too fast, the algorithm gets confused large enough to fully load a GPU. All
gorithm requires a stack-based depth- and cannot track the motion using the the other parts are ideal for GPU but
first search difficult to parallelize, so it range data. With a clever combination limited by the amount of available GPU
is run on the CPU. The results are visu- of good algorithms and using the pro- memory and computing resources.
alized using a false-color image. cessing power provided by GPUs, how- Further development requires even
All the steps except speckle filter- ever, KinectFusion works robustly. more GPU power. At the moment, the
ing are implemented on the GPU. The There are three key concepts that size of the scene is limited by the volu-
most compute-intensive step is block make a robust interactive implementa- metric representation. Using the same
matching. NVIDIA GTX580 has accel- tion feasible. First, the tracking algo- number of voxels but making them big-
erated it seven times faster than a CPU rithm is able to process the new scan ger would allow us to capture a larger
implementation on a quad core Intel data so fast that the camera has time to scene but at a coarser resolution. Re-
i5-760 2.8GHz processor with SSE and move very little between the consecutive taining the same resolution while scan-
TBB optimizations. After this speedup frames. This makes it feasible to track ning larger scenes would require more
the speckle filtering becomes the bot- the camera position and orientation us- voxels, but the number of voxels is lim-
tleneck, consuming 50% of the frame- ing just the range data. ited by the amount of memory available
processing time. Second, fusion of depth data is done on GPU and by its computational power.
An elegant parallel-processing solu- using a volumetric surface representa-
tion is to run speckle filtering on the tion. The representation is a large voxel Mobile Devices
CPU in parallel with the GPU process- grid that makes it easier to merge the While PCs are often built with a CPU
ing. While the GPU processes the next data from different scans in compari- and a GPU on separate chips, mobile
frame, the CPU performs speckle fil- son with surface-based representations. devices such as smartphones and tab-
tering for the current frame. This can To obtain high model quality, the grid lets put all the computing elements
be done using asynchronous OpenCV resolution is chosen to be as dense as on a single chip. Such an SoC (system
GPU and CUDA capabilities. The het- possible (512512512), so it has to be on chip) contains one or more CPUs, a
erogeneous CPU/GPU system now processed by the GPU for real-time rates. GPU, as well as several signal processors
provides a sevenfold speedup for the Finally, the manner in which the new for audio and video processing and data
high-resolution stereo correspondence data is merged with the old reduces the communication. All modern smart-
problem, allowing real-time (24fps) noise and uncertainty as more data is phones and some tablets also contain
performance at full HD resolution. gathered, and the accuracy of the mod- one or more cameras, and OpenCV is
el keeps improving. As the model gets available on both Android and iOS op-
KinectFusion better, tracking gets easier as well. Par- erating systems. With all these com-
Microsofts KinectFusion4 is an exam- allel ray casting through the volume is ponents, it is possible to create mobile
ple of an application that previously done on the GPU to get depth informa- vision applications. The following sec-
required slow batch processing but tion, which is used for camera tracking tions look at the mobile hardware in
now, when powered by GPUs, can be on the next frame. So frame-to-frame more detail, using NVIDIAs Tegra 2 and
run at interactive speeds. Kinect is a movement estimation is performed Tegra 3 SoCs as examples, and then in-
camera that produces color and depth only between the first and second troduce several useful multimedia APIs.
images. Just by aiming the Kinect de- frames. All other movements are com- Finally, two mobile vision applications
vice around, one can digitize the 3D puted on model-to-frame data, which are presented: panorama creation and
geometry of indoor scenes at an amaz- makes camera tracking very robust. video stabilization.
ing fidelity, as illustrated in Figure 5. All of these steps are computation-
An open source implementation of ally intensive. Volumetric integration Tools for Mobile Computer Vision
such a scanning application is based requires the high memory bandwidth At the core of any general-purpose
on the Point Cloud Library,6 a com- that only the GPU can deliver at a price computer is the CPU. While Intels
panion library to OpenCV that uses 3D low enough to be affordable by normal x86 instruction set rules on desktop
points and voxels instead of 2D pixels consumers. Without GPUs this system computers, almost all mobile phones
as basic primitives. would simply not be feasible. Howev- and tablets are powered by CPUs from
Implementing KinectFusion is not er, not every step of the computation ARM. ARM processors follow the RISC
a simple task. Kinect does not return is easy to do on a GPU. For example, (reduced instruction set computing)

ju n e 2 0 1 2 | vo l. 55 | n o. 6 | c ommu n icat ion s of t he acm 65


practice

Energy savings with GLSL on Tegra 3. SIMD (single instruction, multiple signal processing. Some of them are of
data) processing is particularly useful potential interest for computer-vision
for pixel data, as the same instruc- developers, especially video coding and
OpenCV Function Energy
(10,000 iterations) Savings tion can be used on multiple pixels image processing, because they provide
median blur 3.43
simultaneously. SSE is Intels SIMD a number of simple filters, color space
planar warper 6.25
technology, which exists on all mod- conversions, and arithmetic opera-
warpPerspective 6.45
ern x86 chips. ARM has a similar tech- tions. IL is meant for system program-
cylindrical warper 3.89
nology called NEON, which is an op- mers for implementing the multimedia
blur3x3 3.60
tional coprocessor in the Cortex A9. framework and provides tools such as
warpAffine 15.38
The NEON can process up to eight, for camera control. AL is meant for ap-
and sometimes even 16 pixels at the plication developers and provides high-
same time, while the CPU can pro- level abstractions and objects such
cess only one element at a time. This as Camera, Media Player, and Media
approach, as can be deduced from is very attractive for computer-vision Recorder. The OpenMAX APIs are use-
ARMs original name, Advanced Risc developers, as it is often easy to ob- ful for passing image data efficiently
Machines. While x86 processors were tain three to four times performance between the various accelerators and
traditionally designed for high com- speedupand with careful optimiza- other APIs such as OpenGL ES.
puting power, ARM processors were tion even more than six times. Tegra 2 Sensors provide another interesting
designed primarily for low-power us- did not include the NEON extension, opportunity for computer-vision devel-
age, which is a clear benefit for battery- but each of Tegra 3s ARM cores has a opers. Many devices contain sensors
powered devices. As Intel is reducing NEON coprocessor. such as an accelerometers, gyroscopes,
power usage in its Atom family for mo- All modern smart phones include compasses, and GPSs. They are not
bile devices, and recent ARM designs a GPU. The first generation of mobile able to perform calculations but can
are getting increasingly powerful, they GPUs implemented the fixed-function- be useful if the application needs to
may in the future reach a similar de- ality graphics pipeline of OpenGL ES reconstruct the camera orientation or
sign point, at least on the high end of 1.0 and 1.1. Even though the GPUs were 3D trajectory. The problem of extract-
mobile computing devices. Both Tegra designed for 3D graphics, they could ing the camera motion from a set of
2 and Tegra 3 use ARM Cortex-A9 CPUs. be used for a limited class of image- frames is challenging, both in terms of
Mobile phones used to have only processing operations such as warp- performance and accuracy. Simultane-
one CPU, but modern mobile SoCs are ing and blending. The current mobile ous localization and mapping (SLAM),
beginning to sport several, providing GPUs are much more flexible and sup- structure from motion (SfM), and oth-
symmetric multiprocessing. The rea- port OpenGL shading language (GLSL) er approaches can compute both the
son is the potential for energy savings. programming with the OpenGL ES 2.0 camera position and even the shapes
One can reach roughly a similar level API, allowing programmers to run fair- of the objects the camera sees, but
of performance using two cores run- ly complicated shaders at each pixel. these methods are not easy to imple-
ning at 1GHz each than with one core Thus, many old-school GPGPU tricks ment, calibrate, and optimize, and they
running at 2GHz. Since the power con- developed for desktop GPUs about 10 require a lot of processing power. The
sumption increases super-linearly with years ago can now be reused on mobile sensors can nonetheless deliver a fairly
the clock speed, however, these two devices. The more flexible GPU com- accurate estimate of the device orienta-
slower cores together consume less puting languages such as CUDA and tion at a fraction of the cost of relying
power than the single faster core. Tegra OpenCL will replace those tricks in the only on visual processing. For accurate
2 provides two ARM cores, while Tegra coming years but are not available yet. results the sensor input should be used
3 provides four. Tegra 3 actually con- Consumption and creation of audio only as a starting point, to be refined
tains five (four plus one) cores, out of and video content is an important use using computer-vision techniques.
which one, two, three, or four cores can case on modern mobile devices. To sup-
be active at the same time. One of the port them, smartphones contain dedi- OpenCV On Tegra
cores, known as the shadow or com- cated hardware encoders and decoders A major design and implementation
panion core, is designed to use par- both for audio and video. Additionally, goal for OpenCV has always been high
ticularly little energy but can run only many devices have a special ISP (image performance. Porting both OpenCV
at relatively slow speeds. That mode is signal processor) that processes the and applications to mobile devices
sufficient for standby, listening to mu- pixels streaming out from the camera. requires care, however, to retain a suf-
sic, voice calls, and other applications These media accelerators are not as eas- ficient level of performance. OpenCV
that rely on dedicated hardware such ily accessible and useful for computer- has been available on Android since
as the audio codec and require only a vision processing, but the OpenMAX the Google Summer of Code 2010 when
few CPU cycles. When more processing standard helps.1 OpenMAX defines it was first built and run on Google
power is needed (for example, read- three different layers: AL (application), Nexus One. Several demo applications
ing email), the slower core is replaced IL (integration), and DL (development). illustrated almost real-time behavior,
by one of the faster cores, and for in- The lowest, DL, specifies a set of primi- but it was obvious that OpenCV need-
creased performance (browsing, gam- tive functions from five domains: au- ed optimization and fine-tuning for
ing) additional cores kick in. dio/video/image coding and image/ mobile hardware.

66 communicatio ns o f th e acm | j u ne 201 2 | vo l . 5 5 | no. 6


practice

That is why NVIDIA and Itseez de- bined. If the algorithm is constrained Figure 7 shows example speedups of
cided to create a Tegra-optimized by the speed of memory access, how- some filters and geometric transfor-
version of OpenCV. This work ben- ever, multithreading may not provide mations from the OpenCV library.
efited from three major optimization the expected performance improve- An additional benefit of using the
opportunities: code vectorization with ment. For example, the NEON version GPU is that at full speed it runs at a
NEON, multithreading with the Intel of cv::resize does not gain from lower average power than the CPU.
TBB (Threading Building Blocks) li- adding new threads, because a single On mobile devices this is especially
brary, and GPGPU with GLSL. thread already fully consumes the important, since one of the main us-
Taking advantage of the NEON memory-bus capacity. ability factors for consumers is how
instruction set was the most attrac- The final method applied during long the battery lasts on a charge. We
tive of the three choices. Figure 6 the optimization of the OpenCV li- measured the average power and time
compares the performance of origi- brary for the Tegra platform is GPGPU elapsed to perform 10,000 iterations
nal and NEON-optimized versions of with GLSL shaders. Though the mo- of some optimized C++ functions,
OpenCV. In general, NEON requires bile GPU has limitations as discussed compared with the same functions
basic arithmetic operations using previously, on certain classes of al- written in GLSL. Since these func-
simple and regular memory-access gorithms the GPU is able to show an tions are both faster on the GPU, and
patterns. Those requirements are impressive performance boost while the GPU runs at lower peak power. We
often satisfied by image-processing consuming very little energy. On mo- measured the result is significant en-
primitives, which are almost ideal bile SoCs it is possible to share mem- ergy savings (see the accompanying
for acceleration by NEON vector op- ory between CPU and GPU, which table). We measured energy savings of
erations. As those primitives are of- allows interleaving C++ and GLSL 315 times when porting these func-
ten in the critical path of high-level processing of the same image buffer. tions to GPU.
computer vision workflows, NEON
instructions can significantly accel- Figure 6. Performance improvement with NEON on Tegra 3.
erate OpenCV routines.
Multithreading on up to four sym- Tegra CPU Tegra NEON
metric CPUs can help at a higher level. 300
TBB and other threading technolo-
1.6x
gies enable application developers to 250

get the parallel-processing advantage


of multiple CPU cores. At the applica- 200
23x
Time (ms)

1.6x
tion level independent activities can be
150
distributed among different cores, and
the operating system will take care of
100
load balancing. This approach is con- 9.5x 5.4x
sistent with the general OpenCV strat- 50 4.6x 2.6x 3.1x
egy for multithreadingto parallelize 3.4x 7.6x
the whole algorithmic pipelinewhile 0
on a mobile platform we often also
Canny

Median
Blur

Optical
Flow

Color
Conversion

Morphology

Gaussian
Blur

FAST
Detector

Sobel

pyrDown

Image
Resize
have to speed up primitive functions.
One approach is to split low-level
functions into several smaller sub-
tasks, which produces faster results. A
popular technique is to split an input Figure 7. Performance improvement with GLSL on Tegra 3.
image into several horizontal stripes Tegra CPU Tegra GPU
and process them simultaneously. 800
2.4x
An alternative approach is to create a 700
background thread and get the result
600 13x
later while the main program works
on other parts of the problem. For ex- 500
Time (ms)

ample, in the video stabilization ap- 400 9.8x 14x


plication a special class returns an
5.7x
asynchronously calculated result from 300

the previous iteration. Multithread- 200


ing limits the speedup factor by the 3.3x
100
number of cores, which on the most
advanced current mobile platforms 0

is four, while NEON supports vector Median Planal warpPerspective Cylindrical blur3x3 warpAffine
Blur Warper Warper
operations on 16 elements. Of course,
both of these technologies can be com-

ju n e 2 0 1 2 | vo l. 55 | n o. 6 | c ommu n icat ion s of t he acm 67


practice

Applications that stabilizes streaming video. The per- basic operations such as simply copy-
We have developed two mobile vision formance requirements are challeng- ing a 1280720-pixel frame may take
applications using OpenCV: one that ing. Our goal is real-time performance, eight milliseconds. Consequently, to a
stitches a panoramic image from sev- where each frame should be processed large extent the final design of an appli-
eral normal photographs, and another within about 30 milliseconds, of which cation and its underlying algorithm is
determined by this constraint.
Figure 8. Input images and the resulting panorama. In both cases we were able to satisfy
the time limits by using the GPU for op-
timizing the applications bottlenecks.
Several geometric transformation
functions such as image resizing and
various types of image warping were
ported to the GPU, resulting in a dou-
bling of the application performance.
The results were not nearly as good
when performing the same tasks us-
ing NEON and multithreading. One of
the reasons was that both applications
deal with high-resolution four-channel
images. As a result, the memory bus
was overloaded and the CPU cores
competed for the cache memory. At
the same time we needed to program
bilinear interpolation manually, which
Figure 9. Panorama stitching pipeline. is implemented in GPU hardware. We
learned that the CPU does not work as
High-level
well for full-frame geometric transfor-
Image Registration Seam Finding Compositing
pipeline mations, and the help of the GPU was
invaluable. Lets consider both appli-
cations in more detail.
GPU calls Image Resize Image Warp Panorama stitching. In the panora-
ma-stitching application our goal was
to combine several ordinary images
into a single panorama with a much
Figure 10. Video stabilization input sequence. larger field of view (FOV) than the in-
put images.7 Figure 8 demonstrates the
stitching of several detailed shots into
a single high-resolution image of the
whole image.
Figure 9 shows the processing pipe-
line for the OpenCV panorama-stitch-
ing application. The process of porting
to Tegra started from some algorith-
mic improvements, followed by NEON
and multithreading optimization; yet
after all these efforts, the application
still was not responsive enough and
Figure 11. Video stabilization pipeline. could not stitch and preview the result-
ing panorama at interactive speeds.
Motion Motion Motion
Among the top bottlenecks were im-
High-level Preprocessing
pipeline
Estimation Smoothing Compensation age resizing and warping. The former
is required because different algorith-
mic steps are performed at different
GPU calls Image Resize Image Resize resolutions, and each input frame is re-
sized about three times, depending on
the algorithmic parameters. The type
of warping needed depends on the de-
sired projection mode (spherical, cylin-
drical, among others) and is performed
before the final panorama blending.

68 communicatio ns o f th e acm | j u ne 201 2 | vo l . 5 5 | no. 6


practice

With the GPU version of cv::resize Nevertheless, 25 milliseconds is where the main program may run on
we were able to decrease scaling time still too long for a real-time algorithm, a CPU or several CPUs, while major
from 41 milliseconds to 26 millisec- which is why we next tried to obtain parts of the vision API run on differ-
onds for each input frame, which is more speed from asynchronous calls. ent types of hardware: a GPU, a DSP
equal to 1.6 times local speedup. Be- A special class was created for stabi- (digital signal processor), or even a
cause of the GPU implementation of lizing frames on the GPU. This class dedicated vision processor. In fact,
image warping, we could achieve even immediately returns a result from the Khronos has recently started working
better local improvementsa boost of previous iteration stored in its image- on such an API, which could work as
814 times in performance, depending buffer field and creates a TBB::task an abstraction layer that allows inno-
on the projection type. As a result, total for processing the next frame. As a vation independently on the hardware
application speedup was 1.52.0 times, result, GPU processing is performed side and allows for high-level APIs
meeting performance requirements. in the background, and the apparent such as OpenCV to be developed on
Video stabilization. One of the nega- cost and delay for the caller is equal top of this layer while being somewhat
tive consequences of recording video to just copying a full frame. This trick insulated from the changes in the un-
without a tripod is camera shake, was also applied to an expensive col- derlying hardware architecture.
which significantly degrades the or-conversion procedure, and with
viewing experience. To achieve visu- further optimizations of the memory- Acknowledgments
ally pleasant results, all movements access patterns, we achieved real-time We thank Colin Tracey and Marina
should be smooth, and the high-fre- processing performance. Kolpakova for help with power analy-
quency variations in camera orienta- sis; Andrey Pavlenko and Andrey Ka-
tion and translation must be filtered. Future Directions maev for GLSL and NEON code; and
Numerous approaches have been GPUs were originally developed to ac- Shalini Gupta, Shervin Emami, and Mi-
developed, some have become open celerate the conversion of 3D scene chael Stewart for additional comments.
source or commercially available descriptions into 2D images at inter- NVIDIA provided support, including
tools. There exist computationally in- active rates, but as they have become hardware used in the experiments.
tensive approaches offline that take a more programmable and flexible, they
considerable amount of time, while have also been used for the inverse task References
1. Khronos OpenMAX standard; http://www.khronos.org/
the lightweight online algorithms of processing and analyzing 2D images openmax.
are more suitable for mobile devices. and image streams to create a 3D de- 2. Liu, F., Gleicher, M., Wang, J., Jin, H., Agarwala, A.
Subspace video stabilization. ACM Transactions on
High-end approaches often recon- scription, to control some application Graphics 30, 1 (2011), 4:14:10.
struct the 3D movement of the cam- so it can react to the user or events in 3. Matsushita, Y., Ofek, E., Ge, W., Tang, X., Shum, H.-Y.
Full-frame video stabilization with motion inpainting.
era and apply sophisticated nonrigid the environment, or simply to create IEEE Transactions on Pattern Analysis and Machine
image warping to stabilize the video.2 higher-quality images or videos. As Intelligence 28, 7 (2006), 11501163.
4. Newcombe, R.A., Izadi, S. et al. Kinectfusion: Real-
On mobile devices more lightweight computer-vision applications become time dense surface mapping and tracking. IEEE
approaches using translation, affine more commonplace, it will be interest- International Symposium on Mixed and Augmented
Reality (2011), 127136.
warping, or planar perspective trans- ing to see whether a different type of 5. OpenCV library; http://code.opencv.org.
6. Point Cloud Library; http://pointclouds.org.
formations may make more sense.3 computer-vision processor that would 7. Szeliski, R. Image alignment and stitching: a tutorial.
We experimented with translation be even more suitable for image pro- Foundations and Trends in Computer Graphics and
Vision 2, 1 (2006), 1104.
and affine models, and in both cases cessing is created to work with a GPU, 8. Willow Garage. Robot Operating System; http://www.
the GPU was able to eliminate the or whether the GPU remains suitable ros.org/wiki/.
major hotspot, which was the appli- even for this task. The current mobile
cation of the compensating transfor- GPUs are not yet as flexible as those on Kari Pulli is a senior director at NVIDIA Research, where
he heads the Mobile Visual Computing Research team and
mation to an input frame. Applying larger computers, but this will change works on topics related to cameras, imaging, and vision
translation to compensate for the mo- soon enough. on mobile devices. He has worked on standardizing mobile
media APIs at Khronos and JCP and was technical lead of
tion simply means shifting the input OpenCV (and other related APIs the Digital Michelangelo Project at Stanford University.
frame along the x and y axes and cut- such as Point Cloud Library) have Anatoly Baksheev is a project manager at Itseez. He
ting off some of the boundary areas made it easier for application develop- started his career there in 2006 and was the principal
developer of multi-projector system Argus Planetarium.
for which some of the frames now do ers to use computer vision. They are Since 2010 he has been the leader of the OpenCV GPU
not contain color information (see well-documented and vibrant open project. Since 2011 he has work on the GPU acceleration
module for Point Cloud Library.
Figure 10). source projects that keep growing,
Kirill Kornyakov is a project manager at Itseez, where
In terms of programming, one and they are being adapted to new he leads the development of OpenCV library for mobile
should choose a properly located sub- computing technologies. Examples of devices. He manages activities on mobile operating-
system support and computer-vision applications
matrix and then resize it into a new this evolution are the transition from development, including performance optimization for
image at the same resolution as the a C to a C++ API in OpenCV and the ap- NVIDIA Tegra platform.
original video stream, as suggested pearance of the OpenCV GPU module. Victor Eruhimov is CTO of Itseez. Prior to co-founding
in Figure 11. Surprisingly, this simple The basic OpenCV architecture, how- the company, he worked as a project manager and senior
research scientist at Intel, where he applied computer-
step consumed more than 140 milli- ever, was designed mostly with CPUs vision and machine-learning methods to automate Intel
seconds. Our GPU GLSL implementa- fabs and revolutionize data processing in semiconductor
in mind. Maybe it is time to design a manufacturing.
tion was five to six times faster than new API that explicitly takes heteroge-
C++ and took about 25 milliseconds. neous multiprocessing into account, 2012 ACM 0001-0782/12/06 $10.00

ju n e 2 0 1 2 | vo l. 55 | n o. 6 | c ommu n icat ion s of t he acm 69

You might also like