Professional Documents
Culture Documents
Parallel Vision by
GPGPU/CUDA
Wang, Yuan-Kai (王元凱)
Electrical Engineering Department, Fu Jen Catholic
University (輔仁大學電機工程系)
ykwang@mail.fju.edu.tw
http://www.ykwang.tw
2014/07/17
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 2
Multicore Computing
❖ What Is Multicore
• Combine multiple processors
(CPU, DSP, GPGPU, FPGA)
into single chip
❖ Multicore computing is inevitable
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 5
Moore's Law
❖ In 1965, Gordon Moore (Intel co-founder)
predicted
• The transistors no. on an IC would double
every 18 months
❖ The well-known law
• The performance of computer
doubles every 18 months
• More transistors
→ More performance
❖ The prediction was
kept correctly by
Intel's CPUs for 40 years
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 6
Problems
❖ More transistors need high frequency
• We come into the Clock Speed Race
❖ But high frequency needs high power
consumption
• High power consumption è Heat problem
• 4GHz has been the limit of Moore’s law
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 8
Single Core
Time
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 11
Two Architectures
for Multicore
❖ Symmetric multiprocessing (SMP)
• Multicore CPU, GPGPU, DSP multicore
• Homogeneous computing
❖ Asymmetric multiprocessing (AMP)
• CPU+GPGPU,
CPU+FPGA,
CPU+DSP
• Heterogeneous computing
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 12
Multiple
Execution Cores
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 13
GPGPU (1/2)
❖ GPU (Graphical Processing Unit)
• The processor in graphics card to speed
up 3D graphics
• Game playing
is a major
application
❖ GPGPU: General-Purpose GPU
• General purpose computation using
GPU in applications other than 3D
graphics
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 15
GPGPU (2/2)
❖ GPGPU has more cores than CPU
• 120 ~ 3072 cores vs. 2 ~ 8 cores
(Many-core vs. Multi-core)
❖ GPGPU is more powerful than
multicore CPU
❖ Vendors:
• nVidia
• Quadcomm
(AMD, ATI)
• ARM
• Intel
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 16
Video Image
Object Object
Object Behavior
Capture Enhance /Event Tracking /Event Analysis Retrieval
Detection Recognition
HPC Approaches
❖ Cluster/distributed computing
• Hadoop/MAP-REDUCE Supercomputer
(Google, Cloud Computing)
• MPI
❖ Multi-processing
computing
• Multicore (GPGPU, CPU, FPGA/DSP)
• Programming: multi-thread
• Windows thread, Pthraed, OpenMP
• CUDA, renderscript, C++ AMP, …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 23
However
❖ Can CV algorithms speed-up every 18
months with multicore?
❖ Multicore is not a simple solution for
upgrading CV algorithm performance
• The transition from single core to
multicore will be blocked by software
• We are not ready to face the software
programming challenges
• It is the software, stupid.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 24
Multi-threading Demands
New Programming Skills
❖ Previous multi-threading techniques
❖ Windows thread, pthread, OpenMP,
MPI, …
❖ New techniques
• CUDA, C++ AMP, OpenCL, Renderscript,
OpenACC, Map Reduce, …
❖ Concepts
• Race condition, deadlock,
• Domain partition, function partition, …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 26
Multicore Programming
Practice (MPP)
❖ Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
• Proposed by a
MPP working group
in the Multicore
Association
http://www.multicore-association.org/workgroup/mpp.php
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 27
OpenACC
❖ An organization develops API to
• describes a collection of
compiler directives
• To specify loops and regions of
code in standard C, C++ and Fortran
• To be offloaded from a host CPU to
an attached accelerator, including
• APUs, GPUs, and many-core
coprocessor
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 28
HSA Foundation
❖Heterogeneous System Architecture
• Key members: AMD, QUALCOMM, ARM,
SAMSUNG, TI
❖System architecture easing efficient
use of accelerators, SoCs
• Intended to support high-level parallel
programming frameworks
• OpenCL, C++, C#, OpenMP, Java
• Accelerator requirements
• Full-system SVM, memory coherency,
preemption,
user-mode dispatch
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 29
HPEC
❖ High Performance Embedded
Computing
• MIT Lincoln Lab, 1997 ~
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 31
OpenCL
❖ Royalty-free, cross-platform, cross-
vendor standard
• Targeting: supercomputers
è embedded systems
è mobile devices
❖Enables programming of diverse
compute resources
• CPU, GPU,
DSP, FPGA …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 32
2. GPGPU
PC platform
Mobile platform
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 35
Why GPGPU
❖ GPGPU has many-core (vs. multi-core)
• Suitable for masssively parallel computing
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 36
GPGPU as a Coprocessor
Heterogeneous Computing
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 37
PC Platform
• Discrete GPUs
• GPGPU card as a coprocessor
PCIe
Mobile Platform
• Integrated GPUs
• GPGPU sub-chip as a coprocessor
GPGPU
No PCIe CPU
GPGPU Solutions
– Qualcomm/AMD
❖ Qualcomm, AMD, ATI
❖ APU: integrated CPU+GPU
❖ Low energy consumption
❖ PC(AMD): FirePro
❖ Mobile(Snapdragon):
❖ Adreno: 330(32 cores)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 41
Applications of GPGPU
http://developer.nvidia.com/category/zone/cuda-zone
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 44
Heterogeneous Architecture
❖Host: CPU
❖Device: GPGPU
❖Notice: memory hierarchy in device
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 45
GPGPUs Architecture
- nVidia
❖ GT200
• GTX 260/280, Quardro5800, Tesla 1060
❖ Fermi
• Tesla 2060
ALU ALU
Control
ALU ALU
Cache
DRAM DRAM
CPU(host) GPU(device)
Multicore Many-core
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 46
Memory Hierarchy
❖ On-Chip Memory
• Registers
• Shared Memory
• Constant Memory
• Texture Memory
❖ Off-Chip Memory
• Local Memory
• Global Memory
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 48
A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local
Image Features, IEEE Transactions on Computers, 2012.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 49
nVidia CUDA
Qualcomm OpenCL
ARM RenderScript
3. Parallel
Programming
Multi-threading
Programming Languages for Parallels
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 55
Parallel Computing
❖ Serial
Computing
❖ Parallel Core
Computing Core
Core
Core
CPU/GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 56
Parallel Programming
❖ Many codes are written in C/C++/Java
• Especially algorithmic programs
❖ Can we write GPGPU parallel
programs by C/C++/Java?
❖ However, C/C++ is sequential
• Three control structures of C/C++/Java:
sequence, selection, repetition
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 57
Multi-threading
❖ Multi-threading is the fundamental
concept for parallel programming
• Some techniques are ready
• Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
• New techniques
• CUDA, OpenCL, Renderscript,
OpenACC, C++ AMP, ...
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 58
Parallel Programming in
Sequential Language
❖ Do we need to learn new languages for
multi-threading?
• No
❖ Write multi-threading codes in C/C++
• Add functions/directives to C/C++ for
multi-threading
• That is the way current solutions did
• pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 60
Multi-Threading
❖ A program running
In Serial
In Parallel
http://en.wikipedia.org/wiki/Thread_(computer_science)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 62
GPGPU Programming:
SIMT model
❖ CPU (“host”) program often
written in C or C++
❖ GPU code is written as a sequential
kernel in (usually) a C or C++
dialect
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 66
GPGPU Programming
Techniques
CUDA
OpenCL
C++ AMP
Rednerscript
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 67
GPGPU Programming
Techniques
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 68
CUDA
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 69
CUDA
❖ CUDA: Compute Unified Device
Architecture
❖ Parallel programming
for nVidia's GPGPU
❖ Use C/C++ language
• Java, Fortran, Matlab are OK
❖ When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 70
6
Release
device memory
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 73
Programming with
Memory Hierarchy
❖ Locality
principle
• Temporal
locality
• Spatial
locality
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 74
printf("%s\n", h_hello);
Host Device
cudaFree(d_hello1);
❖ cudaFree(d_hello2); src d_hello1
system("pause");
h_hello d_hello2
return 0;
}
Result:
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 77
OpenCL
Standard
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 78
What's OpenCL
❖One code tree can be executed on CPUs, GPUs,
DSPs and hardware
• Dynamically interrogate system load and
balance across available processors
❖Powerful, low-level flexibility
• Foundational access to compute resources for
higher-level engines, frameworks and
languages
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 80
Declaring
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 86
Do
C++ AMP
Microsoft
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 90
Declaring&
Coping to device
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 93
Display
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 94
RenderScript
Google Android
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 95
What's Renderscript(1/2)
❖Higher-level than CUDA or OpenCL: simpler &
less performance control
• Emphasis on mobile devices & cross-SoC
performance portability
❖Programming model
• C99-based kernel language, JIT-compiled,
single input-single output
• Automatic Java class reflection
• Intrinsics: built-in, highly-tuned operations,
e.g. ScriptIntrinsicConvolve3x3
• Script groups combine kernels to amortize
launch cost & enable kernel fusion
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 96
What's Renderscript(2/2)
❖ Data type:
• 1D/2D collections of elements, C types like int
and short2, types include size
• Runtime type checking
❖ Parallelism
• Implicit: one thread per data element,
atomics for thread-safe access
• Thread scheduling not exposed, VM-decided
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 97
RenderScript Architecture
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 98
Renderscript Project
Framework
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 102
Comparison (1/2)
❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)
Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of
the three programming models in google android." in Proc. First Asia-
Pacific Programming Languages and Compilers Workshop (APPLC). 201
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 113
Comparison(2/2)
❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw)
constant memory
GPGPU Programming
Parallelization
❖ Multicore/Multi-threading
❖ Data Parallelization
• Data distribution
• Parallel convolution
• Reduction algorithm
• Amdahl’s law
❖ Memory Hierarchy Management
• Locality principle
• Program accesses a relatively small portion
of the address space at any instant of time
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 116
4. OpenCV
Acceleration
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 119
What Is OpenCV
❖A very popular computer vision
library
• 6M downloads
• BSD licenses
• 2000 ~ CV functions
• Modularized and efficient
• Optimization
• Intel SSE, IPP, TBB
• ARM NEON & GLSL (Tegra)
• CUDA, OpenCL
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 120
OpenCV Modules
❖Image/video I/O, processing, display (core,
imgproc, highgui)
❖Object/feature detection (objdetect, features2d,
nonfree)
❖Geometry-based monocular or stereo computer
vision (calib3d, stitching, videostab)
❖Computational photography (photo, video,
superres)
❖Machine learning & clustering (ml, flann)
❖CUDA and OpenCL GPU acceleration (gpu, ocl)
Normal CV modules: 14
Acceleration modules: 2
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 121
Focus on PC platform
Not fully compatible to mobile GPGPU on Android
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 122
CUDA De-noising
❖Gaussian noise removal
• gpu::FastNonLocalMeansDenoising()
❖Edge preserving smoothing
• gpu::bilateralFilter()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 128
CUDA Background
Segmentation
❖Foregrdound/background segmentation (e.g.,
object detection/removal, motion tracking,
background removal)
• gpu::FGDStatModel, ::GMG_GPU,
::MOG_GPU, ::MOG2_GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 135
5. Computer Vision
Acceleration on PC
Image enhancement (HDR)
Feature extraction
Video surveillance cloud
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 138
HDR and
Image Enhancement
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 139
Algorithms for
Image Restoration
❖ Wiener Filter
❖ Histogram Based Approach
• Histogram Equalization,
Histogram Modification, …
❖ Retinex
• Path-based Retinex
• Recursive Retinex
• Center/surround Retinex
• No iterative process and is suitable for parallelization
• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 141
MSRCR Algorithm
The Method
CPU GPGPU
Copy Data
from CPU to Gaussian Blur
GPGPU
Log-domain
Processing
Normalization
Parallelization by GPGPU
❖ Multicore/Multi-threading
• Tesla C1060 : 240 SP (Stream Processor)
• CUDA: , Thread , Block , Grid
❖ Data Parallelization
• Parallel convolution
• Parallel convolution
M pixels PE data time
1 pixels pixels 1 pixels t0 t1 t2 t3 t4 t5
A(0) A(0)+A(1) A(0)+A(1)+A(2)+A(3) sum
0
A(1)
1
M
{
A(2) A(2)+A(3)
PE i
pixels
PE i pixels 2
pixels pixels 3 A(3)
{
Parallel Normalization
Shared
Memory
Parallel Histogram
Stretching
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 145
2x
Feature Extraction
(SIFT)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 149
What Is SIFT
❖ SIFT
• Scale Invariant Feature Transform
❖ Invariance of feature points
• Translation
• Rotation
• Scale
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 150
Applications of SIFT
❖Object recognition/tracking
❖Image retrieval
❖Autostitch
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 151
Experimental Results
CPU GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 153
Execution Time
CPU:
10 seconds
in average
m
s
GPGPU:
0.8 seconds
in average
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 154
Speedup
Video
Surveillance Cloud
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 156
GPGPU雲端視訊監控系統
多重解析度廣域監視系統 中央視訊及訊息管理系統
高效率影片事件瀏覽系統
警戒區域入侵偵測
PC Mobile
動態場景
device
人臉偵測
Hypervisor 非法停車偵測
PTZ相機追蹤
Multi-core GPGPU
戶外
攝影機異常偵測
停車場
空位偵測
Storage Area Network
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 157
私有雲機房
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 158
6. Computer Vision
Acceleration on
Android
OpenCV
RenderScript
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 160
OpenCV
on Android
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 161
OpenCV4Android SDK
❖Enables development of Android applications
with use of OpenCV library.
❖Use java native interface (JNI) directly access c
code
❖Support nVIDAs’ Tegra android development
pack(TADP)
Not fully
compatible with
GPU module
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 162
System Framework
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 163
RenderScript on
Android with GPU
Acceleration
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 170
RenderScript on android
with GPU(1/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 171
RenderScript on android
with GPU(2/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 172
RenderScript on android
with GPU(3/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 173
RenderScript on android
with GPU(4/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 174
RenderScript on android
with GPU(5/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 175
RenderScript Image
Processing Intrinsics
Name Operation
ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol Performs a 3x3 or 5x5 convolution.
ve5x5
ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale
and RGBA buffers and is used by the system
framework for drop shadows.
ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to
process camera data.
ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.
ScriptIntrinsicBlend Blends two allocations in a variety of ways.
ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.
ScriptIntrinsic3DLUT Applies a color cube with interpolation to a
buffer.
ScriptIntrinsicHistogram Intrinsic Histogram filter
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 176
RenderScript Intrinsic
Example(1/2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 178
RenderScript Intrinsic
Example(2/2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 179
Blur Intrinsic
Performance Analysis
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 180
Performance of
RenderScript Intrinsics
❖On new Nexus 7
❖Relative to equivalent multithreaded C
implementations.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 181
RenderScript Image
Processing Benchmarks(1/2)
❖CPU only on a Galaxy Nexus device.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 182
RenderScript Image
Processing Benchmarks(2/2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 183
Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image
Processing on Android Device Using Renderscript." in Proc. The 8th International
Conference on Robotic, Vision, Signal Processing & Power Applications, 2014.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 184
7. Summary
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 186
GPGPU
❖ Single-core
è Multi-core
è Many-core
❖PC
• nVidia Tesla + CUDA/OpenCV
❖Android
• Qualcomm Adreno + OpenCV ocl
• nVidia Tegra + OpenCV gpu
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 187
Parallel Programming
❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP
• OpenCL
❖Java
• OpenCL, RenderScript
❖Notice that OpenCL and
RenderScript is
• Not Efficient in parallelization.
• Efficient in CV algorithmic design.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 188
OpenCL 2.0
❖Released in 2013
❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory
management
❖Dynamic (Nested) Parallelism
• Allows a device to enqueue kernels onto
itself – no round trip to host required
❖Disadvantage
• Strong hardware support
• Not well supported in current GPGPUs
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 191
3.5X 10 X 10 X
PowerDirector7 Ultra CUDA JPEG Decoder DivideFrame GPU Decoder
26 X 10 X
Hyperspectral Image GPU Decoder Motion Estimation for
Compression on (Vegas/Premiere) - H.264/AVC on
NVIDIA GPUs Using the Power of Multiple GPUs
NVIDIA Graphic Card to Using NVIDIA CUDA
Decode H.264 Video Files
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 193
87 X 26 X 200 X 100 X
CUDA SURF – A Real- Leukocyte Tracking: Real-time Spatiotemporal Image Denoising with
time ImageJ Plugin Stereo Matching Using the Bilateral Filter
Implementation for SURF University of Virginia Dual-Cross-Bilateral Grid Wlroclaw University
TU Darmstadt of Technology
85 X 100 X 8X 13 X
Digital Breast Fast Optical Flow on GPU A Framework for Efficient Accelerating Advanced MRI
Tomosynthesis At Video Rate for Full HD and Scalable Execution of Reconstructions
Reconstruction Resolution Domain-specific Templates University of Illinois
Massachusetts General Onera On GPU
Hospital NEC Labs, Berkeley, Purdue
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 194
20 X 13 X 109 X 263 X
GPU for Surveillance Fast Human Detection with Fast Sliding-Window GPU Acceleration of Object
Cascaded Ensembles Object Detection Classification Algorithm
Using NVIDIA CUDA
300 X 10 X 45 X 3X
Audience Measurement – Real-time A GPU Accelerated Canny Edge Detection
Real-time Video Analysis Visual Tracker by Evolutionary
for Counting People, Face Stream Processing Computer Vision System
Detection and Tracking
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 195
Readings (1/2)
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel
algorithm for accelerating retinex." Journal of Real-Time Image
Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time
phase-based optical flow, stereo, and local image features."
Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A
review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors
to reconfigurable logic: a case study." Computers, IEEE
Transactions on 59.4 (2010): 433-448.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 197
Readings (2/2)
❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV
❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,”
http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of
heterogeneous systems,”
https://www.khronos.org/opencl/
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 198
OpenCV Acceleration
❖ GPU Module Introduction — OpenCV 2.4.9.0
documentation
❖ OpenCL Module Introduction - opencv documentation!
❖ OpenCV-CL: Computer vision with OpenCL
acceleration, AMD Developer Central, 2013.
❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012):
61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated
framework for image processing and computer vision."
Advances in Visual Computing. Springer Berlin
Heidelberg, 2008. 430-439.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 199
CUDA
❖ CUDA Programming guide. nVidia.
❖ CUDA Best Practices Guide. nVidia.
❖ CUDA Reference Manual. nVidia.
❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone
❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html
❖ Applications of CUDA for Imaging and Computer
Vision
http://www.nvidia.com/object/imaging_comp_vision.html
❖ nVidia Performance Primitives (NPP)
http://developer.nvidia.com/object/npp_home.html
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 200
OpenCL
❖ Khronos OpenCL specification, reference card, tutorials, etc:
http://www.khronos.org/opencl
❖ AMD OpenCL Resources:
http://developer.amd.com/opencl
❖ NVIDIA OpenCL Resources:
http://developer.nvidia.com/opencl
❖ Books
• Using OpenCL: Programming Massively Parallel Computers.
IOS Press, 2012.
• OpenCL programming guide. Pearson Education, 2011.
• Heterogeneous Computing with OpenCL: Revised OpenCL 1.
Newnes, 2012.
• OpenCL in Action: how to accelerate graphics and
computation. Manning, 2012.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 201
RenderScript
❖ RenderScript for Android Developer, Official web site
http://developer.android.com/guide/topics/renderscript/compute.ht
ml
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li.
"Comparison and analysis of the three programming
models in google android." First Asia-Pacific
Programming Languages and Compilers Workshop.
2012.
❖ "High Performance Apps Development with
RenderScript," 12th Kandroid Conference, 2013.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 202