You are on page 1of 203

Wang,

Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 1

Parallel Vision by
GPGPU/CUDA
Wang, Yuan-Kai (王元凱)
Electrical Engineering Department, Fu Jen Catholic
University (輔仁大學電機工程系)
ykwang@mail.fju.edu.tw
http://www.ykwang.tw
2014/07/17
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 2

About this Course


❖ Multicore Era for Computer Vision
❖ GPGPU
❖ Parallel Programming
(CUDA, OpenCL, Renderscript)
❖ OpenCV Acceleration with GPGPU
❖ Computer Vision Acceleration
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 3

1. Multicore Era for


Computer Vision
Paradigm shift
from Clock Speed Race
to Multicore Race
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 4

Multicore Computing
❖ What Is Multicore
• Combine multiple processors
(CPU, DSP, GPGPU, FPGA)
into single chip
❖ Multicore computing is inevitable
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 5

Moore's Law
❖ In 1965, Gordon Moore (Intel co-founder)
predicted
• The transistors no. on an IC would double
every 18 months
❖ The well-known law
• The performance of computer
doubles every 18 months
• More transistors
→ More performance
❖ The prediction was
kept correctly by
Intel's CPUs for 40 years
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 6

Review of Moore's Law


❖ Transistors in a chip did increase

Software enjoys the fruits of hardware's labour.


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 7

Problems
❖ More transistors need high frequency
• We come into the Clock Speed Race
❖ But high frequency needs high power
consumption
• High power consumption è Heat problem
• 4GHz has been the limit of Moore’s law
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 8

Paradigm Shift from 2000 AD


❖ General-purpose multicore
comes of age
❖ Chip companies race to create multicore
processors
• CPU: Intel Core Duo, Quad-core,
ARM v7, ...
• DSP: TI OMAP, ARM NEON, …
• GPU/GPGPU:
• nVidia: GeForce/Tesla, Tegra
• ARM: Mali-T6x
• …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 9

The Multicore Evolution


From large mono-core to multiple lightweight cores

Pentium processor Core Duo 5~10 years


Optimized for single 10~100 energy efficient
thread cores optimized for
parallel execution
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 10

Moore’s Law Needs Multicore


❖ Single core cannot fit Moore's law
❖ Multicore can fit Moore's law if a
parallel programming model exists
Multi-Core
Performance

Single Core

Time
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 11

Two Architectures
for Multicore
❖ Symmetric multiprocessing (SMP)
• Multicore CPU, GPGPU, DSP multicore
• Homogeneous computing
❖ Asymmetric multiprocessing (AMP)
• CPU+GPGPU,
CPU+FPGA,
CPU+DSP
• Heterogeneous computing
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 12

Multicore CPU (1/2)


❖ Two or more CPUs in a chip
❖ Ex.: Intel Core i7

Multiple
Execution Cores
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 13

Multicore CPU (2/2)


❖ Windows Task Manager(工作管理員)
Two cores Eight cores
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 14

GPGPU (1/2)
❖ GPU (Graphical Processing Unit)
• The processor in graphics card to speed
up 3D graphics
• Game playing
is a major
application
❖ GPGPU: General-Purpose GPU
• General purpose computation using
GPU in applications other than 3D
graphics
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 15

GPGPU (2/2)
❖ GPGPU has more cores than CPU
• 120 ~ 3072 cores vs. 2 ~ 8 cores
(Many-core vs. Multi-core)
❖ GPGPU is more powerful than
multicore CPU
❖ Vendors:
• nVidia
• Quadcomm
(AMD, ATI)
• ARM
• Intel
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 16

It is the Software, Stupid


❖Gary Smith and Daya Nadamuni, Gartner
Dataquest, Design Automation Conf., 2006
❖The biggest problem with SoC design
is embedded software development.
❖The next big hurdle is
programmability. It's the ability to
program these multicore platforms."
❖You can have elegant algorithms,
first-pass silicon, and fancy intellectual
property. But without software, the
product goes nowhere.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 17

Multicore Demands Threading


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 18

Multicore Demands Threading


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 19

What Is Computer Vision


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 20

A Complete Vision System


– Video Surveillance as an Example
Imaging Abnormal Detection Face Recognition Retrieval

Video Image
Object Object
Object Behavior
Capture Enhance /Event Tracking /Event Analysis Retrieval
Detection Recognition

Image/Video Enhancement Tripwire


Event Detection
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 21

Computer Vision Needs


High Performance Computing
❖ A CV example : video processing
• Intelligent video surveillance,
❖ Its complexity is high
• Video (1080p RGB):
6 Megapixels per frame, 30fps
• 100 – 1K flops per pixel
• ⇒ 18 - 180 Gigaflops per second
❖ Massive data processing
• Intensive computation
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 22

HPC Approaches
❖ Cluster/distributed computing
• Hadoop/MAP-REDUCE Supercomputer
(Google, Cloud Computing)
• MPI
❖ Multi-processing
computing
• Multicore (GPGPU, CPU, FPGA/DSP)
• Programming: multi-thread
• Windows thread, Pthraed, OpenMP
• CUDA, renderscript, C++ AMP, …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 23

However
❖ Can CV algorithms speed-up every 18
months with multicore?
❖ Multicore is not a simple solution for
upgrading CV algorithm performance
• The transition from single core to
multicore will be blocked by software
• We are not ready to face the software
programming challenges
• It is the software, stupid.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 24

Software, Threading, and


Parallel Computing
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 25

Multi-threading Demands
New Programming Skills
❖ Previous multi-threading techniques
❖ Windows thread, pthread, OpenMP,
MPI, …
❖ New techniques
• CUDA, C++ AMP, OpenCL, Renderscript,
OpenACC, Map Reduce, …
❖ Concepts
• Race condition, deadlock,
• Domain partition, function partition, …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 26

Multicore Programming
Practice (MPP)
❖ Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
• Proposed by a
MPP working group
in the Multicore
Association

http://www.multicore-association.org/workgroup/mpp.php
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 27

OpenACC
❖ An organization develops API to
• describes a collection of
compiler directives
• To specify loops and regions of
code in standard C, C++ and Fortran
• To be offloaded from a host CPU to
an attached accelerator, including
• APUs, GPUs, and many-core
coprocessor
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 28

HSA Foundation
❖Heterogeneous System Architecture
• Key members: AMD, QUALCOMM, ARM,
SAMSUNG, TI
❖System architecture easing efficient
use of accelerators, SoCs
• Intended to support high-level parallel
programming frameworks
• OpenCL, C++, C#, OpenMP, Java
• Accelerator requirements
• Full-system SVM, memory coherency,
preemption,
user-mode dispatch
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 29

The ParLab in Berkeley


❖ The Parallel Computing Lab. in UC
Berkeleyhttp://parlab.eecs.berkeley.e
du
• The ParLab. offers programmers a
practical introduction to parallel
programming techniques and tools on
current parallel computers,
emphasizing multicore and manycore
computers.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 30

HPEC
❖ High Performance Embedded
Computing
• MIT Lincoln Lab, 1997 ~
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 31

OpenCL
❖ Royalty-free, cross-platform, cross-
vendor standard
• Targeting: supercomputers
è embedded systems
è mobile devices
❖Enables programming of diverse
compute resources
• CPU, GPU,
DSP, FPGA …
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 32

OpenCL Working Group


Members
❖Diverse industry participation – many
industry experts
❖NVIDIA is chair, Apple is specification
editor
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 33

Today We Talk About


❖ Why GPGPU's multicore is better(Sec. 2)
❖ Vendor, Hardware
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 34

2. GPGPU

PC platform
Mobile platform
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 35

Why GPGPU
❖ GPGPU has many-core (vs. multi-core)
• Suitable for masssively parallel computing
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 36

GPGPU as a Coprocessor

Heterogeneous Computing
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 37

PC Platform
• Discrete GPUs
• GPGPU card as a coprocessor

PCIe

From PC to PSC (Personal Super-Computer)


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 38

Mobile Platform
• Integrated GPUs
• GPGPU sub-chip as a coprocessor
GPGPU

No PCIe CPU

From mobile phone to mobile personal computer


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 39

GPGPU Solutions - nVidia


• Compute Architecture:
Tesla, Fermi, Kepler, …
• PC
• GeForce, Quadro
• Tesla
• 870, 1060, 2070, K40
• Mobile
• Tegra: …, 4, K1(192 cores)

It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 40

GPGPU Solutions
– Qualcomm/AMD
❖ Qualcomm, AMD, ATI
❖ APU: integrated CPU+GPU
❖ Low energy consumption
❖ PC(AMD): FirePro
❖ Mobile(Snapdragon):
❖ Adreno: 330(32 cores)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 41

GPGPU Solutions - ARM


❖ Mali
❖ Samsung Exynos, MediaTek
❖ Compute engine
after T-600
❖ Exynos 5
❖ At most 8 cores
(Mali-T678)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 42

Intel – Multicore CPU


• PC (Xeon Phi)
• IRIS pro GPU
• Knight Landing: 60 cores
• Knight Cover: 48 CPU cores,
PCIe
• Mobile
• Haswell
• Atom
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 43

Applications of GPGPU

http://developer.nvidia.com/category/zone/cuda-zone
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 44

Heterogeneous Architecture
❖Host: CPU
❖Device: GPGPU
❖Notice: memory hierarchy in device
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 45

GPGPUs Architecture
- nVidia
❖ GT200
• GTX 260/280, Quardro5800, Tesla 1060
❖ Fermi
• Tesla 2060
ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU(host) GPU(device)
Multicore Many-core
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 46

nVidia GPGPU Architecture


❖ SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 47

Memory Hierarchy
❖ On-Chip Memory
• Registers
• Shared Memory
• Constant Memory
• Texture Memory
❖ Off-Chip Memory
• Local Memory
• Global Memory
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 48

GPGPU vs. FPGA

❖GPU: nVidia GeForce


GTX 280, GTX580
❖FPGA: Xilinx Virtex4,
Virtex5

A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local
Image Features, IEEE Transactions on Computers, 2012.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 49

GPGPU vs. FPGA

❖GPU: nVidia GeForce 7900 GTX


❖FPGA: Xilinx Virtex-4

Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case


Study, IEEE Transactions on Computers, 2010.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 50

GPGPU vs. FPGA vs. Multicore


❖Application: 2-D image convolution

GPU: nVidia GeForce 295 GTX


FPGA: Altera Stratix III E260
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-
Window Applications, ACM/SIGDA international symposium on FPGA, 2012.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 51

However, GPGPU May Not


Always Improve Speed & Energy
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 52

Hardware vs. Software


Parallel
GPGPU Programming

nVidia CUDA

Qualcomm OpenCL

ARM RenderScript

Intel C++ AMP


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 53

Today We Talk About


❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
• CUDA, renderscript, OpenCL, …
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 54

3. Parallel
Programming
Multi-threading
Programming Languages for Parallels
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 55

Parallel Computing
❖ Serial
Computing

❖ Parallel Core
Computing Core

Core

Core

CPU/GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 56

Parallel Programming
❖ Many codes are written in C/C++/Java
• Especially algorithmic programs
❖ Can we write GPGPU parallel
programs by C/C++/Java?
❖ However, C/C++ is sequential
• Three control structures of C/C++/Java:
sequence, selection, repetition
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 57

Multi-threading
❖ Multi-threading is the fundamental
concept for parallel programming
• Some techniques are ready
• Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
• New techniques
• CUDA, OpenCL, Renderscript,
OpenACC, C++ AMP, ...
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 58

Parallel Programming Models


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 59

Parallel Programming in
Sequential Language
❖ Do we need to learn new languages for
multi-threading?
• No
❖ Write multi-threading codes in C/C++
• Add functions/directives to C/C++ for
multi-threading
• That is the way current solutions did
• pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 60

Decompose the Problem


❖ Two basic approaches to partition
computational work
• Domain decomposition GPGPU
• Partition the data used
Cooperate
in solving the problem
• Function decomposition CPU
• Partition the jobs (functions)
from the overall work (problem)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 61

Multi-Threading
❖ A program running
In Serial

In Parallel

http://en.wikipedia.org/wiki/Thread_(computer_science)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 62

Domain Decomposition (1/3)


❖An image example
• It is 2D data
• Three popular partition ways
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 63

Domain Decomposition (2/3)


❖Domain data are usually processed
by loop SIMD
• for (i=0; i<height; i++) SPMD
for (j=0; j<width; j++) SIMT
img2[i][j] = RemoveNoise(img1[i][j]);
j
i

The X-ray image


of a circuit board
Original image(img1) Enhanced image(img2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 64

Domain Decomposition (3/3)


j
i ❖A three-block partition
example OpenMP
• // Thread 1 CUDA(SPMD)
for (i=0; i<height/3; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
• // Thread 2
for (i=height/3; i<height*2/3; i++)
fork(threads)
subdomain 1 subdomain 2 subdomain 3
for (j=0; j<width; j++)
i=0 i=4 i=8 img2[i][j] = RemoveNoise(img1[i][j]);
i=1 i=5 i=9 • // Thread 3
i=2 i=6 i=10
i=3 i=7 i=11
for (i=height*2/3; i<height; i++)
for (j=0; j<width; j++)
join(barrier) img2[i][j] = RemoveNoise(img1[i][j]);
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 65

GPGPU Programming:
SIMT model
❖ CPU (“host”) program often
written in C or C++
❖ GPU code is written as a sequential
kernel in (usually) a C or C++
dialect
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 66

GPGPU Programming
Techniques

CUDA
OpenCL
C++ AMP
Rednerscript
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 67

GPGPU Programming
Techniques
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 68

CUDA
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 69

CUDA
❖ CUDA: Compute Unified Device
Architecture
❖ Parallel programming
for nVidia's GPGPU
❖ Use C/C++ language
• Java, Fortran, Matlab are OK
❖ When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 70

CUDA Hardware Environment:


CPU+GPU
❖ CPU
• Organizes, interprets, and CPU PCI-E
GPU
communicates information
❖ GPU
• Handles the core processing on large quantities
of parallel information
• Compute-intensive portions of applications
that are executed many times, but on different
data, are extracted from the main application
and compiled to execute in parallel on the GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 71

CUDA Software Stack


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 72

Processing Flow on CUDA


Main
CPU 3
2 Memory
Copy processing 5 Instruct the
data Copy the processing
result
4
1 Memory
for GPU Execute
Allocate parallel in
device memory each core

6
Release
device memory
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 73

Programming with
Memory Hierarchy
❖ Locality
principle
• Temporal
locality
• Spatial
locality
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 74

Example - Hello World(1/3)


int main()
{ Host Device
char src[12]="Hello World";
char h_hello[12]; src d_hello1
char* d_hello1; h_hello d_hello2
char* d_hello2;
cudaMalloc((void**) &d_hello1, sizeof(char)*12);
cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
call the kernel function
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 75

Example - Hello World(2/3)


❖ Kernel Function
__global__ void hello(char* hello1 , char* hello2 )
{
int k;

for(k = 0 ; hello1[k] != '\0' ; k++){


hello2[k] = hello1[k];
Host Device
}
} src d_hello1
No parallel processing in this example
h_hello d_hello2
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 76

Example - Hello World(3/3)


cudaMemcpy(h_hello, d_hello2, sizeof(char)*
12, cudaMemcpyDeviceToHost);

printf("%s\n", h_hello);
Host Device
cudaFree(d_hello1);
❖ cudaFree(d_hello2); src d_hello1
system("pause");
h_hello d_hello2
return 0;
}
Result:
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 77

OpenCL

Standard
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 78

The Inspiration for OpenCL


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 79

What's OpenCL
❖One code tree can be executed on CPUs, GPUs,
DSPs and hardware
• Dynamically interrogate system load and
balance across available processors
❖Powerful, low-level flexibility
• Foundational access to compute resources for
higher-level engines, frameworks and
languages
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 80

Broad OpenCL Implementer


Adoption
❖Multiple conformant implementations shipping
on desktop and mobile
❖Android ICD extension released in latest
extension specification
❖Multiple implementations shipping in Android
NDK
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 81

OpenCL Enables Portability


❖C to gates programs are
proprietary
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 82

Altera OpenCL SDK for FPGAs


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 83

NVIDIA OpenCL SDK for GPU


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 84

AMD OpenCL Optimization


Case Study
❖Platform
• AMD Phenom II X4 965 CPU (quad core)
• ATI Radeon HD 5870 GPU
❖Unoptimized CPU performance: 1 GFLOP/s
❖Optimized CPU performance reaches: 4 GFLOP/s
❖Optimized GPU performance reaches: 50 GFLOP/s
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 85

Example - Hello World(1/3)


Including

Declaring
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 86

Example - Hello World(2/3)


Creating
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 87

Example - Hello World(2/3)


Creating

Do

Copy to host &


display
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 88

Example - Hello World(3/3)


Kernel Function
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 89

C++ AMP

Microsoft
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 90

What's C++ AMP(1/2)


❖Microsoft’s C++ AMP (Accelerated Massive
Parallelism)
• Part of Visual C++, integrated with Visual
Studio, built on Direct3D
• “Performance for the mainstream”
❖STL-like library for multidimensional array
data
• Special convenience support for 1, 2, and 3
dimensional arrays on CPU or GPU
• C++ AMP runtime handles CPU<->GPU data
copying
• Tiles enable efficient processing of sub-arrays
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 91

What's C++ AMP(2/2)


❖Parallel_for_each
• Executes a kernel (C++ lambda) at
each point in the extent
• restrict() clause specifies where to
run the kernel: cpu (default) or
direct3d (GPU)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 92

Example - Hello World(1/2)

Declaring&
Coping to device
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 93

Example - Hello World(2/2)


Do

Display
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 94

RenderScript

Google Android
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 95

What's Renderscript(1/2)
❖Higher-level than CUDA or OpenCL: simpler &
less performance control
• Emphasis on mobile devices & cross-SoC
performance portability
❖Programming model
• C99-based kernel language, JIT-compiled,
single input-single output
• Automatic Java class reflection
• Intrinsics: built-in, highly-tuned operations,
e.g. ScriptIntrinsicConvolve3x3
• Script groups combine kernels to amortize
launch cost & enable kernel fusion
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 96

What's Renderscript(2/2)
❖ Data type:
• 1D/2D collections of elements, C types like int
and short2, types include size
• Runtime type checking
❖ Parallelism
• Implicit: one thread per data element,
atomics for thread-safe access
• Thread scheduling not exposed, VM-decided
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 97

RenderScript Architecture
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 98

Low Level Virtual Machine


❖Low Level Virtual Machine (LLVM)
is a compiler infrastructure
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 99

Offline Compiler Flow


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 100

Renderscript Compiler: libbcc


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 101

Renderscript Project
Framework
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 102

Example - Hello World(1/8)


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 103

Example - Hello World(2/8)


HelloWorld.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 104

Example - Hello World(3/8)


HelloWorld.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 105

Example - Hello World(4/8)


HelloWorldView.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 106

Example - Hello World(5/8)


HelloWorldView.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 107

Example - Hello World(6/8)


HelloWorldRS.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 108

Example - Hello World(7/8)


HelloWorldRS.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 109

Example - Hello World(7/8)


ScriptC_helloworld.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 110

Example - Hello World(7/8)


ScriptC_helloworld.java
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 111

Example - Hello World(8/8)


HelloWorld.rs
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 112

Comparison (1/2)
❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)

Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of
the three programming models in google android." in Proc. First Asia-
Pacific Programming Languages and Compilers Workshop (APPLC). 201
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 113

Comparison(2/2)
❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw)
constant memory

OpenCL’s portability does not


fundamentally affect its performance

Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A


comprehensive performance comparison of CUDA and OpenCL." in
Proc. International Conference Parallel Processing (ICPP), 2011.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 114

GPGPU Programming

Performance: more control, better performance

Productivity: ease use, quick programming,


portability
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 115

Parallelization
❖ Multicore/Multi-threading
❖ Data Parallelization
• Data distribution
• Parallel convolution
• Reduction algorithm
• Amdahl’s law
❖ Memory Hierarchy Management
• Locality principle
• Program accesses a relatively small portion
of the address space at any instant of time
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 116

Multi-thread Programming with


the Discipline of Parallelization
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 117

Today We Talk About


❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 118

4. OpenCV
Acceleration
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 119

What Is OpenCV
❖A very popular computer vision
library
• 6M downloads
• BSD licenses
• 2000 ~ CV functions
• Modularized and efficient
• Optimization
• Intel SSE, IPP, TBB
• ARM NEON & GLSL (Tegra)
• CUDA, OpenCL
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 120

OpenCV Modules
❖Image/video I/O, processing, display (core,
imgproc, highgui)
❖Object/feature detection (objdetect, features2d,
nonfree)
❖Geometry-based monocular or stereo computer
vision (calib3d, stitching, videostab)
❖Computational photography (photo, video,
superres)
❖Machine learning & clustering (ml, flann)
❖CUDA and OpenCL GPU acceleration (gpu, ocl)
Normal CV modules: 14
Acceleration modules: 2
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 121

OpenCV GPU Module


❖Implemented using NVIDIA CUDA
Runtime API
❖Latest version: 2.4.9
• Utilizing Multiple GPUs
❖Implemented modules: 11
❖Implemented functions: 270

Focus on PC platform
Not fully compatible to mobile GPGPU on Android
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 122

CUDA Matrix Operations


❖Point-wise matrix math
• gpu::add(), ::sum(), ::div(), ::sqrt(),
::sqrSum(), ::meanStdDev, ::min(), ::max(),
::minMaxLoc(), ::magnitude(), ::norm(),
::countNonZero(), ::cartToPolar(), etc..
❖Matrix multiplication
• gpu::gemm()
❖Channel manipulation
• gpu::merge(), ::split()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 123

CUDA Geometric Operations


❖Image resize with sub-pixel interpolation
• gpu::resize()
❖Image rotate with sub-pixel interpolation
• gpu::rotate()
❖Image warp (e.g., panoramic stitching)
• gpu::warpPerspective(), ::warpAffine()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 124

CUDA other Math and


Geometric Operations
❖Integral images
• gpu::integral(), ::sqrIntegral()
❖Custom geometric transformation (e.g., lens
distortion correction)
• gpu::remap(), ::buildWarpCylindricalMaps(),
::buildWarpSphericalMaps()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 125

CUDA Image Processing(1/2)


❖Smoothing
• gpu::blur(), ::boxFilter(),
::GaussianBlur()
❖Morphological
• gpu::dilate(), ::erode(), ::morphologyEx()
❖Edge Detection
• gpu::Sobel(), ::Scharr(), ::Laplacian(),
gpu::Canny()
❖Custom 2D filters
• gpu::filter2D(), ::createFilter2D_GPU(),
::createSeparableFilter_GPU()
❖Color space conversion
• gpu::cvtColor()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 126

CUDA Image Processing(2/2)


❖Image blending
• gpu::blendLinear()
❖Template matching (automated inspection)
• gpu::matchTemplate()
❖Gaussian pyramid (scale invariant
feature/object detection)
• gpu::pyrUp(), ::pyrDown()
❖Image histogram
• gpu::calcHist(), gpu::histEven,
gpu::histRange()
❖Contract enhancement
• gpu::equalizeHist()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 127

CUDA De-noising
❖Gaussian noise removal
• gpu::FastNonLocalMeansDenoising()
❖Edge preserving smoothing
• gpu::bilateralFilter()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 128

CUDA Fourier and MeanShift


❖Fourier analysis
• gpu::dft(), ::convolve(),
::mulAndScaleSpectrums(), etc..
❖MeanShift
• gpu::meanShiftFiltering(),
::meanShiftSegmentation()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 129

CUDA Shape Detection


❖Line detection (e.g., lane detection, building
detection, perspective correction)
• gpu::HoughLines(), ::HoughLinesDownload()
❖Circle detection (e.g., cells, coins, balls)
• gpu::HoughCircles(),
::HoughCirclesDownload()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 130

CUDA Object Detection


❖HAAR and LBP cascaded adaptive boosting
(e.g., face, nose, eyes, mouth)
• gpu::CascadeClassifier_GPU::detectMulti
Scale()
❖HOG detector (e.g., person, car, fruit, hand)
• gpu::HOGDescriptor::detectMultiScale()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 131

CUDA Object Recognition


❖Interest point detectors
• gpu::cornerHarris(), ::cornerMinEigenVal(),
::SURF_GPU, ::FAST_GPU, ::ORB_GPU(),
::GoodFeaturesToTrackDetector_GPU()
❖Feature matching
• gpu::BruteForceMatcher_GPU(),
::BFMatcher_GPU()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 132

CUDA Stereo and 3D


❖RANSAC
• gpu::solvePnPRansac()
❖Stereo correspondence (disparity map)
• gpu::StereoBM_GPU(),
::StereoBeliefPropagation(),
::StereoConstantSpaceBP(),
::DisparityBilateralFilter()
❖Represent stereo disparity as 3D or 2D
• gpu::reprojectImageTo3D(),
::drawColorDisp()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 133

CUDA Optical Flow


❖Dense/sparse optical flow
gpu::FastOpticalFlowBM(),
::PyrLKOpticalFlow, ::BroxOpticalFlow(),
::FarnebackOpticalFlow(),
::OpticalFlowDual_TVL1_GPU(),
::interpolateFrames()
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 134

CUDA Background
Segmentation
❖Foregrdound/background segmentation (e.g.,
object detection/removal, motion tracking,
background removal)
• gpu::FGDStatModel, ::GMG_GPU,
::MOG_GPU, ::MOG2_GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 135

Performance of OpenCV GPU


Accelerators on PC
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 136

Today We Talk About


❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 137

5. Computer Vision
Acceleration on PC
Image enhancement (HDR)
Feature extraction
Video surveillance cloud
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 138

HDR and
Image Enhancement
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 139

HDR Image Enhancement


❖ Restore and enhance an image
❖ Its complexity is high for large images

Original Complexity: Restored


O(N2) ~ O(N3)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 140

Algorithms for
Image Restoration
❖ Wiener Filter
❖ Histogram Based Approach
• Histogram Equalization,
Histogram Modification, …
❖ Retinex
• Path-based Retinex
• Recursive Retinex
• Center/surround Retinex
• No iterative process and is suitable for parallelization
• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 141

MSRCR Algorithm

• : the MSRCR output


• : the original image distribution in the ith spectral band
• : the kth Gaussian Surround function
• : the convolution operation
• : the weight
• : the color restoration factor in the ith spectral band

N : the number of spectral bands


: the gain constant
: controls the strength of the nonlinearity
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 142

The Method
CPU GPGPU

Copy Data
from CPU to Gaussian Blur
GPGPU
Log-domain
Processing

Normalization

Copy Data Histogram


from GPGPU to Stretching
CPU

• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm."


Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on. IEEE, 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for
accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 143

Parallelization by GPGPU
❖ Multicore/Multi-threading
• Tesla C1060 : 240 SP (Stream Processor)
• CUDA: , Thread , Block , Grid
❖ Data Parallelization
• Parallel convolution
• Parallel convolution
M pixels PE data time
1 pixels pixels 1 pixels t0 t1 t2 t3 t4 t5
A(0) A(0)+A(1) A(0)+A(1)+A(2)+A(3) sum
0
A(1)
1
M
{
A(2) A(2)+A(3)
PE i
pixels
PE i pixels 2
pixels pixels 3 A(3)
{

4 A(4) A(4)+A(5) A(4)+A(5)+A(6)+A(7)


pixels 5 A(5)
1 pixels 1 pixels 6 A(6) A(6)+A(7)
pixels 7 A(7)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 144

Our Memory Hierarchy

Texture Parallel Gaussian Blur


Memory

Constant Parallel Log-domain


Memory Processing
Global
Memory

Parallel Normalization
Shared
Memory
Parallel Histogram
Stretching
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 145

Experimental Results (1/2)

Original images CPU results GPGPU results


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 146

Experimental Results (2/2)

Original images CPU results GPGPU results


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 147

GPGPU Speedup over CPU


74x

2x

• Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103


• NPP: nVidia Performance Primitive
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 148

Feature Extraction
(SIFT)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 149

What Is SIFT
❖ SIFT
• Scale Invariant Feature Transform
❖ Invariance of feature points
• Translation
• Rotation
• Scale
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 150

Applications of SIFT
❖Object recognition/tracking
❖Image retrieval
❖Autostitch
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 151

Parallelize SIFT by GPGPU

Intel Q9400 Geforce GTS 250


Quad cores 128 SPs
(2.66GHz) (1.836GHz)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 152

Experimental Results
CPU GPU
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 153

Execution Time

CPU:
10 seconds
in average
m
s

GPGPU:
0.8 seconds
in average
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 154

Speedup

13x speedup in average


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 155

Video
Surveillance Cloud
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 156

GPGPU雲端視訊監控系統
多重解析度廣域監視系統 中央視訊及訊息管理系統
高效率影片事件瀏覽系統

警戒區域入侵偵測
PC Mobile
動態場景
device
人臉偵測

Hypervisor 非法停車偵測
PTZ相機追蹤

Multi-core GPGPU
戶外
攝影機異常偵測
停車場
空位偵測
Storage Area Network
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 157

私有雲機房
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 158

Today We Talk About


❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 159

6. Computer Vision
Acceleration on
Android
OpenCV
RenderScript
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 160

OpenCV
on Android
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 161

OpenCV4Android SDK
❖Enables development of Android applications
with use of OpenCV library.
❖Use java native interface (JNI) directly access c
code
❖Support nVIDAs’ Tegra android development
pack(TADP)

Not fully
compatible with
GPU module
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 162

System Framework
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 163

Two Methods to Call OpenCV


❖Using Java API

❖Using native C++


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 164

OpenCV for Android SDK by


GPU(1/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 165

OpenCV for Android SDK by


GPU(2/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 166

OpenCV for Android SDK by


GPU(3/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 167

OpenCV for Android SDK by


GPU(4/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 168

OpenCV for Android SDK by


GPU(5/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 169

RenderScript on
Android with GPU
Acceleration
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 170

RenderScript on android
with GPU(1/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 171

RenderScript on android
with GPU(2/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 172

RenderScript on android
with GPU(3/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 173

RenderScript on android
with GPU(4/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 174

RenderScript on android
with GPU(5/5)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 175

RenderScript Image
Processing Intrinsics
Name Operation
ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol Performs a 3x3 or 5x5 convolution.
ve5x5
ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale
and RGBA buffers and is used by the system
framework for drop shadows.
ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to
process camera data.
ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.
ScriptIntrinsicBlend Blends two allocations in a variety of ways.
ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.
ScriptIntrinsic3DLUT Applies a color cube with interpolation to a
buffer.
ScriptIntrinsicHistogram Intrinsic Histogram filter
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 176

Gaussian Blur Example


by RenderScript Intrinsic
RenderScript rs = RenderScript.create(theActivity);
ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS,
Element.U8_4(rs));;
Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);
Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);
theIntrinsic.setRadius(25.f);
theIntrinsic.setInput(tmpIn);
theIntrinsic.forEach(tmpOut);
tmpOut.copyTo(outputBitmap);
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 177

RenderScript Intrinsic
Example(1/2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 178

RenderScript Intrinsic
Example(2/2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 179

Blur Intrinsic
Performance Analysis
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 180

Performance of
RenderScript Intrinsics
❖On new Nexus 7
❖Relative to equivalent multithreaded C
implementations.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 181

RenderScript Image
Processing Benchmarks(1/2)
❖CPU only on a Galaxy Nexus device.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 182

RenderScript Image
Processing Benchmarks(2/2)
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 183

Acceleration of Retinex Using


RenderScript
❖This paper presents an implementation of
rsRetinex, an optimized Retinex algorithm by
using Renderscript technique.
❖The experimental results show that rsRetinex
could gain up to five times speedup when applied
to different image resolution.

Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image
Processing on Android Device Using Renderscript." in Proc. The 8th International
Conference on Robotic, Vision, Signal Processing & Power Applications, 2014.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 184

Mobile GPGPU List


Adoption OpenCL/ CUDA OpenCV Renderscript
Qualcomm Google Nexus 10, 1.2(302~420) OCL Android 4.0
Adreno Google new Nexus 7, module later
SONY Xperia Tablet Z2
ARM Mali Nexus 10, Samsung OpenCL 1.1 OCL Android 4.0
Note 3, Samsung Note (T604~T678) module later
PRO 12.2, Meizu MX3
nVIDIA Google Project Tango, CUDA, OpenCL GPU Android 4.0
Tegra HTC Nexus 9, Microsoft 1.2(K1 only) module later(K1 only)
Surface 2, Nvidia Shield
Note 7 Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.

AnandTech iPad Air, iPad mini OpenCL 1.2 OCL none


PowerVR module

Intel HD Microsoft Surface Pro 3, OpenCL 1.1 OCL none


Graphics Sony VAIO Tap 11 module
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 185

7. Summary
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 186

GPGPU
❖ Single-core
è Multi-core
è Many-core
❖PC
• nVidia Tesla + CUDA/OpenCV
❖Android
• Qualcomm Adreno + OpenCV ocl
• nVidia Tegra + OpenCV gpu
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 187

Parallel Programming
❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP
• OpenCL
❖Java
• OpenCL, RenderScript
❖Notice that OpenCL and
RenderScript is
• Not Efficient in parallelization.
• Efficient in CV algorithmic design.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 188

OpenCV Acceleration (1/2)


❖Ver. 2.4.x
• gpu module: CUDA, PC
• ocl module: OpenCL, mobile
❖Ver. 3.0 (2014/6)
• Transparent API for GPGPU
acceleration
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 189

OpenCV Acceleration (2/2)


Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 190

OpenCL 2.0
❖Released in 2013
❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory
management
❖Dynamic (Nested) Parallelism
• Allows a device to enqueue kernels onto
itself – no round trip to host required
❖Disadvantage
• Strong hardware support
• Not well supported in current GPGPUs
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 191

CUDA still Dominant in the


Near Future
❖ However, we have to manually parallelize
the algorithm: more design overhead
❖ We need expertise in
• Algorithms of image and signal processing
• Filtering, frequency analysis, compression,
feature extraction, recognition, ...
• Theory, tools and methodology of parallel
computing
• Communication, synchronization, resource
management, load balancing, debugging, ...
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 192

GPUs for Multimedia

3.5X 10 X 10 X
PowerDirector7 Ultra CUDA JPEG Decoder DivideFrame GPU Decoder

26 X 10 X
Hyperspectral Image GPU Decoder Motion Estimation for
Compression on (Vegas/Premiere) - H.264/AVC on
NVIDIA GPUs Using the Power of Multiple GPUs
NVIDIA Graphic Card to Using NVIDIA CUDA
Decode H.264 Video Files
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 193

GPUs for Computer Vision(1/2)

87 X 26 X 200 X 100 X
CUDA SURF – A Real- Leukocyte Tracking: Real-time Spatiotemporal Image Denoising with
time ImageJ Plugin Stereo Matching Using the Bilateral Filter
Implementation for SURF University of Virginia Dual-Cross-Bilateral Grid Wlroclaw University
TU Darmstadt of Technology

85 X 100 X 8X 13 X
Digital Breast Fast Optical Flow on GPU A Framework for Efficient Accelerating Advanced MRI
Tomosynthesis At Video Rate for Full HD and Scalable Execution of Reconstructions
Reconstruction Resolution Domain-specific Templates University of Illinois
Massachusetts General Onera On GPU
Hospital NEC Labs, Berkeley, Purdue
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 194

GPUs for Computer Vision(2/2)

20 X 13 X 109 X 263 X
GPU for Surveillance Fast Human Detection with Fast Sliding-Window GPU Acceleration of Object
Cascaded Ensembles Object Detection Classification Algorithm
Using NVIDIA CUDA

300 X 10 X 45 X 3X
Audience Measurement – Real-time A GPU Accelerated Canny Edge Detection
Real-time Video Analysis Visual Tracker by Evolutionary
for Counting People, Face Stream Processing Computer Vision System
Detection and Tracking
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 195

The Embedded Vision


Alliance
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 196

Readings (1/2)
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel
algorithm for accelerating retinex." Journal of Real-Time Image
Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time
phase-based optical flow, stereo, and local image features."
Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A
review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors
to reconfigurable logic: a case study." Computers, IEEE
Transactions on 59.4 (2010): 433-448.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 197

Readings (2/2)
❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV
❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,”
http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of
heterogeneous systems,”
https://www.khronos.org/opencl/
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 198

OpenCV Acceleration
❖ GPU Module Introduction — OpenCV 2.4.9.0
documentation
❖ OpenCL Module Introduction - opencv documentation!
❖ OpenCV-CL: Computer vision with OpenCL
acceleration, AMD Developer Central, 2013.
❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012):
61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated
framework for image processing and computer vision."
Advances in Visual Computing. Springer Berlin
Heidelberg, 2008. 430-439.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 199

CUDA
❖ CUDA Programming guide. nVidia.
❖ CUDA Best Practices Guide. nVidia.
❖ CUDA Reference Manual. nVidia.
❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone
❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html
❖ Applications of CUDA for Imaging and Computer
Vision
http://www.nvidia.com/object/imaging_comp_vision.html
❖ nVidia Performance Primitives (NPP)
http://developer.nvidia.com/object/npp_home.html
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 200

OpenCL
❖ Khronos OpenCL specification, reference card, tutorials, etc:
http://www.khronos.org/opencl
❖ AMD OpenCL Resources:
http://developer.amd.com/opencl
❖ NVIDIA OpenCL Resources:
http://developer.nvidia.com/opencl
❖ Books
• Using OpenCL: Programming Massively Parallel Computers.
IOS Press, 2012.
• OpenCL programming guide. Pearson Education, 2011.
• Heterogeneous Computing with OpenCL: Revised OpenCL 1.
Newnes, 2012.
• OpenCL in Action: how to accelerate graphics and
computation. Manning, 2012.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 201

RenderScript
❖ RenderScript for Android Developer, Official web site
http://developer.android.com/guide/topics/renderscript/compute.ht
ml
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li.
"Comparison and analysis of the three programming
models in google android." First Asia-Pacific
Programming Languages and Compilers Workshop.
2012.
❖ "High Performance Apps Development with
RenderScript," 12th Kandroid Conference, 2013.
Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 202

Web Sites and Resources


❖Embedded Vision Alliance,
http://www.embedded-vision.com
❖GPUComputing.Net,
http://www.gpucomputing.net
❖HAS Foundation, www.hsafoundation.com

Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 203

Parallel Computing with


GPGPU
❖Programming Massively Parallel
Processors – A Hands-on Approach
• D. B. Kirk, W. M. Hwu
• Morgan Kaufmann, 2010
• http://www.nvidia.com/object/promotion_david_kirk_book.html

You might also like