Professional Documents
Culture Documents
Boltforhsabybensanders 120628182616 Phpapp02
Boltforhsabybensanders 120628182616 Phpapp02
MOTIVATION
Improve developer productivity Optimized library routines for common GPU operations Works with open standards (OpenCL and C++ AMP) Distributed as open source Make GPU programming as easy as CPU programming Resemble familiar C++ Standard Template Library Customizable via C++ template parameters Leverage high-performance shared virtual memory Optimize for HSA Single source base for GPU and CPU Platform Load Balancing C++ Template Library For HSA
AGENDA Introduction and Motivation Bolt Code Examples for C++ AMP and OpenCL ISV Proof Point Single source code base for CPU and GPU Platform Load Balancing Summary
Interface similar to familiar C++ Standard Template Library No explicit mention of C++ AMP or OpenCL (or GPU!)
More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL
Direct use of host data structures (ie std::vector) bolt::sort implicitly runs on the platform
Runtime automatically selects CPU or GPU (or both)
5 | BOLT | June 2012
Functor (a * xx + yy) now specified inline Can capture variables from surrounding scope (a) eliminate boilerplate class
Interface similar to familiar C++ Standard Template Library clbolt uses OpenCL below the API level
Host data copied or mapped to the GPU First call to clbolt::sort will generate and compile a kernel
Solution:
BOLT_FUNCTOR macro creates both host-side and string versions of SaxpyFunctor class definition
Class name (SaxpyFunctor) stored in TypeName trait OpenCL kernel code (SaxpyFunctor class def) stored in ClCode trait.
BOLT : WHATS NEW? Optimized template library routines for common GPU functions
For OpenCL and C++ AMP, across multiple platforms
// x,y are coordinates of pixel to transform // Pixel difference: It = W(y, x) - I(y, x); // average left/right pixels: Ix = 0.5f *( W(y, x+1) - W(y, x-1) ); // average top/bottom pixels: Iy = 0.5f*( W(y+1, x) - W(y-1, x) ); X = x dist of this pixel from center Y = y dist of this pixel from center // Compute for each H[ 0] = (Ix*X+Iy*Y) H[ 1] = (Ix*X-Iy*Y) H[ 2] = (Ix*X-Iy*Y) H[ 3] = (Ix ) H[ 4] = (Ix ) H[ 5] = (Ix ) H[ 6] = (Iy ) H[ 7] = (Iy ) H[ 8] = (Iy ) H[ 9] = (Iy ) H[10] = (It ) H[11] = (It ) H[12] = (It ) H[13] = (It ) pixel: * (Ix*X+Iy*Y) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix ) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix ) * (Iy ) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix ) * (Iy )
250
Launch
Relative Performance
LOC
200 150
100
Launch
Launch
Algorithm
Algorithm
Algorithm
Launch
50 0
5.00
Algorithm Algorithm Copy-back Copy-back
Serial CPU
TBB
Intrinsics+TBB
C++ AMP
HSA Bolt
Copy-back
Algorithm
Launch
Copy
Compile
Init
Performance
CPU implementations:
CPU-friendly data strides Launch enough threads to use all cores
Bolt has access to both optimized CPU and GPU implementations, at the same time Lets use both!
17 | BOLT | June 2012
Run large data sizes on GPU, small on CPU Same call-site used for varying data sizes. Run iniWal reducWon phases on GPU, run nal stages on CPU Run wide center regions on GPU, run border regions on CPU. Any reducWon operaWon. Image processing.
Distribute workgroups to available Kernel has similar performance /energy on processing units on the enWre plaUorm. CPU and GPU. Run a pipelined series of user-dened stages. Stages can be CPU-only, GPU-only, or CPU or GPU. Video processing pipeline. GPU scans all candidates and rejects early mismatches; CPU more deeply evaluates the survivors. Haar detector, word search, audio search.
HETEROGENEOUS PIPELINE
Mimics a traditional manufacturing assembly line Developer supplies a series of pipeline stages Each stage processes its input token, passes an output token to the next stage Stages can be either CPU-only, GPU-only, or CPU/GPU CPU/GPU tasks are dynamically scheduled Use queue depth and estimated execution time to drive scheduling decision Adapt to variation in target hardware or system utilization Data location not an issue in HSA Leverage single source code GPU kernels scheduled asynchronously Completion invokes next stage of the pipeline Simple Video Pipeline Example:
PARALLEL_FILTER
Target applications with a Filter pattern Filter out a small number of results from a large initial pool of candidates Initial phases best run on GPU:
Large data sets (too big for caches), wide vector, high-bandwidth
Examples: Haar detector, word search, acoustic search Developer specifies: Execution Grid Iteration state type and initial value Filter function
Accepts a point to process and the current iteration state Return True to continue processing or False to exit
BOLT / HSA Runtime Automatically hands off work between CPU and GPU Balances work by adjusting the split point between GPU and CPU
21 | BOLT | June 2012
SUMMARY
Bolt: C++ Template Library Optimized GPU and HSA Library routines Customizable via templates For both OpenCL and C++ AMP Enjoy the unique advantages of the HSA Platform High-performance shared virtual memory Tightly integrated CPU and GPU C++ Template Library For HSA Enable advanced HSA features A single source base for CPU and GPU Platform load balancing across CPU and GPU
BACKUP
Software:
Windows 7 Professional SP1 (64-bit OS) AMD OpenCL 1.2 AMD-APP (937.2) Microsoft Visual Studio 11 Beta