Boltforhsabybensanders 120628182616 Phpapp02

BOLT: A C++ TEMPLATE LIBRARY FOR HSA
Ben Sander AMD Senior Fellow
MOTIVATION
Improve developer productivity Optimized library routines for common GPU operations Works with open standards (OpenCL and C++ AMP) Distributed as open source Make GPU programming as easy as CPU programming Resemble familiar C++ Standard Template Library Customizable via C++ template parameters Leverage high-performance shared virtual memory Optimize for HSA Single source base for GPU and CPU Platform Load Balancing C++ Template Library For HSA
3 | BOLT | June 2012
AGENDA Introduction and Motivation Bolt Code Examples for C++ AMP and OpenCL ISV Proof Point Single source code base for CPU and GPU Platform Load Balancing Summary
SIMPLE BOLT EXAMPLE

#include <bolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device bolt::sort(a.begin(), a.end()); }
Interface similar to familiar C++ Standard Template Library No explicit mention of C++ AMP or OpenCL (or GPU!)
More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL
Direct use of host data structures (ie std::vector) bolt::sort implicitly runs on the platform
Runtime automatically selects CPU or GPU (or both)
BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR

#include <bolt/transform.h> #include <vector> struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) restrict(cpu,amp) { return _a * xx + yy; }; }; void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); };
BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA

#include <bolt/transform.h> #include <vector> void main(void) { const float a=100; std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); // saxpy with C++ Lambda bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), [=] (float xx, float yy) restrict(cpu, amp) { return a * xx + yy; }); };
Functor (a * xx + yy) now specified inline Can capture variables from surrounding scope (a) eliminate boilerplate class
BOLT FOR OPENCL

#include <clbolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device clbolt::sort(a.begin(), a.end()); }
Interface similar to familiar C++ Standard Template Library clbolt uses OpenCL below the API level
Host data copied or mapped to the GPU First call to clbolt::sort will generate and compile a kernel
More advanced use case allow programmer to supply a kernel in OpenCL

BOLT FOR OPENCL : USER-SPECIFIED FUNCTOR

#include <clbolt/transform.h> #include <vector> BOLT_FUNCTOR(SaxpyFunctor, struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) { return _a * xx + yy; }; }; ); void main2() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); clbolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); };
Challenge: OpenCL split-source model

Host code in C or C++ OpenCL code specified in strings
Solution:
BOLT_FUNCTOR macro creates both host-side and string versions of SaxpyFunctor class definition
Class name (SaxpyFunctor) stored in TypeName trait OpenCL kernel code (SaxpyFunctor class def) stored in ClCode trait.
Clbolt function implementation

Can retrieve traits from class name Uses TypeName and ClCode to construct a customized transform kernel First call to clbolt::transform compiles the kernel
Advanced users can directly create ClCode trait
BOLT: C++ AMP VS. OPENCL

BOLT for C++ AMP
C++ template library for HSA Developer can customize data types and operations Provide library of optimized routines for AMD GPUs. C++ Host Language Kernels marked with restrict(cpu, amp) Kernels written in C++ AMP kernel language Restricted set of C++ Kernels compiled at compile-time C++ Lambda Syntax Supported Functors may contain array_view Parameters can use host data structures (ie std::vector) Parameters can be array or array_view types Use bolt namespace
10 | BOLT | June 2012
BOLT for OpenCL

C++ template library for HSA Developer can customize data types and operations Provide library of optimized routines for AMD GPUs. C++ Host Language Kernels marked with BOLT_FUNCTOR macro Kernels written in OpenCL kernel language Subset of C99, with extensions (ie vectors, builtins) Kernels compiled at runtime, on first call Some compile errors shown on first call C++11 Lambda Syntax NOT supported Functors may not contain pointers Parameters can use host data structures (ie std::vector) Parameters can be cl::Buffer or cl_buffer types Use clbolt namespace
BOLT : WHATS NEW? Optimized template library routines for common GPU functions
For OpenCL and C++ AMP, across multiple platforms
Direct interfaces to host memory structures (ie std::vectors)

Leverage HSA unified address space and zero-copy memory C++ AMP array and cl::Buffer also supported if memory already on device
Bolt submits to the entire platform rather than a specific device

Runtime automatically selects the device Provides opportunities for load-balancing Provides optimal CPU path if no GPU is available. Override to specify specific accelerator is supported Enables developers to fearlessly move to the GPU
Bolt will contain new APIs optimized for HSA Devices

Multi-device bolt::pipeline, bolt::parallel_filter
11 | BOLT | June 2012
EXAMPLARY ISV PROOF-POINT

Hessian Algorithm Pseudo Code:
Hessian kernel from MotionDSP Ikena

Commercially available video enhancement software Optimized for CPU and GPU
// x,y are coordinates of pixel to transform // Pixel difference: It = W(y, x) - I(y, x); // average left/right pixels: Ix = 0.5f *( W(y, x+1) - W(y, x-1) ); // average top/bottom pixels: Iy = 0.5f*( W(y+1, x) - W(y-1, x) ); X = x dist of this pixel from center Y = y dist of this pixel from center // Compute for each H[ 0] = (Ix*X+Iy*Y) H[ 1] = (Ix*X-Iy*Y) H[ 2] = (Ix*X-Iy*Y) H[ 3] = (Ix ) H[ 4] = (Ix ) H[ 5] = (Ix ) H[ 6] = (Iy ) H[ 7] = (Iy ) H[ 8] = (Iy ) H[ 9] = (Iy ) H[10] = (It ) H[11] = (It ) H[12] = (It ) H[13] = (It ) pixel: * (Ix*X+Iy*Y) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix ) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix ) * (Iy ) * (Ix*X+Iy*Y) * (Ix*X-Iy*Y) * (Ix ) * (Iy )
Basic Hessian Algorithm

Two input images I and W Transform, followed by reduce (transform_reduce)
For each pixel in image, compute 14 float coefficients Sum the coefficients for all the pixels final result is 14 floats
Complex, computationally intense, real-world algorithm
Developed multiple implementations of Hessian kernel

CPU, GPU, Bolt
12 | BOLT | June 2012
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

350 300
Init.
(Exemplary ISV Hessian Kernel)
35.00 30.00 25.00
250
Launch
Relative Performance
LOC
200 150
Compile Copy Launch Compile Copy Launch Launch
20.00 15.00 10.00
100
Launch
Launch
Algorithm
Algorithm Algorithm Copy-back
Algorithm
Algorithm
Launch
50 0
5.00
Algorithm Algorithm Copy-back Copy-back
Serial CPU
TBB
Intrinsics+TBB
OpenCL-C OpenCL -C++
C++ AMP
HSA Bolt
Copy-back
Algorithm
Launch
Copy
Compile
Init
Performance
13 | The Programmers Guide to a Universe of Possibility
June 12, 2012
PERFORMANCE PORTABILITY - INTRODUCTION

For many algorithms, core operation same between CPU and GPU See sort, saxpy, hessian examples Same Core Operation Differences in how data is routed to the core operation Bolt hides the device-specific routing details inside the library function implementation GPU implementations:
GPU-friendly data strides Launch enough threads to hide memory latency Group Memory and work-group communication
CPU implementations:
CPU-friendly data strides Launch enough threads to use all cores
14 | BOLT | June 2012
PERFORMANCE PORTABILITY RESULTS
CPU Performance vs Programming Model

(Exemplary ISV "Hessian" Kernel")
4.50 4.00 3.50 3.00 Rel Perf 2.50 2.00 1.50 1.00 0.50 0.00 Serial CPU TBB CPU OpenCL (CPU) HSA Bolt (CPU)
15 | BOLT | June 2012
PERFORMANCE PORTABILITY WHATS NEW ?

New GPU programming models are close to CPU programming models C++ AMP : Single-source, (restricted) C++11 kernel language, high-quality debugger/profiler, etc Shared Virtual Memory in HSA Removes tedious copies between address spaces Will allow use of complex pointer-containing data structures Less performance cliffs in modern GPU architectures (ie AMD GCN) Reduce need for GPU-specific optimizations in core operation Example: 14:7:1 Bandwidth Ratio for Group:Cache:Global Memory Autovectorization Modern compilers include auto-vectorization support Restrictions of GPU programming models facilitate vectorization Finally, Bolt functors can provide device-specific implementations if needed
16 | BOLT | June 2012
HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS

High-performance shared virtual memory Developers no longer have to worry about data location (ie device vs host) HSA platforms have tightly integrated CPU and GPU GPU better at wide vector parallelism, extracting memory bandwidth, latency hiding CPU better at fine-grained vector parallelism, cache-sensitive code, control-flow Bolt Abstractions Provides insight into the characteristics of the algorithm
Reduce vs Transform vs parallel_filter
Abstraction above the details of a kernel launch

Dont need to specify device, workgroup shape, work-items, number of kernels, etc Runtime may optimize these for the platform
Bolt has access to both optimized CPU and GPU implementations, at the same time Lets use both!
17 | BOLT | June 2012
EXAMPLES OF HSA LOAD-BALANCING

Example Data Size ReducWon Border/Edge OpWmizaWon PlaUorm Super- Device Heterogeneous Pipeline Parallel_lter DescripBon Exemplary Use Cases
Run large data sizes on GPU, small on CPU Same call-site used for varying data sizes. Run iniWal reducWon phases on GPU, run nal stages on CPU Run wide center regions on GPU, run border regions on CPU. Any reducWon operaWon. Image processing.
Distribute workgroups to available Kernel has similar performance /energy on processing units on the enWre plaUorm. CPU and GPU. Run a pipelined series of user-dened stages. Stages can be CPU-only, GPU-only, or CPU or GPU. Video processing pipeline. GPU scans all candidates and rejects early mismatches; CPU more deeply evaluates the survivors. Haar detector, word search, audio search.
18 | BOLT | June 2012
HETEROGENEOUS PIPELINE
Mimics a traditional manufacturing assembly line Developer supplies a series of pipeline stages Each stage processes its input token, passes an output token to the next stage Stages can be either CPU-only, GPU-only, or CPU/GPU CPU/GPU tasks are dynamically scheduled Use queue depth and estimated execution time to drive scheduling decision Adapt to variation in target hardware or system utilization Data location not an issue in HSA Leverage single source code GPU kernels scheduled asynchronously Completion invokes next stage of the pipeline Simple Video Pipeline Example:
Video Decode (CPU-only)
Video Processing (CPU/GPU)
Video Render (GPU-only)
19 | BOLT | June 2012
CASCADE DEPTH ANALYSIS

Cascade Depth 25
20 15 10 5 0
20-25 15-20 10-15 5-10 0-5
20 | The Programmers Guide to a Universe of Possibility
June 12, 2012
PARALLEL_FILTER
Target applications with a Filter pattern Filter out a small number of results from a large initial pool of candidates Initial phases best run on GPU:
Large data sets (too big for caches), wide vector, high-bandwidth
Tail phases best run on CPU

Smaller data sets (may fit in cache), divergent control flow, fine-grained vector width
Examples: Haar detector, word search, acoustic search Developer specifies: Execution Grid Iteration state type and initial value Filter function
Accepts a point to process and the current iteration state Return True to continue processing or False to exit
BOLT / HSA Runtime Automatically hands off work between CPU and GPU Balances work by adjusting the split point between GPU and CPU
21 | BOLT | June 2012
SUMMARY
Bolt: C++ Template Library Optimized GPU and HSA Library routines Customizable via templates For both OpenCL and C++ AMP Enjoy the unique advantages of the HSA Platform High-performance shared virtual memory Tightly integrated CPU and GPU C++ Template Library For HSA Enable advanced HSA features A single source base for CPU and GPU Platform load balancing across CPU and GPU
22 | BOLT | June 2012
BACKUP
23 | BOLT | June 2012
BENCHMARK CONFIGURATION INFORMATION

Slide13, 15 AMD A10-5800K APU with Radeon HD Graphics
CPU: 4cores, 3800Mhz (4200Mhz Turbo) GPU: AMD Radeon HD 7660D, 6 compute units, 800Mhz 4GB RAM
Software:
Windows 7 Professional SP1 (64-bit OS) AMD OpenCL 1.2 AMD-APP (937.2) Microsoft Visual Studio 11 Beta
24 | BOLT | June 2012
Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. [For AMD-speakers only] 2012 Advanced Micro Devices, Inc. [For non-AMD speakers only] The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMDs positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.
25 | BOLT | June 2012

Boltforhsabybensanders 120628182616 Phpapp02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boltforhsabybensanders 120628182616 Phpapp02

Uploaded by

Copyright:

Available Formats

BOLT: A C++ TEMPLATE LIBRARY FOR HSA

Ben Sander AMD Senior Fellow

3 | BOLT | June 2012

4 | BOLT | June 2012

SIMPLE BOLT EXAMPLE

BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR

6 | BOLT | June 2012

BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA

7 | BOLT | June 2012

BOLT FOR OPENCL

More advanced use case allow programmer to supply a kernel in OpenCL

BOLT FOR OPENCL : USER-SPECIFIED FUNCTOR

Challenge: OpenCL split-source model

Clbolt function implementation

Advanced users can directly create ClCode trait

9 | BOLT | June 2012

BOLT: C++ AMP VS. OPENCL

BOLT for OpenCL

Direct interfaces to host memory structures (ie std::vectors)

Bolt submits to the entire platform rather than a specific device

Bolt will contain new APIs optimized for HSA Devices

EXAMPLARY ISV PROOF-POINT

Hessian kernel from MotionDSP Ikena

Basic Hessian Algorithm

Complex, computationally intense, real-world algorithm

Developed multiple implementations of Hessian kernel

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS

(Exemplary ISV Hessian Kernel)

35.00 30.00 25.00

Compile Copy Launch Compile Copy Launch Launch

20.00 15.00 10.00

Algorithm Algorithm Copy-back

OpenCL-C OpenCL -C++

13 | The Programmers Guide to a Universe of Possibility

June 12, 2012

PERFORMANCE PORTABILITY - INTRODUCTION

14 | BOLT | June 2012

PERFORMANCE PORTABILITY RESULTS

CPU Performance vs Programming Model

15 | BOLT | June 2012

PERFORMANCE PORTABILITY WHATS NEW ?

16 | BOLT | June 2012

HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS

Abstraction above the details of a kernel launch

EXAMPLES OF HSA LOAD-BALANCING

18 | BOLT | June 2012

Video Decode (CPU-only)

Video Processing (CPU/GPU)

Video Render (GPU-only)

19 | BOLT | June 2012

CASCADE DEPTH ANALYSIS

20 | The Programmers Guide to a Universe of Possibility

June 12, 2012

Tail phases best run on CPU

22 | BOLT | June 2012

23 | BOLT | June 2012

BENCHMARK CONFIGURATION INFORMATION

24 | BOLT | June 2012

Disclaimer & Attribution

25 | BOLT | June 2012

You might also like