Opencl: Parallel Computing On The Gpu and Cpu

OpenCL
Parallel Computing on the GPU and

CPU
Aaftab Munshi
Beyond Programmable Shading: Fundamentals
Todays processors are increasingly parallel
CPUs
!
Multiple cores are driving performance increases
GPUs
!
Transforming into general purpose data-parallel
computational coprocessors
!
Improving numerical precision (single and double)
Opportunity: Processor
Writing parallel programs di"erent for the CPU and

GPU
!
Di"ering domain-specic techniques
!
Vendor-specic technologies
Graphics API is not an ideal abstraction for general

purpose compute
Challenge: Processor Parallelism
OpenCL Open Computing Language
Approachable language for accessing heterogeneous

computational resources
Supports parallel execution on single or multiple

processors
!
GPU, CPU, GPU + CPU or multiple GPUs
Desktop and Handheld Proles
Designed to work with graphics APIs such as

OpenGL
Introducing OpenCL
OpenCL = Open Standard
Specication under review

!
Royalty free, cross-platform, vendor neutral
!
Khronos OpenCL working group (www.khronos.org)
Based on a proposal by Apple

!
Developed in collaboration with industry leaders
!
Performance-enhancing technology in Mac OS X Snow
Leopard
OpenCL Working Group Members
Broad Industry Support
Copyright Khronos Group, 2008 - Page
OpenCL A Sneak Preview
Use all computational resources in system

!
GPUs and CPUs as peers
!
Data- and task- parallel compute model
E#cient parallel programming model

!
Based on C
!
Abstract the specics of underlying hardware
Specify accuracy of oating-point computations

!
IEEE 754 compliant rounding behavior
!
Dene maximum allowable error of math functions
Drive future hardware requirements

Design Goals of OpenCL
Platform Layer
!
query and select compute devices in the system
!
initialize a compute device(s)
!
create compute contexts and work-queues
Runtime
!
resource management
!
execute compute kernels
Compiler
!
A subset of ISO C99 with appropriate language
additions
!
Compile and build compute program executables
!
online or o$ine
OpenCL Software Stack
Compute Kernel
!
Basic unit of executable code similar to a C
function
!
Data-parallel or task-parallel
Compute Program
!
Collection of compute kernels and internal functions
!
Analogous to a dynamic library
Applications queue compute kernel execution

instances
!
Queued in-order
!
Executed in-order or out-of-order
!
Events are used to implement appropriate
OpenCL Execution Model
Dene N-Dimensional computation domain

!
Each independent element of execution in N-D
domain is called a work-item
!
The N-D domain denes the total number of work-
items that execute in parallel global work size.
Work-items can be grouped together work-group

!
Work-items in group can communicate with each
other
!
Can synchronize execution among work-items in
group to coordinate memory access
Execute multiple work-groups in parallel
Mapping of global work size to work-groups

OpenCL Data-Parallel Execution
Data-parallel execution model must be implemented

by all OpenCL compute devices
Some compute devices such as CPUs can also

execute task-parallel compute kernels
!
Executes as a single work-item
!
A compute kernel written in OpenCL
!
A native C / C++ function
OpenCL Task-Parallel Execution
OpenCL Memory Model
Implements a relaxed
consistency, shared memory
model
Multiple distinct address spaces

!
Address spaces can be collapsed
Compute Unit 1
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Compute Unit N
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
OpenCL Memory Model
model

!
!
Address Qualiers
!
__private
Compute Unit 1
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Compute Unit N
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Local Memory Local Memory
OpenCL Memory Model
model

!
!
Address Qualiers
!
__private
!
__local
Compute Device
Compute Unit 1
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Compute Unit N
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Local Memory Local Memory
Global / Constant Memory Data Cache
Compute Device Memory
Global Memory
OpenCL Memory Model
model

!
!
Address Qualiers
!
__private
!
__local
!
__constant and __global
!
Example:
!
__global oat4 *p;
Derived from ISO C99
A few restrictions
!
Recursion, function pointers, functions in C99
standard headers ...
Preprocessing directives dened by C99 are

supported
Built-in Data Types

!
Scalar and vector data types
!
Structs, Pointers
!
Data-type conversion functions
!
convert_type<_sat><_roundingmode>
!
Image types
Language for writing compute
Built-in Functions Required

!
work-item functions
!
math.h
!
read and write image
!
relational
!
geometric functions
!
synchronization functions
Built-in Functions Required

!
work-item functions
!
math.h
!
read and write image
!
relational
!
geometric functions
!
synchronization functions
Built-in Functions Optional

!
double precision
!
atomics to global and local memory
!
selection of rounding mode
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);
CL_MEM_READ_WRITE,
CL_MEM_READ_WRITE,
sizeof(float)*2*num_entries, NULL);
// create the compute program
program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);
// build the compute program executable
clBuildProgramExecutable(program, false, NULL, NULL);
// create the compute kernel
kernel = clCreateKernel(program, fft1D_1024);
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);
// set the args values
clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
// execute kernel
clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);
OpenCL FFT Example - Compute
// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
// calls to a radix 16 function, another radix 16 function and then a radix 4 function
// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
__local float *sMemx, __local float *sMemy) {
int tid = get_local_id(0);
int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];
// starting index of data to/from global memory
in = in + blockIdx; out = out + blockIdx;
globalLoads(data, in, 64); // coalesced global reads
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);
// local shuffle using local memory
localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication
localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));
// four radix-4 function calls
fftRadix4Pass(data); fftRadix4Pass(data + 4);
fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);
// coalesced global writes
globalStores(data, out, 64);
}
Sharing OpenGL Resources

!
OpenCL is designed to e#ciently share with OpenGL
!
Textures, Bu"er Objects and Renderbu"ers
!
Data is shared, not copied
E#cient queuing of OpenCL and OpenGL commands
Apps can select compute device(s) that will run

OpenGL and OpenCL
OpenCL and OpenGL
A new compute language that works across GPUs

and CPUs
!
C99 with extensions
!
Familiar to developers
!
Includes a rich set of built-in functions
!
Makes it easy to develop data- and task- parallel
compute programs
Denes hardware and numerical precision

requirements
Open standard for heterogeneous parallel computing

Summary

Opencl: Parallel Computing On The Gpu and Cpu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Opencl: Parallel Computing On The Gpu and Cpu

Uploaded by

Copyright:

Available Formats

OpenCL

Parallel Computing on the GPU and

Todays processors are increasingly parallel

Writing parallel programs di"erent for the CPU and

Graphics API is not an ideal abstraction for general

OpenCL Open Computing Language

Approachable language for accessing heterogeneous

Supports parallel execution on single or multiple

Desktop and Handheld Proles

Designed to work with graphics APIs such as

Specication under review

Based on a proposal by Apple

Use all computational resources in system

E#cient parallel programming model

Specify accuracy of oating-point computations

Drive future hardware requirements

Applications queue compute kernel execution

Dene N-Dimensional computation domain

Work-items can be grouped together work-group

Execute multiple work-groups in parallel

Mapping of global work size to work-groups

Data-parallel execution model must be implemented

Some compute devices such as CPUs can also

Multiple distinct address spaces

Multiple distinct address spaces

Multiple distinct address spaces

Multiple distinct address spaces

Derived from ISO C99

Preprocessing directives dened by C99 are

Built-in Data Types

Built-in Functions Required

Built-in Functions Required

Built-in Functions Optional

Sharing OpenGL Resources

E#cient queuing of OpenCL and OpenGL commands

Apps can select compute device(s) that will run

A new compute language that works across GPUs

Denes hardware and numerical precision

Open standard for heterogeneous parallel computing

You might also like