You are on page 1of 38

OpenCL

Parallel Computing on the GPU and


CPU
Aaftab Munshi
Beyond Programmable Shading: Fundamentals

Todays processors are increasingly parallel

CPUs
!
Multiple cores are driving performance increases

GPUs
!
Transforming into general purpose data-parallel
computational coprocessors
!
Improving numerical precision (single and double)
Opportunity: Processor
Beyond Programmable Shading: Fundamentals

Writing parallel programs di"erent for the CPU and


GPU
!
Di"ering domain-specic techniques
!
Vendor-specic technologies

Graphics API is not an ideal abstraction for general


purpose compute
Challenge: Processor Parallelism
Beyond Programmable Shading: Fundamentals

OpenCL Open Computing Language

Approachable language for accessing heterogeneous


computational resources

Supports parallel execution on single or multiple


processors
!
GPU, CPU, GPU + CPU or multiple GPUs

Desktop and Handheld Proles

Designed to work with graphics APIs such as


OpenGL
Introducing OpenCL
Beyond Programmable Shading: Fundamentals
OpenCL = Open Standard

Specication under review


!
Royalty free, cross-platform, vendor neutral
!
Khronos OpenCL working group (www.khronos.org)

Based on a proposal by Apple


!
Developed in collaboration with industry leaders
!
Performance-enhancing technology in Mac OS X Snow
Leopard
Beyond Programmable Shading: Fundamentals
OpenCL Working Group Members
Broad Industry Support
Copyright Khronos Group, 2008 - Page
OpenCL A Sneak Preview
Beyond Programmable Shading: Fundamentals

Use all computational resources in system


!
GPUs and CPUs as peers
!
Data- and task- parallel compute model

E#cient parallel programming model


!
Based on C
!
Abstract the specics of underlying hardware

Specify accuracy of oating-point computations


!
IEEE 754 compliant rounding behavior
!
Dene maximum allowable error of math functions

Drive future hardware requirements


Design Goals of OpenCL
Beyond Programmable Shading: Fundamentals

Platform Layer
!
query and select compute devices in the system
!
initialize a compute device(s)
!
create compute contexts and work-queues

Runtime
!
resource management
!
execute compute kernels

Compiler
!
A subset of ISO C99 with appropriate language
additions
!
Compile and build compute program executables
!
online or o$ine
OpenCL Software Stack
Beyond Programmable Shading: Fundamentals

Compute Kernel
!
Basic unit of executable code similar to a C
function
!
Data-parallel or task-parallel

Compute Program
!
Collection of compute kernels and internal functions
!
Analogous to a dynamic library

Applications queue compute kernel execution


instances
!
Queued in-order
!
Executed in-order or out-of-order
!
Events are used to implement appropriate
OpenCL Execution Model
Beyond Programmable Shading: Fundamentals

Dene N-Dimensional computation domain


!
Each independent element of execution in N-D
domain is called a work-item
!
The N-D domain denes the total number of work-
items that execute in parallel global work size.

Work-items can be grouped together work-group


!
Work-items in group can communicate with each
other
!
Can synchronize execution among work-items in
group to coordinate memory access

Execute multiple work-groups in parallel

Mapping of global work size to work-groups


OpenCL Data-Parallel Execution
Beyond Programmable Shading: Fundamentals

Data-parallel execution model must be implemented


by all OpenCL compute devices

Some compute devices such as CPUs can also


execute task-parallel compute kernels
!
Executes as a single work-item
!
A compute kernel written in OpenCL
!
A native C / C++ function
OpenCL Task-Parallel Execution
Beyond Programmable Shading: Fundamentals
OpenCL Memory Model

Implements a relaxed
consistency, shared memory
model

Multiple distinct address spaces


!
Address spaces can be collapsed
Beyond Programmable Shading: Fundamentals
Compute Unit 1
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Compute Unit N
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
OpenCL Memory Model

Implements a relaxed
consistency, shared memory
model

Multiple distinct address spaces


!
Address spaces can be collapsed
!
Address Qualiers
!
__private
Beyond Programmable Shading: Fundamentals
Compute Unit 1
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Compute Unit N
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Local Memory Local Memory
OpenCL Memory Model

Implements a relaxed
consistency, shared memory
model

Multiple distinct address spaces


!
Address spaces can be collapsed
!
Address Qualiers
!
__private
!
__local
Beyond Programmable Shading: Fundamentals
Compute Device
Compute Unit 1
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Compute Unit N
Private
Memory
Private
Memory
WorkItem
1
WorkItem
M
Local Memory Local Memory
Global / Constant Memory Data Cache
Compute Device Memory
Global Memory
OpenCL Memory Model

Implements a relaxed
consistency, shared memory
model

Multiple distinct address spaces


!
Address spaces can be collapsed
!
Address Qualiers
!
__private
!
__local
!
__constant and __global
!
Example:
!
__global oat4 *p;
Beyond Programmable Shading: Fundamentals

Derived from ISO C99

A few restrictions
!
Recursion, function pointers, functions in C99
standard headers ...

Preprocessing directives dened by C99 are


supported

Built-in Data Types


!
Scalar and vector data types
!
Structs, Pointers
!
Data-type conversion functions
!
convert_type<_sat><_roundingmode>
!
Image types
Language for writing compute
Beyond Programmable Shading: Fundamentals
Language for writing compute
Beyond Programmable Shading: Fundamentals

Built-in Functions Required


!
work-item functions
!
math.h
!
read and write image
!
relational
!
geometric functions
!
synchronization functions
Language for writing compute
Beyond Programmable Shading: Fundamentals

Built-in Functions Required


!
work-item functions
!
math.h
!
read and write image
!
relational
!
geometric functions
!
synchronization functions

Built-in Functions Optional


!
double precision
!
atomics to global and local memory
!
selection of rounding mode
Language for writing compute
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);
memobjs[1] = clCreateBuffer(context,
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);
memobjs[1] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create a compute context with GPU device
context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);
// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);
memobjs[1] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(float)*2*num_entries, NULL);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create the compute program
program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);
// build the compute program executable
clBuildProgramExecutable(program, false, NULL, NULL);
// create the compute kernel
kernel = clCreateKernel(program, fft1D_1024);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Host API
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);
// set the args values
clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
// execute kernel
clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);
Beyond Programmable Shading: Fundamentals
OpenCL FFT Example - Compute
// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
// calls to a radix 16 function, another radix 16 function and then a radix 4 function
// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
__local float *sMemx, __local float *sMemy) {
int tid = get_local_id(0);
int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];
// starting index of data to/from global memory
in = in + blockIdx; out = out + blockIdx;
globalLoads(data, in, 64); // coalesced global reads
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);
// local shuffle using local memory
localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication
localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));
// four radix-4 function calls
fftRadix4Pass(data); fftRadix4Pass(data + 4);
fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);
// coalesced global writes
globalStores(data, out, 64);
}
Beyond Programmable Shading: Fundamentals

Sharing OpenGL Resources


!
OpenCL is designed to e#ciently share with OpenGL
!
Textures, Bu"er Objects and Renderbu"ers
!
Data is shared, not copied

E#cient queuing of OpenCL and OpenGL commands

Apps can select compute device(s) that will run


OpenGL and OpenCL
OpenCL and OpenGL
Beyond Programmable Shading: Fundamentals

A new compute language that works across GPUs


and CPUs
!
C99 with extensions
!
Familiar to developers
!
Includes a rich set of built-in functions
!
Makes it easy to develop data- and task- parallel
compute programs

Denes hardware and numerical precision


requirements

Open standard for heterogeneous parallel computing


Summary

You might also like