Opencl: Parallel Computing On The Gpu and Cpu

OpenCL
Parallel Computing on the GPU and

CPU
Aaftab Munshi
Opportunity: Processor
•Today’s processors are increasingly parallel
•CPUs
■ Multiple cores are driving performance increases
•GPUs
■ Transforming into general purpose data-parallel
computational coprocessors
■ Improving numerical precision (single and double)
Beyond Programmable Shading: Fundamentals

Challenge: Processor Parallelism
•Writing parallel programs different for the CPU and
GPU
■ Differing domain-specific techniques
■ Vendor-specific technologies
•Graphics API is not an ideal abstraction for general
purpose compute

Introducing OpenCL
•OpenCL – Open Computing Language
•Approachable language for accessing heterogeneous
computational resources
•Supports parallel execution on single or multiple
processors
■ GPU, CPU, GPU + CPU or multiple GPUs
•Desktop and Handheld Profiles
•Designed to work with graphics APIs such as
OpenGL

OpenCL = Open Standard
•Specification under review
■ Royalty free, cross-platform, vendor neutral
■ Khronos OpenCL working group (www.khronos.org)
•Based on a proposal by Apple
■ Developed in collaboration with industry leaders
■ Performance-enhancing technology in Mac OS X Snow
Leopard

OpenCL Working Group Members
Broad Industry Support
© Copyright Khronos Group, 2008 - Page

OpenCL — A Sneak Preview
Design Goals of OpenCL
•Use all computational resources in system
■ GPUs and CPUs as peers
■ Data- and task- parallel compute model
•Efficient parallel programming model
■ Based on C
■ Abstract the specifics of underlying hardware
•Specify accuracy of floating-point computations
■ IEEE 754 compliant rounding behavior
■ Define maximum allowable error of math functions
•Drive future hardware requirements

OpenCL Software Stack
•Platform Layer
■ query and select compute devices in the system
■ initialize a compute device(s)
■ create compute contexts and work-queues
•Runtime
■ resource management
■ execute compute kernels
•Compiler
■ A subset of ISO C99 with appropriate language
additions
■ Compile and build compute program executables
■ online or offline
OpenCL Execution Model
•Compute Kernel
■ Basic unit of executable code — similar to a C
function
■ Data-parallel or task-parallel
•Compute Program
■ Collection of compute kernels and internal functions
■ Analogous to a dynamic library
•Applications queue compute kernel execution
instances
■ Queued in-order
■ Executed in-order or out-of-order
■ Events are used to implement appropriate

OpenCL Data-Parallel Execution
•Define N-Dimensional computation domain
■ Each independent element of execution in N-D
domain is called a work-item
■ The N-D domain defines the total number of work-
items that execute in parallel — global work size.
•Work-items can be grouped together — work-group
■ Work-items in group can communicate with each
other
■ Can synchronize execution among work-items in
group to coordinate memory access
•Execute multiple work-groups in parallel
•Mapping of global work size to work-groups

OpenCL Task-Parallel Execution
•Data-parallel execution model must be implemented
by all OpenCL compute devices
•Some compute devices such as CPUs can also
execute task-parallel compute kernels
■ Executes as a single work-item
■ A compute kernel written in OpenCL
■ A native C / C++ function

OpenCL Memory Model
•Implements a relaxed
consistency, shared memory
model
•Multiple distinct address spaces
■ Address spaces can be collapsed

OpenCL Memory Model
Private Private Private Private
model Memory Memory Memory Memory

WorkItem WorkItem WorkItem WorkItem
1 M 1 M

Compute Unit 1 Compute Unit N
■ Address Qualifiers
■ __private

OpenCL Memory Model

1 M 1 M

■ Address Qualifiers Local Memory Local Memory
■ __private
■ __local

OpenCL Memory Model

1 M 1 M

■ Address Qualifiers Local Memory Local Memory
■ __private Global / Constant Memory Data Cache
■ __local Compute Device
■ __constant and __global

Global Memory
■ Example:
Compute Device Memory
■ __global float4 *p;

Language for writing compute
•Derived from ISO C99
•A few restrictions
■ Recursion, function pointers, functions in C99
standard headers ...
•Preprocessing directives defined by C99 are
supported
•Built-in Data Types
■ Scalar and vector data types
■ Structs, Pointers
■ Data-type conversion functions
■ convert_type<_sat><_roundingmode>
■ Image types

•Built-in Functions — Required
■ work-item functions
■ math.h
■ read and write image
■ relational
■ geometric functions
■ synchronization functions

•Built-in Functions — Required
■ work-item functions
■ math.h
■ read and write image
■ relational
■ geometric functions
■ synchronization functions
•Built-in Functions — Optional
■ double precision
■ atomics to global and local memory
■ selection of rounding mode

OpenCL FFT Example - Host API

// create a compute context with GPU device

context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue

queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects


memobjs[0] = clCreateBuffer(context,


CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);




CL_MEM_READ_WRITE,


CL_MEM_READ_WRITE,
sizeof(float)*2*num_entries, NULL);


// create the compute program

program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);
// build the compute program executable

clBuildProgramExecutable(program, false, NULL, NULL);
// create the compute kernel

kernel = clCreateKernel(program, “fft1D_1024”);


// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);
// set the args values

clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
// execute kernel
clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

OpenCL FFT Example - Compute
// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
// calls to a radix 16 function, another radix 16 function and then a radix 4 function
// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
__local float *sMemx, __local float *sMemy) {
int tid = get_local_id(0);
int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];
// starting index of data to/from global memory

in = in + blockIdx; out = out + blockIdx;
globalLoads(data, in, 64); // coalesced global reads

fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);
// local shuffle using local memory

localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication
localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));
// four radix-4 function calls
fftRadix4Pass(data); fftRadix4Pass(data + 4);
fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);
// coalesced global writes

globalStores(data, out, 64);
}

OpenCL and OpenGL
•Sharing OpenGL Resources
■ OpenCL is designed to efficiently share with OpenGL
■ Textures, Buffer Objects and Renderbuffers
■ Data is shared, not copied
•Efficient queuing of OpenCL and OpenGL commands
•Apps can select compute device(s) that will run
OpenGL and OpenCL

Summary
•A new compute language that works across GPUs
and CPUs
■ C99 with extensions
■ Familiar to developers
■ Includes a rich set of built-in functions
■ Makes it easy to develop data- and task- parallel
compute programs
•Defines hardware and numerical precision
requirements
•Open standard for heterogeneous parallel computing

Opencl: Parallel Computing On The Gpu and Cpu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Opencl: Parallel Computing On The Gpu and Cpu

Uploaded by

Copyright:

Available Formats

OpenCL

Parallel Computing on the GPU and

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

© Copyright Khronos Group, 2008 - Page

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

•Multiple distinct address spaces

■ Address spaces can be collapsed

Beyond Programmable Shading: Fundamentals

•Multiple distinct address spaces

■ Address spaces can be collapsed

■ Address Qualifiers Local Memory Local Memory

Beyond Programmable Shading: Fundamentals

•Multiple distinct address spaces

■ Address spaces can be collapsed

■ Address Qualifiers Local Memory Local Memory

■ __private Global / Constant Memory Data Cache

■ __local Compute Device

■ __constant and __global

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

// create the compute program

// build the compute program executable

// create the compute kernel

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

// set the args values

Beyond Programmable Shading: Fundamentals

// starting index of data to/from global memory

globalLoads(data, in, 64); // coalesced global reads

// local shuffle using local memory

// coalesced global writes

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

You might also like

■ constant and global