Professional Documents
Culture Documents
Cud A
Cud A
CUDA
Antonyus Pyetro do Amaral Ferreira
+ The problem
The
advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems.
The
+ A solution
CUDA
A
compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count.
GPU is especially well-suited to address problems that can be expressed as data-parallel computations.
Because
the same program is executed for each data element, there is a lower requirement for sophisticated flow control.
+ Applications?
General
The
latest generation of NVIDIA GPUs, based on the Tesla architecture, supports the CUDA programming model
+ What is CUDA?
CUDA
extends C by allowing the programmer to define C functions, called kernels,that are executed N times in parallel by N different CUDA threads.
Each
of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
+ Concurrency
Threads within
a block can cooperate among themselves by sharing data through some shared memory.
__syncthreads()
acts as a barrier at which all threads in the block must wait before any are allowed to proceed.
+ Process Hierarchy
+ Memory Hierarchy
+ Memory Hierarchy
assumes that the CUDA threads may execute on a physically separate device that operates as a coprocessor to the host.
CUDA
also assumes that both the host and the device maintain their own DRAM, referred to as host memory and device memory
+ Software Stack
Device
+ Language Extensions
Function type
qualifiers to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device.
Variable
+ Language Extensions
A
new directive to specify how a kernel is executed on the device from the host.
vecAdd<<<1, N>>>(A, B, C);
Four
built-in variables that specify the grid and block dimensions and the block and thread indices
__global__
__host__
Executed on the host Callable from the host only. Default type
__constant__
constant memory space Is accessible from all the threads within the grid
__shared__
space of a thread block Is only accessible from all the threads within the block
+ Execution Configuration
Any
call to a __global__ function must specify the execution configuration for that call.
The
execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device.
+ Execution Configuration
<<< Dg, Db, Ns, S >>> Dg is of type dim3 and specifies the dimension and size of the grid, such that Dg.x * Dg.y equals the number of blocks being launched; Dg.z is unused; Db is of type dim3 and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block; Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array; Ns is an optional argument which defaults to 0; S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
+ Built-in Variables
gridDim This variable is of type dim3 and contains the dimensions of the grid.
blockIdx This variable is of type uint3 and contains the block index within the grid.
blockDim This variable is of type dim3 and contains the dimensions of the block.
threadIdx This variable is of type uint3 and contains the thread indexwithin the block.
Each thread within the block is responsible for computing one element of Csub.
Point to:
+ CUDA Interoperability
OpenGL Direct3D