Professional Documents
Culture Documents
CUDA
January 14, 2009
1
CS6963
L2:Introduction to CUDA
Outline
• Programmer expresses
- Thread programs to be launched on GPU, and how to launch
- Data organization and movement between host and GPU
- Synchronization, memory management, testing, …
4
CS6963 L2:Introduction to CUDA
What Programmer Expresses in CUDA
DEVICE (GPU)
HOST (CPU)
P Interconnect between devices and memories
P
M M
• Concurrency management
- E.g. __synchthreads() 5
CS6963 L2:Introduction to CUDA
Minimal Extensions to C + API
• Declspecs __device__ float filter[N];
- global, device,
__global__ void convolve (float *image)
shared, local,
{
constant
__shared__ float region[M];
...
• Keywords
- threadIdx, blockIdx region[threadIdx] = image[i];
• Intrinsics __syncthreads()
- __syncthreads ...
image[j] = result;
}
• Runtime API
- Memory, symbol, // Allocate GPU memory
execution management void *myimage = cudaMalloc(bytes)
7
Slide source: Austin Robison (NVIDIA) L2:Introduction to CUDA
NVCC Compiler’s Role: Partition Code and
Compile for Device
mycode.cu Compiled by native Compiled by nvcc
compiler: gcc, icc, cc compiler
int main_data;
__shared__ int sdata;
Host Only
Main() { }
int main_data; __shared__ sdata;
__host__ hfunc () {
int hdata;
Main() {}
<<<gfunc(g,b,m)>>>();
__host__ hfunc () { __global__ gfunc() {
}
int hdata; int gdata;
__global__ gfunc() { <<<gfunc(g,b,m)>>> }
Interface
8
CS6963
L2:Introduction to CUDA
CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
- Is a coprocessor to the CPU or host
- Has its own DRAM (device memory)
- Runs many threads in parallel
9
CS6963
L2: Introduction to CUDA
Thread Batching: Grids and Blocks
• A kernel is executed as a grid
of thread blocks Host Device
IDs Grid 1
• Simplifies memory
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
multidimensional data
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
}
threadIdx.x = 0 examines in_array elements 0, 4, 8, 12
Known as a
threadIdx.x = 1 examines in_array elements 1, 5, 9, 13
cyclic data
threadIdx.x = 2 examines in_array elements 2, 6, 10, 14
distribution
threadIdx.x = 3 examines in_array elements 3, 7, 11, 15
12
CS6963 L2:Introduction to CUDA
CUDA Pseudo-Code
MAIN PROGRAM: HOST FUNCTION:
Initialization Allocate memory on device for
• Allocate memory on host for copy of input and output
input and output Copy input to device
• Assign random numbers to
input array Set up grid/block
Call host function Call global function
Calculate final output from Copy device output to host
per-thread output
Print result
14
CS6963 L2:Introduction to CUDA
Main Program: Invoke Global Function
MAIN PROGRAM: #include <stdio.h>
#define SIZE 16
Initialization (OMIT)
#define BLOCKSIZE 4
• Allocate memory on host for
input and output __host__ void outer_compute
(int *in_arr, int *out_arr);
• Assign random numbers to
input array int main(int argc, char **argv)
Call host function {
Calculate final output from int *in_array, *out_array;
per-thread output /* initialization */ …
Print result outer_compute(in_array, out_array);
…
}
15
CS6963 L2:Introduction to CUDA
Main Program: Calculate Output & Print Result
MAIN PROGRAM: #include <stdio.h>
#define SIZE 16
Initialization (OMIT)
#define BLOCKSIZE 4
• Allocate memory on host for
input and output __host__ void outer_compute
(int *in_arr, int *out_arr);
• Assign random numbers to
input array int main(int argc, char **argv)
Call host function {
Calculate final output from int *in_array, *out_array;
per-thread output int sum = 0;
Print result /* initialization */ …
outer_compute(in_array, out_array);
for (int i=0; i<BLOCKSIZE; i++) {
sum+=out_array[i];
}
printf (”Result = %d\n",sum);
}
16
CS6963 L2:Introduction to CUDA
Host Function: Preliminaries & Allocation
HOST FUNCTION: __host__ void outer_compute (int
*h_in_array, int *h_out_array) {
Allocate memory on device for
copy of input and output
int *d_in_array, *d_out_array;
Copy input to device
Set up grid/block cudaMalloc((void **) &d_in_array,
SIZE*sizeof(int));
Call global function
cudaMalloc((void **) &d_out_array,
Copy device output to host BLOCKSIZE*sizeof(int));
…
}
17
CS6963 L2:Introduction to CUDA
Host Function: Copy Data To/From Host
HOST FUNCTION: __host__ void outer_compute (int
*h_in_array, int *h_out_array) {
Allocate memory on device for int *d_in_array, *d_out_array;
copy of input and output
Copy input to device cudaMalloc((void **) &d_in_array,
Set up grid/block SIZE*sizeof(int));
18
CS6963 L2:Introduction to CUDA
Host Function: Setup & Call Global Function
HOST FUNCTION: __host__ void outer_compute (int
*h_in_array, int *h_out_array) {
Allocate memory on device for int *d_in_array, *d_out_array;
copy of input and output
Copy input to device cudaMalloc((void **) &d_in_array,
Set up grid/block SIZE*sizeof(int));
20
CS6963 L2:Introduction to CUDA
Device Function
21
CS6963 L2:Introduction to CUDA
Reductions
• This type of computation is called a parallel reduction
- Operation is applied to large data structure
- Computed result represents the aggregate solution across the large
data structure
- Large data structure computed result (perhaps single number)
[dimensionality reduced]
• Why might parallel reductions be well-suited to GPUs?
• What if we tried to compute the final sum on the GPUs?
(next class and assignment)
22
CS6963 L2:Introduction to CUDA
Standard Parallel Construct
• Sometimes called “embarassingly parallel” or
“pleasingly parallel”
• Each thread is completely independent of the others
• Final result copied to CPU
• Another example, adding two matrices:
- A more careful examination of decomposing computation
into grids and thread blocks
23
CS6963 L2:Introduction to CUDA
Another Example: Adding Two Matrices
CUDA C program
CPU C program
__global__ void add_matrix_gpu(float *a,
void add_matrix_cpu(float *a, float *b, float *b, float *c, intN)
float *c, int N)
{
{
int i =blockIdx.x*blockDim.x+threadIdx.x;
int i, j, index;
int j=blockIdx.y*blockDim.y+threadIdx.y;
for (i=0;i<N;i++) {
int index =i+j*N;
for (j=0;j<N;j++) {
if( i <N && j <N)
index =i+j*N;
c[index]=a[index]+b[index];
c[index]=a[index]+b[index];
}
}
}
void main() {
}
dim3 dimBlock(blocksize,blocksize);
dim3 dimGrid(N/dimBlock.x,N/dimBlock.y);
void main() {
..... add_matrix_gpu<<<dimGrid,dimBlock>>>(a,b
add_matrix(a,b,c,N); ,c,N);
} }
24
Example source: Austin Robison, NVIDIA L2:Introduction to CUDA
Closer Inspection of Computation and Data
Partitioning
• Define 2-d set of blocks, and 2-d set of threads per
block
dim3 dimBlock(blocksize,blocksize);
dim3 dimGrid(N/dimBlock.x,N/dimBlock.y);
25
CS6963 L2:Introduction to CUDA
Summary of Lecture
• Introduction to CUDA
• Essentially, a few extensions to C + API supporting
heterogeneous data-parallel CPU+GPU execution
- Computation partitioning
- Data partititioning (parts of this implied by decomposition into
threads)
- Data organization and management
- Concurrency management
• Two examples
- Parallel reduction
- Embarassingly/Pleasingly parallel computation
26
CS6963 L2:Introduction to CUDA
Next Week
• A few more details to prepare you for your first
assignment
- More on synchronization for reductions
- More on decomposing into grids and thread blocks
- More on run-time library
- Especially constructs to test for correct execution
- A little on debugging
27
CS6963 L2:Introduction to CUDA