Professional Documents
Culture Documents
Course Instructor
Dr. M. Rajeev Kumar
Professor (CSE)
UNIT V
Level of learning
CO
Course Outcome(s) domain (Based on
Nos.
revised Bloom’s)
Implement the CUDA Programming model for Parallel
CO5 K3
computing.
CUDA programming model, Basic principles of CUDA programming, Concepts of threads and blocks, GPU and CPU data
exchange.
GPU and CPU data
exchange
Typical CUDA Program Flow
Step 1: int *data = (int *)
malloc (n * sizeof(int));
Step 2: cudaMalloc(&d_data,
n * sizeof(int));
cudaMemcpy(d_data, data, n
* sizeof(int),
cudaMemcpyHostToDevice);
Step 3: execute<<<… , …>>>
(d_data);
Step 4: cudaMemcpy(data,
d_data, n * sizeof(int),
cudaMemcpyDeviceToHost);
Types of Data Transfer in CUDA
• P2P and UVA can be used to both simplify and accelerate CUDA programs
• One address space for all CPU and GPU memory
o Determine physical memory location from pointer value
o Simplified binary interface – cudaMemcopy()
• Faster memory copies between GPUs with less host overhead
GPU Direct – Overview
Asynchronous data transfer
• cudaMemcopy() is blocking – will not return until memcopy is complete
• cudaMemcpyAsync() is nonblocking – returns immediately, we can utilize CPU by useful
computation
• Asynchronous memcopy has two additional requirements:
o Pinned memory
o Stream id
Asynchronous data transfer (Contd.,)
• What is it good for?
o Overlap CPU and GPU communication
o Overlap computation and memory transfer
Asynchronous transfer
CUDA Streams
• What is it?
o Sequence of CUDAoperation that execute in issue-order on the GPU
• What is it good for?
o Operation from different streams may run concurrently
o Hide memory latencies and memory size limitations
CUDA Streams (Contd.,)
CUDA Streams Synchronization
Explicit:
• cudaDeviceSynchronize()
o Blocks until all CUDA operations are finished
• cudaStreamSynchronize(stream)
o Blocks until all CUDA operations are finished within given stream
• cudaEventRecord(event,stream1), cudaStreamWaitEvent(stream2,event)
o Fine-grained synchronization
Implicit:
• Page-locked memory allocation
o cudaMallocHost, cudaHostAlloc
• Device memory allocation
o cudaMalloc
• Blocking version of memory operation
o Cudamemcpy, cudamemset
• Implicit synchronize all CUDA operations