You are on page 1of 34

Department of Computer Science and Engineering

COURSE CODE – TITLE: 1021CS129 – MODERN


COMPUTER ARCHITECTURE
SUMMER SEMESTER 2023-24

Course Instructor
Dr. M. Rajeev Kumar
Professor (CSE)
UNIT V
Level of learning
CO
Course Outcome(s) domain (Based on
Nos.
revised Bloom’s)
Implement the CUDA Programming model for Parallel
CO5 K3
computing.

Unit 5 High Performance Computing with CUDA 9 Hours

CUDA programming model, Basic principles of CUDA programming, Concepts of threads and blocks, GPU and CPU data
exchange.
GPU and CPU data
exchange
Typical CUDA Program Flow
Step 1: int *data = (int *)
malloc (n * sizeof(int));
Step 2: cudaMalloc(&d_data,
n * sizeof(int));
cudaMemcpy(d_data, data, n
* sizeof(int),
cudaMemcpyHostToDevice);
Step 3: execute<<<… , …>>>
(d_data);
Step 4: cudaMemcpy(data,
d_data, n * sizeof(int),
cudaMemcpyDeviceToHost);
Types of Data Transfer in CUDA

• Pageable and Pinned


• Explicit and Implicit (UVM)
• Peer to Peer (between GPUs of the same host)
• GPUDirect (between GPU and network interface)
• Synchronous and asynchronous
Pageable and Pinned memory transfer
Pageable and Pinned memory transfer (Contd.,)
Pageable and Pinned memory transfer (Contd.,)
Pageable and Pinned memory transfer (Contd.,)
Pageable and Pinned memory transfer (Contd.,)
Wave13pt Kernal – Pageable Vs Pinned
Memory
Pageable and Pinned memory transfer (Contd.,)
Pageable and Pinned memory transfer
Summary
• Pageable memory – user memory space, requires extra mem-copy
• Pinned memory – kernel memory space
• Pinned memory performs better (higher bandwidth)
• Do not over-allocate pinned memory – reduces amount of physical memory available
for OS
Unified memory
Unified memory - Usage
Unified memory – Use Case
Unified memory – Use Case
Unified memory – Use Case
Unified memory – Use Case
Wave13pt Kernal – UVM
Wave13pt Kernal – Old memory mapping
system
Simplified memory transfers: UVM
• How does UVM performs in case of multi-threading?
 UVM implements CS – threads are serialized, performance degradation
UVM Summary

• Simplifies program model. But,


o Performance issue D -> H
o CS in multithreaded application
• What it could still be good for?
Peer to Peer data transfer Overview
Peer to Peer data transfer – Unified Virtual
Addressing
P2P memory transfer – Usage
P2P memory transfer – Summary

• P2P and UVA can be used to both simplify and accelerate CUDA programs
• One address space for all CPU and GPU memory
o Determine physical memory location from pointer value
o Simplified binary interface – cudaMemcopy()
• Faster memory copies between GPUs with less host overhead
GPU Direct – Overview
Asynchronous data transfer
• cudaMemcopy() is blocking – will not return until memcopy is complete
• cudaMemcpyAsync() is nonblocking – returns immediately, we can utilize CPU by useful
computation
• Asynchronous memcopy has two additional requirements:
o Pinned memory
o Stream id
Asynchronous data transfer (Contd.,)
• What is it good for?
o Overlap CPU and GPU communication
o Overlap computation and memory transfer
Asynchronous transfer
CUDA Streams
• What is it?
o Sequence of CUDAoperation that execute in issue-order on the GPU
• What is it good for?
o Operation from different streams may run concurrently
o Hide memory latencies and memory size limitations
CUDA Streams (Contd.,)
CUDA Streams Synchronization
Explicit:
• cudaDeviceSynchronize()
o Blocks until all CUDA operations are finished
• cudaStreamSynchronize(stream)
o Blocks until all CUDA operations are finished within given stream
• cudaEventRecord(event,stream1), cudaStreamWaitEvent(stream2,event)
o Fine-grained synchronization
Implicit:
• Page-locked memory allocation
o cudaMallocHost, cudaHostAlloc
• Device memory allocation
o cudaMalloc
• Blocking version of memory operation
o Cudamemcpy, cudamemset
• Implicit synchronize all CUDA operations

You might also like