Professional Documents
Culture Documents
Architecture
Vishal Mehta, DevTech Compute | GTC Fall, September 2022
Hopper Architecture
H100 GPU Key features
Confidential Computing
PCIe Gen5
Larger 50 MB L2
132 SMs
4th Gen Tensor Core Thread Block Clusters 4 thGen NVLink
900 GB/s total bandwidth
See also:
Enabling Hopper-Specific Optimizations in CUDA Applications (A41147)
CUDA: New Features and Beyond (A41100)
Hopper Architecture
H100 Streaming Multiprocessor Key Features
Asynchronous
CUDA Operations
thread
Thread Block Thread Block
Thread Block Thread Block Thread Block Thread Block • All thread blocks form the CUDA grid or CUDA
Shared Shared Shared Shared kernel
Memory Memory Memory Memory
132 SMs
15 SMs
SM SM GPC
Scaling the CUDA Programming Model
Kepler GK110
132 SMs
15 SMs
SM SM GPC
Thread and Memory Hierarchy
Introducing Thread Block Clusters in Hopper
Thread Block Cluster Thread Block Cluster • Thread Block Cluster introduces a new
optional level of hierarchy in the CUDA
Thread Block Thread Block Thread Block Thread Block programming model.
Shared Shared Shared Shared
Memory Memory Memory Memory
namespace cg = cooperative_groups;
auto block = cg::this_thread_block();
cg::cluster_group cluster = cg::this_cluster();
<..>
<..>
• H100 can support up to 16 thread blocks or
cluster.sync(); 16384 threads per cluster
if (threadIdx.x == 0)
*remote_smem = 10; // Store to remote memory
Thread Block Cluster Thread Block Cluster • Annotate kernels with compile time cluster
size
Thread Block Thread Block Thread Block Thread Block
Shared Shared Shared Shared • Kernel launch done in classical way <<< , >>>
Memory Memory Memory Memory
Thread Block Cluster Thread Block Cluster • Using CUDA Extensible Kernel Launch API
Thread Block Thread Block Thread Block Thread Block
Shared Shared Shared Shared // Launch via extensible launch API
Memory Memory Memory Memory {
cudaLaunchConfig_t config = {0};
cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
Thread Block Thread Block Thread Block Thread Block
attribute[0].val.clusterDim.x = 2; // 2 blocks in X
Shared Shared Shared Shared
Memory Memory Memory Memory attribute[0].val.clusterDim.y = 2; // 2 blocks in Y
attribute[0].val.clusterDim.z = 1;
config.attrs = attribute;
config.numAttrs = 1;
Global Memory const int clusterSize = 2 * 2;
config.gridDim = numClusters * clusterSize;
config.blockDim = numThreads;
cudaLaunchKernelEx(&config, (void*)clusterKernel,
param1, param2, ...);
}
Disclaimer: Preliminary CUDA API, subject to change
Histogram example
Histogram Computing in Distributed Shared Memory
N/2 Histogram Bins N/2 Histogram Bins N/2 Histogram Bins N/2 Histogram Bins
N histogram Bins
N histogram Bins
Histogram example
Histogram Performance A100 vs H100 up to 3.0 x
3.5
2.5
1.5
1 2.2 x
1.37 x
0.5
0
A100 H100 H100 - Clusters
Block height
Arrive
Tensor height
Independent
Work
Wait
Consume
Data
Tensor width
Arrive
ASYNCHRONOUS SIMT PROGRAMMING
Asynchronous Transaction Barrier
Arrive
Independent
Work
ASYNCHRONOUS SIMT PROGRAMMING
Asynchronous Transaction Barrier
Arrive
Independent
Work
Arrive
Independent
Work
Consume
Data
ASYNCHRONOUS SIMT PROGRAMMING
Tensor Memory Accelerator: Efficient Copy of Tensor Memory
Block height
• Fire-and-forget from a single thread –
everything handled by TMA
Tensor height
• No iteration or bounds-checking code required
• Saves registers that do loop calculation and
boundary checks
Tensor width
Fully asynchronous with threads and uses Asynchronous Transaction Barrier to signal the completion.
H100 SM
Produce extern __shared__ int smem[];
Data namespace cg = cooperative_groups;
Shared Memory
__shared__ cuda::barrier<cuda::thread_scope_block> barrier;
auto block = cg::this_thread_block();
if (block.thread_rank() == 0) {
Arrive init(&barrier, block.size());
Barrier
1 }
block.sync();
Independent Registers
Work
GPU Memory
HBM
HBM
Fully asynchronous with threads and uses Asynchronous Transaction Barrier to signal the completion.
H100 SM
Produce extern __shared__ int smem[];
Data namespace cg = cooperative_groups;
Shared Memory
__shared__ cuda::barrier<cuda::thread_scope_block> barrier;
auto block = cg::this_thread_block();
if (block.thread_rank() == 0) {
Arrive init(&barrier, block.size());
Barrier
}
block.sync();
Independent Registers
Work // Collectively issue asynchronous copy operation
cuda::memcpy_async(block, smem, gmem,
2 block.size() * sizeof(int), // int/thread
barrier); // Barrier
TMA unit L1 Cache
Wait
GPU Memory
HBM
HBM
Fully asynchronous with threads and uses Asynchronous Transaction Barrier to signal the completion.
Threads
H100 SM
Produce extern __shared__ int smem[];
Data namespace cg = cooperative_groups;
Shared Memory
__shared__ cuda::barrier<cuda::thread_scope_block> barrier;
auto block = cg::this_thread_block();
if (block.thread_rank() == 0) {
Arrive 3 init(&barrier, block.size());
Barrier
}
block.sync();
Independent Registers
Work // Collectively issue asynchronous copy operation
cuda::memcpy_async(block, smem, gmem,
block.size() * sizeof(int), // int/thread
TMA unit L1 Cache barrier); // Barrier
Wait
auto token = barrier.arrive();
GPU Memory
HBM
HBM
Fully asynchronous with threads and uses Asynchronous Transaction Barrier to signal the completion.
Threads
H100 SM
Produce extern __shared__ int smem[];
Data namespace cg = cooperative_groups;
Shared Memory
__shared__ cuda::barrier<cuda::thread_scope_block> barrier;
auto block = cg::this_thread_block();
if (block.thread_rank() == 0) {
Arrive init(&barrier, block.size());
Barrier
}
block.sync();
Independent Registers
Work 4 // Collectively issue asynchronous copy operation
cuda::memcpy_async(block, smem, gmem,
block.size() * sizeof(int), // int/thread
TMA unit L1 Cache barrier); // Barrier
Wait
auto token = barrier.arrive();
//<independent work>
GPU Memory
HBM
HBM
Fully asynchronous with threads and uses Asynchronous Transaction Barrier to signal the completion.
Async Writes
H100 SM
Produce extern __shared__ int smem[];
Data namespace cg = cooperative_groups;
Shared Memory
__shared__ cuda::barrier<cuda::thread_scope_block> barrier;
auto block = cg::this_thread_block();
Data arrived
if (block.thread_rank() == 0) {
Arrive init(&barrier, block.size());
5 Barrier
}
block.sync();
Independent Registers
Work // Collectively issue asynchronous copy operation
cuda::memcpy_async(block, smem, gmem,
block.size() * sizeof(int), // int/thread
barrier); // Barrier
TMA unit L1 Cache
Wait 5
auto token = barrier.arrive();
//<independent work>
barrier.wait(std::move(token));
GPU Memory
HBM
HBM
Execution Timeline
Data arrived
Thread
Barrier
Block
Registers
Thread Block Thread Block
Thread
TMA unit L1 Cache Block
GPU Memory
HBM
HBM
ASYNCHRONOUS SIMT PROGRAMMING
Initialize the barrier with producer and consumer thread counts
Arrive Arrive
ASYNCHRONOUS SIMT
ASYNCHRONOUS SIMTPROGRAMMING
PROGRAMMING
Consumers arrive and wait for data from producers
Arrive Arrive
Wait
ASYNCHRONOUS SIMT
ASYNCHRONOUS SIMTPROGRAMMING
PROGRAMMING
Producers issue’s memcpy_async to consumer’s shared memory using consumer barrier
Arrive Arrive
memcpy_async()
memcpy_async()
Wait
ASYNCHRONOUS SIMT
ASYNCHRONOUS SIMTPROGRAMMING
PROGRAMMING
Producers perform barrier arrival, on barrier in Consumer Thread Block
Arrive Arrive
memcpy_async()
Arrive memcpy_async()
Arrive
Wait
ASYNCHRONOUS SIMT
ASYNCHRONOUS SIMTPROGRAMMING
PROGRAMMING
Consumers are blocked until all data has arrived as well as the producer threads
Arrive Arrive
memcpy_async()
Arrive memcpy_async()
Arrive
Wait
ASYNCHRONOUS SIMT
ASYNCHRONOUS SIMTPROGRAMMING
PROGRAMMING
All threads have arrived as well as data. Consumers are unblocked.
Arrive Arrive
memcpy_async()
Arrive memcpy_async()
Arrive
Wait
Machine Learning Classification
using SVM on Hopper
Support Vector Machine (SVM)
Large Margin Classifier
Support Vector Machine (SVM)
Algorithm
Cluster Minimum
. .
. . N
. . Iterations
barrier.wait() barrier.wait()
Support Vector Machine (SVM)
One sided communication with Barriers
barrier.wait() barrier.wait()
4x
2
1
1.3 x
0
A100 H100 H100 + Thread Block
Cluster
Data
Cost
T G T T A C G G
new value = max ( diagonal cell - 3 // mis-match cost
diagonal cell + 3, // match reward
0 0 0 0 0 0 0 0 0
vertical cell – 2, // insertion cost G 0 0 3 1
horizontal cell – 2, // deletion cost
G 0 0 3
0) // clamp to a min of zero
T 0 3
G T T 0
G 6 4
G 0
A 4 3 A 0
C 0
Optional 0 clamping T 0
A 0
Inputs Output
Simplified calculations.
Dynamic Programming
CUDA Math functions for dynamic Programming
D = MAX(A+B, C, 0) INT32/16 datatype. Optional 0 clamping D = MAX(A, B, C, 0) INT32/16 datatype. Optional 0 clamping
A0 A1 B0 B1 C0 C1 A0 A1 B0 B1 C0 C1
0 0
ADD MAX
MAX MAX
D0 D1 D0 D1
Dynamic Programming
Accelerating Genomics in H100 by over 8x compared to A100
Performance (TCUPS) 4
4.4 x
2
1.9 x
0
A100 (int32) H100 (int32) H100 (int16 DPX)
Axis Title
Asynchronous
Operations
Asynchronous
Operations