A41095 - CUDA Programming Model For Hopper Architecture - 1663711328740001wfiq

CUDA Programming Model for Hopper
Architecture
Vishal Mehta, DevTech Compute | GTC Fall, September 2022
Hopper Architecture
H100 GPU Key features
2 Gen Multi-Instance GPU

nd
Confidential Computing
PCIe Gen5
Larger 50 MB L2
80GB HBM3, 3 TB/s

bandwidth
132 SMs
4th Gen Tensor Core Thread Block Clusters 4 thGen NVLink
900 GB/s total bandwidth
See also:
Enabling Hopper-Specific Optimizations in CUDA Applications (A41147)
CUDA: New Features and Beyond (A41100)
Hopper Architecture
H100 Streaming Multiprocessor Key Features
• 256 KB combined L1 cache/Shared memory per SM. 33%

over A100
• New Thread Block Clusters and Distributed Shared

Memory
• New Tensor Memory Accelerator and Asynchronous

Transaction Barriers
• DPX instructions to accelerate Dynamic Programming

How to write more scalable How to accelerate computations

How to keep all GPU units busy ?
algorithms ? in dynamic programming ?
Thread Block Cluster
Thread Block Thread Block
Asynchronous
CUDA Operations
thread
Asynchronous SIMT Dynamic Programming

Hierarchy Programming
Scaling CUDA Programming
Model using Thread Block
Clusters
Thread and Memory Hierarchy
CUDA Thread & Memory Hierarchy pre-Hopper
• All threads in a thread block can collaborate using shared

memory.
Thread Block • All threads in a thread block are guaranteed to be co-
scheduled on a Streaming Multiprocessor (SM).
Shared Memory

memory.
Shared Memory • Threads in a thread block can synchronize / communicate
data using
• cooperative_groups::this_thread_block.sync();
OR
__syncthreads();
• cuda::barrier<thread_scope_block>::arrive() and ::wait()


memory.
Shared Memory • Threads in a thread block can synchronize / communicate
data using
• cooperative_groups::this_thread_block.sync();
OR
__syncthreads();
• cuda::barrier<thread_scope_block>::arrive() and ::wait()
• Threads in thread block can also perform collectives like

cooperative_groups::reduce()
Thread Block Thread Block Thread Block Thread Block • All thread blocks form the CUDA grid or CUDA
Shared Shared Shared Shared kernel
Memory Memory Memory Memory
• All thread blocks share global memory to

Thread Block Thread Block Thread Block Thread Block collaborate
Shared Shared Shared Shared
• Independent thread blocks can be scheduled

out of order to improve occupancy, and hence
GPU utilization
Global Memory
Scaling the CUDA Programming Model
Kepler GK110 GPU, 2012 Hopper H100 GPU, 2022
132 SMs
15 SMs
SM SM GPC
Scaling the CUDA Programming Model
Kepler GK110 GPU, 2012 Hopper H100 GPU, 2022
Kepler GK110
132 SMs
15 SMs
SM SM GPC
Introducing Thread Block Clusters in Hopper
Thread Block Cluster Thread Block Cluster • Thread Block Cluster introduces a new
optional level of hierarchy in the CUDA
Thread Block Thread Block Thread Block Thread Block programming model.
• Thread blocks in a cluster are guaranteed to

be co-scheduled on SMs in a GPU Processing
Thread Block Thread Block Thread Block Thread Block
Clusters (GPC)
• All shared memory within a cluster form the

Distributed shared memory
Global Memory
Thread Block Cluster are available from CUDA Cooperative Groups
namespace cg = cooperative_groups;
auto block = cg::this_thread_block();
cg::cluster_group cluster = cg::this_cluster();
<..>

Shared Memory Shared Memory Shared Memory Shared Memory
Disclaimer: Preliminary CUDA API, subject to change

Accelerated Synchronization for all threads in cluster
• Cluster synchronization is accelerated in

namespace cg = cooperative_groups; hardware
<..>
• H100 can support up to 16 thread blocks or
cluster.sync(); 16384 threads per cluster


Distributed Shared Memory Operations
• All blocks within a thread block cluster can

collaborate using Distributed shared memory
• Thread blocks can read, write and perform atomics

on each other’s shared memory


// All blocks in the cluster have the variable smem

__shared__ int smem;
• All blocks within a thread block cluster can
unsigned int BlockRank = cluster.block_rank(); collaborate using Distributed shared memory
int cluster_size = cluster.dim_blocks().x;
• Thread blocks can read, write and perform atomics



// All blocks in the cluster have the variable smem
__shared__ int smem;
cg::cluster_group cluster = cg::this_cluster(); • All blocks within a thread block cluster can
unsigned int BlockRank = cluster.block_rank();
collaborate using Distributed shared memory
int cluster_size = cluster.dim_blocks().x;
//Get a pointer to peer smem variable based on

//pointer from current block • Thread blocks can read, write and perform atomics
int *remote_smem = cluster.map_shared_rank(&smem, (BlockRank + 1) %
cluster_size);
if (threadIdx.x == 0)
*remote_smem = 10; // Store to remote memory
cluster.sync(); // Sync to ensure store is done



Launching CUDA Kernels with Clusters
Thread Block Cluster Thread Block Cluster • Annotate kernels with compile time cluster
size
Shared Shared Shared Shared • Kernel launch done in classical way <<< , >>>
//Compile time: Kernel where each kernel is

// 2 Thread Blocks in X-dimension and 2 in Y-
Thread Block Thread Block Thread Block Thread Block dimension.
// Requires number of thread blocks to be multiple of
4
__global__ void __cluster_dims__(2, 2, 1)
clusterKernel()
{ ... }
Global Memory
Launching CUDA Kernels with Clusters
Thread Block Cluster Thread Block Cluster • Using CUDA Extensible Kernel Launch API
Shared Shared Shared Shared // Launch via extensible launch API
Memory Memory Memory Memory {
cudaLaunchConfig_t config = {0};
cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = 2; // 2 blocks in X
Memory Memory Memory Memory attribute[0].val.clusterDim.y = 2; // 2 blocks in Y
attribute[0].val.clusterDim.z = 1;
config.attrs = attribute;
config.numAttrs = 1;
Global Memory const int clusterSize = 2 * 2;
config.gridDim = numClusters * clusterSize;
config.blockDim = numThreads;
cudaLaunchKernelEx(&config, (void*)clusterKernel,
param1, param2, ...);
}
Histogram example
Histogram Computing in Distributed Shared Memory
• Without clusters, histogram is directly computed in global memory

• With cluster, thread blocks can pool shared memory to compute local histogram
Thread Block Cluster Thread Block Cluster

Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block
N/2 Histogram Bins N/2 Histogram Bins N/2 Histogram Bins N/2 Histogram Bins
N histogram Bins
N histogram Bins
Histogram example
Histogram Performance A100 vs H100 up to 3.0 x
• Without clusters, histogram is directly computed in global memory

• With cluster, thread blocks can pool shared memory to compute local histogram
3.5
Speedup for large histogram calculations

3
2.5
1.5
1 2.2 x
1.37 x
0.5
0
A100 H100 H100 - Clusters
75K Histogram bins (300KB) fits in distributed shared memory of

2-block cluster → 37.5K ( 150KB) per thread block
Asynchronous SIMT
Programming Model
Asynchronous SIMT Programming
Building Blocks
Tensor Memory Accelerator Async Transaction Barrier (New on H100)

Automatic
Block width Padding Threads Async Writes
Produce
Data
Block height
Arrive
Tensor height
Independent
Work
Wait
Consume
Data
Tensor width
Tensor Access Stride

ASYNCHRONOUS SIMT PROGRAMMING
Asynchronous Transaction Barrier
Async Transaction Barrier (New on H100)
Threads Produce Async Writes

Data

Data
Threads counted as
they arrive at barrier
Arrive

Data
Threads counted as
Arrive
Independent
Work

Data
Threads counted as
Arrive
Independent
Work
Waiters sleep until all

threads have arrived Data arrival increments
and async memory Wait transaction count
transactions have
completed

Data
Threads counted as
Arrive
Independent
Work
Waiters sleep until all

threads have arrived Data arrival increments
and async memory Wait transaction count
transactions have
completed
Consume
Data
Tensor Memory Accelerator: Efficient Copy of Tensor Memory
The TMA can copy sub-regions of a multi-dimensional tensor

Automatic
Multi-Dimensional Tensor Copying Block width Padding
• Automatic stride & address generation up to

tensors of rank 5
• Boundary padding for out-of-bounds accesses
Block height
• Fire-and-forget from a single thread –
everything handled by TMA
Tensor height
• No iteration or bounds-checking code required
• Saves registers that do loop calculation and
boundary checks
Tensor width
Tensor Access Stride

cuda::memcpy_async Global <-> Shared
Fully asynchronous with threads and uses Asynchronous Transaction Barrier to signal the completion.
H100 SM
Produce extern __shared__ int smem[];
Data namespace cg = cooperative_groups;
Shared Memory
__shared__ cuda::barrier<cuda::thread_scope_block> barrier;
if (block.thread_rank() == 0) {
Arrive init(&barrier, block.size());
Barrier
1 }
block.sync();
Independent Registers
Work
TMA unit L1 Cache

Wait
GPU Memory
HBM
HBM

H100 SM
Shared Memory
Barrier
}
block.sync();
Work // Collectively issue asynchronous copy operation
cuda::memcpy_async(block, smem, gmem,
2 block.size() * sizeof(int), // int/thread
barrier); // Barrier
TMA unit L1 Cache
Wait
GPU Memory
HBM
HBM

Threads
H100 SM
Shared Memory
Arrive 3 init(&barrier, block.size());
Barrier
}
block.sync();
block.size() * sizeof(int), // int/thread
TMA unit L1 Cache barrier); // Barrier
Wait
auto token = barrier.arrive();
GPU Memory
HBM
HBM

Threads
H100 SM
Shared Memory
Barrier
}
block.sync();
Work 4 // Collectively issue asynchronous copy operation
TMA unit L1 Cache barrier); // Barrier
Wait
//<independent work>
GPU Memory
HBM
HBM

Async Writes
H100 SM
Shared Memory
Data arrived
5 Barrier
}
block.sync();
barrier); // Barrier
TMA unit L1 Cache
Wait 5
//<independent work>
barrier.wait(std::move(token));
GPU Memory
HBM
HBM

Asynchronous SIMT
Programming in Clusters
Asynchronous SIMT Programming
Producer -> Consumer Pattern in Cluster to achieve over 5x performance gains

H100 SM Thread Block Cluster
Thread Block Thread Block Thread
Shared Memory Block
Execution Timeline
Data arrived
Thread
Barrier
Block
Registers
Thread
TMA unit L1 Cache Block
GPU Memory
HBM
HBM
Initialize the barrier with producer and consumer thread counts
Thread Block 0 – Producer (2 threads) Thread Block 1 – Consumer (2 threads)
Shared Memory Barrier (4) Shared Memory

Thread 0, Producer Thread 1, Producer Thread 0, Thread 1,
Consumer Consumer
Arrive Arrive
ASYNCHRONOUS SIMT
ASYNCHRONOUS SIMTPROGRAMMING
PROGRAMMING
Consumers arrive and wait for data from producers
Shared Memory Barrier Shared Memory

Consumer Consumer
Arrive Arrive
Wait
ASYNCHRONOUS SIMT
PROGRAMMING
Producers issue’s memcpy_async to consumer’s shared memory using consumer barrier

Consumer Consumer
Arrive Arrive
memcpy_async()
memcpy_async()
Wait
ASYNCHRONOUS SIMT
PROGRAMMING
Producers perform barrier arrival, on barrier in Consumer Thread Block

Consumer Consumer
Arrive Arrive
memcpy_async()
Arrive memcpy_async()
Arrive
Wait
ASYNCHRONOUS SIMT
PROGRAMMING
Consumers are blocked until all data has arrived as well as the producer threads

Consumer Consumer
Arrive Arrive
memcpy_async()
Arrive
Wait
ASYNCHRONOUS SIMT
PROGRAMMING
All threads have arrived as well as data. Consumers are unblocked.

Consumer Consumer
Arrive Arrive
memcpy_async()
Arrive
Wait
Machine Learning Classification
using SVM on Hopper
Support Vector Machine (SVM)
Large Margin Classifier
Algorithm
• CuML's SVM in RAPIDS: Optimal hierarchical Decomposition

• Sequential inner step, single thread block: SmoBlockSolve
SMO Block Solve – Sequential part of the algorithm

Data in Distributed Shared Memory
Block minimum Block minimum
Cluster Minimum
. .
. . N
. . Iterations
Block Maximum Block Maximum

Cluster Maximum
One sided communication with Barriers
Block reduce Block reduce

memcpy_async() + barrier_arrive() memcpy_async() + barrier_arrive()
Shared memory Shared memory

barrier.wait() barrier.wait()
barrier.wait() barrier.wait()
Cluster Reduce Cluster reduce

5.3x Performance A100 vs H100
SVM Model Training Speedup

4
4x
2
1
1.3 x
0
A100 H100 H100 + Thread Block
Cluster
A9A dataset (preprocessed version of the UCI Adult data set)

39k samples, 123 features RBF kernel, C=500, gamma=1/123
5.3x Performance A100 vs H100
• Using thread block cluster, the working set size of

the inner solver can be increased, which in turn 6
requires fewer outer solver iterations.
5
SVM Model Training Speedup

• Thread Block Clusters improve numerical
4
efficiency of the algorithm
• Performance gain of 5.3x between A100 and H100 4x

• Fewer kernel function evaluations & outer solver 2
iterations
• Low latency communication between thread blocks 1
1.3 x
0
A100 H100 H100 + Thread Block
Cluster
A9A dataset (preprocessed version of the UCI Adult data set)

39k samples, 123 features RBF kernel, C=500, gamma=1/123
Dynamic Programming in Hopper
Dynamic Programming Applications
Robotic Path Planning CLARA Genomics Large scale Fleet Routing

Issac Sim + nvblox cuOPT
Dynamic Programming
Why accelerating Sequence alignment matters?
Data
Cost
1000 Genomes 100,000 Genomes 1,000,000 Genomes

Project launches Project launches Project launches
Cost of sequencing dropping twice as fast as informatics
(2008) (2013) (2018)
Source: https://www.businessinsider.com.au/super-cheap-genome-sequencing
Dynamic Programming
Sequence Alignment Smith Waterman Algorithm
T G T T A C G G
new value = max ( diagonal cell - 3 // mis-match cost
diagonal cell + 3, // match reward
0 0 0 0 0 0 0 0 0
vertical cell – 2, // insertion cost G 0 0 3 1
horizontal cell – 2, // deletion cost
G 0 0 3
0) // clamp to a min of zero
T 0 3
G T T 0
G 6 4
G 0
A 4 3 A 0
C 0
Optional 0 clamping T 0
A 0
Inputs Output
Simplified calculations.
Dynamic Programming
CUDA Math functions for dynamic Programming
D = MAX(A+B, C, 0) INT32/16 datatype. Optional 0 clamping D = MAX(A, B, C, 0) INT32/16 datatype. Optional 0 clamping
A0 A1 B0 B1 C0 C1 A0 A1 B0 B1 C0 C1
0 0
ADD MAX
ADD MAX MAX MAX
MAX MAX
D0 D1 D0 D1
Dynamic Programming
Accelerating Genomics in H100 by over 8x compared to A100
Sequence Alignment using Smith Waterman Gotoh

6
Performance (TCUPS) 4
4.4 x
2
1.9 x
0
A100 (int32) H100 (int32) H100 (int16 DPX)
Axis Title
Affine gap alignment (score calculation)

Weights used from BWA.
Data: HG002 (NA24385) pair-end protocol using Illumina Sequencers.
How to write more scalable

algorithms ?
Enables More Localized

Algorithms
How to write more scalable

algorithms ?
CUDA thread
Asynchronous
Operations
Enables More Localized Overlapping Compute and

Algorithms Data Operations
How to write more scalable How to accelerate computations

algorithms ? in dynamic programming ?
CUDA thread
Asynchronous
Operations
Overlapping Compute and Enabling recursive Dynamic

Enables More Localized
Data Operations Programming Algorithms
Algorithms
Thank You

A41095 - CUDA Programming Model For Hopper Architecture - 1663711328740001wfiq

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A41095 - CUDA Programming Model For Hopper Architecture - 1663711328740001wfiq

Uploaded by

Copyright:

Available Formats

CUDA Programming Model for Hopper

2 Gen Multi-Instance GPU

80GB HBM3, 3 TB/s

• 256 KB combined L1 cache/Shared memory per SM. 33%

• New Thread Block Clusters and Distributed Shared

• New Tensor Memory Accelerator and Asynchronous

• DPX instructions to accelerate Dynamic Programming

How to write more scalable How to accelerate computations

Thread Block Cluster

Thread Block Thread Block

Asynchronous SIMT Dynamic Programming

• All threads in a thread block can collaborate using shared

• All threads in a thread block can collaborate using shared

• cuda::barrier<thread_scope_block>::arrive() and ::wait()

• All threads in a thread block can collaborate using shared

• cuda::barrier<thread_scope_block>::arrive() and ::wait()

• Threads in thread block can also perform collectives like

• All thread blocks share global memory to

• Independent thread blocks can be scheduled

Kepler GK110 GPU, 2012 Hopper H100 GPU, 2022

Kepler GK110 GPU, 2012 Hopper H100 GPU, 2022

• Thread blocks in a cluster are guaranteed to

• All shared memory within a cluster form the

Thread Block Cluster

Thread Block Thread Block Thread Block Thread Block

Disclaimer: Preliminary CUDA API, subject to change

• Cluster synchronization is accelerated in

Thread Block Cluster

Thread Block Thread Block Thread Block Thread Block

Disclaimer: Preliminary CUDA API, subject to change

• All blocks within a thread block cluster can

• Thread blocks can read, write and perform atomics

Thread Block Cluster

Thread Block Thread Block Thread Block Thread Block

Disclaimer: Preliminary CUDA API, subject to change

// All blocks in the cluster have the variable smem

• Thread blocks can read, write and perform atomics

Thread Block Cluster

Thread Block Thread Block Thread Block Thread Block

Disclaimer: Preliminary CUDA API, subject to change

//Get a pointer to peer smem variable based on

cluster.sync(); // Sync to ensure store is done

Thread Block Thread Block Thread Block Thread Block

Disclaimer: Preliminary CUDA API, subject to change

//Compile time: Kernel where each kernel is

• Without clusters, histogram is directly computed in global memory

Thread Block Cluster Thread Block Cluster

• Without clusters, histogram is directly computed in global memory

Speedup for large histogram calculations

75K Histogram bins (300KB) fits in distributed shared memory of

Tensor Memory Accelerator Async Transaction Barrier (New on H100)

Tensor Access Stride

Async Transaction Barrier (New on H100)

Threads Produce Async Writes

Async Transaction Barrier (New on H100)

Threads Produce Async Writes

Async Transaction Barrier (New on H100)

Threads Produce Async Writes

Async Transaction Barrier (New on H100)

Threads Produce Async Writes

Waiters sleep until all

Async Transaction Barrier (New on H100)

Threads Produce Async Writes