You are on page 1of 38

A DEEP DIVE INTO THE LATEST HPC SOFTWARE

TIM COSTA | GROUP PRODUCT MANAGER, HPC AND QUANTUM COMPUTING


AGENDA
Accelerated Computing with Standard Languages

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science


AGENDA
Accelerated Computing with Standard Languages

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science


PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION


PLATFORM SPECIALIZATION
CUDA
ISO C++ ISO Fortran Python CUDA

__global__
std::transform(par, x, x+n, y, do concurrent (i = 1:n) import cunumeric as np void saxpy(int n, float a,
y,[=](float x, float y){ y(i) = y(i) + a*x(i) … float *x, float *y) {
return y + a*x; enddo def saxpy(a, x, y): int i = blockIdx.x*blockDim.x +
} y[:] += a*x threadIdx.x;
); if (i < n) y[i] += a*x[i];
}

int main(void) {
...
cudaMemcpy(d_x, x, ...);
cudaMemcpy(d_y, y, ...);
matrix_product(par, mA, mB, mC); C = matmul(A, B) c = np.matmul(a, b)
saxpy<<<(N+255)/256,256>>>(...);

cudaMemcpy(y, d_y, ...);

ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
Copyright (C) 2021 Bryce Adelstein Lelbach
5

PILLARS OF STANDARD LANGUAGE PARALLELISM

Common Algorithms that Dispatch to Tools to Write Your Own Parallel Mechanisms for Composing Parallel
Vendor-Optimized Parallel Libraries Algorithms that Run Anywhere Invocations into Task Graphs

sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ... sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {

} }
// ...

) | bulk(

) | bulk(N, [] (auto data) {

}
// ...

);

[] (auto data) { }

// ...
}
);
}
M-AIA WITH C++17 PARALLEL ALGORITHMS
Multi-physics simulation framework
from RWTH Aachen University
#pragma omp parallel // OpenMP parallel region
{
#pragma omp for // OpenMP for loop
for (MInt i = 0; i < noCells; i++) { // Loop over all cells
if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop
Ø Hierarchical grids, complex moving geometries
const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets
const MInt distNeighStartId = i * distNeighbors;
Ø Adaptive meshing, load balancing
const MFloat* const distributionsStart = &[distributions[distStartId];
Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ...
for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2)
if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration
Ø Physics: aeroacoustics, combustion, biomedical, ...
const MInt n1StartId = neighborId[distNeighStartId + j] * nDist;
oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++
}
if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration Ø Programming model: MPI + ISO C++ parallelism
const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist;
oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1];
}
}
oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution
}
}
}

C++ with OpenMP


std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for
Ø Composable, compact and elegant if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop
return;
for (MInt j = 0; j < nDist; ++j) {
Ø Easy to read and maintain if (auto n = c_neighborId(i, j); n == -1) continue;
a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn
Ø ISO Standard }
});
Ø Portable – nvc++, g++, icpc, MSVC, …
Standard C++
M-AIA
Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University
10
9 8.74

8
7

Relative Speed-Up
6
5
4
3
2
Decaying isotropic turbulence 1 1.025
400k fully-resolved particles 1
0
OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100)
PARALLELISM IN C++ ROADMAP

C++ 11 C++ 14 C++ 17 C++ 20 C++ PIPELINE


• Memory model • Atomics • Parallel algorithms • Scalable • Mdspan • Linear algebra
enhancements extensions • Forward progress synchronization • Range-based algorithms
• Lambdas • Generic Lambda guarantees library parallel algorithms • Asynchronous
Expressions • Memory model • Ranges • Extended floating- parallel algorithms
clarifications • Span point types • Senders-receivers

How users run C++


code on GPUs today Extended C++ interface General usability
to BLAS/Lapack of performance
Co-designed with Custom algorithms
N-dimensional provided by executors
V100 hardware and async.
support loops and
usability control flow

General parallelism user facing feature


Copyright (C) 2021 Bryce Adelstein Lelbach
9

PILLARS OF STANDARD LANGUAGE PARALLELISM

Common Algorithms that Dispatch to Tools to Write Your Own Parallel Mechanisms for Composing Parallel
Vendor-Optimized Parallel Libraries Algorithms that Run Anywhere Invocations into Task Graphs

sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ... sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {

} }
// ...

) | bulk(

) | bulk(N, [] (auto data) {

}
// ...

);

[] (auto data) { }

// ...
}
);
}

Today With Senders & Receivers


SENDERS & RECEIVERS Simplify Work Across CPUs and
Maxwell’s equations Accelerators

• Uniform abstraction between code and


diverse resources
template <ComputeSchedulerT, WriteSchedulerT>

auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer) • ISO standard


{
• Write once, run everywhere

return repeat_n(

n_outer_iterations,

repeat_n(

n_inner_iterations,

schedule(scheduler)

| bulk(grid.cells, update_h(accessor))

| bulk(grid.cells, update_e(time, dt, accessor)))

| transfer(writer)

| then(dump_results(report_step, accessor)))

| then([]{ printf("simulation complete\n"); })

);

}
ELECTROMAGNETISM
Raw performance & % of peak

std::sync_wait(maxwell(inline_scheduler, inline_scheduler));
std::sync_wait(maxwell(openmp_scheduler, inline_scheduler));
std::sync_wait(maxwell(cuda, inline_scheduler));

30 100
90
25
80

Efficiency vs STREAM TRIAD


20 70
60
Speedup

15 50
40
10
30

5 20
10
0 0
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Scheduler Scheduler

§ CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80


§ Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs)
§ clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp
STRONG SCALING USING ISO STANDARD C++
PARALLEL ALGORITHMS AND SENDERS & RECIEVERS

Maxwell SR Scaling Ideal Scaling Efficiency


40 1.2

35
1
30
0.8
25
Speedup

20 0.6

15
0.4
10
0.2
5 NVIDIA CONFIDENTIAL. DO NOT
DISTRIBUTE.

0 0
0 200 400 600 800 1000 1200
Number of GPUs

NVIDIA SUPERPOD
§ 140x NVIDIA DGX-A100 640
§ 1120x NVIDIA A100-SXM4-80 GPUs
Copyright (C) 2022 NVIDIA
13

PALABOS CARBON SEQUESTRATION

Strong Scaling

16

12

0
32 128 224 320 416 512

A100 GPUs

Ø Palabos is a framework for fluid dynamics simulations using


Lattice-Boltzmann methods.
Ø Code for multi-component flow through a porous media
ported to C++ Senders and Receivers.
Ø Application: simulating carbon sequestration in sandstone.
Christian Huber (Brown University), Jonas Latt (University of Geneva)
Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA)
MODERN FORTRAN FEATURES FOR HPC
Standard Parallelism and Concurrency Features

Fortran 2018 Fortran 202X Fortran 202Y


Coming in 2023 In discussion

DO CONCURRENT DO CONCURRENT Reductions Atomics


Data parallel loop construct, locality Support for reduction operations Propose support for atomic
specifiers. Supported in nvfortran on concurrent loops (ala variable accesses
OpenACC/OpenMP). Began
Array Intrinsics supporting in nvfortran 21.11. Asynchronous Tasking
Various math intrinsics that may
apply to entire arrays and map to Propose support for asynchronous
accelerated libraries supported in tasks
nvfortran.

Co-Arrays
Partitioned Global Address Space
arrays, teams of processes (images),
collectives & synchronization.
Awaiting F18.
MINIWEATHER
Standard Language Parallelism in Climate/Weather Applications
MiniWeather
Mini-App written in C++ and Fortran that simulates
weather-like fluid flows using Finite Volume and
Runge-Kutta methods.
Existing parallelization in MPI, OpenMP, OpenACC, …
Included in the SPEChpc benchmark suite*
Open-source and commonly-used in training events.
https://github.com/mrnorman/miniWeather/

do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)


20
local(x,z,x0,z0,xrad,zrad,amp,dist,wpert)

if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then


x = (i_beg-1 + i-0.5_rp) * dx
z = (k_beg-1 + k-0.5_rp) * dz
x0 = xlen/8
z0 = 1000
xrad = 500
zrad = 500
amp = 0.01_rp 10
dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp
if (dist <= pi / 2._rp) then
wpert = amp * cos(dist)**2
else
wpert = 0._rp
endif
tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM)
+ wpert*hy_dens_cell(k)
endif 0
state_out(i,k,ll) = state_init(i,k,ll) OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC
+ dt * tend(i,k,ll)

enddo
Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5.
*SPEChpc is a trademark of The Standard Performance Evaluation Corporation OpenACC version uses –gpu=managed option.
POT3D: DO CONCURRENT
POT3D
POT3D is a Fortran application for approximating solar
coronal magnetic fields.
Included in the SPEChpc benchmark suite*
Existing parallelization in MPI & OpenACC
Optimized the DO CONCURRENT version by using
OpenACC solely for data motion and atomics
https://github.com/predsci/POT3D

!$acc enter data copyin(phi,dr_i)


!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
AGENDA
Accelerated Computing with Standard Languages

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science


PRODUCTIVITY
Sequential and Composable Code

def cg_solve(A, b, conv_iters):


x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
converged = False
max_iters = b.shape[0]
§ Sequential semantics - no visible
for i in range(max_iters):
Ap = A.dot(p) parallelism or synchronization
alpha = rsold / (p.dot(Ap))
x = x + alpha * p § Name-based global data – no partitioning
r = r - alpha * Ap
rsnew = r.dot(r) § Composable – can combine with other
if i % conv_iters == 0 and \
libraries and datatypes
np.sqrt(rsnew) < 1e-10:
converged = i
break

beta = rsnew / rsold


p = r + beta * p
rsold = rsnew
PERFORMANCE
Transparent Acceleration

§ Transparently run at any scale needed to address computational challenges at hand


§ Automatically leverage all the available hardware

GPU Multi-GPU Supercomputer


Grace
CPU

DPU
MICROSCOPY WITH RICHARDSON-LUCY DECONVOLUTION

def richardson_lucy(image, psf, num_iter=50,


clip=True, filter_epsilon=None):
float_type = _supported_float_type(image.dtype)
image = image.astype(float_type, copy=False)
psf = psf.astype(float_type, copy=False)
im_deconv = np.full(image.shape, 0.5, dtype=float_type)
psf_mirror = np.flip(psf)

for _ in range(num_iter):
conv = convolve(im_deconv, psf, mode='same’)
if filter_epsilon:
with np.errstate(invalid='ignore’):
relative_blur = np.where(conv < filter_epsilon, 0,
image / conv)
else:
relative_blur = image / conv
im_deconv *= convolve(relative_blur, psf_mirror,
mode='same’)

if clip:
im_deconv[im_deconv > 1] = 1
im_deconv[im_deconv < -1] = -1

return im_deconv
COMPUTATIONAL FLUID DYNAMICS

CFD Python on cuNumeric!


• CFD codes like:
• Shallow-Water Equation Solver Distributed NumPy Performance
(weak scaling)
• Oil Pipeline Risk Management: Geoclaw- cuPy Legate

landspill simulations 150

• Python Libraries: Jupyter, NumPy, SciPy,


SymPy, Matplotlib

Time (seconds)
100

50

for _ in range(iter):
un = u.copy()
0
vn = v.copy() 1 2 4 8 16 32 64 128 256 512 1024

b = build_up_b(rho, dt, dx, dy, u, v) Relative dataset size


p = pressure_poisson_periodic(b, nit, p, dx, dy) Number of GPUs


Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of
Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
MICRO-JOIN

cuDF + cuNumPy Micro-Join Benchmark (weak scaling) size = num_rows_per_gpu * num_gpus


1700 lines MPI+CUDA 10 lines of cuDF + Legate

1.5 key_l = np.arange(size)


val_l = np.random.randn(size)
lhs = pd.DataFrame({ "key": key_l, ”val": val_l })

key_r = key_l // 3 * 3 # selectivity: 0.33


payload_r = np.random.randn(size)
rhs = pd.DataFrame({ "key": key_r, ”val": val_r })
1
Time (seconds)

out = lhs.merge(rhs, on="key")

vs.
0.5

~1700 LOC
0
8 16 32 64 128

Machine: DGX SuperPOD with A100-80GB GPUs


AGENDA
Accelerated Computing with Standard Languages

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science


cuBLAS: GPU OPTIMIZED BLAS IMPLEMENTATION
Maximum Speedups of CTK 11.6U1 over CTK 11.1: Sizes < 2k

Recently Introduced
7 § FP64 Tensor Core accelerated BLAS3
§ Improved heuristics for GEMV
6 § Batched GEMV
§ Helper functions for improved error management
SPEEDUP OVER CTK 11.1

0
SYRK SYMM TRMM SYRK SYMM TRMM HEMM HERK
DOUBLE REAL DOUBLE COMPLEX
* A100 80GB @ 1095 MHz: CTK 11.1 vs. CTK 11.6U1

[CWE41721] Connect with the Experts: NVIDIA Math Libraries


cuTENSOR: GPU ACCELERATED TENSOR PRIMITIVES
Tensor Contraction Performance Improvements for Complex FP32 + TF32

v1.2.2 v1.5
120
Releasing cuTENSOR v1.5
• Improvements to Tensor Contractions, up to 33x
100
• Supporting more than 28 modes
TFLOPS (LARGER IS BETTER)

80

60

40

20

0
0 50 100 150 200 250 300
CASES
• DGX A100 80GB
• 300 Random Cases
Multi-Node Multi-GPU FFT in cuFFT
Coming to HPC SDK 22.3

C2C Z2Z 32 64 128 256 512 1024 2048


2,000
600
1,860
1,800
§ Distributed 2D/3D FFTs 527
1,600 § Slab Decomposition 500
§ Pencil Decomposition (Preview)
TFLOPS (LARGER IS BETTER)

1,400
§ Helper functions: Pencils <-> Slabs

TFLOPS (LARGER IS BETTER)


400 394
1,200

1,000
851
300 278
800
260

600
429 410 200 176
400 158
147
226 210 134 135
122
200 104
109 105
34 17 61 29 52 100 78.2 79
18 10 12 6 18 9 68
0 51
37 41
8 16 32 64 128 256 512 1024 2048 4096 22 30 25 27
13 14
2048 2560 3072 4096 5120 6144 8192 10240 12288 16384 0
# OF GPUS 1024 2048 4096 8192
PROBLEM SIZE (CUBED) PROBLEM SIZE (CUBED)
* Selene: A100 80GB @ 1410 MHz
MATH LIBRARIES DEVICE EXTENSIONS
cuFFTDx Performance: Comparison with cuFFT across various sizes

cuFFTDx cuFFT
25
Released in MathDx 22.02
§ Available on DevZone
§ Support Volta+ architecture
20
§ FFT 1D sizes up to 32k
Future Releases
TFLOPS (LARGER IS BETTER)

§ cuBLASDx/cuSOLVERDx
15 § 2D/3D FFTs
§ Windows Support

10

0
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
FFT SIZES (1D)
* A100 80GB @ 1410 MHz
AGENDA
Accelerated Computing with Standard Languages

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science


QC RESEARCH ROADMAP
Large improvements in qubit quantity & quality, error correction, needed for wide adoption

Fault-Tolerant QC Era:
1000:1-10000:1 redundancy for error-corrected logical qubits.
[Fowler 2012][Reiher 2016]
Exponential speedups on a limited set of applications with
hundreds to thousands of logical qubits (millions of physical qubits).
10,000,000
Fault-Tolerant Quantum Active Research: What are the best error correction algorithms?
Computing Speedups Threshold
1,000,000
Noisy Intermediate Scale Quantum (NISQ) Era:
100,000
Quantum gates are noisy, errors accumulate. Qubits lose coherence.
Physical Qubits

10,000 QC hardware will mitigate errors by using tens to hundreds of redundant physical
1,000
qubits per logical qubit to mitigate errors.
Active Research: Will NISQs have quantum advantage on useful workloads?
100

10
Quantum Supremacy Threshold: Experimental confirmation of quantum speedup on
1 a well-defined (not necessarily useful) problem.
2010 2015 2020 2025 2030 2035 2040
Qubits and quantum gates are very noisy, hardware not very usable.
Active Research: Can this be simulated efficiently on GPU supercomputers?
GPU-BASED SUPERCOMPUTING IN THE QC ECOSYSTEM
Researching the Quantum Computers of Tomorrow with the Supercomputers of Today

QUANTUM CIRCUIT SIMULATION HYBRID CLASSICAL/QUANTUM APPLICATIONS


Critical tool for answering today’s most pressing questions Impactful QC applications (e.g. simulating quantum materials and systems)
in Quantum Information Science (QIS): will require classical supercomputers with quantum co-processors

• What quantum algorithms are most promising for near-term or long-term


quantum advantage? • How can we accelerate HPC through hybrid classical/quantum algorithms
• What are the requirements (number of qubits and error rates) to realize
quantum advantage?

• What quantum processor architectures are best suited to realize valuable


quantum applications?
TWO LEADING QUANTUM CIRCUIT SIMULATION APPROACHES

State Vector Simulation Tensor Networks


§ “Gate-based emulation of a quantum computer” “Only simulate the states you need”
• Maintain full 2n qubit vector state in memory • Uses tensor network contractions to dramatically
reduce memory for simulating circuits
• Update all states every timestep, probabilistically sample n
of the states for measurement • Can simulate 100s or 1000s of qubits for many
practical quantum circuits
§ Memory capacity & time grow exponentially w/ # of qubits -
practical limit around 50 qubits on a supercomputer
§ Can model either ideal or noisy qubits

GPUs are a great fit for either approach

Tensor Network image from Quimb: https://quimb.readthedocs.io/en/latest/index.html


cuQuantum

§ Platform for quantum computing research


§ Accelerate Quantum Circuit Simulators on GPUs Quantum Computing Application
§ Simulate ideal or noisy qubits
§ Enable algorithms research with scale and
performance not possible on quantum hardware, or Quantum Computing Frameworks (e.g., Cirq, Qiskit)
on simulators today
§ cuQuantum GA release available now Quantum Circuit Simulators
§ Integrated into leading quantum computing (e.g., Qsim, Qiskit-aer)
frameworks Cirq, Qiskit, and Pennylane
§ C and Python APIs cuQuantum
§ Available at developer.nvidia.com/cuquantum QPU
cuStateVec cuTensorNet …

GPU Accelerated Computing


DGX Quantum Appliance Available Now
MULTI-GPU CONTAINER WITH CIRQ/QSIM/CUQUANTUM

• Full Quantum Simulation Stack


• Get started with QC research today!
• World class performance on key
quantum algorithms
• Available on NGC
• See

cuQuantum
DGX Quantum Appliance Available Now
MULTI-GPU CONTAINER WITH CIRQ/QSIM/CUQUANTUM

• Full Quantum Simulation Stack


• Get started with QC research today!
• World class performance on key
quantum algorithms
• Available on NGC
• See

cuQuantum
ACCELERATING THE QUANTUM COMPUTING ECOSYSTEM
Leading groups across academia and industry are jump-starting their Quantum Computing Research with cuQuantum

Quantum Computing Frameworks

Industry Partners

Supercomputing Partners
CUQUANTUM PARTNER HIGHLIGHTS
cuQuantum-accelerated research in quantum chemistry, climate modeling, quantum supremacy, and more

CLASSIQ QC WARE PASQAL


Quantum Algorithm Design Quantum Chemistry Quantum Machine Learning

Rob Parrish - QPUs and GPUs: Practical Applications of NVIDIA


Accelerators to Emulate and Co-process with Emerging Quantum Loic Henriet - Accelerating the Simulation of Neutral-atom Quantum
Computers [S41805] Processors with GPUs [S41435]

RIGETTI COMPUTING CHINESE ACADEMY OF SCIENCES


Quantum Machine Learning, Climate Modeling Quantum Supremacy

Pan Zhang - Solving the Sampling Problem of the Sycamore Quantum


Supremacy Circuits [S41488]
FUTURE DIRECTIONS
Programming NVIDIA Quantum-Classical Heterogeneous Architectures

§ Introducing NVQ++
§ State-of-the-art quantum-classical C++ compiler Leading Framework
§ Interoperable with existing parallel programming models
nvq++
§ Implements a kernel-based approach that can be compiled to the
Quantum Intermediate Representation (QIR)
§ Native support for cuQuantum backend emulation, extensible to
vendor quantum computers

Pharmaceutical Chemistry Weather Finance Logistics

Kernel-based quantum-classical programming model

System-level compiler toolchain

NVQ++

GPU Device Driver Quantum Device Driver

NVIDIA GPUs cuQuantum QPU Emulation Partner QPUs

You might also like