A Deep Dive Into The Latest HPC Software

A DEEP DIVE INTO THE LATEST HPC SOFTWARE
TIM COSTA | GROUP PRODUCT MANAGER, HPC AND QUANTUM COMPUTING

AGENDA
Accelerated Computing with Standard Languages
GPU Supercomputing in the PyData Ecosystem
Advancements in HPC Libraries
Advancements in Quantum Information Science

AGENDA

PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network
ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION

PLATFORM SPECIALIZATION
CUDA
ISO C++ ISO Fortran Python CUDA
__global__
std::transform(par, x, x+n, y, do concurrent (i = 1:n) import cunumeric as np void saxpy(int n, float a,
y,[=](float x, float y){ y(i) = y(i) + a*x(i) … float *x, float *y) {
return y + a*x; enddo def saxpy(a, x, y): int i = blockIdx.x*blockDim.x +
} y[:] += a*x threadIdx.x;
); if (i < n) y[i] += a*x[i];
}
int main(void) {
...
cudaMemcpy(d_x, x, ...);
cudaMemcpy(d_y, y, ...);
matrix_product(par, mA, mB, mC); C = matmul(A, B) c = np.matmul(a, b)
saxpy<<<(N+255)/256,256>>>(...);
cudaMemcpy(y, d_y, ...);
ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
Copyright (C) 2021 Bryce Adelstein Lelbach
5
PILLARS OF STANDARD LANGUAGE PARALLELISM
Common Algorithms that Dispatch to Tools to Write Your Own Parallel Mechanisms for Composing Parallel
Vendor-Optimized Parallel Libraries Algorithms that Run Anywhere Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ... sender auto
return s | bulk(
[] (auto data) {
} }
// ...
) | bulk(
) | bulk(N, [] (auto data) {
}
// ...
);
[] (auto data) { }
// ...
}
);
}
M-AIA WITH C++17 PARALLEL ALGORITHMS
Multi-physics simulation framework
from RWTH Aachen University
#pragma omp parallel // OpenMP parallel region
{
#pragma omp for // OpenMP for loop
for (MInt i = 0; i < noCells; i++) { // Loop over all cells
if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop
Ø Hierarchical grids, complex moving geometries
const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets
const MInt distNeighStartId = i * distNeighbors;
Ø Adaptive meshing, load balancing
const MFloat* const distributionsStart = &[distributions[distStartId];
Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ...
for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2)
if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration
Ø Physics: aeroacoustics, combustion, biomedical, ...
const MInt n1StartId = neighborId[distNeighStartId + j] * nDist;
oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++
}
if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration Ø Programming model: MPI + ISO C++ parallelism
const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist;
oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1];
}
}
oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution
}
}
}
C++ with OpenMP

std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for
Ø Composable, compact and elegant if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop
return;
for (MInt j = 0; j < nDist; ++j) {
Ø Easy to read and maintain if (auto n = c_neighborId(i, j); n == -1) continue;
a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn
Ø ISO Standard }
});
Ø Portable – nvc++, g++, icpc, MSVC, …
Standard C++
M-AIA
Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University
10
9 8.74
8
7
Relative Speed-Up
6
5
4
3
2
Decaying isotropic turbulence 1 1.025
400k fully-resolved particles 1
0
OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100)
PARALLELISM IN C++ ROADMAP
C++ 11 C++ 14 C++ 17 C++ 20 C++ PIPELINE

• Memory model • Atomics • Parallel algorithms • Scalable • Mdspan • Linear algebra
enhancements extensions • Forward progress synchronization • Range-based algorithms
• Lambdas • Generic Lambda guarantees library parallel algorithms • Asynchronous
Expressions • Memory model • Ranges • Extended floating- parallel algorithms
clarifications • Span point types • Senders-receivers
How users run C++

code on GPUs today Extended C++ interface General usability
to BLAS/Lapack of performance
Co-designed with Custom algorithms
N-dimensional provided by executors
V100 hardware and async.
support loops and
usability control flow
General parallelism user facing feature

Copyright (C) 2021 Bryce Adelstein Lelbach
9
PILLARS OF STANDARD LANGUAGE PARALLELISM
Common Algorithms that Dispatch to Tools to Write Your Own Parallel Mechanisms for Composing Parallel
Vendor-Optimized Parallel Libraries Algorithms that Run Anywhere Invocations into Task Graphs
sender auto
return s | bulk(N,
[] (auto data) {
// ... sender auto
return s | bulk(
[] (auto data) {
} }
// ...
) | bulk(
) | bulk(N, [] (auto data) {
}
// ...
);
[] (auto data) { }
// ...
}
);
}
Today With Senders & Receivers

SENDERS & RECEIVERS Simplify Work Across CPUs and
Maxwell’s equations Accelerators
• Uniform abstraction between code and

diverse resources
template <ComputeSchedulerT, WriteSchedulerT>
auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer) • ISO standard

{
• Write once, run everywhere
•
return repeat_n(
n_outer_iterations,
repeat_n(
n_inner_iterations,
schedule(scheduler)
| bulk(grid.cells, update_h(accessor))
| bulk(grid.cells, update_e(time, dt, accessor)))
| transfer(writer)
| then(dump_results(report_step, accessor)))
| then([]{ printf("simulation complete\n"); })
);
}
ELECTROMAGNETISM
Raw performance & % of peak
std::sync_wait(maxwell(inline_scheduler, inline_scheduler));
std::sync_wait(maxwell(openmp_scheduler, inline_scheduler));
std::sync_wait(maxwell(cuda, inline_scheduler));
30 100
90
25
80
Efficiency vs STREAM TRIAD

20 70
60
Speedup
15 50
40
10
30
5 20
10
0 0
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Scheduler Scheduler
§ CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80

§ Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs)
§ clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp
STRONG SCALING USING ISO STANDARD C++
PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
Maxwell SR Scaling Ideal Scaling Efficiency

40 1.2
35
1
30
0.8
25
Speedup
20 0.6
15
0.4
10
0.2
5 NVIDIA CONFIDENTIAL. DO NOT
DISTRIBUTE.
0 0
0 200 400 600 800 1000 1200
Number of GPUs
NVIDIA SUPERPOD
§ 140x NVIDIA DGX-A100 640
§ 1120x NVIDIA A100-SXM4-80 GPUs
Copyright (C) 2022 NVIDIA
13
PALABOS CARBON SEQUESTRATION
Strong Scaling
16
12
0
32 128 224 320 416 512
A100 GPUs
Ø Palabos is a framework for fluid dynamics simulations using

Lattice-Boltzmann methods.
Ø Code for multi-component flow through a porous media
ported to C++ Senders and Receivers.
Ø Application: simulating carbon sequestration in sandstone.
Christian Huber (Brown University), Jonas Latt (University of Geneva)
Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA)
MODERN FORTRAN FEATURES FOR HPC
Standard Parallelism and Concurrency Features
Fortran 2018 Fortran 202X Fortran 202Y

Coming in 2023 In discussion
DO CONCURRENT DO CONCURRENT Reductions Atomics

Data parallel loop construct, locality Support for reduction operations Propose support for atomic
specifiers. Supported in nvfortran on concurrent loops (ala variable accesses
OpenACC/OpenMP). Began
Array Intrinsics supporting in nvfortran 21.11. Asynchronous Tasking
Various math intrinsics that may
apply to entire arrays and map to Propose support for asynchronous
accelerated libraries supported in tasks
nvfortran.
Co-Arrays
Partitioned Global Address Space
arrays, teams of processes (images),
collectives & synchronization.
Awaiting F18.
MINIWEATHER
Standard Language Parallelism in Climate/Weather Applications
MiniWeather
Mini-App written in C++ and Fortran that simulates
weather-like fluid flows using Finite Volume and
Runge-Kutta methods.
Existing parallelization in MPI, OpenMP, OpenACC, …
Included in the SPEChpc benchmark suite*
Open-source and commonly-used in training events.
https://github.com/mrnorman/miniWeather/
do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)

20
local(x,z,x0,z0,xrad,zrad,amp,dist,wpert)
if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then

x = (i_beg-1 + i-0.5_rp) * dx
z = (k_beg-1 + k-0.5_rp) * dz
x0 = xlen/8
z0 = 1000
xrad = 500
zrad = 500
amp = 0.01_rp 10
dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp
if (dist <= pi / 2._rp) then
wpert = amp * cos(dist)**2
else
wpert = 0._rp
endif
tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM)
+ wpert*hy_dens_cell(k)
endif 0
state_out(i,k,ll) = state_init(i,k,ll) OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC
+ dt * tend(i,k,ll)
enddo
Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5.
*SPEChpc is a trademark of The Standard Performance Evaluation Corporation OpenACC version uses –gpu=managed option.
POT3D: DO CONCURRENT
POT3D
POT3D is a Fortran application for approximating solar
coronal magnetic fields.
Included in the SPEChpc benchmark suite*
Existing parallelization in MPI & OpenACC
Optimized the DO CONCURRENT version by using
OpenACC solely for data motion and atomics
https://github.com/predsci/POT3D
!$acc enter data copyin(phi,dr_i)

!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
AGENDA

PRODUCTIVITY
Sequential and Composable Code
def cg_solve(A, b, conv_iters):

x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
converged = False
max_iters = b.shape[0]
§ Sequential semantics - no visible
for i in range(max_iters):
Ap = A.dot(p) parallelism or synchronization
alpha = rsold / (p.dot(Ap))
x = x + alpha * p § Name-based global data – no partitioning
r = r - alpha * Ap
rsnew = r.dot(r) § Composable – can combine with other
if i % conv_iters == 0 and \
libraries and datatypes
np.sqrt(rsnew) < 1e-10:
converged = i
break
beta = rsnew / rsold

p = r + beta * p
rsold = rsnew
PERFORMANCE
Transparent Acceleration
§ Transparently run at any scale needed to address computational challenges at hand

§ Automatically leverage all the available hardware
GPU Multi-GPU Supercomputer

Grace
CPU
DPU
MICROSCOPY WITH RICHARDSON-LUCY DECONVOLUTION
def richardson_lucy(image, psf, num_iter=50,

clip=True, filter_epsilon=None):
float_type = _supported_float_type(image.dtype)
image = image.astype(float_type, copy=False)
psf = psf.astype(float_type, copy=False)
im_deconv = np.full(image.shape, 0.5, dtype=float_type)
psf_mirror = np.flip(psf)
for _ in range(num_iter):
conv = convolve(im_deconv, psf, mode='same’)
if filter_epsilon:
with np.errstate(invalid='ignore’):
relative_blur = np.where(conv < filter_epsilon, 0,
image / conv)
else:
relative_blur = image / conv
im_deconv *= convolve(relative_blur, psf_mirror,
mode='same’)
if clip:
im_deconv[im_deconv > 1] = 1
im_deconv[im_deconv < -1] = -1
return im_deconv
COMPUTATIONAL FLUID DYNAMICS
CFD Python on cuNumeric!

• CFD codes like:
• Shallow-Water Equation Solver Distributed NumPy Performance
(weak scaling)
• Oil Pipeline Risk Management: Geoclaw- cuPy Legate
landspill simulations 150
• Python Libraries: Jupyter, NumPy, SciPy,

SymPy, Matplotlib
Time (seconds)
100
50
for _ in range(iter):
un = u.copy()
0
vn = v.copy() 1 2 4 8 16 32 64 128 256 512 1024
b = build_up_b(rho, dt, dx, dy, u, v) Relative dataset size

p = pressure_poisson_periodic(b, nit, p, dx, dy) Number of GPUs
…
Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of
Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
MICRO-JOIN
cuDF + cuNumPy Micro-Join Benchmark (weak scaling) size = num_rows_per_gpu * num_gpus

1700 lines MPI+CUDA 10 lines of cuDF + Legate
1.5 key_l = np.arange(size)

val_l = np.random.randn(size)
lhs = pd.DataFrame({ "key": key_l, ”val": val_l })
key_r = key_l // 3 * 3 # selectivity: 0.33

payload_r = np.random.randn(size)
rhs = pd.DataFrame({ "key": key_r, ”val": val_r })
1
Time (seconds)
out = lhs.merge(rhs, on="key")
vs.
0.5
~1700 LOC
0
8 16 32 64 128
Machine: DGX SuperPOD with A100-80GB GPUs

AGENDA

cuBLAS: GPU OPTIMIZED BLAS IMPLEMENTATION
Maximum Speedups of CTK 11.6U1 over CTK 11.1: Sizes < 2k
Recently Introduced
7 § FP64 Tensor Core accelerated BLAS3
§ Improved heuristics for GEMV
6 § Batched GEMV
§ Helper functions for improved error management
SPEEDUP OVER CTK 11.1
0
SYRK SYMM TRMM SYRK SYMM TRMM HEMM HERK
DOUBLE REAL DOUBLE COMPLEX
* A100 80GB @ 1095 MHz: CTK 11.1 vs. CTK 11.6U1
[CWE41721] Connect with the Experts: NVIDIA Math Libraries

cuTENSOR: GPU ACCELERATED TENSOR PRIMITIVES
Tensor Contraction Performance Improvements for Complex FP32 + TF32
v1.2.2 v1.5
120
Releasing cuTENSOR v1.5
• Improvements to Tensor Contractions, up to 33x
100
• Supporting more than 28 modes
TFLOPS (LARGER IS BETTER)
80
60
40
20
0
0 50 100 150 200 250 300
CASES
• DGX A100 80GB
• 300 Random Cases
Multi-Node Multi-GPU FFT in cuFFT
Coming to HPC SDK 22.3
C2C Z2Z 32 64 128 256 512 1024 2048

2,000
600
1,860
1,800
§ Distributed 2D/3D FFTs 527
1,600 § Slab Decomposition 500
§ Pencil Decomposition (Preview)
1,400
§ Helper functions: Pencils <-> Slabs

400 394
1,200
1,000
851
300 278
800
260
600
429 410 200 176
400 158
147
226 210 134 135
122
200 104
109 105
34 17 61 29 52 100 78.2 79
18 10 12 6 18 9 68
0 51
37 41
8 16 32 64 128 256 512 1024 2048 4096 22 30 25 27
13 14
2048 2560 3072 4096 5120 6144 8192 10240 12288 16384 0
# OF GPUS 1024 2048 4096 8192
PROBLEM SIZE (CUBED) PROBLEM SIZE (CUBED)
* Selene: A100 80GB @ 1410 MHz
MATH LIBRARIES DEVICE EXTENSIONS
cuFFTDx Performance: Comparison with cuFFT across various sizes
cuFFTDx cuFFT
25
Released in MathDx 22.02
§ Available on DevZone
§ Support Volta+ architecture
20
§ FFT 1D sizes up to 32k
Future Releases
§ cuBLASDx/cuSOLVERDx
15 § 2D/3D FFTs
§ Windows Support
10
0
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
FFT SIZES (1D)
* A100 80GB @ 1410 MHz
AGENDA

QC RESEARCH ROADMAP
Large improvements in qubit quantity & quality, error correction, needed for wide adoption
Fault-Tolerant QC Era:
1000:1-10000:1 redundancy for error-corrected logical qubits.
[Fowler 2012][Reiher 2016]
Exponential speedups on a limited set of applications with
hundreds to thousands of logical qubits (millions of physical qubits).
10,000,000
Fault-Tolerant Quantum Active Research: What are the best error correction algorithms?
Computing Speedups Threshold
1,000,000
Noisy Intermediate Scale Quantum (NISQ) Era:
100,000
Quantum gates are noisy, errors accumulate. Qubits lose coherence.
Physical Qubits
10,000 QC hardware will mitigate errors by using tens to hundreds of redundant physical
1,000
qubits per logical qubit to mitigate errors.
Active Research: Will NISQs have quantum advantage on useful workloads?
100
10
Quantum Supremacy Threshold: Experimental confirmation of quantum speedup on
1 a well-defined (not necessarily useful) problem.
2010 2015 2020 2025 2030 2035 2040
Qubits and quantum gates are very noisy, hardware not very usable.
Active Research: Can this be simulated efficiently on GPU supercomputers?
GPU-BASED SUPERCOMPUTING IN THE QC ECOSYSTEM
Researching the Quantum Computers of Tomorrow with the Supercomputers of Today
QUANTUM CIRCUIT SIMULATION HYBRID CLASSICAL/QUANTUM APPLICATIONS

Critical tool for answering today’s most pressing questions Impactful QC applications (e.g. simulating quantum materials and systems)
in Quantum Information Science (QIS): will require classical supercomputers with quantum co-processors
• What quantum algorithms are most promising for near-term or long-term

quantum advantage? • How can we accelerate HPC through hybrid classical/quantum algorithms
• What are the requirements (number of qubits and error rates) to realize
quantum advantage?
• What quantum processor architectures are best suited to realize valuable

quantum applications?
TWO LEADING QUANTUM CIRCUIT SIMULATION APPROACHES
State Vector Simulation Tensor Networks

§ “Gate-based emulation of a quantum computer” “Only simulate the states you need”
• Maintain full 2n qubit vector state in memory • Uses tensor network contractions to dramatically
reduce memory for simulating circuits
• Update all states every timestep, probabilistically sample n
of the states for measurement • Can simulate 100s or 1000s of qubits for many
practical quantum circuits
§ Memory capacity & time grow exponentially w/ # of qubits -
practical limit around 50 qubits on a supercomputer
§ Can model either ideal or noisy qubits
GPUs are a great fit for either approach
Tensor Network image from Quimb: https://quimb.readthedocs.io/en/latest/index.html

cuQuantum
§ Platform for quantum computing research

§ Accelerate Quantum Circuit Simulators on GPUs Quantum Computing Application
§ Simulate ideal or noisy qubits
§ Enable algorithms research with scale and
performance not possible on quantum hardware, or Quantum Computing Frameworks (e.g., Cirq, Qiskit)
on simulators today
§ cuQuantum GA release available now Quantum Circuit Simulators
§ Integrated into leading quantum computing (e.g., Qsim, Qiskit-aer)
frameworks Cirq, Qiskit, and Pennylane
§ C and Python APIs cuQuantum
§ Available at developer.nvidia.com/cuquantum QPU
cuStateVec cuTensorNet …
GPU Accelerated Computing

DGX Quantum Appliance Available Now
MULTI-GPU CONTAINER WITH CIRQ/QSIM/CUQUANTUM
• Full Quantum Simulation Stack

• Get started with QC research today!
• World class performance on key
quantum algorithms
• Available on NGC
• See
cuQuantum
DGX Quantum Appliance Available Now
MULTI-GPU CONTAINER WITH CIRQ/QSIM/CUQUANTUM
• Full Quantum Simulation Stack

• Get started with QC research today!
• World class performance on key
quantum algorithms
• Available on NGC
• See
cuQuantum
ACCELERATING THE QUANTUM COMPUTING ECOSYSTEM
Leading groups across academia and industry are jump-starting their Quantum Computing Research with cuQuantum
Quantum Computing Frameworks
Industry Partners
Supercomputing Partners
CUQUANTUM PARTNER HIGHLIGHTS
cuQuantum-accelerated research in quantum chemistry, climate modeling, quantum supremacy, and more
CLASSIQ QC WARE PASQAL

Quantum Algorithm Design Quantum Chemistry Quantum Machine Learning
Rob Parrish - QPUs and GPUs: Practical Applications of NVIDIA

Accelerators to Emulate and Co-process with Emerging Quantum Loic Henriet - Accelerating the Simulation of Neutral-atom Quantum
Computers [S41805] Processors with GPUs [S41435]
RIGETTI COMPUTING CHINESE ACADEMY OF SCIENCES

Quantum Machine Learning, Climate Modeling Quantum Supremacy
Pan Zhang - Solving the Sampling Problem of the Sycamore Quantum

Supremacy Circuits [S41488]
FUTURE DIRECTIONS
Programming NVIDIA Quantum-Classical Heterogeneous Architectures
§ Introducing NVQ++
§ State-of-the-art quantum-classical C++ compiler Leading Framework
§ Interoperable with existing parallel programming models
nvq++
§ Implements a kernel-based approach that can be compiled to the
Quantum Intermediate Representation (QIR)
§ Native support for cuQuantum backend emulation, extensible to
vendor quantum computers
Pharmaceutical Chemistry Weather Finance Logistics
Kernel-based quantum-classical programming model
System-level compiler toolchain
NVQ++
GPU Device Driver Quantum Device Driver
NVIDIA GPUs cuQuantum QPU Emulation Partner QPUs

A Deep Dive Into The Latest HPC Software

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Deep Dive Into The Latest HPC Software

Uploaded by

Copyright:

Available Formats

A DEEP DIVE INTO THE LATEST HPC SOFTWARE

TIM COSTA | GROUP PRODUCT MANAGER, HPC AND QUANTUM COMPUTING

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION

cudaMemcpy(y, d_y, ...);

PILLARS OF STANDARD LANGUAGE PARALLELISM

) | bulk(N, [] (auto data) {

C++ with OpenMP

C++ 11 C++ 14 C++ 17 C++ 20 C++ PIPELINE

How users run C++

General parallelism user facing feature

PILLARS OF STANDARD LANGUAGE PARALLELISM

) | bulk(N, [] (auto data) {

Today With Senders & Receivers

• Uniform abstraction between code and

auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer) • ISO standard

| bulk(grid.cells, update_e(time, dt, accessor)))

| then([]{ printf("simulation complete\n"); })

Efficiency vs STREAM TRIAD

§ CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80

Maxwell SR Scaling Ideal Scaling Efficiency

PALABOS CARBON SEQUESTRATION

Ø Palabos is a framework for fluid dynamics simulations using

Fortran 2018 Fortran 202X Fortran 202Y

DO CONCURRENT DO CONCURRENT Reductions Atomics

do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)

if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then

!$acc enter data copyin(phi,dr_i)

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science

def cg_solve(A, b, conv_iters):

beta = rsnew / rsold

§ Transparently run at any scale needed to address computational challenges at hand

GPU Multi-GPU Supercomputer

def richardson_lucy(image, psf, num_iter=50,

CFD Python on cuNumeric!

landspill simulations 150

• Python Libraries: Jupyter, NumPy, SciPy,

b = build_up_b(rho, dt, dx, dy, u, v) Relative dataset size

cuDF + cuNumPy Micro-Join Benchmark (weak scaling) size = num_rows_per_gpu * num_gpus

1.5 key_l = np.arange(size)

key_r = key_l // 3 * 3 # selectivity: 0.33

out = lhs.merge(rhs, on="key")

Machine: DGX SuperPOD with A100-80GB GPUs

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science

[CWE41721] Connect with the Experts: NVIDIA Math Libraries

C2C Z2Z 32 64 128 256 512 1024 2048

TFLOPS (LARGER IS BETTER)

GPU Supercomputing in the PyData Ecosystem

Advancements in HPC Libraries

Advancements in Quantum Information Science

QUANTUM CIRCUIT SIMULATION HYBRID CLASSICAL/QUANTUM APPLICATIONS

• What quantum algorithms are most promising for near-term or long-term

• What quantum processor architectures are best suited to realize valuable

State Vector Simulation Tensor Networks