Professional Documents
Culture Documents
3
GPU DPDK: recipe
• GPUDirect RDMA: NVIDIA GPU + Mellanox CX5
• External buffers (DPDK 18.05)
• NVIDIA API to allocate mbufs payloads in GPU memory
• RX and TX (inlining not allowed)
GPUDirect RDMA
PCIe
pkt1 pkt0 NIC
RX queue Mempool
mbuf 0 mbuf 0
header
pkt0 payload
mbuf 1 mbuf 1
header
pkt1 payload
• Example: l2fwd-nv
NVIDIA GTC ’19, Session S9730: E. Agostini, C. Tekur, "Packet Processing on GPU at 100GbE Line Rate"
5
GPU DPDK: Header/Data split extension
• Having the whole packet in GPU memory may force you to offload the control logic on the GPU
• Header/Data split: you may receive part of the packet in system memory and the remaining part in GPU
memory
• Each packet store into two chained mbufs on RX queue
• Initial X bytes in system memory
• Remaining Y bytes in GPU memory
System memory
mbuf 0 Pkt0: X mempool
header mbuf 0 payload
mbuf 1
mbuf 1 payload
header
Pkt0: X + Y bytes
NIC RX queue Sysmem Sysmem
GPU memory
GPUDirect RDMA mbuf 0 Pkt0: Y mempool
header mbuf 0 payload
mbuf 1
mbuf 1 payload
header
Sysmem GPU memory
6
Benchmarks: testpmd vs testpmd
System setup
• System HW:
• Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
(Skylake)
• NVIDIA GPU V100
• Mellanox ConnectX-5, 100Gbps
• 4kB MaxReadRequest
• PCIe bridge Broadcom PLX Technology 9797 Packet size GPU mem H/D split
only
128 B 30 Gbps 40 Gbps
• Testpmd with GPU mbufs support
512 B 91 Gbps 98 Gbps
• 2 cores, 4 RX/TX queues x core, 256 descriptors x
queue 1024 B 96 Gbps 99 Gbps
2048 B (Jumbo) 96 Gbps 99 Gbps
• Testpmd TX only mode
• 99Gbps, 8 TX cores, 4 RX/TX queues, 256 descriptors x
queue
7
Benchmarks: testpmd vs l2fwd-nv
System setup
• System HW:
• Intel(R) Xeon(R) Platinum 8168 CPU @
2.70GHz(Skylake)
• NVIDIA GPU V100
• Mellanox ConnectX-5, 100Gbps
• 4kB MaxReadRequest
• PCIe bridge Broadcom PLX Technology 9797
8
Aerial SDK
• NVIDIA Aerial is a set of SDKs that enables GPU-accelerated, software defined 5G wireless RANs. Today,
NVIDIA Aerial provides two critical SDKs: cuVNF and cuBB.
• The NVIDIA cuVNF SDK provides optimized input/output (IO) and packet processing whereby 5G packets
are directly sent to GPU memory from GPU Direct capable network interface cards (NICs), such as
Mellanox CX-5
• The NVIDIA cuBB SDK provides a fully-offloaded 5G Signal Processing pipeline (L1 5G Phy) which
delivers unprecedented throughput and efficiency by keeping all physical layer processing within the GPU’s
high-performance memory
• https://developer.nvidia.com/aerial-sdk
9
Thoughts toward upstream
• New APIs to create mempool with persistence external buffers
• Buffers can come from any memory on platform
• New mempool flag - MEMPOOL_F_PINNED_EXT_MBUF
• Avoid detaching of external buffer during mbuf free.
10
Use Case: ViaSat Satellite Internet
11
ViaSat Signal Processing Pipeline
12
ViaSat Signal Processing Pipeline
GPU GPU
Memory
bandwidth PCIe3 x16
~400Gbps ~100Gbps
CPU Memory
100Gbps 100Gbps
Ring Buffers
Lcore
13
Supermicro 9029 (16 NICs/16 GPUs)
GNGN GNGN GNGN GNGN GNGN GNGN GNGN GNGN
P IP I P IP I P IP I P IP I P IP I P IP I P IP I P IP I
UCUC UCUC U CU C UCUC UCUC UCUC U CU C UCU C
Intel Intel
Xeon Xeon
UPI
14
Before and After GPUDirect RX
Before GPUDirect CPU Memory
Payload Data
Data GPU Memory
• Payload reassembled Packet Ordering
rs+
...
in CPU memory e (copy) Payload Data
d
• Copied to device in hea load 0 1 2 3 N
single (large) copy pay Packet Headers + Data
15
Thank you – Questions?