You are on page 1of 16

x

DPDK acceleration with GPU


ELENA AGOSTINI (NVIDIA),
CLIFF BURDICK (VIASAT),
SHAHAF SHULER (MELLANOX)
• GPUDIRECT RDMA
• GPU DPDK
• General overview
• Header/Data split feature
• Benchmarks
• USE CASE: VIASAT

Agenda Converting a legacy



application
GPUDirect RDMA

• 3rd party PCIe devices can directly read/write GPU memory


• e.g. network card
• GPU and external device must be under the same PCIe root complex
• No unnecessary system memory copies and CPU overhead
nv_peer_mem: https://github.com/Mellanox/nv_peer_memory
• cudaMalloc(gpu_buffer) + MPI_Send(gpu_buffer)
•https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

3
GPU DPDK: recipe
• GPUDirect RDMA: NVIDIA GPU + Mellanox CX5
• External buffers (DPDK 18.05)
• NVIDIA API to allocate mbufs payloads in GPU memory
• RX and TX (inlining not allowed)

GPUDirect RDMA

PCIe
pkt1 pkt0 NIC
RX queue Mempool
mbuf 0 mbuf 0
header
pkt0 payload

mbuf 1 mbuf 1
header
pkt1 payload

GPU memory consistency problem: NVIDIA GTC ’19, D. Rossetti, E. Agostini, …. ….


Session S9653 - How to make your life easier in the age of exascale computing using
NVIDIA GPUDirect technologies​ Sysmem GPU memory 4
GPU DPDK: interaction with CUDA kernel
• Spend some time batching the packets

• Option 1: launch a new CUDA kernel for each batch


• No GPU memory consistency problem
• Adding CUDA kernel launch latency for each batch

• Option 2: polling from CUDA persistent kernel


• GPU memory consistency problem
• No additional latency for CUDA kernel launch
• Communication protocol CPU <> GPU
 The CPU needs to notify the persistent kernel about new received packets

• Example: l2fwd-nv

NVIDIA GTC ’19, Session S9730: E. Agostini, C. Tekur, "Packet Processing on GPU at 100GbE Line Rate"
5
GPU DPDK: Header/Data split extension
• Having the whole packet in GPU memory may force you to offload the control logic on the GPU
• Header/Data split: you may receive part of the packet in system memory and the remaining part in GPU
memory
• Each packet store into two chained mbufs on RX queue
• Initial X bytes in system memory
• Remaining Y bytes in GPU memory

System memory
mbuf 0 Pkt0: X mempool
header mbuf 0 payload

mbuf 1
mbuf 1 payload
header
Pkt0: X + Y bytes
NIC RX queue Sysmem Sysmem

GPU memory
GPUDirect RDMA mbuf 0 Pkt0: Y mempool
header mbuf 0 payload

mbuf 1
mbuf 1 payload
header
Sysmem GPU memory
6
Benchmarks: testpmd vs testpmd
System setup
• System HW:
• Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz​
(Skylake)
• NVIDIA GPU V100​
• Mellanox ConnectX-5, 100Gbps​
• 4kB MaxReadRequest
• PCIe bridge Broadcom PLX Technology 9797 Packet size GPU mem H/D split
only
128 B 30 Gbps 40 Gbps
• Testpmd with GPU mbufs support
512 B 91 Gbps 98 Gbps
• 2 cores, 4 RX/TX queues x core, 256 descriptors x
queue 1024 B 96 Gbps 99 Gbps
2048 B (Jumbo) 96 Gbps 99 Gbps
• Testpmd TX only mode
• 99Gbps, 8 TX cores, 4 RX/TX queues, 256 descriptors x
queue

7
Benchmarks: testpmd vs l2fwd-nv
System setup
• System HW:
• Intel(R) Xeon(R) Platinum 8168 CPU @
2.70GHz​(Skylake)
• NVIDIA GPU V100​
• Mellanox ConnectX-5, 100Gbps​
• 4kB MaxReadRequest
• PCIe bridge Broadcom PLX Technology 9797

Packet size No Swap Swap with Swap with dedicated


Persistent kernel kernel
• L2fwd-nv GPU memory mempool 64 pkts x burst
• 8 RX/TX cores
128 B 31 Gbps 25 Gbps 9 Gbps
256 B 53 Gbps 46 Gbps 18 Gbps
• Testpmd TX only mode 512 B 86 Gbps 80 Gbps 35 Gbps
• 99Gbps, 8 TX cores, 32 descriptors x queue 1024 B 88 Gbps 88 Gbps 65 Gbps
2048 B 88 Gbps 88 Gbps 86 Gbps
(Jumbo)

8
Aerial SDK

• NVIDIA Aerial is a set of SDKs that enables GPU-accelerated, software defined 5G wireless RANs. Today,
NVIDIA Aerial provides two critical SDKs: cuVNF and cuBB.
• The NVIDIA cuVNF SDK provides optimized input/output (IO) and packet processing whereby 5G packets
are directly sent to GPU memory from GPU Direct capable network interface cards (NICs), such as
Mellanox CX-5
• The NVIDIA cuBB SDK provides a fully-offloaded 5G Signal Processing pipeline (L1 5G Phy) which
delivers unprecedented throughput and efficiency by keeping all physical layer processing within the GPU’s
high-performance memory
• https://developer.nvidia.com/aerial-sdk

9
Thoughts toward upstream
• New APIs to create mempool with persistence external buffers
• Buffers can come from any memory on platform
• New mempool flag - MEMPOOL_F_PINNED_EXT_MBUF
• Avoid detaching of external buffer during mbuf free.

• Fine grained PMD inline control


• Have mbuf flag to hint the PMD whether or not to inline the mbuf data
• Flag will be on the dynamic flags section after handshake complete between application and PMD.

• Header data split


• Populate the mempool with two types of external buffers (CPU and GPU)
• Even mbufs – GPU. Odd mbufs - CPU
• Configure PMD rxq always to drain X mbufs per post recv.
• Each packet will spread between GPU and CPU memory

10
Use Case: ViaSat Satellite Internet

• Global satellite internet provider


• Residential
• Mobile
• Users of DPDK since 2.x
• MAC layer and above packet processing
• Migrating some signal processing code
out of FPGAs into GPUs
• Rapid prototyping
• Highly-parallel, real-time tasks
• Some packet loss is okay

11
ViaSat Signal Processing Pipeline

• Using multiple 100Gbps NICs to keep feeding


the processing pipeline in real-time
• CPU only needs to know a few bytes in the
header for control logic
• Payload is processed entirely in GPU
• Legacy DPDK-based application started hitting
memory bandwidth limitations early on
• Larger systems exacerbated this
• Analyzed RDMA (RoCE/iWARP)
• Line rate issues
• Control over senders
• Tolerance to lossy data

12
ViaSat Signal Processing Pipeline

GPU GPU
Memory
bandwidth PCIe3 x16
~400Gbps ~100Gbps
CPU Memory

100Gbps 100Gbps

NIC Ingress NIC Egress


Receive Processing Transmit

Ring Buffers
Lcore
13
Supermicro 9029 (16 NICs/16 GPUs)
GNGN GNGN GNGN GNGN GNGN GNGN GNGN GNGN
P IP I P IP I P IP I P IP I P IP I P IP I P IP I P IP I
UCUC UCUC U CU C UCUC UCUC UCUC U CU C UCU C

PCIe PCIe PCIe PCIe PCIe PCIe PCIe PCIe


Switch Switch Switch Switch Switch Switch Switch Switch

PCIe PCIe PCIe PCIe


Switch Switch Switch Switch

Intel Intel
Xeon Xeon
UPI
14
Before and After GPUDirect RX
Before GPUDirect CPU Memory
Payload Data
Data GPU Memory
• Payload reassembled Packet Ordering
rs+
...
in CPU memory e (copy) Payload Data
d
• Copied to device in hea load 0 1 2 3 N
single (large) copy pay Packet Headers + Data

CPU mbuf (with header) Payload data

~500x less data copied from CPU->GPU!


ta
da
GPU Memory
After GPUDirect CPU Memory Device Pkt 1 Pkt 0 PktN Pkt3 Pkt2
Pointers
...
0 1 2 3 N
s
der 00 11 22 33

...
Payload already in GPU NN
memory hea 0123 N
Packet Headers
• Device pointers ordered Packet Pointers
in CPU memory
• Pointers copied to GPU
CPU mbuf (with header) GPU mbuf (with device pointer) Device pointer
memory

15
Thank you – Questions?

Elena Agostini (NVIDIA),


Cliff Burdick (Viasat),
Shahaf Shuler (Mellanox)

You might also like