DP DK Acceleration With Gpu

x
DPDK acceleration with GPU

ELENA AGOSTINI (NVIDIA),
CLIFF BURDICK (VIASAT),
SHAHAF SHULER (MELLANOX)
• GPUDIRECT RDMA
• GPU DPDK
• General overview
• Header/Data split feature
• Benchmarks
• USE CASE: VIASAT
Agenda Converting a legacy

•
application
GPUDirect RDMA
• 3rd party PCIe devices can directly read/write GPU memory

• e.g. network card
• GPU and external device must be under the same PCIe root complex
• No unnecessary system memory copies and CPU overhead
nv_peer_mem: https://github.com/Mellanox/nv_peer_memory
• cudaMalloc(gpu_buffer) + MPI_Send(gpu_buffer)
•https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
3
GPU DPDK: recipe
• GPUDirect RDMA: NVIDIA GPU + Mellanox CX5
• External buffers (DPDK 18.05)
• NVIDIA API to allocate mbufs payloads in GPU memory
• RX and TX (inlining not allowed)
GPUDirect RDMA
PCIe
pkt1 pkt0 NIC
RX queue Mempool
mbuf 0 mbuf 0
header
pkt0 payload
mbuf 1 mbuf 1
header
pkt1 payload
GPU memory consistency problem: NVIDIA GTC ’19, D. Rossetti, E. Agostini, …. ….

Session S9653 - How to make your life easier in the age of exascale computing using
NVIDIA GPUDirect technologies Sysmem GPU memory 4
GPU DPDK: interaction with CUDA kernel
• Spend some time batching the packets
• Option 1: launch a new CUDA kernel for each batch

• No GPU memory consistency problem
• Adding CUDA kernel launch latency for each batch
• Option 2: polling from CUDA persistent kernel

• GPU memory consistency problem
• No additional latency for CUDA kernel launch
• Communication protocol CPU <> GPU
 The CPU needs to notify the persistent kernel about new received packets
• Example: l2fwd-nv
NVIDIA GTC ’19, Session S9730: E. Agostini, C. Tekur, "Packet Processing on GPU at 100GbE Line Rate"
5
GPU DPDK: Header/Data split extension
• Having the whole packet in GPU memory may force you to offload the control logic on the GPU
• Header/Data split: you may receive part of the packet in system memory and the remaining part in GPU
memory
• Each packet store into two chained mbufs on RX queue
• Initial X bytes in system memory
• Remaining Y bytes in GPU memory
System memory
mbuf 0 Pkt0: X mempool
header mbuf 0 payload
mbuf 1
mbuf 1 payload
header
Pkt0: X + Y bytes
NIC RX queue Sysmem Sysmem
GPU memory
GPUDirect RDMA mbuf 0 Pkt0: Y mempool
header mbuf 0 payload
mbuf 1
mbuf 1 payload
header
Sysmem GPU memory
6
Benchmarks: testpmd vs testpmd
System setup
• System HW:
• Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
(Skylake)
• NVIDIA GPU V100
• Mellanox ConnectX-5, 100Gbps
• 4kB MaxReadRequest
• PCIe bridge Broadcom PLX Technology 9797 Packet size GPU mem H/D split
only
128 B 30 Gbps 40 Gbps
• Testpmd with GPU mbufs support
512 B 91 Gbps 98 Gbps
• 2 cores, 4 RX/TX queues x core, 256 descriptors x
queue 1024 B 96 Gbps 99 Gbps
2048 B (Jumbo) 96 Gbps 99 Gbps
• Testpmd TX only mode
• 99Gbps, 8 TX cores, 4 RX/TX queues, 256 descriptors x
queue
7
Benchmarks: testpmd vs l2fwd-nv
System setup
• System HW:
• Intel(R) Xeon(R) Platinum 8168 CPU @
2.70GHz(Skylake)
• NVIDIA GPU V100
• Mellanox ConnectX-5, 100Gbps
• 4kB MaxReadRequest
• PCIe bridge Broadcom PLX Technology 9797
Packet size No Swap Swap with Swap with dedicated

Persistent kernel kernel
• L2fwd-nv GPU memory mempool 64 pkts x burst
• 8 RX/TX cores
128 B 31 Gbps 25 Gbps 9 Gbps
• Testpmd TX only mode 512 B 86 Gbps 80 Gbps 35 Gbps
• 99Gbps, 8 TX cores, 32 descriptors x queue 1024 B 88 Gbps 88 Gbps 65 Gbps
(Jumbo)
8
Aerial SDK
• NVIDIA Aerial is a set of SDKs that enables GPU-accelerated, software defined 5G wireless RANs. Today,
NVIDIA Aerial provides two critical SDKs: cuVNF and cuBB.
• The NVIDIA cuVNF SDK provides optimized input/output (IO) and packet processing whereby 5G packets
are directly sent to GPU memory from GPU Direct capable network interface cards (NICs), such as
Mellanox CX-5
• The NVIDIA cuBB SDK provides a fully-offloaded 5G Signal Processing pipeline (L1 5G Phy) which
delivers unprecedented throughput and efficiency by keeping all physical layer processing within the GPU’s
high-performance memory
• https://developer.nvidia.com/aerial-sdk
9
Thoughts toward upstream
• New APIs to create mempool with persistence external buffers
• Buffers can come from any memory on platform
• New mempool flag - MEMPOOL_F_PINNED_EXT_MBUF
• Avoid detaching of external buffer during mbuf free.
• Fine grained PMD inline control

• Have mbuf flag to hint the PMD whether or not to inline the mbuf data
• Flag will be on the dynamic flags section after handshake complete between application and PMD.
• Header data split

• Populate the mempool with two types of external buffers (CPU and GPU)
• Even mbufs – GPU. Odd mbufs - CPU
• Configure PMD rxq always to drain X mbufs per post recv.
• Each packet will spread between GPU and CPU memory
10
Use Case: ViaSat Satellite Internet
• Global satellite internet provider

• Residential
• Mobile
• Users of DPDK since 2.x
• MAC layer and above packet processing
• Migrating some signal processing code
out of FPGAs into GPUs
• Rapid prototyping
• Highly-parallel, real-time tasks
• Some packet loss is okay
11
ViaSat Signal Processing Pipeline
• Using multiple 100Gbps NICs to keep feeding

the processing pipeline in real-time
• CPU only needs to know a few bytes in the
header for control logic
• Payload is processed entirely in GPU
• Legacy DPDK-based application started hitting
memory bandwidth limitations early on
• Larger systems exacerbated this
• Analyzed RDMA (RoCE/iWARP)
• Line rate issues
• Control over senders
• Tolerance to lossy data
12
ViaSat Signal Processing Pipeline
GPU GPU
Memory
bandwidth PCIe3 x16
~400Gbps ~100Gbps
CPU Memory
100Gbps 100Gbps
NIC Ingress NIC Egress

Receive Processing Transmit
Ring Buffers
Lcore
13
Supermicro 9029 (16 NICs/16 GPUs)
GNGN GNGN GNGN GNGN GNGN GNGN GNGN GNGN
P IP I P IP I P IP I P IP I P IP I P IP I P IP I P IP I
UCUC UCUC U CU C UCUC UCUC UCUC U CU C UCU C
PCIe PCIe PCIe PCIe PCIe PCIe PCIe PCIe

Switch Switch Switch Switch Switch Switch Switch Switch
PCIe PCIe PCIe PCIe

Switch Switch Switch Switch
Intel Intel
Xeon Xeon
UPI
14
Before and After GPUDirect RX
Before GPUDirect CPU Memory
Payload Data
Data GPU Memory
• Payload reassembled Packet Ordering
rs+
...
in CPU memory e (copy) Payload Data
d
• Copied to device in hea load 0 1 2 3 N
single (large) copy pay Packet Headers + Data
CPU mbuf (with header) Payload data
~500x less data copied from CPU->GPU!

ta
da
GPU Memory
After GPUDirect CPU Memory Device Pkt 1 Pkt 0 PktN Pkt3 Pkt2
Pointers
...
0 1 2 3 N
s
der 00 11 22 33
•
...
Payload already in GPU NN
memory hea 0123 N
Packet Headers
• Device pointers ordered Packet Pointers
in CPU memory
• Pointers copied to GPU
CPU mbuf (with header) GPU mbuf (with device pointer) Device pointer
memory
15
Thank you – Questions?
Elena Agostini (NVIDIA),

Cliff Burdick (Viasat),
Shahaf Shuler (Mellanox)

DP DK Acceleration With Gpu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DP DK Acceleration With Gpu

Uploaded by

Copyright:

Available Formats

x

DPDK acceleration with GPU

Agenda Converting a legacy

• 3rd party PCIe devices can directly read/write GPU memory

GPU memory consistency problem: NVIDIA GTC ’19, D. Rossetti, E. Agostini, …. ….

• Option 1: launch a new CUDA kernel for each batch

• Option 2: polling from CUDA persistent kernel

Packet size No Swap Swap with Swap with dedicated

• Fine grained PMD inline control

• Header data split

• Global satellite internet provider

• Using multiple 100Gbps NICs to keep feeding

NIC Ingress NIC Egress

PCIe PCIe PCIe PCIe PCIe PCIe PCIe PCIe

PCIe PCIe PCIe PCIe

CPU mbuf (with header) Payload data

~500x less data copied from CPU->GPU!

Elena Agostini (NVIDIA),

You might also like