You are on page 1of 1

Application Offloading

Huỳnh Nhật Quang, Nguyễn Đức Tài, Nguyễn Hữu Thắng,


Phạm Nguyễn Thiên Phúc, Nguyễn Gia Phần, Dương Tiến Thành
VNU-HCM

Abstract: Network speed became significantly faster than CPU performance, forcing many data center applications to sacrifice application cycles for packet
processing. The literature shows that there is a gap between a general-purpose CPU and a network processor ASIC. Exploiting SmartNICs capabilities to
improve TCP performance is a challenging task due to the network protocol stack design.
As a result, developers have started to offload computation to programmable NICs, dramatically improving the performance and energy efficiency of many
data center applications. However, implementing data center network applications in a combined CPU-NIC environment is challenging

1. Introduction 3. Application Offloading Comparison, Discussions, and Limitations


 λ-NIC, a system that aims at offloading serverless workloads to  Table 12 compares the schemes described in previous sections. P4-
SmartNICs. Serverless workloads are referred to as fine-grained based NICs are continually evolving to support new features. NICs can
functions with short service times and memory usage. If correctly now host advanced features, such as protocol offloading, packet
balanced and distributed, serverless applications can be more efficient classification, rate limiting, cryptographic functions, and virtualization.
and cost-saving. The authors used SmartNICs, or more specifically  However, a drawback of P4-based NICs is the complexity of the
ASIC-based NICs, to run these lambdas more efficiently. platforms. High-level programmable platforms support limited network
 ASIC-based NICs consist of hundreds of Reduced Instruction Set functions. Existing NICs do not offer a systematic methodology for
Computer (RISC) processors (i.e., NPU) with independent local chaining multiple offload capabilities.
instruction storage and memory. Consequently, these SmartNICs-unlike  P4-programmable NICs present a higher level of flexibility and
GPUs and FPGAs, optimized to accelerate specific workloads, are programmability. These programmable NICs can offload almost any
more suitable for running many discrete functions in parallel at high packet processing function from the server. Moreover, they employ
speed and low latency. energy-efficient processors compared to x86-based server processors,
achieving higher energy efficiency in packet processing.

2. Literature Review

4. Comparison with Legacy Approaches


 P4-programmable NICs can be:
 present specialized packet engines that allow the implementation of a
 In this model, λs are completely independent programs . The matching variety of protocol accelerators
stage operates as a scheduler that forwards packets to the matching  considered as a generalpurpose offloading platform to develop
lambdas or the host OS. In the last stage, a parser handles packet key/value store applications, microservices, and other types of network
operations. The authors developed an open-source implementation of λ- functions
NIC using P4-enabled SmartNICs. They also implemented the  considered network accelerators capable of offloading communication
methodologies to optimize λs to utilize the SmartNIC resources routines and computational kernels from the CPU.
efficiently. Results show that λ-NIC achieves a maximum of 880x in  extended to support applications, thus minimizing CPU intervention.
workloads’ response latency and 736x more throughput reducing the main  Programmable NICs that do not use P4 language (e.g., Mellanox
CPU and memory usage. A scheme called nanoPU, which minimizes the BlueField SmartNICs [153]) present a lower degree of flexibility,
tail-latency for Remote Procedure Calls (RPC). The system mitigates the making the development process more difficult and closed to a vendor
causes of high. RPC tail latency by bypassing the cache and memory technology. P4-programmable NICs offer similar performance and a
hierarchy. The nanoPU directly places arriving messages into the CPU scalable solution for large and diverse deployments.
register file. Figure 25 shows the block diagram of the nanoPU  More ever, it can be extended to support applications, thus minimizing
CPU intervention.

You might also like