You are on page 1of 13

Why is a SmartNIC needed?

Specifically, there are three major reasons why a smart offload NIC, or SmartNIC, is needed:
1) The complexity of the server-based networking data plane has increased dramatically with the introduction of
overlay tunneling protocols such as VXLAN, and virtual switching with complex actions.
2) Increasing network interface bandwidths mean that performing these functions in software creates an untenable
load on the CPU resources, leaving little or no CPU left over to run applications.
3) A key requirement of SDN is that the networking data plane must remain fungible, so fixed-function offload
technologies cannot be applied.
What exactly is a SmartNIC anyway?
The term SmartNIC is being bandied about quite a bit in the industry right now, and as is often the case with any new
terminology, there is some confusion over the precise definition. Netronome’s perspective is that a SmartNIC must:
1) Be able to implement complex server-based networking data plane functions, including multiple match-action
processing, tunnel termination and origination, metering and shaping and per-flow statistics, for example.
2) Support a fungible data plane either through updated firmware loads or customer programming, with little or no
predetermined limitations on functions that can be performed.
3) Work seamlessly with existing open source ecosystems to maximize software feature velocity and leverage.
Server Offload Models
Let’s compare four basic models of server offload:
Model 1: Partial data plane offload using a fixed function NIC
In this case, a portion of the server networking data plane is implemented in and offloaded by the NIC. Typically, such
a NIC is not programmable and hence the data plane implemented is fixed. For the features supported by the data
plane implemented in the NIC silicon the packet forwarding is performed by the NIC, once a flow has been identified.
The identified flow rules are stored in memory on-chip in the NIC. All traffic that relates to features not implemented
by the NIC, or flows that are not stored in the memory on-chip in the NIC, are handled by the networking data plane
in the host. Initial packets in a flow are handled by the networking data plane in the host. The performance of such an
implementation is dependent on the percentage of data plane features implemented by the NIC and the number of flow
rules that can be stored on-chip in the NIC.
Model 2: Partial data plane using a SmartNIC
In this case, a portion of the server networking data plane is implemented in and offloaded by the SmartNIC. Since
the SmartNIC is programmable, new features available in the server networking data plane in the host can be
implemented based on customer requests to match the features in the server networking data plane in the host.
Typically, a SmartNIC includes larger memory on-chip or on the SmartNIC board to hold a much larger number of
flows.
For the features supported by the data plane implemented in the SmartNIC, the packet forwarding is performed by the
SmartNIC, once a flow has been identified. The identified flow rules are stored in memory on-chip or on the SmartNIC.
All traffic that relates to features not implemented by the SmartNIC or flows that are not stored in the memory on-
chip or on the SmartNIC are handled by the server networking data plane in the host. Initial packets in a flow are
handled by the server networking data plane in the host. The performance of such an implementation is dependent on
the percentage of data plane features implemented by the SmartNIC and the number of flow rules that can be stored
on-chip or on the SmartNIC.
Model 3: Whole data plane offload using a SmartNIC
In this case, the server networking data plane is implemented in and offloaded by the SmartNIC. Since the SmartNIC
is programmable, new features available in new versions of the server networking data plane in the host can be
implemented in the SmartNIC to match the features in the server networking data plane in the host, thereby
maintaining feature parity. Typically, a SmartNIC includes larger memory on-chip or on the SmartNIC board to hold
a large number of flows.
For all packets, packet forwarding is performed by the SmartNIC once a flow has been identified. The identified flow
rules are stored in memory on-chip or on the SmartNIC. Initial packets in a flow are handled by the server networking
data plane in the host. All traffic that relates to flows not stored in the memory on-chip or on the SmartNIC ar
e handled by the server networking data plane in the host. The performance of such an implementation is dependent
on the number of flow rules that can be stored on-chip or on the SmartNIC.
Model 4: Complete control plane and data plane offload using a SmartNIC
In this case, the server networking control plane and data plane are implemented in and offloaded by the SmartNIC.
Since the SmartNIC is programmable, new features available in new versions of the control plane and data plane in
the host can be implemented in the SmartNIC to match the features in the host, thereby maintaining feature parity.
This model assumes there is no control or data plane in the host. Typically, a SmartNIC includes larger memory on-
chip or on the SmartNIC board to hold a large number of flows. For all packets, packet forwarding is performed by
the SmartNIC once a flow has been identified. The identified flow rules are stored in memory on-chip or on the
SmartNIC. Initial packets in a flow are handled by the SmartNIC or in collaboration with a centralized SDN Controller
that the SmartNIC is connected to. The performance of such an implementation is dependent on the number of flow
rules that can be stored on-chip or on the SmartNIC.
Models supported by Netronome
Netronome Agilio CX ISAs (2x10GbE, 2x25GbE, 2x40GbE) are designed to support Models 2 and 3 with the Agilio
Software available today. The Netronome Agilio CX ISAs include large on-chip memory and on the SmartNIC. With
up to 2GB of DDR memory on the SmartNIC, up to 2M flow rules can be supported, helping boost performance
significantly. Netronome Agilio CX ISAs include an ARM processor and in-band or out-of-band control mechanisms,
making the hardware capable of supporting Model 4. Software support for Model 4 will be coming in a future release
of the Agilio Software.

A network interface card connects your computer to a local data network or the Internet. The
card translates computer data into electrical signals it sends through the network; the signals are
compatible with the network so computers can reliably exchange information. Because of the
popularity of the Internet and networks in general, virtually all desktop and notebook PCs have
some form of interface card included. You can add a network card to bare-bones computers
which don't have one.

Description

A standard PC network adapter is a plastic circuit board about the size of a playing card. It has
several computer chips that process signals from the network and the PC. The card slides into the
PC's chassis and mates firmly with a connector on the motherboard. A steel bracket holds the
card in place. The bracket may have a network cable jack or an antenna, depending on the type
of network to which the computer connects. The bracket also has light-emitting diodes that
indicate network status and activity.

Function

Drivers

In addition to the network card's hardware, it needs programming to make it work. When you
install the card, Microsoft Windows loads software drivers for your card's make and model. Your
browser, email and other programs communicate through Windows to reach the network;
Windows, in turn, passes data to the drivers that are programmed specifically for your network
interface card. The card cannot function without the right driver software.

Types

Most contemporary network cards work with Wi-Fi wireless networks; these cards have an
antenna to send data signals via radio waves. Some networks still use wired Ethernet
connections; these cables have a rectangular plug which mates with a jack on the network card's
bracket. In many new computers, network adapter cards are actually custom computer chips built
into the PC's motherboard. Because of the nearly universal use of the Internet, almost all
computers include a network capability. Having the chips on the motherboard frees up a card slot
for other devices you may want to add later. Computer retailers sell network accessory cards if
you want to install one in a PC. For desktop computers, these are standard PCI cards; notebook
computers use smaller PC accessory cards that slide into a slot on the computer's side.

Chicago native J.T. Barett has a Bachelor of Science in physics from Northeastern Illinois University and
has been writing since 1991. He has contributed to "Foresight Update," a nanotechnology newsletter
from the Foresight Institute. He also contributed to the book, "Nanotechnology: Molecular Speculations
on Global Abundance." Ref https://itstillworks.com/function-network-interface-card-1422.html

1. We're here at QCon London 2016. I'm sitting here with David Riddoch. So David, how are
you?

Hi. I'm the chief architect at Solarflare. Solaflare makes the world's fastest Ethernet network
adapters.

2. So what makes your Ethernet adapters fast? What's Solarflare?

On the one hand they're just good network adapters. So they do the same job that the network
adaptors that most people buy today from companies like Intel or Broadcom or Mellanox. They
do the same job. So you plug them into a server. They connect to a network using maybe 1
gigabit or 10 gigabit or 40 gigabit Ethernet. They move packets in and out of the host and ours
will give you better performance than everybody else's. But in addition to that, we have some
special software which is called OpenOnload that uses a technology called kernel bypass to make
the networking go much, much faster. Now, kernel bypass is a technology that was invented in
the high performance computing industry in around the 1980's.

Originally it was used in these high performance computing clusters where you have a whole
load of machines that are essentially identical. They're all working together on a big number
crunching problem, passing messages between each other in order to collaborate on this problem
and they have to have a fast network to do that. And so that was really -- in those days, that was
the application of fast networking. So they came up with the idea of basically the protocol stacks
are slow. So we're going to bypass them and maybe we'll move some of the work into the
network adapters. But we want to avoid all those costs like context switching, going in and out of
the Kernel, interrupts, and all of those slow things. That's essentially what kernel bypass is. It's
moving networking into the user space applications and allowing those applications to talk to the
network hardware directly.

Now, I was involved in a kernel bypass networking research project at AT&T labs in around
1999. We had a great technology. It was 1 gigabit, so super fast which was actually, you know,
that was reasonably fast at the time but the key thing was for low latency. So it could get a
message from one machine to another in less than two microseconds which is actually pretty
comparable with the best you can do today. It's slightly worse than what you can do today. So
that was really, really fast in 1999.

But apart from being super fast, it had the same flaws that every other kernel bypass network
technology had at that time which was it wasn't Ethernet which is the network that everybody
uses. It used proprietary link layer protocols and other protocols on top of that. So it didn't use
IP, TCP, and UDP and it required that you use new APIs or adapt your middleware to use new
APIs in order to be able to use it all. So that meant that you can't just take this technology, put it
on a server and use it. That means of course you can't sell it.

So we all happily got made redundant in 2002 and AT&T ran out of money. And that was the
opportunity for us to start the company called Cambridge Internetworking that became Solarflare
and our idea for that company was take what we have learned about high performance
networking, apply it in a way that people can actual deploy easily, which means it's got to be
Ethernet. It's got to IP, TCP, and UDP.. And it's got to be the standard BSD sockets API.

Werner: You basically made normal APIs fast for the rest of us who don't use fancypants
interconnects.

Exactly, yes. So you can take a server, put a Solarflare network adapter in it and it'll work
exactly the same as it did with anybody else's network adapter. You can put the OpenOnload
acceleration technology on there and you can then select which applications you want to
accelerate with OpenOnload. And the idea is they'll work in exactly the way they did before. It's
just the networking calls suddenly become much, much cheaper. So that gives you lower latency
and it gives you high throughput, high CPU efficiency. And it literally can be that easy. Some
applications, because the environment that it runs in has different assumptions, different locking
behaviors, some applications don't benefit from this technology but many applications do and it
can be as simple as: install our drivers, go two to three times as fast or in some cases, install our
drivers, do a little bit of tuning, go fast.
3. Obviously I have to pick up on which applications don't benefit or which applications benefit
more?

Yes. Well first off, the easy answer is the ones that are already fast benefit the most and in one
sense that's Amdahl's law in a sense of if you've got bottlenecks elsewhere, we can't do anything
about those bottlenecks elsewhere but if your bottleneck is in the networking, then we'll make
that a bit faster. If networking is a big part of what you do, making that faster will make your
application a lot faster. But a classic example of the sort of application that's unlikely to get
much speed-up is one where you have many connections, one thread per connection and you're
throwing around lots of small messages distributed over those many connections. What you'll
find is that the cost of thread switching between those connections vastly outweighs the cost of
the networking even without our technology and therefore no matter how cheap we make the
networking, that application isn't going to get very much of a speed up.

4. Let's dig into the secret sauce behind it. Let's spill the beans. How does it work? So how do
you implement this? Is there a special kernel or you said I just install special drivers?

That's right. Yes, so it runs on Linux. It works with any kernel. We distribute it as source. So it
includes really two parts. There are kernel drivers and then there's the user space library. The
kernel drivers are partly there to provide configuration set up and allocate resources. The really
interesting stuff happens in user space. And the user space library gets loaded into your
application at runtime using a technique called LD_PRELOAD. What this does is when your
application starts, the runtime linker will load our library ahead of the standard C library.

So the places where your applications call send and receive and connect and socket and epoll and
all of those things. The linker will send those calls to our library and when the application calls
them, it'll come to us. We will then have a look at the file descriptors that have been invoked,
we'll make a decision as to whether we can handle that call ourselves. If we can, we do, it goes
fast. If we can't, then we just forward that on to libc which forwards them onto the Kernel. And
so the normal behavior is preserved for everything that we can't accelerate. So that gives us
compatibility. It's perfectly okay for you to do some of your networking over Solarflare and that
will go fast and some of it to do over the built-in network adapter that's on your motherboard and
that'll just get the normal kernel based performance.

Now, essentially what's happening at the sort of stack level. So we're implementing the sockets
API. We're accelerating TCP and UDP sockets. So we have our own TCP, UDP stack inside this
library which is implementing those protocols and we are not offloading those protocols
themselves to the adapters. The only offloads the adapters do really are checksum off-loads and
distributing load. So the adapter has to know where to send packets so they arrive at the right
application. So the actual protocols are done in software and the conventional wisdom used to be
that TCP was horribly complicated and required enormous amounts of CPU time and it's
complete nonsense. TCP is perfectly fast as we have proven.

The issue really is the efficiency of the interfaces between hardware and software and the costs
of going in and out of the kernel, so syscalls, the cost of interrupts and other context switches, it's
all of those things that add up. The reason this technology gets you a big win in terms of
performance is it can bypass interrupts. It does bypass system calls on the fast paths. It has more
efficient locking. So there are a far fewer bus locked atomic operations on our fast paths than
there are going through the kernel stack. And we get a huge advantage from just being very
tightly integrated whereas the kernel stack is a very general purpose piece that supports millions
of protocols and does it very well, I should say, with quite incredible performance, but that
generality comes at a certain cost in terms of CPU overhead and by being integrated, we
essentially cut down the number of CPU cycles that it takes to do these operations enormously.

Just to give you an idea, roughly speaking, the cost of a send or a receive call is better than one-
fifth of the number of CPU cycles to execute. The effect of that is on the one hand lower latency,
so you can get messages from A to B much more quickly. And we're talking about -- with our
technology, current generation adapters, we have got some faster ones coming very soon, but our
current generation takes a little over a microsecond to get a message from one application to an
application in another machine over the network excluding switches. With a kernel stack, if you
tune it really well, you can do that in about four microseconds, more normally if you just use
default settings out of the box, you'll get probably something like 50 microseconds.

Werner: Pretty speedy.

Yes, and this is really important for a number of different people. I mean that low latency aspect
is critically important for people like electronic traders who are essentially engaged in a race.
They are looking for opportunities to trade. Whoever spots the opportunity and submits an order,
first makes more money. So those guys will go to great lengths to be the fastest and this is a
technology that helps them do that. But it's kind of like the nuclear arms race. As soon as one of
them started buying Solarflare, the rest of them had to buy Solarflare or they were out of the
game. So that was great from our point of view. That's the electronic traders.

Other people who have interactive responsive applications, that could be microservices, that
could be more traditional architectures, anything where there are multiple hops involves in
responding to requests. The faster each of those hops execute, the more responsive your overall
application will be, more hops you have to handle, the more benefit you will get. But that's just
the latency part. The other thing that we are improving is throughput. By reducing the number of
CPU cycles for sends and receives and epoll calls just means you can do more of them per
second and that means you can handle a much higher throughput.

So that obviously gives you cost savings. It means you can do more on one machine or you can
do the same amount with fewer machines and save money. And it's not just about message rate,
it's also about connection rate. So if you are running something like a load balancer or a web
server, it's all about how many messages and connections per second can we handle, and we
accelerate all of those things.

Werner: So that improves the efficiency, basically. It's not just about being faster it's doing
more with the same hardware.
Yes. That's right. The capacity of each server is much greater, potentially, with this technology
assuming networking is your bottleneck. We won't help if disk is your bottleneck but we'll help if
networking is a bottleneck. Yes, so more capacity per server. You don't need that many servers.

Werner: Well that's always good because as we know, Moore's Law is history, so we have
to make better use of our servers.

Yes. And we are also improving the scaling over the cores within a server as well. So as the
servers get more and more calls, it becomes harder and harder to actually scale applications on
those boxes. And Kernel-bypass, it's not inherent that it makes that better but our implementation
of it essentially allows you to share nothing between the threads that are running on different
cores. By sharing no state, they don't have any of the inefficiencies associated with moving that
state between the caches on the different cores.

Werner: So no cache coherence overhead, or less.

Yes, certainly a lot less. I mean you will get those overheads from other parts of your application
potentially and there is some cost to cache coherency even if you are completely local to your
cache. But we scale a lot better and it can be dramatically better. It depends very much on the
nature of the application and its threading model. But for example, a quite challenging
application is load balancing, particularly TCP load balancing where you have incoming
connections on the internet, outgoing connections into your backend services.

That's very, very hard to scale and the many-core performance if we're using the kernel stack is
only roughly two or three times the single core performance if you do a really good job of tuning
it, when you're using the standard technologies. But OpenOnload can scale that much, much
better on a 12-core box, it's probably going eight times as fast as the single core. So it's not
linear, but it's really good. You're taking advantage of those cores.

5. Is it also because of the specialization, removing locking etc.?

Yes. Not so much. I mean that, the integration, the lack of locking, the saving of the CPU cycles,
that allows us to be twice as fast on one core compared to the kernel stack technology. But the
fact that we can scale when you add more cores, that's all down to the fact that each of those
worker processes running on a separate core doesn't share any state with the other workers so
that they don't have any contention.

6. You basically pin these processes to this cores?

No. We don't need to. I mean as long as -- you might get a benefit by pinning. It depends.
Particularly, the electronic trading guys, they are very careful to pin their applications and
threads and that's primarily because they are very sensitive to latency. If you don't pin your
threads, what can happen is another thread will essentially sit on the core that you want your
critical path to run on. The critical path is held up, you get a latency spike that costs you money,
which is bad.

For that reason, financial guys are very keen on getting that right. If you only care about
throughput then latency spikes don't matter very much and what you care about is once a core
becomes saturated, you want to make sure that maybe one worker stays on that core but anything
else migrates off it in order to keep that core just for that one worker and then other workers can
run on other cores. And the scheduler in the kernel will just do that naturally. So as long as you
don't mind that you get some latency spikes, it's not strictly necessarily to pin. You will get
slightly more predictable performance if you do. That's really an application tuning issue that is
kind of orthogonal to kernel bypass.

7. That's already interesting enough. Lots of stuff to dig through, but what are other things that
you do?

Oh, so Solarflare on the one hand, so it does kernel bypass acceleration technology. On the other
hand, we just sell network adapters without that technology and they are just really good fast
network adapters that run on all the different operating systems that you might want to support
including all the virtualization platforms, KVM and Hyper-V and ESX and Xen. And we do a
similar acceleration technology for those as well. So there is a technology called SR-IOV which
stands for Single Root I/O Virtualization and what that allows you to do is hypervisor bypass.

So when you go into a virtual world, you add an extra layer of management but you also add an
extra layer into the data paths, the networking. So your application talks to the kernel which does
the protocol and then your kernel talks to a device driver and behind that device driver isn't real
hardware, it's a virtual network device. When you send and receive packets through that
interface, you are doing traps into the hypervisor and in the hypervisor there's a soft switch and
on the other side of that soft switch, there is a real network adapter interfacing into a real
network. So you have added this extra layer of stuff into the fast path and it really, really slows
networking down.

Hypervisor bypass allow us to cut that layer out and essentially get back to bare metal
performance. What happens is you slice up your network adapter into a bunch of virtual network
adapters, virtual channels and SR-IOV gives you a standard way of exposing those slices of the
network adapter and you can give one slice to each guest in your virtual machine. So each virtual
machine instance has its own real slice of real hardware so it can talk to It directly. The system
IOMMU does address translation between the virtual machine's address space and the real
physical address space. That's great. It essentially gives you back almost all of the performance
that you lost.

And the other really cool thing is that we can do that and kernel bypass at the same time. So you
can now have applications running in a virtual machine talking directly to network hardware and
getting latency around the one point something microseconds mark from application to
application and that might be application to application within the same physical box. So two
VMs talking to each other or it could be across the network. So that's pretty cool.

Werner: That's pretty exciting.

It means the financial guys can now think about moving their high performance applications into
their private cloud infrastructure which obviously gives a lot of flexibility and cost savings and
all that stuff.

Werner: Right. Just from a management point of view because obviously virtual machines
are nice to use but they introduce overheads.

Exactly. It's not entirely without cost. The downside is your virtual machine now really is talking
to real hardware. So you can't just pause it or migrate it to another machine without dealing with
that hardware dependency. So there is a small cost to that. But the people who really care about
performance, that's a small price to pay in order to be able to get this really, really good
performance even when you're in a virtualized environment. We also do some other special
features that are particularly valuable in the financial services area but are useful elsewhere as
well.

Our adapters have built in support for hardware time stamping of packets and they have a stable
oscillator. This allows you to do two things. It allows you to get very accurate timestamps for
every packet that you send and receive and it allows you together with the PTP protocol to
synchronize the clocks on all of your different servers. The effect of that is that the clocks on the
network adapter and the clocks on the servers are all synchronized to UTC to within +/-100
nanoseconds or so of each other. And that means that if you take a timestamp on this server and
then you take another timestamp on another server on the other side of the network and you
compare those two timestamps, you'll get an accurate measurement of the time difference
between when those timestamps were taken. So you can measure latency accurately from
moving messages between machines.

If you don't use this sort of technology, if you just use standard NTP, then you can't get accuracy
better than a millisecond. If it only takes packets 10 microseconds to get from one side of a
network to another, millisecond accuracy isn't enough. It's useless. So that's extremely useful.
The other thing is that of course our network adapters are able to send packets at enormously
high rates, millions and millions of packets. I think, tens of millions of packets actually now.

They're able to do hardware time stamping. You can receive packets from a 10 GB network at
line rate, any packet size and do that reliably without dropping anything. And that's exactly what
a capture card does. The difference is that a network adapter will cost you a few hundred dollars.
A capture card will cost you thousands of dollars. So now you have the ability to take a standard
server, a standard commodity network adapter. You can do highly accurate packet capture with
time stamping on that platform and it's the same platform that you might be using to your other
applications that's very flexible. It's also very cost effective.
8. There's a lot of exciting stuff. Where can we find all of this?

So www.solarflare.com. we sell our adapters through the channels. So your VARs will be able to
sell you Solarflare. We also sell through the major OEMs. So IBM, HP, Dell all have our
adapters available as options to be fitted at a factory and of course we have a sales agents and
system engineers throughout all of the major centers in the world.

Werner: Great. We'll all check it out and thank you, David.

Thank you very much.

Ultra Low Latency


One microsecond network packet delivery.

Latency is the time one waits for a task to complete. Ultra-low latency is reducing that wait time down
to the limits physics imposes on the execution of the task. In networking, we often measure latency by
the half round trip, that’s the time for a single receive plus a single send.

Onload is a high-performance POSIX compliant network stack from Solarflare that dramatically reduces
latency and x86 utilization, while increasing throughput and reducing latency. Onload supports
TCP/UDP/IP network protocols by providing a standard Berkley Sockets Direct (BSD) Application
Programming Interface (API), and requires no modifications to end user applications. Onload delivers
half round trip latency in the 1,000 nanosecond range, by contrast the typical kernel stack latency is
about 7,000 nanoseconds. It achieves these performance improvements in part by performing network
processing in user, bypassing the OS kernel entirely, reducing the number of data copies, and kernel
context switches. Networking performance is improved without sacrificing the security and multiplexing
functions that the OS kernel normally provides.

https://www.solarflare.com/ultra-low-latency

The performance of networks

Types of TCP Offload Engine Implementation


Architecture.
 stateful
 stateless
Stateful:

Stateless: A stateless offload adapter can provide optimizations based on the local state contained
in the upper-layer protocols of a single frame. Stateless offloads that can be performed based on
higher-level protocols include header splitting, and TCP/IP checksum calculation and
verification.

Even simple offloads such as checksum can introduce operational conflicts in a system. For
example, studies of network error root causes not detected by the Ethernet FCS have found
systematic errors in hardware, including in the DMA controllers within the network interface
adapter. Applications are not protected from such errors when checksums are validated in
hardware. Thus, when introducing complex hardware or troubleshooting, disable acceleration
features to eliminate them as error sources. As offloads and network adapters become more
complex, however, that may cease to be an

Full TCP/IP Offload Network Interface card: is an ASCII based hardware offload engine.
The full TCP packet processing been implemented on the hardware, all operation performed
by the hardware. The NIC silicon the packet forwarding is performed by the NIC, once a flow
has been identified. The identified flow rules are stored in memory on-chip in the NIC. The
full Offload model had limitations of flexibility and expandability.
fig Computer system based on RDMA/TCP/IP offload engine.

Hybrid TCP/IP Offload engine: is a software based implementation with the advantage of
flexibility compared to full TCP/IP Offload model. Critical TCP tasks like checksum calculation is
been carried out by the Network Interface Card, while non-critical such as TCP Connection
establishment by the software. The flexibility of upgrading or adding new features to software is
easy to perform.
Fig :Hybrid architecture of TCP/IP offload engine.