You are on page 1of 9

InfiniBand: An Overview

Everything changes. In the early 90s the microprocessor was a prized possession. By the year 2000, PCs were running microprocessors at GHz of clock speeds. But the way, in which I/O was carried out, remained much or less, the same. The processor is now capable of delivering data at blistering speeds but the I/O subsystem that is supposed to accept it, is incapable of receiving the same. The bottleneck is the shared bus architecture.

Problems with PCI

The various components connected to the bus vie for the control of the bus. A prime example of this is the familiar Peripheral Component Interconnect (PCI) bus [4].

Fig 1. The shared PCI bus in the system architecture

Source: How PCI works? by Jeff Tyson (

As shown in the Fig 1 the PCI devices are all attached to a parallel PCI bus, which they all contend for. In this kind of a scenario, contention is inevitable. The performance chart is shown below.

PCI Max BW Architecture Issues

PCI (66MH) 4 Gbps

PCI-X DDR* (133 MHz) 8 Gbps 16 Gbps Shared Parallel Bus Bus Contention

QDR** 32 Gbps

Fig 2. Table of PCI standards.

*Double Data Rate ** Quad Data Rate

Though the maximum bandwidth shown in the tables looks enormous, the fact is that the bandwidth at hand, turns out be about 533 Mbps for PCI 66 MHz version. Also, due to the shared nature of the PCI bus, as the frequency of operation is increased, the fanout has to be lowered. This means that the number of devices that can be attached to the bus decreases. So PCI does not look like a viable option for the next generation I/O systems, though it looks poised to exist for quite some time due to wide market acceptability. What could be the solution to the bus contention issue? It is the use of Serial Switched Architectures. InfiniBand is a technology that employs a serial switched architecture.

InfiniBand to the Rescue

Only a technology that is implemented very close to the processor memory bus can be seen as a replacement for PCI. InfiniBand (IB) breaks through the bandwidth and fanout constraints posed by the PCI bus by moving to a serial switched fabric architecture. Now the question is that when there are already certain established networking technologies like Fibre Channel (FC) and Gigabit Ethernet (GigE) which provide the same serial switched architecture, then what is the need of a new one? The answer can be summarized in 3 key-words [6]: Data Storage Networking Server Clustering

The FC technology is a proven technology in the field of data storage. GigE is also coming up in a big way. Networking is the USP of GigE. But what about server clustering? Server clustering needs a low overhead, quick messaging service that is very reliable. This is where InfiniBand scores. Unlike other networking technologies InfiniBand is designed to bypass the multi-layered protocol-processing overhead. The comparison in other areas is shown in the graphic.

Technology InfiniBand Ethernet

Application focus Server I/O & Clustering Local Area Networks Storage Area Networks

Data Transport/ Reliability High Reliability Data Packets dropped during Congestion-No failover capability High Reliability

Fibre Channel

Systems Management Built-in, in-band fabric and H/W management. No form factors or built-in management systems No form factors or built-in management systems

Fig 3. Differences between technologies.

Source: Understanding InfiniBand by Gene Risi & Philip Bender

Components of InfiniBand
System Area Layout

Fig 4. InfiniBand topology Fig 4. Shows the InfiniBand topology in its most basic form. The node could be server, a PC an I/O device like RAID subsystem. The fabric may be a single switch or an interconnection of switches and routers. All connections in this topology are switched i.e. they are point to point, thus eliminating congestion. Also due to the serial nature, they require only four cables instead of the wide parallel connection of the PCI bus.

Fig 5. An system level view of the basic topology In the system level view (Fig 5.) there are certain elements that need explanation. The leftmost part of the figure depicts the internals of a node. The memory controller is connected to a Host Channel Adapter (HCA), which is the entry point of the node into the fabric. The HCA provides an interface for InfiniBand to integrate with the Operating System. The HCA links the node with the switch, which in-turn is connected to a number of Target Channel Adapters (TCA). The TCA interfaces present target I/O devices like RAID and JBOD subsystems with the InfiniBand fabric. Each TCA serves a specific kind of target though Multi-utility TCAs are also a possibility. These channel adapters contain ports. A single TCA/HCA can contain more than a single port. These ports connect the node to the fabric and vice-versa.

InfiniBand Architecture
As is evident from the fig 7. InfiniBand operates via a Network Protocol Stack. This protocol stack has been compared with the OSI model layers for convenience.

Fig 7. InfiniBand Protocol Stack compared with the OSI network Model
Source: InfiniBand Architecture Tutorial Hot Chips by Daniel Cassiday (InfiniBand Trade Association)

At the top client layers communicate in the form of Transactions. These transactions are composed of Messages that are moved through the transport layer. These messages are then further divided into packets at the network layer as shown in the graphic. IB routers can rout these packets across network domains. The routers use a global identifier called GID[3] for this purpose. For subnet routing in the data-link layer an identifier local to the subnet is used, known as the LID [3]. An IB switch generally does this.

Fig 6. IB PDU s at various layers.

At the lowest layer of the stack (which corresponds to the physical and data-link layers of the OSI model) the standards are more or less, similar to FC. InfiniBand uses both optic Fibre cables and copper cables. The IB error rate is 10-12 and uses 8B/10B-encoding standards. 8B/10B means that for every 8 bits of data to be sent, 10 bits are actually sent over the physical cabling. A new concept of aggregating links into physical lanes [6] of 4 or 12 cables is also supported. They are known as 4X and 12X respectively. Moreover, the IB cabling is fully duplex, i.e. a 4X channel contains 4 send and 4 receive lanes. This combination gives a faster throughput. Though there are 4 lanes, they are a single entity for management issues. IB incorporates a concept of segmenting bandwidths using virtual lanes (VL) [6]. These VLs are formed by a multiplexing arrangement where unrelated data can flow sharing the same link. IB has configurations of 1,2,4,8 & 15 virtual lanes. V15 is only used for network management and the rest are data lanes. By implementing this, IB allows multipoint communication among nodes and provides better utilization of the fabric. IB provides a method to logically group together nodes, which are otherwise physically distant. This is known as partitioning [6]. It is analogous to VLAN s in Ethernet data networks.

Virtual Interface Protocol

The Virtual Interface protocol is used at the IB transport layer and is what makes IB different. As mentioned earlier, the main area where IB scores over FC and GigE is clustering. For clustering heartbeat a very low latency network has to be present. The Virtual Interface (VI) [6] protocols main motive is to reduce the latency between communicating servers. Using network protocol architecture for cluster heartbeat causes latency because of the overhead involved in executing the network protocol code and due to the context switches needed to accept data in the privileged mode of the OS. The privileged mode comes into the picture, because the network adapter, which receives the data, has to hand it over to the OS. The VI protocol reduces the latency by allowing the network adapter to bypass the OS and perform functions in the non-privileged mode. VI uses certain memory like operations to directly access buffers on the receiver. This process is known as Remote Dynamic Memory Access (RDMA)[3]. In order to bypass the privileged mode the OS, the various I/O and process related management functions have to be taken up by the VI protocol. Each application that wants to send/receive creates a QueuePair (QP)[3]. A QP is a combination of a send & a receive queue at each port. An application that wants to communicate places a Work Queue Element (WQE) [3] in the send queue. From the send queue of the sender, the data is sent to receiving queue of the receiver. When a WQE is executed, a Completion Queue Element (CQE)[3] is generated and placed in a completion queue. The completion queue is used to inform the WQE parent application of the completion and also reduces the number of interrupts generated. There are certain functions defined for both the send and the receive queues. The send queue can perform basic message sending, and 3 RDMA related functions known as RDMA-read, RDMA-write and RDMA-Atomic.

For Receive Queue the only type of operation is Post Receive Buffer, which identifies a buffer into which a client may send to or receive data from through a Send, RDMAWrite, RDMA-Read operation.

Fig 8. VI protocol communication mechanism

Source: An introduction to InfiniBand Architecture by Odysseas Pentakalos (

Types of services:
IB provides 5 different types of transport services [6]: Reliable Connection Unreliable Connection Reliable Datagram Unreliable Datagram Raw Datagram

Scope as a PCI replacement

IB came into the market and was immediately being touted as the PCI replacement. But any technology takes a while to become popular in the market. PCI is an established technology and a lot of IT professionals are at ease with PCI. In this scenario, the chances of IB displacing PCI seem very slim. IB is making inroads into the market, not as a competitor for PCI but as a complimentary technology. In fact, adapters are already in the markets that provide support for both IB and PCI-X [2]. A comparative chart is shown in the figure:

Comparison PCI, PCI-X, DDR, QDR


Advantages Lower Cost Simpler for chip to chip Clustering Clustering Scalability Quality of Service Security Fault Tolerance Multi-Cast Fabric Convergence PCB, Copper & Fiber

Disadvantages Bus Contention

Low market acceptance

Fig 8. Table showing comparison between PCI and IB

Source: Introduction to the value proposition of InfiniBand by Marc Staimer (Dragon Slayer Consulting)

The response to IB has been positive. As per analysts, very soon a huge percentage of servers will be IB enabled. This growth will take place when IB becomes native with the server motherboard. It is predicted that soon the use of IB as a technology for clustering, storing as well as networking will ensue. The predictions may be positive but the IT world is such that what is hot property today may be obsolete tomorrow. So what lies in store for InfiniBand, is for time to tell.

1. AGP Advanced Graphics Processor 2. BW - Bandwidth 3. CPU Central Processing Unit 4. CQE Completion Queue Element 5. DDR Double Data Rate 6. FC Fibre Channel 7. GID Global Identifier 8. GigE Gigabit Ethernet 9. HCA Host Channel Adapter 10. IB InfiniBand 11. IBTA InfiniBand Trade Association 12. ISA Industry Standard Architecture 13. LID Local Identifier 14. PCI - Peripheral Component Interconnect 15. QDR Quadruple Data Rate 16. QP Queue Pair 17. RAM Random Access Memory

18. RDMA Remote Dynamic Memory Access 19. SNIA Storage Networking Industry Association 20. TCA Target Channel Adapter 21. VI Virtual Interface 22. VL Virtual Lanes 23. WQE Work Queue Element References 1. InfiniBand Architecture Tutorial Hot Chips by Daniel Cassiday (InfiniBand Trade Association) 2. Introduction to the value proposition of InfiniBand by Marc Staimer (Dragon Slayer Consulting) 3. An introduction to InfiniBand Architecture by Odysseas Pentakalos ( 4. How PCI works? By Jeff Tyson ( 5. Understanding InfiniBand by Gene Risi & Philip Bender 6. Building Storage Networks - 2nd Edition by Marc Farley (Storage Networking Industry Association)