You are on page 1of 33

InfiniBand: Today and Tomorrow

Jamie Riotto
Sr. Director of Engineering
Cisco Systems (formerly Topspin Communications)
jriotto@cisco.com

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 1
Agenda

• InfiniBand Today
– State of the market
– Cisco and InfiniBand
– InfiniBand products available now
– Open source initiatives

• InfiniBand Tomorrow
– Scaling InfiniBand
– Future Issues

• Q&A

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 2
InfiniBand Maturity Milestones

• High adoption rates


– Currently shipping > 10,000 IB ports / Qtr

• Cisco acquisition will drive broader market


adoption
• End-to-end price points of <$1000.
• New Cluster scalability proof-points
– 1000 to 4000 nodes

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 3
Cisco Adopts InfiniBand

• Cisco acquired Topspin on May 16, 2005


• Adds InfiniBand to Switching Portfolio
– Network Switches, Storage Switches,
now Server Switches
– Creates independent Business Unit to promote InfiniBand
& Server Virtualization

• New Product line of Server Fabric Switches (SFS)


– SFS 7000 Series InfiniBand Server Switches
– SFS 3000 Series Multifabric Server Switches

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 4
Cisco and InfiniBand
The Server Fabric Switch

Network Switch Storage Switch Server Switch

Clients Storage (SAN) Servers

Network Resources Server Network Storage


(Internet, Printer, Server)

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 5
Cisco HPC Case Studies

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 6
Real Deployments Today: Wall Street Bank
with 512 Node Grid
Fibre Channel and GigE
Existing SAN LAN
connectivity built
Networks seamlessly into the cluster

GRID
I/O 2 TS-360 w/ Ethernet and Fibre
Channel Gateways

2 96-port
Core TS-270
Fabric

Edge 23 24-port
Fabric TS-120

512 Server Nodes

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 7
NCSA
National Center for Supercomputing Applications

Tungsten 2: 520 Node Supercomputer


Core 6 72-port
Fabric TS270

174 uplink
cables
Edge 29 24-port
Fabric TS120

512 1m
cables
520 Dual CPU Nodes
18 Compute 18 Compute 1,040 CPUs
Nodes Nodes

 Parallel MPI codes for commercial clients


 Point to point 5.2us MPI latency
Deployed: November 2004
Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 8
D.E. Shaw Bio-Informatics:
1,066 Node Super Computer
1,066 Fully Non-Blocking Fault Tolerant IB Cluster

Fault 12 96-port
Tolerant TS-270
Core
Fabric 1,068 5m/7m/10m/15m
uplink cables

Edge 89 24-port
Fabric TS-120
1,066 1m
cables

12 Compute 12 Compute
Nodes Nodes

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 9
Large Government Lab
Worlds Largest Commodity Server Cluster – 4096 nodes

• Application:
High Performance Super 8x SFS TS740
288 ports each
Computing Cluster
Core
• Environment: Fabric

4096 Dell Servers 2048 uplinks


(7m/10m/15m/20m)
50% Blocking Ratio
8 TS-740s
Edge
256 TS-120s 256x TS120
24-ports each
• Benefits:
Compelling
Price/Performance
Largest Cluster Ever Built 18 18
(by approx. 2X) Compute Compute
Nodes) Nodes)
Expected to be 2nd
Largest Supercomputer in
the world by node count 8192 Processor
60TFlop SuperCluster

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 10
InfiniBand Products Available
Today

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 11
InfiniBand Switches and HCAs

• Fully non-blocking switch


building blocks available in
sizes from 24 up to 288
ports.
• Blade servers offer
integrated switches and
pass-through modules
• HCAs available in PCI-X and
PCI-Express
• IP & Fibre-Channel Gateway
Modules

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 12
Integrated InfiniBand for Blade Servers
Create “wire-once” fabric

• Integrated 10Gbps InfiniBand


switches provide unified “wire-
once” fabric 10Gbps 30Gbps
• Optimize density, cooling,
space, and cable management.
• Option of integrated InfiniBand
IB Switch IB Switch
switch (ex: IBM BC) or pass-
thru module (ex: Dell 1855)
• Virtual I/O provides shared
Ethernet and Fibre Channel
ports across blades and racks

Blade Chassis with InfiniBand Switches

HCA

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 13
Ethernet and Fibre Channel Gateways
Unified “wire-once” fabric
Server Cluster

Single InfiniBand link for:


- Storage
- Network

SAN LAN/WAN
Server Fabric

Fibre Channel to InfiniBand gateway for storage Ethernet to InfiniBand gateway for LAN
access access

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 14
InfiniBand Price / Performance

InfiniBand
10GigE GigE Myrinet D Myrinet E
PCI-Express
Data Bandwidth 950MB/s 900MB/s 100MB/s 245MB/s 495MB/s
(Large Messages)

MPI Latency 5us 50us 50us 6.5us 5.7us


(Small Messages)

HCA Cost $550 $2K-$5K Free $535 $880


(Street Price)

Switch Port $250 $2K-$6K $100-$300 $400 $400


Cable Cost $100 $100 $25 $175 $175
(3m Street Price)

•Myrinet pricing data from Myricom Web Site (Dec 2004)


** InfiniBand pricing data based on Topspin avg. sales price (Dec 2004)
*** Myrinet, GigE, and IB performance data from June 2004 OSU study

• Note: MPI Processor to Processor latency – switch latency is less

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 15
InfiniBand Cabling

• CX4 Copper (15m)


• Flexible 30-Gauge Copper
(3m)
• Fiber Optics up to 150m

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 16
Host Drivers for Standard Protocols

• Open source strategy = reliability at low cost


• IPoIB: legacy TCP/IP applications
• SDP: reliable socket connections (optional RDMA)
• MPI: leading edge HPCC applications (RDMA)
• SRP: block storage access (RDMA)
• uDAPL: User level RDMA

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 17
OS Support

• Operating Systems Available:


– Linux (Red Hat, SuSE, Fedora, Debian, etc.)
– Windows 2000 and 2003
– HP-UX (Via HP)
– Solaris (Via Sun)

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 18
The InfiniBand Driver Architecture
APPLICATION
INFINIBAND SAN
NETWORK

BSD Sockets BSD Sockets UDAPL NFS-RDMA FS API User

TCP FILE SYSTEM Kernel


SDP SDP DAT
IP TS TS SCSI

IPoIB API SRP


Drivers FCP
VERBS

ETHER INFINIBAND HCA FC

ETHER INFINIBAND SWITCH


FC
SWITCH ETH GW FC GW SWITCH
E

SAN
LAN/WAN SERVER FABRIC
Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 19
Open Software Initiatives
• OpenIB.org
– Topspin primary authors of major portions
including IPoIB, SDP, SRP and TS-API. Cisco will
continue to invest.
– Current protocol development nearing production
quality code. Expect release by end of year.
– Charter has been expanded to include Windows
and iWarp
– MPI will be available in the near future (MVAPICH
0.96)
• OpenSM
• OpenMPI
Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 20
InfiniBand Tomorrow

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 21
Looking into the future

• Cost
• Speed
• Distance Limitations
• Cable Management
• Scalability
• IB and Ethernet

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 22
Speed: InfiniBand DDR / QDR, 4X / 12X

• DDR Available end of 2005


Doubles wire speeds to ? (ok, still working on this one)
PCI-Express DDR
Distances of 5-10m using copper
Distances of 100m using fiber

• QDR Available WHEN?

• 12X (30 Gb/s) available for over one year!!


– Not interesting until 12X HCA
• Not interesting until > 16X PCIe
Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 23
Future InfiniBand Cables

• InfiniBand over CAT5 / CAT6 / CAT7


Shielded cable distances up to ???
Leverage existing 10-GigE cabling
10-GigE too expensive?

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 24
IB Distance Scaling
• IB Short Haul
– New Copper drivers
– 25 – 50 Meters (KeyEye)
– 75 - 100 Meters (IEEE 10Ge)
• IB Wan
– Same Subnet over distance (300 KM target)
– Buffer / Credit / Timeout issues
– Applications: Disaster Recover, Data Mirroring
• IB Long Haul
– IB over IP (over SONET?)
– utilizes existing public plant (WDM, Debugging, etc)

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 25
Scaling InfiniBand

• Subnet Management
• Host-side Drivers
MPI
IPoIB
SRP

• Memory Utilization

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 26
IB Subnet Manager

• Subnets are getting bigger


– 4,000 -> 10,000 nodes
– Topology convergence times
• Topology disturbance times
• Topology disturbance minimization

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 27
Subnet Management Challenges
• Cluster Cold Start times
–Template Routing
– Persistent Routing
• Cluster Topology Change Management
– Intentional Change - Maintenance
– Unintentional Change – Dealing with Faults
• How to impact minimum number of connections
• Predetermine fault reaction strategy?
• Topology Diagnostic Tools
– Link/Route Verification
– Built-in BERT testing
• Partition Management
Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 28
Multiple Routing Models
• Minimum Latency Routing:
– Load-Balanced Shortest-Path Routing
• Minimum Contention Routing:
– Lowest-Interference Divergent-Path Routing
• Template Driven Routing:
– Supports Pre-Determined Routing Topology
– For example: Clos Routing, Matrix Row/Column, etc
– Automatic Cabling Verification for Large Installations

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 29
IB Routing Challenges
• Static / Dynamic Routing
– IB impliments Static Routing through Linear Forwarding
Tables at each chip
– Multi-LID Routing enables Dynamic Routing
• Credit Loops

• Cost Base Routing


– Speed mismatches cause Store & Forward (vs. cut through)
– SDR <> DDR <>QDR
– 4X <> 12X
– Short Haul <> Long Haul

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 30
Multi-LID Source-Based Routing Support

• Applications can implement “Dynamic” Routing for Contention


Avoidance, Failover, Parallel Data Transfer

1,2,3,4

Leaf Switches Spine Switches Leaf Switches

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 31
New IB Peripherals
• CPUs?
• Storage
– SAN
– NFS-RDMA
• Memory (coherent / non-coherent)
• Purpose built Processors?
– Floating Point Processors
– Graphics Processors
– Pattern Matching Hardware
– XML Processor

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 32
THANK YOU!

• Questions & Answers

Session Number
Presentation_ID © 2005 Cisco Systems, Inc. All rights reserved. Cisco Public 33

You might also like