You are on page 1of 385

High Performance Data Transfer

SC2004 Tutorial M6
8 November 2004
Phillip Dykstra Chief Scientist WareOnEarth Communications Inc. phil@sd.wareonearth.com

Online Copy
• This tutorial can be found at
– http://www.wcisd.hpc.mil/~phil/sc2004/

Dykstra, SC2004

2

Motivation
If our networks are so fast, how come my ftp is so slow?

Dykstra, SC2004

3

Tutorial Focus
• Moving a lot of data
– Quickly – Securely – Error free

Dykstra, SC2004

4

Unique HPC Environment
• The Internet has been optimized for
– millions of users behind low speed connections – thousands of high bandwidth servers serving millions of low speed streams

• Single high-speed to high-speed flows get little commercial attention

Dykstra, SC2004

5

Objectives
• Look at current high performance networks • Fundamental understanding of delay, loss, bandwidth, routes, MTU, windows • Learn what is required for high speed data transfer and what to expect • Look at many useful tools and how to use them for performance testing and debugging • Examine TCP dynamics and TCP’s future • Look at alternatives to TCP and higher level approaches to data transfer
Dykstra, SC2004 6

Topics
• Fundamental Concepts
– – – – – – – Layers, Circuits High Performance Networks Delay Routes Packet Size Bandwidth Windows

Slide #
9 18 36 57 76 86 110

• TCP Performance • Performance Tuning
Dykstra, SC2004

126 168
7

SC2004 184 249 272 286 306 319 338 351 366 8 .Topics • • • • • • • • • Testing and Debugging Data Integrity TCP Security TCP Improvements Alternatives to TCP Quality of Service Storage Area Networks Peer to Peer Abstract Storage Layers Dykstra.

Quick Layer Review Applications (Middleware) TCP / UDP / SCTP IP / ATM Ethernet / MPLS/ POS Copper / Fiber 4. Transport / End-to-End 3. Network / Routing 2. SC2004 9 . Data Link / Switching 1. Application Dykstra. Physical / Optical 7.

44. 622.813 Gbps (“40 Gbps”) Dykstra.488 Gbps OC192.736 Mbps • SONET – Synchronous Optical Network – – – – – OC3.080 Mbps OC48.WAN Circuits • TDM – Time Division Multiplexing – T1. 9. 1.544 Mbps – T3. 2. 155. 39.953 Gbps (“10 Gbps”) OC768.520 Mbps OC12. SC2004 10 .

SC2004 11 .620 voice Dykstra.560 voice 1 Gbps 15.Relative Bandwidths T1 24 voice T3 720 voice 100 Mbps 1.

Fiber Optic Attenuation David Goff. Fiber Optic Reference Guide 12 .

Wave Division Multiplexing (WDM) Fiber • Use color / frequency to separate channels • Course (CWDM) and Dense (DWDM) varieties • Usually in the 1550 nm band Dykstra. SC2004 13 .

WDM Evolution Toward 100 Tbps per fiber David Goff. Fiber Optic Reference Guide 14 .

2 Tbps) 2010. 8 channels.5x cost for 4x bps within a TDM channel The return of wider channels (for higher rates)? A lot of fiber in the ground today will be obsolete Dykstra. SC2004 15 . 2. 80 channels.4 Gbps each (19 Gbps) 2000. 40 Gbps each (3. 40 channels. 10 Gbps each (400 Gbps) 2004. 320 channels.DWDM Progress • Example DWDM Technology Deployment – – – – 1996. 160 Gbps each (51 Tbps) • • • • Cost is linear with # of channels ~2.

SC2004 16 . $100k/router port • Why aren’t WANs Ethernet too? Dykstra.$6000 (Oct 2004) – $10k/switch port.Ethernet World Domination • Over 95% of the worlds computers connect via Ethernet • 1 Gbps on copper can be had for $10/port – Fiber is at least 10x more expensive • 10 Gbps fiber NICs run $2000 .

Encapsulation 20 TCP header 20/40 IP header 14/18 Ethernet header Ethernet Payload (1500/9000) IP Payload 4 Ether CRC TCP Payload Dykstra. SC2004 17 .

High Performance Networks in the USA .

SC2004 19 .NGI Architecture Dykstra.

Abilene Backbone Sep 2004 20 .

21 .

22 .

MCI vBNS+ Aug 2004 Dykstra. SC2004 23 .

DREN Map Dykstra. SC2004 24 .

UK . NOAA) Laboratory Sponsored (6) peering points SNV HUB hubs PA NT E Abilene X ATL HUB SRS HUB ELP ESnet backbone: Optical Ring and Hubs International (high speed) OC192 (10G/s optical) OC48 (2.DOE ESnet.Germany .France .5 Gb/s optical) Gigabit Ethernet (1 Gb/s) OC12 ATM (622 Mb/s) OC12 OC3 (155 Mb/s) T3 (45 Mb/s) T1-T3 T1 (1 Mb/s) .etc Sinet (Japan) Japan – Russia(BINP) SEA HUB LIGO ne Japan TW C INEE L Ab ile ne Ch iN Ab ile W PN JGI SNLL LLNL SNV HUB G PNNL ESnet IP ligh Star t YC N B U H MIT QWEST ATM AP FNAL ANL AMES CHI HUB ANL-DC INEEL-DC ORAU-DC LLNL/LANL-DC BNL NY-NAP PPPL MAE-E PAIX-E JLAB ile Ab ne LBNL NERSC SLAC 4xLAB-DC GTN&NNSA W IXPA inix uq E MAE-W Fix-W YUCCA MT EL HT EC B SDSC ALB HUB LANL ARM L NRE KCP DC HUB ORNL OSTI NOAA ORAU SNLA Allied Signal DOE-ALB GA 42 end user sites Office Of Science Sponsored (22) NNSA Sponsored (12) Joint Sponsored (3) Other Sponsored (NSF LIGO.Italy . March 2004 Australia CA*net4 Taiwan (TANet2) Singaren CA*net4 KDDI (Japan) France Switzerland Taiwan (TANet2) CA*net4 CERN MREN Netherlands Russia StarTap Taiwan (ASCC) GEANT .

ESnet Future S C S C I C S C 1-3 yrs 1-40 Gb/s. 4-7 yrs requirement is for high bandwidth and QoS and network resident cache and compute elements. S end-to-end S 3-5 yrs requirement is for high bandwidth and QoS and network resident cache and compute elements. and robust bandwidth (multiple paths) C&C C I C C . end-to-end C C S I S storage C compute instrument C&C cache & compute In the near term applications need high bandwidth I 2-4 yrs requirement is for high bandwidth and QoS. 2-4 yrs guaranteed bandwidth paths S C& C S C& C C& C& 4-7 yrs 3-5 yrs C& C& C C C C C&C I C C &C C C& C C C &C &C C 100-200 Gb/s.

NREN Testbed NGIX-Chi ARC/NGIX-West GRC NGIX-East GSFC NASA WAN Testbed MSFC JPL NPN Sites OC-3 ATM NREN Sites OC-12ATM Hybrid Ground Station (35 Mbps) NASA WAN Testbed: • NASA Research and Education Network (NREN) • NASA Prototyping Network (NPN) 27 .

Target NREN Backbone December 2005 ARC/NGIX-West NLR Sunnyvale StarLight NLR Chicago GRC NLR Cleveland NGIX-East MATP LRC GSFC JPL NLR Los Angeles MSFC NLR MSFC NREN Sites NGIX Peering Points 10 GigE JSC NLR Houston 28 .

National LambdaRail Architecture 29 .

Phase 1 Completed Aug 24.NLR Optical Infrastructure . NLR 30 . 2004 Seattle Portland Boise Chicago &Starlight Sunnyvale LA San Diego NLR Route NLR Layer 1 Ogden /Salt Lake Denver Clev Pitts Wash DC Raleigh Atlanta Jacksonville KC Debbie Montano.

NLR 10 Gbps Jacksonville 31 .Seattle 30 Gbps 40 Gbps Starlight Pitts/SC04 70 Gbps 20 Gbps Sunn 20 Gbps Chicago DC LA NLR to SC2004 8 to 9 Potential Temporary 10 Gig Lambdas – *Draft* Debbie Montano.

NLR 32 .Seattle NLR Layer 2 Network Nationwide Gigabit Ethernet Early 2005 Chicago Denver Kans Pitts Albu Phoenix El Paso /Las Cruces Baton Rouge Houston DC Raleigh Atlanta Jacksonville Clev NY Sunnyvale Los Angeles Tulsa Cisco 6509 Switch 10GE LAN PHY Lambda Debbie Montano.

Network Speeds Over Time 10 Gbps 10 GigE HIPPI-6400 / GSN OC48 1 Gbps HIPPI-800 GigE vBNS OC12 OC3 T3 100 Mbps Ethernet FDDI 100 Ethernet 10 Mbps ~Computer I/O requirements NSFNET 1 Mbps T1 ATM HPCC NGI NITRD 2000 2005 33 100 kbps ARPANET / MILNET 10 kbps 1970 IP 1975 1980 1985 1990 1995 Dykstra. SC2004 .

fixedorbit.Commercial Networks The top ten networks defined by number of peers Rank Peers ASN Description 1 2.347 701 UUNET Technologies.com.092 209 Qwest 6 636 2914 Verio. 2 1. Sep 2004 Dykstra. Inc 10 533 4513 Globix Corporation Source: www. Inc. Inc.171 3356 Level 3 Communications.902 7018 AT&T WorldNet Services 3 1. LLC 5 1. 7 623 174 Cogent Communications 8 597 3549 Global Crossing 9 549 6461 Abovenet Communications. SC2004 34 .732 1239 Sprint 4 1.

abilene.internet2.net Dykstra. SC2004 35 .mil/Htdocs/DREN ESnet.gov vBNS+.hpc. www. www.es. www.vbns.Network References • • • • • • Abilene.nren.net NREN.edu DREN. www.net NLR. www.hpcmo.nasa.nlr.

k.a.Delay a. Latency .

Speed Limit • The speed of light is a constant • It seems slower every year Dykstra. SC2004 37 .

3x108 m/s in copper • ~2.0x108 m/s in free space • ~2.0x108 m/s in fiber = 200 km / ms [100 km of distance = 1 ms of round trip time] Dykstra. SC2004 38 .Speed of Light in Media • ~3.

5x longer 20 ms Dykstra.4 . SC2004 30 ms 40 ms 39 .3.Light Speed Delay in Fiber 10 ms Actual rtt’s often 1.

High “Speed” Networks Capacity OC3 155 Mbps DS3 45 Mbps Dykstra. SC2004 40 .

056 T1 1. SC2004 .2 ms ms ms us us us us us us us 42857 1554 240 53 24 15 3859 2400 965 240 km km km km km km m m m m 41 Dykstra.8 1.2 267 120 77 19 12 4.Packet Durations and Lengths 1500 Byte Packets in Fiber Mbps pps sec/pkt length 56k 0.8 1.544 Eth 10 T3 45 FEth 100 OC3 155 OC12 622 GigE 1000 OC48 2488 10GigE 10000 4.7 129 833 3750 8333 13k 52k 83k 207k 833k 214 7.

Observations on Packet Lengths • A 56k packet could wrap around the earth! • A 10GigE packet fits in the convention center Dykstra. SC2004 42 .

Observations on Packet Lengths • Each store and forward hop adds the packet duration to the delay – In the old days (< 10 Mbps) such hops dominated delay – Today (> 10 Mbps) store and forward delays on WANs are minimal compared to propagation Dykstra. SC2004 43 .

1/6th as many per second Dykstra. SC2004 44 . 30x as many per second – One of the reasons we haven’t seen OC48 SAR until recently (2002) • Jumbo Frames (9000 bytes) are 6x longer.Observations on Packet Lengths • ATM cells (and TCP ACK packets) are ~1/30th as long.

Router Queues Arriving Packets sum of input rates Departing Packets output line rate • The major source of variable delay • Handle temporary inequalities between arrival rate and output interface speed • Small queues minimize delay variation • Large queues minimize packet drop Dykstra. SC2004 45 .

SC2004 46 .Passive Queue Management (PQM) Arriving Packets 1 sum of input rates 3 2 Departing Packets output line rate When full. Tail-Drop – arriving packet Drop-From-Front – packet at front of queue Push-Out – packet at back of queue Random-Drop – pick any packet Dykstra. 4. choose a packet to drop 1. 2. 3.

RFC 2309. Maxth. SC2004 47 .html Dykstra.org/floyd/red. Maxdrop. w – www.Active Queue Management (AQM) Arriving Packets sum of input rates Maxth Minth Departing Packets output line rate • Addresses lock-out and full queue problems • Incoming packet drop probability is a function of average queue length • Random Early Detection (RED) – Recommended Practice. Apr 98 – Minth.icir. Mindrop.

8 ms 64 bytes from www. 10% loss.133.133.25): icmp_seq=3 ttl=241 time=25.cisco.1 ms 64 bytes from www.25): icmp_seq=9 ttl=241 time=25.25): icmp_seq=5 ttl=241 time=25.com (198.219. 9 received.com (198.com PING cisco.com (198.71.2 ms --.133.6 ms 64 bytes from www.com (198.25) from 63.133.25): icmp_seq=1 ttl=241 time=25.0 ms 64 bytes from www.053/25.585/26.1 ms 64 bytes from www.com (198.25): icmp_seq=4 ttl=241 time=26.219.219.455 ms Dykstra. SC2004 48 .133.cisco.cisco.229/0.com (198.cisco.cisco.219.133.25): icmp_seq=2 ttl=241 time=25.Measuring Delay: Ping $ ping –s 56 cisco.219.196.219.com (198.219.219.cisco.com ping statistics --10 packets transmitted.25): icmp_seq=6 ttl=241 time=25. 64 bytes from www.219.5 ms 64 bytes from www.cisco.246 : 56(84) bytes of data.cisco.com (198.133. time 9082ms rtt min/avg/max/mdev = 25.133.133.com (198.25): icmp_seq=8 ttl=241 time=25.25): icmp_seq=10 ttl=241 time=26.219.cisco.cisco.133.4 ms 64 bytes from www.1 ms 64 bytes from www.com (198.

SC2004 49 .Ping Observations IP 20 8 ICMP ts User Data 0+ bytes • Ping packet = 20 bytes IP + 8 bytes ICMP + “user data” (first 8 bytes = timestamp) • Default = 56 user bytes = 64 byte IP payload = 84 total bytes • Small pings (-s 8 = 36 bytes) take less time than large pings (-s 1472 = 1500 bytes) Dykstra.

Ping Observations • TTL = 241 indicates 255-241 = 14 hops • Delay variation indicates congestion or system load • Not good at measuring small loss – An HPC network should show zero ping loss • Depends on ICMP ECHO which is sometimes blocked for “security” Dykstra. SC2004 50 .

Delay Changes • Delay jumping between two distinct values • Does not show up in traceroute as a route change • Layer 2 problem: SONET protect failover Dykstra. SC2004 51 .

SC2004 52 .Delay Creep When layer 2 “heals” itself May 4th: 69 msec min May 5th: 138 msec min VC Reset May 7th: 168 msec min May 9th: 168 to 68 msec drop Dykstra.

One-Way Active Measurement Protocol (OWAMP) http://e2epi.040404% no 2-reordering Dykstra.0018921 s) no reordering --.880 ms (precision 0.0% loss) one-way delay min/median = 43.edu/owamp $ owping damp-arl --.com]:33530 --SID: 3fc447f6c4ab3560c54b599ab1a6185e 100 packets transmitted.internet2.owping statistics from [sd.wareonearth.0% loss) one-way delay min/median = 48.0018921 s) 1-reordering = 4. 0 packets lost (0. 0 packets lost (0.com]:33529 to [damp-arl-ge]:37545 --SID: 8a121503c4ab3560b8ca3e7df7886da8 100 packets transmitted.152 ms (precision 0.473/45. SC2004 53 .owping statistics from [damp-arl-ge]:37546 to [sd.429/48.wareonearth.

SC2004 54 .Surveyor • Showed that one-way is interesting Dykstra.

SC2004 55 .Real-Time Delay Dykstra.

075 = 7.Bandwidth*Delay Product • The number of bytes in flight to fill the entire path • Includes data in queues if they contributed to the delay • Example – 100 Mbps path – ping shows a 75 ms rtt – BDP = 100 * 0. SC2004 56 .5 million bits (916 KB) Dykstra.

Routes The path taken by your packets .

pass to the other network ASAP Dykstra. i. SC2004 58 .How Routers Choose Routes • Within a network – Smallest number of hops – Highest bandwidth paths – Usually ignore latency and utilization • From one network to another – Often “hot potato” routing.e.

EIGRP.IP Routing Hierarchy Autonomous System Interior Gateway Protocol (IGP) Exterior Gateway Protocol BGP Autonomous System Interior Gateway Protocol (IGP) OSPF. SC2004 59 . IS-IS OSPF. EIGRP. IS-IS Dykstra.

SC2004 60 .“Scenic” Routes Dykstra.

SC2004 61 .Asymmetric Routes Dykstra.

Level3 Dark Fiber 62 .

Qwest Fiber Routes Qwest Fiber Routes Dykstra. SC2004 63 .

SC2004 64 .MCI Fiber Routes Dykstra.

SC2004 65 .NGI Architecture Dykstra.

Bandwidth The highest bandwidth path is not always the highest throughput path! Host A Perryman.Path Performance: Latency vs. MD The network chose the OC3 path with 24x the rtt. SC2004 66 . NJ • Host A&B are 15 miles apart Host B Aberdeen. MD OC3 Path vBNS DS3 Path SDSC. 80x BDP • DS3 path is ~250 miles • OC3 path is ~6000 miles Dykstra. CA DREN SprintNAP.

SC2004 67 .com/pub/software/traceroute/ • Supports – – – – – – MTU discovery (-M) Report ASNs (-A) Registered owners (-O) Set IP TOS field (-t) Microsecond timestamps (-u) Set IP protocol (-I) Dykstra.A Modern Traceroute • ftp://ftp.login.

net (138.32.dren.dren.8.22.168.5.47.55) [AS668] 76 ms 75 ms 76 ms 3 Abilene-peer.7.5.1) [AS668] 0 ms 0 ms 0 ms 2 so-0-0-0.34) [AS668] 76 ms 77 ms 76 ms 4 nycmng-washng.edu (198.18.22. 64 hops max.84) [<NONE>] 80 ms 88 ms 80 ms 5 ATM10-420-OC12-GIGAPOPNE.18.190.sandiego. SC2004 68 .ucaid.18.net (138.90) [<NONE>] 85 ms 86 ms 104 ms 7 W92-RTR-1-BACKBONE.89.Traceroute Example [phil@damp-ssc phil]$ traceroute -A mit.9) [<NONE>] 87 ms 86 ms 85 ms 6 192.org (192.ngixeast.nox.7.EDU (18.1.89.ngixeast.25) [AS3] 85 ms 85 ms 85 ms 8 WEB.MIT.edu (18.dren.abilene.0.89.EDU (18.net (138. 40 byte packets 1 ge-0-1-0.MIT.90 (192.5.69) [AS3] 86 ms 86 ms 85 ms • DREN (AS668) to Abilene to MIT (AS3) • ASN “NONE” results from private/unadvertised address space • Hop 1 to 2 was over an (invisible) MPLS path Dykstra.69).edu traceroute to mit.

caida. that node returns an ICMP TTL Expired • The destination host returns an ICMP Port Unreachable Dykstra.How Traceroute Works www. TTL of 1 to 30 • Each router hop decrements the TTL • If the TTL=0.org/outreach/resources/animations/ • Sends UDP packets to ports (-p) 33434 and up. SC2004 69 .

Traceroute Observations
• Shows the return interface addresses of the forwarding path • You can’t see hops through switches or over tunnels (e.g. ATM VC’s, GRE, MPLS) • The required ICMP replies are sometimes blocked for “security”, or not generated, or sent without resetting the TTL
Dykstra, SC2004 70

Matt’s Traceroute
www.bitwizard.nl/mtr/
Matt's traceroute damp-ssc.spawar.navy.mil Keys: D - Display mode [v0.41] Sun Apr 23 23:29:51 2000 Q - Quit Pings Snt Last Best Avg Worst 24 0 0 0 1 24 1 1 1 6 24 84 84 84 86 24 84 84 84 86 24 89 88 152 407 23 88 88 88 90 23 88 88 88 90 23 89 88 91 116 23 112 111 112 113 23 135 134 135 135 23 147 147 147 147 23 98 98 113 291 23 152 152 152 156 23 152 152 152 160

R - Restart statistics Packets Hostname %Loss Rcv 1. taco2-fe0.nci.net 0% 24 2. nccosc-bgp.att-disc.net 0% 24 3. pennsbr-aip.att-disc.net 0% 24 4. sprint-nap.vbns.net 0% 24 5. cs-hssi1-0.pym.vbns.net 0% 23 6. jn1-at1-0-0-0.pym.vbns.net 0% 23 7. jn1-at1-0-0-13.nor.vbns.net 0% 23 8. jn1-so5-0-0-0.dng.vbns.net 0% 23 9. jn1-so5-0-0-0.dnj.vbns.net 0% 23 10. jn1-so4-0-0-0.hay.vbns.net 0% 23 11. jn1-so0-0-0-0.rto.vbns.net 0% 23 12. 192.12.207.22 5% 22 13. pinot.sdsc.edu 0% 23 14. ipn.caida.org 0% 23

Dykstra, SC2004

71

GTrace – Graphical Traceroute
www.caida.org/tools/visualization/gtrace/

Dykstra, SC2004

72

tcptraceroute
• http://michael.toren.net/code/tcptraceroute/ • Useful when ICMP is blocked
– But still depends on TTL expired replies

• Can select the TCP port number
– Helps locate firewall or ACL blocks – Defaults to port 80 (http)

Dykstra, SC2004

73

Routing vs. Switching
• IP routing requires a longest prefix match
– Harder than switching, but now wire speed – Inspired IP switching / MPLS

• Switches have gained features
– Some even route

• Simplicity is good
– Used switched infrastructure where you can, routers where you need better separation or control
Dykstra, SC2004 74

Multi-Protocol Label Switching
(MPLS)
• Adds switched (layer 2) paths to below IP
– Useful for traffic engineering, VPN’s, QoS control, high speed switching

• IP packets get wrapped in MPLS frames and “labeled” • MPLS routers switch the packets along Label Switched Paths (LSP’s) • Being generalized for optical switching
Dykstra, SC2004 75

Packet Sizes
MTU

Maximum Transmission Unit (MTU) • Maximum packet size that can be sent as one unit (no fragmentation) • Usually “IP MTU” which is the largest IP datagram including the IP header – Shown with ifconfig on Unix/Linux • Sometimes a layer 2 MTU – IP MTU 1500 = Ethernet MTU 1518 (or 1514. SC2004 77 . or 1522) Dykstra.

IP MTU Sizes Default IPv4 IPv6 HSSI IP on FDDI IP on ATM IP on FC POS 576 1280 4470 4352 9180 65280 4470/9180 4500 64K 64K – 4GB 64K .Inf RFC 1188 RFC 2225 RFC 2652 RFC 2615 Maximum 64K 64K – 4GB RFC 2460 Reference .

com/~phil/jumbo.Ethernet Jumbo Frames • Non-standard Ethernet extension – Usually 9000 bytes (IP MTU) – Sometimes anything over 1500 • Usually only used with 1GigE and above • Requires separate LANs or VLANs to accommodate non-jumbo equipment • http://sd. SC2004 79 .wareonearth.html Dykstra.

Reasons for Jumbo • Reduce system and network overhead • Handle 8KB NFS or SAN packets • Improve TCP performance! – Greater throughput – Greater loss tolerance – Reduced router queue loads Dykstra. SC2004 80 .

Nortel • • • • Juniper = 9178 (9192 – 14) Cisco (5000/6000) = 9216 Foundry JetCore 1 & 10 GbE = 14000 NetGear GA621 = 16728 • http://darkwing. SC2004 81 .html Dykstra. Avaya.Vendor Jumbo Support • MTU 9000 interoperability demonstrated: – Allied Telesyn. Extreme.edu/~joe/jumbo-clean-gear. Foundry. Cisco. NEC.uoregon.

Path MTU (PMTU) • Minimum MTU of all hops in a path • Hosts can do Path MTU Discovery to find it – Depends on ICMP replies • Without PMTU Discovery hosts should assume PMTU is only 576 bytes – Some hosts falsely assume 1500! Dykstra. SC2004 82 .

18.144ms 2: so-1_0_0.18.225ms 3: so-0_2_0.1) asymm 2 1.203.dren.981ms reached Resume: pmtu 1500 hops 7 back 6 Dykstra.1.18.uhm.dren.rome.25.4.net (138.net (138.mhpcc.11) asymm 6 163.18.seattle-m20.18.504ms 6: t3-0_0_0.49) asymm 3 3.11) 162.18.net (138.dren.194.net (138.6) asymm 6 158.net (138.976ms pmtu 1500 7: damp-rome-fe (138.Check the MTU [phil@damp-mhpcc phil]$ tracepath damp-rome 1?: [LOCALHOST] pmtu 9000 1: ge-0_1_0.net (138.1.33) asymm 5 69.dren.dren.52) asymm 4 69.597ms 5: t3-0_0_0.18.395ms 4: ge-0_1_0.dren.rome.seattle. SC2004 83 .4.

177.5) icmp_seq=1 Frag needed and DF set (mtu = 4352) 8008 bytes from damp-asc2-ge (138.wpafb.5): icmp_seq=1 ttl=60 time=47.5) from 204.22.18.net (138.Using Ping to Check the MTU damp-navo$ ping -s 8000 -d -v -M do damp-asc2 PING damp-asc2-ge (138. SC2004 84 .22.6 ms ping: local error: Message too long.220 : 8000(8028) bytes of data. mtu=4352 Dykstra.1.222.dren. From so-0_0_0.18.18.

9000+) – ATM CLIP (9180) • Never reduce the MTU (or bandwidth) on the path between each/every host and the WAN • Make sure your TCP uses Path MTU Discovery Dykstra.Things You Can Do • Use only large MTU interfaces/routers/links – Gigabit Ethernet with Jumbo Frames (9000) – Packet over SONET (POS) (4470. SC2004 85 .

Bandwidth and throughput .

SC2004 87 .Hops of Different Bandwidth 45Mbps 10Mbps 100Mbps 45Mbps • The “Narrow Link” has the lowest bandwidth • The “Tight Link” has the least Available bandwidth • Queues can form wherever available bandwidth decreases • A queue buildup is most likely in front of the Tight Link Dykstra.

SC2004 88 .Throughput Limit • throughput <= available bandwidth (“tight link” with the minimum unused bandwidth) – A high performance network should be lightly loaded (<50%) – A loaded high speed network is no better to the end user than a lightly loaded slow one Dykstra.

Bandwidth Determination • Ask the routers/switches – SNMP – RMON / NetFlow / sFlow • Passive Measurement – tcpdump – OCxmon • Active Measurement – Light weight tests – Full bandwidth tests Dykstra. SC2004 89 .

net-snmp.org/ – Library and utilities Dykstra. SC2004 90 .SNMP • Simple Network Monitoring Protocol – Version 1: RFC 1155-1157 (1990) – Version 2c: RFC 1901-1908 (1996) – Version 3: RFC 2571-2574 (1999) • Net-SNMP (was UCD-SNMP) – http://www.

2 = Counter64: 4049716647 IF-MIB::ifHCOutOctets.net system.sysUpTime.dren.Net-SNMP Example $ snmpget -v 2c -c public router.51 $ snmpwalk -v 2c -c public router. 22:16:35.2 = STRING: fxp1 … IF-MIB::ifHCInOctets.0 SNMPv2-MIB::sysUpTime. SC2004 91 .1 = STRING: fxp0 IF-MIB::ifName.2 = Gauge32: 100 IF-MIB::ifAlias.net IF-MIB::ifXTable IF-MIB::ifName.2 = Counter64: 3264374754 IF-MIB::ifHighSpeed.0 = Time ticks: (1407699551) 162 days.dren.2 = STRING: Router 1 LAN … Dykstra.

SC2004 Wrap Time 57 min 5.SNMP Counter Wrap Counter Bits 32 32 32 32 64 Data Rate 10 Mbps 100 Mbps 1 Gbps 10 Gbps 10 Gbps Dykstra.7 min 34 sec 3.4 sec 468 years 92 .

SNMP Resolution • Juniper caches interface statistics for five seconds • Reads provide new data only if cache > 5 seconds old • No timestamps on the data Dykstra. SC2004 93 .

Real-Time DREN Traffic Five second snapshots of DREN traffic Dykstra. SC2004 94 .

mrtg. SC2004 95 .MRTG • www.org • Extremely popular network monitoring tool • Most common display: – Five minute average link utilizations – Green into interface – Blue out of interface Dykstra.

SC2004 96 . OC48 link. 7 October 2002 Dykstra. Indianapolis to Kansas City.MRTG Example Abilene.

RRDtool • Round Robin Database Tool • Reimplementation of MRTG’s backend – Generalized data storage.org Dykstra. retrieval. graphing – Many front ends available. including MRTG • www.rrdtool. SC2004 97 .

Abilene Weather Map http://loadrunner.iu.edu/weathermaps/abilene/ 98 .uits.

SC2004 99 .tcpdump • www.tcpdump.com – Nice GUI protocol analyzer Dykstra.ethereal.org – Raw packet capture / decode from hosts – libpcap • www.

1993 – SNMP controllable packet capture • NetFlow. 1997 – Cisco flow records or samples • sFlow. SC2004 100 . 2001 – Low level packet samples and statistics Dykstra.RMON / NetFlow / sFlow • RMON.

NetFlow • Cisco flow records (also on Juniper) • Version 5 and 9 formats common today – www.doit. SC2004 101 .org/internet-drafts/draft-claise-netflow-version9-07.com/go/netflow – www.org/tools/measurement/cflowd/ – flow-tools.net/sw/flow-tools/ – FlowScan.caida.splintered.txt • Many tools – Cflowd. net. www.ietf.cisco. www.edu/~plonka/FlowScan/ Dykstra.wisc.

org Dykstra. Sep 2001 • Switch or router level agent – Controlled by SNMP • Statistical flow samples – Flow is one ingress port to one egress port(s) – Up to 256 packet header bytes.ntop. or summary • Periodic counter statistics – typical 20-120 sec interval • www.sFlow • RFC 3176.sflow. www.org. SC2004 102 .

Downey .Bandwidth Estimation – Single Packet • Larger packets take longer • Delay from intercept • Bandwidth from slope From A.

SC2004 104 .Bandwidth Estimation – Multi Packet 987 6 5 4 3 2 1 • Packet pairs or trains are sent • The slower link causes packets to spread • The packet spread indicates the bandwidth Dykstra.

org/~bmah/Software/pchar/ Dykstra. Wellesley College – http://rocky. Sandia/Cisco – http://www.wellesley. LBL – ftp://ftp.lbl.Bandwidth Measurement Tools • pathchar – Van Jacobson.employees.gov/pathchar/ • clink – Allen Downey. Mah.ee.edu/downey/clink/ • pchar – Bruce A. SC2004 105 .

Jin Guojun. Georgia Tech – http://www.html Dykstra.edu/~laik/projects/nettimer/ • pathrate/pathload .edu/fac/Constantinos.gov/pipechar/ • nettimer .cc.Kevin Lai.lbl.Bandwidth Measurement Tools • pipechar . Stanford University – http://gunpowder.gatech.Dovrolis/bwmeter.Constantinos Dovrolis.didc. LBL – http://www. SC2004 106 .stanford.

5 200.110 69.1.stanford.5 Mbps RTT: 69. estimates (in ~1 second): – ABw: Available Bandwidth – Xtr: Cross traffic – DBC: Dominating Bottleneck Capacity damp-ssc$ abing -d damp-arl 1095727946 T: 192.edu/tools/abing/ • Sends 20 pairs (40 packets) of 1450 byte UDP to a reflector on port 8176.042 69.slac.1 104.1.168.322 ms 20 20 1095727946 F: 192. SC2004 107 .abing • http://www-iepm.110 69.1 645.9 ABW: 364.042 69.2 ABw-Xtr-DBC: 364.1 Mbps RTT: 69.322 ms 20 20 Dykstra.2 ABw-Xtr-DBC: 541.168.5 564.3 ABW: 541.

Les R.Abing Example 1 Jiri Navratil. Cottrell (SLAC) 108 .

Abing Example 2

Jiri Navratil, Les R. Cottrell (SLAC)

109

Windows
Flow/rate control and error recovery

Windows
• Windows control the amount of data that is allowed to be “in flight” in the network • Maximum throughput is one window full per round trip time • The sender, receiver, and the network each determine a different window size

Dykstra, SC2004

111

Window Sizes 1,2,3

Data packets go one way ACK packets come back
Dykstra, SC2004 112

MPing – A Windowed Ping
www.psc.edu/~mathis/wping/ • Excellent tool to view the packet forwarding and loss properties of a path under varying load! • Sends windows full of ICMP Echo or UDP packets • Treats ICMP Echo_Reply or Port_Unreachable packets as “ACKs” • Make sure destination responds well to ICMP • Consumes a lot of resources: use with care

Dykstra, SC2004

113

How MPing Works
Example: window size = 5
5 4 3 2 1 1 2
transmit

bad things happen 5

comes back

recv 1 recv 2 recv 5 recv 6

6 7 10 11 9 8

(can send ack 1 + win 5 = 6) (can send ack 2 + win 5 = 7) (can send ack 5 + win 5 = 10) (can send ack 6 + win 5 = 11)

Dykstra, SC2004

114

MPing on a “Normal” Path

Stable queueing region

Tail-drop behavior

Rtt window not yet full Slope = 1 / rtt

Dykstra, SC2004

115

SC2004 116 .MPing on a “Normal” Path Stable data rate 3600*1000*8 = 29 Mbps Packet Loss 1 .3600/4800 = 25% Queue size (280-120)*1000 = 160 KB Effective BW*Delay Product 120*1000 = 120 KB Rtt = 120/3600 = 33 msec Dykstra.

SC2004 117 .Some MPing Results #1 Fairly normal behavior Discarded packets are costing some performance loss RTT is increasing as load increases Slow packet processing? Dykstra.

Some MPing Results #2 Very little stable queueing Insufficient memory? Spikes from some periodic event (cache cleaner?) Discarding packets comes at some cost to performance Error logging? Dykstra. SC2004 118 .

Some MPing Results #3 Oscillations with little loss Rate shaping? Decreasing performance with increasing queue length Typical of Unix boxes with poor queue insertion Dykstra. SC2004 119 .

Some MPing Results #4 Fairly constant packet loss. SC2004 120 . ~7/8 or 88% Hump at 50 may be duplex problem Both turned out to be an auto-negotiation duplex problem Setting to static full-duplex fixed these! Dykstra. even under light load Major packet loss.

SC2004 121 .Ethernet Duplex Problems An Internet Epidemic! • Ethernet “auto-negotiation” can select the speed and duplex of a connected pair • If only one end is doing it: – It can get the speed right – but it will assume half-duplex • Mismatch loss only shows up under load – Can’t see it with ping Dykstra.

Windows and TCP Throughput • TCP Throughput is at best one window per round trip time (window/rtt) • The maximum TCP window of the receiving or sending host is the most common limiter of performance • Examples – 8 KB window / 87 msec ping time = 753 Kbps – 64 KB window / 14 msec rtt = 37 Mbps Dykstra. SC2004 122 .

Maximum TCP/IP Data Rate With 64KB window 622Mbps 100Mbps 45Mbps Dykstra. SC2004 123 .

Receive Windows for 1 Gbps 1 MB 64KB limit is 32 miles 2 MB 3 MB Dykstra. SC2004 4 MB 5 MB 124 .

Mathis.Observed Receiver Window Sizes • ATM traffic from the Pittsburgh Gigapop • 50% had windows < 20 KB – These are obsolete systems! • 20% had 64 KB windows – Limited to ~ 8 Mbps coast-to-coast • ~9% are assumed to be using window scale M. 2002 125 . PSC.

TCP The Internet’s transport .

Important Points About TCP • TCP is adaptive • It is constantly trying to go faster • It slows down when it detects a loss • How much it sends is controlled by windows • When it sends is controlled by received ACK’s (or timeouts) Dykstra. SC2004 127 .

Time (Maryland to California) Dykstra.TCP Throughput vs. SC2004 128 .

The Three TCP Windows Sender Receiver Socket buffer cwnd Congestion window Receive window The smallest of these three will limit throughput Dykstra. SC2004 129 .

SC2004 130 .TCP Receive Window • A flow control mechanism • The amount of data the receiver is ready to receive – Updated with each ACK • Loosely set by receive socket buffer size • Application has to drain the buffer fast enough • System has to wait for missing or out of order data (reassembly queue) Dykstra.

SC2004 131 .Sender Socket Buffers • Must hold two round trip times of data – One set for the current round trip time – One set for possible retransmits from the last round trip time (retransmit queue) • A system maximum often limits size • The application must write data quickly enough Dykstra.

etc. rtt.) • In the old days. based on observations of the data transfer (loss. there was none – TCP would send a full receive window in one round trip time • TCP now has two modes – Slow-start: cwnd increases exponentially – Congestion-Avoidance: cwnd increases linearly Dykstra. SC2004 132 .TCP Congestion Window • Flow control. calculated by sender.

TCP Congestion Window (cwnd) • Congestion window controls startup and limits throughput in the face of loss • cwnd gets larger after every new ACK • cwnd get smaller when loss is detected • cwnd amounts to TCP’s estimate of the available bandwidth at any given time Dykstra. SC2004 133 .

Cwnd During Slow Start

• cwnd increased by one for every new ACK • cwnd doubles every round trip time (exponential) • cwnd is reset to zero after a loss
Dykstra, SC2004 134

Slow Start and Congestion Avoidance Together

Dykstra, SC2004

135

AIMD
• Additive Increase, Multiplicative Decrease
– of cwnd, the congestion window

• The core of TCP’s congestion avoidance phase, or “steady state”
– Standard increase = +1.0 MSS per loss free rtt – Standard decrease = *0.5 (i.e. halve cwin on loss)

• Avoids congestion collapse • Promotes fairness among flows
Dykstra, SC2004 136

Iperf : TCP California to Ohio
damp-ssc2$ iperf -c damp-asc2 -p56117 -w750k -t20 -i1 -fm -----------------------------------------------------------Client connecting to damp-asc2, TCP port 56117 TCP window size: 1.5 MByte (WARNING: requested 0.7 MByte) -----------------------------------------------------------[ 3] local 192.168.25.74 port 35857 connected with 192.168.244.240 port 56117 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 1.7 MBytes 14.2 Mbits/sec [ 3] 1.0- 2.3 sec 1.5 MBytes 9.5 Mbits/sec [ 3] 2.3- 3.2 sec 1.2 MBytes 11.2 Mbits/sec [ 3] 3.2- 4.1 sec 1.2 MBytes 12.2 Mbits/sec [ 3] 4.1- 5.1 sec 1.6 MBytes 12.5 Mbits/sec [ 3] 5.1- 6.1 sec 1.6 MBytes 13.6 Mbits/sec [ 3] 6.1- 7.0 sec 1.6 MBytes 14.6 Mbits/sec [ 3] 7.0- 8.2 sec 2.0 MBytes 14.7 Mbits/sec [ 3] 8.2- 9.1 sec 1.6 MBytes 15.4 Mbits/sec 82 msec rtt [ 3] 9.1-10.1 sec 2.0 MBytes 16.7 Mbits/sec [ 3] 10.1-11.1 sec 2.0 MBytes 17.0 Mbits/sec [ 3] 11.1-12.0 sec 2.0 MBytes 18.5 Mbits/sec [ 3] 12.0-13.1 sec 2.4 MBytes 18.8 Mbits/sec [ 3] 13.1-14.1 sec 2.4 MBytes 19.1 Mbits/sec [ 3] 14.1-15.1 sec 2.4 MBytes 20.8 Mbits/sec [ 3] 15.1-16.1 sec 2.4 MBytes 20.6 Mbits/sec [ 3] 16.1-17.0 sec 2.4 MBytes 22.0 Mbits/sec [ 3] 17.0-18.1 sec 2.8 MBytes 22.2 Mbits/sec [ 3] 18.1-19.1 sec 1.6 MBytes 13.6 Mbits/sec [ 3] 19.1-20.1 sec 1.6 MBytes 12.3 Mbits/sec [ ID] Interval Transfer Bandwidth [ 3] 0.0-20.6 sec 38.2 MBytes 15.5 Mbits/sec

Dykstra, SC2004

137

TCP Examples from Maui HI
8 ms rtt

98 ms rtt

153 ms rtt

Dykstra, SC2004

138

TCP Acceleration
rtt (msec) Mbps/s

2) (MSS/rtt
0-100Mbps (sec)

(Congestion avoidance rate increase, MSS = 1448)

5 10 20 50 100 200

463 116 29 4.6 1.16 0.29
Dykstra, SC2004

0.216 0.864 3.45 21.6 86.4 345
139

Bandwidth*Delay Product and TCP
• TCP needs a receive window (rwin) equal to or greater than the BW*Delay product to achieve maximum throughput • TCP needs sender side socket buffers of 2*BW*Delay to recover from errors • You need to send about 3*BW*Delay bytes for TCP to reach maximum speed

Dykstra, SC2004

140

SC2004 141 .Delayed ACKs • TCP receivers send ACK’s: – after every second segment – after a delayed ACK timeout – on every segment after a loss (missing segment) • A new segment sets the delayed ACK timer – Typically 0-200 msec • A second segment (or timeout) triggers an ACK and clears the delayed ACK timer Dykstra.

SC2004 142 .ACK Clocking 987 6 5 4 3 2 1 • • • • A queue forms in front of a slower speed link The slower link causes packets to spread The spread packets result in spread ACK’s The spread ACK’s end up clocking the source packets at the slower link rate Dykstra.

Detecting Loss • Packets get discarded when queues are full (or nearly full) • Duplicate ACK’s get sent after missing or out of order packets • Most TCP’s retransmit after the third duplicate ACK (“triple duplicate ACK”) – Windows XP now uses 2nd dup ACK Dykstra. SC2004 143 .

psc. 40% higher throughput • www.edu/networking/papers/model_abstract. double the throughput – shortest path matters • Halve the loss rate.TCP Throughput Model Once recv window size and available bandwidth aren’t the limit ~0. et al. double the throughput • Halve the latency. SC2004 144 . Mathis. • Double the MTU.html Dykstra.7 * Max Segment Size (MSS) Rate = Round Trip Time * sqrt[pkt_loss] M.

7 * MSS / (rtt * sqrt(p)) • If we could halve the delay we could double throughput! • Most delay is caused by speed of light in fiber (~200 km/msec) • “Scenic routing” and fiber paths raise the minimum • Congestion (queueing) adds delay Dykstra. SC2004 145 .Round Trip Time (RTT) rate = 0.

MSS = 1448 or 1460 Jumbo frame => ~6x throughput increase The Abilene and DREN WANs support MTU 9000 everywhere – Most sites still have a lot of work to do Dykstra.7 * MSS / (rtt * sqrt(p)) • • • • MSS = MTU – packet headers Most often.Max Segment Size (MSS) rate = 0. SC2004 146 .

4470. or 9000. exactly 9000 is being used to the sites with GigE interfaces – Sites choose 1500. others by exception • Sites are encouraged to support 9000 on their GigE infrastructures Dykstra. SC2004 147 .DREN MTU Increase • IP Maximum Transmission Unit (MTU) on the DREN WAN is now 9146 bytes • After much debate.

into Ohio Jumbo enabled 23 July 2003 Dykstra.Impact of Jumbo. SC2004 148 .

SC2004 149 . San Diego to DC Jumbo enabled 17 July 2003 Dykstra.Impact of Jumbo.

7 * MSS / (rtt * sqrt(p)) • Loss dominates throughput ! • At least 6 orders of magnitude observed on the Internet • 100 Mbps throughput requires O(10-6) • 1 Gbps throughput requires O(10-8) Dykstra.Packet Loss (p) rate = 0. SC2004 150 .

Loss Limits for 1 Gbps 7x10-7 2x10-7 MSS = 1460 Dykstra. SC2004 151 7x10-8 4x10-8 .

0. SC2004 152 . i.Specifying Loss • TCP loss limits for 1 Gbps across country are O(10-8).01% Dykstra.000001% packet loss – – – – About 1 “ping” packet every three years Systems like AMP would never show loss Try to get 10-8 in writing from a provider! Most providers won’t guarantee < 0.e.

g. ~300 Mbps on OC12) • A low loss requirement comes with this! Dykstra. SC2004 153 .Specifying Throughput • Require the provider to demonstrate TCP throughput – DREN contract requires ½ line rate TCP flow sustained for 10 minutes cross country (e.

SC2004 154 .g. 10-12 BER => 10-8 packet ER (1500 bytes) – 10 hops => 10-7 packet drop rate Dykstra.Concerns About Bit Errors • Bit Error Rate (BER) specs for networking interfaces/circuits may not be low enough – E.

10-14 – One error in 10 TB • SONET Equipment. 10-6 Dykstra. 10-12 • SONET Circuits. SC2004 155 . 10-10 • 1000Base-T. 10-10 – One bit error every 10 seconds at line rate • Copper T1.Example Bit Error Rate Specs • Hard Disk.

Is 10-8 Packet Loss Reasonable? • Requires a bit error rate (BER) of 10-12 or better! • Perhaps TCP is expecting too much from the network • Loss isn’t always congestion • Modify TCP’s congestion control Dykstra. SC2004 156 .

4 x throughput Dykstra.TCP Throughput Model Revisited • Original formula assumptions – constant packet loss rate – dominated by congestion from other flows – 6 x MTU provides 6 x throughput • HPC environment (low congestion) – Loss may be dominated by bit error rates – MSS becomes sqrt(MSS) – 6 x MTU provides 2. SC2004 157 .

More About TCP Some high performance options .

High Performance TCP Features • • • • • Window Scale Timestamps Selective Acknowledgement (SACK) Path MTU Discovery (PMTUD) Explicit Congestion Notification (ECN) Dykstra. SC2004 159 .

16KB Dykstra. 65535 byte maximum • RFC1323 window scale option in SYN packets • Creates a 32-bit window by left shifting 0 to 14 bit positions – New max window is 1 GB (230) – Granularity: 1B. ….TCP Window Scale • 16-bit window. 4B. 2B. SC2004 160 .

SC2004 161 .TCP Timestamps • 12 byte option – drops MSS of 1460 to 1448 • Allows better Round Trip Time Measurement (RTTM) • Prevention Against Wrapped Sequence numbers (PAWS) – 32 bit sequence number wraps in 17 sec at 1 Gbps – TCP assumes a Maximum Segment Lifetime (MSL) of 120 seconds Dykstra.

Selective ACK • TCP originally could only ACK the last in sequence byte received (cumulative ACK) • RFC2018 SACK allows ranges of bytes to be ACK’d – Sender can fill in holes without resending everything – Up to four blocks fit in the TCP option space (three with the timestamp option) • Surveys have shown that SACK implementations often have errors – RFC3517 addresses how to respond to SACK Dykstra. SC2004 162 .

SC2004 163 .Duplicate SACK • RFC2883 extended TCP SACK option to allow duplicate packets to be SACK’d • D-SACK is a SACK block that reports duplicates • Allows the sender to infer packet reordering Dykstra.

recommended Dykstra. keeps sending as long as outstanding data < cwnd • Generally a good performer .Forward ACK Algorithm • Forward ACK (FACK) is the forward-most segment acknowledged in a SACK • FACK TCP uses this to keep track of outstanding data • During fast recovery. SC2004 164 .

Path MTU Discovery • Probes for the largest end-to-end unfragmented packet • Uses don’t-fragment option bit. SC2004 165 . waits for must-fragment replies • Possible black holes – some devices will discard the large packets without sending a must-fragment reply – Problems are discussed in RFC2923 Dykstra.

Explicit Congestion Notification (ECN) • RFC3168. SC2004 166 . Proposed Standard • Indicates congestion without packet discard • Uses last two bits of the IP Type of Service (TOS) field • Black hole warning – Some devices silently discard packets with these bits set Dykstra.

SC2004 167 .edu:7961/ • telnet syntest.edu 7960 Your IP address is: 192.psc.200 ! ! Check of SYN options ! !======================================================= ! Variable : Val : Warning (if any) !======================================================= SACKEnabled : 3 : TimestampsEnabled : 1 : CurMSS : 1448 : WinScaleRcvd : 2 : CurRwinRcvd : 1460 : ! ! End of SYN options ! Dykstra.SYN Option Check Server • http://syntest.26.psc.168.

. buffers. etc.System Tuning Interfaces. routes.

SC2004 169 .Things You Can Do • Throw out your low speed interfaces and networks! • Make sure routes and DNS report high speed interfaces • Don’t over-utilize your links (<50%) • Use routers sparingly. host routers not at all routed -q Dykstra.

SC2004 170 .Things You Can Do • Make sure your HPC apps offer sufficient receive windows and use sufficient send buffers – But don’t run your system out of memory – Find out the rtt with ping. by application. or automatically • Check your TCP for high performance features • Look for sources of loss – Watch out for duplex problems Dykstra. compute BDP – Tune system wide.

TCP Performance • Maximum TCP throughput is one window per round trip time • System default and maximum window sizes are usually too small for HPC • The #1 cause of slow TCP transfers! Dykstra. SC2004 171 .

0 MB 24 MB 97 MB Dykstra.5 MB 6.Minimum TCP Window Sizes Throughput\rtt 40 Mbps 140 Mbps 570 Mbps 2280 Mbps 9100 Mbps 85 msec 425KB 1.7 MB 11 MB 44 MB 176 MB 172 . SC2004 155 msec 775KB 2.

wmem_max=8388608 • In /etc/sysctl. SC2004 173 .Increase Your Max Window Size Linux example • Command line: sysctl -w net.core.rmem_max = 8388608 net.wmem_max = 8388608 Dykstra.conf (makes it permanent) net.core.core.rmem_max=8388608 sysctl -w net.core.

tcp_rmem = 4096 349520 4194304 net. maximum for automatic buffer allocation on Linux Dykstra. SC2004 174 .ipv4.tcp_wmem = 4096 65536 4194304 • Three values are: minimum.ipv4.Increase Your Default Window Size Linux example • In /etc/sysctl. default.conf net.

…) Dykstra.Application Buffer Sizes • Before doing a connect or accept setsockopt(fd. SOL_SOCKET. SO_SNDBUF. SC2004 175 . SOL_SOCKET. …) setsockopt(fd. SO_RCVBUF.

dslreports.html • Beware that modem utilities such as DunTweak can reduce performance on high speed nets Dykstra.Dr. SC2004 176 . TCP A TCP Stack Tuner for Windows http://www.com/front/drtcp.

psc. SC2004 177 .edu/networking/projects/tcptune/ Dykstra.Tuning Other Systems • See “Enabling High Performance Data Transfers” http://www.

net/Projects/FTP.html Dykstra.nlanr. see http://dast.FTP Buffer Sizes • Many FTP’s allow the user to set buffer sizes – The commands are different everywhere! • kftp buffer size commands lbufsize 8388806 rbufsize 8388806 – Other kftp commands (local) (remote) • protect/cprotect (clear|safe|private) • passive – client opens the data channel connection • For other versions of FTP. SC2004 178 .

fast.x has SSL / TLS support • http://vsftpd.org/ Dykstra. secure • 2.vsftpd FTP server • Small. SC2004 179 .beasts.

SC2004 180 .GridFTP • FTP Protocol Extensions – SBUF and ABUF – set / auto buffer sizes – SPOR and SPAS – striped port / passive Dykstra.

nlanr.net/Projects/Autobuf/ • Measures the spread of a burst of ICMP Echo packets to estimate BDP.Autobuf – An Auto-tuning FTP http://dast. sets bufs Dykstra. SC2004 181 .

kftp • Before tuning: 3. SC2004 182 .2 Mbps • After tuning: 40 Mbps (line rate) A very common story… Dykstra.DREN Success Story • Rome NY to Maui HI.

edu/People/vwelch/net_perf/tcp_windows.psc.html • TCP Tuning Guide www-didc.lbl. SC2004 183 .Good Tuning References • Users Guide to TCP Windows www.ncsa.internet2.edu/networking/projects/tcptune/ • WAN Tuning and Troubleshooting www.html Dykstra.uiuc.edu/~shalunov/writing/tcp-perf.gov/TCP-tuning/ • Enabling High Performance Data Transfers on Hosts www.

Throughput Tests .

edu/networking/treno_info.16%) in 9.1.77 ms. path MTU was 1440 bytes Dykstra.16%) in 10..9 kbp/s (54475 pkts in + 86 lost = 0.168.psc..html • Tells you what a good TCP should be able to achieve (Bulk Transfer Capacity) damp-mhpcc% treno damp-pmrf MTU=8166 MTU=4352 MTU=2002 MTU=1492 . SC2004 185 . Replies were from damp-pmrf [192...1] Average rate: 63470.5 kbp/s (55241 pkts in + 87 lost = 0...03 s Equilibrium rate: 63851....Treno Throughput Test www.828 s Path properties: min RTT was 8.

Treno Observations • Easy 10 second test. no remote access or receiver process required • Emulates TCP but doesn’t use TCP – Problems with host TCP or tuning are avoided • Does Path MTU Discovery • Reports rtt and loss rates • A zero equilibrium result means there was too much packet loss to exit “slow start” Dykstra. SC2004 186 .

SC2004 187 .Treno Observations • Can send ICMP (-i) or UDP (default) – ICMP replies (ECHO or UNREACH) could be blocked for “security” • Routers send ICMP replies very slowly – So don’t test routers with treno • ICMP is often rate limited now by hosts • Port numbers can wrap around (and look like port scans) Dykstra.

SC2004 188 .net/Projects/Iperf/ • netperf – dated but still in wide use – http://www.nrl.lcp.nlanr.org/ • ftp – nothing beats a real application Dykstra.TCP Throughput Tests • ttcp – the original.mil/pub/nuttcp/ • Iperf – great TCP/UDP tool (recommended) – http://dast.com/~phil/net/ttcp/ • nuttcp – great successor to ttcp (recommended) – ftp://ftp.netperf.navy.wareonearth. many variations – http://sd.

ttcp) • User payload data rates are reported by tools – No TCP.5293 Mbps • http://sd. etc. IP.Throughput Testing Notes • Network data rates (bps) are powers of 10.wareonearth. SC2004 189 . 100 Mbps ethernet is 100.g. not powers of 2 as used for Bytes – E. 100 Mbps ethernet max is 97.g.com/~phil/net/overhead/ Dykstra.000.g.000 bits/sec – Some tools wrongly use powers of 2 (e. Ethernet. headers are included – E.

SC2004 190 .c file – ftp://ftp.c • Start a server: – nuttcp –S – Allows remote users to run tests to/from that host without accounts Dykstra.nrl.lcp.navy.1.11.mil/pub/nuttcp/ • Compile it: – cc –O3 –o nuttcp nuttcp-v5.nuttcp • My favorite TCP/UDP test tool • Get the latest .

A Permanent nuttcp Server RedHat Linux example Create a file /etc/xinetd. SC2004 191 .d/nuttcp # default: off # description: nuttcp service nuttcp { socket_type wait user server server_args disable } = = = = = = stream no nobody /usr/local/bin/nuttcp -S no Dykstra.

Testing a Path Example San Diego CA to Washington DC OC12 path .

21) 72.18.217 ms 3 damp-ssc-ge (138.190.33) 0.079 ms 72.dren.162 ms 72. 30 hops max.5).net (138.dren.190.18.net (138.069 ms 72.255 ms 0.238 ms 2 so12-0-0-0.sandiego. 30 hops max.23.23.net (138.37) 72.1.Traceroute [phil@damp-ssc phil]$ nuttcp -xt damp-nrl traceroute to damp-nrl-ge (138.dren.5 (138.nrldc.18.070 ms traceroute to 138.18.4.190.190.237 ms 0.18.net (138.18.199 ms 0.5) 72.18.184 ms 2 so12-0-0-0.1) 0.239 ms 0.nrldc.185 ms 72.18.087 ms 72.075 ms 72.068 ms Make sure it goes where you expect! Dykstra. 38 byte packets 1 ge-0-1-0.979 ms 72.37). 38 byte packets 1 ge-0-1-0.dren.164 ms 3 damp-nrl-ge (138.18.210 ms 72.sandiego.7) 72.23. SC2004 193 .

0 ms 64 bytes from damp-nrl-ge (138.23.Check Path RTT and MTU [phil@damp-ssc phil]$ ping damp-nrl PING damp-nrl-ge (138.190.7) 3: damp-nrl-ge (138.sandiego.007 ms [phil@damp-ssc phil]$ tracepath damp-nrl 1?: [LOCALHOST] pmtu 9000 1: ge-0-1-0.905ms 72.092/0.18.37) from 138.net (138.37): icmp_seq=3 ttl=62 time=72.18.23.0 ms --.37): icmp_seq=2 ttl=62 time=72.nrldc.18.747ms reached RTT = 0.23.damp-nrl-ge ping statistics --5 packets transmitted.190.0 ms 64 bytes from damp-nrl-ge (138.082/72.37): icmp_seq=5 ttl=62 time=72.1) 2: so12-0-0-0.18.18.37): icmp_seq=4 ttl=62 time=72. SC2004 194 .18.0 ms 64 bytes from damp-nrl-ge (138.dren.dren.1. MSS = 9000-20-20-12 = 8948 Dykstra.23. 5 received.23.18. 64 bytes from damp-nrl-ge (138.072.18.23.5 : 56(84) bytes of data. 0% loss.net (138.37) Resume: pmtu 9000 hops 3 back 3 asymm 2 1.18.071/72.23.37): icmp_seq=1 ttl=62 time=72.060ms asymm 3 72.18.0 ms 64 bytes from damp-nrl-ge (138. time 4035ms rtt min/avg/max/mdev = 72.

072 = 5. SC2004 195 .Do the Math • Bandwidth * Delay Product – BDP = 622000000/8 * 0.6 MB • The TCP receive window needs to be at least this large • The sender buffer should be twice this size • We choose 8MB for both – Knowing that Linux will double it for us! Dykstra.

nstream=1.33 nuttcp-t: 0. calls/sec = 973. port=5001 tcp nuttcp-r: accept from 138.0user 0.0786 MB in 10.0360 Mbps nuttcp-r: 71224 I/O calls.97 nuttcp-r: 0. SC2004 196 .00 seconds nuttcp-t: connect to 138.23.10: socket nuttcp-t: buflen=65536.1. nstream=1.00 real seconds = 62288.10: socket nuttcp-r: buflen=65536.18.05.5 nuttcp-r: send window size = 103424. calls/sec = 6981. msec/call = 0.20 real seconds = 61039.18. port=5001 tcp -> damp-nrl nuttcp-t: time limit = 10.55 KB/sec = 500.0786 MB in 10.15.190.Check Buffer Sizes [phil@damp-ssc phil]$ nuttcp -v -t -T10 -w8m damp-nrl nuttcp-t: v5. receive window size = 16777216 nuttcp-r: 608. receive window size = 103424 nuttcp-t: 608. msec/call = 1.2659 Mbps nuttcp-t: 9730 I/O calls.37 nuttcp-t: send window size = 16777216.0user 0.5sys 0:10real 5% 0i+0d 0maxrss 3+0pf 0+0csw Dykstra.32 KB/sec = 510.9sys 0:10real 9% 0i+0d 0maxrss 2+0pf 0+0csw nuttcp-r: v5.1.

SC2004 197 .Things to Check • The receiver’s (nuttcp-r) receive buffer size • The transmitter’s (nuttcp-t) send buffer size • The receiver’s reported throughput – The transmitter number is often too high • The CPU utilization of both sender and receiver – Make sure you didn’t run out of CPU Dykstra.

9192 Mbps 67.00 sec = 567.5339 MB / 1.00 sec = 566.5937 MB / 1.5937 MB / 1.7978 Mbps Slow start 67.6363 MB / 1.00 sec = 567.1195 Mbps 67.0000 MB / 10.00 sec = 567.00 sec = 567.00 sec = 569.5531 MB / 1.7558 MB / 1.00 sec = 13.5603 Mbps 8 %TX 4 %RX Dykstra.Throughput Test (with -i) [phil@damp-ssc phil]$ nuttcp -t -T10 -i1 -w8m damp-nrl 1.0935 Mbps 44.5937 MB / 1.19 sec = 494.3144 MB / 1.9265 MB / 1.4850 Mbps 67.00 sec = 569.4661 Mbps 67.4834 Mbps 601.1184 Mbps Stable 67.8753 MB / 1. SC2004 198 .6178 Mbps 67.1195 Mbps 67.00 sec = 568.00 sec = 371.

00 sec = 207.00 sec = 126.5891 MB / 1.0547 Mbps 50.6874 MB / 1.0350 Mbps 29.00 sec = 420.0616 MB / 1.7883 Mbps Slow start 38.6979 MB / 10.3801 MB / 1.9842 Mbps 48.00 sec = 406.2614 MB / 1.00 sec = 18.Reverse Throughput Test (-r) [phil@damp-ssc phil]$ nuttcp -r -T10 -i1 -w8m damp-nrl 2.1768 MB / 1.00 sec = 139.1180 Mbps 15.9809 MB / 1.60 sec = 240.00 sec = 132.8672 MB / 1.3490 Mbps 4 %TX 2 %RX Dykstra. SC2004 199 .4872 MB / 1.8211 MB / 1.1766 Mbps 303.00 sec = 327.3602 Mbps 15.00 sec = 431.00 sec = 250.7315 Mbps 16.5731 Mbps 51.9644 Mbps Unstable 24.

5932 MB / 1.5932 MB / 1.0000 Mbps 0 / 7811 ~drop/pkt 0.9920 Mbps 0 / 7811 ~drop/pkt 0.9778 MB / 10.5932 MB / 1.00 ~%loss 59.9930 Mbps 0 / 7811 ~drop/pkt 0.00 ~%loss 59.5932 MB / 1.00 sec = 499.9950 Mbps 0 / 7811 ~drop/pkt 0.00 ~%loss 59.UDP Test (-u –l –R) [phil@damp-ssc phil]$ nuttcp -t -u -l8000 -R500m -T10 -i1 –w8m damp-nrl 59.5932 MB / 1.99 sec = 499.9925 Mbps 0 / 7811 ~drop/pkt 0.6008 MB / 1.00 ~%loss 59.9905 Mbps 0 / 7811 ~drop/pkt 0. SC2004 200 .01 sec = 499.5932 MB / 1.00 sec = 500.00 ~%loss 595.5932 MB / 1.9950 Mbps 0 / 7811 ~drop/pkt 0.00 sec = 499.00 ~%loss 59.00 sec = 499.00 sec = 500.00 sec = 499.0570 Mbps 0 / 7812 ~drop/pkt 0.6586 Mbps 0 / 7764 ~drop/pkt 0.9935 Mbps 0 / 7811 ~drop/pkt 0.00 ~%loss 59.00 %loss Dykstra.00 sec = 499.00 ~%loss 59.2346 MB / 0.00 ~%loss 59.00 sec = 499.00 ~%loss 59.5932 MB / 1.00 sec = 499.5061 Mbps 99 %TX 3 %RX 0 / 78116 drop/pkt 0.

5169 MB / 1.00 ~%loss 59.01 ~%loss 59.00 ~%loss 595.7300 MB / 1.8939 MB / 10.03 ~%loss 59.00 sec = 502.00 sec = 501.8439 Mbps 0 / 7731 ~drop/pkt 0.00 sec = 509.9828 MB / 1.00 sec = 495.2097 Mbps 0 / 7768 ~drop/pkt 0.7424 Mbps 2 / 7794 ~drop/pkt 0.9463 Mbps 0 / 7717 ~drop/pkt 0.2651 MB / 1.00 sec = 499.01 sec = 499.4702 Mbps 100 %TX 3 %RX 5 / 78110 drop/pkt 0.5001 Mbps 0 / 7960 ~drop/pkt 0.00 ~%loss 59.00 ~%loss 59.4825 Mbps 0 / 7741 ~drop/pkt 0.8526 MB / 1.5677 Mbps 2 / 7838 ~drop/pkt 0.00 sec = 494.00 sec = 497.00 ~%loss 58.1408 Mbps 0 / 7448 ~drop/pkt 0.4482 MB / 1.8760 MB / 1.Reverse UDP Test [phil@damp-ssc phil]$ nuttcp -r -u -l8000 -R500m -T10 -i1 –w8m damp-nrl 56.0591 MB / 1.00 ~%loss 60.1347 Mbps 0 / 7845 ~drop/pkt 0.00 sec = 493.00 ~%loss 59.03 ~%loss 58.01 %loss Dykstra.99 sec = 481. SC2004 201 .00 sec = 498.8237 MB / 0.3279 Mbps 1 / 7802 ~drop/pkt 0.7839 MB / 1.

072 * sqrt(5/78110)) = 87 Mbps Conclusion: Too much loss on DC to CA path Dykstra.Is Loss the Limiter? bps = 0.7 * MSS / (rtt * sqrt(loss)) = 0. SC2004 202 .7 * 8948*8 / (0.

SC2004 203 .Testing Notes • When UDP testing – – – – – Be careful of fragmentation (path MTU) Most systems are better at TCP than UDP Setting large socket buffers helps Raising process priority helps Duplex problems don’t show up with UDP! • Having debugged test points is critical – Local and remote • A performance history is valuable Dykstra.

Duplex Problem Symptoms • Will never see it with pings or UDP tests • TCP throughput is sometimes ½ of expected rate. SC2004 204 . sometimes under 1 Mbps • Worse performance usually results when the mismatch is near the receiver of a TCP flow • Sometimes reported as “late collisions” Dykstra.

Checking Systems for Errors • Don’t forget – ifconfig UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:414639041 errors:0 dropped:0 overruns:0 frame:0 TX packets:378926231 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 – netstat –s Tcp: 59450 segments retransmited – dmesg or /var/log ixgb: eth2 NIC Link is Up 10000 Mbps Full Duplex Dykstra. SC2004 205 .

Iperf • Iperf is probably better for – Parallel stream reporting (-P) – Bi-directional tests (-d) – MSS control (-M) and reporting (-m) • Nuttcp is better at – – – – – Getting server results back to the client Third party support Traceroute (-xt) Setting priority (-xP) Undocumented instantaneous rate limit (-Ri) Dykstra. SC2004 206 .Nuttcp vs.

Example Measurement Infrastructure .

loss.nlanr.DREN Active Measurement Program (DAMP) • Grew out of the NLANR AMP project – http://amp. throughput – Dedicated hardware. known performance – User accessible test points • Invaluable at identifying and debugging problems Dykstra. routes.net/ • Long term active performance monitoring – Delay. SC2004 208 .

1 Alaska Dykstra.DREN AMP Systems pnw asc arl nrl ssc navo 23 CONUS. 3 Hawaii. SC2004 209 .

2 Gbps TCP from MD to MS over OC48 • New NICs: Intel. tg3 driver – Intel PXLA8590LR 10GigE NIC.6 Gbps TCP on LAN after tuning (7. S2io X-Frame. PCI-X – Linux 2.7 Gbps loopback) – 2. 512 KB L2 cache.4.65 driver • 4. SC2004 210 .0 GHz Xeon. 3.26-web100 – Two built-in BCM5703 10/100/1000 copper NICs. 533 MHz FSB.10GigE Test Systems • Dell 2650.0. Chelsio T110/N110 • PCI Express solution early 2005? Dykstra. 1 MB L3 cache. ixgb 1.

tagged events • Would love to see NTP nanokernel patch adopted by Linux! Dykstra.Precise Time • 10 msec NTP accuracy w/o clock – asymmetric paths and clock hopping • 10 usec NTP accuracy with CDMA clock – EndRun Technologies Praecis Ct ($1100) – Serial port. SC2004 211 .

AMP Tests on Demand • Web page interface • 10 second TCP throughput test – Uses nuttcp third party support • Can also access them directly Dykstra. SC2004 212 .

SC2004 213 .e.Need to test over a distance • Many local problems won’t be seen over a short round trip time (i. within your site!) – Window size won’t matter – TCP error recovery is extremely quick • Firewalls and screening routers (ACLs) could be part of the problem Dykstra.

SC2004 214 .Test Pairs AMP Performance Data AMP1 AMP2 A Actual Application Pair B Dykstra.

cat5) Bad switches or router modules Slow firewalls or routers – Excessive logging • WAN bit errors • Routing issues • Heat problems! Dykstra.Problems We Have Found • • • • • Small windows on hosts Duplex problems Bad cables (patch panels. SC2004 215 . fiber.

Advanced Debugging TCP Traces and Testrig .

xplot.TCP/IP Analysis Tools • tcpdump – www.ethereal.GUI tcpdump (protocol analyzer) – www.ncne. etc.org • ethereal .org • testrig – tcpdump.tcpdump.net/research/tcp/testrig/ Dykstra. tcptrace.tcptrace.nlanr.com • tcptrace – stats/graphs of tcpdump data – www. SC2004 217 . – www.

SC2004 218 .“A Preconfigured TCP Test Rig” Dykstra.

SC2004 219 . SackOK. MSS. window scale. timestamps. Don’t-Fragment (DF) Dykstra.TCP Connection Establishment • Three-way handshake – SYN. SYN+ACK. ACK • Use tcpdump. look for performance features – window sizes.

wscale 0> (DF) 16:08:33.timestamp 263520796 364570771> (DF) Dykstra.hpc.wscale 5> (DF) 16:08:33.sackOK.nop. SC2004 220 .40874: S 490305274:490305274(0) ack 488615736 win 5792 <mss 1460.Tcpdump of TCP Handshake 16:08:33.timestamp 263520790 0.sackOK.56117 > wcisd.hpc.mil.nop.timestamp 364570771 263520790.56117: .734103 wcisd.40874 > damp-nrl.734045 damp-nrl.40874 > damp-nrl. ack 1 win 5840 <nop.mil.hpc.56117: S 488615735:488615735(0) win 5840 <mss 1460.mil.674226 wcisd.nop.

Collecting A TCP Trace tcpdump -p -w trace. SC2004 221 .out tcptrace -S trace.xpl Dykstra.out (or –G for all graphs) xplot *.out -s 100 host $desthost and port 5001 & nuttcp -T10 -n200m -i1 -w10m $desthost tcptrace -l -r trace.

272579 2004 last packet: Wed Oct 13 17:30:39.trace a->b: b->a: total packets: 23473 total packets: ack pkts sent: 23472 ack pkts sent: pure acks sent: 1 pure acks sent: sack pkts sent: 0 sack pkts sent: dsack pkts sent: 0 dsack pkts sent: max sack blks/ack: 0 max sack blks/ack: unique bytes sent: 209714276 unique bytes sent: actual data pkts: 23471 actual data pkts: actual data bytes: 210018508 actual data bytes: rexmt data pkts: 34 rexmt data pkts: rexmt data bytes: 304232 rexmt data bytes: zwnd probe pkts: 0 zwnd probe pkts: zwnd probe bytes: 0 zwnd probe bytes: outoforder pkts: 0 outoforder pkts: pushed data pkts: 555 pushed data pkts: 12590 12590 12589 1774 0 1 0 0 0 0 0 0 0 0 0 Dykstra.289916 2004 elapsed time: 0:00:09.017336 total packets: 36063 filename: nrl-ssc.tcptrace -l -r TCP connection 1: host a: damp-nrl-ge:58310 host b: damp-ssc-ge:5001 complete conn: no (SYNs: 2) (FINs: 0) first packet: Wed Oct 13 17:30:30. SC2004 222 .

945 secs idletime max: 141.000 73.) SYN/FIN pkts sent: 1/0 req 1323 ws/ts: Y/Y adv wind scale: 7 req sack: Y sacks sent: 0 urgent data pkts: 0 pkts urgent data bytes: 0 bytes mss requested: 8960 bytes max segm size: 8948 bytes min segm size: 8948 bytes avg segm size: 8947 bytes max win adv: 17920 bytes min win adv: 17920 bytes zero win adv: 0 times avg win adv: 17920 bytes initial window: 17896 bytes initial window: 2 pkts ttl stream length: NA missed data: NA truncated data: 209220494 bytes truncated packets: 23471 pkts data xmit time: 8.tcptrace -l -r (cont.8 0 pkts bytes bytes bytes bytes bytes bytes bytes times bytes bytes pkts bytes pkts secs ms Bps Dykstra. SC2004 223 .4 ms throughput: 23256786 Bps SYN/FIN pkts sent: req 1323 ws/ts: adv wind scale: req sack: sacks sent: urgent data pkts: urgent data bytes: mss requested: max segm size: min segm size: avg segm size: max win adv: min win adv: zero win adv: avg win adv: initial window: initial window: ttl stream length: missed data: truncated data: truncated packets: data xmit time: idletime max: throughput: 1/0 Y/Y 7 Y 1774 0 0 8960 0 0 0 8388480 17792 0 8044910 0 0 NA NA 0 0 0.

3 ms sdv retr time: 0 0.3 ms RTT avg (last): RTT sdv (last): 0.8 ms RTT sdv (last): segs cum acked: 12508 segs cum acked: duplicate acks: 1742 duplicate acks: triple dupacks: 4 triple dupacks: max # retrans: 1 max # retrans: min retr time: 73.9 ms ms ms ms RTT RTT RTT RTT RTT samples: min: max: avg: stdev: 1 0.7 ms min retr time: max retr time: 140.0 0. ambiguous acks: 30 ambiguous acks: RTT min (last): 72.9 ms RTT max (last): RTT avg (last): 73.0 0.2 ms avg retr time: sdv retr time: 10.1 28.2 102.0 0.tcptrace -l -r (cont.0 0.0 0.7 ms RTT min (last): RTT max (last): 74.7 144.0 0.0 ms ms ms ms ms ms ms ms 224 .9 RTT from 3WHS: RTT RTT RTT RTT RTT full_sz full_sz full_sz full_sz full_sz smpls: min: max: avg: stdev: 0.0 ms ms ms ms RTT from 3WHS: RTT RTT RTT RTT RTT full_sz full_sz full_sz full_sz full_sz smpls: min: max: avg: stdev: 72.0 0.0 0.0 0.2 102.1 28. only ACKs for multiply-transmitted segments (ambiguous ACKs) were considered. Times are taken from the last instance of a segment.0 0.0 0.0 0 0 0 0 0.0 0.0 0 ms ms ms ms ms ms ms ms post-loss acks: 4 post-loss acks: For the following 5 RTT statistics.) RTT RTT RTT RTT RTT samples: min: max: avg: stdev: 10814 72.2 ms max retr time: avg retr time: 89.1 144.1 ms 10813 72.0 ms 1 0.

TCP Startup Dykstra. SC2004 225 .

SC2004 226 .TCP Startup – Detail 1 Dykstra.

TCP Startup – Detail 2 Dykstra. SC2004 227 .

TCP Single Loss/Retransmit Dykstra. SC2004 228 .

TCP Sender Pause Dykstra. SC2004 229 .

26-web100 Example 3 4 2 Linux 2.4.4.Linux 2.26-web100 Example 1 .

Detail 1: Startup Detail 1: Start .

Detail 2: Timeout Detail 2: Timeout .

Detail 3: Single SACK Loss Detail 3: Single SACK Loss .

Detail 4: Multiple SACK Loss Detail 4: Multiple SACK Loss .

Normal TCP Scallops NLANR NCNE Dykstra. SC2004 235 .

A Little More Loss NLANR NCNE Dykstra. SC2004 236 .

Excessive Timeouts NLANR NCNE Dykstra. SC2004 237 .

Bad Window Behavior NLANR NCNE Dykstra. SC2004 238 .

Receiving Host/App Too Slow NLANR NCNE Dykstra. SC2004 239 .

tune) Dykstra. /proc/web100/1010/{read.org • Set out to make 100 Mbps TCP common • “TCP knows what’s wrong with the network” – The sender side knows the most • • • • Instruments the TCP stack for diagnostics Enhanced TCP MIB (IETF Draft) Linux 2.Web100 www. SC2004 240 .spec.g.spec-ascii.test.4 kernel patches + library and tools /proc/web100 file system e.web100.

Web100 – Connection Selection Dykstra. SC2004 241 .

SC2004 242 .Web100 .Tool/Variable Selection Dykstra.

net100.Web100 – Variable Display. SC2004 243 . Triage Chart See also www.org for more work based on Web100 Dykstra.

ANL • Try it out on damp-ssc (San Diego) Dykstra.internet2. SC2004 244 .edu/ndt/ • Java applet runs a test and uses Web100 stats to diagnose the end to end path • Developed by Richard Carlson.Network Diagnostic Tool (NDT) • http://e2epi.

NDT Example
http://miranda.ctd.anl.gov:7123/

Dykstra, SC2004

245

NDT Statistics

Dykstra, SC2004

246

Iperf with Web100, Clean Link
wcisd$ iperf-web100 -e -w400k -p56117 -c damp-wcisd -----------------------------------------------------------Client connecting to damp-wcisd, TCP port 56117 TCP window size: 800 KByte (WARNING: requested 400 KByte) -----------------------------------------------------------[ 3] local 192.168.26.200 port 33185 connected with 192.168.26.61 port 56117 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 113 MBytes 94.1 Mbits/sec ------ Web100 Analysis -----100 Mbps FastEthernet link found Good network cable(s) found Duplex mismatch condition NOT found Link configured for Full Duplex operation Information: This link is congested with traffic Web100 reports the Round trip time = 14.0 msec; the Packet size = 1448 Bytes; and There were 1 packets retransmitted, 0 duplicate acks received, and 0 SACK blocks received This connection is network limited 99.99% of the time. Contact your local network administrator to report a network problem Web100 reports the Tweakable Settings are: RFC-1323 Time Stamping: On RFC-1323 Window Scaling Option: On RFC-2018 Selective Acknowledgment (SACK): On 247

Iperf with Web100, Lossy Link
wcisd$ iperf-web100 -e -w400k -p56117 -c damp-ssc2 -----------------------------------------------------------Client connecting to damp-ssc2, TCP port 56117 TCP window size: 800 KByte (WARNING: requested 400 KByte) -----------------------------------------------------------[ 3] local 192.168.26.200 port 33198 connected with 192.168.25.74 port 56117 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 35.0 MBytes 28.9 Mbits/sec ------ Web100 Analysis -----Unknown link type found Good network cable(s) found Warning: Duplex mismatch condition exists: Host HD and Switch FD Information: link configured for Half Duplex operation No congestion found on this link Web100 reports the Round trip time = 2.0 msec; the Packet size = 1448 Bytes; and There were 617 packets retransmitted, 4072 duplicate acks received, and 4370 SACK blocks received The connection stalled 1 times due to packet loss The connection was idle for 0.21 seconds (2.06%) of the time This connection is network limited 99.99% of the time. Contact your local network administrator to report a network problem 248

Data Integrity

Corrupted Packets
• V. Paxson
– Claims most corruption caused by T1’s – 1 in 5000 packets corrupted – TCP checksum would then let 1 in 300M corrupted packets pass
• Roughly two errors per terabyte

Dykstra, SC2004

250

More Corrupted Packets
• J. Stone, C. Partridge, SIGCOMM 2000
– “Traces of Internet packets over the last two years show that 1 in 30,000 packets fails the TCP checksum, even on links where link-level CRCs should catch all but 1 in 4 billion errors.” – DMA transfers account for many of these errors – Conclude that 1 in 200M TCP packets pass with undetected errors
Dykstra, SC2004 251

SC2004 252 .Data Integrity • How do you know that your transferred copy is correct? • A simple approach would be copy it two (or more) times and compare the results • But you would like to check it with something smaller than the entire dataset Dykstra.

sorting. SC2004 253 . CRC’s. – Cryptographic hashes are good for data integrity checks Dykstra. Hashes • All map large blocks of data into smaller keys • Checksums are simple algebraic sums • Cyclic Redundancy Checks (CRC’s) are good at detecting multiple bit errors • Hashes are good at randomizing the key space – Good for database lookups.Checksums. etc.

SC2004 254 .Simple Checksums • Can not detect several kinds of errors – Reordering of bytes – Inserting or deleting zero valued bytes – Multiple errors that cancel (same sum) Dykstra.

includes a 12 byte “pseudo header” from the IP layer with the network addresses and payload length • RFC 1071 – Computing the Internet Checksum Dykstra.The Internet Checksum • 16 bits long. summed 16 bits at a time • For UDP/TCP. SC2004 255 .

the TTL changes • IPv6 has none. not even over the header – Assumes layer 2 will protect it Dykstra. since e.g. SC2004 256 .IP Data Protection • IPv4 uses the 16-bit Internet Checksum over the header information only – This checksum is recalculated/rewritten hop by hop.

SC2004 257 .IPv4 and IPv6 Headers Vers 4 IHL Type of Service Total Length Flags Frag Offset Vers 6 Traffic Class Flow Label Next Hdr Hop Limit Identification Time to Live Payload Length Protocol Header Checksum Source Address Source Address Destination Address IP Options v4 Header = 20 Bytes + Options v6 Header = 40 Bytes Destination Address Dykstra.

UDP Data Protection • 16-bit Internet Checksum covers the UDP header and payload • Optional for IPv4 UDP • Required for IPv6 UDP Dykstra. SC2004 258 .

SC2004 259 .TCP Data Protection • 16-bit Internet Checksum over the entire segment • Roughly every 65536th corrupted segment goes undetected Dykstra.

e.The Danger Of Offloading • Many NICs today support TCP Checksum offloading • Studies have shown many errors occur during DMA transfers – Offloading would miss these • Nothing beats reading the data back from disk.g. SC2004 260 . md5sum Dykstra.

but strength is non-uniform • http://www.ohio-state.cse.Ethernet Data Protection • IEEE 802 CRC32 • Strong error detection for 1500 byte packets • Detects all triple bit errors up to 11455 bytes • Roughly 1 in 4 billion errors go undetected. SC2004 261 .edu/%7Ejain/papers/xie1.htm Dykstra.

nearly impossible to invert • Collision free – Infeasible to find two inputs that result in the same output • The workhorses of modern cryptography Dykstra.Cryptographic Hashes • One-way functions – Easy to compute. SC2004 262 .

SHA-224 2010. NIST no longer endorses SHA-1 Dykstra. Ron Rivest 1992. NSA (a. SHA-512 2004. SHA-0) 1995. SHA-1 2002.Cryptographic Hash History • • • • • • • 1990. SC2004 263 . SHA-256.k. SHA-384. SHA.a. MD5 1993. MD4.

Common Hashes • MD5 – Message Digest algorithm #5 – 128 bit hash. SC2004 264 . RFC1321 (April 1992) – md5sum --check md5sums • SHA-1 – Secure Hash Algorithm #1 – 160 bit hash Dykstra.

SHA-384.Stronger Hashes • Recent work has shown some weakness in MD5 and SHA-0 (Crypto2004 conference) • NIST plans to phase out SHA-1 by 2010 • NIST FIPS 180-2. Aug 2002. SC2004 265 . SHA-512 – http://csrc. Secure Hash Standard. also includes: – SHA-224.gov/CryptoToolkit/tkhash.nist.html Dykstra. SHA-256.

Feb 1997.71 Dykstra. ANS X9. HMAC-RIPEMD • RFC2104. SC2004 266 . HMAC-MD5.HMAC • Keyed-Hash Message Authentication Codes • Combines a secret key and a cryptographic hash • HMAC-SHA1.

RFC 3566. 128 bit data blocks • Can be used as a Message Authentication Code (MAC) – AES-XCBC-MAC-96. 256 bit keys. FIPS-197 • 128. Sep 2003 • scp defaults to aes128-cbc and hmac-md5 Dykstra.Advanced Encryption Standard (AES) • AES (Rijndael) data encryption/decryption • Federal Information Processing Standards Publication. SC2004 267 . 192.

21k 78330.11k 65752.31k 51452.66k 43858.20k 43846.09k 19268.05k 51292.74k 313778.02k 58603.9.09k 84785.74k 90881.44k 52584.97k 31237.68k 25317.68k 24957. 2.24k 78255. SC2004 268 .10k 96411.88k 90386.70k 173880.07k 1024 bytes 8827.93k 17158.59k 25950.78k 51385.89k 56045.32k 246730.82k 95925.71k 90675.99k Dykstra.85k 7677.7d.23k 16325.51k 240482.31k 54524.90k 58446.82k 24287.22k 52542.37k 55336.82k 391984.70k 24765. 400 MHz FSB.38k 57739.65k 70686.61k 7717.90k 63363.21k 91339.89k 93663.85k 84407.Encryption/Hash Speeds “openssl speed” OpenSSL 0. 512KB L2 cache type md2 mdc2 md4 md5 hmac(md5) sha1 rmd160 rc4 des cbc des ede3 idea cbc rc2 cbc rc5-32/12 cbc blowfish cbc cast cbc aes-128 cbc aes-192 cbc aes-256 cbc 16 bytes 2335.81k 17168.04k 62984.31k 51985.17k 50893.38k 252513.77k 65180.26k 25599.00k 43868.36k 64 bytes 5298.85k 25439.08k 51747.06k 17186.35k 89333.86k 65725.43k 311775.38k 57590.43k 64708.6 GHz Xeon.12k 7374.95k 126032.14k 199600.76k 93179.50k 17128.74k 96438.92k 303676.02k 57441.48k 25818.52k 17483.14k 51628.15k 15804.38k 78479.38k 6480.37k 17191.13k 143885.96k 256 bytes 7781.69k 43456.54k 43784.63k 25979.90k 77552.97k 8192 bytes 9161.67k 7790.13k 77119.43k 48512.65k 25301.24k 151600.80k 11629.

loopback interface.6 GHz Xeon – Disk read. 0:49 sec. 455 Mbps • scp localhost:/tmp/file /dev/null – AES-128 (default). 153 Mbps Dykstra. 121 Mbps – 3DES. 2. no disk write – both encrypt and decrypt for scp • wget –O /dev/null ftp://localhost/tmp/file – 0:13 sec. SC2004 269 .Local Transfer Examples • 744 MB file. 1:53. 0:39. 53 Mbps – Blowfish.

15 seconds – wget with large window modification Dykstra. 4 minutes – from vsftpd 1.Remote Transfer Examples • Aberdeen. 744 MB • scp 14 minutes – compared to 49 seconds on localhost • wget v1. MD to San Diego. CA.1.2.9. SC2004 270 .2 server at Aberdeen • wget –passive-ftp.

ECC. FEC Dykstra. SC2004 271 .Summary: Layers of Protection • Application: Hashes • Transport: TCP/UDP checksum – Optional: SSL/TLS • Network: IP header checksum – Optional: IPsec • Link: Ethernet CRC32 or POS FCS-32 • Physical: symbols.

TCP Security .

nop. src port.hpc.817322 IP a.1128: S 121297587:121297587(0) ack 3993350671 win 5840 <mss 1460.mil.mil. SC2004 273 .nop.mil.hpc.nop.http: S 3993350670:3993350670(0) win 16384 <mss 1460.hpc.1128 > b.hpc. dst addr.mil. dst port) • 32 bit sequence numbers (byte counts) – One for each direction – Random starting value for each connection 14:03:49.http > a.sackOK> Dykstra.TCP Security – Sequence Numbers • Connections identified by – (src addr.nop.817340 IP b.sackOK> 14:03:49.

SC2004 274 .TCP Security Proposals • Ephemeral Port Randomization – 16 bit port space subset • SYN Cookies • Timestamps as Cookies • Tighter sequence number checking Dykstra.

g. can reset a connection in about a minute • Proposal to require an exact match. i.Guessing A Valid Seq Number • Today we only need one within the window – 2^32 / window_size – E.e. the next number Dykstra. SC2004 275 . 32 KB window = 128 k guesses – Even at DSL rates.

send a RST • Blind data injection – Guessing a valid sequence number and ACK lets us inject bogus data segments • draft-ietf-tcpm-tcpsecure-01. close our connection • Blind reset attack. SC2004 276 .txt Dykstra.TCP Security • Blind reset attack. RST bit – If we receive a RST within the window. SYN bit – If we receive a SYN within the window of an open connection.

TCP Security • Blind ACK DoS attack – ACK segments not yet received – Client can cause server to saturate it’s connection by sending too fast – draft-azcorra-tcpm-tcp-blind-ack-dos-01. SC2004 277 . to open senders window faster – Addressed by “Appropriate Byte Counting” Dykstra.txt • Related: ACK every byte.

SC2004 278 . RFC2385 – Manually configured shared keys – Often used between BGP peers – RFC3562 provides guidance on keys • IPsec – Layer 3 solution • SSL / TLS – Layer 5 “solution” Dykstra.TCP Security • TCP MD5 Signature Option.

IPsec History • IPsec work began in the early 1990s • Three revisions – 1995: RFC 1825-1829 . SC2004 279 .IKE v2 key management • At least 20 related RFC’s Dykstra.no key management – 1998: RFC 2401-2412 .IKE v1 key management – Current: RFC TBD .

SC2004 280 .IPsec Security Services • Authentication Header (AH) – – – – Authentication of data source Integrity of message content and ordering Replay Detection Access Control (i.e. packet filtering) • Encapsulating Security Payload (ESP) – Above plus encryption of packet data – Limited traffic flow confidentiality Dykstra.

IPsec Transport and Tunnel Modes • Transport Mode – AH and/or ESP is placed between IP header and next higher level header • IP-TCP -> IP-AH-TCP • IP-TCP -> IP-ESP-TCP – Can only be used host-to-host • Tunnel Mode – A new IP header is created • IP-UDP -> IP’-AH-IP-UDP • IP-UDP -> IP’-ESP-IP-UDP – Common way to implement VPNs Dykstra. SC2004 281 .

IPsec: Host-to-Host Modes IP ESP TCP Transport Mode Host A IP’ ESP IP TCP Host B Tunnel Mode Ken Renard. WCI 282 .

Host-to-Gateway Tunnel Mode Host A IP TCP • Security Gateway performs IPsec services on behalf of Host A • Offloads IPsec processing and policy enforcement from Host A • Road-warrior VPN ESP Tunnel Security Gateway IP’ ESP IP TCP Host B Ken Renard. WCI 283 .

WCI 284 .Gateway-to-Gateway Tunnel Mode Network A • Security Gateways perform IPsec services on behalf of Network A & B • Offloads IPsec processing and policy enforcement from hosts • Inter-site VPN Network B IP TCP ESP Tunnel IP TCP Security Gateway IP’ ESP IP TCP Security Gateway Ken Renard.

IPsec Support • Native support for IPsec is required for IPv6 – That’s where “IPv6 is secure” comes from – But many vendors don’t support it yet • For Linux. see – www.strongswan.org – www.openswan.linux-ipv6.org – www. SC2004 285 .org Dykstra.

Going Faster Cheating Today. Improving TCP Tomorrow .

org/datagrid/gridftp.nj.Parallel TCP Streams • PSockets (Parallel Sockets library) – http://citeseer.globus.fr/bbftp/ • MulTCP – a TCP that acts like N TCP’s – UCL Dykstra.html • bbFTP – parallel ‘ftp’ uses ssh or GSI – http://doc.html • GridFTP – enhanced parallel wu-ftpd – http://www. SC2004 287 .com/386275.in2p3.nec.

Parallel TCP Stream Performance Les Cottrell. SLAC 288 .

Parallel TCP Streams Throughput and RTT by Window Size Les Cottrell. SLAC 289 .

RFC 2018. July 2000 RTO Calculation. Oct 2002 SACK loss recovery. RFC 1191. RFC 793. RFC 2883. Nov 2000 ECN. RFC 1323. Apr 2003 Dykstra. RFC 2988. Oct 1996 NewReno. RFC 3042. RFC 3168. SC2004 290 . RFC 3390. 1990 Path MTU Discovery. Jan 2001 Increased Initial Windows. Nov 1990 Window Scale. BSD. Sep 2001 Limited Transmit. RFC 3517. Sep 1981 Reno. May 1992 SACK. PAWS.TCP Keeps Evolving • • • • • • • • • • • • TCP. April 1999 D-SACK. RFC 2581.

RFC3782.TCP Reno • Most modern TCP’s are “Reno” based. from the 1990 BSD Reno Unix release • Reno defined (refined) four key mechanisms – – – – Slow Start Congestion Avoidance Fast Retransmit Fast Recovery • NewReno refined fast retransmit/recovery when non-SACK partial acknowledgements are available – Proposed Standard. Apr 2004 Dykstra. SC2004 291 .

SC2004 292 . Vegas. FAST. … • Pacing .removing burstiness by spreading the packets over a round trip time – Vegas. BLUE • Autotuning windows and buffer space • Modifications to prevent “cheating” • Kick-starting TCP after timeouts Dykstra.The Future of TCP/IP • Different retransmit/recovery schemes – TCP Tahoe. Westwood. Peach.

4380 bytes)) • RFC 3390.Increased Initial Windows • Allows ~4KB initial window rather than one or two segments – min(4*MSS. Oct 2002. SC2004 293 . Proposed Standard Dykstra. max(2*MSS.

SC2004 294 . receiver ACKs every byte • Increases by at most 2*MSS bytes per ACK – To avoid bursts when one ACK covers a huge number of bytes • RFC 3465.g. Feb 2003 Dykstra. increase cwin based on the number of new bytes ACK’d • Prevents receiver from “cheating” and making the sender open cwin too quickly – e.Appropriate Byte Counting • When an ACK is received.

Limited Slow-Start • In slow-start.5*max_ssthresh/cwin) * MSS • RFC 3742. SC2004 295 . this doubling can cause massive packet loss (and network load) • Limited slow-start adds max_ssthresh (proposed value of 100 MSS) • Above max_ssthresh cwin opens slower. the congestions window (cwin) doubles each round trip time • For large cwins. never bursts more than 100 MSS cwin += (0. Apr 2004 Dykstra.

SC2004 296 .Limit Slow-Start Example Tom Dunigan. ORNL Dykstra.

Quick-Start draft-amit-quick-start-01. SC2004 297 .txt • IP option in the TCP SYN specifies desired initial sending rate – Routers on the path decrement a TTL counter and decrease initial sending rate if necessary • If all routers participated. receiver tells the sender the initial rate in the SYN+ACK pkt • The sender can set cwin based on the rtt of the SYN and SYN+ACK packets Dykstra.

SC2004 298 .Experimental TCP Modifications • TCP Westwood – Rate Estimation from ACK stream (cwnd = RE x RTTmin) • HighSpeed TCP (HSTCP) – Scaled AIMD parameters • Scalable TCP (STCP) • Binary Increase Congestion Control (BIC/CUBIC) • TCP Vegas / FAST TCP – Queuing delay based congestion control Dykstra.

org/floyd/hstcp.icir.html • Changes the AIMD parameters • Identical to standard TCP for loss rates above 10-3 for fairness (cwin <= 38) • Allows cwin to reach 83000 segments for 10-7 loss rates – Good for 10 Gbps over 100 msec rtt – Std TCP would be limited to ~440 Mbps • RFC 3649. Dec 03 Dykstra.HighSpeed TCP www. SC2004 299 .

SC2004 300 . ICSI Dykstra.HighSpeed TCP: response Sally Floyd.

HighSpeed TCP: fairness Sally Floyd. ICSI Dykstra. SC2004 301 .

Standard vs. SC2004 302 . ORNL Dykstra. HighSpeed TCP Tom Dunigan.

SC2004 303 .Cwin Opening Comparison Tom Dunigan. ORNL Dykstra.

uk/~ctk21/scalable/ Dykstra.eng. SC2004 304 .cam.ac.Scalable TCP (STCP) Tom Kelly • Loss recovery time is independent of sending rate • http://www-lce.

STCP Reno cwnd • Delay based congestion control – “multi-bit” feedback vs. Vegas D: DUAL R: Reno. binary loss signal – Avoids oscillations of cwnd and unnecessary loss • http://netlab.caltech.FAST TCP loss Queue Delay C F D R Window TCP Operating Points C: CARD F: FAST.edu/FAST/ Dykstra. SC2004 305 . HSTCP.

UDP Transfer Protocols .

SC2004 307 . rtt=150msec (DC to Maui) – Short flow throughput is determined by slow start • Loss != congestion • Large TCP queues increase latency • End users never tune their systems Dykstra. mtu=9000. mtu=1500. rtt=72msec (DC to San Diego) • 15 minutes.Why Not TCP? • Slow adaptation after loss – 0 to 500 Mbps takes • 36 seconds.

etc. 1985 • Block transfer protocol (not streaming) – Send a block at predetermined rate – Wait for lost packet list – Resend those. SC2004 308 . Dykstra.NETBLT • RFC969.

evl.uic.RBUDP • Reliable Blast UDP • http://www.edu/cavern/quanta • Similar to NETBLT but in active development Dykstra. SC2004 309 .

Dec 2002 • UDP data.edu/anmlresearch.html • Last release.anml.iu. Sep 2002 • Was created because TCP over that path was getting only 10’s to 100’s of Mbps! Dykstra. SC2004 310 . TCP control.Tsunami • http://www. rate adaptive – Loss rate controls sending rate • File transfer protocol. no API • Transferred 1 TB of data at ~1 Gbps over a 12000 km “light path” (Vancouver to Geneva).

window + AIMD rate control (not rtt dependent) • Includes FTP like application.SABUL • Simple Available Bandwidth Utilization Library • National Center for Data Mining (NCDM) at UIC (University of Illinois at Chicago) • http://www.htm • 2000-2003.net/sabul. now in third generation • UDP data. TCP control.dataspaceweb. SC2004 311 . rate adaptive • Streaming protocol. API Dykstra.

net/projects/dataspace UDP for data and control Grew out of SABUL work. 2003+ Dykstra. SC2004 312 .UDT • • • • UDP-based Data Transport http://sourceforge.

11 Hdr 128-bit IV Data Decapsulate Seq Num 128-bit MIC Jesse Walker.11 wireless? 802. encryption – Learn something from 802. corruption.UDP Protocol Security! • Session hijacking. Intel 313 .11 Hdr Data Encapsulate 802.

SC2004 314 .TCP Revisited • “Anything you can do I can do too” – It’s just algorithms (e. easier experimentation and deployment! – But user space is not as efficient Dykstra. congestion control) • The real UDP advantage: – User space implementations.e.g. i.

RFC3286 http://www.org/ Dykstra.sctp. TCP’s byte stream Multi-streaming Multi-homing Cookie based DoS protection RFC2960. SC2004 315 .SCTP • Proposed Standard transport protocol – Peer of TCP and UDP • • • • • • Message oriented vs. RFC3309.

SC2004 316 .SCTP • Allows out of order delivery – Each message can be marked as ordered or not – Avoids TCP’s head-of-line blocking • CRC32c data protection – Changed from Adler-32 checksum Dykstra.

Remote Direct Data Placement • RDDP.k. Remote DMA • Something SCTP is better at than TCP – Easy for the sender (“zero copy”) – Hard for the receiver • Marker PDU Aligned framing for TCP (MPA) Dykstra. a.a. SC2004 317 .

Is TCP The Right Choice? • TCP assures 100% in-order delivery • If your app can handle out-of-order delivery. use something that isn’t 100% reliable – TCP stalls are sometimes worse than a loss Dykstra. try SCTP • If you app can handle some loss. SC2004 318 .

Quality of Service (QoS) .

What is QoS? • Quality of Service “The collective effect of service performances which determines the degree of satisfaction of the user” (ITU) • A user perception of how good the service is Dykstra. SC2004 320 .

VPN’s. can cause burst losses • Circuit emulation or ATM QoS over MPLS – Preservation of expected behavior • Critical services – DoS protection Dykstra. Real-time. • TCP builds large queues – Increases delay. Voice. etc.Why Improve QoS? • Performance Guarantees – SLA’s. SC2004 321 .

traffic engineering (MPLS) • Discriminate flows into service classes – Can be signaled/reserved. or marked – All members of a class get the same type of service Dykstra. SC2004 322 .How Do You Improve QoS? • Improve the entire network (egalitarian) – larger pipes. congestion control (RED. ECN).

VBR. 2.Classes of Service (CoS) • Signaled – ATM • CBR. Assured 1. SC2004 323 . 4 Dykstra. ABR. UBR – IP Integrated Services (IntServ) • Best-effort (BE). Controlled-load (CLS). 3. Guaranteed (GS) • Marked – IP Differentiated Services (DiffServ) • Expedited.

SC2004 324 .Integrated Services (IntServ) • IETF working group since 1993 • Resource Reservation – RSVP is the most common choice • • • • Packet Classification (multi field) Admission Control (capacity and/or policy) Packet Scheduling QoS based routing Dykstra.

SC2004 325 .Resource Reservation Protocol (RSVP) • “Soft State” – Typical: PATH messages every 30 seconds. but removed from XP in favor of DiffServ marking • NSIS may be the next generation of RSVP Dykstra. soft-state timeout in 90 seconds • RSVP only blocks when congested – ATM often blocks allocations even if unused • RSVP is in Windows 2000.

Multi Field Classification • Every RSVP router classifies every packet by inspecting up to seven fields – dest addr. SC2004 326 . src port • IPv4: ToS (8 bits) • IPv6: Traffic Class (8 bits). src addr. proto. dest port. Flow label (20 bits) • DiffServ reduces this burden by marking packets • MPLS can help! Dykstra.

DiffServ Code Points map to Per Hop Behaviors (PHB) • Cisco Committed Access Rate (CAR) was an early implementation of DiffServ Dykstra. policing. SC2004 327 . signaling to the edges (hosts or edge routers) • In the core. shaping.Differentiated Services (DiffServ) • IETF working group since 1998 • Keeps the core stateless – Pushes any/all marking.

each with three drop priorities) Dykstra.DiffServ Classes (PHB Groups) • Expedited Forwarding (EF) – Emulates virtual leased line – Absolute assurance. admission controlled – One PHB: Priority or Class Based Queue • Assured Forwarding (AF) – Emulates unloaded network – Statistical assurance – 12 PHB’s (4 classes/queues. SC2004 328 .

IPv4 and IPv6 CoS fields Vers 4 IHL Type of Service Total Length Flags Frag Offset Vers 6 Traffic Class Flow Label Next Hdr Hop Limit Identification Time to Live Payload Length Protocol Header Checksum Source Address Source Address Destination Address IP Options Destination Address Dykstra. SC2004 329 .

SC2004 330 . RFC1349 Dykstra.IP TOS Field (8 bits) Precedence 3 bits D T R M 0 • • • • • • D = Minimize Delay T = Maximize Throughput R = Maximize Reliability M = Minimize Monetary Cost (RFC1349) Maximum Security = 1111 (RFC1349) RFC791. RFC795.

Priority 2 .Internetwork Control 7 .IP TOS Precedence 3 bit precedence in the TOS byte: 0 .Flash Override 5 . SC2004 331 .Routine 1 .Immediate 3 .Flash 4 .Network Control Dykstra.CRITIC/ECP 6 .

DiffServ Code Points CS0 CS1 CS2 CS3 CS4 CS5 CS6 CS7 AF11 AF12 AF13 AF21 AF22 AF23 AF31 AF32 AF33 AF41 AF42 AF43 EF 000000 001000 010000 011000 100000 101000 110000 111000 001010 001100 001110 010010 010100 010110 011010 011100 011110 100010 100100 100110 101110 [RFC2474] [RFC2474] [RFC2474] [RFC2474] [RFC2474] [RFC2474] [RFC2474] [RFC2474] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC2597] [RFC3246] • Six bit field. SC2004 332 . goes in the IPv4 TOS octet or the IPv6 Traffic Class • Last two bits are being used by ECN Dykstra.

Comparison to ATM: Policing • ATM uses a leaky bucket to determine and enforce conformance – Smoothes traffic • IntServ and DiffServ use a token bucket – Preserves packet spacing and bursts Dykstra. SC2004 333 .

classification. and tagging capability Dykstra.Routing vs. Switching • Does my switch have QoS controls? – Some QoS in 802. SC2004 334 .1p(D)/Q • 3 bit User Priority in VLAN tag • Simple priority queuing (possible starvation) – IETF defined Subnet Bandwidth Manager (SBM) • Routers have better queue management.

1Q User Priority? – MPLS label or Exp field (in shim header)? – Separate port? Dykstra. and can hide DiffServ marks • Need a way to identify service classes to the WAN provider – 802.Classes of Encrypted Traffic? • Encryption can make multi field inspection impossible. SC2004 335 .

QoS Recommendations • • • • Implement DiffServ where needed Utilize MPLS+TE where available Reduce TCP queues/bursts (pacing) Better Active Queue Management – Have routers penalize “bad” flows Dykstra. SC2004 336 .

QoS/CoS Summary • QoS is a user perception of performance • CoS discriminates flows into service types – CoS can be signaled (IntServ/RSVP) – CoS can be marked (DiffServ) • Current research could improve QoS even without CoS Dykstra. SC2004 337 .

Storage Area Networks SANs and IP Storage .

Storage Area Network (SAN) • Dedicated network for accessing Network Attached Storage (NAS) • Usually Fibre Channel – Fibre Channel Protocol (FCP) = SCSI Commands over FC • IP Storage is coming Dykstra. SC2004 339 .

Fibre Channel • Began in 1988 to improve HIPPI connections – First ANSI standard in 1994 – 1 Gbps (100 MBps) standard in 1996 • Runs on both fiber and copper • Ring or mesh (switch) topology • FC disks have dual / redundant connections Dykstra. SC2004 340 .

SC2004 341 . Hunt groups.Fibre Channel Architecture FCP Striping. but not backward compatible Dykstra. Datagrams 8B/10B encoding 12 25 50 100 MBps FC100 = GigE physical layer • 2 Gbps 2001. 4 Gbps 2004 • 10 Gbps available. Multicast Connections.

July 2004 Dykstra.FCIP • Fibre Channel over TCP/IP – Encapsulates FC frames in TCP/IP • Provides Inter-Switch Links (ISLs) • Allows multiple FC SAN’s to be connected via the internet • RFC 3821. SC2004 342 .

org/html.iFCP • Internet Fiber Channel Protocol • FC-4 layer interface with a TCP/IP sublayer – Replaces the FC SAN with an IP network • Gateways interface between FC and IP devices • Emulates fabric services • Work in progress – http://www.html Dykstra.ietf. SC2004 343 .charters/ips-charter.

Apr 2004 Dykstra.iSCSI • Internet Small Computer Systems Interface – Serialized SCSI over TCP – Initiator to target (tcp port 3260) • Removes 25 meter distance limit • Includes initiator/target authentication via iSCSI Login PDUs – Challenge Response Authentication Protocol (CHAP) – Secure Remote Password (SRP) • Uses CRC32c (but optional!) • RFC 3720. SC2004 344 .

charters/ips-charter.iSNS • Internet Storage Name Server • Discovery and management of FCP and iSCSI storage devices • Required by iFCP. SC2004 345 .html Dykstra. optional in iSCSI • Work in progress – http://www.org/html.ietf.

SC2004 346 .org Dykstra.Remote Direct Memory Access (RDMA) • Zero-copy end-to-end data movement • Available in InfiniBand • RDMA Consortium defined it for TCP/IP via iWarp – www.rdmaconsortium.

iWarp • Collective name given to three protocols – MPA – Marker PDU Alignment for TCP – DDP – Direct Data Placement – RDMAP – RDMA Protocol Dykstra. SC2004 347 .

SC2004 348 .Enhanced Network Interface Cards iSCSI iSER RDMAP DDP MPA TCP IP NIC TOE NIC RNIC iSCSI NIC TCP Interface RDMA Interface SCSI Interface Dykstra.

IP SAN Security • FC SANs provide little to no security • FCIP. iSCSI all depend on IPsec for perpacket – – – – Authentication Confidentiality Integrity Replay protection • RFC 3723. iFCP. Apr 2004 Dykstra. Securing Block Storage Protocols over IP. SC2004 349 .

SC2004 350 .SANs in Perspective • Came out in the 10/100 Mbps Ethernet era – IP networks simply weren’t fast enough. nor did they have QoS controls • Held on through the introduction of 1Gbps Ethernet – Claimed to still be faster and cheaper • Today: Can no longer compete with 1 GigE and soon 10 GigE pricing • IP attached iSCSI storage is coming Dykstra.

Peer to Peer (P2P) File Transfer .

SC2004 352 .P2P File Sharing Networks • FastTrack – Closed source network – Morpheus. Grokster. eMule • Direct Connect – started Nov 1999. KaZaa. now over 12 PB of data! – i2Hub Dykstra. Limewire • Gnutella (G1) – Open source FastTrack like implementation • eDonkey2000 – eDonkey.

Gnutella (G1) • Open source FastTrack like system • No central database (unlike Napster) • Flooding. order N lookups Dykstra. SC2004 353 .

eDonkey 2000 (ed2k) • Did for movies what Napster did for music • One of the first to allow partial file uploads • Many clients. SC2004 354 . including eMule Dykstra.

SC2004 355 .neo-modus.Direct Connect • 12 PB of data on 250k hosts (6 Oct 04) • http://www.com/ Dykstra.

com • Direct Connect P2P network on Internet2 • “faster because it uses Internet2” • Reality check: – “at UCLA. SC2004 356 .i2Hub • www.i2hub. i2hub downloads run 30 kBps to 200 kBps. vs. 600 kBps for a good server” Dykstra.

org/BitTorrent/ • http://btfaq. SC2004 357 .com/ Dykstra.BitTorrent • http://bitconjurer.

torrent How peers find files Tracker Host How peers find each other Peer Peer Peer Peer Peer Data Transfer Dykstra.BitTorrent Web Server file. SC2004 358 .

BitTorrent Terminology • Peer – a client with a file – “Seed” if complete file – “Leach” if partial file • Swarm – the collection of all peers with parts or all of a given file • “Swarming” – uploading partial files Dykstra. SC2004 359 .

BitTorrent – How it Works • .torrent file – Tracker location(s) – 256 KB file pieces. SHA1 hash of each piece • Downloads – 16 KB sub-piece transfers – 5 pipelined requests (~80 KB window) – 4 active peers. SC2004 360 . reselected every 10 seconds • New opportunistic unchoke every 30 seconds Dykstra.

Local LAN – 14 Mbps + 14 Mbps = 30 Mbps • One seed.4 Mbps Dykstra. SC2004 361 . One leach.BitTorrent – Reality Check • One seed. CA to MD – 3. One leach. Local LAN – 14 Mbps to 40 Mbps • Two seeds. One leach.

SC2004 362 .mod_bt – Integrated Web/BitTorrent • Combines tracker and seed functions in an Apache module • Worse case. downloads like a web server • BitTorrent function can only help • Not quite ready for prime time Dykstra.

net/ Dykstra. SC2004 363 .sourceforge. more efficient than BitTorrent for large numbers of clients • http://konspire. clients subscribe Does not require balanced upload/download In theory.konspire[2b] • • • • Similar to BitTorrent Distribution tree.

web caches. • O(log N) vs.Distributed Hash Tables (DHT) • Hot topic since 2001 – CAN. SC2004 364 . O(N) lookups • Provable properties • Many research projects in DHT’s – FreeNet. etc. Chord. Pastry. Tapestry – Could be used to replace DNS. OceanStore – Automated P2P backup projects • Pastiche and PAST built on Pastry Dykstra.

Newer P2P Networks • Gnutella2 (G2) – DHT network from Shareaza • Overnet – DHT network from eDonkey2000 • NEONet – DHT network from Morpheus Dykstra. SC2004 365 .

SC2004 366 .Abstract Storage Layers • • • • • Internet caches DataSpace Transfer Protocol (DSTP) I2 Logistical Networking Decouple LAN/WAN tuning issues Should storage be a network resource? Dykstra.

Internet Web Caches • IRCache project 1995-2000 • Internet Cache Protocol (ICP) – RFC2186. Sep 1997 – Used in Squid and several web cache products – http://icp.net/ • Hyper Text Caching Protocol (HTCP) – RFC2756. RFC2187. SC2004 367 .ircache. Jan 2000 Dykstra.

edu/ • Internet Backplane Protocol (IBP) • Logistical Backbone (L-Bone) – Directory of IBP depots • exNodes – Collection of IBP allocations • Logistical Runtime System (LoRS) – Upload and Download services Dykstra.utk.Logistical Networking • http://loci.cs. SC2004 368 .

limited time storage • Need capabilities to access data – Crypto secure URL’s: read/write/manage • Supports Data Mover plugins • 4 GB allocation limit? Dykstra. SC2004 369 .IBP • Implements append-only byte arrays • Often anonymous creation.

IBP Commands • Storage Management – IBP_allocate. IBP_copy. IBP_manage • Data Transfer – IBP_store. IBP_mcopy • Depot Management – IBP_status Dykstra. SC2004 370 . IBP_load.

Dykstra. SC2004 371 . etc.exNodes • Holds information and capabilities about storage on one or more IBP depots – think inode for L-Bone storage • XML representation – Can be emailed.

91% free – 6 Oct 2004 Dykstra.L-Bone 487 public IBP servers in 30 countries and 35 American states. 29 TB online. SC2004 372 .

SC2004 373 .Logistical Run Time System • LoRS tools for using the L-Bone • Finds allocations. and java versions Dykstra. creates exNodes. visual. • Command line. etc.

html Dykstra. SC2004 374 .cs.utk.sinrg.edu/ibpvo/about.Example Application: IBPvo http://promise.

routing. etc.PlanetLab • Think world-wide shared Linux cluster • Over 100 universities doing research in overlay networks. SC2004 375 . • Over half of the L-Bone servers are PlanetLab hosts • Internet2 has PlanetLab hosts on the backbone Dykstra. P2P. performance. security.

packet size • Importance of window and buffer sizes and how to tune them Dykstra. loss. speed • How TCP throughput depends on delay.Review • High performance comes from all levels – They complement each other • Network capacity vs. SC2004 376 .

Review • How to test and debug network performance • Protect your data. the network isn’t perfect • TCP continues to improve. notably in congestion control and security modifications • Alternatives exist: SCTP and UDP transport experiments • Quality of Service: class based and egalitarian • Storage Area Networks and IP storage Dykstra. SC2004 377 .

Review • Peer to Peer is where the file transfer action is – But most P2P assumes poor network performance – High performance BitTorrent hasn’t been written yet • There is a lot of work today in DHT and abstract storage layers • Is storage a future network function? Dykstra. SC2004 378 .

psc.edu/networking/projects/tcptune/ • Tom Dunigan’s Network Performance Links – http://www.org/tools/ Dykstra.Recommended Resources • System tuning details – http://www.gov/~dunigan/netperf/netlinks.slac.stanford. SC2004 379 .csm.html • SLAC’s Network Monitoring pages – http://www-iepm.edu/ • CAIDA Internet Measurement Tool Taxonomy – http://www.caida.ornl.

org/ – http://www.nrl.nlanr.navy.net/Projects/Iperf/ • Testrig for TCP traces – http://ncne.csm.web100.net/research/tcp/testrig/ • Web100 / Net100 – http://www.Recommended Resources • Nuttcp and Iperf for TCP and UDP testing – ftp://ftp.ornl.gov/~dunigan/net100/ Dykstra.mil/pub/nuttcp/ – http://dast.lcp.nlanr. SC2004 380 .

org/ – ftp://ftp.Resources • Request For Comments (RFC) – http://www.html Dykstra.charters/wg-dir.rfc-editor.ietf.org/ – http://www.ietf. SC2004 381 .org/in-notes/rfc968.rfc-editor.org/html.txt • Internet Engineering Task Force (IETF) – http://www.

com/start/ Dykstra.W. 3 volumes • Unix Network Programming. Richard Stevens’ Books • TCP/IP Illustrated. 2 volumes • http://www. SC2004 382 .kohala.

SC2004 383 .Resources • Stream Control Transmission Protocol (STCP) • Randall R. Qiaobing Xie • November. Stewart. 2001 • ISBN: 0-201-72186-4 Dykstra.

2003 • ISBN: 0130646342 Dykstra. Jain • October.Resources • High Performance TCP/IP Networking • M. Hassan. SC2004 384 . R.

2109 Mergho Impasse San Diego. CA 92110 phil@sd.Thank You! Phillip Dykstra WareOnEarth Communications Inc.com 619-574-7796 .wareonearth.