David V. Schuehler John W. Lockwood Washington University in St. Louis

High-speed network switches currently operate at OC-48 (2.5 gigabits per second) line rates, and faster OC-192 (10 Gbps) and OC-768 (40 Gbps) networks are on the horizon. At the same time, network traffic continues to increase.1 Studies have found that more than 85% of the packets traveling on the Internet are based on the Transmission Control Protocol/Internet Protocol (TCP/IP).2,3 The latest networking processing systems require scanning and processing of data in both headers and payloads of TCP/IP packets. To scan payloads at high rates, these systems need new methods of processing TCP/IP data in hardware. A hardware implementation of a full TCP/IP protocol stack acting as a communication end point would be useful. Unfortunately, several problems make the full implementation of a TCP/IP stack in hardware impractical; these include the need for many TCP timers, the need for large memories for reassembly buffers, and the need to support many connections. At Washington University’s Applied Research Laboratory, we have developed a TCP-flow-

monitoring circuit that provides client application systems with an ordered TCP data stream. Instead of acting as a connection end point for a few TCP connections, this circuit, called TCP Splitter, monitors all TCP flows passing through the network hardware. This technique has many advantages over implementing a TCP end point. For reliable delivery of a data set to a client application, a TCP connection only needs to transit the device monitoring the data. The TCP end points, not the logic on the network hardware, manage the work for guaranteed delivery. Because the retransmission logic remains at the connection end points, not in the active network switch, the lightweight monitor does not require a complex protocol stack. (The “Related work” sidebar summarizes other TCP/IP-monitoring research.)

Current-generation field-programmable gate arrays (FPGAs) have approximately the capacity of a million-gate application-specific IC (ASIC), a few hundred Kbytes of on-chip memory, and operation speeds ranging from 50 to 200 MHz. By placing FPGAs in the


Published by the IEEE Computer Society

0272-1732/03/$17.00  2003 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

These tools provide a wide range of capabilities for capturing and saving network data. 1-6. processing data at line rates. Symp. Programs such as tcpdump capture and store TCP packets. Downloaded on April The Applied Research Laboratory developed the Washington University Gigabit Switch as a research platform for high-speed 2000.tcpdump.4 We used this hardware.nj.. This program also has performance limitations that preclude it from monitoring high-bandwidth traffic. An active solution A problem with attempting to monitor JANUARY–FEBRUARY 2003 Authorized licensed use limited to: IEEE Xplore. 2001.” http://www. Design. and a user datagram protocol (UDP) wrapper. HTTPDUMP captures and stores Web-based hypertext transfer protocol (HTTP) traffic. this requirement applies to the edge routers but not to interior Internet nodes. maintaining state for numerous flows. N. and V. Field-Programmable Custom Computing Machines (FCCM 02). They include an asynchronous transfer mode (ATM) cell wrapper. and P. IEEE CS ~iang/isaac/ipse. Some of the issues we faced were handling dropped and reordered packets. 7. “BLT: Bi-Layer Tracing of HTTP and TCP/IP. nor can they guarantee the processing of every byte of data on the network. http://citeseer. pp. S. Necker.cs. pp. The Cluster-Based Online Monitoring System does a much better job of capturing data associated with Web requests. Mao et al. “Using the AT&T Labs PacketScope for Internet Measurement. 3.6 These wrappers process high-level packets in reprogrammable logic. By instantiating multiple processing circuits. we restricted the way data flows through the network switch. McCanne. 5. 321-335. PacketScope. no. yet even with eight analysis engines. Generally. this tool requires more processing and runs slower than tcpdump. TCP Splitter must reside in the network through which all packets of monitored flows will pass. With these tools.3 BLT (bilayer tracing) leverages the PacketScope monitor to perform HTTP monitoring of links with line speeds greater than 100 3rd Int’l Workshop Web Information and Data Management. reconstructing TCP data streams requires postprocessing.5 Typically.2 but as a result of the extra filtering logic for processing HTTP traffic. ACM Press.” http://citeseer.7 This project focuses on detecting intrusion and tracking a single connection’s TCP/IP processing state.html. which it writes to a log file for TCP stream an ATM adaptation layer 5 (AAL5) frame wrapper. we made design tradeoffs. we used components of the layered protocol wrappers developed for the FPX.4 This tool does not ensure the processing of all packets but instead attempts to obtain statistically relevant results by capturing a large portion of the HTTP traffic. “tcpdump. S. Anerousis et al. It would be impossible to provide a client application with a consistent TCP byte stream from a connection if the switch performing the monitoring processed only a fraction of the TCP packets. Brooks. Jacobson. Private networks designed to pass traffic in a certain manner can also enforce this requirement. Related work Protocol analyzers and packet-capturing programs have been around as long as there have been protocols and networks to monitor. This solution uses a hardware environment similar to that of TCP Splitter and processes data at equal line rates. monitors much larger volumes of network traffic but relies on tcpdump’s capabilities to perform packet capturing.” WWW9/Computer Networks. Restrictions apply. developed at AT&T. 2009 at 09:52 from IEEE Xplore. Contis. 2.html. 2002. the engine monitors a maximum of 30 TCP/IP connections simultaneously on a single field-programmable gate array. C. D. “TCP-Stream Reassembly and State Tracking in Hardware. The Internet Protocol Scanning Engine is another software-based TCP/IP monitor.html. Wooster. pp. Researchers at the Georgia Institute of Technology have developed a TCP state-tracking engine with buffer reassembly.1 These tools work well for monitoring data at low bandwidth rates. as the testbed for the TCP Splitter project. and Performance Analysis.nj. This set of wrappers lets a client application send and receive packets with FPGA hardware. “Cluster-Based Online Monitoring System of Web Traffic. feldmann00blt.” Proc. Goldberg. but their performance is limited because they execute in along with the Field-Programmable Port Extender (FPX).org/. Feldmann.nj. To overcome these challenges. To achieve this result within the practical bounds of today’s hardware. 55 . “Internet Protocol Scanning Engine. Leres. 1996. None of these solutions can operate in a high-speed active networking environment where data rates exceed 1 Gbps. All packets associated with a monitored TCP/IP connection must pass through the networking node where monitoring takes place. frame. 33. 1997. and IP wrappers as a framework in which to implement TCP Splitter.” http://citeseer. “HTTPDUMP Network HTTP Packet Snooper.” path of a high-speed network switch. and D. 10th Ann. M. an IP wrapper. Y.6 Multiple analysis engines working in parallel improve its performance over other systems. A.5 For TCP Splitter’s foundation.” http://www. 286-287. and minimizing hardware gate count. I. The state-tracking engine also performs limited buffer reassembly. high-performance circuit that contains a simple client interface and can monitor an almost unlimited number of flows. Design requirements TCP Splitter is a lightweight. vol. it does not consistently monitor traffic on a 100-Mbps network. We used the cell. 47-53. it captures only header information. R.. 6. designers can implement network-processing functions without reducing the switch’s overall throughput. Schimmel. References 1. 4.

consists of six components. Because there is no guarantee that classifies. nection unusable. Many TCP control FIFO.8 Its benefit or detriment to throughput depends on the TCP flow specific TCP implementations in use at the end points.dropped packet problem and render the condow size of 1 Gbyte. the checksum engine. reassembling packets in each direction of a TCP connection would require that much Architecture memory. TCP Splitter can provide an ordered client application for each TCP flow in the TCP byte stream to the client application with. The flow classifier. when a packet is dropped upstream of the monitoring node. TCP input. use the go-back. A high-speed switch monitoring We implemented TCP Splitter in FPGA both directions of 128. This section performs the missing packet. including those of Windows 98.6 TCP Splitter’s name reflects the fact that hibitive quantity of memory led us to con. This ensures in-order pack.most of TCP Splitter’s processing.while the other goes to the destination. separate flows. we could Figure 2 shows TCP Splitter’s layout. ond section.7 In a worst-case scenario. the input state machine. .Inbound frames enter TCP Splitter. it actively drops tions. the TCP input section retransmission policy. for reassembly buffers. develop a solution that does not need reassem. By actively dropping out-of-order ter also delivers a TCP byte stream to the packets. Downloaded on April 28. one flow We chose a design that eliminated the need goes to the client application on a local host. Even wrapper framework.1. Figure 1 shows a highif we assume that TCP window size is limit.the circuit splits the TCP byte stream into two sider other lightweight designs. If TCP SplitTCP Splitter consists of two logical secter detects a missing packet.operation potentially can exacerbate the ports is 214. when the end points use many TCP/IP flows is that such a system selective retransmission and the percentage of would require a large amount of memory. handles the subsequent packets until the sender retransmits ingress of IP frames. the system still requires 256 ter. and caches them. tations. If all frames for a par. The first. out requiring reassembly buffers. TCP Splitof order. The secet flow through the switch. The network data loss is high.000 connections would hardware. handles packet This design feature forces the TCP connec. ticular flow transit the switch in order. FreeBSD 4.000 connections. which leads to a maximum win. TCP output. it fits within the FPX protocol require 256 Tbytes of high-speed RAM. ing packets that the receiver will discard. checksums. and Linux 2. This pro.HOT INTERCONNECTS 10 n retransmission logic.4. TCP Splitter’s data flow. As it turns out. On the other hand. machines Input processing throughout the Internet use the go-back-n As Figure 2 shows.pers. In TCP flow TCP flow cases where the receiving TCP Source Destination TCP Splitter stack is performing go-back-n sliding-window behavior. As Figure 1 shows. which bly buffers. Restrictions apply.level view of the data flow through TCP Splited to 1 Mbyte.routing and frame delivery to the outbound tions into a go-back-n sliding-window mode IP stack and the client application. and the frame FIFO all Client application 56 IEEE MICRO Authorized licensed use limited to: IEEE Xplore. the TCP Splitter maximum window scale factor that TCP sup. actively dropping frames Protocol wrappers might improve overall network throughput by eliminatFigure 1. OutTCP frames will cross the network in order. bound IP frames go back to the IP wrapper some action must occur when packets are out and then to the next-hop router. 2009 at 09:52 from IEEE Xplore. IP frames enter TCP Splitter from the IP Gbytes of memory to monitor both directions protocol layer contained in the protocol wrapof the same 128.

fier. An 18-bit hash porting large rule sets.10 Both of these research projects are TCP Splitter’s simple flow classifier can developing flow classifiers to perform 30 miloperate at high speed and has minimal hard.process IP packet data received Client from the IP protocol wrapper. Restrictions apply. TCP input TCP output IP frames enter the input Flow section 32 bits at a time. many of these classifiers to identify traffic the output state machine starts reading the flows for TCP Splitter. Switchgen is a tool that next frame from the frame FIFO. nique. Each table entry contains achieve a high-performance classifier sup33 bits of state information. This data includes the checksum result (pass or fail). ware complexity. the start and end of flow signals. The output transforms packet classification rules into state machine passes this frame data and the reconfigurable hardware-based circuit design. Upon detecting a nonempty control FIFO. destination IP Other researchers have proposed a packet address. and a signal that indicates whether or not this flow classifier does not handle hash table the output section should forward the frame collisions. The frame FIFO also IP input and stores the input data so that the frame delivery Input state output state machine can send machine the TCP checksum result to the output section along with Control FIFO the start of the IP packet. The control FIFO packets from different flows as if they were a holds state information for smaller frames single connection. The classifier checksum engine computes the TCP checksum using the Checksum appropriate bits in each data engine Packet routing word. optimizes rules by removing redunFlow classification dancy.12 This table. which cause TCP Splitter to process only to the destination. Once the checksum engine Frame computes the TCP checkFIFO sum. the flow identiFigure 2.144. a TCP frame indicator. TCP Splitter’s layout.9 associated control signals from the control The recursive flow classification algorithm. and destination classification solution that performs lookups TCP port serves as the index into the flow using a series of pipelined SRAMs. The detection of a TCP FIN or RST technology could support 1 billion packet flag signals the end of a TCP flow. it writes information about the current frame to the control FIFO. source TCP port. ter imposes no restrictions on the flow clasOutput state machine IP output JANUARY–FEBRUARY 2003 Authorized licensed use limited to: IEEE Xplore. Currently. Downloaded on April 28.lion to 100 million classifications per second. TCP Splitthe hash table entry for that flow.The aggregate bit vector approach reduces the element array contained in a low-latency number of required memory lookups to static-RAM chip. 2009 at 09:52 from IEEE Xplore. while the output state machine is still retrievThere are many recent innovations in highing preceding larger frames from the frame performance flow classifiers capable of operating at network line speeds. and clears classification lookups per second. 57 .11 of the source IP address. The flow table is a 262. application The output state machine retrieves data from the control TCP processing and frame FIFOs and sends it to the TCP output section. another high-performance classification techFIFO to the TCP output section. We could use FIFO for output processing.

We synthesized the circuit to provide processing at full OC-48 line speeds on the FPX platform. The interface clocks all packet headers into the client application. we could design a solution with 16 TCP Splitter engines. Another planned enhancement is the addition of a few packet reassembly buffers. Because the client application is not in the network data path. A complete solution. Output processing TCP Splitter’s output-processing section determines how a packet should be processed. and a sample client application that simply counts TCP data bytes. we will exploit additional pipeline stages and parallelism available in the FPGA. There are three possible choices: Packets can go to the outbound IP layer only. • Drop all TCP packets with sequence numbers greater than the current expected sequence number. These buffers would support the reassembly of IP fragments and TCP packets to provide a passive monitoring solution. including TCP Splitter. Packets containing sequence numbers greater than the expected sequence number represent packets that TCP Splitter has already processed. The client processes only the ordered byte stream from the TCP connection. in-order TCP packet data for each flow to the client application. • Send all TCP packets with sequence numbers less than or equal to the current expected sequence number to the outbound IP stack. requires 21 percent of the FGPA’s resources. • Send all other packets to both the outbound IP stack and the client application. The rules for processing packets are as follows: • Send all non-TCP packets to the outbound IP stack. Future work We plan to increase TCP Splitter’s throughput to support OC-768 line rates. 2009 at 09:52 from IEEE Xplore. which introduces a total data path delay of 70 ns. Given that TCP Splitter’s input data width is 32 bits (4 bytes). along with a start-of-header signal so that the client can extract information from the headers. • Drop all TCP packets with invalid checksums. it does not induce delay into the packets crossing the network switch. Restrictions apply. the flow classifier performs a maximum of two memory accesses for each packet. checksummed. We also plan improvements of the flow classifier to eliminate the hash table collision problem.HOT INTERCONNECTS 10 sification technique and can use any flow classifier. eight TCP Splitter engines could run in parallel and perform one memory access on every clock cycle. This is sufficient bandwidth to monitor all TCP/IP flows at OC-768 line rates. the client application can be arbitrarily complex without affecting TCP Splitter’s throughput rate. To accomplish this goal. By using both of the static-RAM modules on the FPX platform. TCP Splitter has a pipeline delay of only seven clock cycles. • Send all SYN packets to the outbound IP stack. Client interface The client interface provides a simple hardware interface for application circuits. go to both the outbound IP layer and the client application. Thus. The TCP Splitter implementation is small—it uses only 2 percent of the FPGA. In the current implementation. T CP Splitter differs from other TCP/IP network monitors because it 58 IEEE MICRO Authorized licensed use limited to: IEEE Xplore. Results We implemented TCP Splitter as a module on a Xilinx Virtex XCV1000E-7 FPGA. Downloaded on April 28. the smallest operation period is 16 clock cycles. or be discarded. This method eliminates the need to store header information but still gives the client access to this data. To avoid forwarding erroneous frames. TCP Splitter adds one store-andforward delay allowing for the time to compute and verify the TCP checksum. . and assuming minimum-length packets of 64 bytes. which would process data at 51 Gbps. The design’s critical path includes the 16-bit arithmetic operations that compute the TCP checksum. TCP Splitter forwards these packets to the destination to account for packets that were dropped between the monitor and the destination. In that time. The interface provides only valid.2 Gbps. each operating at 101 MHz. It operates at a post-place-and-route frequency of 101 MHz and has a corresponding throughput of 3. protocol wrappers.

MICRO References 1.html. Washington University. Although we developed the circuit as a module on the FPX platform. pp. WB-19. Architectures and Synthesis for Embedded Systems (CASES 01). For further information on this or any other computing topic. 1981. Restrictions apply. A. 2001. 126-130. 93-98. Lockwood appears on p. • processes data in real time.” Proc. 10. “OC-3072 Packet Classification Using BDDs and Pipelined SRAMs. 2001. Varghese. “Scalable Packet Classification. Schuehler has a BS in aeronautical and astronautical engineering from Ohio State University and an MS in computer science from the University of Missouri-Rolla. IFIP Personal Wireless Communications” Proc. • processes packets at line rates exceeding 3 Gbps. IEC DesignCon 01. Schuehler. 15-20. 9. V.” http://www. 87-108. 2. David V. Teitelbaum.W. S. 08. using simulated TCP data packets. JANUARY–FEBRUARY 2003 Authorized licensed use limited to: IEEE Xplore. He is a member of the IEEE and the ACM. Aug. 2001. and M. One Brookings Dr. ACM Sigcomm. Mackenzie. “RFC1072: TCP Extensions for Long-Delay Paths. pp. Downloaded on April 28. 1988. 1999.shtml. Shalunov and B. L. Consortium.• is implemented in reconfigurable hardware.faqs. and • eliminates the need for large reassembly buffers. “Packet Classification on Multiple Fields. 2001. IEEE CS Press. High-Performance Interconnects (Hot Interconnects IX). • delivers a consistent byte stream for each TCP flow to a client application. 147-160. Compilers. St. 199-210. 8. “Effect of Delays on TCP Performance. TCP Splitter can easily be ported to other FPGA. Baboescu and G. 2001. http://www. ACM Sigcomm. Gurtov. 7. Symp. 9. His research interests include realtime processing.. RFC793: Transmission Control Protocol.” Proc. “Bulk TCP Use and Performance on Internet2. Int’l Conf.” Proc.15. Aziz. Roberts.000 TCP flows simultaneously. We have successfully tested a sample application in hardware. pp. ACM Press. 2-11. and highspeed networking. A. 1997. MO 63130. 5. P.” Proc. pp.faqs. High-Performance Interconnects (Hot Interconnects IX). Chaney et al.” Proc.” Proc.W. McKeown. 59 . “Internet Still Growing Dramatically Says Internet Founder. IEEE CS Press. Johnson and K. Int’l Eng. pp. “An Open Platform for Development of Network Processing Modules in Reprogrammable Hardware. “Layered Protocol Wrappers for Internet Packet Processing in Reconfigurable Hardware. Louis. Gupta and N.” http://www. Louis.wustl. F. Symp. 4. A.html. 2001.” Proc. Applied Research Laboratory. 6. “Design of a Gigabit ATM Switch. 12. He is also vice president of research and development for Reuters. ACM Press. ACM Press. Campus Box 1045.or ASICbased packet-processing 3. Prakash and A. “Pattern Matching in Reconfigurable Logic for Packet Classification. Direct questions and comments about this article to David V. • can monitor 256. J. The biography of John W. caspiannetworks. Lockwood.. F. IEEE CS Press. dvs1@arl.pdf. Waldvogel. Infocom 97.” http:// www. pp. Lockwood.internet2. 2009 at 09:52 from IEEE Xplore. visit our Digital Library at http://computer. embedded systems. Schuehler is a doctoral student in the Applied Research Laboratory of Washington University in St. T. 11.01. 2001. Int’l Federation for Information Processing. Aug. Braun. Jacobson and R. J. pp.