You are on page 1of 19

Study of Internet Router Architectures

Kiran Mukesh Misra (A30214824) and Freddy Kharoliwalla (A30790087)

{misrakir, kharoliw}@msu.edu

May 1, 2001

Abstract: The Internet today is going through a high degree of transformation, new generations

of innovative and fast applications like, e-commerce, video streaming, transfer of medical-

transcripts etc. are placing high performance demands on the Internet infrastructure. To keep

pace with the continuous demands, not only does the bandwidth need to be increased, but also the

routers that power the Internet have to evolve architecturally to keep up with the escalating use of

the web. Here we look at the different router architectures and also study the merits of some of

the major architectures used in the today’s routers.

1. Introduction

The Internet has developed at an exponential rate, and so has the traffic on it. This has placed an

enormous amount of pressure to continually improve router performance. The diverse nature of

the traffic on the Internet makes the task of improving router performance on the Internet

extremely challenging. The main demands of any Internet infrastructure today are:

• To utilize the network capacity to the fullest so as to transfer large volumes of traffic.

• To be able to scale networks quickly and cost effectively with minimal impact on the

network operations.

• The ability to ensure the delivery of packets in proper sequence and minimize packet loss

over the network

The report starts by looking at the functionality of routers and the different popular architectures

in use. We then look at router performance in light of various switching techniques. We also

1
look at the use of cache in routers and how it affects performance. Finally we look at two

commercially available routers, and examine their architecture and performance in the light of

the architectural details we study.

2. Related Work

Routers have adopted various techniques to overcome the high performance requirements of the

Internet environment. Some of them use larger caches, while some of them have adopted the

distributed architecture to get better performance. In a centralized architecture, the network

performance degrades as the volume of traffic increases, however in a distributed architecture the

processing load gets distributed, ensuring faster and more reliable communication. In this report

we look at the following techniques adopted to improve router performance: Cache, Switching,

Forwarding Engines and Distributed Processing.

3. Basic Router Functionality

Generally, the router performs two main functions: Control path routines and Data path control

(switching). Routers maintain and manipulate routing tables; they listen for updates and change

the routing tables to reflect the new network topology. The network topology in the core of the

Internet and in enterprise networks is extremely dynamic and changes very frequently. Routers

also classify packets and perform control actions on the packets; it performs Layer 3 switching

and sometimes maintains statistical data on the data-flow. Typically, packets are received at

inbound network interface; they are then processed by the processing module (CPU), possibly

stored in the buffering module. The packets are then forwarded through the switching fabric to

outbound interface that transmits the packet to the next hop router. The architecture of a

conventional router is given in Figure 1, the CPU typically performs functions such as path

computations, routing table maintenance and reachability propagation. The router adjusts the

2
Time-to-live (TTL) filed in the packet to prevent packets from circulating endlessly. Packets

whose lifetime is exceeded are dropped by the router (sender may or may not receive error

Figure 1: Conventional Router Architecture [7]

message). The router also checks the validity of the data based on the checksum, since the router

changes the TTL field, it needs to incrementally update the checksum before forwarding the

packet. One of the performance bottlenecks in routers is to lookup the address of the next hop.

The first approaches used Patricia trees [1][2] combined with hash tables, these are binary trees

that used the destination IP address as the lookup key. A key to overcoming this performance was

the introduction of lookup caches, which relied on the fact that there is enough locality in the

traffic, to maintain a high hit rate. The frequent changes in network topology in the core of the

Internet causes the cache entries to be invalidated frequently, resulting in smaller hits. Hardware

based caching; route lookup and forwarding solutions are generating a lot of interest due to the

speed advantage they offer.

4. Techniques to improve performance

The various techniques used for improving router performance are: Route Cache, Switching,

Forwarding Engines and Distributed Processing.

3
4.1 Route Caching

Figure 2: Route Cache and Distributed Processing [7]

Traditionally all the route lookup has been done centrally by the CPU, but this places a lot of load

on the CPU and the CPU then becomes a performance bottleneck. So routers have now moved

towards providing interfaces (line cards) with processing power. The line cards now have a CPU

of their own, which determine the outbound interface using the Cache it maintains (figure 2). The

cache contains lookup information on the next hop for a destination IP. The cache is updated

based on packets that were forwarded recently, and algorithms like, say, LRU, FIFO. Entries in

cache may be validated as the network topology changes. This not only speeds up the packet

forwarding, as the packets are forwarded from one interface to another (fast path), but also

reduces load over the system bus. Packets are transmitted only once over the shared bus, the CPU

is now free to perform other functions like, -routing and, determining policies and resources used

by interfaces. If there is a cache miss the packet is forwarded to the CPU, which performs a route

table lookup (slow path), before forwarding the packet to the appropriate interface. The use of

cache makes the throughput of the architecture dependent on the nature of traffic. The

performance of a router using route cache is influenced by:

• The size of the cache

• Replacement Algorithm (LRU, FIFO)

4
• Performance of slow path

• The hashing function used in cache lookup

4.2 Switching

Figure 3: Crossbar Data Path [7]

As the route lookup was off-loaded to the interfaces, the CPU became less of a bottleneck,

however the shared bus limited the speed at which packets could be forwarded from one interface

to another. This leads to the next improvement in routers; the shared bus being replaced by a

switch fabric (figure 3). The switch fabric provides a large bandwidth for transmitting packets

between interface cards, and increases throughput considerably. All commercially available

Gigabit Routers use Switch fabric. A switching fabric offers parallelism, but makes it difficult to

multicast packets.

4.3 Forwarding Engines

When the function of route lookup was delegated to the interfaces, it required the use of CPU and

cache in every interface. This restricted the number of interfaces as cost increased along with the

number of interfaces. This put a restriction on port density. The approach that was adopted to

5
Figure 4: Forwarding Engines [11]

overcome this problem was to separate the forwarding engines from the line cards. Multiple

forwarding engines (figure 4) are connected in parallel to achieve high throughput. As the packets

arrive through the inbound interface, the header is stripped from the packet, and attaches a tag.

This tagged header is then assigned to the appropriate forwarding engine in a round-robin

fashion. The forwarding engines have a FIFO queue in which the header is placed (The load can

be shared equally by the forwarding engines). Every forwarding engine has its own route cache.

The forwarding engine performs basic error checks and then computes the hash offset into the

route cache. The Forwarding engine then generates the appropriate header, and forwards the IP

packet (along with the appropriate header) to the appropriate interface. The basic function of the

Forwarding engines is to generate next-hop addresses. Transferring only IP headers to forwarding

engines eliminates any unnecessary payload over the system bus. Packets are transferred directly

between interfaces; they never go to forwarding engines or route processor unless they are

destined for them. The use of Forwarding engines is based on the presumption that it is unlikely

that all interfaces will be bottlenecked at the same time, hence sharing of forwarding engines can

increase the port density of the router. In some applications the order in which the packets are

sent/received may be important, so forwarding engines can be made to output their data in the

6
same order they received it. The performance of a forwarding engine can be considerably

improved by designing it as an ASIC, however the Internet is constantly evolving and any change

in the IP specifications cannot be ruled out.

4.4 Distributed Processing

CPU
-Routing protocols (RIP, BGP etc.)
-Error and maintenance protocols (ICMP, IGMP)
-Network management
-QoS policies
-Applications (UDP, TCP etc.)
-IP options processing
-Packet fragmentation & Reassembly
-ARP

Switch Fabric
-IP header validation -IP header validation
-Router Lookup & Packet Classification -Router Lookup & Packet Classification
-TTL Update -TTL Update
-Checksum Update -Checksum Update

Network Interface Network Interface

Figure 5: Example of Functional Partition of Distributed Router Architecture [11]

Distributed Processing architecture is a combination of all the techniques discussed above. It

contains a switch fabric, which increases the number of packets that can be transmitted between

interfaces. Communication between interfaces and CPU can also be through the switch fabric,

making CPU equivalent to an interface, but with additional functionality. The routing decisions

are made by the Forwarding Engines, which contain their own route cache. The various

functionalities of a router may be built in the slow-path or the fast–path based on the nature of the

functionality, for e.g. Sometimes the IP datagram received is too large for the MTU of the output

port. Then the datagram is fragmented and forwarded. This fragmented datagram is to be

reassembled at the destination router, although fragment reassembly is resource intensive, the

7
number of packets, which are fragmented, is normally quite low, so it is generally built in the

slow path. The functional partitioning (an example) of distributed router architecture is dis played

in figure 5.

5. Types of Switch Fabrics

A switch fabric is a mechanism for allowing each line card to transmit data to any other line card

as needed. High performance Switch Fabric needs to be non-blocking, i.e., they eliminate IP

traffic jams within the router. However the use of high-speed memory to achieve this goal is

expensive and impractical. A more efficient design relies on buffering, queuing and scheduling

techniques. The Switch Fabric is a big influence on the performance of routers, especially in the

Gigabit domain. The design of Switch fabric is influenced by many factors, the need for

multicasting, fault tolerance, and delay priorities. This leads to building of redundancy in Switch

fabrics, congestion monitoring and control, packet discarding etc.

The four basic type of switch fabrics are: Shared medium, shared memory, Distributed output

buffered, Space division (e.g. crossbar – figure 3).

5.1 Shared Medium

This is the simplest type of switch fabric where the IP packets are transferred from one interface

to another over a shared medium e.g. bus, ring or dual bus. Data is transmitted over the shared

medium using time division multiplexing (TDM), however the bandwidth of the bus is major

bottleneck. Also there is a certain overhead for bus arbitration, this is a major bottleneck in the

Gigabit domain. For all its limitations the Shared medium architecture has found several

implementations, as it lends itself as a natural candidate to be used in case of

multicasting/broadcasting and is easy to implement.

8
5.2 Shared Memory

The incoming data is stored in a shared memory pool. The headers are examined to determine the

appropriate output port. The output port can then read the data from the shared memory. This

method has the advantage that it minimizes the need for buffering, as the shared memory absorbs

large bursts of output. However since the packets are written into and read out of the memory one

at a time the speed of the memory must be at least equal to the throughput, this may be a problem.

5.3 Distributed Output Buffered

Figure 6: Distribute Output Buffered Switch Fabric [11]

Figure 6 shows the various components of this type of switch fabric. There are N2 independent

paths from the inbound interfaces to the outbound interfaces. Every output port determines if the

packet is destined for it using the Address Filters (AF). There is no waiting at the input, all the

data is buffered at the output. The memory buffer at the output needs to be, only as fast as the

output port. Ideally, the depth of the output buffer needs to be N to make sure there is no packet

loss. However, practically the depth is usually kept as L < N, at the cost of some packet loss.

(Generally, L=8, and the loss rate is 10-6 ).

9
5.4 Space Division

Figure 7: Head Of Line Blocking [7]

Another popular name for the Space Division Switch Fabric is the Crossbar Switch (Figure 3).

This architecture establishes connection between a inbound interface and an outbound interface,

based on the destination IP. It has the following advantages: (i) low cost (ii) good scalability (iii)

non-blocking properties (iv) Convenient to provide QoS guarantees. The speed at which the

switch fabric must operate is at least equal to the aggregate speed of all input links connected to

the fabric. A serious drawback of crossbar switches is Head of Line Blocking (figure 7). If an

interface receives a number of packets, it may decide to write them to the same output port, this

may lead to queuing up of packets. Another scenario is when an interface receives several

inbound packets and puts the packets in the input queue. The packet at the head of the input

queue is blocked because the output port corresponding to its destination is not free. The input

packets behind the head may be destined for a different output port (which may be available), are

blocked. This leads to a slow down in the forwarding rate. There are several ways to combat this,

keep track of traffic on every output port. Every cycle (epoch) the input port is matched to an

output port depending on the traffic. The input ports bid for their output ports; an Allocator,

arbitrates and assigns the output ports. This scheduling is pipelined. Another method is to give

10
every output port its own lane (input buffer), so that a packet waiting on an output port does not

block packets bound for a different output port. Increasing the speed of input/output channel is

another alternative. The above approaches can be used to maximize the throughput of the

crossbar switch.

The Shared memory and Shared medium approaches, achieve throughput limited by memory

access time. The crossbar switch does not have any such limitations. However, the fabrication

density limits the speed of switching in the Crossbar Switch. It is generally accepted that large

router switch fabrics (of 1 TB or higher) cannot be obtained by simply scaling up a fabric design

in size and speed [11].

6. Other Issues

The nature of traffic is another factor influencing the performance of routers. Although not

strictly under the control of the designer, it is helpful to know the behavior of routers under

different data profiles. The figure 8, displays a graph indicating the number of packet transmitted

Figure 8: PPS Vs Frame Size [9]

per second depending on the frame size over various connections like T1, Ethernet and T3. It is

11
observed that routers perform better when the packet size is larger as compared to when the

packet size is smaller (for the same amount of data). This is logical, as the overhead for lookup is

lesser when the packet size is larger.

7. Case Studies

We have performed case studies of two router architectures as part of the survey – Cisco 7500,

Cisco 10000. An explanation of some of the terms used is as follows:

Cisco Express Forwarding (CEF)[3][4]: CEF is a non-cache based switching mode for IP

packets. In cache based switching the first packet of a flow is sent up to the process level (CPU),

where its destination address is compared with the routing table to obtain the forwarding

information. The route cache entry for the corresponding forwarding information is built, so that

the subsequent packets of the same flow can be fast-switched based on the route cache. With CEF

instead of building route cache entries on demand, a Forwarding Information Base (FIB) is built.

It is based on the entire routing table, and is downloaded to all the interfaces for distributed

switching. The FIB has a one-to-one correspondence to the routing table. FIB is updated only

when there are changes to the routing table.

Tag Switching [3][4]: Tag switching is a new IP packet switching scheme that adds a tag to each

IP packet. The tag is used by tag switches (which can be routers or ATM switches) as the basis

for packet switching, instead of the original IP destination addresses. There is a separate Tag

Distribution Protocol (TDP) for maintaining the mapping between IP addresses and tags. Tag

switching offers the following benefits: (i) Flexible traffic engineering (ii) IP-ATM integration

(iii) Scalable Border Gateway protocol (iv) Virtual Private Network (VPN) implementations.

VPN uses encryption or tunneling to connect users or sites over a public network.

Parallel Express Forwarding (PXF)[5]: The Cisco PXF is a powerful new multiprocessor

technology to enable forwarding performance on the order of millions of packets per second. PXF

12
has the ability to deploy new services such as Multiprotocol label switch (MPLS), VPN etc. PXF

like CEF avoids the potential overhead of continuous cache churn by using a FIB for the

destination switching decision. PXF enhances the FIB model by separating the control plane

functions from forwarding plane functions. PXF uses a parallel array of processors for an

accelerated switching path. The architecture allocates independent memory to each processor as

well as memory for each column of processors, to provide memory access optimization. A more

complete explanation of PXF is given in [5][6].

7.1 Cisco 7500

Figure 9: Cisco 7500 RSP [7]

The Cisco 7500[ 3] is designed for environments requiring high performance, high availability

routing. It was introduced in 1995 and had a centralized architecture. It uses CEF and Tag

switching to obtain higher performance. The Route Switch Processor (RSP) in Cisco 7500

performs the following functions: Switching data packets, Providing additional services (such as

encryption, compression, access control, QoS, and traffic accounting) to data packets, running

routing protocols to maintain switching intelligence, and handling other system maintenance

functions such as network management. The architecture of the RSP is as shown in figure 9, at

the heart lies a RISC processor (MIPS R4600/MIPS R4700/MIPS R5000). The Cisco

13
Internetwork Operating System (IOS) executes on this processor to perform all the functions of

the RSP. A description of the other major components of the RSP and their functionality is as

follows [7]:

Boot Rom: It contains the ROM Monitor containing the startup diagnostic code, and exception

handling.

NVRAM: It contains the Startup configuration file and 16-bit configuration registers which

specify router startup parameters.

PCMCIA: It is a Flash memory which contains Cisco IOS® Files.

Boot Flash: It houses the software component RxBoot, in the host mode, RxBoot is used for

downloading full Cisco IOS®.

QA ASIC: Is a Quality Assurance ASIC, used to maintain QoS.

SRAM: The RAM can be divided functionally into two parts, Main and IO. The IO part contains

buffers for interfaces and some system buffers. The main part contains the image of the Cisco

IOS® part in execution, Routing tables, running configuration etc.

Figure 10: Versatile Interface Processor [7]

In 1996, a distributed architecture using Versatile Interface Processors (VIP -figure 10) was

introduced. Each VIP has its own processor, which is capable of switching IP data packets and

providing certain network services like encryption, compression, access control, QoS and traffic

accounting. The RSP can now devote all its CPU cycles to handle other essential tasks, such as

14
routing protocols, non-IP traffic, tunneling, network management etc. The VIP2-50[3] is the most

recent addition to the VIP2 family. It uses the MIP R5000 processor, with up to 8MB of SRAM

and up to 128 MB of DRAM. The additional memory capacities give the VIP2-50 more queuing

capabilities and more storage to handle large routing tables. VIP2-50 supports all available WAN

and LAN port adapters (PAs), and there are also Packet over Synchronous Optical Network

(SONET) Interface Processors (POSIPs) and Channelized T3 Interface Processors (CT3Ips)

based on VIP2-50 platform. The VIP2-50 has a minimum software requirement of Cisco IOS

Release 11.1 (14) CA.

7.2 Cisco 10000

Figure 11: Cisco 10000, Performance Routing Engine [6]

The Cisco 10000 Edge Service Router (ESR) [6] is a Layer-3, 10-slot platform optimized to meet

the large-scale leased line aggregation requirements of Internet service providers (ISPs). The

Cisco 10000 ESR dedicates two slots for active and redundant processor modules and eight slots

for interface modules. Interface modules can be configured in any of the eight available slots. All

modules require a single slot and are hot-swappable. The Cisco 10000 ESR supports a:

• Six-port channelized T3 interface module

15
• Channelized OC-12 module

• Four ports channelized STM-1 module

• Gigabit Ethernet module

• OC-12 Asynchronous Transfer Mode (ATM) module

• OC-12 Packet over Synchronous (PoS) Optical Network (SONET) module.

The heart of the Cisco 10000 ESR’s high performance and throughput is the Performance

Routing Engine (PRE). The PRE uses PXF to support high-performance throughput with IP

services enabled on ever port. The PRE has two PCMCIA slots, 32 MB of Flash memory, and

128 MB packet buffer. It also supports 512 MB of SDRAM for use by Cisco IOS® software. The

Cisco 1000ESR has two major blocks: Line cards and the PRE. Each line card manages its own

interface type, sending and receiving complete packets to PRE across the backplane. Most

communication devices are based on a shared system bus to which all circuit cards are attached,

but Cisco 10000 ESR replaces it with line card interconnect that uses point-to-point links between

each line card and each PRE. This provides a high bandwidth and fault isolation. Transfer speeds,

up to 3.2 Gbps in each direction, can be achieved using the point-to-point links.

Two PRE can be configured in a single router to provide fault-tolerance and high availability. The

PRE consists of the forwarding path (FP) and the route processor (RP). The FP executes packet-

forwarding algorithm on every packet flowing through the router. The FP is based on PXF

network processor, each PXF network processor provides a packet-processing pipeline consisting

of 16 micro coded processors arranges as multiple pipelines.

Figure 12: Cisco 10000 ESR Forwarding Path Processor Array [6]

16
The RP runs the routing protocols, does update calculations, and handles other control-plane

function such as the SNMP agent and command line interface (CLI). Each of the 16 processors in

a PXF network processor is an independent, high-performance processor, customized for packet

processing. Each processor, called an eXpress Micro Controller (XMC), provides a sophisticated

dual-instruction-issue execution unit, with a variety of special instructions designed to execute

packet-processing tasks efficiently. Within a single PXF network processor, the 16 XMC are

linked together in four parallel pipelines. Each pipeline comprises four microcontrollers arranged

as a systolic array, where each processor can efficiently pass its result to its neighboring

downstream processor. Four parallel pipelines are used, further increasing throughput. Within

Cisco 10000 ESR, two PXF network processor ASICs are used, yielding four parallel processing

pipelines, each containing eight processors in a row (Figure 12). In the array of processors,

hardware, microcode, and Cisco IOS software resources are combined to provide advanced, high-

touch feature processing on the Cisco 10000 ESR. The allocation of features is constantly

changing, but one such allocation could be: Layer 2 Analysis (level 1), FIB Switching (level 2),

Additional Features (level 3,4,5,7), MAC Rewrite (level 6), Enqueue/Dequeue (level 8).

The RP also includes standard Cisco IOS facilities like Flash memory, nonvolatile RAM for

storing configuration files, and Ethernet connections for network management.

8. Conclusion

The growth of the Internet has propelled the emergence of routers with forwarding rates in the

Gigabit and Terabit range. This report has briefly presented the various approaches used for

designing high-speed routers. We have seen that shared buses become a bottleneck in high-speed

routers, necessitating the use of switched backplane. High-speed routers must be robust and must

17
have enough parallelism to support QoS, multicast etc. Routing table lookup and data movements

are the major performance bottlenecks.

To improve performance, critical functions are now performed in ASIC. Parallelism is being

exploited by the use of pipelining in the forwarding path. The cost of a router port is dependent

on the type and size of memory at the ports. Faster memory is expensive, but may be required to

meet the performance criteria. The buffer size should be sufficiently large to avoid packet losses,

or to contain them within reasonable limits. In case of crossbar switches, head of line blocking

plays a major role in determining effective switching speed. Crossbar switches require the use of

techniques like faster input/output channels, or allocation of lanes (input queues) for every output

port, to obtain satisfactory performance.

Industry has developed its own architectural design to improve performance. Use of CEF and tag

switching is popular in all Cisco High speed routers. Also the use of PXF (pipelining) cuts the

average time in the forwarding path of routers. Extensive research is being carried out in the

industry to improve router performance to keep up with the exponential growth of the Internet.

The field presents many unique and interesting challenges, on which the research community and

industry have been working actively.

References

[1] K. Sklower “A Tree-Based Packet Routing Table for Unix”, USENIX Winter’91, Dallas, TX,

1991.

[2] W. Doeringer, G. Karjoth, and M. Nassehi, “Routing on Longest-Matching Prefixes”,

IEEE/ACM Tras. On Networking, Vol. 4, No. 1, Feb 1996, pp. 86-97

[3] Cisco 7500 Advanced Router System – Data Sheet, http://www.cisco.com

[4] White Paper, Cisco Express Forwarding, http://www.cisco.com

18
[5] White Paper, “Parallel eXpress Forwarding in the Cisco 10000 Edge Service Router”,

http://www.cisco.com

[6] Cisco 10000 Edge Service Router, Hardware Architecture, http://www.cisco.com

[7] “Cisco Router Architectures”, http://www.cisco.com/networkers/nw99_pres/601.pdf

[8] White paper, “The evolution of high end router architectures”,

http://www.cisco.com/warp/public/cc/pd/rt/12000/tech/ruar_wp.htm

[9] “Router performance”, http://www.cisco.com/networkers/nw99_pres/602.pdf

[10] “Troubleshooting Layer 3 Network Connections”,

http://www.cisco.com/univercd/cc/td/doc/product/atm/c8540/12_0/13_19/trouble/l3_net.htm

[11] James Aweya, “IP Router Architectures: An Overview”, Nortel Networks, Ottawa, Canada,

K1Y 4H7

[12] White Paper, “The Evolution of high-end Router Architectures-Basic Scalability and

Performance Considerations for Evaluating Large-Scale Router Designs”, http://www.csico.com

[13] Partridge, Carvey, Burgess, Castineyra, Clarke, graham, Hathaway, Herman, King, Kohlami,

Ma, Mcallen, Mendez, Milliken, Osterlind, Pettyjohn, Rokosz, Seeger, Sollins, Storch, Tober,

Troxel, Waitzman, Winterble, “A Fifty Gigabit Per Second IP Router”, BBN Technologies (a part

of GTE Corporation)

[14] Vibhavasu Vuppala, Lionel M. Ni, “Virtual Network Ports: An Inter-network Switching

Framework”, Proc. Of 1999 International Conference in Computer Communication (ICCC),

Tokyo, September 1999

[15] Newman, Minshall, Lyon, Hutson, “IP Switching and Gigabit Routers”, Ipsilon Networks

Inc.

19

You might also like