Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc.
4
Executive Summary
This white paper is the introduction to a series of papers published by Juniper Networks, Inc. that describe the support of differentiated service classes in large IP networks. This overview presents the motivations for deploying multiple service classes, the fundamentals of statistical multiplexing, and the impact of statistical multiplexing on the quality of service delivered by a network in terms of packet throughput, delay, jitter, and loss. We also provide a brief history of the various approaches that have been proposed to support differentiated service classes, a description of the IETF DiffServ architecture, and general observations about what you can expect from the deployment of multiple service classes in your network. The other papers in this series provide technical discussions of queue scheduling disciplines, queue memory management, host TCP congestion-avoidance mechanisms, and other issues related to the deployment of multiple service classes in your network.
Perspective
Service provider IP networks have traditionally supported only public Internet service. Initially, Internet applications (e-mail, remote login, le transfer, and Web access) were not considered mission-critical and did not have specic performance requirements for throughput, delay, jitter, and packet loss. As a result, a single best-effort class of service (CoS) was adequate to support all Internet applications. However, the commercial success of the Internet has caused all of this to change, thus affecting service providers in several ways.
I
Your IP network is now the single largest consumer of bandwidth, or at least is growing toward this trend.
I
Your networks 24/7 availability and reliability are even more imperative. Internet services have become mission-critical. For some organizations, such as online retailers or stock markets, the cost of an hour-long network outage can be extremely expensive.
I
You need to differentiate your company from the competition by offering a range of service classes with service-level agreements (SLAs) that are specically tailored to meet your customers and their customers requirements.
I
You want to offer better classes of service to your premium customers and charge more for those services.
I
You are probably considering offering services such as voice-over-IP (VoIP) or virtual private networks (VPNs) that have more rigid performance requirements than traditional Internet applications. You may also be considering deploying a variety of services over a shared IP infrastructure, each of which has different performance requirements. In a multiservice IP network, IP routers rather than Frame Relay switches, ATM switches, or voice switches are used to access the transmission network.
I
A larger service portfolio allows you to attract and keep new customers
I
Converged networks minimize your operating expenses, because you have fewer networks to manage.
I
A packet-based network maximizes bandwidth efciency through the use of statistical multiplexing.
Copyright 2001, Juniper Networks, Inc.
5
Supporting Differentiated Service Classes in Large IP Networks
There are two fundamentally different approaches to supporting the delivery of multiple service classes in large IP networks. One approach is simply to overprovision the network and throw raw bandwidth at the problem. The other approach is to build a CoS-enabled backbone based on bandwidth management. Those who favor overprovisioning argue that:
I
The additional cost and complexity of managing trafc outweighs the gain it provides in bandwidth efciency.
I
It is very difcult to monitor, verify, and account for multiple service classes in large IP networks.
I
You already have other CoS-enabled infrastructures (TDM and ATM) that you can use to support services that have strict performance requirements. Those who favor bandwidth management argue that:
I
Bandwidth management allows you to optimize bandwidth utilization and run your network at close to its maximum capacity.
I
New applications emerge, you deploy new networking equipment, and bandwidth arrives in discrete chunks. These events rarely occur in a coordinated manner, and trafc management allows you to control bandwidth and smoothly handle mismatches in network capacity as these transitions occur.
I
Bandwidth management allows you to increase your revenue by selling multiple service classes over a shared infrastructure, such as a converged IP/MPLS backbone. A converged infrastructure allows you to reduce your operating expenses, to use a single access technology, and to market a wide range of integrated products, such as Internet access, VPN access, and videoconferencing. While the arguments for both of these approaches are convincing, the cost is roughly equal. Initially, the deployment of bandwidth management in your network involves simply enabling specic router functions. However, there are a number of hidden training, operational, and maintenance costs involved in successfully managing bandwidth in a production network. Also, while it is relatively easy to understand how to manage bandwidth from an engineering perspective, service providers have very little practical experience in supporting, debugging, tuning, and accounting for multiple service classes in large IP networks. On the other hand, if you do not have the ability to throttle trafc to some degree, even a network of enormous bandwidth can be overrun by misbehaving applications to a point that mission-critical and delay-sensitive services are severely impacted. Successful providers will adopt a solution that is based on a combination of overprovisioning bandwidth and MPLS trafc engineering to minimize the long-term average level of congestion, while also deploying Integrated Services (IntServ) and Differentiated Services (DiffServ) to address the requirements of delay- and jitter-sensitive trafc during short-term periods of congestion. It is only through a combination of technologies that you will be able to support the delivery of differentiated service classes on a large scale and at a reasonable cost.
Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc.
6
Fundamentals of Differentiated Services
To support business objectives that require multiple service classes, there is growing interest in the mechanisms that make it possible to deliver differentiated trafc classes over a common IP infrastructure. Because these mechanisms are widely misunderstood, we begin with a discussion of some of the fundamental concepts that are relevant to the deployment of differentiated service classes.
Classic Time-division Multiplexing vs. Statistical Multiplexing
Network transmission facilities are an expensive resource, as you know. Multiplexing can save you money by allowing many different data ows to share a common physical transmission path, rather than requiring that each ow have a dedicated transmission path. There are currently two basic types of multiplexing used in data communications:
I
Time-division multiplexing (TDM)The transmission facility is divided into multiple channels by allocating the facility to several different channels, one at a time.
I
Frequency-division multiplexing (FDM)The transmission facility is divided into multiple channels by using different frequencies to carry different signals. Within TDM, there are two methods of arbitrating bandwidth on an output port: the static allocation of xed-sized time slots and the dynamic allocation of variable-sized time slots. Classic TDM devices switch trafc by using static arbitration to allocate input bandwidth to an equal amount of output bandwidth and by mapping trafc to a specic output time slot. Packet switches use variable arbitration, with bandwidth allocated on demand on a per-packet basis.
Classic Time-division Multiplexing (TDM)
Classic time-division multiplexing (TDM) is a technique that is applied to circuit-switched networks. TDM assumes that data streams are organized into bits, bytes, or words rather than packets. Figure 1 illustrates the basic concept behind TDM.
Although the following description is not the classic denition of TDM, it is sufcient to provide a background for our discussion of differentiated service classes. At the ingress end of the shared link, the TDM multiplexer samples and then interleaves the ve discrete input data streams in a round-robin fashion, granting each stream the entire bandwidth of the shared link for a very short time. TDM guarantees that the bandwidth of the output link is never less than the sum of the rates of the individual input streams, because, at input, each unit of bandwidth 1 4 3 2 5 1 4 3 2 5 1 4 3 2 5 1 4 3 2 5 MUX DEMUX
Copyright 2001, Juniper Networks, Inc.
7
Supporting Differentiated Service Classes in Large IP Networks
is mapped at conguration time to an equal-sized unit of bandwidth on the output link. At the egress end of the shared link, the TDM demultiplexer processes the trafc and reconstructs the ve individual data streams. There are two key features of classic TDM that are relevant to our discussion of supporting multiple service classes:
I
First, it is not necessary to buffer data when input streams are multiplexed onto the shared output link, because the capacity of the output link is always greater than or equal to the sum of the rates of the individual input streams.
I
Second, classic TDM leads to an aggregate underutilization of bandwidth on an output port. Assuming that you are transmitting packet data over a classic TDM system, each input channel consumes somewhere between zero percent and 100 percent of its available bandwidth, depending on the burstiness of the application. If you examine the bandwidth that is not used and add this up for all of the channels in your system, you can achieve an overall bandwidth utilization on the output port of only 10 to 15 percent, depending on the specic behavior of your trafc. Two common examples of classic TDM in large carrier or provider networks are:
I
A T-1 multiplexer with 28 T-1 circuits on the input side and one DS-3 circuit on the output side, or
I
A SONET multiplexer with 4 OC-12c/STM-4s on the input side and one OC-48c/STM-16 on the output side.
Statistical Multiplexing
Statistical multiplexing is designed to support packet-switched networks by dynamically allocating variable-length time slots on an output port. Statistical multiplexing devices assume that data ows are organized into packets, frames, or cells rather than bits, bytes, or words. Figure 2 illustrates the basic concept behind statistical multiplexing.
Figure 2: Statistical Multiplexing
Unlike classic TDM devices, a statistical multiplexing device does
not
map each unit of input bandwidth to an equal-sized unit of bandwidth on an output port. Statistical multiplexing dynamically allocates bandwidth on an output port only to active input streams, making better use of the available bandwidth and allowing more streams to be transported across the shared port than with other multiplexing techniques. Stat Mux Device 2 1 3
Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc.
8
A packet, frame, or cell arriving on one port of a statistical multiplexing device can potentially exit from any other port of the device. The specic output port is determined by the result of a lookup, based on the contents of the packet headera MAC address, a VPI/VCI, a DLCI, or an IP address. This means that there may be times when more packets, frames, or cells need to be transmitted from a port than the given port has bandwidth to support. When this occurs, the statistical multiplexing device places the oversubscribed packets, frames, or cells into a buffer (queue) that is associated with the output port. The buffer absorbs packets during the extremely short periods of time when the output port experiences congestion. Common examples of statistical multiplexing devices in large carrier or provider networks include:
I
IP routers,
I
Ethernet switches, and
I
Frame Relay switches.
Optimal Buffer Size
Determining the optimal size for a packet buffer is critical, because providing a packet buffer that is too small is just as bad as providing a packet buffer that is too large.
I
Small packet buffers can cause packets from bursts to be dropped. This forces a host TCP to reduce its transmission rate by returning to slow-start or congestion-avoidance mode. This can severely reduce the sessions overall packet throughput rate.
I
Large packet buffers at each hop can cause the total round-trip time (RTT) to increase to a point where packets that are waiting in buffers in the core of a network are retransmitted by the source TCP even though they have not been dropped. A source TCP maintains a retransmission timer that it uses to decide when it should start retransmitting
lost
packets if it does not receive an ACK from the destination TCP. Optimally, a router buffer needs to be large enough to absorb the burstiness of trafc ows but small enough that the RTT remains relatively small, so that packets waiting in queues are not mistakenly retransmitted. The amount of memory that needs to be assigned to each queue is determined by the speed of the link, the behavior of the trafc, and the characteristics of the higher-layer transport protocol that provides ow control. For a queue designed to support UDP-based, real-time applications, such as VoIP, a large packet buffer is not desirable, because it can increase end-to-end delay. However, for a queue designed to support TCP-based applications, optimal performance requires that the bandwidth-delay buffer size be calculated using the following formula: Buffer_Size = (Port bandwidth) * (longest RTT ow forwarded across the port) For example, the size of the buffer required to support a maximum round-trip delay of 100 ms on an OC-48c/STM-16 port is ~32 MB.
Bandwidth Oversubscription
Voice networks have always been oversubscribed, in that dedicated bandwidth is not reserved for each potential voice user. Carriers can overprovision their voice networks because there are far more voice subscribers than there are voice calls at any given moment. Generally, it is easier to provision a voice network than a data network, because you have a much better understanding of the call activity you expect to see at any time of the day than you do of the amount of data trafc your network will be required to transport at the same time. However, we have all experienced situations when all circuits are busy during catastrophic events.
Copyright 2001, Juniper Networks, Inc.
9
Supporting Differentiated Service Classes in Large IP Networks
In packet-based networks, statistical multiplexing takes advantage of the fact that each host attached to a network is not always active, and when it is active, data is transmitted in bursts. As a result, statistical multiplexing allows you to oversubscribe network resources and support a greater number of ows than classic TDM using the same amount of bandwidth. This is known as
the statistical multiplexing gain
. Figure 3 shows three hosts transmitting data. When the network uses classic TDM to access the output port, certain time slots remain empty, which causes the bandwidth of those time slots to be wasted. In contrast, when the network uses statistical multiplexing to access the output port, empty time slots are not transmitted, so this extra bandwidth can be used to support the transmission of other statistically multiplexed ows.
Figure 3: Classic TDM vs. Statistical Multiplexing
Lets examine typical oversubscription numbers used by large service providers. Core links are typically oversubscribed by a factor of 2X, while access links are generally oversubscribed by a factor of 8X (8 times more than the potential capacity going into the core than the core can transport). As long as the queues in the network
usually
remain empty, the network will continue to provide satisfactory performance at these oversubscription levels. If the trafc patterns in the network are well-understood, then it is possible to apply an oversubscription policy that ensures that queues do, in fact, usually remain empty. The oversubscription capabilities supported by statistical multiplexing devices offer monetary savings. For example, an oversubscription policy of 20 percent allows packets from almost 23 E-3 circuits (775 Mbps) to be aggregated onto a single OC-3/STM-1 circuit (155 Mbps).
Statistical Multiplexing and Multiple Service Classes
As a foundation to our discussion of differentiated service classes, there are two key features to keep in mind regarding statistical multiplexing:
I
Statistical multiplexing requires packet buffering during transient periods of congestion when the output-port bandwidth is momentarily less than the sum of the rates of the input ows seeking to use that bandwidth. Flow 3 Flow 2 Flow 1 Multiplexer Output Port P1 P3 P2 Bandwidth Utilization Classic TDM Statistical Multiplexing Extra Bandwidth Available Wasted Bandwidth Statistical Multiplexing Gain Pass 1 Pass 2 Pass 3 Pass 1 Pass 2 Pass 3
Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc.
10
I
Statistical multiplexing provides signicantly better utilization of the output port bandwidth than classic TDM. This enhanced utilization can be approximately four times (4X) greater than classic TDM, depending on the specic trafc ows. The higher utilization of output port bandwidth is the key benet of statistical multiplexing when compared with classic TDM.
Best-effort Delivery
IP routers perform statistical multiplexing because they are packet switches. The Internet Protocol is a datagram protocol, where each packet is routed independently of all other packets without the concept of a connection. IP has traditionally offered only a single class of service, known as best-effort delivery, where all packets traversing the network were treated with the same priority. Best-effort means that IP makes a reasonable effort to deliver each datagram to its destination with uncorrupted data, but there are no guarantees that a packet will not be corrupted, duplicated, reordered, or misdelivered. Additionally, there are no promises with respect to the amount of throughput, delay, jitter, or loss that a trafc stream will experience. The network makes a best-effort attempt to satisfy its clients and does not arbitrarily discard packets. However, best-effort service without the support of intelligent transport protocols would lead to chaos. The only reason that best-effort works in global IP networks is because TCP does not compromise the network when it experiences congestion, but rather detects and then responds smoothly to packet loss by reducing its transmission rate. TCP is the basic building block that makes the best-effort queue the most well-behaved queue in a router, because it backs off when it experiences congestion.
Best-effort delivery
is not a pejorative term. In fact, the ability to support a single best-effort service has allowed large IP networks and the Internet to become what they are todaythe unchallenged technology of choice for supporting mission-critical applications at a global scale. However, there are a number of perceived issues related to IPs ability to support only a single best-effort class of service and the potential impact on IPs continued commercial success. Some carriers and providers see the need to offer multiple service levels if they are to support the deployment of new services, each with different performance requirements, over a shared IP infrastructure.
Differentiated Service Classes
Supporting multiple service classes for specic applications or customers is concerned with treating packets that belong to certain data streams differently from packets that belong to other data streams. Multiple service classes are all about providing
managed unfairness
to certain trafc classes. Differentiated service levels are supported by manipulating the key attributes of certain streams to change the customers perception of the quality of service that the network is delivering. These attributes include:
I
The amount of data that can be transmitted per unit of time (throughput),
I
The amount of time that it takes for data to be transmitted from one point to another point in the network (delay or latency),
I
The variation in this delay over time (jitter) for consecutive packets in a given ow, and
I
The percentage of transmitted data that does not arrive at its destination correctly (loss). However, the quality of service provided to a given service class can be only as good as the lowest quality of service delivered by the weakest link in the end-to-end path.
Copyright 2001, Juniper Networks, Inc.
11
Supporting Differentiated Service Classes in Large IP Networks
The concept of multiple service classes is not applicable to classic TDM services, because if a TDM link is up, then bandwidth, delay, and jitter are constant, and the percent of packet loss is zero. Any errors that occur result from bandwidth going to zero, delay going to innity, and loss going to 100 percent. For classic TDM services, the concept of differentiated service classes involves providing different uptime commitments and meeting different customer service requirements, restoration times, and so forth. The need to provide multiple service classes for customers or applications applies much more to the delivery of statistically multiplexed services. This is because specic packet ows of interest traverse several routers, and the
quality of service
perceived by individual users is a function of the way that statistical multiplexing is performed at each hop in the path, as well as the characteristics of the individual links in the path. By treating some packets differently from others when performing statistical multiplexing, a network of routers can offer different kinds of throughput, delay, jitter, and loss for different packet ows. Finally, supporting differentiated service classes through bandwidth reservations or lower oversubscription factors for higher-priority services results in a less efcient use of network bandwidth than if you provide only a single best-effort statistical multiplexing service. However, you can compensate for your lower bandwidth efciency by charging your subscribers a premium for higher-priority services.
NOTE
Once you make the business decision to offer multiple levels of service, it is important to perform the analysis that is necessary to determine exactly how much more you need to charge your subscribers to maintain your prot margins and to compensate for your loss of
bandwidth efciency.
The Impact of Statistical Multiplexing on Perceived Quality of Service
In this section, we examine how the statistical multiplexing performed by routers can inuence the user's perception of the quality of service delivered by a network. The quality of service attributes that can be affected by statistical multiplexing include:
I
Throughput,
I
Delay,
I
Jitter, and
I
Loss.
Throughput
Throughput
is a generic term used to describe the capacity of a system to transfer data. It is easy to measure the throughput for a TDM service, because the throughput is simply the bandwidth of the transmission channel. For example, the throughput of a DS-3 circuit is 45 Mbps. However, for TCP/IP statistically multiplexed services, throughput is much harder to dene and measure, because there are numerous ways that it can be calculated, including:
I
The packet or byte rate across the circuit,
I
The packet or byte rate of a specic application ow,
I
The packet or byte rate of host-to-host aggregated ows, or
I
The packet or byte rate of network-to-network aggregated ows.
Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc.
12
The most direct way that a router's statistical multiplexing can be tuned to affect throughput is by the amount of bandwidth it allocates to different types of packets.
I
In classic best-effort service, the router does not specically control the amount of bandwidth assigned to different trafc classes. Instead, during periods of congestion, all packets are placed into a single rst-in, rst-out (FIFO) queue. When faced with congestion, User Datagram Protocol (UDP) ows continue to transmit at the same rate, but TCP ows detect and then react to packet loss by reducing their transmission rate. As a result, UDP ows end up consuming the majority of the bandwidth on the congested port, but each TCP ow receives a roughly equal share of the leftover bandwidth.
I
When attempting to support differentiated treatment for different trafc classes, each class of trafc can be given different shares of output-port bandwidth. For example, a router can be congured to allocate different amounts of bandwidth to each class of trafc on the output port, or one class of trafc can be given strict priority over all other classes, or one class of trafc can be given strict priority with a bandwidth limit (to prevent the starvation of the other classes) over all other classes. The support of differentiated service classes implies the use of more than just a single FIFO queue on each output port.
Delay
Delay
(or
latency
) is the amount of time that it takes for a packet to be transmitted from one point in a network to another point in the network. There are a number of factors that contribute to the amount of delay experienced by a packet as it traverses your network:
I
Forwarding delay,
I
Queuing delay,
I
Propagation delay, and
I
Serialization delay. Figure 4 illustrates that the end-to-end delay can be calculated as the sum of the individual forwarding, queuing, serialization, and propagation delays occurring at each node and link in your network.
Figure 4: End-to-end Delay Calculation
However, when examining the causes of application delay in your network, it is important to remember that the routers represent only a part of the end-to-end path and that you must also consider several other factors: Ingress Router Egress Router D (Node) D (Link) D (Forwarding) D (Queuing) D (Serialization) D (Propagation)
Copyright 2001, Juniper Networks, Inc.
13
Supporting Differentiated Service Classes in Large IP Networks
I
The performance bottlenecks within hosts and servers
I
Operating system scheduling delays
I
Application resource contention delays
I
Physical layer framing delays
I
CODEC encoding, compression, and packetization delays
I
The quality of the different TCP/IP implementations running on these end systems
I
The stability of routing in the network Sources of Network Delay In this section, we examine each of the sources of delayforwarding, queuing, propagation, and serialization delay. Forwarding Delay Forwarding delay is the amount of time that it takes a router to receive a packet, make a forwarding decision, and then begin transmitting the packet through an uncongested output port. This represents the minimum amount of time that it takes the router to perform its basic function and is typically measured in tens or hundreds of microseconds (0.000001 sec). Other than deploying the industry standard in hardware-based routers, you have no real control over forwarding delay. Queuing Delay Queuing delay is the amount of time that a packet has to wait in a queue as the system performs statistical multiplexing and while other packets are serviced before it can be transmitted on the output port. The queuing delay at a given router can vary over time between zero seconds for an uncongested link, to the sum of the times that it takes to transmit each of the other packets that are queued ahead of it. During periods of congestion, the queue memory management and queue scheduling disciplines allow you to control the amount of queuing delay experienced by different classes of trafc placed in different queues. Propagation Delay Propagation delay is the amount of time that it takes for electrons or photons to traverse a physical link. The propagation delay is based on the speed of light and is measured in milliseconds (0.001 sec). When estimating the propagation delay across a point-to-point link, you can assume one millisecond (1 ms) of propagation delay per 100 mile (160 km) round-trip distance. Consequently, the speed-of-light propagation RTT delay from San Francisco to New York (6000 mi, 9654 km) is between 60 to 70 ms (0.060 sec to 0.070 sec). Because you cant change the speed of light in optical ber, you have no control over propagation delay. It is interesting to note that the speed of light in optical ber is approximately 65 percent of the speed of light in a vacuum, while the speed of electron propagation through copper is slightly faster, at 75 percent of the speed of light. Although the signal representing each bit travels slightly faster in copper than in ber, ber has numerous advantages over copper, because it results in fewer bit errors, supports longer cable runs between repeaters, and allows more bits to be packed into a given length of cable. For example, a 10 Mbps copper interface (traditional Ethernet) transports 78 bits per mile (124 bits per km), resulting in a 1500-byte packet that is Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 14 154 miles (248 km) long. In contrast, a 2.488 Gbps ber interface (OC-48c/STM-16) transports 19,440 bits per mile (31,104 bits per km), creating a 1500-byte packet that is only 3,260 feet (994 m) long. Serialization Delay Serialization delay is the amount of time that it takes to place the bits of a packet onto the wire when a router transmits a packet. Serialization delay is measured in milliseconds (ms, or 0.001 sec) and is a function of the size of the packet and the speed of the port. Since there is no practical mechanism to control the size of the packets in your network (other than reducing the MTU or forcing packet fragmentation), the only action you can take to reduce serialization delay is to install higher-speed router interfaces. Table 1 displays the serialization delay for various packet sizes and different port speeds. Table 1: Serialization DelayPacket Size vs. Port Speed From Table 1, you can see that it takes 7.7 ms to place a 1500-byte packet on a DS-1 circuit. This is a signicant amount of time if you consider that the typical one-way propagation delay from San Francisco to New York (3000 mi, 4827 km) is between 30 and 35 ms. On the other hand, the serialization delay for a 1500-byte packet on an OC-192c/STM-64 port is only 0.0012 ms. In a network consisting of high-speed interfaces, serialization delay contributes an insignicant amount to the overall end-to-end delay. However, in a network consisting of low-speed interfaces, serialization delay can contribute signicantly to the overall end-to-end delay. Managing Delay While Maximizing Bandwidth Utilization Given that the only component of end-to-end delay that you can actually control is queuing delay, support for differentiated service classes is based on managing the queuing delay experienced by different trafc classes during periods of network congestion. In the absence of active queue management techniques, such as Random Early Detection (RED), there is a direct relationship between the bandwidth utilization on a link and the RTT delay. If you maintain a 5-minute weighted bandwidth utilization of 10 percent, there will be minimal packet loss and minimal RTT delay, because the output ports are generally underused. However, if you increase the 5-minute weighted bandwidth utilization to approximately 50 percent, the average RTT starts to increase exponentially as the load on your network increases. (Figure 5.) DS-1 DS-3 OC-3 OC-12 OC-48 OC-192 40 byte 0.2073 ms 0.0072 ms 0.0021 ms 0.0005 ms 0.0001 ms < 0.0001 ms 256 byte 1.3264 ms 0.0458 ms 0.0132 ms 0.0033 ms 0.0008 ms 0.0002 ms 320 byte 1.6580 ms 0.0572 ms 0.0165 ms 0.0041 ms 0.0010 ms 0.0003 ms 512 byte 2.6528 ms 0.0916 ms 0.0264 ms 0.0066 ms 0.0016 ms 0.0004 ms 1500 byte 7.7720 ms 0.2682 ms 0.0774 ms 0.0193 ms 0.0048 ms 0.0012 ms 4470 byte 23.1606 ms 0.7994 ms 0.2307 ms 0.0575 ms 0.0144 ms 0.0036 ms 9180 byte 47.5648 ms 1.6416 ms 0.4738 ms 0.1181 ms 0.0295 ms 0.0074 ms Copyright 2001, Juniper Networks, Inc. 15 Supporting Differentiated Service Classes in Large IP Networks Figure 5: Bandwidth Utilization vs. Round-trip Time (RTT) Delay The challenge when trying to manage delay is that, at the same time, you also need to maximize bandwidth utilization in your network for nancial reasons. Bear in mind that bandwidth utilization statistics are meaningful only when the length of the circuit observation period is specied. If you measure bandwidth utilization over one nanosecond (0.000000001 sec.), you get one of two values: zero percent or 100 percent utilization. If you measure the utilization of a circuit over 5 minutes, you get a reasonably damped average. Whenever we discuss bandwidth utilization here, we always mean a 5-minute weighted average. A 5-minute weighted bandwidth utilization of 50 percent doesnt mean just 50 percent utilization. It means that there are short, sub-second intervals when utilization is close to 100 percent, queues ll up, and packets are dropped. It also means that there are other periods when the bandwidth utilization is close to zero percent, queue depth is zero, and packets are never dropped. A 5-minute weighted average utilization of 50 to 60 percent is considered heavy bandwidth utilization. If nancial factors compel you to drive your utilization up to 70 or 75 percent, then you dramatically increase the RTT delay and the variation in RTT delay for all applications running across your network. So your dilemma is how to optimize the bandwidth utilization of your network while also managing queuing delays for delay-sensitive trafc. To nd the solution, you must rst determine which applications in your network can cope with increasing delay and delay variation. TCP-based applications are specically designed to be rate-adaptive and to cope with delay, but there are other types of applications, such as real-time voice, that are unable to operate smoothly when experiencing long delays or delay variation. Therefore, the solution to optimizing bandwidth utilization while also managing queuing delays is to isolate the applications that cannot handle delay from the 50 to 60 percent utilization class. You can accomplish this by placing packets from those applications into a dedicated queue that does not experience the aggregate delay caused by the high utilization of 0 0% 5-minute Weighted Bandwidth Utilization RTT Delay 25% 50% 75% 100% Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 16 the circuit. In effect, you identify a certain set of applications, isolate those applications from other types of trafc by placing them into a dedicated queue, and then control the amount of queuing delay experienced by those specic applications. There are three things that you need to keep in mind with respect to delay in your network: I In a well-designed and properly functioning network, queuing delay should be zero when measured over time. There will always be extremely short periods of congestion, but network links need to be properly provisioned. Otherwise, queuing delay will increase rapidly, because you are asking that too much trafc cross an underprovisioned link. I If you examine the relative impact of the factors other than queuing delay (forwarding, propagation, and serialization) that contribute to delay, propagation delay is the major source of delay by several orders of magnitude. I The only delay factor that you can control is queuing delay. The challenge with the other factors is that you have no real control over them. Jitter Jitter is the variation in delay over time experienced by consecutive packets that are part of the same ow. (See Figure 6.) You can measure jitter by using a number of different techniques, including the mean, standard deviation, maximum, or minimum of the interpacket arrival times for consecutive packets in a given ow. Figure 6: Jitter Makes Packet Spacing Uneven TDM systems can cause jitter, but the variation in delay is so small that, for all practical purposes, you can ignore it. In a statistically multiplexed network, the primary source of jitter is the variability of queuing delay over time for consecutive packets in a given ow. Another potential source of jitter is that consecutive packets in a ow may not follow the same physical path across the network due to equal-cost load balancing or routing changes. Jitter increases exponentially with bandwidth utilization, just like delay. You can see this by executing a number of pings across a highly used link. You will notice not only an increase in delay, but also an increase in the variation of delay. There are a couple of other considerations relevant to jitter in statistical multiplexing networks: I In statistically multiplexed networks, the end-to-end jitter is never constant. This is because the level of congestion in a network always changes from place to place and moment to moment. Unless you are assured that the transmission of a packet will begin immediately after a router s forwarding decision, the amount of delay introduced at each hop in an end-to-end path is variable. Network Constant Flow of Packets Packets Arrive Unevenly Spaced Source Destination PC PC Copyright 2001, Juniper Networks, Inc. 17 Supporting Differentiated Service Classes in Large IP Networks I ATM has traditionally supported real-time trafc by using 53-byte cells as a way to place an upper bound on the amount of delay that a cell is subject to at any single network node. The point is that a 53-byte time period is a lot less than a 1500-byte time period. Impact of Jitter on Perceived QoS Some applications are unable to handle jitter. I With interactive voice or video applications, jitter can result in a jerky or uneven quality to the sound or image. The solution is to properly provision the network, including the queue scheduling discipline, and to condition trafc so that jitter stays within acceptable limits The jitter that remains can be handled by a short playback buffer on the destination host that buffers packets briey before playing them back as a smoothed data stream. I For emulated TDM service over a statistical multiplexed network, jitter outside of a narrowly dened range can introduce errors. The solution is to properly provision the network, including priority queuing, and to condition trafc at the edges of the network so that jitter stays within a predened range. However, there are other types of applications (such as those that run over TCP/IP) for which jitter is not a problem. Also, for non-interactive applications, such as streaming voice or video, jitter does not present serious problems, because it can be overcome by using large playback buffers. Loss There are three sources of packet loss in an IP network, as illustrated in Figure 7: I A break in a physical link that prevents the transmission of a packet, I A packet that is corrupted by noise and is detected by a checksum failure at the downstream node, and I Network congestion that leads to buffer overow. Figure 7: Sources of Packet Loss in IP Networks Breaks in physical links do occur, but they are rare, and the combination of self-healing physical layers and redundant topologies respond dynamically to this source of packet loss. With the exception of wireless networking, when using modern physical layer technologies, the chance of packet corruption is statistically insignicant, so you can ignore this source of packet loss also. Consequently, the primary reason for packet loss in a non-wireless IP network is due to buffer overow resulting from congestion. The amount of packet loss in a network is typically expressed in terms of the probability that a given packet will be discarded by the network. Buffer Overflow Broken Link Corruption Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 18 IP networks do not carry a constant load, because trafc is bursty, and this causes the load on the network to vary over time. There are periods when the volume of trafc that the network is asked to carry exceeds the capacity of some of the components in the network. When this occurs, congested network nodes attempt to reduce their load by discarding packets. When the TCP/IP stack on host systems detects a packet loss, it assumes that the packet loss is due to congestion somewhere in the network. Packet Loss Can Be Good It is important to understand that packet loss in an IP network is not always a bad thing. Each TCP session seeks all of the bandwidth that it can for its ow, but it must nd the maximum bandwidth without causing sustained congestion in the network. TCP accomplishes this by transmitting slowly at the beginning of each session (slow-start), and then increasing the transmission rate until it eventually detects the loss of a packet. Since TCP understands that a packet drop means congestion is present at some point in the network, it reacts to the congestion by temporarily reducing the transmission rate of the ow. Given enough time, each TCP ow will eventually settle on the maximum bandwidth it can get across the network without experiencing sustained congestion. When multiple TCP ows do this in parallel, the result is fairness for all TCP sessions across the network. Thus, occasional packet loss is good, because each TCP session needs to experience some amount of packet loss to nd all of the bandwidth that it can to handle its ow. Host response to network congestion is the same whether your IP network runs over packets or over an ATM infrastructure, because TCP congestion-avoidance mechanisms are executed at the transport layer, not the data link layer. An ATM transport does not possess remarkable properties that allow you to better control the amount of trafc that a host injects into your network, because the applications are native-IP-based, not ATM-based. If you need to control end-system behavior, you are still required to perform trafc policing or shaping at the ingress edges of your network. ATM can only support this at a relatively coarse level, because it is not aware of TCP/IP or the operation of its congestion-avoidance mechanisms. In fact, running TCP/IP over an ATM infrastructure has a number of well-known limitations (cell tax, the number of routing adjacencies required, the inability to identify IP packets in the core of the network without reassembly, and so forth) that may actually obscure congestion-avoidance issues, because there are more network layers that can hide this problem. Mindful that a certain amount of packet loss is to be expected in any IP network, how can you support differentiated service classes for specic customers or applications by arranging for some packets to be treated differently from other packets with respect to packet loss? Assume that you offer a xed amount of bandwidth between two points in your network. As long as the total amount of trafc sent along the path between these two points is less than or equal to the agreed-upon throughput, there should be minimal packet loss after TCP sessions stabilize. This assumption allows us to support the differentiated treatment of packets with respect to loss by deploying multiple queues on each port, rather than just a single FIFO queue. The output trafc stream is rst classied, and then different types of packets are placed into different queues. Finally, each queue is given a different share of the ports bandwidth. As long as the amount of trafc placed into each of the queues is less than or equal to the agreed-upon bandwidth for the particular queue, then each queue should experience minimal packet loss after the TCP sessions traversing the queue stabilize. (See Figure 8.) Copyright 2001, Juniper Networks, Inc. 19 Supporting Differentiated Service Classes in Large IP Networks Figure 8: Multiple Queues with Different Shares of a Ports Bandwidth But what do you do if the amount of trafc placed into a given service class exceeds its agreed-upon throughput? This becomes a policy decision with a number of options to manage the trafc load: I Drop packets that are out-of-prole. I Mark the packet, and then forward it with an increased drop probability. If the out-of-prole packet experiences congestion at a downstream node, it can be dropped before other in-prole packets are dropped. I Queue the packet, and then use trafc conditioning tools to control its rate on egress I Transmit an explicit congestion notication (ECN) by setting the congestion experienced (CE) bit in the header of packets sourced from ECN-capable transport protocols. It is important to note that, up to this point, we have limited our discussion of packet loss to the case when a queue becomes 100 percent full. This mechanism is known as tail drop queue management, because packets are dropped from the logical back, or tail, of the queue. (See Figure 9.) Figure 9: Tail-drop Queue Management Tail-drop queue management is a simple algorithm that is easy to implement. However, it does not discard packets fairly, because it allows a poorly behaved, bursty stream to consume all of a queues resources, causing packets from other well-behaved streams to be discarded because the queue is 100 percent full. 30% Bandwidth Queue 20% Bandwidth Queue 10% Bandwidth Queue Source 1 Source N Source 6 Source 5 Source 4 Source 3 Source 2 40% Bandwidth Queue Interface Scheduler Classifier Full Queue Head Tail Tail-drop Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 20 Although you should expect a limited amount of packet loss in any network, signicant packet loss due to sustained congestion adversely affects the operation of your network. Sustained congestion creates critical problems for your network in several ways: I The exchange of routing information is disrupted, and this can lead to route instability. I The network is no longer able to absorb bursts of trafc. I New TCP sessions cannot be established. I Each of the existing TCP sessions traversing a heavily congested link begin to experience some amount of packet loss. Therefore, since packets from different sessions are interleaved, all of the sessions begin to experience packet loss at roughly the same time. This causes each of the individual sessions to go into slow-start, which creates a phenomenon known as global TCP synchronization. When this occurs, all of the TCP sessions across the congested link become synchronized, resulting in periodic surges of trafc. As a result, the link alternates between heavy congestion, as each TCP begins to seek its maximum bandwidth, and light use, as all of the TCPs return to slow-start when they begin to experience congestion. This cycle repeats itself over and over again. Depending on where the congestion occurs in your network, this phenomenon can involve hundreds, thousands, or even tens of thousands of TCP sessions. Random Early Detection (RED) is an active queue management mechanism that combats the problem of global TCP synchronization, while also introducing a degree of fairness into the discard-selection process. A Brief History of Differentiated Services in Large IP Networks The notion of providing more than just a single best-effort class of service has been part of the IP architecture for more than 20 years. In this section, we examine some of the historical approaches to supporting differentiated service classes in large IP networks. The First Approach: RFC 791 In September 1981, RFC 791 standardized the Internet Protocol and reserved the second byte of the IP header as the type of service (ToS) eld. The bits of the ToS byte were dened as Figure 10 shows. Figure 10: RFC 791 Bit Denitions of ToS Bytes The rst three bits in the ToS byte (precedence bits) could be set by a node to select the relative priority or precedence of the packet. The next three bits could be set to specify normal or low delay (D), normal or high throughput (T), and normal or high reliability (R). The nal two bits of the ToS byte were reserved for future use. However, very little architecture was provided to support the delivery of differentiated service classes in IP networks using these capabilities. Precedence D T R Reserved 0 1 5 6 7 4 3 2 Copyright 2001, Juniper Networks, Inc. 21 Supporting Differentiated Service Classes in Large IP Networks The only application of the IP precedence bits until the mid-1990s was to support a feature known as selective packet discard (SPD). SPD set the precedence bits for control packets (link-level keepalives, routing protocol keepalives, and routing protocol updates) so that, if the network experienced congestion, critical control trafc would be the last to be discarded. The goal was to enhance network stability during periods of congestion. In practice, the DTR bits were never used. The Second Approach: The Integrated Services Model (IntServ) Around 1993, comprehensive work began in the IETF to develop a mechanism that would allow IP to support more than a single best-effort class of service. The goal was to provide real-time service simultaneously with traditional non-real-time service in a shared IP network. This work resulted in the development of the Integrated Services (IntServ) architecture. The IntServ architecture is based on per-ow resource reservation. IntServ Architecture The IntServ architecture dened a reference model that species a number of different components and the interplay among these components: I The resource reservation setup protocol (RSVP) that allows individual applications to request resources from routers and then install per-ow state along the path of the packet ow. I Two new service modelsguaranteed service and controlled load service. Guaranteed service provides rm assurances (through strict admission control, bandwidth allocation, and fair queuing) for applications that require guaranteed bandwidth and delay. The controlled load service does not provide guaranteed bounds on bandwidth or delay, and emulates a lightly loaded, best-effort network. I Flow specications that provide a syntax that allows applications to specify their specic resource requirements. I A packet classication process that examines incoming packets and decides which of the various classes of service should be applied to each packet. I An admission control process that determines whether a requested reservation can be supported, based on the availability of both local and network resources. I A policing and shaping process that monitors each ow to ensure that it conforms to its trafc prole. I A packet scheduling process that distributes network resources (buffers and bandwidth) among the different ows. The IntServ model requires that source and destination hosts exchange RSVP signaling messages to establish packet classication and forwarding state at each node along the path between them. (See Figure 11.) Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 22 Figure 11: Resource Reservation Protocol (RSVP) While people in the industry learned a tremendous amount during the development of IntServ architecture, they eventually concluded that IntServ was not a suitable mechanism to support the delivery of differentiated service classes in large IP networks. I IntServ is not scalable, because it requires signicant amounts of per-ow state and packet processing at each node along the end-to-end path. In the absence of state aggregation, the amount of state that needs to be maintained at each node scales in proportion to the number of simultaneous reservations through a given node. The number of ows on a high-speed backbone link could potentially range from tens of thousands to over a million. I IntServ requires that applications running on end systems support the RSVP signaling protocol. There were very few operating systems that supported an RSVP API that application developers could access. I IntServ requires that all nodes in the network path support the IntServ model. This includes the ability to the map IntServ service classes to link-layer technologies. While the IntServ model failed, it led to the development and deployment of RSVP, which we now use as a general-purpose signaling protocol for MPLS trafc engineering, fast LSP restoration, and the rapid provisioning of optical links (GMPLS or MPLambdaS). RSVP performs very well as a signaling protocol for MPLS because, in this application, it does not experience the scalability problems associated with IntServ. An IntServ Enhancement: Aggregation of RSVP Reservations As discussed above, one of the major scalability limitations of RSVP is that it does not have the ability to aggregate individually reserved sessions into a single, shared class. In September 2001, RFC 3175 ("Aggregation of RSVP for IPv4 and IPv6 Reservations") dened procedures that allow a single RSVP reservation to aggregate other RSVP reservations across a large IP network. It proposed mechanisms to dynamically establish the aggregate reservation, identify the specic trafc for which the aggregate reservation applies, determine how much bandwidth is required to satisfy the reservation requirement, and reclaim bandwidth when the subreservations are no longer required. RFC 3175 enhances the scalability of RSVP for use in large IP networks by: I Reducing the number of signaling messages exchanged and the amount of reservation state that needs to be maintained by making a limited number of large reservations, rather than a large number of small, ow-specic reservations, I Streamlining the packet classication process in core routers by using the Differentiated Services codepoint, or DSCP (see the discussion of DiffServ that follows), to identify an aggregated ow, instead of the traditional RSVP ow classication mechanism, and I Simplifying packet queuing and scheduling by combining the aggregated streams into the same queue on an output port. Source Destination PC PC Path Resv Path Resv Path Resv Path Resv Path Resv RSVP RSVP RSVP RSVP RSVP RSVP Copyright 2001, Juniper Networks, Inc. 23 Supporting Differentiated Service Classes in Large IP Networks Among the potential applications for aggregation of RSVP reservations are these three: I Interconnection of PSTN-call gateways across a provider backbone, I Aggregation of RSVP paths at the edges of a provider network, and I Aggregation of RSVP paths across the core of a provider network. One of the strengths of RSVP is that it supports admission control on a per-ow basis. This can be a powerful tool when supporting premium interactive voice services. Assume that you establish an aggregated RSVP reservation to support 1000 voice calls. As long as there are fewer than 1000 active calls, a new call will be accepted by admission control, which will allocate adequate bandwidth to support subscriber performance requirements. The 1001st call will be denied access by admission control, thus preserving the quality of service delivered to the 1000 established calls. As you will see in the next section, the DiffServ model performs admission control on a per-packet basis, not on a per-ow basis. This means that, at the edge of a DiffServ domain, calls 1001 through 1100 will be accepted but, because the service class is now out-of-prole, packets will be randomly dropped, thereby impacting the quality of service delivered for all of the calls. You can overcome this feature of DiffServ by using a combination of aggregated RSVP at the edges of the network to perform per-ow admission control for a voice gateway plus DiffServ in the core of the network to support application performance requirements across the backbone. The Third Approach: The Differentiated Services Model (DiffServ) Around 1995 or 1996, service providers and various academic institutions began to examine alternative approaches to supporting more than a single best-effort class of service, but this time by using mechanisms that could provide the requisite scalability. As discussed in the previous section, the failure of the IntServ model was due to the signaling explosion and the amount of per-ow state that needed to be maintained at each node in the packet-forwarding path. As a result, all of these new proposals sought to prevent these scalability issues. Figure 12 illustrates the cost, relative to complexity, of the new approaches to supporting differentiated service classes. Figure 12: Cost Relative to Complexity of Differentiated Services Solutions At that time, there were a number of different proposals to redene the meaning of the three precedence bits in the ToS byte of the IP header. The proposals ranged from using a single bit, similar to the Frame Relay DE bit, to arbitrary bit denitions and even hybrid approaches, where some bits were used for certain functions and the remaining bits were used for other Increasing Cost Increasing Complexity Best-effort IntServ DiffServ Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 24 functions. There was a lot of talk, some vendor code, but never any real production deployment. The lack of successful deployment was because routers were software-based, and any attempt to make the packet forwarding process more complicated affected forwarding performance, so it was simply easier to overprovision congested links. By 1997, the IETF realized that IntServ was not going to be deployed in production networks, and that the commercial sector had been thinking about supporting differentiated service classes for specic customers or applications in a more coarse-grained and more scalable way by using the IP precedence bits. As a result, the IETF created the DiffServ Working Group, which met for the rst time in March 1998. The goal of this group was to create relatively simple and coarse methods of providing differentiated classes of service for Internet trafc, to support various types of applications and specic business models. The IETF Architecture for Differentiated Services The DiffServ Working Group has changed the name of the IPv4 ToS octet to the DS byte and dened new meanings for each of the bits. (See Figure 13.) The new specication for the DS Field is applied to both the IPv4 ToS octet and the IPv6 trafc class octet, so that they use a common set of mechanisms to support the delivery of differentiated service classes. Figure 13: Differentiated Services Field (DS Field) The IETFs DiffServ Working Group divides the DS byte into two subelds: I The six high-order bits are known as the Differentiated Services codepoint (DSCP). The DSCP is used by a router to select the per-hop behavior (PHB) that a packet experiences at each hop within a Differentiated Services domain. PHB is an externally observable forwarding treatment applied to all packets that belong to the same service class or behavior aggregate (BA). I The two low-order bits are currently unused (CU) and reserved for future use. These two bits are presently set aside for use by the explicit congestion notication (ECN) experiment. The values of the CU bits are ignored by each node when it determines the PHB to apply to a packet. The complete DiffServ architecture, dened in RFC 2475, is based on a relatively simple model, whereby trafc that enters a network is rst classied and then possibly conditioned at the edges of the network. Depending on the result of the packet classication process, each packet is associated with one of the BAs supported by the Differentiated Services domain. The BA that each packet is assigned to is indicated by the specic value carried in the DSCP bits of the DS Field. When a packet enters the core of the network, each router along the transit path applies the appropriate PHB, based on the DSCP carried in the packets header. It is this combination of trafc conditioning (policing and shaping) at the edges of the network, packet marking at the edges of the network, local per-class forwarding behaviors in the interior of the network, and adequate network provisioning that allow the DiffServ model to support scalable service discrimination across a common IP infrastructure. 0 1 5 6 7 4 3 2 Differentiated Services Codepoint (DSCP) CU Copyright 2001, Juniper Networks, Inc. 25 Supporting Differentiated Service Classes in Large IP Networks Differentiated Services Domain (DS Domain) A Differentiated Services domain (DS domain) is a contiguous set of routers that operate with common sets of service provisioning policies and PHB group denitions. (Figure 14.) A DS domain is typically managed by a single administrative authority that is responsible for ensuring that adequate network resources are available to support the service level specications (SLSs) and trafc conditioning specications (TCSs) offered by the domain. Figure 14: Differentiated Services Domain (DS Domain) A DS domain consists of DS boundary nodes and DS interior nodes. I DS boundary nodes sit at the edges of a DS domain. DS boundary nodes function as both DS ingress and egress nodes for different directions of trafc ows. When functioning as a DS ingress node, a DS boundary node is responsible for the classication, marking, and possibly conditioning of ingress trafc. It classies each packet, based on an examination of the packet header, and then writes the DSCP to indicate one of the PHB groups supported within the DS domain. When functioning as a DS egress node, the DS boundary node may be required to perform trafc conditioning functions on trafc forwarded to a directly connected peering domain. DS boundary nodes connect a DS domain to another DS domain or another non-DS-capable domain. I DS interior nodes select the forwarding behavior applied to each packet, based on an examination of the packets DSCP (they honor the PHB indicated in the packet header). DS interior nodes map the DSCP to one of the PHB groups supported by all of the DS interior nodes within the DS domain. DS interior nodes connect only to another DS interior node or boundary node within the same DS domain. Differentiated Service Router Functions Figure 15 provides a logical view of the operation of a packet classier and trafc conditioner on a DiffServ-capable router. DS Egress Boundary Node DS Ingress Boundary Node DS Interior Node DS Interior Node BA Traffic Microflows Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 26 Figure 15: Packet Classier and Trafc Conditioner Packet Classication A packet classier selects packets in a trafc stream based on the content of elds in the packet header. The DiffServ architecture denes two types of packet classiers: I A behavior aggregate (BA) classier selects packets based on the value of the DSCP only. I A multield (MF) classier selects packets based on a combination of the values of one or more header elds. These elds can include the source address, destination address, DS Field, protocol ID, source port, destination port, or other information, such as the incoming interface. The result of the classication is written to the DS Field to simplify the packet classication task for nodes in the interior of the DS domain. After the packet classier identies packets that match specic rules, the packet is directed to a logical instance of a trafc conditioner for further processing. Trafc Conditioning A trafc conditioner may consist of various elements that perform trafc metering, marking, shaping, and dropping. A trafc conditioner is not required to support all of these functions. I A meter measures a trafc stream to determine whether a particular packet from the stream is in-prole or out-of-prole. The meter passes the in-prole or out-of-prole state information to other trafc conditioning elements so that different conditioning actions can be applied to in-prole and out-of-prole packets. I A marker writes (or rewrites) the DS Field of a packet header to a specic DSCP, so that the packet is assigned to a particular DS behavior aggregate. I A shaper delays some or all packets in a trafc stream to bring the stream into conformance with its trafc prole. I A dropper (policer) discards some or all packets in a trafc stream to bring the stream into conformance with its trafc prole. Meter Marker Shaper/ Dropper Packets Packet Classifier Traffic Conditioner Copyright 2001, Juniper Networks, Inc. 27 Supporting Differentiated Service Classes in Large IP Networks Differentiated Services Router Functions Figure 16 illustrates the functions that are typically performed by DS boundary routers and DS interior routers. Figure 16: DiffServ Router Functions The DS ingress boundary router generally performs MF packet classication and trafc conditioning functions on incoming microows. A microow is single instance of an application-to-application ow that is ultimately assigned to a behavior aggregate. A DS ingress boundary router can also apply the appropriate PHB, based on the result of this packet classication process. NOTE A DS ingress boundary router may also perform BA packet classication if it trusts an upstream DS domains packet classication. A DS interior router usually performs BA packet classication to associate each packet with a behavior aggregate. It then applies the appropriate PHB by using specic buffer-management and packet-scheduling mechanisms to support the specic packet-forwarding treatment. Although the DiffServ architecture assumes that the majority of complex packet classication and conditioning occurs at DS boundary routers, the use of MF classication is also supported in the interior of the network. The DS egress boundary router normally performs trafc shaping as packets leave the DS domain for another DS domain or non-DS-capable domain. A DS egress boundary router may also perform MF or BA packet classication and precedence rewriting if it has an agreement with a downstream DS domain. DS Egress Boundary Router DS Ingress Boundary Router DS Interior Router DS Interior Router BA Traffic Microflows MF Classification Traffic Metering Packet Marking Traffic Shaping/Dropping BA Classification PHB Support Traffic Metering Packet Marking Traffic Shaping/Dropping Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 28 Per-hop Behaviors (PHBs) A per-hop behavior (PHB) is a description of the externally observable forwarding behavior applied to a particular behavior aggregate. The PHB is the means by which a DS node allocates its resources to different behavior aggregates. The DiffServ architecture supports the delivery of scalable service discrimination, based on this hop-by-hop resource allocation mechanism. PHBs are dened in terms of the behavior characteristics that are relevant to a providers service provisioning policies. A specic PHB may be dened in terms of: I The amount of resources allocated to the PHB (buffer size and link bandwidth), I The relative priority of the PHB compared with other PHBs, or I The observable trafc characteristics (delay, jitter, and loss). However, PHBs are not dened in terms of specic implementation mechanisms. Consequently, a variety of different implementation mechanisms may be acceptable for implementing a specic PHB group. The IETF DiffServ Working Group has dened two PHBs: I Expedited forwarding PHB I Assured forwarding PHB In the future, new DSCPs can be assigned by a provider for its own local use or by new standards activity. Expedited Forwarding (EF PHB) According to the IETFs DiffServ Working Group, the Expedited Forwarding (EF) PHB is designed to provide low loss, low delay, low jitter, assured bandwidth, end-to-end service. In effect, the EF PHB simulates a virtual leased line to support highly reliable voice or video and to emulate dedicated circuit services. The recommended DSCP for the EF PHB is 101110. Since the only aspect of delay that you can control in your network is the queuing delay, you can minimize both delay and jitter when you minimize queuing delays. Thus, the intent of the EF PHB is to arrange that suitably marked packets encounter extremely short or empty queues to ensure minimal delay and jitter. You can achieve this only if the service rate for EF packets on a given output port exceeds the usual rate of packet arrival at that port, independent of the load on other (non-EF) PHBs. The EF PHB can be supported on DS-capable routers in serveral ways: I By policing EF microows to prescribed values at the edge of the DS domain (this is required to ensure that the service rate for EF packets exceeds their arrival rate in the core of the network), I By ensuring adequate provisioning of bandwidth across the core of your network, I By placing EF packets in the highest strict-priority queue and ensuring that the minimum output rate is at least equal to the maximum input rate, or I By rate-limiting the EF aggregate load in the core of your network to prevent inadequate bandwidth for other service classes. Generally, you will not use RED as a queue memory-management mechanism when supporting the EF PHB, because the majority of the trafc is UDP-based, and UDP does not respond to packet drops by reducing its transmission rate. Copyright 2001, Juniper Networks, Inc. 29 Supporting Differentiated Service Classes in Large IP Networks Assured Forwarding (AF PHB) The Assured Forwarding (AF) PHB is a group of PHBs designed to ensure that packets are forwarded with a high probability of delivery, as long as the aggregate trafc in a forwarding class does not exceed the subscribed information rate. If ingress trafc exceeds its subscribed information rate, then out-of-prole trafc is not delivered with as high a probability as trafc that is in-prole. The AF PHB group includes four trafc classes. Packets within each AF class can be marked with one of three possible drop-precedence values. The AF PHB group can be used to implement Olympic-style service that consists of three service classes: gold, silver, and bronze. If you wish, you can further differentiate packets within each class by giving them either low, medium, or high drop precedence within the service class. Table 2 summarizes the recommended DSCPs for the four AF PHB groups. Table 2: Recommended AF DiffServ Codepoint (DSCP) Values The AF PHB groups have not been assigned specic service denitions by the DiffServ Working Group. The groups can be viewed as the mechanism that allows a provider to offer differentiated levels of forwarding assurances for IP packets. It is the responsibility of each DS domain to set the quantitative and qualitative differences between AF classes. In a DS-capable router, the level of forwarding assurance for any given packet depends on: I The amount of bandwidth and buffer space allocated to the packets AF class, I The amount of congestion for the AF class within the router, and I The drop precedence of the packet. The AF PHB group can be supported on DS-capable routers by: I Policing AF microows to prescribed values at the edge of the DS domain, I Ensuring adequate provisioning of bandwidth across the core of your network, I Placing each AF service class into a separate queue, I Selecting the appropriate queue scheduling discipline to allocate buffer space and bandwidth to each AF service class, and I Conguring RED to honor the three low-order bits in the DSCP to determine how aggressively a packet is dropped during periods of congestion. Default PHB RFC 1812 species the default PHB as the conventional best-effort forwarding behavior. When no other agreements are in place, all packets are assumed to belong to this trafc aggregate. A packet assigned to this aggregate may be sent into a network without following any specic AF Class 1 AF Class 2 AF Class 3 AF Class 4 Low drop precedence 001010 010010 011010 100010 Medium drop precedence 001100 010100 011100 100100 High drop precedence 001110 010110 011110 100110 Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 30 rules, and the network will deliver as many of these packets as possible, as soon as possible, subject to other resource-policy constraints. The recommended DSCP for the default PHB is 000000. General Observations about Differentiated Services In this section, we discuss general observations about the nature of the DiffServ architecture to help you understand what you can or should expect if you decide to deploy it. It is important to maintain a healthy skepticism about DiffServ, because it does not provide a magic solution that has the ability to solve all of the congestion-related problems in your network. DiffServ Does Not Create Free Bandwidth Routers are statistical multiplexing devices; therefore, they can experience congestion when the amount of trafc that needs to traverse a port exceeds the output ports capacity. This means that statistical multiplexing devices require buffer memory to absorb temporary bursts, so that packets are not immediately dropped when congestion occurs. In a well-designed and properly functioning network, packet buffers must be empty most of the time, so that they are available to absorb occasional packet bursts. Sustained congestion is an unacceptable condition, because it exhausts buffer memory, and this causes packets to be dropped. The fundamental idea behind the DiffServ model is that the deployment of multiple queues on a port allows a router to service certain trafc before other types of trafc, and thus isolate congestion to a subset of a routers queues. The queue-servicing algorithm arbitrates a queues access to the link. The deployment of multiple queues on a port allows some queues to experience congestion while other queues do not. In a way, you can look at the DiffServ model as a mechanism that tries to ensure that important trafc has access to network resources and does not experience congestion, which essentially makes it managed unfairness. This approach is based on the assumption that it is acceptable for less important trafc not to have access to network resources during periods of congestion. The primary problem with this assumption is that, for even the worst service class, you still need to be concerned about provisioning bandwidth, because sustained congestion for this class will result in a poor experience for users. The key question you need to ask yourself is, How much delay, jitter, or loss can the least-important service accommodate and still be commercially viable? If you are not willing to provision adequate bandwidth for this service class, these customers will leave your network and purchase their service from another provider. It is important to remember that the deployment of DiffServ in your network does not create free bandwidth. Neither does it replace the need for careful capacity planning for network resources for all types of services. You are still required to ensure that you have enough bandwidth in the core of your network to match the bandwidth that is available for your customers to inject trafc into the core. DiffServ Does Not Change the Speed of Light There are four sources of packet delay in your network: forwarding delay, queuing delay, serialization delay, and propagation delay. Assuming that you have deployed the industry standard in hardware-based routers, you have no real control over forwarding delay, serialization delay, or propagation delay. The only aspect of packet delay across your network that you can actually control is the queuing delay. But queuing delay in a well-designed and Copyright 2001, Juniper Networks, Inc. 31 Supporting Differentiated Service Classes in Large IP Networks properly functioning network is several orders of magnitude less than the propagation delay. The deployment of DiffServ in your network does not impact the speed of light in optical ber, which is the primary contributor to delay in your network. The Strictest Service Guarantees Will Be between Well-Known Endpoints Large IP networks have a global reach and provide any-to-any communication. This means that trafc patterns are of statistical signicance only, and that they can change dramatically without a moments notice (think World Trade Center attacks, stock market uctuations, election nights, and so forth). Large IP networks are completely different from traditional ATM and Frame Relay networks where the carrier knows in advance the trafc patterns that will persist on their networks. As a consequence, the strictest of guarantees are likely to be offered in IP networks only for trafc ows that have well-known endpointsVoIP gateways, VPNs, video conferencing, and so forth. For these types of applications, you know the ingress node, the egress node, and the amount of trafc that needs to be supported. Support for Interprovider DiffServ Is a Business Issue Support for interprovider DiffServ is not a technical issue. If the existing technology allows us to support DiffServ between a subscriber and a server in a providers network, then the technology offers all of the features needed to support DiffServ between two different providers. Support for interprovider DiffServ is a business issue, not a technical issue. Providers Do Not Control All Aspects of the User Experience A users experience is affected by factors other than just their service providers ability to support Differentiated Services. For example, assume that you want to measure the performance of various providers by going to the theoretical Web site of www.reallyfastprovider.com. After conducting this performance test with a few different providers, you make your selection based on which providers Web page appears on your screen fastest. The problem with this approach is that a number of other factors can skew the results of this test, including the performance of your PC, the performance of your DNS server, the presence of congestion on your access link, the stability of routing in the network, and the performance of the target Web server. Conclusion So, which is the best approach to support the delivery of multiple service classes for specic customers or applications in large IP networksbandwidth overprovisioning or bandwidth management? We know that there are a number of hidden costs associated with the bandwidth management approach. However, we also believe, based on our collective inexperience with these issues, that neither of the two approaches is clearly superior to the other. In fact, advocates for both approaches present compelling arguments to defend their positions. Generally, a balanced approach between bandwidth overprovisioning and bandwidth management will provide the most satisfactory solution. It is certainly acceptable to throw bandwidth at a problem when it is cost-effective to do so. Nevertheless, active bandwidth management can play an important role in helping you improve the performance of your network, can support different forwarding behaviors for different classes of trafc, and can reduce operating expenses. Over the past few years, the IETF has developed a number of technologies to help you support differentiated service classes in large IP networks: Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 32 I Recent additions to the IntServ architecture enhance the scalability of RSVP by allowing it to aggregate resource reservations and to potentially play a more signicant role in large IP networks. I The DiffServ architecture is an alternative to IntServ that provides the scalability for deployment in large IP networks when attempting to offer better than best-effort service. I Multiprotocol Label Switching (MPLS) supports the convergence of two fundamentally different approaches to data networking (datagram and virtual circuit) in a seamless and fully integrated manner. I MPLS trafc engineering reduces congestion and optimizes the use of existing network resources by allowing you to carefully manage the distribution of trafc across your network. You need to be aware of all of the tools that are available to you, carefully select the proper subset of tools, and then use them in a combination to achieve the results you want in your network. Copyright 2001, Juniper Networks, Inc. 33 Supporting Differentiated Service Classes in Large IP Networks Acronym Denitions AF Assured Forwarding BA behavior aggregate CE congestion experienced CU currently unused DEMUX demultiplexer DiffServ Differentiated Services DLCI data-link connection identier DSCP DiffServ codepoint ECN explicit congestion notication EF Expedited Forwarding FDM frequency-division multiplexing FIFO rst in GMPLS Generalized MPLS IntServ Integrated Services MF multield MUX multiplexer MPLS Multiprotocol Label Switching PHB per-hop behavior RED Random Early Detection RTT round-trip time SLA service-level agreement SLS service-level specication SPD selective packet discard TCS trafc-conditioning specication TDM time-division multiplexing ToS type of service UDP User Datagram Protocol VCI virtual channel identier VoIP voice over IP VPI virtual path identier VPN virtual private network Supporting Differentiated Service Classes in Large IP Networks Copyright 2001, Juniper Networks, Inc. 34 References Requests for Comments (RFCs) RFC 791. Postel, J. "Internet Protocol: DARPA Internet Program Protocol Specication." September 1981. RFC 1349. Almquist, P. "Type of Service in the Internet Protocol Suite." July 1992. RFC 1633. Braden, R., Clark, D., and Shenker., S. "Integrated Services in the Internet Architecture: An Overview." June 1994. RFC 1812. Baker, F. (Editor). "Requirements for IP Version 4 Routers." June 1995. RFC 2205. Braden, B. (Editor), et. al. "Resource ReSerVation Protocol (RSVP) Version 1 Functional Specication." September 1997. RFC 2309. Braden, R., Clark, D., Crowcroft, J., et. al. "Recommendations on Queue Management and Congestion Avoidance in the Internet." April 1998. RFC 2474. Nichols, K., Blake, S., Baker, F., and Black, D. "Denition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers." December 1998. RFC 2475. Blake, S., Black, D., Carlson, M., Davies, E., et. al. "An Architecture for Differentiated Services." December 1998. RFC 2597. Heinanen, J., Baker, F., Weiss, W., and Wroclawski, J. "Assured Forwarding PHB Group." June 1999. RFC 2598. Jacobson, V., Nichols, K., and Poduri, K. "An Expedited Forwarding PHB." June 1999. RFC 3140. Black, D., Brim, S., Carpenter, B.,. and Le Faucheur, F. "Per-Hop Behavior Identication Codes." June 2001. RFC 3175. Baker, F., Iturralde, C., Le Faucheur, F., and Davie, B. "Aggregation of RSVP for IPv4 and IPv6 Reservations." September 2001. Internet Drafts Charny, A. (Editor), et. al. "Supplemental Information for the New Denition of the EF PHB." <draft-ietf-diffserv-ef-supplemental-01.txt> June 2001 Davie, B. (Editor), et. al. "An Expedited Forwarding PHB." <draft-ietf-diffserv-rfc2598bis-02.txt> September 2001. Grossman, Dan. "New Terminology for DiffServ." <draft-ietf-diffserv-new-terms-06.txt> November 2001. Textbooks Comer, Douglas E. Internetworking with TCP/IP Vol. I: Principles, Protocols, and Architecture. Prentice Hall, February 2000. (ISBN 0130183806) Croll, Alistair and Packman, Eric. Managing Bandwidth:: Deploying Across Enterprise Networks. Prentice Hall PTR, January 2000. (ISBN 0130113913) Copyright 2001, Juniper Networks, Inc. 35 Supporting Differentiated Service Classes in Large IP Networks Ferguson, Paul and Huston, Geoff. Quality of Service: Delivering QoS on the Internet and in Corporate Networks. John Wiley & Sons, January 1998. (ISBN 0471243582) Huitema, Christian. Routing in the Internet, 2nd edition. Prentice Hall PTR, January 2000. (ISBN 0130226475) Huston, Geoff. Internet Performance Survival Guide: QoS Strategies for Multiservice Network. John Wiley & Sons, February 2000. (ISBN 0471378089) Kilkki, Kalevi. Differentiated Services for the Internet. New Riders Publishing, June 1999. (ISBN 1578701325) Partridge, Craig. Gigabit Networking. Addison-Wesley Publishing Co., January 1994. (ISBN 0201563339) Perlman, Radia. Interconnections: Bridges, Routers, Switches, and Internetworking Protocols (Second Edition). Addison-Wesley Publishing Co., October 1999. (ISBN 0201634481) Stallings, William. High-speed Networks TCP/IP and ATM Design Principles. Prentice Hall, January 1998. (ISBN 0135259657) Stallings, William. Data & Computer Communications. Prentice Hall, November 1999. (ISBN 0130843709) Stevens, W. Richard. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley Publishing Co., January 1994. (ISBN 0201633469) Wright, Gary R. and Stevens, W. Richard. TCP/IP Illustrated, Volume 2: The Implementation. Addison-Wesley Pub Co., January 1995. (ISBN 020163354X) Wang, Zheng. Internet QoS: Architectures and Mechanisms for Quality of Service. Morgan Kaufmann Publishers, March 2001. (ISBN 1558606084) Technical Papers Bennett, J. and Zhang, H. "Hierarchical Packet Fair Queueing Algorithms," Proc. ACM SIGCOMM 96, August 1996. Floyd, S. and Jacobson, V. "Random Early-Detection Gateways for Congestion Avoidance." IEEE/ACM Transactions on Networking, Volume 1, Number 4, August 1993, pp. 397-413. Shreedhar, M. and Varghese, G. "Efcient Fair Queueing Using Decit Round Robin," Proc. ACM SIGCOMM 95, 1995. Copyright 2001, Juniper Networks, Inc. All rights reserved. Juniper Networks is registered in the U.S. Patent and Trademark Ofce and in other countries as a trademark of Juniper Networks, Inc. Internet Processor, Internet Processor II, JUNOS, JUNOScript, M5, M10, M20, M40, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners. All specications are subject to change without notice. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.