You are on page 1of 4

1

Congestion-Adaptive Encoding for VoIP
Sebastian B¨ ck (sxb01@aub.edu.lb), Tobias Knaup (txk00@aub.edu.lb) o
output bit-rates. In [5] the network quality is measured in terms of packet delays and losses and supplied to the encoder via RTCP reports calculated at the receiving side. The bit-rate of the encoder then is controlled by these parameters to adapt to the varying conditions. As a result, the adaptive encoder gains higher bandwidth efficiency compared to a fixed-rate system, while maintaining voice quality. The aim of our work is to combine the two approaches of using TCP as transport protocol and a controllable voice encoder to get a system which is able to sense network congestion at an early stage. The sender then reduces its bitrate fast enough to avoid further packet loss, thus keeping the perceived voice quality stable and as high as possible under the particular conditions. Opposed to previous research activities in that field, our system consists of a real software implementation, running on a real network.

Abstract— This paper introduces an adaptive VoIP system which uses TCP as its transport protocol and TCP stack parameters to control its bit-rate at the sender side. The presented architecture is able to adapt to varying network conditions by encoding and sending only at a bit-rate the network is currently capable of handling. It is thus reducing the amount of congestion while maintaining the highest possible voice quality. The performance of this approach is investigated in an experimental setup and is compared to non-adaptive systems on top of either TCP or UDP.

I. I NTRODUCTION

W

ITH the worldwide deployment and its rapidly growing popularity, the Internet has evolved into a universal communication network. In recent years it has increasingly been used to deliver Voice over IP (VoIP) services to end customers, and thus emerged as an alternative to public switched telephone network (PSTN). VoIP, being a real-time application, has high demands on the quality of the network it runs on. For example the delay must not exceed a certain threshold in order to not be perceived by users. Furthermore, packets suffering a significant delay are usually dropped by the VoIP application because they would cause a confusing conversation otherwise. Also the jitter has to be low, since it forces the receiver to utilize a playout buffer. Buffering guarantees timely playback of the voice data but also induces an additional delay. On top of that, the internet with its best-effort nature does not guarantee the delivery of packets at all. From an application perspective, the packets are lost in case UDP is used, and retransmitted in case TCP is used. For this reason, UDP is usually chosen for VoIP systems because the retransmission delays of TCP are usually high enough to render timely delivery impossible. However, UDP has a destructive impact on concurrent TCP flows which form most of the internet traffic [1][2]. Because of this, a lot of effort has been put into the development of TCP-friendly application-layer protocols running on top of UDP. Furthermore, from a VoIP perspective, UDP has limited performance in case of a congested network. This results in long and variable delays which cause conversational gaps [3]. TCP on the other hand is able to react to network congestion and outperforms UDP under certain conditions. Simulation studies have shown that TCP congestion control does not have a bad impact on time-sensitive application like VoIP, but in fact can result in performance gains [4]. According to [3], a key to improve the perceived quality is to avoid congestion in the first place. A different approach to achieving better quality is to equip a VoIP system with a voice encoder which supports different

II. T HEORY A. Implementation The software described in this paper has been implemented for the Darwin (MAC OS X) operating system in the current version 8.9.0. Darwin uses the TCP protocol stack of FreeBSD which has the RFC 1323 [6] high performance extensions enabled. It also features selective acknowledgments as described in RFC 2018 [7]. The NewReno modifications described in RFC 2582 [8] are disabled by default. As previous research has shown, TCP achieves better performance in VoIP applications if the TCP NODELAY socket option is turned on [9]. We enabled this option for all the experiments we conducted. The codec we used for our system is the Speex voice codec, available at [10]. It is specifically tailored for speech and offers various different modes of operation as well as different encoding bit-rates. Speex operates at a fixed frame length of 20 ms and thus outputs 50 frames per second. The size of each frame is variable between 5 bits and about 110 bytes in case of audio sampled at 16 kHz. The frame size can be controlled by different parameters, one of them being a quality setting in the range of 0-10. In case of Constant Bit Rate (CBR) mode, the size of an output frame is directly proportional to the quality setting. If Variable Bit Rate (VBR) is selected, the setting only affects the average frame size. To evaluate the performance of our approach under different network conditions, we used the Dummynet traffic shaper [11]. It is part of the FreeBSD packet filter ipfw and offers the notion of a pipe for intentionally constraining selected traffic. A pipe can be configured by parameters such as its bandwidth, propagation delay, queue size and packet loss probability.

2

B. CWND-based approach The first approach that we implemented was inspired by a statement in [12], that the TCP Congestion Window (cwnd) could serve as a measure for controlling the sending rate of a VoIP system. Our software reads the cwnd out of the TCP stack before a packet is sent, and sets the encoder quality linearly. The link bandwidth is limited by dummynet in order to force congestion. Figure 1 shows the results of our first experiment.
250 delay/10 (ms) cwnd/100 packet (bytes) 200

Start

read srtt

YES

srtt > last_srtt?

NO

YES

srtt < last_srtt?

NO

no_chng++

incr. quality no_chng = 0

YES

no_chng > chng_th?

NO

decr. quality no_chng = 0

last_srtt = srtt

150

Figure 2: SRTT based adaption algorithm
100

50

0 0 2 4 6 8 10 12 14 16 18 20 22 24

Figure 1: Congestion window size The red line depicts the cwnd and clearly shows that TCP senses congestion at about 16.5 seconds. The blue line resembles the size of bytes used to encode each Speex frame. As the cwnd increases, the encoder increases its bit-rate, and suddenly decreases where TCP is aware of congestion. The important time slice is between 9.5 seconds and the point of the cwnd drop. At 9.5 s the bit-rate is increased again which suddenly overwhelms the link and leads to a significantly rising end-to-end delay, depicted by the green line. However, since TCP only senses congestion about 7 seconds later, the encoder is not able to adapt to the new network condition in an adequately low amount of time. Thus it must be stated that the cwnd is not appropriate for controlling the encoder of a VoIP system. C. SRTT-based approach The poor results of the first experiment motivated the search for a better suited parameter among the various TCP state variables. It turned out that the Smoothed-out Round Trip Time (srtt) increases rapidly whenever congestion is present in the network. The srtt is an internal variable of TCP and is used to adapt retransmission timeouts to the network condition [13]. However, the srtt depends heavily on link characteristics such propagation delay, and thus the encoder bit-rate can not be aligned linearly. We rather have to search for an algorithm which takes into account the changes in srtt over time. Simulation studies have discovered that an algorithm which follows the Additive Increase, Multiplicative Decrease (AIMD) paradigm yields appropriate performance in adaptive VoIP systems [5]. Figure 2 outlines the main steps of the algorithm we developed based on this concept.

First, the srtt parameter is read from the TCP stack. It is then compared to the srtt of the last encountered frame which is referred to as last srtt. If srtt is larger than last srtt, we assume that the network conditions have worsened and perform a multiplicative decrease of the Speex quality parameter. We also set an internal parameter called no chng to zero. It is used in another step which is explained further down. If on the other hand the srtt is lower than last srtt, the network is regarded to be less congested, and the quality is additively increased. The no chng parameter is zeroed in this step as well. Brief experiments with the algorithm have shown that additive increase controlled by lower srtt should not be to aggressive in order to achieve stable behavior. The drawback of this is that in some cases the algorithm does not adapt to improved network conditions in an optimal way. To overcome this situation, an additional step which handles the case that srtt and last srtt are equal has been integrated. Each time this happens, the no chng counter is increased by 1. If it exceeds the configurable threshold chng th, the quality is increased as well. This effectively means that we also increase quality whenever srtt has been stable over a certain amount of frames. More often than not the controlling parameters can be determined by experimenting rather than calculating them. We found the following values to yield good results: Initial quality level: Additive increase by: Multiplicative decrease by: No Change Threshold chng th: 3 0.4 0.7 10 frames

III. E XPERIMENTAL R ESULTS A. Setup To test our implementation, we set up a testbed consisting of a sender and a receiver, both located on the same machine. This way we assure that both sides use the same system clock, which is essential for end-to-end delay measurements. All network traffic is sent through a dummynet pipe with

3

different bandwidth settings to simulate two different levels of congestion. The available bandwidth was set according to the following scheme: 0-10 10-15 15-20 20-25 seconds: seconds: seconds: seconds: 80 56 32 80 Kbit/s Kbit/s Kbit/s Kbit/s

100 delay/10 (ms) 90 bandwidth (Kbit/s) quality (%) 80

70

60

50

40

The link queue had a size of 50 slots, which is a typical value for Ethernet devices. Each slot can hold a full ethernet packet of 1500 bytes. We modeled four different scenarios to demonstrate the effectiveness of our adaption algorithm.
• • • •

30

20

10

0 0 2 4 6 8 10 12 14 16 18 20 22 24

(a) UDP with CBR voice encoding, non-adaptive
100 delay/10 (ms) 90 srtt (ms) bandwidth (Kbit/s) quality (%)

UDP with CBR1 voice encoding, non-adaptive TCP with CBR voice encoding, non-adaptive TCP with CBR voice encoding, adaptive TCP with VBR2 voice encoding, adaptive

80

70

To insure comparability of the results of the different scenarios, we designed the sender in a way that it can read in audio files. To ensure real-time behavior of the whole setup, the sender reads in a frame every 20 ms, encodes this frame and sends a data packet over the network to the receiver. The receiver has a configurable playout buffer to smooth out network delay and jitter. B. Results In all scenarios, the displayed delay values (green line in Figure 3) are real measured end-to-end values, and no estimations. We chose to plot the quality level (red line) instead of the real bit-rate or packet size of the audio stream, to make the graphs comparable, even if they use different encoding settings, e.g. constant or variable bit-rate. The available bandwidth at a time is always shown as a black line, and the actual TCP srtt value is depicted in blue. The first scenarios examines the performance of a nonadaptive VoIP application which uses UDP as its transport protocol. Figure 3a shows that if the available bandwidth exceeds the bit-rate of the voice encoder by a huge amount, very low delay values can be achieved. These values increase as the available bandwidth decreases, but still remain at a stable and tolerable level within the 10-15 second timeframe. However, there is a huge jump in delay exactly at the time the pipe is not able to handle the audio stream any more, which is is caused by queueing of the pipe. The end-to-end delay reaches tremendous values, and does not even settle down to an acceptable region when sufficient bandwidth is present again. The second scenario also shows a non-adaptive approach like the first one, but uses TCP as a transport protocol instead of UDP. Notable in Figure 3b is the fact, that under
refers to the constant bit-rate mode of the Speex voice encoder, and should not be confused with fixed Speex quality settings in a non-adaptive mode 2 VBR refers to the variable bit-rate mode of the Speex voice encoder, and should not be confused with changing Speex quality settings in an adaptive mode
1 CBR

60

50

40

30

20

10

0 0 2 4 6 8 10 12 14 16 18 20 22 24

(b) TCP with CBR voice encoding, non-adaptive
100 delay/10 (ms) 90 srtt (ms) bandwidth (Kbit/s) quality (%)

80

70

60

50

40

30

20

10

0 0 2 4 6 8 10 12 14 16 18 20 22 24

(c) TCP with CBR voice encoding, adaptive
100 delay/10 (ms) srtt (ms) bandwidth (Kbit/s) quality (%)

90

80

70

60

50

40

30

20

10

0 0 2 4 6 8 10 12 14 16 18 20 22 24

(d) TCP with VBR voice encoding, adaptive

Figure 3: Experimental results for the non-adaptive scenarios

4

normal load conditions, the experienced delays are much higher compared to UDP. The delay peak at around four seconds is due to an retransmission of a lost packet and shows another shortcoming of TCP. Corresponding to the results in [3], TCP shows a much more predictable behavior in the 1520 second timeframe than UDP. It also recovers much faster from high delay values when bandwidth becomes available again. Important is also the correlation of the delay and the srtt value, as the latter follows the former one immediately and thus demonstrates the suitability of this parameter for our adaption algorithm. For both scenarios the result change only marginally if variable bit-rate encoding is used instead of constant bit-rate encoding. Figure 3c shows an impressive reduction of end-to-end delay in the critical period of network congestion (15 to 20 seconds). Our algorithm reacts very fast to the changing network conditions. It the only needs a maximum of two seconds to adapt to the new bandwidth setting of the pipe. The effectiveness can also be seen on the transition from 80 to 56 Kbit/s at t = 10sec. This is not only true in situations of decreasing network capabilities, but also in the opposite direction. The initial adaption to an adequate quality setting of the voice encoder takes a maximum of three seconds. This is probably because of an imperfect estimation of the initial srtt values. Assuming a playout buffer size of 200 ms as suggested by [14], the resulting playback of the audio stream is noticeable interrupted for a total duration of three seconds with a complete loss of continuos playback kept around one second. The last scenario analyzes the potential of variable bit-rate encoding in conjunction with an adaptive control. Figure 3d shows a noticeable increase in end-to-end delay in the region around 16.5 to 20 seconds, which leads to a virtually complete loss of the audio stream (once more assuming a playout buffer of 200 ms). The degradation can be explained by the unpredictable bit-rate and therefore packet sizes, if the encoder operates in VBR mode. This mode allows the codec to change its bit-rate dynamically to adapt to the complexity of the audio being processed. There are no guarantees about the maximum used bit-rate, which are needed to operate satisfactory in an adaptive VoIP application. IV. C ONCLUSION We showed that adaptive VoIP applications provide much better perceived quality than non-adaptive ones. This is due to the fact that they can avoid network congestion, or if congestion occurs, they are able to adapt their bit-rate to the new available bandwidth, and hence avoid playback gaps caused by delayed or lost packets. Furthermore we can state that TCP is in fact a choice for VoIP applications. However, the TCP cwnd state variable is not suitable for controlling adaptive VoIP systems. We developed an congestion-adaptive encoding algorithm that uses the TCP srtt parameter instead. Nonetheless, there are still unacceptably

high delays in case of TCP retransmission. As a consequence we propose to replace TCP by SCTP [15] which offers a partially reliable extension [16] to avoid retransmissions in case they are not wanted. Promising results have already been achieved if SCTP is used for video streaming [17]. Since SCTP also uses the srtt value internally, it should be easy to migrate our adaption algorithm to the new protocol. Although variable bit-rate encoding can result in better perceived audio quality when used in a non-streaming application, it is not suitable for adaptive VoIP applications. Yet, there remains some work to be done in the future. One is the exploration of the cause for the zig-zag nature of the delay in the TCP scenarios. Another task is to test all scenarios under real world conditions and compare them with state of the art adaptive UDP-based implementations and refine the adaption algorithm and the used parameter values. R EFERENCES
[1] S. Floyd and K. Fall. Promoting the use of end-to-end congestion control in the internet. IEEE/ACM Transactions on Networking (TON), 7(4):458–472, 1999. [2] R. Rejaie, M. Handley, and D. Estrin. RAP: An end-to-end rate-based congestion control mechanism for realtime streams in the internet. IEEE INFOCOM (3), 1337-1345, 1999. [3] P. Papadimitriou and V. Tsaoussidis. Assessment of internet voice transport with TCP. International Journal of Communication Systems, 19:381–405, 2006. [4] P. Papadimitriou and V. Tsaoussidis. Evaluation of transport services for VoIP. In IEEE ICC Proceedings, 2006. [5] A. Barberis, C. Casetti, J. D. Martin, and M. Meo. A simulation study of adaptive voice communications on IP networks. Computer Communications, 24:757–767, 2001. [6] V. Jacobson, R. Braden, and D. Borman. TCP extensions for high performance. IETF RFC 1323, May 1992. [7] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TCP selective acknowledgment options. IETF RFC 2018, October 1996. [8] S. Floyd, and T. Henderson. The NewReno modification to TCP’s fast recovery algorithm. IETF RFC 2582, April 1999. [9] X. Zhang and H. Schulzrinne. Voice over TCP and UDP. Technical report, Department of Computer Science Columbia University, 2004. [10] J.-M. Valin, Xiph.org Foundation. Speex voice codec. http://www.speex.org. [11] L. Rizzo. Dummynet: A simple approach to the evaluation of network protocols. ACM SIGCOMM Computer Communication Review, 27(1):31–41, 1997. [12] M.-F. Horng, H.-W. Hsu, W.-L. Du, Y.-H. Hung, and M.-H. Lee. A faststartup TCP mechanism for VoIP services in long-distance networks. In Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP’06), 2006. [13] J. Postel, Transmission Control Protocol. IETF RFC 793, September 1981. [14] H. Oouchi, T. Takenaga, H. Sugawara, and M. Masugi. Study on appropriate voice data length of IP packets for VoIP network adjustment. In IEEE, pages 1618–1622, 2002. [15] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxon. Stream control transmission protocol. IETF RFC 2960, October 2000. [16] R. Stewart, M. Ramalho, Q. Xie, M. Tuexen, and P. Conrad. Stream control transmission protocol (SCTP) partial reliability extension. IETF RFC 3758, May 2004. [17] M. Molteni and M. Villari. Using SCTP with partial reliability for MPEG-4 multimedia streaming. In Proceedings of BSDCon Europe 2002, 2002.