TCPLow Sigcomm 2001

Equilibrium & Dynamics
of TCP/AQM
Steven Low
CS & EE, Caltech
netlab.caltech.edu
Sigcomm
August 2001
Copyright, 1996 © Dale Carnegie & Associates, In

Acknowledgments
 S. Athuraliya, D. Lapsley, V. Li, Q. Yin

(UMelb)
 S. Adlakha (UCLA), J. Doyle (Caltech), F.
Paganini (UCLA), J. Wang (Caltech)
 L. Peterson, L. Wang (Princeton)
 Matthew Roughan (AT&T Labs)
Outline
 Introduction
 TCP/AQM algorithms
 Duality model
 Equilibrium of TCP/AQM
 Linear dynamic model
 Dynamics of TCP/AQM
 A scalable control
Schedule
1:30 – 2:30 TCP/AQM algorithms

2:30 – 3:00 Duality model (equilibrium)
2:30 – 3:00 Break
3:30 – 4:00 Duality model (equilibrium)
4:00 – 4:30 Linear model (dynamic)
4:30 – 5:00 Scalable control
Part 0
Introduction
TCP/IP Protocol Stack
Applications (e.g. Telnet, HTTP)

TCP UDP ICMP
IP ARP
Link Layer (e.g. Ethernet, ATM)
Physical Layer (e.g. Ethernet, SONET)
Packet Terminology
Application Message
MSS
TCP Segment
TCP hdr TCP data
IP Packet 20 bytes
IP hdr IP data
Ethernet Frame 20 bytes
Ethernet Ethernet data

14 bytes MTU 1500 bytes 4 bytes
Success of IP
Simple/Robust
WWW, Email, Napster, FTP, …  Robustness against failure
 Robustness against technological
Applications evolutions
 Provides a service to applications
IP  Doesn’t tell applications what to do
Transmission
Quality of Service
 Can we provide QoS with simplicity?
Ethernet, ATM, POS, WDM, …  Not with current TCP…
 … but we can fix it!
IETF
 Internet Engineering Task Force
 Standards organisation for Internet
 Publishes RFCs - Requests For Comment
 standards track: proposed, draft, Internet
 non-standards track: experimental, informational
 best current practice
 poetry/humour (RFC 1149: Standard for the transmission of
IP datagrams on avian carriers)
 TCP should obey RFC
 no means of enforcement
 some versions have not followed RFC
 http://www.ietf.org/index.html
Simulation
 ns-2: http://www.isi.edu/nsnam/ns/index.html
 Wide variety of protocols
 Widely used/tested
 SSFNET: http://www.ssfnet.org/homePage.html
 Scalable to very large networks
 Care should be taken in simulations!
 Multiple independent simulations
 confidence intervals
 transient analysis – make sure simulations are long enough
 Wide parameter ranges
 All simulations involve approximation
Other Tools
 tcpdump
 Get packet headers from real network traffic
 tcpanaly (V.Paxson, 1997)
 Analysis of TCP sessions from tcpdump
 traceroute
 Find routing of packets
 RFC 2398
 http://www.caida.org/tools/
Part I
Algorithms
Outline
 Introduction
 Window flow control
 Source algorithm: Tahoe, Reno, Vegas
 Link algorithm: RED, REM
 Duality model
Early TCP
 Pre-1988
 Go-back-N ARQ
 Detects loss from timeout
 Retransmits from lost packet onward
 Receiver window flow control
 Prevent overflows at receive buffer
 Flow control: self-clocking
Why Flow Control?
 October 1986, Internet had its first

congestion collapse
 Link LBL to UC Berkeley
 400 yards, 3 hops, 32 Kbps
 throughput dropped to 40 bps
 factor of ~1000 drop!
 1988, Van Jacobson proposed TCP flow
control
Window Flow Control
RTT
Source 1 2 W 1 2 W
time
data ACKs
Destination 1 2 W 1 2 W
time
 ~ W packets per RTT

 Lost packet detected by missing ACK
Source Rate
 Limit the number of packets in the network
to window W
W  MSS
 Source rate = bps
RTT
 If W too small then rate « capacity

If W too big then rate > capacity
=> congestion
Effect of Congestion
 Packet loss
 Retransmission
 Reduced throughput
 Congestion collapse due to
 Unnecessarily retransmitted packets
 Undelivered or unusable packets
 Congestion may continue after the overload!
throughput
load
Congestion Control
 TCP seeks to
 Achieve high utilization
 Avoid congestion
 Share bandwidth
 Source rate = W packets/sec
RTT
 Adapt W to network (and conditions)
W = BW x RTT
TCP Window Flow Controls
 Receiver flow control
 Avoid overloading receiver
 Set by receiver
 awnd: receiver (advertised) window
 Network flow control
 Avoid overloading network
 Set by sender
 Infer available network capacity
 cwnd: congestion window
 Set W = min (cwnd, awnd)
Receiver Flow Control
 Receiver advertises awnd with each ACK

 Window awnd
 closed when data is received and ack’d
 opened when data is read
 Size of awnd can be the performance limit
(e.g. on a LAN)
 sensible default ~16kB
Network Flow Control
 Source calculates cwnd from indication of
network congestion
 Congestion indications
 Losses
 Delay
 Marks
 Algorithms to calculate cwnd
 Tahoe, Reno, Vegas, RED, REM …
Outline
 Introduction
 Duality model
TCP Congestion Control
 Tahoe (Jacobson 1988)
 Slow Start
 Congestion Avoidance
 Fast Retransmit
 Reno (Jacobson 1990)
 Fast Recovery
 Vegas (Brakmo & Peterson 1994)
 New Congestion Avoidance
 RED (Floyd & Jacobson 1993)
 Probabilistic marking
 REM (Athuraliya & Low 2000)
 Clear buffer, match rate
 Others…
Variants
 Tahoe & Reno
 NewReno
 SACK
 Rate-halving
 Mod.s for high performance
 AQM
 RED, ARED, FRED, SRED
 BLUE, SFB
 REM, PI
TCP Tahoe (Jacobson 1988)
window
time
SS CA
SS: Slow Start

CA: Congestion Avoidance
Slow Start
 Start with cwnd = 1 (slow start)

 On each successful ACK increment cwnd
cwnd  cnwd + 1
 Exponential growth of cwnd
each RTT: cwnd  2 x cwnd
 Enter CA when cwnd >= ssthresh
Slow Start
sender receiver
cwnd
1
data packet
1 RTT
ACK
2
3
4
5
6
7
8
cwnd  cwnd + 1 (for each ACK)

Congestion Avoidance
 Starts when cwnd  ssthresh
 On each successful ACK: cwnd
 cwnd + 1/cwnd
 Linear growth of cwnd
each RTT: cwnd  cwnd + 1
Congestion Avoidance
sender receiver
cwnd
1
data packet
ACK
2 1 RTT
cwnd  cwnd + 1 (for each cwnd ACKS)

Packet Loss
 Assumption: loss indicates congestion
 Packet loss detected by
 Retransmission TimeOuts (RTO timer)
 Duplicate ACKs (at least 3)
Packets
1 2 3 4 5 6 7
Acknowledgements
1 2 3 3 3 3
Fast Retransmit
 Wait for a timeout is quite long
 Immediately retransmits after 3 dupACKs
without waiting for timeout
 Adjusts ssthresh
flightsize = min(awnd, cwnd)
ssthresh  max(flightsize/2, 2)
 Enter Slow Start (cwnd = 1)
Successive Timeouts
 When there is a timeout, double the RTO
 Keep doing so for each lost retransmission
 Exponential back-off
 Max 64 seconds1
 Max 12 restransmits1
1 - Net/3 BSD
Summary: Tahoe
 Basic ideas
 Gently probe network for spare capacity
 Drastically reduce rate on congestion
 Windowing: self-clocking
 Other functions: round trip time estimation, error
recovery
for every ACK {
if (W < ssthresh) then W++ (SS)
else W += 1/W (CA)
}
for every loss {
ssthresh = W/2
W = 1
}
TCP Tahoe (Jacobson 1988)
window
time
SS CA
SS: Slow Start

CA: Congestion Avoidance
TCP Reno (Jacobson 1990)
window
time
SS CA
SS: Slow Start

CA: Congestion Avoidance Fast retransmission/fast recovery
Fast recovery
 Motivation: prevent `pipe’ from emptying after
fast retransmit
 Idea: each dupACK represents a packet having
left the pipe (successfully received)
 Enter FR/FR after 3 dupACKs
 Set ssthresh  max(flightsize/2, 2)
 Retransmit lost packet
 Set cwnd  ssthresh + ndup (window inflation)
 Wait till W=min(awnd, cwnd) is large enough;
transmit new packet(s)
 On non-dup ACK (1 RTT later), set cwnd  ssthresh
(window deflation)
 Enter CA
Example: FR/FR
S 1 2 3 4 5 6 7 8 1 9 10 11
time
Exit FR/FR
R 0 0 0 0 0 0 0 8 time
cwnd 8 7 9 11 4
ssthresh 4 4 4 4
 Fast retransmit
 Retransmit on 3 dupACKs
 Fast recovery
 Inflate window while repairing loss to fill pipe
Summary: Reno
 Basic ideas
 Fast recovery avoids slow start
 dupACKs: fast retransmit + fast recovery
 Timeout: fast retransmit + slow start
dupACKs
congestion
avoidance FR/FR
timeout
slow start retransmit

NewReno: Motivation
FR/FR
S 1 2 3 4 5 6 7 8 9 0 1 3
time
D 0 0 0 0 0 2
time
8 9 5 timeout
8 unack’d pkts
 On 3 dupACKs, receiver has packets 2, 4, 6, 8, cwnd=8,

retransmits pkt 1, enter FR/FR
 Next dupACK increment cwnd to 9
 After a RTT, ACK arrives for pkts 1 & 2, exit FR/FR,
cwnd=5, 8 unack’ed pkts
 No more ACK, sender must wait for timeout
NewReno Fall & Floyd ‘96, (RFC 2583)
 Motivation: multiple losses within a window

 Partial ACK acknowledges some but not all packets
outstanding at start of FR
 Partial ACK takes Reno out of FR, deflates window
 Sender may have to wait for timeout before proceeding
 Idea: partial ACK indicates lost packets
 Stays in FR/FR and retransmits immediately
 Retransmits 1 lost packet per RTT until all lost packets
from that window are retransmitted
 Eliminates timeout
SACK Mathis, Mahdavi, Floyd, Romanow ’96 (RFC 2018, RFC 2883)
 Motivation: Reno & NewReno retransmit at most

1 lost packet per RTT
 Pipe can be emptied during FR/FR with multiple losses
 Idea: SACK provides better estimate of packets
in pipe
 SACK TCP option describes received packets
 On 3 dupACKs: retransmits, halves window, enters FR
 Updates pipe = packets in pipe
 Increment when lost or new packets sent
 Decrement when dupACK received
 Transmits a (lost or new) packet when pipe < cwnd
 Exit FR when all packets outstanding when FR was
entered are acknowledged
Outline
 Introduction
 Duality model
TCP Reno (Jacobson 1990)
window
time
SS CA
SS: Slow Start

CA: Congestion Avoidance Fast retransmission/fast recovery
TCP Vegas (Brakmo & Peterson 1994)
window
time
SS CA
 Converges, no retransmission
 … provided buffer is large enough
Vegas CA algorithm
for every RTT
{ if W/RTTmin – W/RTT < then W ++
if W/RTTmin – W/RTT > then W -- }
for every loss
queue size
W := W/2
Implications
 Congestion measure = end-to-end queueing
delay
 At equilibrium
 Zero loss
 Stable window at full utilization
 Approximately weighted proportional fairness
 Nonzero queue, larger for more sources
 Convergence to equilibrium
 Converges if sufficient network buffer
 Oscillates like Reno otherwise
Outline
 Introduction
 Duality model
RED (Floyd & Jacobson 1993)
 Idea: warn sources of incipient congestion by

probabilistically marking/dropping packets
 Link algorithm to work with source algorithm
(Reno)
 Bonus: desynchronization
 Prevent bursty loss with buffer overflows
marking window
1
Avg queue time

B
router host
RED
 Implementation
 Probabilistically drop packets
 Probabilistically mark packets
 Marking requires ECN bit (RFC 2481)
 Performance
 Desynchronization works well
 Extremely sensitive to parameter setting
 Fail to prevent buffer overflow as #sources
increases
Variant: ARED (Feng, Kandlur, Saha, Shin 1999)
 Motivation: RED extremely sensitive to

#sources
 Idea: adapt maxp to load
 If avg. queue < minth, decrease maxp
 If avg. queue > maxth, increase maxp
 No per-flow information needed
Variant: FRED (Ling & Morris 1997)
 Motivation: marking packets in proportion to flow

rate is unfair (e.g., adaptive vs unadaptive flows)
 Idea
 A flow can buffer up to minq packets without being
marked
 A flow that frequently buffers more than maxq packets
gets penalized
 All flows with backlogs in between are marked according
to RED
 No flow can buffer more than avgcq packets persistently
 Need per-active-flow accounting
Variant: SRED (Ott, Lakshman & Wong 1999)
 Motivation: wild oscillation of queue in RED

when load changes
 Idea:
 Estimate number N of active flows
An arrival packet is compared with a randomly
chosen active flows
N ~ prob(Hit)-1
 cwnd~p-1/2 and Np-1/2 = Q0 implies p = (N/Q0)2
 Marking prob = m(q) min(1, p)
 No per-flow information needed
Variant: BLUE (Feng, Kandlur, Saha, Shin 1999)
 Motivation: wild oscillation of RED leads to

cyclic overflow & underutilization
 Algorithm
 On buffer overflow, increment marking prob
 On link idle, decrement marking prob
Variant: SFB
 Motivation: protection against nonadaptive flows
 Algorithm
 L hash functions map a packet to L bins (out of NxL )
 Marking probability associated with each bin is
 Incremented if bin occupancy exceeds threshold
 Decremented if bin occupancy is 0
 Packets marked with min {p1, …, pL}
h1 h2 hL-1 hL
nonadaptive
1 1
1
adaptive
1
Variant: SFB
 Idea
 A nonadaptive flow drives marking prob to 1
at all L bins it is mapped to
 An adaptive flow may share some of its L bins
with nonadaptive flows
 Nonadaptive flows can be identified and
penalized
REM Athuraliya & Low 2000
 Main ideas
 Decouple congestion & performance measure
 Price adjusted to match rate and clear buffer
 Marking probability exponential in `price’
REM RED
1
0.9
0.8 1
Link marking probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
Avg queue
Link congestion measure
Part II
Models
Congestion Control
Heavy tail  Mice-elephants
Internet
Elephant Mice
efficient & fair sharing small delay

queueing + propagation
Congestion control CDN

TCP
xi(t)
TCP & AQM
pl(t)
xi(t)
Example congestion measure pl(t)

 Loss (DropTail)
 Queue length (RED)
 Queueing delay (Vegas)
 Price (REM)
TCP & AQM
pl(t)
xi(t)
Duality theory  equilibrium

 Source rates xi(t) are primal variables
 Congestion measures pl(t) are dual variables
 Congestion control is optimization process over
Internet
TCP & AQM
pl(t)
xi(t)
Control theory  stability & robustness

 Internet is a gigantic feedback system
 Distributed & delayed
Outline
 Introduction
 Duality model
Overview: equilibrium
 Interaction of source rates xs(t) and
congestion measures pl(t)
 Duality theory
 They are primal and dual variables
 Flow control is optimization process
 Example congestion measure
 Loss (Reno)
 Queueing delay (Vegas)
 Congestion control problem
max
x 0
s
U s ( xs )
s
subject to x c,
l
l
l  L
 Primal-dual algorithm
x(t  1)  F ( p (t ), x(t )) Reno, Vegas
p (t  1)  G ( p (t ), x(t )) DropTail, RED, REM
 TCP/AQM protocols (F, G)

 Maximize aggregate source utility
 With different utility functions Us(xs)
Overview: Reno
 Equilibrium characterization
(1  qi ) xi2 2
 qi xi 
 i
2
2  i qi
2  x 
Duality  U sreno ( xs )  tan 1  i i 
i  2 
 Congestion measure p = loss
 Implications
 Reno equalizes window wi = i xi
 inversely proportional to delay i
 1 p dependence for small p
 DropTail fills queue, regardless of queue capacity
Overview: AQM
Decouple congestion
& performance measure
DropTail
queue = 94%
REM
queue = 1.5 pkts
utilization = 92%
 = 0.05,  = 0.4, = 1.15
RED
min_th = 10 pkts
max_th = 40 pkts
max_p = 0.1
 DropTail
 High utilization
 Fills buffer, maximize queueing delay
 RED
 Couples queue length with congestion measure
 High utilization or low delay
 REM
 Decouple performance & congestion measure
 Match rate, clear buffer
 Sum prices
Overview: Vegas
 Delay
 Congestion measures: end to end queueing delay
ds
 Sets rate sx (t )   s
qs (t )
 Equilibrium condition: Little’s Law
 Loss
 No loss if converge (with sufficient buffer)
 Otherwise: revert to Reno (loss unavoidable)
 Fairness
 Utility function U s
reno
( x )   d log x
s s s s
 Proportional fairness
Model
Sources s
L(s) - links used by source s
Us(xs) - utility if source rate = xs
 Network
 Links l of capacities cl
x1
x1  x2  c1 x1  x3  c2
c1 c2
x2
x3
Primal problem
max
xs  0
U ( x )
s
s s
subject to x c,
l
l
l  L
Assumptions
Strictly concave increasing Us
 Unique optimal rates xs exist
 Direct solution impractical
Prior Work
 Formulation
 Kelly 1997
 Penalty function approach
 Kelly, Maulloo and Tan 1998
 Kunniyur and Srikant 2000
 Duality approach
 Low and Lapsley 1999
 Athuraliya and Low 2000, Low 2000
 Extensions
 Mo & Walrand 1998
 La & Anantharam 2000
Prior Work
 Formulation
 Kelly 1997
 Penalty function approach
 Kelly, Maulloo and Tan 1998
 Kunniyur and Srikant 2000
 Duality approach
 Low and Lapsley 1999
 Athuraliya and Low 2000, Low 2000
 Extensions
 Mo & Walrand 1998
 La & Anantharam 2000
Example
max
xs  0
 log x
s
s
subject to x1  x2  1  Lagrange multiplier:

x1  x3  1 p1 = p2 = 3/2
 Optimal: x1 = 1/3
x2 = x3 = 2/3
x1
1 1
x2
x3
Example
 xs : proportionally fair
 pl : Lagrange multiplier, (shadow) price, congestion
measure
 How to compute (x, p)?
 Relevance to TCP/AQM ??
x1
1 1
x2
x3
Example
 xs : proportionally fair (Vegas)
measure
 Gradient algorithms, Newton algorithm, Primal-dual algorithms…

p TCP/AQM
(t  1) protocols
p (t ) implement primal-dual
 ( x (t )  algorithms
x (t )  1)
 over Internet
1 1 1 2
p2 (t  1)  p2 (t )   ( x1 (t )  x3 (t )  1) 

x1 Aggregate rate
1
x1 (t  1) ;
p1 (t )  p2 (t )
1 1 11
x2 (t  1) ; x3 (t  1) ;
x2 p1 (t ) p2 (t )
x3
Example
 xs : proportionally fair (Vegas)
measure
 Gradient algorithms, Newton algorithm, Primal-dual algorithms…
 TCP/AQM protocols implement primal-dual algorithms over Internet
x1
1 1
x2
x3
Duality Approach
Primal : max
xs  0
 U s ( xs )
s
subject to x l  cl , l  L
 l 
Dual : min D( p)   max
p 0
 xs  0
s s s l l l ) 
U ( x )  p ( c  x
Primal-dual algorithm:
x(t  1)  F ( p(t ), x(t ))

p(t  1)  G ( p(t ), x(t ))
Duality Model of TCP
 TCP iterates on rates (windows)

 AQM iterates on congestion measures
 With different utility functions
x(t  1)  F ( p(t ), x(t )) Reno, Vegas
p(t  1)  G ( p(t ), x(t )) DropTail, RED, REM

Summary
 Congestion control problem
max
x 0
s
U s ( xs )
s
subject to x c,
l
l
l  L
 Primal-dual algorithm
x(t  1)  F ( p (t ), x(t )) Reno, Vegas
p (t  1)  G ( p (t ), x(t )) DropTail, RED, REM
 TCP/AQM protocols (F, G)

 Maximize aggregate source utility
 With different utility functions Us(xs)
(F, G, U) model
 Derivation
 Derive (F, G) from protocol description
 Fix point (x, p) = (F, G) gives equilibrium
 Derive U
regard fixed point as Kuhn-Tucker condition
 Application: equilibrium properties

 Performance
Throughput, loss, delay, queue length
 Fairness, friendliness
Outline
 Introduction
 Duality model (F, G, U)
 Queue management G : RED, REM
 TCP F and U : Reno, Vegas
Active queue management
 Idea: provide congestion information by
probabilistically marking packets
 Issues
 How to measure congestion (p and G)?
 How to embed congestion measure?
 How to feed back congestion info?
x(t+1) = F( p(t), x(t) ) Reno, Vegas
p(t+1) = G( p(t), x(t) ) DropTail, RED, REM

RED (Floyd & Jacobson 1993)
 Congestion measure: average queue length

pl(t+1) = [pl(t) + xl(t) - cl]+
 Embedding: p-linear probability function

marking
Avg queue
 Feedback: dropping or ECN marking

REM (Athuraliya & Low 2000)
 Congestion measure: price

pl(t+1) = [pl(t) + (l bl(t)+ xl (t) - cl )]+
 Embedding:


 Embedding: exponential probability function

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20

Key features
 Clear buffer and match rate
pl (t  1)  [ pl (t )   (  l bl (t )  xˆ l (t )  cl )]
Clear buffer Match rate
Sum prices
1  pl ( t )
 1  p s (t )
Theorem (Paganini 2000)

Global asymptotic stability for general utility
function (in the absence of delay)
Active Queue Management
pl(t) G(p(t), x(t))

DropTail loss [1 - cl/xl (t)]+ (?)
RED queue [pl(t) + xl(t) - cl]+
Vegas delay [pl(t) + xl (t)/cl - 1]+
REM price [pl(t) + lbl(t)+ xl (t) - cl )]+

Congestion & performance
pl(t) G(p(t), x(t))

DropTail loss [1 - cl/xl (t)]+ (?)
RED queue [pl(t) + xl(t) - cl]+
Vegas delay [pl(t) + xl (t)/cl - 1]+
REM price [pl(t) + lbl(t)+ xl (t) - cl )]+
 Decouple congestion & performance measure

 RED: `congestion’ = `bad performance’
 REM: `congestion’ = `demand exceeds supply’
But performance remains good!
Outline
 Introduction
Utility functions
2  xi i 
1
 Reno U reno
( xi )  tan  
i
i
 2 
 Vegas U i
vegas
( xi )   i d i log xi
Reno: F
for every ack (ca)
{ W += 1/W }
for every loss
{ W := W/2 }
xs (t )(1  p (t )) ws (t )
ws t    xs (t ) p (t )
ws 2

Reno: F
for every ack (ca)
{ W += 1/W }
for every loss
{ W := W/2 }
xs (t )(1  p (t )) ws (t )
ws t    xs (t ) p (t )
ws 2

Reno: F
for every ack (ca)
{ W += 1/W }
for every loss
{ W := W/2 }
xs (t )(1  p (t )) ws (t )
ws t    xs (t ) p (t )
ws 2
(1  p (t )) xs2 (t )
Fs  p (t ), x (t )   xs (t )  2
 p (t )
Ds 2

Implications
 Equilibrium characterization
(1  qi ) xi2 2
 qi xi 
 i
2
2  i qi
2  x 
Duality  U sreno ( xs )  tan 1  i i 
i  2 
 Congestion measure p = loss
 Implications
 Reno equalizes window wi = i xi
 inversely proportional to delay i
 1 p dependence for small p
 DropTail fills queue, regardless of queue capacity
Validation - Reno
 30 sources, 3 groups with RTT = 3, 5, 7ms + 6ms (queueing delay)

Link capacity = 64 Mbps, buffer = 50 kB
 Measured windows equalized, match well with theory (black line)
Validation – Reno/RED
 30 sources, 3 groups with RTT = 3, 5, 7 ms + 6 ms (queueing delay)

 Link capacity = 64 Mbps, buffer = 50 kB
Validation – Reno/REM
 30 sources, 3 groups with RTT = 3, 5, 7 ms

 Link capacity = 64 Mbps, buffer = 50 kB
 Smaller window due to small RTT (~0 queueing delay)
Queue
REM p = Lagrange multiplier!

queue = 1.5 pkts DropTail
utilization = 92% queue = 94%
 = 0.05,  = 0.4, = 1.15
p decoupled from queue
RED
min_th = 10 pkts
p increasing in queue! max_th = 40 pkts
max_p = 0.1
Summary
 DropTail
 High utilization
 Fills buffer, maximize queueing delay
 RED
 Couples queue length with congestion measure
 High utilization or low delay
 REM
 Decouple performance & congestion measure
 Sum prices
Gradient algorithm
 Gradient algorithm
source : xi (t  1)  U i'1 (qi (t ))
link : pl (t  1)  [ pl (t )   ( yl (t )  cl )]
Theorem (ToN’99)
Converge to optimal rates in asynchronous
environment
Reno & gradient algorithm
source : xi (t  1)  U i'1 (qi (t ))
 TCP approximate version of gradient algorithm
(1  qi (t )) xi2 (t )
Fi qi (t ), xi (t )   xi (t )   qi (t )
 i2 2
Reno & gradient algorithm
source : xi (t  1)  U i'1 (qi (t ))
 TCP approximate version of gradient algorithm


 qi (t ) 2 
xi t  1   xi (t )  ( xi (t )  xi2 (t ))
 2 
U i'1 (qi (t ))
Outline
 Introduction
Vegas
for every RTT
{ if W/RTTmin – W/RTT < then W ++
if W/RTTmin – W/RTT > then W -- }
for every loss
queue size
W := W/2
 1
F: xs t  1   xs (t )  2 if ws (t )  d s xs (t )   s d s
 Ds
 1
xs t  1   xs (t )  2 if ws (t )  d s xs (t )   s d s
 Ds
xs t  1  xs (t ) else
G: pl(t+1) = [pl(t) + xl (t)/cl - 1]+

Implications
 Performance
 Rate, delay, queue, loss
 Interaction
 Fairness
 TCP-friendliness
 Utility function
Performance
 Delay
 Congestion measures: end to end queueing
delay
ds
 Sets rate xs (t )   s q (t )
s
 Equilibrium condition: Little’s Law

 Loss
 No loss if converge (with sufficient buffer)
 Otherwise: revert to Reno (loss unavoidable)
Vegas Utility
 Equilibrium (x, p) = (F, G)
U reno
s
( x )   d log x
s s s s
Proportional fairness
Vegas & Basic Algorithm
 Basic algorithm
source : xs (t  1)  U s'1 ( p(t ))
 TCP smoothed version of Basic Algorithm …
Vegas & Gradient Algorithm
 Basic algorithm
source : xs (t  1)  U s'1 ( p(t ))
 TCP smoothed version of Basic Algorithm …
 Vegas
 1
xs t  1   xs (t )  2 if xs (t )  xs (t )
 Ds
 1
xs t  1   xs (t )  2 if xs (t )  xs (t )
 Ds
U s'1 ( p(t ))
Validation
Source 1 Source 3 Source 5
RTT (ms) 17.1 (17) 21.9 (22) 41.9 (42)

Rate (pkts/s) 1205 (1200) 1228 (1200) 1161 (1200)
Window (pkts) 20.5 (20.4) 27 (26.4) 49.8 (50.4)
Avg backlog (pkts) 9.8 (10)
meausred theory
Single link, capacity = 6 pkts/ms

5 sources with different propagation delays, s = 2
pkts/RTT
Persistent congestion
 Vegas exploits buffer process to compute prices
(queueing delays)
 Persistent congestion due to
 Coupling of buffer & price
 Error in propagation delay estimation
 Consequences
 Excessive backlog
 Unfairness to older sources
Theorem
A relative error of s in propagation delay estimation
distorts the utility function to
Uˆ s ( xs )  (1   s ) s d s log xs   s d s xs
Evidence
Without estimation error With estimation error
 Single link, capacity = 6 pkt/ms, s = 2 pkts/ms, ds = 10 ms

 With finite buffer: Vegas reverts to Reno
Evidence
Source rates (pkts/ms)
# src1 src2 src3 src4 src5
1 5.98 (6)
2 2.05 (2) 3.92 (4)
3 0.96 (0.94) 1.46 (1.49) 3.54 (3.57)
4 0.51 (0.50) 0.72 (0.73) 1.34 (1.35) 3.38 (3.39)
5 0.29 (0.29) 0.40 (0.40) 0.68 (0.67) 1.30 (1.30) 3.28 (3.34)
# queue (pkts) baseRTT (ms)

1 19.8 (20) 10.18 (10.18)
2 59.0 (60) 13.36 (13.51)
3 127.3 (127) 20.17 (20.28)
4 237.5 (238) 31.50 (31.50)
5 416.3 (416) 49.86 (49.80)
Vegas/REM
 To preserve Vegas utility function & rates
xs   ds
s qs end2end queueing delay
 REM
 Clear buffer : estimate of ds
 Sum prices : estimate of ps
 Vegas/REM
 1
xs t  1   xs (t )  2 if xs (t )  xˆ s (t )
 Ds
 1
xs t  1   xs (t )  2 if xs (t )  xˆ s (t )
 Ds
Vegas/REM
xs   ds
s qs end2end queueing delay
 REM
xs :  
 Clear buffer estimate
ds
s Ds of
ds ds
 Vegas/REM
 1
 Ds
 1
 Ds
Vegas/REM
xs   s ds
ps end2end price
 REM
 Vegas/REM
 1
 Ds
 1
 Ds
Vegas/REM
xs   s ds
ps end2end price
 REM
 Vegas/REM
 1
 Ds
 1
 Ds
Performance
peak = 43 pkts
utilization : 90% - 96%
Vegas/REM Vegas
Conclusion
Duality model of TCP: (F, G, U)

x(t  1)  F ( p(t ), x(t ))
p(t  1)  G ( p(t ), x(t ))
Reno, Vegas
 Maximize aggregate utility
 With different utility functions
DropTail, RED, REM

 Decouple congestion & performance
 Sum prices
Food for thought
 How to tailor utility to application?
 Choosing congestion control automatically
fixes utility function
 Can use utility function to determine
congestion control
Outline
 Introduction
 Stability of TCP/AQM
Model assumptions
 Small marking probabilities
 End to end marking probability
1   (1  pl )   pl
 Congestion avoidance
l
dominates
l
 Receiver not limiting

 Decentralized
 TCP algorithm depends only on end-to-end measure
of congestion
 AQM algorithm depends only on local & aggregate
rate or queue
 Constant (equilibrium) RTT
Dynamics
 Small effect on queue
 AIMD
 Mice traffic
 Heterogeneity
 Big effect on queue
 Stability!
Stable: 20ms delay
Window
70
60
50
Window (pkts)
40
individual window
30
20
10
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time (ms)
Window
Ns-2 simulations, 50 identical FTP sources, single link 9 pkts/ms, RED marking
Stable: 20ms delay
Window Instantaneous queue

70 800
60 700
600
50
Instantaneous queue (pkts)

500
Window (pkts)
40
individual window
400
30 average window
300
20
200
10
100
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time (ms) time (ms)
Window Queue
Unstable: 200ms delay
Window
70
individual window
60
50
Window (pkts)
40
30
20
10
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time (10ms)
Window
Unstable: 200ms delay
Window Instantaneous queue
70 800
individual window
60 700
600
50

500
(pkts)
Window(pkts)
40
400
Window
30
300
20
200
10
average window 100
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time (10ms) time (10ms)
Window Queue
Other effects on queue
Instantaneous queue Instantaneous queue (50% noise) Instantaneous queue (pkts)
800 800 800
20ms 50% noise avg delay 16ms

700 700 700
600 600 600
instantaneous queue (pkts)

500 500 500
400 400 400
300 300 300
200 200 200
100 100 100
0 0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
time (ms) time (sec) time (sec)
Instantaneous queue Instantaneous queue (50% noise) Instantaneous queue (pkts)

800 800 800
50% noise avg delay 208ms

700 700
200ms
700
600 600 600


500 500 500
400 400 400
300 300 300
200 200 200
100 100 100
0 0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
time (10ms) time (sec) time (sec)
Effect of instability
 Larger jitters in throughput
 Lower utilization
Mean and variance of individual window
160
140
120 variance
100
window
80
60
40
mean
20
0
No noise
20 40 60 80 100 120 140 160 180 200
delay (ms)
Linearized system
x y
Rf(s)
F1 G1
TCP Network AQM
FN GL
q p
Rb (s)
’
(1  qi (t )) xi2 (t )
Fi qi (t ), xi (t )   xi (t )   qi (t )
 i2 2
~
Gl  pl (t ), yl (t )  : Gl ( zl (t ), pl (t ), yl (t ))
Linearized system
x y
Rf(s)
F1 G1
FN GL
q p
Rb (s)
’
1 c 1
L( s )     e  i s
 p *
( s  p w ) s  c s  1
* *
xn   n s
i i i

c n n
e
TCP RED queue delay

Stability condition
Theorem
1 
TCP/RED stable if c  | H | 
3 3

Nyquist plot of h(v, theta)
2
0
 

-2
-4
Im
-6
-8
-10
-10 -5 0 5 10 15
Re
Stability condition
Theorem
1 
3 3

Nyquist plot of h(v, theta)
2
Small
-2
-4  Slow response
Im
 Large delay
-6
-8
-10
-10 -5 0 5 10 15
Re
Stability condition
Theorem
1 
3 3

10
Nyquist plot of h(v, theta) x 10 Stability condition
2 4.5
N=200 c=50 pkts/ms
4
0
3.5
LHS of stability condition

-2 3
2.5
-4 c=40
Im
-6 1.5 c=10
1 c=30
-8
0.5
c=20
-10 0
-10 -5 0 5 10 15 20 30 40 50 60 70 80 90 100
Re delay (ms)
Validation
Critical frequency (Hz)
Round trip propagation delay at critical frequency (ms) 5
100
4.5
95
4
90
3.5
85
80
3
frequency (model)
delay (model)
75 2.5
70 2
65 1.5
dynamic-link model
60 1
55 0.5
30 data points
50 0
50 55 60 65 70 75 80 85 90 95 100 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
delay (NS) frequency (NS)
Validation
Critical frequency (Hz)
Round trip propagation delay at critical frequency (ms) 5
100
4.5
95 static-link model
4
90
3.5
85
80 3
frequency (model)
delay (model)
75 2.5
70 2
65 1.5
dynamic-link model
60 1
55 0.5
30 data points
50 0
50 55 60 65 70 75 80 85 90 95 100 0 0.5 1 1.5 2 2.5 3 3.5 44 4.5
4.5 55
delay (NS) frequency (NS)
Stability region
Round trip propagation delay at critical frequency
100
N=20 N=60
95
Unstable for
90
 Large delay
85
N=40  Large capacity
80  Small load
delay (ms)
75
70
N=30
65
60
N=20
55
50
8 9 10 11 12 13 14 15
capacity (pkts/ms)
Role of AQM
 (Sufficient) stability condition
K  co{ H ( ; ) }
-1
Role of AQM
 (Sufficient) stability condition
K  co{ H ( ; ) }
-1
c2 2
e  j
TCP:
2N j  p * w*
AQM: scale down K & shape H

Role of AQM
K  co{ H ( ; ) }
c 2 2 e  j
TCP:
2N j  p w * *
Problem!!
c e  j1
RED: 
1   j  c
a2 e  jj  a1 / a2
REM/PI: 
1   j
Queue dynamics (windowing)

Scalable control??
pl(t)
xi(t)
Scalable Control
REM
Stability Utilization Delay
Outline
 Introduction
 Stability of TCP/AQM
Delay compensation
 Equilibrium
 High utilization: itegrator at links
 Low loss & delay: VQ, REM/PI
 Stability
 Integrator+network delay always unstable for high  !
 s 1
e
s s
Source Network Link

Delay compensation
 Equilibrium
 High utilization: itegrator at links
 Low loss & delay: REM, PI (HMTG01a)
 Stability
 Integrator+network delay always unstable for high  !
 Delay invariance
 Scale down gain by  (known at sources)
 This is self-clocking!
 s 1
e
 s
Source Network Link

Compensation for capacity
 Speed of adaptation increases with

 Link capacity
 #bottleneck links in path
 Capacity invariance
 Scale down by capacity cl at links
 Scale up by rate xi(t) at sources
x0i
 i c
1
Scalable control (Paganini, Doyle, Low 01)
i
 qi ( t )
 i mi
TCP: xi (t )  xi e
1
AQM:  l (t ) 
p  yl (t )  cl 
cl
Theorem (Paganini, Doyle, Low 2001)

Provided R is full rank, feedback loop is locally
stable for arbitrary delay, capacity, load and
topology
Scalable control (Paganini, Doyle, Low 01)
i
 qi ( t )
 i mi
TCP: xi (t )  xi e
1
AQM:  l (t ) 
p  yl (t )  cl 
cl
Theorem (Paganini, Doyle, Low 2001)

0.9
0.8
0.7
Provided R is full rank, feedback loop is locally

0.6
0.5
stable for arbitrary delay, capacity, load and

0.4
0.3
topology 0.2
0.1
Utility function
0
0 2 4 6 8 10 12 14 16 18 20
Stability
Individual window Instantaneous queue Instantaneous queue
20 500 1000
18 40ms, window 450 queue 900
54% noise
16 400 800
14 350

700

Individual window (pkts)
12 300 600
10 250 500 50, 40ms

8 200 400
6 150 300
4 100 200
2 50 100
0 0 0
0 5 10 15 0 2 4 6 8 10 12 14 0 5 10 15
time (sec) time (sec) time (sec)
Individual window Instantaneous queue Instantaneous queue

160 7000 7000
140 200ms, 6000 queue 6000 56% noise

120 window
5000 5000

Individual window (pkts)
100
4000 4000
80
3000 3000
60
2000 2000
40
20 1000 1000
0 0
0 5 10 15 0
0 2 4 6 8 10 12 14 0 5 10 15 20
time (sec) time (sec) time (sec)
Papers
netlab.caltech.edu
 Scalable laws for stable network congestion

control (CDC2001)
 A duality model of TCP flow controls (ITC, Sept 2000)
 Optimization flow control, I: basic algorithm &
convergence (ToN, 7(6), Dec 1999)
 Understanding Vegas: a duality model (ACM Sigmetrics,
June 2000)
 REM: active queue management (Network, May/June 2001)
The End
Backup slides
1 p Law
a
 Equilibrium window size w 
p
s
 Equilibrium rate a
x 
D p
s
s
 Empirically constant a ~ 1
 Verified extensively through simulations and on
Internet
 References
 T.J.Ott, J.H.B. Kemperman and M.Mathis (1996)
 M.Mathis, J.Semke, J.Mahdavi, T.Ott (1997)
 T.V.Lakshman and U.Mahdow (1997)
 J.Padhye, V.Firoiu, D.Towsley, J.Kurose (1998)
 J.Padhye, V.Firoiu, D.Towsley (1999)
Implications
 Applicability
 Additive increase multiplicative decrease (Reno)
 Congestion avoidance dominates
 No timeouts, e.g., SACK+RH
 Small losses
 Persistent, greedy sources
 Receiver not bottleneck
 Implications
 Reno equalizes window
 Reno discriminates against long connections
Derivation (I)
window
4w/3
w = (4w/3+2w/3)/2
2w/3
t
2w/3
Area = 2w2/3
 Each cycle delivers 2w2/3 packets

 Assume: each cycle delivers 1/p packets
 Delivers 1/p packets followed by a drop
 Loss probability = p/(1+p) ~ p if p is small.
 Hence w  3 / 2 p
Derivation (II)
 Assume: loss occurs as Bernoulli process rate p

 Assume: spend most time in CA
 Assume: p is small
 wn is the window size after nth RTT
w / 2, if a packet is lost (prob. pwn )
wn 1   n
wn  1, if no packet is lost (prob. (1  pwn ))
w
w  pw  ( w  1)(1  pw )
2
w2  2 p
w  2 p
Simulations
Refinement (Padhye, Firoin, Towsley & Kurose 1998)
 Renewal model including

 FR/FR with Delayed ACKs (b packets per ACK)
 Timeouts
 Receiver awnd limitation
 Source rate 
 
 Wr 1 
xs  min , 
 Ds D 2bp  3bp 
 To min1,3  p (1  32 p 2 ) 
 8  
s
 3  
 When p is small and Wr is large, reduces to
a
x 
D p
s
s
Further Refinements
 Further refinements of previous formula
 Padhye, Firoiu, Towsley and Kurose (1999)
 Other loss models
 E.Altman, K.Avrachenkov and C.Barakat
(Sigcomm 2000)
 Square root p still appears!
 Dynamic models of TCP
 e.g. RTT evolves as window increases
Link layer protocols
 Interference suppression
 Reduces link error rate
 Power control, spreading gain control
 Forward error correction (FEC)
 Improves link reliability
 Link layer retransmission
 Hides loss from transport layer
 Source may timeout while BS retransmits
Split TCP
TCP
TCP
 Each TCP connection is split into two

 Between source and BS
 Between BS and mobile
 Disadvantages
 TCP not suitable for lossy link
 Overhead: packets TCP-processed twice at BS (vs. 0)
 Violates end-to-end semantics
 Per-flow information at BS complicates handover
Snoop protocol
TCP
snooper
 Snoop agent
 Monitors packets in both directions
 Detects loss by dupACKs or local timeout
 Retransmits lost packet
 Suppresses dupACKs
 Disadvantages
 Cannot shield all wireless losses
 One agent per TCP connection
 Source may timeout while BS retransmits
Explicit Loss Notification
 Noncongestion losses are marked in ACKs
 Source retransmits but do not reduce
window
 Effective in improving throughput
 Disadvantages
 Overhead (TCP option)
 May not be able to distinguish types of losses,
e.g., corrupted headers

 Embedding: exponential probability function

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20

Performance Comparison
 RED 1 1
OR
B B
High utilization Low delay/loss
REM
match rate AND clear buffer
High utilization Low delay/loss

Comparison with RED
REM RED
 Goodput
 Queue
 Loss
Comparison with RED
REM RED
 Goodput
 Queue
 Loss
Application: Wireless TCP
 Reno uses loss as congestion measure

 In wireless, significant losses due to
 Fading
 Interference
 Handover
 Not buffer overflow (congestion)
 Halving window too drastic
 Small throughput, low utilization
Proposed solutions
 Ideas
 Hide from source noncongestion losses
 Inform source of noncongestion losses
 Approaches
 Link layer error control
 Split TCP
 Snoop agent
 SACK+ELN (Explicit Loss Notification)
Third approach
 Problem
 Reno uses loss as congestion measure
 Two types of losses
Congestion loss: retransmit + reduce window
Noncongestion loss: retransmit
 Previous approaches
Hide noncongestion losses
Indicate noncongestion losses
 Our approach
Eliminates congestion losses (buffer overflows)
Third approach
 Router
 REM capable
 Host
 Do not use loss as congestion measure
Vegas
REM
 Idea
 REM clears buffer
 Only noncongestion losses
 Retransmits lost packets without reducing window
Performance
 Goodput
Performance
 Goodput
Food for thought
 How to tailor utility to application?
 Choosing congestion control automatically
fixes utility function
 Can use utility function to determine
congestion control
 Incremental deployment strategy?
 What if some, but not all, routers are ECN-
capable
Acronyms
ACK Acknowledgement QoS Quality of Service
AQM Active Queue Management RED Random Early Detection/Discard
ARP Address Resolution Protocol RFC Request for Comment
ARQ Automatic Repeat reQuest RTT Round Trip Time
ATM Asynchronous Transfer Mode RTO Retransmission TimeOut
BSD Berkeley Software Distribution SACK Selective ACKnowledgement
B Byte (or octet) = 8 bits SONET Synchronous Optical NETwork
bps bits per second SS Slow Start
CA Congestion Avoidance SYN Synchronization Packet
ECN Explicit Congestion Notification TCP Transmission Control Protocol
FIFO First In First Out UDP User Datagram Protocol
FTP File Transfer Protocol VQ Virtual Queue
HTTP Hyper Text Transfer Protocol WWW World Wide Web
IAB Internet Architecture Board
ICMP Internet Control Message Protocol
IETF Internet Engineering Task Force
IP Internet Protocol
ISOC Internet Society
MSS Maximum Segment Size
MTU Maximum Transmission Unit
POS Packet Over SONET

TCPLow Sigcomm 2001

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TCPLow Sigcomm 2001

Uploaded by

Copyright:

Available Formats

Equilibrium & Dynamics

Copyright, 1996 © Dale Carnegie & Associates, In

 S. Athuraliya, D. Lapsley, V. Li, Q. Yin

1:30 – 2:30 TCP/AQM algorithms

Applications (e.g. Telnet, HTTP)

Ethernet Ethernet data

 October 1986, Internet had its first

 ~ W packets per RTT

 If W too small then rate « capacity

 Receiver advertises awnd with each ACK

SS: Slow Start

 Start with cwnd = 1 (slow start)

cwnd  cwnd + 1 (for each ACK)

cwnd  cwnd + 1 (for each cwnd ACKS)

SS: Slow Start

SS: Slow Start

slow start retransmit

 On 3 dupACKs, receiver has packets 2, 4, 6, 8, cwnd=8,

 Motivation: multiple losses within a window

 Motivation: Reno & NewReno retransmit at most

SS: Slow Start

 Idea: warn sources of incipient congestion by

Avg queue time

 Motivation: RED extremely sensitive to

 Motivation: marking packets in proportion to flow

 Motivation: wild oscillation of queue in RED

 Motivation: wild oscillation of RED leads to

efficient & fair sharing small delay

Congestion control CDN

Example congestion measure pl(t)

Duality theory  equilibrium

Control theory  stability & robustness

p (t  1)  G ( p (t ), x(t )) DropTail, RED, REM

 TCP/AQM protocols (F, G)

subject to x1  x2  1  Lagrange multiplier:

x(t  1)  F ( p(t ), x(t ))

 TCP iterates on rates (windows)

x(t  1)  F ( p(t ), x(t )) Reno, Vegas

p(t  1)  G ( p(t ), x(t )) DropTail, RED, REM

p (t  1)  G ( p (t ), x(t )) DropTail, RED, REM

 TCP/AQM protocols (F, G)

 Application: equilibrium properties

x(t+1) = F( p(t), x(t) ) Reno, Vegas

p(t+1) = G( p(t), x(t) ) DropTail, RED, REM

 Congestion measure: average queue length

 Embedding: p-linear probability function

 Feedback: dropping or ECN marking

 Congestion measure: price

 Feedback: dropping or ECN marking

 Congestion measure: price

 Embedding: exponential probability function

 Feedback: dropping or ECN marking

Theorem (Paganini 2000)

pl(t) G(p(t), x(t))

x(t+1) = F( p(t), x(t) ) Reno, Vegas

p(t+1) = G( p(t), x(t) ) DropTail, RED, REM

pl(t) G(p(t), x(t))

 Decouple congestion & performance measure

p(t+1) = G( p(t), x(t) ) DropTail, RED, REM

p(t+1) = G( p(t), x(t) ) DropTail, RED, REM

p(t+1) = G( p(t), x(t) ) DropTail, RED, REM

 30 sources, 3 groups with RTT = 3, 5, 7ms + 6ms (queueing delay)

 30 sources, 3 groups with RTT = 3, 5, 7 ms + 6 ms (queueing delay)