You are on page 1of 56

Congestion and Flow Control

TCP

Release 0.2 May 2022

Giuseppe Citerna – CCIE#10503

Claudio Scala – CCNA – DevNet Professional

Ing. Aldo Menichelli


Disclaimer

This work was made by IeXa Academy, the authors are Aldo Menichelli, Giuseppe Citerna, Claudio
Scala.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0


International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-
nc-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

If you want to cooperate on the document to add or modify content, please contact IeXa Academy
(click on this link)

Iexa Academy S.r.l. is in no way affiliated with any of the companies mentioned, including Cisco System Inc.
This document is not officially sponsored, affiliated, or endorsed by Cisco Systems Inc. Iexa Academy S.r.l.
respects the trademarks of all other companies and institutions mentioned.

Some logos, third-party trademarks, product names, trade names, corporate names and companies in this
product may be trademarks owned by their respective owners or registered trademarks of other companies
and are used for purely explanatory purposes, without any purpose of violation of the Copyright rights in
force by IEXA Academy S.r.l.

1
1. INTRODUCTION
The focus of this study is focused on the role that TCP plays in the functioning of the Internet:
thanks to its congestion control mechanisms, it is able to regulate the data flow of millions of
connections to ensure that the entire network does not crash.
The parameters that enable this process are the acknowledgement of the segments sent and the
number of bytes that the receiving buffers (intermediate and final) can contain.

2. THE TCP PROTOCOL


The transport layer of the TCP/IP suite takes care of the end-to-end connection, distinguishing
the traffic of the various processes active on the hosts. In this level, addressing is based on the
use of logic gates, one assigned to each process (daemon).
TCP(Transmission Control Protocol) is the transport protocol on which the Internet is based.
Encapsulates the data generated by the applications in segments, the header of which is
composed as follows:

1. Header TCP

The basic fields for congestion control are the Sequence Number, theAcknowledgement Number
and the Window Size.
The Sequence Number is the number of the first byte contained in the segment. It is initialized
during the establishment of the connection.
The Acknowledgement Number indicates the next expected byte: for example, if a host received
a segment of size 100 bytes and sequence number 1, the acknowledgement number entered in
the response would be 101 (we could interpret it as "I received the bytes 1-100, I wait for the 101
to arrive").

2
The Window field indicates how many bytes the recipient buffer can contain and, as we will see,
it is the element on which the TCP flow control is based.
Since it is a connection-oriented protocol, you can define the main steps of a TCP connection
1. Three-Way Handshake, establishment of the connection
2. Data Transmission
3. Closing Connection
In addition, four elements are distinguished for each connection:

2. Processes and TCP

1. Sender Processes 3. Receiver Processes


2. Sender TCP 4. Receiver TCP

Since it is not said that the processes that produce bytes and the processes that consume them operate
at the same speed, buffers are used in transmission and reception. These are divided into cells, each 1
byte in size.

3
2.1 ACK
In the previous paragraph, the Acknowledgement Number was introduced, an essential parameter for
all TCP connection-oriented features (flow control, error checking and congestion control). The primary
purpose of using Ack is to inform the sending host about the actual receipt of the sent segments,
communicating the next expected byte.

3. Ack in the Three-Way Handshake

Each segment with a sequence number must necessarily be found. Only pure Ack segments are not
found because they do not consume sequence numbers:

4
4. Ack segment sequence number

TCP uses cumulative acks: It does not count out-of-sequence, discarded, duplicated, or lost segments.
Simply put, the Acknowledgement number contained in the header of a segment means that all
previous bytes have been received.
There are some general rules for sending feedback. The most common are:
o Piggybacking (reduces the number of segments), the feedback is placed in a segment that carries
other data in case of two-way communication;
o Delayed Ack, the recipient waits for a timer (usually 200-500ms) if it has a single segment in the
sequence to be encountered, used to reduce protocol overhead (if another segment arrives
before the timer expires, you can save 41 bytes of a single ack);
o There must be no more than two consecutive segments notfound: when the recipient receives
the second segment in sequence, he sends an immediate ack;
o If it receives an out-of-sequence segment, the recipient immediately sends an ack followed by
three duplicates with the expected sequence number;
o When a missing segment arrives that has been retransmitted,the receiver immediately sends
an ack indicating the next expected segment
o When a duplicate segment arrives, the receiver immediately sends an ack indicating the next
expected segment

5
A cumulative ACK finds all bytes prior to the reported number (next byte expected). The latest
implementations of TCP allow the use of Selective ACKs (fields in the "options" section), which specify
the initial byte and the final byte of non-consecutive blocks.

5.RFC 2018

2.2 TIMER
The transmission of data with TCP is regulated by some timers, based on the arrival of the ack.
The first timer we will deal with is the Round Trip Time (RTT): this is the time that elapses between
sending the message and receiving its feedback.
The RTT is calculated by the sender using the Timestamp Value and Timestamp Echo Reply fields (RFC
1323 and 7323):

6a. RFC 1323

The sender places their timestamp in the TSVal field of the segment. When the recipient receives it,
copy the TSVal value into the TSecr field of the Ack segment and enter the sending timestamp in the
TSVal field.

6
At this point, the sender can calculate RTT, performing the difference between the arrival time of the
Ack (check the time of the operating system) and the timestamp contained in the TSecr field.

6b. Timestamps and RTT

In Figure 6b,the PC sends a data segment and enters the instant 0.00000s in the TSVal field. Next, the
server responds with an Ack, in which it copies the timestamp of the PC in the TSecr field and as TSVal
enters the sending instant 0.0050s. The PC can calculate RTT:
RTT = Arrival time Ack – Tsecr = 0.0100 – 0.0000 = 0.0100s
This does not manifest the problem of synchronizing the clocks, since the host uses its values.
Another very important timer for the operation of TCP is the Retransmission Time Out (RTO): this allows
the management of segment retransmissions.
The sender initializes the timer for each segment sent. When it expires, without having received the
response, the sender retransmits the segment.

7
7. RTO and retransmission of a segment

In Figure 7, the PC starts an RTO of 3s with the transmission of a segment. Not having received an Ack
before the timer expires, the PC retransmits the segment and this time receives a response after 2.5s.
An exception to the use of this timer is the use of duplicate Acks andFast Retransmit:when the recipient
receives an out-of-sequence segment (i.e. the sequence number of the received segment is greater
than expected), it immediately sends three duplicate Acks. If these are received before the expiration
of the time out initialized by the sender, the only segment to be retransmitted will be the one required
by the duplicate acks.
Fast Retransmit: RTO > Time of Reception of Duplicate Acks

8
8. Duplicate Ack and Fast Retransmit

3. THE MATHEMATICS OF TCP: VAN JACOBSON AND KARN


This is the first paragraph in which we will look at the mathematical aspects of TCP. In particular, we
will analyze the algorithms of Van Jacobson and Karn, used for the calculation of RTOs.
Van Jacobson's algorithm
One of the main concepts of TCP is the RTO, which is the timer that regulates the retransmissions of
segments. This depends on RTT (Round Trip Time), which indicates the time it takes for the segment
to be transmitted and the segment to be received. It is a random (dynamic) variable that depends on
network conditions.
The choice of the Retransmission Time Out is crucial:
- if it is too short you risk not waiting for the time necessary for the segment to arrive at its
destination and receive feedback. It would congest the network with retransmissions

9
9. RTO too short

- if it is too long it loses efficiency, as the sender should remain stationary for longer than
necessary, before being able to proceed with a possible retransmission

10. RTO too long

The RTO calculation is a complex process based on Jacobson's algorithm, which includes some
parameters:
- RTTM, as measured RTT, which corresponds to the time required for the segment to reach the
recipient and be found (variable quantity);
- RTTS, as smoothed RTT, which corresponds to the weighted mean between RTTM and RTTS
previous, associated with a weight α (usually α = 1/8)

10
The reason for the use of the weighted average is that it allows to derive an estimate of the
average value of the RTT, offering the possibility of attributing a weight to the previous sample;

- RTTD, (deviated RTT) i.e. an approximation of the standard deviation of RTT, estimated with the
parameter β (usually β = 1/4). This parameter represents a margin of safety between the
estimated time and the actual time, as it takes into account the difference between the previous
values;

- RTO, is calculated through the mediated RTT and its deviation

This means that you consider the most recent value of RTTS and balance with the deviation. If
the fluctuations are small, RTO ≈ RTTS, otherwise RTO is much larger than RTTS.
This algorithm minimizes timeout, since less than 1% of segments arrive after 4 standard deviations.
When you initialize a TCP connection, the values of RTTM, RTTS, and RTTD do not exist.

11
11. Example Van Jacobson's algorithm

The initial value of RTO is set by the host sending the SYN segment. It must be calculated dynamically
due to congestion and variations of the windows, the idea is to have a value that is always up to date.
Through the Jacobson algorithm (network scientist in Berkeley, Cisco, PaloAlto) the best estimate of
RTT to a destination is obtained. α and β are the smoothing factors that determine how much the
previous values weigh (they allow you to "smooth" the values, reducing the peaks and, therefore, the
difference between the actual and estimated values).
In the example, the sender sets the initial RTO to 6s. When receiving feedback from the recipient, it
calculates an RTT of 1.5s. With these values he can apply the algorithm to derive the RTO for the next
transmission:

After the first measurement


RTTM = 1,5s RTTD =
𝑅𝑇𝑇𝑀
=
1,5
s = 0,75s
RTTS = RTTM = 1,5s
RTO = RTTS + 4 x RTTD = (1,5 + 4 x 0,75)s = 4,5s

12
The sender proceeds with the next transmission, setting the Time-Out to 4.5s. When it receives the new
feedback, it measures an RTTM of 2.5s.
After the second measurement
RTTM = 1,4s

RTTS = (1-α) RTTS + α RTTM = (1-⅛) x 1,5 + ⅛ x 1,4 = 1,4875s

RTTD = (1-β) RTTD + β| RTTS - RTTM|= (1- ⅟4) x 0,75 + ⅟4 |1,4875 – 1,4|= 0,65s

RTO = RTTS + 4 x RTTD = (1,4875 + 4 x 0,65)s = 4,0875s

Karn's algorithm
The retransmission of a segment causes ambiguity about the feedback: does it refer to the original
segment or to the duplicate?
Since the receipt of the Ack is fundamental for the calculation of RTT and, therefore, of RTO, the sending
host does not have to consider the information obtained from the feedback.
Karn's algorithm states that TCP does not use the RTT of a retransmitted segment to update the RTO
value. In this case, most TCP implementations use the exponential back-off strategy: RTO is doubled
with each retransmission.

13
12. Karn algorithm example

14
In the example, the recipient does not receive the segment with sequence number 1601. Then, once
the RTO expires, the sender retransmits the segment and uses twice as much as the previous time-out
as the new time- When it receives feedback, it can proceed with Jacobson's algorithm, using the values
obtained before the loss of the segment:
RTTM = 4s

RTTS = (1-α) RTTS + α RTTM = (1-⅛) x 1,625 + ⅛ x 4 = 1,92s

RTTD = (1-β) RTTD + β| RTTS - RTTM|= (1- ⅟4) x 0,78 + ⅟4 |1,92 – 4|= 1,105s

RTO = RTTS + 4 x RTTD = (1,92 + 4 x 1,105)s = 6,34s

4. FLOW CONTROL
The first mechanism that we are going to analyze is the "Flow Control". The purpose of this feature is
to avoid overrun of the recipient's receive buffer.
It is therefore necessary to introduce two elements:
● Receive window (RWND), it allows the recipient to receive the correct segments and send the
related feedback. It is defined with two parameters:
o Rn Receive Next, next segment
o Rsize Receive Size, window size
The size of the window is indicated in the Window field (max 2^16 without Window Scale). TCP
allows the recipient to request data at the desired rate (the receive buffer may contain bytes
that are confirmed but still need to be processed by the application)

RWND = buffer size – number of bytes waiting to be consumed

15
13. Receive Window

16
● Send Window, is an imaginary box that contains the sequence numbers of packets that can be
sent before receiving an Ack. The parameters that define the send window are:
o Sf Sent First, the first segment awaiting verification
o Sn Send Next, the next segment to be sent
o Ssize Send Size, window size
At all times, it divides the possible sequence numbers into four areas: the left part of the window
corresponds to the segments already sent and confirmed; the second part indicates the
segments sent pending feedback; the third part indicates the segments that can still be sent; the
right part of the window includes segments that cannot yet be accepted by the application.

14. Send Window

The concept of flow control is based on the "Sliding Window": the sender must consider the segments
sent and not found, and the maximum number of segments that can still be sent before having to stop
the transmission, therefore having to adapt the sending window.
As the transmission progresses, the window scrolls forward. Initially it has limits 0 and RWND -1. With
each byte sent, the lower bound grows by one unit; when there are bytes RWNDs waiting for feedback,
the transmission must stop. For each byte found the window closes (wall left to right) and the upper
limit of the window grows by one unit if the new RWND allows it (right wall to right), otherwise it must
shrink.

17
15. Sliding Window example

When the recipient sends an ack, communicating a new Acknowledgement Number and a new RWND,
he must respect the following relationship:
new ackNo + new RWND ≥ last ackNo + last RWND
The first part of the inequality indicates the new location of the right wall of the Submission window.
This rule is necessary to prevent the send window from shrinking to the point of excluding segments
that have already been sent.

18
16. Example of reducing the incorrect send window

In Figure 13, the sender changes the size of the sending window, respecting the information received in the last
Ack: Ack = 206 and RWND = 12. This allows it to continue broadcasting to segment 214 before receiving a new
response: Ack = 210 and RWND = 4. With this new information, the send window is reduced, but segment 214,
waiting to be found, is cut off. This happened because the recipient did not respect the inequality of before:

new ackNo + new RWND ≥ last ackNo + last RWND 210 + 4 ≥ 206 + 12 214 ≥ 218
In order to keep the inequality verified, the sum of the new Ack with the new RWND must be at least 218: keeping
RWND = 4, Ack must be at least 214; keeping Ack = 210, RWND must be at least 8.

4.1 ERROR CHECKING


TCP is a reliable protocol that ensures the transmission of intact segments, in order and without
duplicates.
Error checking ensures reliability and includes mechanisms for determining and recovering corrupt
segments, sending back lost segments, discarding duplicates, and storing segments received out of
sequence.
TCP uses three tools:

1. Checksum each segment has a dedicated 16-bit


field that identifies the corrupted segments. When
the recipient receives the segment, recalculates the
checksum. If it does not correspond with what has
come to him, he considers the segment corrupt and
discards it.

19
2. Acks confirm the receipt of the data and control segments (all segments that have a sequence
number). Ack segments have no sequence number and are not found. The ack field in the TCP header is
used for cumulative hits (does not count out-of-sequence, discarded, duplicated, or lost segments).
3. Timeout fundamental element for the retransmission of segments. when a segment is sent the
sender TCP initializes the Retransmission Time-Out (RTO) timer.

20
5. CONGESTION CONTROL
TCP's congestion control mechanism solves the problem of buffer overruns of intermediate nodes,
which could handle multiple data streams at the same time.
To detect congestion, the protocol uses two tools:
- RTO failure to receive rapid and regular acks are an indication of severe congestion;
- Duplicate Ack reporting the loss of a segment results in slight network congestion, since
subsequent segments have reached their destination.
The element that allows you to manage congestion is the Congestion Window (CWND),calculated
through algorithms that verify network conditions.
To control the number of segments to be transmitted, TCP uses the variables cwnd and rwnd: the final
size of the send window corresponds to the minimum of the two values.
Final Size = min(rwnd, cwnd)
TCP uses a three-step strategy to manage congestion, reaching the right compromise between avoiding
overload and making the most of network capacity.
1. Slow Start cwnd = 1, the congestion window size is initialized with a multiple value of MSS
(Maximum Segment Size). The algorithm proceeds with exponential speed: every time it
receives an ack (after a RTT) the size of cwnd doubles. The limit value of this algorithm is the
variable ssthresh (Slow Start Threshold), which is initially set in such a way as to exploit all the
available bandwidth (usually 2^16). When cwnd reaches this value, slow start ends and we move
on to the next step.

If Ack arrives, cwnd = cwnd + 1

21
At this stage, as soon as the sender receives an Ack, he sends a new segment.

Slow Start

2. Congestion Avoidance, additive increase this phase involves a linear increase of the cwnd
variable. Once the slow start threshold value is reached, each time the sender receives an ack
(after an RTT) it increases cwnd by 1:

If ack arrives, cwnd = cwnd + 1/cwnd


The size of the congestion window is increased linearly until congestion is detected.

22
With this algorithm, in order to send the new segments, you must wait for all the acks from the
previous segments to arrive.

Congestion Avoidance

3. Fast Recovery Optional phase in TCP that begins when three duplicate hits arrive. The
algorithm increases the size of the congestion window linearly:
23
If an ack arrives, cwnd = cwnd + 1/cwnd
The Fast Recovery algorithm is implemented together with Fast Retransmit: the missing
segment is retransmitted without waiting for the timeout.

6. SOME IMPLEMENTATION OF TCP


Switching between algorithms, it depends on the version of TCP that is implemented. In this document
we will refer to the Tahoe, Reno and New Reno versions.

Tahoe (1988)
First version of TCP. Use only the Slow Start and Congestion Avoidance algorithms. Reacts to timeout
and duplicate feedback in the same way: at first agree on the value of ssthresh (usually a multiple of
MSS) and set cwnd = 1. Each time feedback arrives, the size of the congestion window is increased by
1 (exponential increase). If congestion is detected TCP restarts slow start by setting ssthresh to be half
the cwnd value reached (ssthreshNEW = CWND/2). If, on the other hand, no congestion is detected and
the Slow Start threshold value is reached, we move on to Congestion Avoidance and, therefore, to a
linear increase. The size of the congestion window increases by 1 each time it receives a number of hits
equal to the current size (ex. Size = 5MSS, to be able to increase it to 6 you must receive 5 ack). No limit
is set at this stage, proceeding until the end of the connection or until congestion is detected. If
congestion is detected TCP brings the value of threshold to half the value of cwnd (ssthreshNEW =
CWND/2) and restarts Slow Start.

24
Reno (1990)
Latest version of TCP that also implements the Fast Recovery algorithm. This results in a different
reaction when you receive three duplicate acks: instead of restarting Slow Start, TCP switches to the
Fast Recovery phase. The initial value of CWND in this case is equal to (CWND/2) + 3MSS and the linear
increase coincides with that of Congestion Avoidance. As long as duplicate feedback continues to arrive,
TCP remains at this stage. If a timeout occurs, switch to the Slow Start algorithm. If a single ack arrives,
TCP switches to the Congestion Avoidance state, bringing the CWND window to CWND/2.

25
Reno TCP partially solves the problem of non-congestion losses only when the losses are not strongly
correlated with each other, that is, when a maximum of one packet is lost within each window. This
behavior is problematic in situations where entire packet bursts are lost (a frequent situation, for
example, in wireless links). In fact, in these cases TCP Reno could reduce the value of the congestion
window several times consecutively (that is, as many times as there are lost packets) causing a drastic
deterioration in the transmission speed of the TCP connection.

New Reno (2004)


Behavior similar to TCP Reno, changes the response to partial acks that indicate a loss of multiple
segments, that is, TCP continues with the Fast Recovery phase and retransmits all lost segments.
Behavior similar to TCP Reno, changes the response to partial acks that indicate a loss of multiple
segments, that is, TCP continues with the Fast Recovery phase and retransmits all lost segments.

The TCP Congestion Control mechanism is based on the AIMD (AdditiveIncrease/Multiplicative


Decrease) logic, which consists, once a network congestion signal has been received, in quickly reducing
the rate of emission of TCP(Multiplicative Decrease)segments, and then returning more or less to the
level before the congestion, depending on whether or not the congestion state persists, with a milder
rhythm (Additive Increase).

26
This behavior is due to the fact that in most cases congestion is detected and managed through
duplicate ACKs. The result is that CWND follows the profile.

Cubic (2008)
TCP Cubic differs slightly from TCP Reno, since it only changes the Congestion Avoidance phase. Let
Wmax be the size of TCP’s congestion control window when loss was detected, and let K be the future
point in time when TCP CUBIC’s window size will again reach Wmax, assuming no losses. Cubic
increases the congestion window as a function of the cube of the distance between the current time t
and K. Thus, when t is further from K, the congestion window size increases are much larger than when
t is close to K. In this way CUBIC quickly ramps up TCP’s sending rate to get close to pre-loss rate,
Wmax, and only then probes cautiously for bandwidth as it approaches Wmax. When t is greater thak K,
the cubic rule implies that CUBIC’s congestion window increases are small when t is still close to K
(which is good if the congestion level of the lnk causing loss hasn’t changed much) but then increases
rapidly as t exceeds K (which allows CUBIC to more quickly find a new operating point if the congestion
level of the link that caused loss has changed significantly).

Looking at the graph, we see the slow start phase ending at t0. Then, when congestion loss occurs at
t1, t2 and t3, CUBIC quickly ramps up close to Wmax (leading to a greater average throughput than TCP
Reno). TCP CUBIC attempts to maintain the flow for as long as possible just below the congestion
threshold. At t3, the congestion level has presumably decreased appreciably, allowing both TCP Reno
and TCP CUBIC to overcome the Wmax limit.

TCP CUBIC has recently gained wide deployment. Nearly 50% of the most popular Web Servers are
running a version of TCP CUBIC, which is also the default version of TCP used in Linux operating
systems.

27
QUIC (2020)

QUIC (Quick UDP Internet Connections) is a new transport protocol for the internet, developed by
Google.

28
29
Faster Connection

The QUIC protocol improves communication between client and server by managing separate and
independent connections for each request: in this way, even if one call is not handled in a short time, it
does not affect the slowdown of the others. It initiates a connection with a single packet and
communicates all the necessary TLS or HTTPS parameters directly to the server without having to
depend on a response unlike TCP which needs to first obtain and process a server confirmation.

30
Multiplexing Support

QUIC connections are identified by a 64-bit ID, randomly generated by the client as opposed to TCP
connections consisting of a combination of source address, source port, destination address, and
destination port.

This means that if a client changes IP addresses (for example, suddenly switching from a Wi-Fi
connection to a 3G one) all active TCP connections will no longer be valid.

The moment a QUIC client changes IP addresses, it can continue to use the old connection ID from
the new IP address without interrupting any ongoing requests and thus ensuring continuity.

Forward Error Correction

Through an error correction system, in case of packet loss during communication there is no need for
further sending of data. These can be rebuilt at any time thanks to FEC (Forward Error Correction)
packets, which is a sort of backup copy. This is essentially how RAID 5 works but used at the network
level. To allow this management there is a necessary compromise: each UDP packet contains more
payloads than is strictly necessary (it is estimated about 10%), because it takes into account the
potential of lost packets that can be recreated more easily in this way.

Congestion control

TCP sends data with an incremental speed that leads to advantages in the presence of fast data
connections, but which contributes to an increase in the rate of packet loss. If a packet is not sent
correctly, the operation restarts immediately by reducing the temporary queue window and causing
data transmission at intervals. QUIC has efficient overhead management and provides more complete
information than TCP. Thanks to Packet Pacing the sending rate is automatically managed and in this
way an overload is avoided even in case of unresponsive connections.

QUIC introduced improvements over the congestion control strategy:

● it increases the SACK/NACK (Selective Ack/Not Acked) range allowing you to provide more
information to the sender about lost packets
● it introduced monotonically increasing packet numbers to disambiguate delayed packet from
retransmission (more accurate RTT measurements) and
● it changed the way packet loss is detected: instead of using the traditional 3 duplicate ACK
strategy used by TCP, it uses packet threshold and time threshold variables that define the
conditions for which a packet is presumed to be lost (if a packet is behind the last
acknowledged packet more than the packet threshold or it has not been acknowledged with a

31
time frame that depends on the time threshold, the latest RTT, a smoothed RTT measurement
and a minimum time threshold value).

Authentication and encryption

One of the biggest problems related to the TCP protocol concerns the header of the sent packets that
is passed in the form of plaintext and cannot be processed without first receiving authorization; this
more easily exposes calls to possible attacks or manipulation.

QUIC packets are always authenticated and for the most part encrypted: portions of the header that are
not encrypted are protected from alteration thanks to authentication by the recipient.

Increased compatibility

Another great advantage of QUIC is that its operation is not linked to the system as in the case of the
TCP protocol that must be managed by the various platforms and devices for communication to be
possible (this involves both hardware and software support). With QUIC the management is directly at
the application level and its availability is therefore dependent on those who develop the software who
can decide whether or not to integrate this feature as applications such as Google Server or Chrome
and Opera browsers have already done.

32
INSERIRE BBR

https://managedserver.it/bbr-tcp-la-formula-magica-per-le-prestazioni-della-
rete/#BBR_TCP_i_concetti_importanti
TCP TUNING LINUX - WINDOWS 10
REAL IMPLEMENTATIONS
https://cloud.google.com/architecture/tcp-optimization-for-network-performance-in-gcp-and-hybrid

https://docs.microsoft.com/it-it/windows-server/networking/technologies/network-subsystem/net-sub-
performance-tuning-nics

LINUX TUNING

https://netbeez.net/blog/how-to-use-the-linux-traffic-control/

7. THE MATHEMATICS OF TCP: THROUGHPUT AND MATHIS'


FORMULA
Imagine that we have a point-to-point connection of 1GBit/s, which connects us directly to a server
with an unlimited buffer. In this way, there would be no possibility of overloading it, and TCP would not
have to intervene with flow control.

33
If we wanted to upload files to the server, our sending window would be limited only by the amount of
data that our transmission medium can carry in a certain period. So, in our example 1Gbit in a second.
This concept is called Network Throughput , which indicates the actual capacity of the link, that is, the
amount of traffic that arrives at the destination in the unit of time.
However, we must consider the time it takes for the segments to reach their destination. This time is
called "physical delay", or Delay, and is generally worth 5ms/Km. We also know that TCP uses Ack, so
we must also consider the time of arrival of the Ack. Therefore, the total delay will be twice the delay
(Round Trip Time). If the distance between our PC and the server was 5Km, we would get:
1𝐺𝐵𝑖𝑡 5𝑚𝑠
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑥 𝑥 5𝐾𝑚 = 25𝑀𝑏𝑖𝑡
𝑠 𝐾𝑚
However, we are aware that our server cannot have an unlimited buffer and that the flow control
mechanism of TCP is essential. When we establish the TCP connection, we are notified of the size of
RWND.
The general formula for calculating Throughput is:

𝑊𝑖𝑛𝑑𝑜𝑤
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =
𝑅𝑇𝑇
But how are all these concepts related?
Let's start by saying that if we had a B band (bytes/sec) the maximum throughput we would get when
the TCP reception window does not limit the data exchange speed, or when the receiving window will have
value:

RWNDMAX = 𝐵𝑥𝑅𝑇𝑇 = 𝐵𝐷𝑃


The BDP (band-delay product) indicates the value that the reception window should take to take advantage
of all the available physical bandwidth and constitutes the physical limit. However, TCP is a protocol based
on a dynamic window mechanism, which allows you to control both the flow and congestion of the
network. To calculate the throughput, therefore, it is necessary to consider the minimum value
between the actual receive window and the bandwidth-delay product.

34
𝑚𝑖𝑛 [𝐵𝐷𝑃, 𝑅𝑊𝑁𝐷]
𝑇𝐻 =
𝑅𝑇𝑇
The above is true if you assume zero Packet Loss, which is the percentage of packet loss in a data stream.
This situation occurs especially in point-to-point networks, where the Throughput of a TCP session
depends essentially on the RTT value and the width of the receiving window, and not on the available
physical bandwidth.
For example, consider the following network:

The PC wants to download a 1MB file size from the server. The transmission medium is private and
allows you to have a band B = 1GBit/s. The RTT of the connection is equal to 8ms.
Let's calculate the B band in GB/s (Gigabytes/second, 1B = 8bit):

1
𝐵 = 𝐺𝐵/𝑠
8
In order to take full advantage of the bandwidth, the send window should be:

1
𝐵𝐷𝑃 = 𝐺𝐵/𝑠 𝑥 8𝑚𝑠 = 1𝑀𝐵
8
But the RWND receive window has a maximum size of 65535B (maximum number that can be
represented with the 16 bits of the Window field, without the Window Scale Option), which in the
formula we report in bits (65535x8 bits).
Therefore:

𝑚𝑖𝑛 [𝐵𝐷𝑃, 𝑅𝑊𝑁𝐷] 𝑚𝑖𝑛 [(1𝑥8)𝑀𝐵𝑖𝑡, (65535𝑥8)𝐵𝑖𝑡]


𝑇𝐻 = = = 65,535𝑀𝐵𝑖𝑡/𝑠
𝑅𝑇𝑇 8𝑚𝑠

We can verify the calculations through the Network Throughput Calculator:

35
(https://wintelguy.com/wanperf.pl)

In reality, the send window depends mainly on congestion: the congestion control mechanism provides
for the loss of segments, in order to test the actual network capacity and, therefore, the buffers of the
intermediate nodes. This results in a non-zero Packet Loss.

Mathis' formula allows us to find the link between congestion control and packet loss:

𝑇𝐻(𝑝) =

Where the constant C encloses the main characteristics of the network, such as TCP implementation
and Ack strategy (delayed or non-delayed),while p represents the Packet Loss expressed as a
percentage.
The result of the research of Mathis and his team is a mathematical model that allows to calculate an
average probability of packet loss in the data stream. Not being a calculation based on instantaneous
values, it does not represent the actual value of the Throughput, but an upper limit.
The higher the RTT and the likelihood of losing packets, the lower the Throughput.
From Mathis' formula we derive:

𝐶𝑊𝑁𝐷 = = 𝑇𝐻 =

36
Therefore:
𝑚𝑖𝑛 [𝐶𝑊𝑁𝐷𝑥𝑀𝑆𝑆, 𝑅𝑊𝑁𝐷]
𝑇𝐻 =
𝑅𝑇𝑇

Let's now analyze some examples to understand in which cases it is valid 𝑇𝐻 = and in
which 𝑇𝐻 = .

In the situation before, the client-server link is a point-to-point connection. The probability of losing a
package is very low, suppose p = 0.0001% and C = 1.
Having the values MSS = 1460B (in the formula we will use the value in bits, then 1460x8 bits) and
RWND = 65535B, we get:

Mathis’ formula
𝑀𝑆𝑆 𝐶 (1460 𝑥 8)𝑏𝑖𝑡 1
𝑇𝐻(𝑝) = = = 146 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 𝑝 8𝑚𝑠 √0,0001

We can verify the accounts through the TCP Throughput Calculator, which uses for packet loss the base
10-2 (we must multiply by the factor 102 to get our values).

37
Calculation with RWND
𝑅𝑊𝑁𝐷 (65535𝑥8)𝑏𝑖𝑡
𝑇𝐻 = = = 65,535 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 8𝑚𝑠

This time, the maximum throughput is limited by the recipient's send window:
𝑚𝑖𝑛 [𝐶𝑊𝑁𝐷𝑥𝑀𝑆𝑆, 𝑅𝑊𝑁𝐷] 𝑚𝑖𝑛[146, 65.535]𝑀𝑏𝑖𝑡
𝑇𝐻 = = 65,535𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 𝑠

Now let's consider this topology:

38
In this example, the client-server link has an intermediate node. The router buffer, which could handle
multiple data streams, represents the bottleneck of the network: in case of congestion, it would begin
to discard packets, causing an increase in Packet Loss that would affect The Throughput.

We use the same data as the previous example (reporting the MSS in 1460x8bit bits), assuming we
have p = 0.001% and C = 1.

Mathis’ formula
𝑀𝑆𝑆 𝐶 (1460 𝑥 8)𝑏𝑖𝑡 1
𝑇𝐻(𝑝) = = = 46,16 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 𝑝 8𝑚𝑠 √0,001

We verify the calculation using the TCP Throughput Calculator:

Calculation with RWND (remains unchanged)


𝑅𝑊𝑁𝐷 (65535𝑥8)𝑏𝑖𝑡
𝑇𝐻 = = = 65,535 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 8𝑚𝑠

In this case, the maximum Throughput is limited by the Packet Loss:


𝑚𝑖𝑛 [𝐶𝑊𝑁𝐷𝑥𝑀𝑆𝑆, 𝑅𝑊𝑁𝐷] 𝑚𝑖𝑛[46.16, 65.535]𝑀𝑏𝑖𝑡
𝑇𝐻 = = 46,16 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 𝑠
39
Let's try to add another node to our network and suppose we double the probability of packet loss p
= 0.002% and C = 1.

Mathis’ formula
𝑀𝑆𝑆 𝐶 (1460 𝑥 8)𝑏𝑖𝑡 1
𝑇𝐻(𝑝) = = = 32,646 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 𝑝 8𝑚𝑠 √0,002

We verify the accounts using the TCP Throughput Calculator:

Calculation with RWND (remains unchanged)


𝑅𝑊𝑁𝐷 (65535𝑥8)𝑏𝑖𝑡
𝑇𝐻 = = = 65,535 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 8𝑚𝑠

40
Again, the maximum Throughput is limited by the Packet Loss:
𝑚𝑖𝑛 [𝐶𝑊𝑁𝐷𝑥𝑀𝑆𝑆, 𝑅𝑊𝑁𝐷] 𝑚𝑖𝑛[32.646, 65.535]𝑀𝑏𝑖𝑡
𝑇𝐻 = = 32,646 𝑀𝑏𝑖𝑡/𝑠
𝑅𝑇𝑇 𝑠

In conclusion, as the probability of packet loss increases, due to network congestion, the Throughput
calculated with Mathis' formula becomes more significant; conversely, as network reliability increases,
the Throughput value is limited by the RWND receive window.

41
8. WIRESHARK
Let's analyze some examples of TCP data stream captured through the Wireshark sniffer, in order to
find the congestion control mechanism.

8.1 Wireshark Capture 1


The client (192.168.1.102:1161) downloads a 150KB file to the server (128.119.245.12:80), via HTTP
protocol.

The first three packets of the data stream constitute the Three-Way Handshake. Both hosts use the
initial value of the sequence number 0. In the SYN+ACK segment, the server communicates an RWND
of 5840 B.

42
The client starts data transmission with packet 4. This is encountered by the server with packet 6,
where an RWND of 6780B is communicated. The RTTM of the first segment can be calculated by the
difference between the instant in which it was transmitted and the stante in which it was found: RTTM
= 0.053937 – 0.026477 = 0.02746s.

Up to packet 13 the Bytes in Flight (i.e. bytes waiting for Acknowledgement) are 5527.
We calculate the RTTM of packets 4-13

Segment Number Transmission Istant(s) Ack Instant (s) RTTM (s)


4 0.026477 0.053937 0.02746
5 0.041737 0.077294 0.035557
7 0.054026 0.124085 0.070059
8 0.054690 0.169118 0.114428
10 0.077405 0.217299 0.139894

43
11 0.078157 0.267802 0.189645
13 0.124185 0.304807 0,180622

These are the packets sent during the Slow Start phase: the sender continues to send packets even if
he has not received all the Ack. This phase lasts about 0.305s.
Up to packet 13 the Bytes in Flight (i.e. bytes waiting for Acknowledgement) are 5527.

44
From the screen we can see that from package 18 the Congestion Avoidance phase begins: the client
sends 6 segments (sending window reached, Bytes in Flight packet 23:8192 = 1460x5 + 892) before
waiting for the relative feedback.
We calculate the RTTM of the packets 18-23

Segment Number Transmission Istant (s) Ack Istant (s) RTTM (s)
18 0.305040 0.356437 0.051397
19 0.305813 0.400164 0.094351
20 0.306692 0.448613 0.141921
21 0.307571 0.500029 0.192458
22 0.308699 0.545052 0.236353
23 0.309553 0.576417 0. 266864

To analyze some statistics, we can see Stevens' graph. This presents the sequence number on the
ordinate axis, while on the abscissa axis the time is expressed in seconds.

45
Through the Stevens graph it is possible to understand the number of packets sent consecutively at any
given time and if there has been any retransmission.
Simply select the direction of the client server data stream to view the progress of the transmission. The
anomaly of this connection is that it lacks the linear increment of the send window, probably due to a
limit imposed by the HTTP protocol: the sender continues to send six segments consecutively.

Grafico di Stevens Traccia 1

46
[You can download the capture from here: https://gaia.cs.umass.edu/kurose_ross/wireshark.htm]

8.2 Wireshark Capture 2


The client (192.168.1.140:51342) wants to download a 1MB file via an HTTP request to the server
(174.143.213.184:80).

The first three packets of the data stream constitute the Three-Way Handshake. Both hosts use the
initial value of the sequence number 0 and communicate an RWND of 5840 B.

47
The client sends the HTTP GET in package 4 and is encountered by the server with packet 5.

The server starts data transmission with packet 6 instantly 0.097271s.


Let's analyze Stevens' graph, especially the segments sent by the server:

48
Enlarging the area of the graph where package 6 is located, we can see that immediately after the
next package (number 8) was sent. In fact:

The two packets were sent at a distance of 0.123ms. This makes us understand that the initial value of
server CWND is 2. L'RTT calcolato sull'intera finestra, quindi tra l'istante in cui viene inviato il pacchetto
6 e l'istante in cui viene riscontrato il pacchetto 8, è pari a 0.097394 – 0.097271 = 0,000137s.

49
After about 50ms we see the sending of four packets:

The size of CWND has doubled we are in the Slow Start phase. The RTT calculated on this window is:
0.149879 – 0.145584 = 0. 004295s.

After about another 50ms, the size of CWND doubles: the client sends 8 packets.

The RTT calculated on this window is: 0.200246 - 0.192716 = 0.00753s.

50
After about 40ms, the size of CWND is 26. The failure to double is due to the fact that the client used
delayed hits: this means that with a single Ack it finds multiple segments. The use of Ack delayed
involves a slowdown of the Slow Start, which, remember, increases the CWND of an MSS for each Ack
received. This window has as its first package the 34 and the last 60.

51
The next transmission occurs after about 35ms.

The size of CWND is 26, the first package is 61 and the last one is 99.

Let's analyze the subsequent transmissions through Stevens' graph:

52
53
The size of CWND reaches its maximum value after about 487ms, that is, by sending 449-963 packets.
This time, through Stevens' chart, the exponential growth of the send window is much more evident.
Throughout the connection, TCP remains in the Slow Start phase.

[You can download the capture from here:


https://packetlife.net/media/blog/attachments/612/slow_start.cap]

54
55

You might also like