You are on page 1of 18

TCP/UDP Checksums

A 12 bytes pseudo header is included for the checksum computation in both sides (and is being discarded afterwards) If the datagram length is odd, a zero pad byte is added only for the checksum computation. The sender side performs the one's complement of the sum of all the 16-bit words in the segment. This result is put in the checksum field of the UDP segment. When the segment arrives the receiver sums all 16-bit words together, including the checksum. If this sum equals 1111111111111111 there were no errors.

From RFC 791 - Checksum is the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the UDP header, and the data, padded with zero octets at the end (if necessary) to make a multiple of two octets. The pseudo header conceptually prefixed to the UDP header contains the source address, the destination address, the protocol, and the UDP length: (This checksum procedure is the same as used in TCP). 0 7 8 15 16 23 24 31 +--------+--------+--------+--------+ | source address | +--------+--------+--------+--------+ | destination address | +--------+--------+--------+--------+ | 0 |protocol| UDP length | +--------+--------+--------+--------+ If the computed checksum is zero it means that the transmitter generated no checksum The checksum can only check for errors occurred in our computer between the transport layer to the network device.

UDP
0 7 8 15 16 23 24 31 +--------+--------+--------+--------+ | Source | Destination | | Port (optional) | Port | +--------+--------+--------+--------+ | | | | Length | Checksum | +--------+--------+--------+--------+ | | data octets ... +---------------- ... User Datagram Header Format Length - the length in octets of this user datagram including header + data. (This means the minimum value of the length is eight.)

NOTE (unrelated):
Big/little Indian TCP/IP uses big-endian (sending data from MSB to LSB). Windows uses little-endian.

Subnets
A subnetwork, or subnet, describes networked computers and devices that have a common, designated IP address routing prefix. All hosts within a subnet can be reached in one "hop" (time to live = 1), implying that all hosts in a subnet are connected to the same link. An IPv4 address consists of a 32 bit address written, for human readability, into 4 octets and a subnet mask of like size and notation. In order to facilitate the routing process the address is divided into two pieces: The network prefix (network ID) that is significant for routing decisions at that particular topological point. The network host that specify a particular device in the network. To determine which part of the address is the network address (netID) and which part is the host address, a device performs a bitwise "AND" operation between the IP and the subnet mask. Example Dot-decimal Address IP address 192.168.5.10 Subnet Mask 255.255.255.0 Network Portion 192.168.5.0 Host Portion 0.0.0.10 Binary 11000000.10101000.00000101.00001010 11111111.11111111.11111111.00000000 11000000.10101000.00000101.00000000 00000000.00000000.00000000.00001010

IPv4 classes Class Leading bits Start End Default Subnet Mask in dotted decimal A (CIDR /8) 0 0.0.0.0 127.255.255.255 255.0.0.0 B (CIDR /16) 10 128.0.0.0 191.255.255.255 255.255.0.0 C (CIDR /24) 110 192.0.0.0 223.255.255.255 255.255.255.0 D 1110 224.0.0.0 239.255.255.255 E 1111 240.0.0.0 255.255.255.254 While the 127.0.0.0/8 network is in the Class A area, it is designated for loopback and cant assigned to a network. 2^7 class A netids are available, 2^21 class C netids are available The routers verify whether its their network by the netID. Subnetting is the process of allocating bits from the host portion as a network portion. Example Dot-decimal Address Binary IP address 192.168.5.130 11000000.10101000.00000101.10000010 Subnet Mask 255.255.255.192 11111111.11111111.11111111.11000000 Network Portion 192.168.5.128 11000000.10101000.00000101.10000000 In this example two bits were borrowed from the original host portion. This is beneficial because it allows this network to be split into four smaller networks. Each bit can take the value 1 or 0, giving 4 possible subnets. A /24 suffix (Class C block) allows 254 hosts; split into four parts, the prefix is /26, each has 62 hosts. The remaining bits after the subnet are used for addressing hosts within the subnet. In the above example the subnet mask consists of 26 bits, leaving 6 bits for the address (32 26). This allows for 64 possible combinations (26), however the all zeros value and all ones value are reserved for the network ID and broadcast address respectively, leaving 62 addresses. In general the number of available hosts on a subnet can be calculated using the formula 2n 2, where n is the number of bits used for the host portion of the address.

DNS
DNS resolving process - To map a name to an IP address, an application calls the resolver function (its an OS service): Looking for the info in the hosts.txt file. Checking local cache (info is kept as long as the TTL doesnt expire) Sending UDP packet to the local NS (which is defined automatically by DHCP when we connect the internet, and usually belong to our ISP). 1. If the domain being sought falls under the jurisdiction of the local NS, it returns the authorative record. 2. If the local NS has no info about this domain (not authoritative and no cache for it) it can start: Iterative query forward to resolver to a better location (TLD server). Recursive query (not always supported).

DNS message format:

Identification set by the client and returned by the server (a resolver can send several queries at a time and should be able to match an answer to the query). Flags:

QR: 0 means the message is a query, 1 means it's a response. Opcode: The normal value is 0 (a standard query) and 2 (server status request). AA : means "authoritative answer. TC: means "truncated". With UDP this means the total size of the reply exceeded 512 bytes, and only the first 512 bytes of the reply was returned. 5. RD: means "recursion desired". 6. RA: means "recursion available". Set to 1 in the response if the server supports recursion. 7. rcode: return code. The common values are 0 (no error) and 3 (name error). A name error is returned only from an authoritative name server and means the domain name does not exist. The Question potion of DNS Request Message contains: 1. Query name the domain [in the format: 3mta2ac2il0 for mta.ac.il ] 2. Query Type A(1), NS(2), MX(15), CNAME(5), PTR(12) 3. Query class- IN(1) (for Internet address).

1. 2. 3. 4.

:Resource Record Portion of DNS Response Message contains .1. Domain name - in the same format as we described earlier for the query name field .2. Type .3. Class .4. TTL - the number of seconds that the RR can be cached by the client. It is often 2 days .5. Resource data length - specifies the amount of resource data .6. Resource data

1. DNS : DNS . , . Single point of failure - , IP . DNS , , . 2. : root ns 31 . .DNS root root ,TLD " . local name server ( resover )ISP local cache .host cache , root server ( " .)resolver authoritative name server DNS cache . primary secondary - ( primary secondary primary TCP 33). 3. :nslookup nslookup mta.ac.il , non authoritative :IP Server: dns1.bezeqint.net 13.601.511.291 :Address :Non-authoritative answer Name: mta.ac.il 2.46.611.291 :Address nslookup type=NS mta.ac.il :authoritative Server: dns1.bezeqint.net 13.601.511.291 :Address :Non-authoritative answer mta.ac.il nameserver = aleph.mta.ac.il mta.ac.il nameserver = sdns.goldenlines.net.il mta.ac.il nameserver = pdns.goldenlines.net.il nslookup mta.ac.il aleph.mta.ac.il aleph.mta.ac.il DNS authoritative mta.ac.il .IP Server: aleph.mta.ac.il 51.46.611.291 :Address Name: mta.ac.il 2.46.611.291 :Address

TCP
TCP Connection Oriented Transport TCP is full duplex, so that host A may be receiving data from host B while it sends data to host B (as part of the same TCP connection). Each of the segments that arrive from host B have a sequence number for the data flowing from B to A. 20 byte header (possible optional data part) Segment size limitation: IP payload limit (65515 bytes), segment must fit MSS (~1500). The maximum amount of data that can be grabbed and placed in a segment is limited by the Maximum Segment Size (MSS). The MSS is typically set by first determining the length of the largest link-layer frame that can be sent by the local sending host (the maximum transmission unit - MTU), and then setting the MSS to ensure that a TCP segment (when encapsulated to an IP datagram) will fit into a single link-layer frame. When TCP sends a large file it typically breaks the file into chunks of size MSS (except for the last chunk, which will often be less than the MSS).

Segment structure
Source port Destination port. The32-bit sequence number field, and the 32-bit acknowledgment number field are used by the TCP sender and receiver in implementing a reliable data transfer service. o The sequence number for a segment is the byte-stream number of the first byte in the segment . o The acknowledgment number that host A puts in its segment is the sequence number of the next byte host A is expecting from host B. The 16-bit window size field is used for the purposes of flow control. It is used to indicate the number of bytes that a receiver is willing to accept. The 4-bit length field - specifies the length of the TCP header in 32-bit words (usually the value is 5 = 20 bytes). The TCP header can be of variable length due to the TCP options field. The optional and variable length options field is used when a sender and receiver negotiate the maximum segment size (MSS) or as a window scaling factor for use in high-speed networks. A time stamping option is also defined. The flag field contains 6 bits: ACK, RST, SYN, FIN, PSH, and URG. When the PSH bit is set, this is an indication that the receiver should pass the data to the upper layer immediately. The URG bit is used to indicate there is data in this segment that the sending-side upper layer entity has marked as ``urgent.'' The location of the last byte of this urgent data is indicated by the 16-bit urgent data pointer. TCP must inform the receiving-side upper layer entity when urgent data exists and pass it a pointer to the end of the urgent data. (In practice, the PSH, URG and pointer to urgent data are not used. However, we mention these fields for completeness.) Checksum. Urgent pointer.

TCP Connection Establishment and Termination


3-way handshake 1. The requesting end sends a SYN segment specifying the port number of the server that the client wants to connect to, and the client's initial sequence number (ISN, 1415531521 in this example). 2. The server responds with its own SYN segment containing the server's ISN. The server also acknowledges the client's SYN by ACKing the client's ISN plus one. A SYN consumes one sequence number. 3. The client must acknowledge this SYN from the server by ACKing the server's ISN plus one. Connection Termination Since a TCP connection is full-duplex, each direction must be shut down independently.

The receipt of a FIN only means there will be no more data flowing in that direction. A TCP can still send data after receiving a FIN. While it's possible for an application to take advantage of this half-close, in practice few TCP applications use it.

Normally: A sends FIN to B ( meaning A have no more data to transfer) B acks As FIN, as a result this direction is closed for new data. The same applies for the other direction (steps 3-4)

When both directions have been shut down the connection is released. MSL the maximum segment lifetime. When a connection is closed it must stay in time_wait state for 2*MSL. When a Transmission Control Protocol (TCP) connection is closed, the socket pair associated with the connection is placed into a state known as TIME-WAIT, which prevents other connections from using that protocol, source Internet Protocol (IP) address, destination IP address, source port, and destination port for a period of time. TIME-WAIT makes certain that enough time has passed to ensure that any TCP segments that might have been misrouted or delayed are not delivered unexpectedly to a new, unrelated application with the same connection settings

TCP TIMERS
TCP manages four different timers for each connection: A retransmission timer is used when expecting an acknowledgment from the other end - 500ms. A persist timer keeps window size information flowing even if the other end closes its window - 200ms. A keepalive timer detects when the other end on an otherwise idle connection crashes or reboots. A 2MSL timer measures the time a connection has been in the TIME_WAIT state.

Interactive Input (interactive data flow)


Each interactive keystroke normally generates a data packet. That is, the keystrokes are sent from the client to the server 1 byte at a time (not one line at a time). Furthermore, Rlogin has the remote system (the server) echo the characters that we (the client) type. This could generate four segments: (1) the interactive keystroke from the client, (2) an acknowledgment of the keystroke from the server, (3) the echo of the keystroke from the server, and (4) an acknowledgment of the echo from the client. Normally, however, segments 2 and 3 are combined- the acknowledgment of the keystroke is sent along with the echo. This technique is called delayed acknowledgments. Delayed Acknowledgments Normally TCP does not send an ACK the instant it receives data. Instead, it delays the ACK, hoping to have data going in the same direction as the ACK, so the ACK can be sent along with the data. (This is sometimes called having the ACK piggyback with the data.) Most implementations use a 200-ms delay-that is, TCP will delay an ACK up to 200 ms to see if there is data to send with the ACK. Nagle Algorithm In Rlogin connection 1 byte at a time normally flows from the client to the server, this generates 41-byte packets: 20 bytes for the IP header, 20 bytes for the TCP header, and 1 byte of data. These small packets (called tinygrams) can add to congestion on wide area networks. A simple and elegant solution is called the Nagle algorithm.

This algorithm says that a TCP connection can have only one outstanding small segment that has not yet been acknowledged. No additional small segments can be sent until the acknowledgment is received. Instead, small amounts of data are collected by TCP and sent in a single segment when the acknowledgment arrives. The beauty of this algorithm is that it is self-clocking: the faster the ACKs come back, the faster the data is sent. But on a slow WAN, where it is desired to reduce the number of tinygrams, fewer segments are sent (the definition of "small" is less than the segment size). The round-trip time on an Ethernet for a single byte to be sent, acknowledged, and echoed averaged around 16 ms. To generate data faster than this we would have to be typing more than 60 characters per second. This means we rarely encounter this algorithm when sending data between two hosts on a LAN. Things change, however, when the round-trip tune (RTT) increases, typically across a WAN. There are times when the Nagle algorithm needs to be turned off when small messages (like mouse movements) must be delivered without delay to provide real-time feedback for interactive users doing certain operations. On LAN networks we see delayed acknowledgements, on WAN networks Nagle algorithm kicks in.

Summary Interactive data is normally transmitted in segments smaller than the MSS. With Rlogin a single byte of data is normally sent from the client to the server. Telnet allows for the input to be sent one line at a time, but most implementations today still send single characters of input. Delayed acknowledgments are used by the receiver of these small segments to see if the acknowledgment can be piggybacked along with data going back to the sender. This often reduces the number of segments, especially for an Rlogin session, where the server is echoing the characters typed at the client. On slower WANs the Nagle algorithm is often used to reduce the number of these small segments. This algorithm limits the sender to a single small packet of unacknowledged data at any time. But there are times when the Nagle algorithm needs to be disabled.

Bulk Data Flow


TCP uses a form of flow control called a sliding window protocol. It allows the sender to transmit multiple packets before it stops and waits for an acknowledgment. With TCP's sliding-window protocol the receiver does not have to acknowledge every received packet. The ACKs are cumulative - they acknowledge that the receiver has correctly received all bytes up through the acknowledged sequence number minus one. Sliding Window Window size - is the number of bytes, starting with the one specified by the acknowledgment number field, that the receiver is willing to accept. (The window size is relative to the acknowledged sequence number). Over time this sliding window moves to the right, as the receiver acknowledges data. The relative motion of the two ends of the window increases or decreases the size of the window. Three terms are used to describe the movement of the right and left edges of the window. 1. The window closes as the left edge advances to the right. This happens when data is sent and acknowledged. 2. The window opens when the right edge moves to the right, allowing more data to be sent. This happens when the receiving process on the other end reads acknowledged data, freeing up space in its TCP receive buffer. 3. The window shrinks when the right edge moves to the left (not recommended).

To sum up: 1. The sender does not have to transmit a full window's worth of data. 2. One segment from the receiver acknowledges data and slides the window to the right. This is because the window size is relative to the acknowledged sequence number. 3. The size of the window can decrease, but the right edge of the window must not move leftward. 4. The receiver does not have to wait for the window to fill before sending an ACK. We saw earlier that many implementations send an ACK for every two segments that are received. 5.

Round-Trip Time (RTT ) Retransmission Timeout (RTO) Measurements


Fundamental to TCP's timeout and retransmission is the measurement of the RTT experienced on a given connection, which can change over time, and TCP should track these changes and modify its timeout accordingly. RTT the time between sending a byte with a particular sequence number and receiving an acknowledgment that covers that sequence number = M. (its possible that M = the time between the transmission of segment X and the reception of segment X+3). (We measure only one M at a time so if a packet is being sent while another measurement is in process, it wont be measured) The smoothed RTT estimator (called R) is an integration of the Ms over time: R aR + (1 - a)M a=0.9

Where a is a smoothing factor with a recommended value of 0.9. This smoothed RTT is updated every time a new measurement is made. Ninety percent of each new estimate is from the previous estimates and 10% is from the new measurement. RTO (retransmission timeout) The RTO value is being rounded to fit unit of a second. If over time we see a large gap between RTT measurements its probably because of retransmissions. The first timeout is set for X seconds after the first transmission. After this, the timeout value is doubled for each retransmission, with an upper limit of 64 seconds. This doubling is called an exponential backoff. The time difference between the first transmission of the packet and the connection reset is about 9 minutes.

Given this smoothed estimator, which changes as the RTT changes, the recommended RTO is set to: RTO = R Better RTO calculation: )) The mean deviation is a good approximation to the standard deviation, but easier to compute. (Calculating the standard deviation requires a square root.) This leads to the following equations that are applied to each RTT measurement M. Err = M RTT RTT RTT + gErr (g= 1/8) D D+ h(|Err| - D) (h= , = D) RTO = RTT + 4D The initial values are: RTT=0, D=3, RTO=6. (recommended =2)

D is the smoothed mean deviation. Err is the difference between the measured value just obtained and the current RTT estimator. Both RTT and D are used to calculate the next retransmission timeout (RTO). The larger gain for the deviation makes the RTO go up faster when the RTT changes. Karn's Algorithm - when a timeout and retransmission occur, we cannot update the RTT estimators when the acknowledgment for the retransmitted data finally arrives. This is because we don't know to which transmission the ACK corresponds.

Congestion Control
Network congestion occurs when a link or node is carrying so much data that its quality of service deteriorates Slow start When both side of the connection are on the same LAN, it OK for the sender to start off by injecting multiple segments into the network, up to the window size advertised by the receiver. However, on slower links problems can arise. TCP is required to support slow start. It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end. Slow start adds another window to the sender's TCP: the congestion window - cwnd. When a new connection is established with a host on another network, the congestion window is initialized to one segment (= 1 MSS). Each time an ACK is received, the cwnd is increased by one segment, The sender can transmit up to min(cwnd*MSS, window size). The cwnd is flow control imposed by the sender, while the advertised window size is flow control imposed by the receiver.

Slow start and Congestion avoidance There are two indications of packet loss: a timeout occurring and the receipt of duplicate ACKs. Congestion avoidance and slow start are independent algorithms with different objectives. But when congestion occurs we want to slow down the transmission rate of packets into the network, and then invoke slow start to get things going again. In practice they are implemented together. Congestion avoidance and slow start require that two variables be maintained for each connection: a congestion window, cwnd, and a slow start threshold size, ssthresh. The combined algorithm operates as follows: 1. Initialization for a given connection sets cwnd =1 segment, ssthresh = 65535 bytes (64KB). 2. The TCP can send up to min(cwnd , receiver's window) . 3. On congestion (timeout / duplicate ACKs): ssthresh = *min(cwnd,receiver's window). Additionally, if the congestion is indicated by a timeout, cwnd = 1 segment. 4. When ack is received we increase cwnd as follows: 1. cwnd is ssthresh : invoking slow start (until at some point cwnd is > ssthresh) 2. cwnd is > ssthresh: Congestion avoidance dictates that cwnd be incremented by 1/cwnd each time an ACK is received. This is an additive increase, compared to slow start's exponential increase. Fast Retransmit and Fast Recovery Algorithms TCP may generate an immediate acknowledgment (a duplicate ACK) when an out- of-order segment is received. This duplicate ACK should not be delayed. The purpose of this duplicate ACK is to let the other end know that a segment was received out of order, and to tell it what sequence number is expected. Since TCP does not know whether a duplicate ACK is caused by a lost segment or just a reordering of segments, it waits for a small number of duplicate ACKs to be received. It is assumed that if there is just a

reordering of the segments, there will be only one or two duplicate ACKs before the reordered segment is processed, which will then generate a new ACK. If three or more duplicate ACKs are received in a row, it is a strong indication that a segment has been lost. TCP then performs a retransmission of what appears to be the missing segment, without waiting for a retransmission timer to expire. This is the fast retransmit algorithm. Next, congestion avoidance, but not slow start is performed. This is the fast recovery algorithm.

TCP Persist Timer If an acknowledgment that opens the window that was previously shut down is lost, we could end up with both sides waiting for the other (receiver for data, the sender for the window update allowing it to send data). TCP uses a persist timer that causes the sender to query the receiver periodically, to find out if the window has been increased. These segments are called window probes. The persist timer is always between 5 and 60 seconds (5,6,12,24,48,60,60 - The normal TCP exponential backoff is used when calculating the persist timer. The first timeout is calculated as 1.5 seconds for a typical LAN connection. This is multiplied by 2 for a second timeout value of 3 (rounded to 5) seconds. A multiplier of 4 gives the next value of 6, a multiplier of 8 gives a value of 12, and so on (with max of 60). TCP never gives up sending window probes. These window probes continue to be sent at 60-second intervals until the window opens up or either of the applications using the connection is terminated. The window probes contain 1 byte of data. TCP is always allowed to send 1 byte of data beyond the end of a closed window. Notice, however, that the acknowledgments returned with the window size of 0 do not ACK this byte, but all the data sent before.

Silly Window Syndrome (SWS) When silly window syndrome occurs, small amounts of data are exchanged across the connection, instead of full-sized segments. It can be caused due to: The receiver can advertise small windows (instead of waiting until a larger window could be advertised) The sender can transmit small amounts of data (instead of waiting for additional data, to send a larger segment). Avoidance: The receiver must not advertise small windows. Can advertise a larger window if it can be increased by min(MSS, receiver's buffer space). The sender is not transmitting unless one of the following is true: a full-sized segment can be sent at least the advertized receivers window size can be sent We can send everything we have + we are not expecting an ACK or the Nagle algorithm is disabled for this connection. ( An Example (figure 22.3)- in our book - What's happening here is that segment 11 advertised a window of 1533 bytes but the sender only transmitted 1024 bytes. If the acknowledgment in segment 13 advertised a window of 0, it would violate the TCP principle that a window cannot shrink by moving the right edge of the window to the left. That's why the small window of 509 bytes must be advertised. Next we see that the sender does not immediately transmit into this small window. This is silly window avoidance by the sender. Instead it waits for another persist timer to expire at time 20.151, when it sends 509 bytes. Even though it ends up sending this small segment with 509 bytes of data, it waits 5 seconds before doing so, to see if an ACK arrives that opens up the window more. These 509 bytes of data leave only 768 bytes of available space in the receive buffer, so the acknowledgment (segment 15) advertises a window of 0. )

TCP Keepalive Timer Normally this option is set by servers. If there is no activity on a given connection for 2 hours, the server sends a probe segment to the client. The client host must be in one of four states. 1. The client host is still up and running. The client's TCP responds and the server's TCP will reset the keepalive timer. 2. The client's host is down. After 75 seconds another probe is being sent up to a total of 10 probes. After 10 failures the server terminates the connection. 3. The client's host has crashed and rebooted. The server will receive a response with RST flag, causing the server to terminate the connection. 4. The client's host is up and running, but unreachable. Same as scenario 2.

Network layer
IPv4 packet structure Version 4 bits. For IPv4, this has a value of 4 (hence the name IPv4). Header Length - 4 bits. The number of 32-bit words in the header. The minimum value for this field is 5 (=20 bytes) and the maximum length is 15 (= 60 bytes). TOS not used. Total Length - This 16-bit field defines the entire datagram size, including header and data, in bytes. The minimum-length datagram is 20 bytes and the maximum is 65,535 bytes. Identification Used for uniquely identifying fragments of an original IP datagram. Flags - 3-bit field. They are (in order, from high order to low order): Reserved; must be zero. Don't Fragment (DF) More Fragments (MF) Fragment Offset - specifies the offset of a particular fragment relative to the beginning of the original unfragmented IP datagram .measured in units of 8-byte blocks, is 13 bits long and. Time To Live (TTL) - 8-bit .This field limits a datagram's lifetime. It is specified in seconds, but in practice, it has come to be a hop count field. Each packet switch (or router) that a datagram crosses decrements the TTL field by one. When the TTL field hits zero, the packet is discarded. Typically, an ICMP message is sent back to the sender that it has been discarded. Protocol Identifies the protocol the IP datagram is encapsulating (UDP-17, TCP-6, etc). Header Checksum - Since the TTL field is decremented on each hop and fragmentation is possible at each hop then at each hop the checksum will have to be recomputed. Source IP address - 32 bits. Destination IP address- 32 bits.

IP routing table Each host keeps the set of mappings between: Destination IP address The first router needed in order to reach the destination

There are two kinds of Routing Algorithms: Static routing algorithm - computes the least cost path between a source and destination using complete, global knowledge about the network. That is, the algorithm takes the connectivity between all nodes and all links costs as inputs. This requires that the algorithm will somehow obtain this information before actually performing the calculation. Dynamic routing algorithm - the calculation of the least cost path is carried out in an iterative, distributed manner. Each node begins with only the knowledge of the costs of its own directly attached links and then through an iterative process of calculation and exchange of information with its neighboring nodes gradually calculates the least cost path to a destination, or set of destinations. A node never actually knows a complete path from source to destination, but the direction (which neighbor) to which it should forward a packet. Each node maintains a vector of estimates of costs (distances) to all other nodes in the network.

Distance-Vector (dynamic routing)


Each router maintains a table giving the best known distance to each destination and which line to use to get there. Every T msec, each router sends its estimates to its neighbors. When the estimates are received the distance vector is updated. A distance-vector routing protocol uses the Bellman-Ford algorithm to calculate paths. The internet uses it under the name RIP. DV protocol is based on calculating the direction and distance to any link in a network. The cost of reaching

a destination is calculated using various route metrics (RIP uses the hop count) Updates are performed periodically in a distance-vector protocol where all or part of a router's routing table is sent to all its neighbors that are configured to use the same distance-vector routing protocol. Once a router has this information it is able to amend its own routing table to reflect the changes and then inform its neighbors of the changes. This process has been described as routing by rumor because routers are relying on the information they receive from other routers and cannot determine if the information is actually valid and true. count-to-infinity - The Bellman-Ford algorithm does not prevent routing loops from happening and suffers from the count-to-infinity problem. The core of the count-to-infinity problem is that if A tells B that it has a path somewhere, there is no way for B to know if the path has B as a part of it. To see the problem clearly, imagine a subnet connected like A-B-C-D-E-F, and let the metric between the routers be "number of jumps". Now suppose that A goes down (out of order). In the vector-update-process B notices that its once very short route of 1 to A is down - B does not receive the vector update from A. The problem is, B also gets an update from C, and C is still not aware of the fact that A is down - so it tells B that A is only two jumps from it, which is false. This slowly propagates through the network until it reaches infinity (in which case the algorithm corrects itself, due to the "Relax property" of Bellman Ford). Partial solutions - RIP uses a maximum number of hops to counter the count-to-infinity problem. These measures avoid the formation of routing loops in some, but not all, cases. The addition of a hold time (refusing route updates for a few minutes after a route retraction) avoids loop formation in virtually all cases, but causes a significant increase in convergence times.

Link state algorithm


Distance vector routing was later replaced by link state routing, because: It didnt take bandwidth into consideration when choosing routes. It was possible to change the delay metric to include bandwidth, but there was another problem as well: The algorithm often took too long to converge (the count to infinity problem). The network topology and all link costs are known, i.e., available as input to the link state algorithm. This is accomplished by having each node broadcast link state packets to all other nodes in the network; such packet contains the identities and costs of its attached links. A node needs only to know the identities and costs to its directly-attached neighbors; it will then learn about the topology of the rest of the network by receiving link state broadcast from other nodes. The result of the nodes' link state broadcast is that all nodes have an identical and complete view of the network. Each node can then run the link state algorithm and compute the same set of least cost paths as every other node. The link state algorithm we present below is known as Dijkstra's algorithm.

The idea behind link state routing is that each router must:
Discover its neighbors and learn their network addresses. Measure the delay or cost to each of its neighbors. Construct a packet telling all it has just learned. Send this packet to all other routers. Compute the shortest path to every other router.

Learning about the neighbors


The router sends HELLO on each of its links to introduce itself. The HELLO is broadcasted on a LAN. A router on the other end should answer with its own HELLO, so that the routers both know about each other.

Measuring line cost


The estimating of the delay is done by some form of echo/reply.

Building link state packets


Once the information needed for the exchange has been collected, the next step for each router is to build a packet containing all the data.

The packet contains: The identity of the sender (the ID of the node that created the LSP) List of neighbors and the delay given for each of them. Sequence number and age (TTL) to control the flooding.

Distributing the link state packets


The fundamental idea is flooding. Each packet has a sequence number that is increased for each packet sent. Routers keeps a database of all the pairs of <source router, sequence> they see. When receiving a packet: If this is a duplicate discard it. If its sequence number is lower than the one of a previously received packet discard it. If its new forward on all lined but the one it arrived on. A node determines the TTL of a packet before flooding it to its neighbors. The TTL is decremented every second, and the packet is removed when it reaches 0.

Computing the new routes Once a router has all the link packets it can construct the graph. A shortest path algorithm (Dijkstra's) can be used to decide the path to all destinations. The memory requirement and computation time are not ignorable.

Hierarchical Routing
Not all routers execute the same routing algorithms, for at least two important reasons: Scale. As the number of routers becomes large, the overhead involved in computing, storing, and communicating the routing table information (e.g., link state updates or least cost path changes) becomes prohibitive. Today's public Internet consists of millions of interconnected routers and more than 50 million hosts. Storing routing table entries to each of these hosts and routers would clearly require enormous amounts of memory. The overhead required to broadcast link state updates among millions of routers would leave no bandwidth left for sending the data packets! Administrative autonomy. Ideally, an organization should be able to run and administer its network as it wishes, while still being able to connect its network to other "outside" networks. Both of these problems can be solved by aggregating routers into "regions" or "autonomous systems" (ASs). Routers within the same AS all run the same routing algorithm (LS or DV) and have full information about each other. The routing algorithm running within an autonomous system is called an intraautonomous system routing protocol. It will be necessary, of course, to connect ASs to each other, and thus one or more of the routers in an AS will have the added task for being responsible for routing packets to destinations outside the AS. Routers in an AS that have the responsibility of routing packets to destinations outside the AS are called gateway routers. In order for gateway routers to route packets from one AS to another (possibly passing through multiple other ASs before reaching the destination AS), the gateways must know how to route (i.e., determine routing paths) among themselves. The routing algorithm that gateways use to route among the various ASs is known as an inter-autonomous system routing protocol.

Inter-AS routing (BGP)


BGPs see the world as ASes and the lines connecting them. ASes are one of: Stub AS connected to only one other autonomous system, through which it gains access to the

Internet. It has only 1 connection to the BGP graph.


Multi-connected AS can be used for transit traffic but it refuses. Transit AS willing to handle transit traffic, possibly with restrictions (payment). BGP is a distant vector protocol, however:

Each BGP router keeps the track to the destination. Each BGP router tells its neighbors the exact path it is using. (solves the count to infinity problem)

BGP is carried over TCP.

The inter-AS (BGP) routing protocol has two tasks: Obtaining reachability information from neighboring ASs. Propagating the reachability information to all routers internal to the AS If a router need to send a packet outside of its AS to subnet x, and theres is more than one gateway with a path to the destination, the gateway will be chosen according to the hot potato routing the AS gets rid of the packet as inexpensively as possible, thus the router forwards it to the gateway router that has the smallest router-to-gateway cost among all gateway with a path for the destination, according to the data gathered from the intra-routing protocol (and then adding a new entry to its forwarding table for x).

Intra-AS routing protocol (RIP/OSPF)


RIP
RIP is a distance vector protocol. The version of RIP specified in RFC 1058 uses hop count as a cost metric, i.e., each link has a cost of 1, and limits the maximum cost of a path to 15 (from source router to destination subnet including). This limits the use of RIP to ASs that are less than 15 hops in diameter. In distance vector protocols, neighboring routers exchange routing information with each other. In RIP, the routing tables are exchanged between neighbors every 30 seconds using a RIP response message, which contains a list of up to 25 subnets within the AS, as well as the senders distance to each of those subnets. These response messages containing routing tables are also called advertisements. Each router maintains a RIP table known as a routing table, which includes both the routers distance vector and the router forwarding table. If a router doesnt hear from its neighbor at least once every 180 seconds that neighbor is considered to be no longer reachable. When that happens RIP modifies the local routing table and sends advertises it to its neighbors. A router can also request information about its neighbor's cost to a given destination using RIP's request message. RIP is an application layer protocol. Routers send RIP request and response messages to each other over UDP using port 520.The UDP packet is carried between routers in a standard IP packet.

OSPF
OSPF is typically deployed in upper tier ISPs. It is a link state protocol that uses flooding of link state information and a Dijkstra least cost path algorithm. With OSPF, a router constructs a complete topological map (i.e., a directed graph) of the entire autonomous system. The router then locally runs Dijkstra's shortest path algorithm to determine a shortest path tree to all subnets with itself as the root node, according to the individual link costs configured by the network administrator. The router's routing table is then obtained from this shortest path tree. With OSPF, a router broadcasts routing information to all other routers in the autonomous system, not just its neighbors. A router broadcast link-state information whenever there is a change in a links state. It also broadcasts periodically (at least once in 30 minutes), even if there was no change. As OSPF autonomous system can be configured into "areas." Each area runs its own OSPF link state routing algorithm, with each router in an area broadcasting its link state to all other routers in that area. The internal details of an area thus remain invisible to all routers outside the area. Within each area, one of more area border routers is responsible for routing packets outside the area. Exactly one OSPF area in the AS is configured to be the backbone area. The primary role of the backbone area is to route traffic between the other areas in the AS. The backbone always contains all area border routers in the AS and may contain non border routers as well. Inter-area routing within the AS requires that the packet be first routed to an area border router (intra-area routing), then routed though the backbone to the area border router that is in the destination area, and then routed to the final destination.

Boundary router talks to routers in other ASs Backbone router connect two or more areas (performs routing within the backbone) Area border router (Internal+Backbone router) - belongs to both an area and the backbone

Internal router
A diagram of a hierarchically structured OSPF AS network with four areas is shown above. We can identify four types of OSPF routers: Internal routers are in a non-backbone area and only perform intra-AS routing. Area border routers belong to both an area and the backbone. Backbone routers (non border routers) perform routing within the backbone but themselves are not area border routers. Within a non-backbone area, internal routers learn of the existence of routes to other areas from information (essentially a link state advertisement, but advertising the cost of a route to another area, rather than a link cost) broadcast within the area by its backbone routers. Boundary routers exchange routing information with routers belonging to other ASs. This router might, for example, use BGP to perform inter-AS routing. It is through such a boundary router that other routers learn about paths to external networks.

Data-Link Layer
MTU (max transmission unit) - The Ethernet / IEEE802.03have size limit of 1500 / 1492 byes respectively The method used today is CSMA / CD CS carrier sense protocols in which stations listen for on-going transmission and act accordingly. MA multiple access is another name for broadcast channels. Several stations can listen/broadcast at the same time. CD collision detection protocols in which stations abort transmission as soon as they detect a collision. Ethernet transmission: A station want to transmit: i. The station senses the channel: 1. busy keeps sensing. 2. free starting transmission ii. collision is detected: 1. abort transmission 2. wait random amount of time and start transmission again. Ethernet frames must be > 64 bytes for proper collision detection. Previously used methods: Pure Aloha success of 18% i. Transmit whenever we feel like ii. If collision detection is available use it. Otherwise wait for ack from the destination station. iii. If the transmission was unsuccessful (no ack / collision detected) - resent after a random

amount of time to reduce the probability of re-collision


Slotted Aloha success of 36% i. Divide the time to slots (slot should be at least the time it take to get to destination) ii. A station can transmit in the beginning of its slot only (once a station is transmitting, no one else will interfere, a collision occurs if two stations start transmitting at the same time they have the same slot). 1 Persistent CSMA success of 54% i. Sense the wire 1. Free- start transmitting immediately. 2. Busy wait till its free and start transmitting immediately (greedy). ii. On collision waits a random amount of time and starts over. Non-Persistent CSMA success close 90% i. Sense the wire 1. Free - start transmitting immediately. 2. Busy wait a random amount of time and start all over again (with sensing first). P-Persistent CSMA - success close 90% (**in the net I found very different definitions for it**) i. Time divided to slots, A station can transmit in its slot. ii. Sense the wire 1. Free - start transmitting with probability P. 2. Busy wait a random amount of time, then go to Step 1. 3. Busy - wait for the next slot, then go to Step 1.

ARP
The ARP module converts IP addresses to MAC addresses. The ARP module keeps an ARP Table (ARP cache) with the matches it knows. Address resolution process: The IP is searched in the ARP table i. Found - the MAC address from the table is returned ii. Not found the caller is notified and a broadcast of an ARP request is made. 1. If a machine in the Ethernet recognizes its IP address in the ARP request, it sends an ARP reply with both it IP and MAC addresses to the requesting host (unicast). 2. When the ARP reply is received , the info is stored in the ARP table ARP fields:

Hardware type - Each data link layer protocol is assigned a number used in this field. For example, Ethernet is 1. Protocol type - Each protocol is assigned a number used in this field. For example, IP is 0x0800. Hardware length - Length in bytes of a hardware address. MAC address is 6 bytes long. Protocol length - Length in bytes of a logical address. IPv4 address is 4 bytes long. Operation - Specifies the operation the sender is performing: 1 for request, 2 for reply. Sender hardware address - MAC address of the sender. Sender protocol address - Protocol address of the sender. Target hardware address - MAC address of the destination. This field is ignored in requests. Target protocol address - Protocol address of the intended receiver.

ARP operation: In a request all fields are filled but the Target hardware address (destination MAC). System replying: i. Fills the Target hardware address (destination MAC). ii. Swaps the sender addresses with the destination address. iii. Sets the operation field to 2. iv. Sends he reply.