You are on page 1of 26

Nitte Education Trust

Nitte Meenakshi Institute of Technology


Yelahanka, Bangalore- 560064, Karnataka- India.

Department of Electonics and Communication Engineering

Semester: 6th C Staff In charge: Dr. B.S. Pavan


Subject Name: Computer Network and Applications Subject Code: 18EC61

UNIT 4 Notes:
Source
TEXT BOOKS
Text Book - Data Communications and Networking, 5/e By Behrouz A Forouzan,
published by McGraw-Hill

MULTIMEDIA DATA
Today, multimedia data consists of text, images, video, and audio, although the definition
is changing to include futuristic media types.

Text
The Internet stores a large amount of text that can be downloaded and used. One often refers
to plaintext, as a linear form, and hypertext, as a nonlinear form, of textual data. Text stored in
the Internet uses a character set, such as Unicode, to represent symbols in the underlying
language. To store a large amount of textual data, the text can be compressed using one of the
lossless compression methods.

Image
In multimedia parlance, an image (or a still image as it is often called) is the representation of
a photograph, a fax page, or a frame in a moving picture.
Digital Image
To use an image, it first must be digitized. Digitization in this case means to represent an image
as a two-dimensional array of dots, called pixels. Each pixel then can be represented as a
number of bits, referred to as the bit depth. In a black-and-white image, such as a fax page, the
bit depth = 1; each pixel can be represented as a 0-bit (black) or a 1-bit (white). In a gray
picture, one normally uses a bit depth of 8 with 256 levels. In a color image, the image is
normally divided into three channels, with each channel representing one of the three primary
colors of red, green, or blue (RGB). In this case, the bit depth is 24 (8 bits for each color). Some
representations use a separate channel, called the alpha channel, to represent the background.
In a black and white image, this results in two channels; in a color image, this results in four
channels. It is obvious that moving from black-and-white, to gray, to color representation of
images tremendously increases the size of the information to transmit in the Internet. An image
is compressed to save time.
Image Compression: JPEG
Although there are both lossless and lossy compression algorithms for images, the lossy
compression method called JPEG. The Joint Photographic Experts Group (JPEG) standard
provides lossy compression used in most implementations. The JPEG standard can be used for
both color and gray images. In JPEG, a grayscale picture is divided into blocks of 8 × 8 pixels.
The compression and decompression each go through three steps, as shown in Figure 28.17.

The purpose of dividing the picture into blocks is to decrease the number of calculations.
Transformation
JPEG normally uses DCT in the first step in compression and inverse DCT in the last step in
decompression. Transformation and inverse transformation are applied on 8 × 8 blocks.
Quantization
The output of DCT transformation is a matrix of real numbers. The precise encoding of these
real numbers requires a lot of bits. JPEG uses a quantization step that not only rounds real
values in the matrix, but also changes some values to zeros. The zeros can be eliminated in the
encoding step to achieve a high compression rate.
Encoding
After quantization, the values are reordered in a zigzag sequence before being input into the
encoder. The zigzag reordering of the quantized values is done to let the values related to the
lower frequency feed into the encoder before the values related to the higher frequency.
Video
Video is composed of multiple frames; each frame is one image. This means that a video file
requires a high transmission rate.
Digitizing Video
A video consists of a sequence of frames. If the frames are displayed on the screen fast enough,
for an impression of motion. The reason is that our eyes cannot distinguish the rapidly flashing
frames as individual ones. There is no standard number of frames per second; in North America
25 frames per second is common.

Video Compression: MPEG


Motion Picture Experts Group (MPEG) is a method to compress video. In principle, a
motion picture is a rapid flow of a set of frames, where each frame is an image. In other words,
a frame is a spatial combination of pixels, and a video is a temporal combination of frames that
are sent one after another. Compressing video, then, means spatially compressing each frame
and temporally compressing a set of frames.

Spatial Compression: The spatial compression of each frame is done with JPEG (or a
modification of it). Each frame is a picture that can be independently compressed.

Temporal Compression: In temporal compression, redundant frames are removed. For


television, it receives 50 frames per second. To temporally compress data, the MPEG method
first divides a set of frames into three categories: I-frames, P-frames, and B-frames. Figure
28.22 shows how a set of frames (7 in the figure) are compressed to create another set of frames.

I-frames. An intracoded frame (I-frame) is an independent frame that is not related to any
other frame (not to the frame sent before or after). They are present at regular intervals. An I-
frame must appear periodically to handle some sudden change in the frame that the previous
and following frames cannot show.
❑ P-frames. A predicted frame (P-frame) is related to the preceding I-frame or P-frame. In
other words, each P-frame contains only the changes from the preceding frame. P-frames can
be constructed only from previous I- or P-frames. P-frames carry much less information than
other frame types and carry even fewer bits after compression.
❑ B-frames. A bidirectional frame (B-frame) is related to the preceding and following I-
frame or P-frame. In other words, each B-frame is relative to the past and the future. Note that
a B-frame is never related to another B-frame.

Audio
Audio (sound) signals are analog signals that need a medium to travel; they cannot travel
through a vacuum. The speed of the sound in the air is about 330 m/s (740 mph). The audible
frequency range for normal human hearing is from about 20Hz to 20kHz with maximum
audibility around 3300 Hz.
Digitizing Audio
To be able to provide compression, audio analog signals are digitized using an analog-to-digital
converter. It consists of two processes: sampling and quantizing. A digitizing process known
as pulse code modulation (PCM). This process involved sampling an analog signal, quantizing
the sample, and coding the quantized values as streams of bits. Voice signal is sampled at the
rate of 8,000 samples per second with 8 bits per sample; the result is a digital signal of 8,000 ×
8 = 64 kbps. Music is sampled at 44,100 samples per second with 16 bits per sample; the result
is a digital signal of 44,100 × 16 = 705.6 kbps for monaural and 1.411 Mbps for stereo.

Audio Compression
Both lossy and lossless compression algorithms are used in audio compression. Lossless audio
compression allows one to preserve an exact copy of the audio files; it has a small compression
ratio of about 2 and is mostly used for archival and editing purposes. Lossy algorithms provide
far greater compression ratios (5 to 20) and are used in mainstream consumer audio devices.
Lossy algorithms sacrifice a little bit of quality, but substantially reduce space and bandwidth
requirements. For example, on a CD, one can fit one hour of high fidelity music, 2 hours of
music using lossless compression, or 8 hours of music compressed with a lossy technique.

Compression techniques used for speech must have low latency because significant delays
degrade the communication quality in telephony. Compression algorithms used for music must
be able to produce high quality sound with lower numbers of bits. Two categories of techniques
are used in audio compressions: predictive coding and perceptual coding.

Predictive coding: Predictive coding techniques have low latency and therefore are popular in
speech coding for telephony where significant delays degrade the communication quality.
Predictive coding methods: DM, ADM, DPCM, ADPCM, and LPC.

Perceptual Coding
Even at their best, the predictive coding methods cannot sufficiently compress a CD quality
audio for the multimedia application. The most common compression technique used to create
CD-quality audio is perceptual coding, which is based on the science of psychoacoustics.
Algorithms used in perceptual coding first transform the data from time domain to frequency
domain. Psychoacoustics is the study of subjective human perception of sound. Perceptual
coding takes advantage of flaws in the human auditory system. The lower limit of human
audibility is 0 dB. This is only true for sounds with frequencies of about 2.5 and 5 kHz. The
lower limit is less for frequencies between these two frequencies and rises for frequencies
outside these ranges, as shown in Figure 28.23a. We cannot hear any frequency whose power
is below this curve; thus, it is not necessary to code such a frequency.

For example, we can save bits, without loss of quality, by omitting any sound with frequency
of less than 100 Hz if its power is below 20 dB.

Other concepts are frequency masking and temporal masking. Frequency masking occurs
when a loud sound partially or totally masks a softer sound if the frequencies of the two are
close to each other. For example, we cannot hear our dance partner in a room where a loud
heavy metal band is performing. In Figure 28.23b, a loud masking tone, around 700 Hz, raises
the threshold of the audibility curve between frequencies of about 250 to 1500 Hz. In temporal
masking, a loud sound can numb our ear for a short time even after the sound has stopped.

MULTIMEDIA IN THE INTERNET


The audio and video services are divided into three broad categories: streaming stored
audio/video, streaming live audio/video, and interactive audio/video. Streaming means a user
can listen (or watch) the file after the downloading has started.

Streaming Stored Audio/Video


In the first category, streaming stored audio/video, the files are compressed and stored on a
server. A client downloads the files through the Internet. This is sometimes referred to as on-
demand audio/video. Examples of stored audio files are songs, symphonies, books on tape, and
famous lectures. Examples of stored video files are movies, TV shows, and music video clips.
Streaming stored audio/video refers to on-demand requests for compressed audio/video files.
Downloading these types of files from a Web server can be different from downloading other
types of files.
First Approach: Using a Web Server
A compressed audio/video file can be downloaded as a text file. The client (browser) can use
the services of HTTP and send a GET message to download the file. The Web server can send
the compressed file to the browser. The browser can then use a help application, normally
called a media player, to play the file. Figure 28.24 shows this approach.

This approach is very simple and does not involve streaming. However, it has a drawback. An
audio/video file is usually large even after compression. An audio file may contain tens of
megabits, and a video file may contain hundreds of megabits. In this approach, the file needs
to download completely before it can be played. Using contemporary data rates, the user needs
some seconds or tens of seconds before the file can be played.

Second Approach: Using a Web Server with a Metafile


In another approach, the media player is directly connected to the Web server for downloading
the audio/video file. The Web server stores two files: the actual audio/video file and a metafile
that holds information about the audio/video file. Figure 28.25 shows the steps in this approach.

1. The HTTP client accesses the Web server using the GET message.
2. The information about the metafile comes in the response.
3. The metafile is passed to the media player.
4. The media player uses the URL in the metafile to access the audio/video file.
5. The Web server responds.
Third Approach: Using a Media Server
The problem with the second approach is that the browser and the media player both use the
services of HTTP. HTTP is designed to run over TCP. This is appropriate for retrieving the
metafile, but not for retrieving the audio/video file. The reason is that TCP retransmits a lost
or damaged segment, which is counter to the philosophy of streaming. Dismiss the need of
TCP and its error control; we need to use UDP. However, HTTP, which accesses the Web
server, and the Web server itself are designed for TCP; it needs another server, a media server.
Figure 28.26 shows the concept.

1. The HTTP client accesses the Web server using a GET message.
2. The information about the metafile comes in the response.
3. The metafile is passed to the media player.
4. The media player uses the URL in the metafile to access the media server to download
the file. Downloading can take place by any protocol that uses UDP.
5. The media server responds.

Fourth Approach: Using a Media Server and RTSP


The Real-Time Streaming Protocol (RTSP) is a control protocol designed to add more
functionalities to the streaming process. Using RTSP, it can control the playing of audio/video.
RTSP is an out-of-band control protocol that is like the second connection in FTP. Figure 28.27
shows a media server and RTSP.
1. The HTTP client accesses the Web server using a GET message.
2. The information about the metafile comes in the response.
3. The metafile is passed to the media player.
4. The media player sends a SETUP message to create a connection with the media server.
5. The media server responds.
6. The media player sends a PLAY message to start playing (downloading).
7. The audio/video file is downloaded using another protocol that runs over UDP.
8. The connection is broken using the TEARDOWN message.
9. The media server responds.

The media player can send other types of messages. For example, a PAUSE message
temporarily stops the downloading; downloading can be resumed with a PLAY message.

28.3.2 Streaming Live Audio/Video


In this, streaming live audio/video, a user listens to broadcast audio and video through the
Internet. Good examples of this type of application are Internet radio and Internet TV. There
are several similarities between streaming stored audio/video and streaming live audio/video.
They are both sensitive to delay; neither can accept retransmission. However, there is a
difference. In the first application, the communication is unicast and on-demand. In the second,
the communication is multicast and live. Live streaming is better suited to the multicast services
of IP and the use of protocols such as UDP and RTP. However, presently, live streaming is still
using TCP and multiple unicasting instead of multicasting. There is still much progress to be
made in this area.

Example: Internet Radio


Internet radio or web radio is a webcast of audio broadcasting service that offers news, sports,
talk, and music via the Internet. It involves a streaming medium that is accessible from
anywhere in the world.
Example: Internet Television (ITV)
Internet television or ITV allows viewers to choose the show they want to watch from a library
of shows. The primary models for Internet television are streaming Internet TV or selectable
video on an Internet location.

Real-Time Interactive Audio/Video


In the third category, interactive audio/video, people use the Internet to interactively
communicate with one another. The Internet phone or voice over IP is an example of this type
of application. Video conferencing is another example that allows people to communicate
visually and orally.

Characteristics: Real-time audio/video communication.


Time Relationship: Real-time data on a packet-switched network require the preservation of
the time relationship between packets of a session. For example, let us assume that a real-time
video server creates live video images and sends them online. The video is digitized and
packetized. There are only three packets, and each packet holds 10 seconds of video
information. The first packet starts at 00:00:00, the second packet starts at 00:00:10, and the
third packet starts at 00:00:20. Also imagine that it takes 1 s (an exaggeration for simplicity)
for each packet to reach the destination (equal delay). The receiver can play back the first
packet at 00:00:01, the second packet at 00:00:11, and the third packet at 00:00:21. Although
there is a 1-s time difference between what the server sends and what the client sees on the
computer screen, the action is happening in real time. The time relationship between the
packets is preserved. The 1-s delay is not important. Figure 28.28 shows the idea.

But what happens if the packets arrive with different delays? For example, the first packet
arrives at 00:00:01 (1-s delay), the second arrives at 00:00:15 (5-s delay), and the third arrives
at 00:00:27 (7-s delay). If the receiver starts playing the first packet at 00:00:01, it will finish
at 00:00:11. However, the next packet has not yet arrived; it arrives 4 s later. There is a gap
between the first and second packets and between the second and the third as the video is
viewed at the remote site. This phenomenon is called jitter. Figure 28.29 shows the situation.

Timestamp
One solution to jitter is the use of a timestamp. If each packet has a timestamp that shows the
time it was produced relative to the first (or previous) packet, then the receiver can add this
time to the time at which it starts the playback. In other words, the receiver knows when each
packet is to be played. Imagine the first packet in the previous example has a timestamp of 0,
the second has a timestamp of 10, and the third a timestamp of 20. If the receiver starts playing
back the first packet at 00:00:08, the second will be played at 00:00:18, and the third at
00:00:28. There are no gaps between the packets. Figure 28.30 shows the situation.
Playback Buffer: To be able to separate the arrival time from the playback time, a buffer is
needed to store the data until they are played back. The buffer is referred to as a playback
buffer. When a session begins (the first bit of the first packet arrives), the receiver delays
playing the data until a threshold is reached. In the previous example, the first bit of the first
packet arrives at 00:00:01; the threshold is 7 s, and the playback time is 00:00:08. The threshold
is measured in time units of data. The replay does not start until the time units of data are equal
to the threshold value.

Data are stored in the buffer at a possibly variable rate, but they are extracted and played back
at a fixed rate. Note that the amount of data in the buffer shrinks or expands, but as long as the
delay is less than the time to play back the threshold amount of data, there is no jitter. Figure
28.31 shows the buffer at different times. To understand how a playback buffer can actually
remove jitter, introduces more delay in each packet. If the amount of delay added to each packet
makes the total delay (the delay in the network and the delay in the buffer) for each packet the
same, then the packets are played back smoothly, as though there were no delay.

Figure 28.32 shows the idea using the time line for seven packets. Buffer delay need to be
selected for the first packet in the buffer in a such a way that the right two saw-tooth curves do
not overlap. As the figure shows, if the playback time for the first packet is selected properly,
then the total delay for all packets should be the same. The packets that have longer
transmission delay should have shorter waiting in the buffer and vice versa.
Ordering: In addition to time relationship information and timestamps for real-time traffic, one
more feature is needed. A sequence number is needed for each packet. The timestamp alone
cannot inform the receiver if a packet is lost. For example, suppose the timestamps are 0, 10,
and 20. If the second packet is lost, the receiver receives just two packets with timestamps 0
and 20. The receiver assumes that the packet with timestamp 20 is the second packet, produced
20 s after the first. The receiver has no way of knowing that the second packet has actually
been lost. A sequence number to order the packets is needed to handle this situation.

Multicasting: Multimedia play a primary role in audio and video conferencing. The traffic can
be heavy, and the data are distributed using multicasting methods. Conferencing requires two-
way communication between receivers and senders.
:
Translation: Sometimes real-time traffic needs translation. A translator is a computer that can
change the format of a high-bandwidth video signal to a lower-quality narrow-bandwidth
signal. This is needed, for a source creating a high-quality video signal at 5 Mbps and sending
to a recipient having a bandwidth of less than 1 Mbps. To receive the signal, a translator is
needed to decode the signal and encode it again at a lower quality that needs less bandwidth.

Mixing: If there is more than one source that can send data at the same time (as in a video or
audio conference), the traffic is made of multiple streams. To converge the traffic to one stream,
data from different sources can be mixed. A mixer mathematically adds signals coming from
different sources to create one single signal.

Example of a Real-Time Application: Skype

28.4 REAL-TIME INTERACTIVE PROTOCOLS


Real-time interactive multimedia. This application has evoked a lot of attention in the Internet
society and several application-layer protocols have been designed to handle it. Schematic
representation shown in Figure 28.33.
Although it could have only one microphone and one audio player, today’s interactive real-
time application is normally made up of several microphones and several cameras. The audio
and video information (analog signals) are converted to digital data. The digital data created
from different sources are normally mixed and packetized. The packets are sent to the packet-
switched Internet. The packets are received at the destination, with different delays (jitter) and
some packets may also be corrupted or lost. A playback buffer replays packets based on the
timestamp on each packet. The result is sent to a digital-to-analog converter to recreate the
audio/video signals. The audio signal is sent to a speaker; the video signal to a display device.
Each microphone or camera at the source site is called a contributor and is given a 32-bit
identifier called the contributing source (CSRC) identifier. The mixer is also called the
synchronizer and is given another identifier called the synchronizing source (SSRC) identifier.

28.4.1 Rationale for New Protocols


Here, why it is needed some new protocols to handle interactive realtime multimedia
applications was discussed such as audio and video conferencing. The first three layers of the
TCP/IP protocol suite need not be changed (physical, data-link, and network layers) because
these three layers are designed to carry any type of data. The physical layer provides service to
the data-link layer, no matter the nature of the bits in a frame. The data-link layer is responsible
for node-to-node delivery of the network layer packets no matter what makes up the packet.
The network layer is also responsible for host-to-host delivery of a datagram, with a better
quality of service for multimedia applications. Some application-layer protocols need to be
designed to encode and compress the multimedia data, considering the trade-off between the
quality, bandwidth requirement, and the complexity of mathematical operations for encoding
and compression. Application-layer protocols that can handle multimedia have some
requirements that can be handled by the transport layer instead of being individually handled
by each application protocol.

Application Layer: Some application-layer protocols need to be developed for interactive


realtime multimedia because the nature of audio conferencing and video conferencing is
different from some applications, such as file transfer and electronic mail. Some of the
applications, such as MPEG audio and MPEG video, use some standards defined for audio and
video data transfer. There is no specific standard that is used by all applications, and there is
no specific application protocol that can be used by everyone.

Transport Layer: The lack of a single standard and the general features of multimedia
applications raise some questions about the transport-layer protocol to be used for all
multimedia applications. The two common transport-layer protocols, UDP and TCP, were
developed at the time when no one even thought about the use of multimedia in the Internet.
Can we use UDP or TCP as a general transport-layer protocol for real-time multimedia
applications? To answer this question, it needs to think about the requirements for this type of
multimedia application and then see if either UDP or TCP can respond to these requirements.

Transport-Layer Requirements for Interactive Real-Time Multimedia


❑ Sender-Receiver Negotiation.
The first requirement is related to the lack of a single standard for audio or video. If a sender
uses one encoding method and the receiver uses another one, the communication is impossible.
The application programs need to negotiate the standards used for audio/video before encoded
and compressed data can be transferred.
❑ Creation of Packet Stream.
UDP allows the application to packetize its message with clear-cut boundaries before
delivering the message to UDP. TCP, on the other hand, can handle streams of bytes without
the requirement from the application to put specific boundaries on the chunk of data. In other
words, UDP is suitable for those applications that need to send messages with clear-cut
boundaries, but TCP is suitable for those applications that send continuous streams of bytes.
When it comes to real-time multimedia, it needs both features. Real-time multimedia is a
stream of frames or a stream chunk of data in which the chunk or frame has a specific size or
boundary, but also there are relationships between frames or chunks. It is clear that neither
UDP nor TCP is suitable for handling streams of frames in this case. UDP cannot provide a
relationship between frames; TCP provides relationships between bytes, but a byte is much
smaller than a multimedia frame or chunk.
❑ Source Synchronization.
If an application uses more than one source (both audio and video), there is a need for
synchronization between the sources. For example, in a teleconferencing that uses both audio
and video, such as Skype, the audio and video may be using different encoding and
compression methods with different rates. It is also possible that there is more than one source
for audio or video (using multiple microphones or multiple cameras). Source synchronization
is normally done using mixers.
❑ Error Control.
Handling errors (packet corruption and packet loss) need special care in real-time multimedia
applications. It needs to inject extra redundancy in the data to be able to reproduce the lost or
corrupted packets without asking for them to be retransmitted. This implies that the TCP
protocol is not suitable for real-time multimedia applications.
❑ Congestion Control.
As in other applications, it needs to provide some sort of congestion control in multimedia. For
TCP for multimedia (because of retransmission problems), congestion control in the system
needs to be implemented.
❑ Jitter Removal. One of the problems with realtime multimedia applications is the jitter
created at the receiver site, because the packet-switched service provided by the Internet may
create uneven delays for different packets in a stream. In the past, audio conferencing was
provided by the telephone network, which was originally designed as a circuit-switched
network, which is jitter free. If we gradually move all of these applications to the Internet, the
jitter should dealt here. One of the ways to alleviate jitter is to use playback buffers and
timestamping. The playback is implemented at the application layer at the receiver site, but the
transport layer needs to provide the application layer with timestamping and sequencing.
❑ Identifying Sender.
A subtle issue in multimedia applications, like other applications, is to identify the sender at
the application layer. When we use the Internet, the parties are identified by their IP addresses
as in HTTP protocol or electronic mail.

Capability of UDP or TCP to Handle Real-Time Multimedia


Table 28.6 compares UDP and TCP with respect to these requirements. The first glance at
Table 28.6 reveals a very interesting fact: neither UDP nor TCP can respond to all
requirements. There are three choices:

1. A new transport-layer protocol (such as SCTP, that combines the features of UDP and
TCP. However, SCTP was introduced when there were many multimedia applications.
It may become the defacto transport layer in the future.
2. A TCP protocol and combine it with another transport facility to compensate for the
requirements that cannot be provided by TCP. However, this choice is somewhat
difficult because TCP uses a retransmission method that it is not acceptable for real-
time applications. Another problem with TCP is that it does not do multicasting. A TCP
connection is only a two-party connection; whereas multiparty connection for real-time
interactive communication.
3. UDP can be used and combine it with another transport facility to compensate for the
requirements that cannot be provided by UDP. In other words, UDP can be used to
provide client-server socket interface, but use another protocol that runs at the top of
the UDP. This is the current choice for multimedia applications.
RTP: Real-time Transport Protocol (RTP) is the protocol designed to handle real-time
traffic on the Internet. RTP does not have a delivery mechanism (multicasting, port numbers,
and so on); it must be used with UDP. RTP stands between UDP and the multimedia
application. The literature and standards treat RTP as the transport protocol (not a transport-
layer protocol) that can be thought of as located in the application layer (see Figure 28.34). The
data from multimedia applications are encapsulated in RTP, which in turn passes them to the
transport layer. In other words, the socket interface is located between RTP and UDP.

RTP Packet Format: Figure 28.35 shows the format of the RTP packet header. The format is
very simple and general enough to cover all real-time applications. An application that needs
more information adds it to the beginning of its payload.

A description of each field follows.


❑ Ver. This 2-bit field defines the version number. The current version is 2.
❑ P. This 1-bit field, if set to 1, indicates the presence of padding at the end of the packet. In
this case, the value of the last byte in the padding defines the length of the padding. Padding is
the norm if a packet is encrypted. There is no padding if the value of the P field is 0.
❑ X. This 1-bit field, if set to 1, indicates an extension header between the basic header and
the data. There is no extension header if the value of this field is 0.
❑ Contributor count. This 4-bit field indicates the number of contributing sources (CSRCs).
It can have a maximum of 15 contributors because a 4-bit field only allows a number between
0 and 15. Note that in an audio or video conferencing, each active source (the source that sends
data instead of just listening) is called a contributor.
❑ M. This 1-bit field is a marker used by the application to indicate, for example, the end of
its data. The multimedia application is a stream of blocks or frames with an end of frame
marker. If this bit is set in an RTP packet, it means that the RTP packet carries this marker.
❑ Payload type. This 7-bit field indicates the type of the payload. Several payload types have
been defined so far. The list of common applications shown in Table 28.7.

Sequence number. This field is 16 bits in length. It is used to number the RTP packets. The
sequence number of the first packet is chosen randomly; it is incremented by 1 for each
subsequent packet. It is used by the receiver to detect lost or out of order packets.
❑ Timestamp. This is a 32-bit field that indicates the time relationship between packets. The
timestamp for the first packet is a random number. For each succeeding packet, the value is the
sum of the preceding timestamp plus the time the first byte is produced (sampled). The value
of the clock tick depends on the application. For example, audio applications normally generate
chunks of 160 bytes; the clock tick for this application is 160. The timestamp for this
application increases 160 for each RTP packet.
❑ Synchronization source (SSRC) identifier. If there is only one source, this 32-bit field
defines the source. However, if there are several sources, the mixer is the synchronization
source and the other sources are contributors. The value of the source identifier is a random
number chosen by the source. The protocol provides a strategy in case of conflict (two sources
start with the same sequence number).
❑ Contributing source (CSRC) identifier. Each of these 32-bit identifiers (a maximum of 15)
defines a source. When there is more than one source in a session, the mixer is the
synchronization source and the remaining sources are the contributors.

UDP Port
Although RTP is itself a transport-layer protocol, the RTP packet is not encapsulated directly
in an IP datagram. Instead, RTP is treated like an application program and is encapsulated in a
UDP user datagram. However, unlike other application programs, no well-known port is
assigned to RTP. The port can be selected on demand with only one restriction: The port
number must be an even number. The next number (an odd number) is used by the companion
of RTP, Real-time Transport Control Protocol (RTCP).
RTCP: RTP allows only one type of message, one that carries data from the source to the
destination. To really control the session, a separate protocol is developed called Real-time
Transport Control Protocol (RTCP). RTCP is in fact a sister protocol of RTP. This means
that the UDP, as the real transport protocol, sometimes carries RTP payloads and sometimes
RTCP payloads as though they belong to different upper-layer protocols. RCTP packets make
an out-of-band control stream that provides two-way feedback information between the senders
and receivers of the multimedia streams.
In particular, RTCP provides the following functions:
1. RTCP informs the sender or senders of multimedia streams about the network performance,
which can be directly related to the congestion in the network. Since multimedia applications
use UDP (instead of TCP), there is no way to control the congestion in the network at the
transport layer. This means that, if it is necessary to control the congestion, it should be done
at the application layer. RTCP gives the clues to the application layer to do so. If the congestion
is observed and reported by the RTCP, an application can use a more aggressive compression
method to reduce the number of packets and, therefore, to reduce congestion, for a trade-off in
quality. On the other hand, if no congestion is observed, the application program can use a less
aggressive compression method for a better quality service.
2. Information carried in the RTCP packets can be used to synchronize different streams
associated with the same source. A source may use two different sources to collect audio or
video data. In addition, audio data may be collected from different microphones and video data
may be collected from different cameras. In general, two pieces of information are needed to
achieve synchronization:
a. Each sender needs an identity. Although each source may have a different SSRC, RTCP
provides one single identity, called a canonical name (CNAME) for each source.
CNAME can be used to correlate different sources and allow the receiver to combine
different sources from the same source. A CNAME is in the form of user@host in
which user is normally the login name of the user and the host is the domain name of
the host.
b. The canonical name cannot per se provide synchronization. To synchronize the sources,
it needs to know the absolute timing of the stream, in addition to the relative timing
provided by the timestamp field in each RTP packet. The timestamp information in each
packet gives the relative time relationship of the bits in the packet to the beginning of the
stream; it cannot relate one stream to another. The absolute time, the "wall clock" time as
it is sometimes referred to, needs to be sent by RTCP packets to enable synchronization.
3. An RTCP packet can carry extra information about the sender that can be useful for the
receiver, such as the name of the sender (beyond canonical name) or captions for a video.

RTCP Packets: Figure 28.36 shows five common packet types. The number next to each box
defines the numeric value of each packet.

Sender Report Packet


The sender report packet is sent periodically by the active senders in a session to report
transmission and reception statistics for all RTP packets sent during the interval.
The sender report packet includes the following information:
❑ The SSRC of the RTP stream.
❑ The absolute timestamp, which is the combination of the relative timestamp and the wall
clock time. The absolute timestamp, allows the receiver to synchronize different RTP packets.
❑ The number of RTP packets and bytes sent from the beginning of the session.

Receiver Report Packet: The receiver report is issued by passive participants, those that do not
send RTP packets. The report informs the sender and other receivers about the quality of
service. The feedback information can be used for congestion control at the sender site. A
receiver report includes the following information:
❑ The SSRC of the RTP stream for which the receiver report has been generated.
❑ The fraction of packet loss.
❑ The last sequence number.
❑ The interval jitter.

Source Description Packet: The source periodically sends a source description packet to give
additional information about itself. The packet can include:
❑ The SSRC.
❑ The canonical name (CNAME) of the sender.
❑ Other information such as the real name, the e-mail address, the telephone number.
❑ The source description packet may also include extra data, such as captions used for video.

Bye Packet: A source sends a bye packet to shut down a stream. It allows the source to
announce that it is leaving the conference. Although other sources can detect the absence of a
source, this packet is a direct announcement. It is also very useful to a mixer.

Application-Specific Packet: It is a packet for an application that wants to use new applications
(not defined in the standard). It allows the definition of a new packet type.

UDP Port
RTCP, like RTP, does not use a well-known UDP port. It uses a temporary port. The UDP port
chosen must be the number immediately following the UDP port selected for RTP, which
makes it an odd-numbered port.
Bandwidth Utilization
The RTCP packets are sent not only by the active senders, but also by passive receivers, whose
numbers are normally greater than the active senders. This means that if the RTCP traffic is
not controlled, it may get out of hand. To control the situation, RTCP uses a control mechanism
to limit its traffic to the small portion (normally 5 percent) of the traffic used in the session (for
both RTP and RTCP). A larger part of this small percentage, x percent, is then assigned to the
RTCP packets generated by the passive receiver, a smaller part, (1 − x) percent, is assigned to
the RTCP packets generated by the active senders. RTCP protocol uses a mechanism to define
the value of x based on the ratio of the passive receiver to the active sender.

Requirement Fulfillment
1. The first requirement, sender-receiver negotiation, cannot be satisfied by the combination
of the RTP/RTCP protocols.
2. The second requirement, creation of a stream of chunks, is provided by encapsulating each
chunk in an RTP packet and giving a sequence number to each chunk. The M field in an RTP
packet also defines whether there is a specific type of boundary between chunks.
3. The third requirement, synchronization of sources, is satisfied by identifying each source by
a 32-bit identifier and using the relative timestamping in the RTP packet and the absolute
timestamping in the RTCP packet.
4. The fourth requirement, error control, is provided by using the sequence number in the RTP
packet and letting the application regenerate the lost packet using FEC methods.
5. The fifth requirement, congestion control, is met by the feedback from the receiver using the
receiver report packets (RTCP) that notify the sender about the number of lost packets. The
sender then can use a more aggressive compression technique to reduce the number of packets
sent and therefore alleviate the congestion.
6. The sixth requirement, jitter removal, is achieved by the timestamping and sequencing
provided in each RTP packet to be used in buffered playback of the data.
7. The seventh requirement, identification of source, is provided by using the CNAME included
in the source description packets (RTCP) sent by the sender.

Session Initialization Protocol (SIP)


Although RTP and RTCP can be used to provide these services, one component is missing: a
signaling system required to call the participants. In traditional audio conferencing (between
two or more people) using the traditional telephone system (public switched telephone network
or PSTN). To make a phone call, two telephone numbers are needed, that of the caller and that
of the callee. We then need to dial the telephone number of the callee and wait for her to
respond. The telephone conversation starts after the response of the callee. In other words,
regular telephone communication involves two phases: the signaling phase and the audio
communication phase.

The signaling phase in the telephone network is provided by a protocol called Signaling System
7 (SS7). The SS7 protocol is totally separate from the voice communication system. For
example, although the traditional telephone system uses analog signals carrying voice over a
circuit-switched network, SS7 uses electrical pulses in which each number dialed changes to a
series of pulses. SS7 today not only provides the calling service, it also provides other services,
such as call forwarding and error reporting. The combination of RTP/RTCP protocols is
equivalent to the voice communication provided by PSTN; to totally simulate this system over
the Internet, it needs a signaling system.

The SIP is a protocol devised by IETF to be used in conjunction with the RTP/SCTP. It is an
application-layer protocol, like HTTP, that establishes, manages, and terminates a multimedia
session (call). It can be used to create two-party, multiparty, or multicast sessions. SIP is
designed to be independent of the underlying transport layer; it can run on UDP, TCP, or SCTP,
using the port 5060.
SIP can provide the following services:
❑ It establishes a call between users if they are connected to the Internet.
❑ It finds the location of the users (their IP addresses) on the Internet because the users may
be changing their IP addresses (think about mobile IP and DHCP).
❑ It finds out if the users are able or willing to take part in the conference call.
❑ It determines the users’ capabilities in terms of media to be used and the type of encoding
(the first requirement for multimedia communication).
❑ It establishes session setup by defining parameters such as port numbers to be used
(remember that RTP and RTCP use port numbers).
❑ It provides session management functions such as call holding, call forwarding, accepting
new participants, and changing the session parameters.

Communicating Parties
In an audio or video conference, the communication is between humans, not devices. For
example, in HTTP or FTP, the client needs to find the IP address of the server (using DNS)
before communication. There is no need to find a person before communicating. In the SMTP,
the sender of an e-mail sends the message to the receiver mailbox on an SMTP without
controlling when the message will be picked up. In an audio or video conference, the caller
needs to find the callee. The callee can be sitting at her desk, can be walking in the street, or
can be totally unavailable. What makes the communication more difficult is that the device to
which the participant has access at a particular time may have a different capability than the
device being used at another time. The SIP protocol needs to find the location of the callee and
at the same time negotiate the capability of the devices the participants are using.

Addresses:
In a regular telephone communication, a telephone number identifies the sender, and another
telephone number identifies the receiver. SIP is very flexible. In SIP, an e-mail address, an IP
address, a telephone number, and other types of addresses can be used to identify the sender
and receiver. However, the address needs to be in SIP format (also called scheme). Figure 28.37
shows some common formats.
The SIP address is similar to a URL. In fact, the SIP addresses are URLs that can be included
in the web page of the potential callee. For example, Bob can include one of the above
addresses as his SIP address and, if someone clicks on it, the SIP protocol is invoked and calls
Bob. Other addresses are also possible, such as those that use first name followed by last name,
but all addresses need to be in the form sip:user@address.

Messages: SIP is a text-based protocol like HTTP. SIP, like HTTP, uses messages. Messages
in SIP are divided into two broad categories: Requests and responses. The format of categories
is shown below:

Request Messages: IETF originally defined six request messages, but some new request
messages have been proposed to extend the functionality of the SIP. The original six messages
as follows:
❑ INVITE. The INVITE request message is used by a caller to initialize a session. Using this
request message, a caller invites one or more callees to participate in the conference.
❑ ACK. The ACK message is sent by the caller to confirm that the session initialization has
been completed.
❑ OPTIONS. The OPTIONS message queries a machine about its capabilities.
❑ CANCEL. The CANCEL message cancels an already started initialization process, but does
not terminate the call. A new initialization may start after the CANCEL message.
❑ REGISTER. The REGISTER message makes a connection when the callee is not available.
❑ BYE. The BYE message is used to terminate the session. Compare the BYE message with
the CANCEL message. The BYE message, which can be initiated from the caller or callee,
terminates the whole session.

Response Messages: IETF has also defined six types of response messages that can be sent to
request messages, but note that there is no relationship between a request and a response
message. A response message can be sent to any request message. As in other text-oriented
application protocols, the response messages are defined using three-digit numbers. The
response messages are briefly described below:
❑ Informational Responses. These responses are in the form SIP 1xx (the common ones are
100 trying, 180 ringing, 181 call forwarded, 182 queued, and 183 session progress).
❑ Successful Responses. These responses are in the form SIP 2xx (the common one is 200
OK).
❑ Redirection Responses. These responses are in the form SIP 3xx (the common ones are 301
moved permanently, 302 moved temporarily, 380 alternative service).
❑ Client Failure Responses. These responses are in the form SIP 4xx (the common ones are
400 bad request, 401 unauthorized, 403 forbidden, 404 not found, 405 method not allowed,
406 not acceptable, 415 unsupported media type, 420 bad extension, 486 busy here).
❑ Server Failure Responses. These responses are in the form SIP 5xx (the common ones are
500 server internal error, 501 not implemented, 503 service unavailable, 504 timeout, 505 SIP
version not supported).
❑ Global Failure Responses. These responses are in the form SIP 6xx (the common ones are
600 busy everywhere, 603 decline, 604 doesn’t exist, and 606 not acceptable).

First Scenario: Simple Session


In the first scenario, assume that Alice needs to call Bob and the communication uses the IP
addresses of Alice and Bob as the SIP addresses. The communication is divided into three
modules: establishing, communicating, and terminating. Figure 28.38 shows a simple session
using SIP.

Establishing a Session: Establishing a session in SIP requires a three-way handshake. Alice


sends an INVITE request message, using UDP, TCP, or SCTP to begin the communication. If
Bob is willing to start the session, he sends a response (200 OK) message. To confirm that a
reply code has been received, Alice sends an ACK request message to start the audio
communication. The establishment section uses two request messages (INVITE and ACK) and
one response message (200 OK). The body of the header uses another protocol, Session
Description Protocol (SDP), that defines the syntax (format) and semantic (meaning of each
line). The first line in the body defines the sender of the message; the second line defines the
media (audio) and the port number to be used for RTP in the direction from Alice to Bob. The
response message defines the media (audio) and the port number to be used for RTP in the Bob
to Alice direction. After Alice confirms the establishment of the session with the ACK message
request (which does not need a response), the establishing session is finished and the
communication can start.
Communicating: After the session has been established, Alice and Bob can communicate using
two temporary ports defined in the establishing sessions. The even-numbered ports are used
for RTP; RTCP can use the odd-numbered ports.
Terminating the Session: The session can be terminated with a BYE message sent by either
party.

Second Scenario: Tracking the Callee


What happens if Bob is not sitting at his terminal? He may be away from his system or at
another terminal. He may not even have a fixed IP address if DHCP is being used. SIP has a
mechanism (similar to one in DNS) that finds the IP address of the terminal at which Bob is
sitting. To perform this tracking, SIP uses the concept of registration. SIP defines some servers
as registrars. At any moment a user is registered with at least one registrar server; this server
knows the IP address of the callee.

When Alice needs to communicate with Bob, she can use the e-mail address instead of the IP
address in the INVITE message. The message goes to a proxy server. The proxy server sends
a lookup message (not part of SIP) to some registrar server that has registered Bob. When the
proxy server receives a reply message from the registrar server, the proxy server takes Alice’s
INVITE message and inserts the newly discovered IP address of Bob. This message is then
sent to Bob. Figure 28.39 shows the process.
SIP Message Format and SDP Protocol
The SIP request and response messages are divided into four sections: start or status line,
header, a blank line, and the body.

Start Line: The start line is a single line that starts with the message request name, followed
by the address of the recipient and the SIP version. For example, the INVITE message request
start line has the following start line format:
INVITE sip:forouzan@roadrunner.com

Status Line: The status line is a single line that starts with the three-digit response code. For
example, the 200 response message has the following status line format.
200 OK

Header: A header, in the request or response message, can use several lines. Each line starts
with the line name followed by a colon and space and followed by the value. Some typical
header lines are: Via, From, To, Call-ID, Content-Type, Content-Length, and Expired. The Via
header defines the SIP device through which the message passes, including the sender. The
From header defines the sender and the To header defines the recipient. The Call-ID header is
a random number that defines the session. The Content-Type defines the type of body of the
message, which is normally SPD. The Content-Length defines the length of the body of the
message in bytes. The Expired header is normally used in a REGISTER message to define the
expiration of the information in the body.

Body: SIP uses another protocol, called Session Description Protocol (SDP), to define the
body. Each line in the body is made of an SDP code followed by an equal sign, and followed
by the value. The code is a single character that determines the purpose of the code.

Body is divided into several sections.


The first part of the body is normally general information. The codes used in this section are:
v (for version of SDP), and o (for origin of the message).

The second part of the body normally gives information to the recipient for making a decision
to take part in the session. The codes used s section are: s (subject), i (information about
subject), u (for session URL), and e (the e-mail address of the person responsible for the
session).

The third part of the body gives the technical details to make the session possible. The codes
used in this part are: c (the unicast or multicast IP address that the user needs to join to be able
to take part in the session), t (the start time and end time of the session, encoded as integers),
m (the information about media such as audio, video, the port number, the protocol used).
H.323: It is a standard designed by ITU to allow telephones on the public telephone network
to talk to computers (called terminals in H.323) connected to the Internet. Figure 28.40 shows
the general architecture of H.323 for audio, but it can also be used for video. A gateway
connects the Internet to the telephone network. In general, a gateway is a five-layer device that
can translate a message from one protocol stack to another. The gateway here does exactly the
same thing. It transforms a telephone network message into an Internet message. The
gatekeeper server on the local area network plays the role of the registrar server.

Protocols
H.323 uses a number of protocols to establish and maintain voice (or video) communication.
Figure 28.41 shows some of these protocols. H.323 uses G.71 or G.723.1 for compression. It
uses a protocol named H.245, which allows the parties to negotiate the compression method.
Protocol Q.931 is used for establishing and terminating connections. Another protocol, called
H.225, or Registration/Administration/Status (RAS), is used for registration with the
gatekeeper.
SIP is only a signalling protocol, which is normally combined with RTP and RTCP to create a
complete set of protocols for interactive real-time multimedia applications, but it can be used
with other protocols as well. H.323, on the other hand, is a complete set of protocols that
mandates the use of RTP and RTCP.

Operation: The following shows simple example to show the operation of a telephone
communication using H.323. Figure 28.42 shows the steps used by a terminal to communicate
with a telephone.

1. The terminal sends a broadcast message to the gatekeeper. The gatekeeper responds with its
IP address.
2. The terminal and gatekeeper communicate, using H.225 to negotiate bandwidth.
3. The terminal, the gatekeeper, the gateway, and the telephone communicate using Q.931 to
set up a connection.
4. The terminal, the gatekeeper, the gateway, and the telephone communicate using H.245 to
negotiate the compression method.
5. The terminal, the gateway, and the telephone exchange audio using RTP under the
management of RTCP.
6. The terminal, the gatekeeper, the gateway, and the telephone communicate using Q.931 to
terminate the communication.

You might also like