You are on page 1of 24

MMC

Unit-4

AV Compression

UNIT 4: AUDIO & VIDEO COMPRESSION


Contents: 1) Introduction 2) DPCM 3) ADPCM 4) APC 5) LPC 6) Video Compression 7) H.261 8) H.263 9) MPEG 10) MPEG-1 11) MPEG-2 12) MPEG-4

Instructional Objectives: At the end of the unit, the students should be able to: 1. Understand the various Audio Compression Techniques. 2. Distinguish the merits of various audio compression techniques. 3. Learn the various video compression techniques. 4. Distinguish the merits of various video compression techniques.

1 BSS ECE, REVA

MMC

Unit-4

AV Compression

1. INTRODUCTION:
An audio (sound) wave is a one-dimensional acoustic (pressure) wave. When an acoustic wave enters the ear, the eardrum vibrates, causing the tiny bones of the inner ear to vibrate along with it, sending nerve pulses to the brain. These pulses are perceived as sound by the listener. In a similar way, when an acoustic wave strikes a microphone, the microphone generates an electrical signal, representing the sound amplitude as a function of time. The representation, processing, storage, and transmission of such audio signals are a major part of the study of multimedia systems. The frequency range of the human ear runs from 20 Hz to 20,000 Hz. Some animals, notably dogs, can hear higher frequencies.
The ear hears logarithmically, so the ratio of two sounds with power A and B is conventionally

expressed in dB (decibels) according to the formula dB 10 log10 (A /B).


If we define the lower limit of audibility (a pressure of about 0.0003 dyne/cm2) for a 1-kHz sine wave

as 0 dB, an ordinary conversation is about 50 dB and the pain threshold is about 120 dB, a dynamic range of a factor of 1 million. The ear is surprisingly sensitive to sound variations lasting only a few milliseconds. The eye, in contrast, does not notice changes in light level that last only a few milliseconds.
The result of this observation is that jitter of only a few milliseconds during a multimedia transmission

affects the perceived sound quality more than it affects the perceived image quality. Audio waves can be converted to digital form by an ADC (Analog Digital Converter).
An ADC takes an electrical voltage as input and generates a binary number as output. In Fig. 7-1(a) we

see an example of a sine wave. To represent this signal digitally, we can sample it every T seconds, as shown by the bar heights.
If a sound wave is not a pure sine wave but a linear superposition of sine waves where the highest

frequency component present is f, then the Nyquist theorem states that it is sufficient to make samples at a frequency 2f. Sampling more often is of no value since the higher frequencies that such sampling could detect are not present.
2 BSS ECE, REVA

MMC

Unit-4

AV Compression

Digital samples are never exact. The samples of Fig. 7-1(c) allow only nine values, from 1.00 to +1.00

in steps of 0.25. An 8-bit sample would allow 256 distinct values. A 16-bit sample would allow 65,536 distinct values. The error introduced by the finite number of bits per sample is called the quantization noise. If it is too large, the ear detects it. Two well-known examples where sampled sound is used are the telephone and audio compact discs. Pulse code modulation, as used within the telephone system, uses 8-bit samples made 8000 times per second. In North America and Japan, 7 bits are for data and 1 is for control; in Europe all 8 bits are for data. This system gives a data rate of 56,000 bps or 64,000 bps. With only 8000 samples/sec, frequencies above 4 kHz are lost.
Audio CDs are digital with a sampling rate of 44,100 samples/sec, enough to capture frequencies up to

22,050 Hz, which is good enough for people, but bad for canine music lovers. The samples are 16 bits each and are linear over the range of amplitudes. Note that 16-bit samples allow only 65,536 distinct values, even though the dynamic range of the ear is about 1 million when measured in steps of the smallest audible sound. Thus, using only 16 bits per sample introduces some quantization noise (although the full dynamic range is not coveredCDs are not supposed to hurt). With 44,100 samples/sec of 16 bits each, an audio CD needs a bandwidth of 705.6 kbps for monaural and 1.411 Mbps for stereo. While this is lower than what video needs (see below), it still takes almost a full T1 channel to transmit uncompressed CD quality stereo sound in real time. Digitized sound can be easily processed by computers in software. Dozens of programs exist for personal computers to allow users to record, display, edit, mix, and store sound waves from multiple sources. Virtually all professional sound recording and editing are digital nowadays. Music, of course, is just a special case of general audio, but an important one. Another important special case is speech. Human speech tends to be in the 600-Hz to 6000-Hz range. Speech is made up of vowels and consonants, which have different properties. Vowels are produced when the vocal tract is unobstructed; producing resonances whose fundamental frequency depends on
3 BSS ECE, REVA

MMC

Unit-4

AV Compression

the size and shape of the vocal system and the position of the speakers tongue and jaw. These sounds are almost periodic for intervals of about 30 msec. Consonants are produced when the vocal tract is partially blocked. These sounds are less regular than vowels.
Some speech generation and transmission systems make use of models of the vocal system to reduce

speech to a few parameters (e.g., the sizes and shapes of various cavities), rather than just sampling the speech waveform.

2. DIFFERENTIAL PULSE CODE MODULATION:


Differential pulse code modulation is a derivative of the standard PCM. It uses the fact that the range of differences in amplitudes between successive samples of the audio

waveform is less than the range of the actual sample amplitudes.


Hence fewer bits to represent the difference signal.

Operation of DPCM: Encoder


Previously digitized sample is held in the register (R).

4 BSS ECE, REVA

MMC Unit-4 AV Compression The DPCM signal is computed by subtracting the current contents (Ro) from the new output by the ADC

(PCM).
The register value is then updated before transmission.

Decoder
Decoder simply adds the previous register contents (PCM) with the DPCM. Since ADC will have noise there will be cumulative errors in the value of the register signal.

Third-order predictive DPCM signal encoder and decoder Operation Principles:

To eliminate this noise effect predictive methods are used to predict a more accurate version of the previous signal (use not only the current signal but also varying proportions of a number of the preceding estimated signals). These proportions used are known as predictor coefficients. Difference signal is computed by subtracting varying proportions of the last three predicted values from the current output by the ADC. R1, R2, R3 will be subtracted from PCM.

5 BSS ECE, REVA

MMC Unit-4 AV Compression The values in the R1 register will be transferred to R2 and R2 to R3 and the new predicted value goes into

R1.

Decoder operates in a similar way by adding the same proportions of the last three computed PCM signals to the received DPCM signal.

3 ADAPTIVE DIFFERENTIAL PULSE CODE MODULATION:


Savings of bandwidth is possible by varying the number of bits used for the difference signal depending

on its amplitude (fewer bits to encode smaller difference signals).


An international standard for this is defined in ITU-T recommendation G721. This is based on the same principle as the DPCM except an eight-order predictor is used and the number

of bits used to quantize each difference is varied.


This can be either 6 bits producing 32 kbps to obtain a better quality output than with third order

DPCM, or 5 bits- producing 16 kbps if lower bandwidth is more important.

ADPCM subband encoder and decoder schematic


The principle of adaptive differential PCM varies the number of bits used for the difference signal

depending on its amplitude.


6 BSS ECE, REVA

MMC Unit-4 AV Compression A second ADPCM standard which is a derivative of G-721 is defined in ITU-T Recommendation G-722

(better sound quality).


This uses subband coding in which the input signal prior to sampling is passed through two filters: one

which passes only signal frequencies in the range 50Hz through to 3.5 kHz and the other only frequencies in the range 3.5kHz through to 7kHz.
By doing this the input signal is effectively divided into two separate equal-bandwidth signals, the first

known as the lower subband signal and the second the upper subband signal..
Each is then sampled and encoded independently using ADPCM, the sampling rate of the upper

subband signal being 16 ksps to allow for the presence of the higher frequency components in this subband.
The use of two subbands has the advantage that different bit rates can be used for each. In general the frequency components in the lower subband have a higher perceptual importance than

those in the higher subband.


For example with a bit rate of 64 kbps the lower subband is ADPCM encoded at 48kbps and the upper

subband at 16kbps.
The two bit streams are then multiplexed together to produce the transmitted (64 kbps) signal in such a

way that the decoder in the receiver is able to divide them back again into two separate streams for decoding.

4 LINEAR PREDICTIVE CODING:

7 BSS ECE, REVA

MMC

Unit-4

AV Compression

Linear predictive coding (LPC) signal encoder and decoder


Linear predictive coding involves the source simply analyzing the audio waveform to determine a

selection of the perceptual features it contains.


With this type of coding the perceptual features of an audio waveform are analysed first. These are then quantized and sent and the destination uses them, together with a sound synthesizer, to

regenerate a sound that is perceptually comparable with the source audio signal.
With this compression technique although the speech can often sound synthetic high levels of

compressions can be achieved. In terms of speech, the three features which determine the perception of a signal by the ear are its:

Pitch: this is closely related to the frequency of the signal. This is important since ear is more sensitive to signals in the range 2-5 kHz. Period: this is the duration of the signal. Loudness: This is determined by the amount of energy in the signal. The input speech waveform is first sampled and quantized at a defined rate
8

BSS ECE, REVA

MMC Unit-4 AV Compression A block of digitized samples known as segment - is then analysed to determine the various perceptual

parameters of the speech that it contains.


The output of the encoder is a string of frames, one for each segment. Each frame contains fields for pitch and loudness the period determined by the sampling rate being

used a notification of whether the signal is voiced (generated through the vocal cords) or unvoiced (vocal cords are opened).

5 PERCEPTUAL CODING:
LPC and CELP are used for telephony applications and hence compression of speech signal. PC is designed for compression of general audio such as that associated with a digital television

broadcast.
Using this approach, sampled segments of the source audio waveform are analysed but only those

features that are perceptible to the ear are transmitted.


Although the human ear is sensitive to signals in the range 15Hz to 20 kHz, the level of sensitivity to

each signal is non-linear; that is the ear is more sensitive to some signals than others.
Also when multiple signals are present as in audio a strong signal may reduce the level of sensitivity of

the ear to other signals which are near to it in frequency, an effect known as frequency masking.
When the ear hears a loud sound it takes a short but a finite time before it could hear a quieter sound an

effect known as temporal masking. Sensitivity of the ear


The dynamic range of ear is defined as the loudest sound it can hear to the quietest sound. Sensitivity of the ear varies with the frequency of the signal. The ear is most sensitive to signals in the range 2-5 kHz hence the signals in this band are the quietest

the ear is sensitive to.


Vertical axis gives all the other signal amplitudes relative to this signal (2-5 kHz). Signal A is above the hearing threshold and B is below the hearing threshold.

9 BSS ECE, REVA

MMC

Unit-4

AV Compression

Perceptual properties of the human ear


Perceptual encoders have been designed for the compression of general audio such as that associated

with a digital television broadcast.


Signal B is larger than signal A. This causes the basic sensitivity curve of the ear to be distorted in the

region of signal B. Signal A will no longer be heard as it is within the distortion band.

Variation with frequency of effect of frequency masking


10 BSS ECE, REVA

MMC Unit-4 AV Compression The width of each curve at a particular signal level is known as the critical bandwidth for that

frequency.
The width of each curve at a particular signal level is known as the critical bandwidth. It has been observed that for frequencies less than 500Hz, the critical bandwidth is around 100Hz,

however, for frequencies greater than 500Hz then bandwidth increases linearly in multiples of 100Hz.
Hence if the magnitude of the frequency components that make up an audio sound can be determined, it

becomes possible to determine those frequencies that will be masked and do not therefore need to be transmitted.

Temporal masking caused by loud signal


After the ear hears a loud signal, it takes a further short time before it can hear a quieter sound (temporal

masking).
After the ear hears a loud sound it takes a further short time before it can hear a quieter sound. This is known as the temporal masking. After the loud sound ceases it takes a short period of time for the signal amplitude to decay. During this time, signals whose amplitudes are less than the decay envelope will not be heard and hence

need not be transmitted.


In order to achieve this input audio waveform must be processed over a time period that is comparable

with that associated with temporal masking. 5.1 MPEG Audio Coders:

11 BSS ECE, REVA

MMC

Unit-4

AV Compression

MPEG perceptual coder schematic


The audio input signal is first sampled and quantized using PCM. The bandwidth available for transmission is divided into a number of frequency subbands using a bank

of analysis filters.
The bank of filters maps each set of 32 (time related) PCM samples into an equivalent set of 32

frequency samples.
Processing associated with both frequency and temporal masking is carried out by the psychoacoustic

model.
In basic encoder the time duration of each sampled segment of the audio input signal is equal to the time

to accumulate 12 successive sets of 32 PCM.


12 sets of 32 PCM are converted into frequency components using DFT. The output of the psychoacoustic model is a set of what are known as signal-to-mask ratios (SMRs) and

indicate the frequency components whose amplitude is below the audible components.
This is done to have more bits for highest sensitivity regions compared with less sensitive regions.

In an encoder all the frequency components are carried in a frame.

12 BSS ECE, REVA

MMC

Unit-4

AV Compression

MPEG perceptual coder schematic


MPEG audio is used primarily for the compression of general audio and, in particular, for the audio

associated with various digital video applications.


The header contains information such as the sampling frequency that has been used. The quantization is performed in two stages using a form of companding. The peak amplitude level in each subband is first quantized using 6 bits and a further 4 bits are then

used to quantize the 12 frequency components in the subband relative to this level.
Collectively this is known as the subband sample (SBS) format. The ancillary data field at the end of the frame optional and is used to for example to carry additional

coded samples associated with the surround-sound that is present with some digital video broadcasts.
At the decoder section the dequantizers will determine the magnitude of each signal. The synthesis filters will produce the PCM samples at the decoders.

6. VIDEO COMPRESSION:
One approach to compressing a video source is to apply the JPEG algorithm to each frame

independently. This is known as moving JPEG or MJPEG.


If a typical movie scene has a minimum duration of 3 seconds, assuming a frame refresh rate of 60

frames/s each scene is composed of 180 frames hence by sending those segments of each frame that has movement associated with them considerable additional savings in bandwidth can be made. There are two types of compressed frames - Those that are compressed independently (I- frames)
13 BSS ECE, REVA

MMC

Unit-4

AV Compression

- Those that are predicted (P-frame and B-frame)

Example frame sequences I and P frames


In the context of compression, since video is simply a sequence of digitized pictures, video is also

referred to as moving pictures and the terms frames and picture are used interchangeably.
I-frames (Intracoded frames) are encoded without reference to any other frames. Each frame is treated

as a separate picture and the Y, Cr and Cb matrices are encoded separately using JPEG.
Iframes the compression level is small. They are good for the first frame relating to a new scene in a movie. I-frames must be repeated at regular intervals to avoid losing the whole picture as during transmission it

can get corrupted and hence looses the frame.


The number of frames/pictures between successive I-frames is known as a group of pictures (GOP).

Typical values of GOP are 3 12.


The encoding of the P-frame is relative to the contents of either a preceding I-frame or a preceding P-

frame.
P-frames are encoded using a combination of motion estimation and motion compensation. The accuracy of the prediction operation is determined by how well any movement between successive

frames is estimated. This is known as the motion estimation.


Since the estimation is not exact, additional information must also be sent to indicate any small

differences between the predicted and actual positions of the moving segments involved. This is known as the motion compensation.
No of P frames between I-frames is limited to avoid error propagation.

14 BSS ECE, REVA

MMC

Unit-4

AV Compression

Frame Sequences I-, P- and B-frames


Each frame is treated as a separate (digitized) picture and the Y, Cb and Cr matrices are encoded

independently using the JPEG algorithm (DCT, Quantization, entropy encoding) except that the quantization threshold values that are used are the same for all DCT coefficients.

PB-Frames
A fourth type of frame known as PB-frame has also been defined; it does not refer to a new frame type

as such but rather the way two neighbouring P- and B-frames are encoded as if they were a single frame. Motion Estimation and Compensation:
Motion estimation involves comparing small segments of two consecutive frames for differences and

should a difference be detected a search is carried out to determine which neighbouring segments the original segment has moved.
To limit the time for search the comparison is limited to few segments. Works well in slow moving applications like video telephony. For fast moving video it will not work effectively. Hence B-frames (Bi-directional) are used. Their

contents are predicted using the past and the future frames.
B- Frames provide highest level of compression and because they are not involved in the coding of other

frames they do not propagate errors.

15 BSS ECE, REVA

MMC

Unit-4

AV Compression

P-frame encoding
The digitized contents of the Y matrix associated with each frame are first divided into a two-

dimensional matrix of 16 X 16 pixels known as a macro block.


4 DCT blocks for the luminance signals in the example here and 1 each for the two chrominance signals

are used.
To encode a p-frame the contents of each macro block in the frame known as the target frame are

compared on a pixel-by-pixel basis with the contents of the I or P frames (reference frames).
If a close match is found then only the address of the macro block is encoded. If a match is not found the search is extended to cover an area around the macro block in the reference

frame.
To encode a P-frame, the contents of each macro block in the frame (target frame) are compared on a

pixel-by-pixel basis with the contents of the corresponding macro block in the preceding I- or P-frame.

16 BSS ECE, REVA

MMC

Unit-4

AV Compression

B-frame encoding
To encode a B-frame, any motion is estimated with reference to both the immediately preceding I-

or P-frame and the immediately succeeding P- or I-frame.


To encode B-frame any motion is estimated with reference to both the preceding I or P frame and

the succeeding P or I frame.


The motion vector and difference matrices are computed using first the preceding frame as the

reference frame and then the succeeding frame as the reference.


Third motion vectors and set of difference, matrices are then computed using the target and the mean

of the two other predicted set of values.


The set with the lowest set of difference matrices is chosen and is encoded. To encode B-frame any motion is estimated with reference to both the preceding I or P frame and

the succeeding P or I frame.


The motion vector and difference matrices are computed using first the preceding frame as the

reference frame and then the succeeding frame as the reference.


Third motion vectors and set of difference, matrices are then computed using the target and the mean

of the two other predicted set of values.


The set with the lowest set of difference matrices is chosen and is encoded. 17 BSS ECE, REVA

MMC

Unit-4

AV Compression

Implementation schematic I-frames


The encoding procedure used for the macro blocks that make up an I-frame is the same as that used in

the JPEG standard to encode each 8 x 8 block of pixels. Implementation Issues:


In the case of P-frames the encoding of each macro block is dependent on the output of the motion

estimation unit which, in turn, depends on the contents of the macro blocks being encoded and the contents of the macro block in the search area of the reference frame that produces the closest match. There are three possibilities: - If the two contents are the same, only the address of the macro block in the reference frame is encoded. - If the two contents are very close, both the motion vector and the difference matrices associated with the macro block in the reference frame are encoded. - If no close match is found, then the target macro block is encoded in the same way as a macro block in an I-frame.

18 BSS ECE, REVA

MMC

Unit-4

AV Compression

Implementation schematic P-frames


In order to carry out its role, the motion estimation unit containing the search logic, utilizes a copy of the

(uncoded) reference frame.

Implementation schematic B-frames


The same previous procedure is followed for encoding B-frames except both the preceding (reference)

and the succeeding frame to the target frame are involved.


19 BSS ECE, REVA

MMC

Unit-4

AV Compression

Example Macro block Encoded Bit stream Format


For each macro block it is necessary to identify the type of encoding that has been used. This is the role

of the formatter. Type indicates the type of frame encoded I, P or B: Address identifies the location of the macro block in the frame. Quantization Value is the value used to quantize all the DCT coefficients in the macro block. Motion vector encoded vector. Block representation indicates which of the six 8X8 blocks that make up the macro block are present. B1, B2,..B6: JPEG encoded DCF coefficients for those blocks present.

MPEG-1 example frame sequence


20 BSS ECE, REVA

MMC Unit-4 AV Compression Uses a similar video compression technique as H.261; the digitization format used is the source

intermediate format (SIF) and progressive scanning with a refresh rate of 0 Hz (NTSC) and 25 Hz (for PAL). COMPRESSION:
Compression for I-frames is similar to JPEG for Video typically 10:1 through to 20:1 depending on the

complexity of the frame contents.


P and B frames are higher compression and in the region of 20:1 through to 30:1 for P frame and 30:1 to

50:1 for B-frames.


MPEG-1 ISO Recommendation 11172 uses resolution of 352x288 pixels and used for VHS quality

audio and video on CD-ROM at a bit rate of 1.5 Mbps.


MPEG-2 ISO Recommendation 13818.

Used in recording and transmission of studio quality audio and video. Different levels of video resolution possible: Low: 352X288 comparable with MPEG-1. Main: 720X576 pixels studio quality video and audio, bit rate up to 15 Mbps. High:1920X1152 pixels used in wide screen HDTV bit rate of up to 80Mbps are

possible.
MPEG-4: Used for interactive multimedia applications over the Internet and over various entertainment

networks.
MPEG standard contains features to enable a user not only to passively access a video sequence using

for example the start/stop/ but also enables the manipulation of the individual elements that make up a scene within a video.
In MPEG-4 each video frame is segmented into a number of video object planes (VOP) each of which

will correspond to an AVO (Audio visual object) of interest.


Each audio and video object has a separate object descriptor associated with it which allows the object

providing the creator of the audio and /or video has provided the facility to be manipulated by the viewer prior to it being decoded and played out.

21 BSS ECE, REVA

MMC

Unit-4

AV Compression

MPEG-1 video bit stream structure: composition


The compressed bit stream produced by the video encoder is hierarchical: at the top level, the complete

compressed video (sequence) which consists of a string of groups of pictures.

MPEG-1 video bit stream structure: format


In order for the decoder to decompress the received bit stream, each data structure must be clearly

identified within the bit stream.

22 BSS ECE, REVA

MMC

Unit-4

AV Compression

MPEG-4 coding principles


Content based video coding principles showing how a frame/scene is defined in the form of multiple

video object planes.

23 BSS ECE, REVA

MMC

Unit-4

AV Compression

MPEG 4 encoder/decoder schematic


Before being compressed each scene is defined in the form of a background and one or more foreground

audio-visual objects (AVOs).

MPEG VOP encoder


The audio associated with an AVO is compressed using one of the algorithms described before and

depends on the available bit rate of the transmission channel and the sound quality required.

24 BSS ECE, REVA