You are on page 1of 60

CSE412

SELECTED TOPICS IN
COMPUTER ENGINEERING

DIGITAL VIDEO STANDARDS


MPEG: Motion Picture Experts Group
• MPEG-1 (1992)
• Compression for Storage
• 1.5Mbps
• Frame-based Compression
• MPEG-2 (1994)
• Digital TV
• 6.0 Mbps
• Frame-based Compression
• MPEG-4 (1998)
• Multimedia Applications, digital TV, synthetic graphics
• Lower bit rate
• Object based compression
• MPEG-7
• Multimedia Content Description Interface
• MPEG-21
• Digital identification, Intellectual Property (IP) rights management
Basics of MPEG
Types of pictures
– I (intra) frame
• compressed using only intraframe coding
• Moderate compression but faster random access
– P (predicted) frame
• Coded with motion compression using past I frames or P
frames
• Can be used as reference pictures for additional motion
compensation
– B (bidirectional) frame
• Coded by motion compensation by either past or future I or P
frames
– D (DC) frame
• Limited use: encodes only DC components of intraframe coding
MPEG Frame Types
• Intra (I) pictures: coded by themselves, as still
images. No temporal coding. No motion
vectors.
MPEG Frame Types
• Forward Motion Compensated predicted (P)
pictures – forward motion compensated from
the previous I or P frame
MPEG Frame Types
• Motion Compensated interpolated (B) pictures –
forward, backward, or interpolatively (average of
forward and backward) motion compensated from
previous and next I/P frames
MPEG Frame Structure Terminology
• A slice is a collection of macroblocks, tracing in
a raster scan from upper left to lower right
– The resynchronization unit
• A Group of Pictures (GOP) contains ≥ 1 frame.
– The unit for random access into the
sequence
MPEG GOP Structure
• A Group of Pictures (GOP) may contain
– All I pictures
– I & P pictures only
– I, P, & B Pictures

I B B P B B P B B P B B P B B I
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frame Ordering

– Display order (encoder input order):

B B I B B P B B P B B P B B P B B I
-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

– But consider coding dependencies:


• Frame 2 (B) needs frame 4 (P) to be decoded
first, etc.
• So better transmit frame 4 before frame 2
I B B P B B P B B P B B P B B I B B
1 -1 0 4 2 3 7 5 6 10 8 9 13 11 12 16 14 15
Coding Mode I (Inter-Coding)
Inter coding refers to coding with motion vectors

Macro
Block

Previous Current
Frame Frame
Motion Vector
Coding Mode II (Intra-Coding)

INTRA coding refers to coding without motion vectors


The Macro Block (MB) is coded all by itself, in a manner
similar to JPEG

Macro
Block

Previous Current
Frame Frame
1
2

I-Picture Coding

Use of macroblocks modifies block-scan order:


8 8
8

8
P-Picture Coding: many coding modes
– Motion compensated coding: Motion Vector (MV) only
– Motion compensated coding: MV plus difference
macroblock

Motion
Vector

– Motion compensation: MV & difference MB with


modified quantization scaling
B-Picture Coding
B pictures have even more possible modes:
– Forward prediction MV, no difference block
– Forward prediction MV, plus difference block
– Backward prediction MV, no difference block
– Backward prediction MV, plus difference block
– Interpolative prediction MV, no difference block
– Interpolative prediction MV, plus difference block
– Some of above with modified Quantization parameters
Group of Pictures
IIIII…: Every picture is intra-coded.
– Fully decodable without reference to any other picture
– Editing is straightforward
– Requires about 2.5 more bit rate than bidirectional

IBBPBBPB…: Forward and bidirectional


– Best compression factor
– Needs large decoder memory
– Hard to edit
– Most useful for final delivery of post-produced material
(e.g., broadcast) because no editing requirement
Group of Pictures
IPPPPIPP…: Forward predicted only.
– Needs less decoder memory
IBIBIB…: bidirectional compromise
– Some of the bit rate advantage of bidirectional coding
– Editable with moderate processing.
For example, if the video after a B picture needs to be
deleted, the B frame would not be decodable.
Solution is to decode the B frame first, re-encode it using
forward prediction only. Some quality loss.
MPEG: Video Encoding
Regulator

Frame + Quantizer VLC


-
DCT
Memory (Q) Encoder

Q-1
Predictive frame

Motion vectors
Pre Buffer
processing IDCT

+ Output
Input
Motion Frame
Compensation Memory

Motion
Estimation
MPEG: Video Encoding
– Interframe predictive coding (P-pictures)
• For each macroblock the motion estimator produces the best
matching macroblock
• The two macroblocks are subtracted and the difference is DCT
coded
– Interframe interpolative coding (B-pictures)
• The motion vector estimation is performed twice
• The encoder forms a prediction error macroblock from either of
them or from their average
• The prediction error is encoded using a block-based DCT
– The encoder needs to reorder pictures because B-
frames always arrive late
MPEG-1 Video Layer
• a coded representation that can be used for compressing video
sequences - both 625-line and 525-lines - to bitrates around 1.5
Mbit/s.

• Developed to operate from storage media offering a continuous


transfer rate of about 1.5 Mbit/s.

• Different techniques for video compression:


• Select an appropriate spatial resolution for the signal. Use block-based
motion compensation to reduce the temporal redundancy. Motion
compensation is used for causal prediction of the current picture from a
previous picture, for non-causal prediction of the current picture from a
future picture, or for interpolative prediction from past and future pictures.
• The difference signal, the prediction error, is further compressed using the
discrete cosine transform (DCT) to remove spatial correlation and is then
quantized.
• Finally, the motion vectors are combined with the DCT information, and
coded using variable length codes.
MPEG-1 Systems Layer
• Combines one or more data streams from the video and audio parts with
timing information to form a single stream suited to digital storage or
transmission.
MPEG-1
• I,B,P Frames
• Picture size, bitrate is variable
• No closed-captions

• Group of Pictures
• one I frame in every group
• 10-15 frames per group
• P depends only on I, B depends on both I and P
• B and P are random within GoP
MPEG Video Filtering

I B B P B B P B B P B B P B B I

I B P B P B P B P B I

I P P P P I

I P P P I

I P P I

I I
MPEG-2
– Digital Television (4 - 9 Mb/s)
– Satellite dishes, digital cable video
– Larger data size
– includes closed-captions
– More complex encoding (“long time”)
– Support higher bit rates for HDTV instead of the 1.5Mbps
– Support a larger number of applications
– Different color subsampling modes e.g., 4:2:2, 4:2:0, 4:4:4
MPEG-2: Profiles and Levels
Profiles
Levels SNR Spatial High Multiview
4:2:0 4:2:0 4:2:0;4:2:2 4:2:0
Enhancement 1920 X 1151/60 1920 X 1151/60
Lower 960 X 576/30 1920 X 1151/60
High Bitrate 100, 80,25 130, 50, 80
Enhancement 1440 X 1152/60 1440 X 1152/60 1920 X 1152/60

Lower 720 X 576/30 720 X 576/30 1920 X 1152/60


High-1440

Bitrate 60, 40, 15 80, 60, 20 100, 40, 60


Enhancement 720 X 576/30 720 X 576/30 720 X 576/30
Main Lower 352 X 288/30 720 X 576/30
Bitrate 15, 10 20, 15, 4 25, 10, 15
Enhancement 352 X 288/30 352 X 288/30
Low Lower 352 X 288/30
Bitrate 4, 3 8, 4, 4
MPEG-2 Applications
Digital Betacam: 90 Mbits/s video
MPEG-2
– Main Profile, Main Level, 4:2:0: 15 Mbits/s
– High Profile, High Level, 4:2:0: adequate, expensive
MPEG-4
• Similar to MPEG-2 but it includes the following features
• Interactive Graphics Applications
• Interactive multimedia (WWW), networked distribution
MPEG-4
• Bitrates from 5kb/s to 10Mb/s
• Several extension “profiles”
• Very high quality video
• Better compression than MPEG-1
• Low delay audio and error resilience
• Support for “objects”, e.g., Face Animation
• Support for efficient streaming
MPEG-4
MPEG-4
Objective
– Standardize algorithms for audiovisual coding in
multimedia applications allowing for
• Interactivity
• High compression
• Scalability of audio and video content
• Support for natural and synthetic audio and video
The Idea
– An audiovisual scene is a coded representation of
audiovisual objects related in space and time
MPEG-4: Scenario

A/V object
– A video object within a scene
– The background
– An instrument or voice
– Coded independently
A/V scene
– Mixture of natural or synthetic objects
– Individual bitstreams multiplexed and transmitted
– One or more channels
– Each channel may have its own quality of service
MPEG-4: Video Object Plane (VOP)
• Video frame = sum of segmented regions with
arbitrary shape (VOP)
• Shape motion and texture information of VOPs
belonging to the same video object is encoded
into a video object layer (VOL)
• Encode
– VOL identifiers
– Composition information
• Overlapping configuration of VOPs
MPEG-4: Coding

Shape coding
– Shape information in alpha planes
– Transparency of shape encoded
– Inter and intra shape coding functions
– After shape coding each VOP in a VO is
partitioned into non-overlapping macroblocks
Motion coding
– Shift parameter with respect to reference window
– Standard macroblock
– Contour macroblock
MPEG-4: Coding
Texture coding
– Intra-VOPs, residual errors from motion compensation are
DCT coded like MPEG-1
– P-VOPs (prediction error blocks) may not conform to VOP
boundary
• Pixels outside the active area are set to a constant value
• Standard compression
• Efficient prediction of DC and AC components from intra and
inter coded blocks
– Multiplexing
• Shape → motion → texture coded data
• Motion and DCT coefficients can be jointly or individually
coded
Composition of Audiovisual Objects
(AVOs)
• MPEG-4 provides a standardized way to describe a scene, allowing
the user to:
– place AVOs anywhere in a given coordinate system;
– apply transforms to change the geometrical or acoustical appearance
of a AVO;
– group primitive AVOs in order to form compound media objects;
– apply streamed data to AVOs, in order to modify their attributes;
– change interactively the user’s viewing and listening points anywhere
in the scene.

• With reference to the shown figure, for example, one can replace the
person with a different person, changes her dress or hairstyle;
group the desk and the globe to form a compound AVO since they
are static; or change the background.
An MPEG-4
audiovisual scene
Video Objects
• MPEG-4 treats a video sequence as a collection of
video objects.
• A video object (VO) is an area of video scene that may
occupy an arbitrary-shaped region and may exist for an
arbitrary length of time.
• An instance of a VO at a particular point in time is a
video object plane (VOP).
• In the traditional video coding sense, a rectangular
video frame is a VOP and a video sequence is a VO.
MPEG-4 Encoder

+ motion video
_ DCT Q texture multiplex
coding
Q-1

IDCT
+
+
S pred. 1
w
i Frame
pred. 2 Store
t
c
h pred. 3

Motion
estimation
MPEG-4 encoder.
Shape
coding
VOP Prediction

I-VOP B-VOP P-VOP

Forward Backward
Bidirectional

Forward

VOP prediction
MPEG-4 Profiles

Profile Coding Tools


Simple Profile I-VOP, P-VOP, MV, Intra prediction,
Coding of rectangular video Video packets, Data Partitioning,
frames
Core Profile Simple coding tools in
Coding of arbitrary-shaped addition to B-VOP
video objects
Scalable Profile Simple coding tools in addition to:
Scalable coding of Temporal scalability, Spatial scalability,
rectangular video frames or Fine granular scalability
video objects
Simple Profile (SP): Basic Coding Tools
I-VOP
• An I-VOP is a rectangular video frame encoded in Intra mode.

Source
DCT Q Reorder RLC VLC
frame

Decoded
IDCT Q−1 Reorder RLD VLD
frame

DCT – Discrete cosine transform IDCT – Inverse discrete cosine transform


Q – Quantization Q−1 – Inverse quantization

RLC – Runlength coding RLD - Runlength decoding


VLC – Variable length coding VLD – Variable length decoding
SP: Basic Coding Tools
• A coded I-VOP consists of a VOP header, optional video packet
headers and coded macroblocks.
• Each macroblock (MB) is coded with a header (defining the
macroblock type, identifying which blocks in the MB contain coded
coefficients, signalling changes in quantization parameter, etc.)
followed by coded coefficients for each 88 block.
• In the decoder, the sequence of VLCs are decoded to extract the
quantized transform coefficients which are re-scaled and inverse
transformed to reconstruct the decoded I-VOP.
SP: Basic Coding Tools
P-VOP
• A P-VOP is coded with Inter prediction from a previously encoded
I- or P-VOP (a reference VOP).

ME

Source
MCP DCT Q Reorder RLC VLC
frame

Decoded
MCR IDCT Q−1 Reorder RLD VLD
frame

ME – Motion estimation MCP – Motion compensated prediction


MCR – Motion compensated reconstrunction
SP: Basic Coding Tools
Motion Estimation and Compensation
• The basic motion compensation (MC) scheme is the block-based
compensation of 1616 pixel blocks.
• The motion vector (MV) may have half-pixel resolution where the
half-pixel positions are calculated using interpolation
between pixels at integer-pixel positions. The motion estimation
(ME) method is not defined and left for the implementer to
decide it.
• The residual MB is formed by subtracting the motion-compensated
MB (prediction) in the reference frame from the current MB.

• The residual MB is transformed with the DCT, quantized, zig-zag


scanned, run-length coded.
SP: Coding Efficiency Tools
Four Motion Vectors per Macroblock
• The default block size for ME is 1616 for luma pixels and 88 for
chroma pixels. This tool allows the encoder to choose a smaller ME
block size of 88 for luma and 44 for chroma pixels, giving 4 MVs
per MB.
• The mode can minimize the energy of the MC residual, particularly
in areas of complex motion or near the boundaries of moving
objects.
• There is an increase in overhead in sending the 4 MVs, and so the
encoder may choose to send one or four MVs on a MB-by-MB basis.
MPEG-4: Core Profile (CP)
• Simple Profile coding tools
• B-VOP (bidirectionally predicted Inter-coded VOP)
• Object-based coding (with Binary Shape)
Core Profile Coding Tools
B-VOP
• The block or macroblock (MB) may be predicted using (a) forward
prediction from the previous I- or P-VOP, (b) backward prediction
from the next I- or P-VOP, or (c) an average of forward and
backward predictions.
• This mode generally gives better coding efficiency than basic
forward prediction; however, the encoder must store multiple
frames prior to coding each B-VOP which increases the memory
requirements and the encoding delay.
Core Profile Coding Tools
Example of direct mode prediction

I4 B5 B6 P7

Forward Backward
MVF MVB

Bidirectional

Direct mode prediction


Object-based Coding
• The most important functionality in the Core Profile (CP) is its
support for coding of arbitrary-shaped objects.
2 3 1
• Each MB position in the picture is classified as: 2 3 1
(1) opaque (fully ‘inside’ the VOP), 2 3 1
(2) transparent (not part of the VOP), or
3 1 1
(3) on the boundary of the VOP.
• In order to indicate the shape of the VOP to the decoder, alpha
mask information is sent for every MB.
• In the Core Profile, only binary alpha information is allowed, and
each pixel position in the VOP is defined either as opaque or
transparent.
• CP supports coding of binary alpha information, and provides
tools to encode texture within boundary MBs.
Object-based Coding

Binary Shape Coding

• The binary alpha mask indicates


which pixels are part of the VOP
and which pixels are outside the VOP.

• The binary alpha mask for each


macroblock is called Binary Alpha Block
(BAB).

Binary alpha mask


MPEG-4: Scalable Profile
• Scalable coding of video data enables a decoder to decode
selectively only part of the coded bitstream.
• The coded bitstream is arranged in a number of layers, including a
base layer and one or more enhancement layers.
• The base layer decodes a video with basic quality, while the
enhancement layer(s) together with the base layer delivers a high
quality video.
• MPEG-4 Scalable Profile supports:
1. Spatial scalability
2. Temporal scalability
3. Fine grain scalability
Scalable Video Coding

Basic-quality
Base layer Decoder A
Video sequence
Encoder Enhancement
sequence
layer 1

     
Decoder B High-quality
Enhancement sequence
layer N

Scalable video coding


Spatial Scalability
• The base layer contains a reduced-resolution version of each
frame. Decoding the base layer alone produces a low-
resolution output sequence, and decoding the base layer with
enhancement layer(s) produces a higher-resolution output.
• The procedure to encode a video sequence into two spatial
layers:
1. Subsample each input video frame (or video object)
horizontally and vertically.
2. Encode the reduced-resolution frame to form the base layer.
3. Decode the base layer and upsample to the original
resolution to form the prediction frame.
4. Subtract the full-resolution frame from this prediction
frame to form the residual.
5. Encode the residual to form the enhancement layer.
Spatial Scalability
• A single-layer decoder decodes only the base layer to produce a
reduced-resolution output sequence.
• To reconstruct the full-resolution sequence:
1. Decode the base layer and upsample to the original resolution.
2. Decode the enhancement layer to obtain the residual.
3. Add the decoded residual to the decoded base layer to form the
output frame.
Temporal Scalability
• The basic idea of temporal scalability is to split the sequence into
two layers. The base layer of a temporal scalable sequence is
encoded at a lower frame rate and an enhancement layer
consisting of I-, P- and/or B-VOPs that can be decoded together
with the base layer to provide an increased video frame rate.
• The enhancement VOPs are predicted using motion-compensated
prediction according to the following rules as illustrated in
the following figures.
• An enhancement I-VOP is encoded without any prediction.
• An enhancement P-VOP is predicted from:
(i) the previous enhancement P-VOP/I-VOP; or
(ii) the previous base layer P-VOP/I-VOP; or
(iii) the next base layer P-VOP/I-VOP.
Temporal Scalability
• An enhancement B-VOP is predicted from:
(i) the previous enhancement and previous base layer P-VOP/I-VOP; or
(ii) the previous enhancement and next base layer P-VOP/I-VOP; or
(iii) the previous and next base layer P-VOP/I-VOP.
Temporal Scalability

(i)
0 2
enhancement
layer VOPs

(iii)
(ii)

1 3 base layer VOPs

Temporal enhancement P-VOP prediction options


Temporal Scalability

0 2 0 2 2

1 3 1 3

(i) (ii) (iii)

Temporal enhancement B-VOP prediction options

57
Fine Granular Scalability
• Fine Granular Scalability (FGS) is a method of encoding a
sequence as a base layer and enhancement layer such that the
enhancement layer can be truncated during or after encoding to
give a highly flexible control over the transmitted bitrate.
• FGS is very useful in video streaming applications where the
channel bandwidth may change. When that happens, the
streaming server transmits the base layer and a truncated version
of the enhancement layer to match the available bandwidth,
hence maximizing the decoded video quality without the need to
re-encode the video sequence.

58
MPEG-7
• Data + Multimedia Content Description Scheme
• Description Definition Language (e.g., XML-based)
• Does not deal with data, but meta-data transmission
• Description Scheme + Content Description, e.g:
• Table of content
• Still Images
• Summaries
• links
• etc.
• Focus mainly on how description of data gets generated and how
it is used
MPEG-7

You might also like