Professional Documents
Culture Documents
SELECTED TOPICS IN
COMPUTER ENGINEERING
VIDEO ENCODING
ITU-R BT.601 Digital Video
Spatial Resolution
• A digital video can be obtained either by sampling a raster
scan, or directly from a digital video camera.
• The International Telecommunications Union – Radio
Sector (ITU-R) developed the BT.601 standards for digital
video formats to represent different analog TV video
signals for both 4:3 and 16:9 aspect ratios.
2
ITU-R BT.601 Digital Video
Color Coordinate
• The BT.601 standard also defines a digital color coordinate, known
as YCbCr.
• Assuming the range of RGB values is 0-255 the transformation
matrix is:
Y 0.257 0.504 0.098 R
16
Cb = − 0.148 − 0.219 0.439 G + 128
Cr 0.439 − 0.368 − 0.071 B 128
3
ITU-R BT.601 Digital Video
Color Coordinate
• The inverse transformation is:
4
ITU-R BT.601 Digital Video
Chrominance Subsampling
• Chrominance is usually subsampled at a reduced rate than the
luminance, resulting in different formats.
• In 4:2:2 format, the chrominance is subsampled at half the
sampling rate of luminance. This leads to half the number of Cb &
Cr pixels per line.
• To further reduce sampling rate, the chrominance is subsampled by
a factor of 4 along each line, resulting in 4:1:1 format. However,
this format yields asymmetrical resolutions in the horizontal and
vertical directions.
• To solve the above problem, the chrominance components are
subsampled by half along both the horizontal and vertical directions
in the 4:2:0 format.
5
ITU-R BT.601 Digital Video
Y pixel Cb & Cr pixel
4:4:4 4:2:2
For every 22 Y pixels 4 For every 22 Y pixels 2
Cb & 4 Cr pixels (no Cb & 2 Cr pixels
subsampling) (subsampling by 2:1
horizontally only)
BT.601 chrominance subsampling formats 6
ITU-R BT.601 Digital Video
Y pixel Cb & Cr pixel
4:1:1 4:2:0
For every 41 Y pixels 1 For every 22 Y pixels 1
Cb & 1 Cr pixel Cb & 1 Cr pixel
(subsampling by 4:1 (subsampling by 2:1 both
horizontally only) horizontally and vertically)
9
Chroma sub-sampling
Frequency Domain Characterization of
Video Signals
• A video is a 3-D signal, having two spatial dimensions (horizontal and
vertical) and one temporal dimension (time).
• The 2D spatial frequency is a measure of how fast the imageintensity
or color changes in the 2D image plane.
11
Compressing Digital Video
• Exploit spatial redundancy within frames (like JPEG:
transforming, quantizing, variable length coding)
• Exploit temporal redundancy between frames
– Only the sun has changed position between these 2 frames
• Translation: simple
movement of typically
Frame n
rigid object
• Zooms –in/out
Frame n Frame n+1 ❑ Camera zoom vs. object zoom
(movement in/out)
Describing Motion
• Translational
– Move (object) from (x,y) to (x+dx,y+dy)
• Rotational
– Rotate (object) by (r rads) (counter-clockwise
/ clockwise)
• Zoom
– Move (in/out) from (object) to
increase/decrease its size by (t times)
Motion Estimation
• Determining parameters for the motion
descriptions
• For some portion of the frame, estimate its
movement between 2 frames - the current
frame and the reference frame
General Idea
• For a region PC in the current frame, find a
region PR in the search window in reference
frame so that Error(PR,PC) is minimized
Current
Portion
Reference of
Search Frame
Frame interest
window
PC
Block-based Motion Estimation
• PC is a block of pixels (in the current frame)
• The search window is a rectangular segment (in
the reference frame)
13 15 16
2
4
= Motion
vector
28
Motion Estimation
Frame (n−1)
W(j,k)
M Frame n
N
N N
MAD =
1
N2
W ( j, k) − X ( j, k ) N
j =1 k =1
X(j,k)
29
Motion Estimation
Motion Estimation
• Calculate the MAD for all positions of X(j,k) )within the
search window W(j,k) .R is the maximum search range.
• Find the minimum MAD to be the best match.
• Motion vector is defined by MV(vx ,vy).
If an operation consists of
1 subtraction, 1 comparison
and 1 addition, then the
number of operations for a
vy search block is
(2R +1)2 N 2.
vx
For an LL image, the
number of operations is
(2R +1)2 L2.
Motion Estimation - Fast Algorithms
3-step Search BMA
1. In the first step , 8 coarsely spaced pixels around the center pixel
z0 are tested as shown in the figure.
2. In the second step, again 8 pixels are used around the pixel of
minimum MAD found in step 1, but the stepsize is halved.
3. The final step is basically a repetition of step 2 ,but with a
stepzize of 1.
4. The initial stepsize R0 is usually equal or slightly larger than half
of the maximum search range R.
5. The number of steps required is S gol =2R . The number search
points would be (8S+1).
Motion Estimation - Fast Algorithms
–7 –6 –5 –4 –3 –2 –1 0 +1 +2 +3 +4 +5 +6 +7
+7
+6
Step 1
+5
+4 Step 2
+3 Step 3
+2
MV
+1
0
z0
–1
–2
–3
–4
–5
–6
–7
3-step search motion estimation
Motion Estimation - Fast Algorithms
2D-logarithmic Search BMA
1. It starts from the center of a diamond search region. Each step
consists of calculating 5 search points, located at the center and
the four corners n pixels (stepsize) away, as illustrated in the
figure.
2. The search in the next step is centered at the best matching
point of the current step.
3. The stepsize is reduced to half of its current value if the best
match is located at the center or at the boundary of the search
range. Otherwise it remains the same.
4. When the stepsize is reduced to 1 ,we have arrived at the final
step.
Motion Estimation - Fast Algorithms
–6 –5 –4 –3 –2 –1 0 +1 +2 +3 +4 +5 +6
+6
+5 Step 1
+4 Step 2
+3 Step 3
+2 Step 4
+1 MV
0
z0
–1
–2
–3
–4
–5
–6
Frame (n – 1) Frame n
Motion vectors
Motion Compensation
• Consider a sequence of images,
S =f (t), t = 0,1 ,2... ,
dm dm = MV
Current
block, Bm
Matching
block,
B m
B E C
Motion field -
half-pixel Predicted
accuracy image
Hierarchical Motion Estimation
• Full-search motion estimation algorithm is an extremely
computationally intensive method.
• To reduce the number of computation steps, hierarchical
approach can be employed by first searching the solution in a
coarse resolution, and then refining it at a finer resolution
around a small neighborhood of the previous solution.
• The following figure illustrates a 3-level hierarchical block
matching algorithm where the spatial resolution of each level is
halved, both horizontally and vertically, of the level below it in
the pyramid.
• We assume the same block size is used at different levels, so
that the number of blocks is reduced by half as well in each
dimension.
Hierarchical Motion Estimation
MOTION COMPENSATED
Frame n
Motion Compensation
• This glued together frame is called
the motion compensated frame
• The encoder can also form the
difference between the motion
compensated frame and the actual
frame.
• This is called the motion
compensated difference frame
• This difference frame formed using
MC should have less correlation
between pixels than the difference
frame formed without using MC
Motion Compensated Difference
Frames
MOTION COMPENSATED
Frame n
Encoding Difference Frames
Encoder forms motion With no motion compensation
compensated diff frame: encoder could do frame diff:
MCD(n) = F(n) – M(n) FD(n) = F(n) – F(n-1)
Encoder losslessly encodes Encoder losslessly encodes
MCD(n) FD(n)
Decoder can then do Decoder can then do
F(n) = MCD(n) + M(n) F(n) = FD(n) + F(n-1)
→ knows F(n) exactly → knows F(n) exactly
Send FD(n)
Send Motion
Vectors Send MCD(n)
Motion compensated Motion compensated
frame M(n) difference image
MCD(n) =F(n) – M(n)
Motion estimation philosophy
• Goal of motion estimation is NOT to provide a
careful analysis of the actual motion
• Goal is to achieve a given quality of representation
of the video while globally minimizing the bit rate
required to send
– The motion information
– The prediction error information
• Most of the time, for a given representation quality
– fewer bits to send MV+MCD(n) instead of sending FD(n)
– fewer bits to send FD(n) instead of sending F(n) itself.
Motion Estimation/Compensation
Summary
At the encoder:
– For each block in the frame being coded,
examine the search window(s) in the
reference frame to find the best match block
– Form the MC difference image = original
image minus motion compensated image
Motion Estimation/Compensation
Summary
At the decoder:
– Decode the reference frames
– For each block in a temporally coded frame,
use the motion vector to select a block from
the reference frame and glue it in place
– Add the difference image
Temporal Location of Reference
• The reference frame need not occur before the
temporally coded frames which use it
Flavors of Motion Estimation
1. Forward predicted blocks: the best-match block
occurs in the reference frame before the block’s
frame
2. Backward predicted blocks: the best-match block
occurs in the reference frame after the block’s frame
3. Interpolatively predicted blocks: the best-match
block is the average of the best-match blocks from
reference frames before & after