Stereo Video Signals Acquisition, Representation and Compression

Stereo Video
Signals
Acquisition, representation
and Compression
Note on the use of images and external sources of information
• Most of the pictures used throughout this presentation are not created by the
author of the presentation. They were obtained from the sources referenced at
the end and obtained doing Internet searches. For the sake of clarity and
simplicity of the slides, the source has not been referred individually in each
slide.
• The author would like to thank all of the researchers who have published and
made available their contents
2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 2

Acquisition, representation and Compression of 3D signals
• How to create and deliver 3D content?

• stereo and multi view video mastering
• 3D content acquisition
• 3D content representation
• Stereo and multiview video compression
• texture-based coding
• texture plus depth

3D content chain
•an overall 3DTV system comprises elements for the capture,

representation, compression, distribution, and display of the signals
• mastering is the process of representing a 3D scene

• a 3D Master is comprised of two 2D uncompressed files
• compression needs to be applied for transmission

3D acquisition
• Direct acquisition by stereo cameras

• cameras that have two lenses can record depth information because they
capture a separate data stream for each eye
• most often a pair of 2D cameras is used

• when using two 2D cameras - or an array of cameras for multi view capturing -
precise calibration and temporal synchronisation across cameras is vital
• cameras with active depth sensing

• which have sensors to estimate the distance between the camera and the object
by extracting phase information from received light stimuli (e.g., Microsoft
Kinetic)
• converting from 2D to 3D
• the key is to consider several depth cues such as motion parallax, perspective or
camera motion to obtain a depth map

Direct acquisition by stereo cameras or pair of cameras
• baseline distance between visual axes (separation of the eyes - interocular

distance) is around 65 mm for a normal average viewer
• 3D cameras use the same separation for baseline (interaxial distance)
• the separation can be smaller or larger to accentuate the 3D effect of the displayed
material
• it will need to be varied for different focal length lenses and by the distance from
cameras to the subject
• nonetheless although, depending on age and other factors, other baseline values
may have to be taken into account by content producers
% of viewers
Baseline distance
2D-to-3D conversion
• Why 2D to 3D conversion?
• there is still limited content originally captured in stereoscopic format
• whilst 3D-enabled TV sets are already being introduced at a rather large scale
• difficulty in capturing stereoscopic content

• large amount of 2D content available
• Where/When?
• Post-production: cinema and TV
• Broadcasting: live content and legacy material
• TV set: legacy content owned by viewer or 2D linear programming being broadcasted
• How?
• Fully manual
• Semi-automatic
• Automatic off-line
• Automatic in real-time

2D-to-3D conversion strategy
• From one single-view video (normal flat 2D video) estimate depth and render
new virtual view
• Input: 2D video sequence
• Output: Stereoscopic/multi-view video sequence

• the input 2D sequence, at the output, will be treated as the left video sequence
• the right video sequence will be the new virtually created sequence
stereoscopic display
input 2D video ➟ Left video sequence

2D-to-3D conversion strategy (2)
• The key is to be able to construct good quality depth maps
• this is done taking into account depth cues
• the depth map is a grayscale picture, assigning values of brightness to each pixel
• the value of brightness assigned to a pixel specifies the distance that pixel is from
the camera or from the viewer
• this map must be obtained for each image of the input 2D video sequence
• The resulting stereo/3D video is generated from the corresponding depth maps
and the original 2D video
• by shifting each pixel of the 2D image to the left or to the right depending on
the corresponding depth map value and the type of stereo view (left or right)

2D-to-3D conversion strategy (3)
• The most critical factors for obtaining good quality conversion
• accuracy of the depth maps
• using pictorial / monocular cues (from one single image)
• using motion cues (from a sequence of images)
• motion parallax, dynamic occlusion
• stereo generator quality
• involves shifting and filling
• most important and difficult step is filling unknown areas that appear after image
objects have been shifted
• techniques include interpolation and texture generation
• using information from uncovered pixels in image or adjacent images
• temporal impainting

On-line vs Off-line 2D-to-3D conversion
• Off-line
• Automatic, semi-automatic or manual
• if automatically generated, then converted content may need to be manually
corrected
• normally it is human-assisted (post production)

• delivers high quality but it incurs in high costs
• On-line
• must be fully automatic
• many problems may arise and degrade the quality

• artefacts may not removed
• speed of conversion is critical (real-time)
• often, non-veridical depth information is used

Off-line standard approach to 2D-to-3D conversion
• Segmentation (selecting and tracing foreground objects)

• normally it is a semi-automatic operation
• Assigning depth
• usually a manual operation performed by the editor based on depth cues
• Displace objects
• shifting pixels according to the depth map
• could be manual or semi-automatic operation
• Fill-in holes
• filling unknown areas that appear after image objects have been shifted
according to their attributed depths
• semi-automatic operation with human supervision
new image generation

Off-line alternative approach to 2D-to-3D conversion
• Generate depth
• create a draft overall depth map
• automatic operation
• Select objects
• semi-automatic operation
• Edit depth map
• fine-tune the depth information for each foreground object
• manual or semi-automatic
• Render new view
• automatic
• called depth image based rendering
• Fill-in holes
• automatic operation with human supervision

On-line approach
• Imposes real-time constraints which demand for compromises
• perception quality is more important than depth accuracy
• depth is often assigned based on perceptual outcome (prior knowledge) rather

than on real depth
• using previously constructed depth models
• a commonly used model is the one that assumes that the closer to the
bottom an object is, the closer to the viewer it is
• sophisticated inpainting algorithms need to be avoided
• Normally it trades-off depth quality with real-time requirements
• Application in limited areas
• directly on 3D TV sets
• mobile devices

Present stereoscopic image generation process
• Two main approaches can be employed relying on two different ways of representing
depth
• Depth Image Based Rendering (DIBR)
• it directly uses the constructed depth maps to create the new (shifted) image
• Image Warping
• depth is represented by a mathematical function (as a transformation)
• Input signals
• single original image
• usually taken as to be the left image
• depth data
• map, structure or transformation to apply
• Output
• stereoscopic pair

DIBR
• the depth data provides disparity values that are used to displace pixels
2D (texture) + depth map
2 views are presented simultaneously for stereoscopic visioning

Existing challenges in 2D to 3D conversion
• Quality and accuracy of extracted depth
• Extracting depth from video sequences with object motion involved
• Combining the use of different cues to derive depth
• or, how to merge the different depth information derived

from different motion cues?
• What is the ideal depth budget distribution for each scene ?
• Automatic in-painting algorithms for hole filling
• Determining correct depth order for all the objects in the scene

3D mastering - putting together the left and right images
• Once the two images have been obtained, how to fit in the left and right
images (or the 3D visual information) into one stream?
• for storage or transmission purposes
• two alternatives
• spatial
• side-by-side
• top-down
• row interleaved ; column interleaved ; checkerboard
• temporal
• doubled frame rates
• or, adopt a different paradigm

• video + metadata
• multiplex in the same stream the 2D visual data and information describing
stereoscopy

Spatial 3D mastering
• transmitting or storing using the same infrastructures or resources

• to use the same amount of bandwidth, given that we have twice the amount of
data, some loss will be needed
• in resolution
• normally vertical or horizontal
• but it could be diagonally!

• the human eye is less sensitive to loss of resolution in a diagonal direction

Stereo and multi view video compression approaches
• Different taxonomies can be used to identify the different coding and

representation formats, namely the following two
• texture only versus texture + depth

• inline with the mastering approaches
• two views versus multi-view

• two views is stereo or conventional 3D
• two views correspond to a single 3D perspective of the scene
• multi-view corresponds to n views with n > 2, hence to n/2 3D scene perspectives
• the two classes can be combined between them

• two views texture-based
• single view texture + depth
• multi-view texture-based
• multi-view texture + depth

stereo and multi view video compressing approaches (2)
• Formats and standard specifications

• Texture only (two full views)
• Frame Compatible Stereo
• Conventional Stereo Video
• Multiview Simulcasting
• Multiview Video
• MVC (extension of H.264)
• MV-HEVC standards
• Texture + Depth
• 2D (Texture) + Depth
• MPEG-C standard
• Multiview+Depth (MVD)
• 3D-HEVC standard

Mastering of texture based encoding
• Called Frame Compatible Stereo formats
• the stereo signal is a multiplex of the two images into a single frame or sequence
of frames
• each coded with 2D video coding solutions such as H.264
• signalling data referring to the frame compatible formats has been specified in H.
264 as Supplemental Enhancement Information (SEI) messages
• also called stereo interleaving or spatial/temporal multiplexing formats as in

mastering
• the spatial multiplex leads to a loss of image spatial resolution
• the temporal multiplex offers full resolution but increased bandwidth
• higher frame rates

Texture based encoding (2)
• spatial and temporal multiplexing packaging
• multiple spatial multiplexing approaches
• top-bottom approach halves vertical resolution whereas side-by-side halves horizontal

resolutions
• thus, for interlaced image formats side-by-side is preferred

Preparing content
side-by-side top-bottom
recovering content

• Conventional Stereo Format
• encodes 2 full spatial resolution images
• as in the temporal multiplexing approach
• however, to minimize bandwidth (as temporal multiplexing

doubles it due to the 2x higher frame rate) it exploits redundancy
between the left and right image
• MPEG-2 Video, MPEG-4 Visual and the MVC standards offer full stereo coding
solutions with increased compression efficiency adopting this approach

• Multiview Format
• similar as conventional stereo but with n pairs of views or cameras
• temporally synchronised cameras capturing the same scene from different

perspectives or views
• the encoder exploits inter-view redundancy
• offers the viewer the possibility to move in front of the screen, changing
viewpoint freely with multiple perspectives available
• delivers a more realistic 3D viewing experience
• the basic case of multi-view is stereo video (N = 2), generating the sensation of
depth to the viewer by having each image derived for projection into one eye
• H.264 and HEVC have extensions dealing with multi-view

Texture-based inter-view encoding
• exploiting inter-view inter-dependency / redundancy as in H.264 MVC and
images
views
• just like temporal redundancy between images is commonly explored

• consecutive camera pairs are positioned very closely
• it is very likely for image i from view k to be very similar to image i from view
k+1 and so on
• coding is thus performed using inter-view prediction
• if redundancy between views is not explored ➟ simulcast

Texture-based inter-view encoding (2)
• inter-view prediction
view 0
• view 0 is coded using a conventional H.264
approach
• it can be decoded either by a H.264/AVC or
view 1
H.264 MVC
• constitutes the base or reference level
• the other views are coded in a similar way but
• reference images in those other views are
themselves predicted using the I or P images
from the adjacent view i
• for similar objective quality (PSNR), it leads to gain of

• circa 30% only on the dependent view, for stereo video (N = 2 views)
• circa 25% over all views for multi view (N > 2 views)

• inter-view prediction • inter-view prediction

• view progressive encoding • view full hierarchical encoding
• redundancy between views is • redundancy between views is explored
explored only for the first frame of
bidirectionally and for all frames
each dependent view
base
base view 0 view
view 1
view 1 B B
(B only)
B B

• the decoding process when using view full hierarchical encoding

Texture + Depth encoding
• the depth map is a grayscale image indicating the distance of objects to the
camera
• brighter areas indicate objects closer to the camera
• it can be seen as metadata as it provides information about the image

Texture + Depth encoding (2)
• depth information is obtained
• directly, using special cameras
• extracted from texture using depth cues (recall 2D-to-3D conversion)
• inherent to content if content was generated by a computer
• the solution texture+depth is suitable for generic 3D video solutions
• only one view (2D) is coded and transmitted alongside the metadata
• all necessary views for 3D display are generated from the received data, such as
using depth image based rendering (DIBR) (recall 2D-to-3D conversion)

texture + depth alternatives
• 2D texture + depth
• one single 2D view and metadata with the corresponding depth map
• depth enables to generate at the receiver the missing neighbouring view
• standardised by MPEG as ISO/IEC23002-3 MPEG-C Part3
• advantages
• 2D video is backward compatible with legacy devices
• possibility to use diverse 2D video encoding formats (e.g., MPEG-2, H.264)
• requires minimal additional bandwidth in respect to 2D video
• limitations
• difficulties in rendering the second image and filling holes due to occlusions (see
2D-to-3D conversion)
• complex decoders

texture + depth alternatives (2)
• multi view texture + depth

• for each 3D perspective, it encodes one 2D view and the corresponding depth
map
• thus it produces N 2D textures and N depth maps
• for a multi-view signal with 2D N perspectives
• the 2D views and the depth maps (greyscale images!) are encoded with MVC
• standardised as H.264 MVC + D or MVD
• key points
• an MVC + D decoder can reuse H.264/AVC or MVC hardware decoder
implementation modules
• supports bandwidth and device decoding capability adaptation
• allows depth maps to have different spatial resolution from texture
• requires typically about twice the bit rate of 2D video coded by H.264/AVC

Texture + Depth encoding and transmission

Texture + Depth encoding and transmission (2)
• Because the depth map is a grayscale image, it can be encoded using usual
video encoders
• and then both channels are joined together for transmission
• the main channel with the 2D textured image and the auxiliary channel with depth
information on a pixel basis
• MPEG-4 MAC (Multiple Auxiliary Component) has specified an approach for

transmitting jointly the two channels
• the depth channel is seen as an auxiliary component
• enables to re-use standard video codecs
• delivers one single multiplexed stream at the output
2D texture
channel
MPEG-4
MAC
depth
channel

Most recent standards
• MV-HEVC
• Simple stereo/multi view extension of the HEVC standard
• includes encoding of depth maps as additional color plane
• 3D-HEVC
• more efficient video + depth coding (in relation to MV-HECV)
• Scalable stereo/multiview
• Combined coding of video and depth

• shared allocation of bits
• more suitable for view synthesis at the receiver side

• thus enables to save bandwidth by not requiring the transmission of all views
• and still enables larger view field ranges

References
• Anthony Vetro, Frame Compatible Formats for 3D Video Distribution .

Proceedings of the IEEE International Conference on Image Processing (ICIP)
2010.
• Daniel Minoli, 3D Television (3DTV) Technology, Systems, and Deployment:

Rolling Out the Infrastructure for the Next Generation Entertainment . CRC
Press 2011.
• Ying Chena, Miska M. Hannukselab, Teruhiko Suzukic, Shinobu Hattori,

Overview of the MVC + D 3D video coding standard . Journal of Visual
Communication and Image Representation. Volume 25, Issue 4, May 2014,
Pages 679‒688.
• Fernando Pereira, 3D video systems . Audio and Video Communication, IST,

2019/2020.
• http://www.yuvsoft.com/stereo-3d-technologies/2d-to-s3d-conversion-process/
• https://www.ntt-review.jp/archive/ntttechnical.php?contents=ntr201110gls.html

Stereo Video Signals Acquisition, Representation and Compression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stereo Video Signals Acquisition, Representation and Compression

Uploaded by

Copyright:

Available Formats

Stereo Video

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 2

• How to create and deliver 3D content?

• texture plus depth

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 3

•an overall 3DTV system comprises elements for the capture,

• mastering is the process of representing a 3D scene

• compression needs to be applied for transmission

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 4

• Direct acquisition by stereo cameras

• most often a pair of 2D cameras is used

• cameras with active depth sensing

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 5

• baseline distance between visual axes (separation of the eyes - interocular

• 3D cameras use the same separation for baseline (interaxial distance)

• diﬃculty in capturing stereoscopic content

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 7

• Output: Stereoscopic/multi-view video sequence

input 2D video ➟ Left video sequence

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 8

• The key is to be able to construct good quality depth maps

• this is done taking into account depth cues

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 9

• The most critical factors for obtaining good quality conversion

• accuracy of the depth maps

• using pictorial / monocular cues (from one single image)

• using motion cues (from a sequence of images)

• motion parallax, dynamic occlusion

• stereo generator quality

• involves shifting and ﬁlling

• techniques include interpolation and texture generation

• using information from uncovered pixels in image or adjacent images

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 10

• normally it is human-assisted (post production)

• many problems may arise and degrade the quality

• speed of conversion is critical (real-time)

• often, non-veridical depth information is used

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 11

• Segmentation (selecting and tracing foreground objects)

new image generation

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 12

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 13

• Imposes real-time constraints which demand for compromises

• perception quality is more important than depth accuracy

• depth is often assigned based on perceptual outcome (prior knowledge) rather

• using previously constructed depth models

• sophisticated inpainting algorithms need to be avoided

• Normally it trades-oﬀ depth quality with real-time requirements

• Application in limited areas

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 14

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 15

2D (texture) + depth map

2 views are presented simultaneously for stereoscopic visioning

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 16

• Quality and accuracy of extracted depth

• Extracting depth from video sequences with object motion involved

• Combining the use of diﬀerent cues to derive depth

• or, how to merge the diﬀerent depth information derived

• What is the ideal depth budget distribution for each scene ?

• Automatic in-painting algorithms for hole ﬁlling

2019/2020 Processamento e Codificação de Informação Multimédia - MIEEC 17

• row interleaved ; column interleaved ; checkerboard

• or, adopt a diﬀerent paradigm