You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

A Unified Scheme for Super-resolution and Depth


Estimation from Asymmetric Stereoscopic Video
Jing Zhang Student Member, IEEE, Yang Cao Member, IEEE, Zheng-Jun Zha Member, IEEE,
Zhigang Zheng, Chang Wen Chen Fellow, IEEE, Zengfu Wang Member, IEEE

AbstractReconstructing a full-resolution stereoscopic video used, has been applied to reduce the amount of data to be
from an asymmetric stereoscopic video is a challenging task. The compressed.
existing approaches assume that the depth information is avail- Recently, several mixed-resolution stereo-coding frame-
able, which imposes an additional challenge in data acquisition.
In this paper, we propose a novel scheme that is capable of obtain- works for mobile devices have been proposed in [5], [6],
ing super-resolution and depth estimation simultaneously from an where one of the views is coded entirely at a lower resolution.
asymmetric stereoscopic video. The proposed scheme models the Aksay et al. introduced temporal and spatial scalability into
video super-resolution and stereo matching with a unified energy the mixed-resolution framework [7]. Chen et al. proposed to
function. Then, we apply an alternating optimization method to predict the macroblocks of the low-resolution view directly
minimize this energy function, which can be implemented with
a two-step algorithm. In the first step, we calculate the initial from the high-resolution view, and thus to reduce the computa-
depth map by using a region-based cooperative optimization tional cost of subsampling reference frames [8]. Very recently,
technique while considering the temporal consistency in video. In Aflaki et al. proposed a modified MVC+D coding scheme
the second step, we resolve the super-resolution problem under that supports the coding MVD data with a mixed-resolution
the guidance of the depth information. It is effective because each texture representation [9]. Their method can provide fairly
step benefits from the additional improvement over the previous
step. We iteratively update the two steps until stable depth good bitrate reduction results. However, the above work does
and super-resolution results are obtained. We have conducted not compensate for the quality differences between views [10],
a series of experiments on public stereoscopic video sequences which might causes some difficulties in the post-processing
to evaluate the performance of the proposed method. Both applications, such as production of sharp stereoscopic images
objective indices and subjective visual comparisons verify that and view synthesis.
the proposed scheme can achieve satisfactory super-resolution
results and high-quality depth map simultaneously. In particular, As addressed in the previous work, advanced super-
the subjective evaluation experiments on a 3D monitor show resolution methods can lay a fine fundament for asymmetric
that this scheme outperforms others and achieves the best visual stereoscopic video applications [9]. The reviews of image
sharpness. and video super-resolution methods are presented in [11].
Index Termsasymmetric stereoscopic video, super-resolution, Recently, Liu et al. proposed a Bayesian approach to single-
depth estimation, stereo matching view video super-resolution via simultaneously estimating the
underlying motion, blur kernel, and noise [12]. Moreover,
the advancements in the sparse representations have achieved
I. I NTRODUCTION
outstanding super-resolution results [13], [14]. These single-

W ITH the recent developments in video capture and


display technologies, 3D video communication and
entertainment is one of the most promising services in video
view super-resolution algorithms can be directly applied to
up-sample the low-resolution view of asymmetric stereoscopic
video. However, these approaches neglect the correspondence
applications. They can bring new user experiences to a set between the left and right views.
of applications, such as mobile 3D TV, free-viewpoint video, To solve this problem, the correspondence between the
and immersive teleconferencing [1], [2]. Along with these left and right views has been applied to enhance the par-
applications, the huge amount of data needed to be processed ticular low-resolution views. Garcia et al. proposed a super-
becomes a burden for both storage and transmission. Inspired resolution method for mixed-resolution multi-view video us-
by suppression theory [3], [4], mixed-resolution approach, in ing depth information [10]. Their method leveraged high-
which low-resolution and full-resolution images are jointly frequency information from the full-resolution view to up-
sample the low-resolution view based on the correspondences
Copyright (c) 2014 IEEE. Personal use of this material is permitted. indicated by the associated depth maps. However, they did
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org. not discuss the acquisition of the depth map. Brust et al.
Jing Zhang, Yang Cao, Zhigang Zheng and Zengfu Wang are with the proposed to render one of the views from the other view
Department of Automation, University of Science and Technology of China, based on the estimated depth [15]. In their work, the depth
Hefei, P.R.China. Zheng-Jun Zha and Zengfu Wang are with the Hefei Institute
of Intelligent Machines, Chinese Academy of Sciences, Hefei, P.R.China. map was precalculated from the original full-resolution stereo
Chang Wen Chen is with the Department of Computer Science and Engi- pairs by using hybrid recursive matching (HRM) method
neering, University at Buffalo, State University of New York, Buffalo, NY [16]. Therefore, it is actually not a complete mixed-resolution
14260-2000, USA. e-mail: {forrest, zfwang}@ustc.edu.cn
This work is supported by the National Science and Technology Major approach. The most related previous work is presented in
Project of China (No.2012GB102007) and NSFC (No.61472380). [17], in which Tian et al. proposed a dual regularization-based

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

Depth Estimation Video Super-resolution

I R
N ( low )
Interpolation

Update
I NL I NR ( t )

Conjugate Gradient Method


Data Term
I NL
IN(
Forward R t + 1)
Optical Flow Nonlocal Update
Term
Cooperative Optimization
Temporal Smoothness Occlusion Data
Consistency Term Energy Energy Term
Backward
Optical Flow Mapping I NL
Term

Update
DN( t )

Fig. 1. Flowchart of the proposed method. Inputs: full-resolution video of left view IN L and low-resolution video of right view I R (low). Outputs: full-
N
resolution video of right view INR and the depth map D . The proposed method consists of two parts: depth estimation and video super-resolution. In the
N
first part, the depth map is estimated by using cooperative optimization for an energy function, which is composed of a data term, an occlusion term, a
smoothness term, and a temporal consistency term for modeling the temporal correlations between adjacent depth frames. Then, the estimated depth map is
used for guiding the super-resolution process in the second part. Specifically, the super-resolved right view is estimated by using conjugate gradient method
to optimize a quadratic energy function, which is composed of a data term, a mapping term for modeling the pixel correspondence between left and right
views, and a nonlocal term for exploiting the nonlocal self-similarity.

super-resolution approach for asymmetric stereoscopic images [22]. Both objective and subjective experimental results verify
(e.g. mixed-resolution stereoscopic images). Their approach that the proposed algorithm can achieve high-quality depth
exploited two regulation functions, which were a saliency- and super-resolution results while preserving good 3D visual
based total variation regularization function and a depth-based experience.
pixel consistency regularization function. Moreover, they used The paper is organized as follows. Section II presents
a hierarchical approach [18] to estimate the disparity map the unified energy function for video super-resolution and
based on the temporally reconstructed full-resolution image stereo matching. Then, an iterative solution and algorithm is
and the neighboring-view image. presented in Section III. Experimental results are presented in
Section IV. Finally, we conclude this paper in Section V.
Different from the existing methods, which deal with video
super-resolution and depth estimation separately, the proposed II. A UNIFIED ENERGY FUNCTION FOR VIDEO
algorithm reconstructs the low-resolution view while simul- SUPER - RESOLUTION AND STEREO MATCHING
taneously performing stereo matching in a unified energy-
A. Overview
minimization framework. On the one hand, the stereo corre-
spondence can guide on how to borrow the high-frequency Given an asymmetric stereoscopic video, without loss of
information from the full-resolution view to enhance the generality, we assume that the left view is of full-resolution
quality of the low-resolution view. On the other hand, the and the right view is of low-resolution. The goal of the
enhanced stereo pair can generate more matching points and proposed method is to obtain a full-resolution video of the
consequently improve the result of stereo correspondence. In right view, by exploiting the plentiful details information
the proposed algorithm, the energy minimization is performed of the left full-resolution video. The exploiting step builds
by using an alternating optimization method. We first calculate up on the calculation of view correspondence using depth
the stereo correspondence (depth map) by using a coopera- information. This method models the video super-resolution
tive optimization based stereo matching algorithm [19] while and stereo matching in a unified energy function. Then, we
considering the temporal consistency in video [20]. Then, we use an alternating optimization method to minimize the energy
resolve the super-resolution problem under the guidance of function. Fig. 1 shows the flowchart of the proposed method.
the stereo correspondence. Each of the above two processes It consists of two parts: depth estimation and video super-
benefits from the gradual improvement result in the other. resolution. We alternately update these two steps until stable
So, we iteratively perform the two processes until obtaining results are obtained.
stable results. The proposed algorithm can recover the full-
resolution stereoscopic video with high PSNR and SSIM. B. Energy function for depth-based super-resolution
Furthermore, it can also provide a high-quality depth map, In this paper, we propose a new depth-based super-
which can be applied in many video applications, such as resolution method, which leverages the high-frequency in-
multiview video coding [21] and video semantic segmentation formation of the left full-resolution view for enhancing the

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

R
resolution of the right view. This method belongs to the vectorization patch extraction operator, and T IN (m, n) is the
category of reconstruction-based methods, which set up an vectorization representation of a patch centered at (m, n) on
R
energy function and then minimize it to obtain the optimal image IN , and wmn,pq is the nonlocal weight calculated by
solution. The energy function usually consists of a data term measuring the similarity (mean square error, MSE) between
R L
and some other constraint terms. patch T IN (m, n) and T IN (p, q) [25], [26].
Here, we start constructing the energy function by the As can be seen, minimizing the above energy function relies
following data term. Mathematically, on the calculation of stereo correspondence. We know that
R R
2 the result of stereo matching algorithm depends on the co-
Edata = SKIN IN (low) , 2
(1) occurrence of distinct details in both views. The more distinct
where S is a down-sampling operator and K is a blurring details these two views share, the more reliable the matching
R result is. However, since we only have the mixed-resolution
operator, IN is a variable referring to the expected full- videos as inputs, the result of directly matching the left view
resolution right view of the N th frame and IN R
(low) is the with the interpolated right view may not meet the expectation.
th
initial low-resolution input of the N frame. kk2 denotes We need to restore the details of right view for obtaining a
the Euclidean norm. This is a common term used in the reliable depth map. Therefore, we combine the calculation of
reconstruction-based methods [23], [24]. It indeed enforces a stereo correspondence and the super-resolution together, and
R propose a unified function as follows:
constraint on the expected full-resolution image IN so that it
is consistent with the low-resolution input after the blurring ESR = Edata + 1 Emap + 2 Enonlocal +3 Edepth , (4)
and down-sampling process.
In addition, the high-frequency information of the left full- where 1 , 2 , and 3 are regularization parameters, Edepth
resolution view can be used to enhance the resolution of right is the depth energy function. We will give its explicit form in
view, since they share many scene points. Therefore, once the the following part.
correspondence between the left and right views is obtained,
we can add a mapping term to the energy function. Similar
disparity based pixel mapping strategy is also applied in the C. Depth energy function
methods in [10] and [17]. The explicit form of this term can
be denoted as: In [19], we proposed a region-based stereo matching algo-
X
R L

 2 rithm using cooperative optimization. This method can achieve
Emap = cmn IN (m, n) IN m, n + DN (m, n)

2
high-quality depth map with relatively high efficiency. In this
(m,n)
(2) paper, we extend this method to the stereoscopic video case.
where is the pixel index set of the image grid, IN L
is the By using the temporal consistency of depth information in

N th frame of left full-resolution view, DN denotes the stereo stereoscopic video, this extension method can obtain tempo-
correspondence of the N th frame (depth map1 of IN R
relative to rally consistent depth maps. In the following part, we briefly
L
IN ). It can be obtained from the corresponding left view depth describe the idea about region based stereo matching algorithm
map DN of IN L R
relative to IN , and we use linear interpolating using cooperative optimization. We recommend referring [19]
to deal with the non-integer case. cmn is a binary confidence for the detailed description.
Supposing that R1 ,. . ., Rn are regions obtained by the
value about DN (m, n). It is necessary since the depth map Mean-shift segmentation algorithm [27], we define a total
may be not accurate, especially in the occlusion regions energy function, which can be decomposed into the sum of
and non-overlapping regions of two views. In this paper, we several subtarget energy functions. Mathematically,
determine cmn by measuring the similarity (mean square error, X i
R
at IN (m, Edepth = E , (5)
MSE) between the local patch  centered  n) and the
L iseg
local patch centered at IN m, n + DN (m, n) .
Besides the above observation about the point-to-point map- where seg is the index set of regions, E i is the energy
ping between two views, there is another useful observation function of the ith region Ri .
about natural image, nonlocal prior [25]. This nonlocal prior is Next, we give the explicit form of every subtarget E i . Here,
based on such an observation that the image content is likely we mainly concentrate on four aspects: data energy, occlusion
to repeat itself within some neighborhood. This self-similarity energy, smoothness energy, and temporal consistency energy.
of natural image is beneficial for solving super-resolution Mathematically, we define the energy function of the ith region
problem, because it means that we can exploit the redundant Ri as follows:
information hidden in the full-resolution view. Leveraging the
nonlocal prior, we enforce an additional nonlocal constraint E i = Edata
i i
+ Eocclusion i
+ Esmooth i
+Econsistency . (6)
between the left view and right view under the guidance of
stereo correspondences. The explicit form of this nonlocal The first term is the data term. It evaluates the validity of the
regularization term is: depth at the position (m, n) in region Ri by calculating the
Enonlocal =
P
cmn
P color difference between two corresponding pixels. Its explicit
(m,n)
(p,q)nr (m,n+DN (m,n)) .
form is:
R 2 (3) X R
L i L
wmn,pq T IN (m, n) T IN (p, q) 2
Edata = IN (m, n) IN (p, q) , (7)
i ,(p,q)V i
(m,n)VR
Here nr (i, j) denotes the nonlocal neighborhood at position L

(i, j), whose size is (2 nr + 1) (2 nr + 1). T is a where kk denotes the maximum norm or infinity norm, VLi
1 Depth and disparity are two interdependent terms in stereo vision. We use and VRi denote the visible pixel sets [19], [28] on the current
them interchangeably whenever appropriate. region of the left and right images, respectively.

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology


R I R (low) +
2 P
R (m, n) I L m, n + D (m, n)
 2
ESR = SKIN N 1 cmn IN N N
+
2 2
(m,n)
R (m, n) T I L (p, q) 2 +
P P
2 cmn wmn,pq T IN N
 2
(m,n)
(11)
(p,q)nr m,n+DN (m,n)
P

R (m, n)
IN
L (p, q)
IN + OcciL + OcciR occ + Bdiscon
i s +

P (m,n)V i ,(p,q)V i
3 PR LP 

iseg
DN (m, n) DN+k N,N+k (m, n)
2
k{k0 ,...,1} (m,n)Ri

The second term is the occlusion energy. It imposes a fixed D. The unified energy function
penalty on occlusion pixels. And its explicit form is: Considering all the energy functions together, we have the
i
Eocclusion = (|OccL | + |OccR |) occ , (8)
explicit form of the unified energy function for depth-based
super-resolution (see Eq. (11)).
Minimizing the above energy function, we can obtain the
where |OccL | (|OccR |) is the number of left (right) occlusion optimal solution for INR
and DN . It can be described as the
pixels on region Ri (Please refer Fig. 6 in [19]), and occ is following optimization problem:
a penalty constant and is set to 5 [19].  R
The third term is the smoothness energy. It is necessary IN , DN= arg min ESR
since depth varies a little in a small neighborhood. This term {IN
R ,D
N}

Ei.
P
imposes a fixed penalty on the boundary pixels of each region, = arg min Edata + 1 Emap + 2 Enonlocal + 3
whose values are different from its neighbors. Its explicit form {IN ,DN }
R iseg

is: (12)
(
if (p, q) 1 (m, n) s.t.
i
X s
Esmooth = |DN (m, n) DN (p, q)| 1 III. I TERATIVE SOLUTION AND ALGORITHM
(m,n)B i 0 otherwise
(9) Directly optimizing the above energy function is difficult
R
i
where B is the boundary pixels set of region R , DN is the i because IN and DN are coupling together. Therefore, we
disparity map of the N th frame, 1 (m, n) indicates the 4- use an alternating optimization technique to solve the above
connected neighborhood at position (m, n), and s is a penalty optimization problem. We alternately optimize the unified
constant. energy function with respect to one variable while keeping
Furthermore, we consider the temporal consistency of depth the other one fixed. We employ a set of advanced optimization
information across frames. So, we add a consistency energy techniques so that the proposed method can effectively handle
term that accounts for the penalty on temporal inconsistency. this challenging problem. The iterative updating solution is
For a video, temporal consistency means the correspondence shown as follows.
between adjacent frames, i.e., a scene point, which appears in
several adjacent frames should share the same depth values.
This correspondence can be calculated according to the for- A. Optimizing DN
ward and backward optical flow field2 [29]. Therefore, we can In this step, we optimize the energy function to obtain
construct a point-wise temporal consistency energy term. And (t+1) R(t)
its explicit form is: DN given the previous super-resolved result IN . In other
words, when we perform the depth estimation part of the
(t+1) R
i
Econsistency =
P P proposed method to obtain DN , we keep the variable IN
R(t)
k{k0 ,...,1} (m,n)Ri . (10) in the energy function as a constant, i.e., IN . By removing
kDN (m, n) DN+k (N,N+k (m, n))k2 constant-value terms, the above optimization problem becomes
Eq. (13). While simplified, the energy function in Eq. (13) is
Here, k0 is the range of adjacent frames. In this paper, we still difficult to optimize by directly using gradient descent
set k0 to 2. N,N +k (m, n) is a mapping operator that maps algorithm. In the proposed method, we use cooperative op-
a position index (m, n) in N th frame to its corresponding timization technique to solve this optimization problem [30].
The principle of cooperative optimization is decomposing a
position index in N + k th frame under the guidance of complex target into some comparatively simple subtargets, and
forward/backward optical flow field. The assumption about optimizing these subtargets individually by keeping P the icon-
temporal consistency of depth information may be violated sistent common parameters. Namely, optimizing E can
when there is large motion towards/away from the lens in iseg

the scene. In this case, the temporal consistency energy term be performed by minimizing these subtargets E i (i seg )
individually. To keep the depth consistency between regions,
should be excluded from the energy function, or be assigned we minimize all energy functions of subtarget E i and its
with a pixel-wise weight according to the optical flow infor- associated subtargets E j , and then propagate the results via
mation. In this paper, we just excluded it from the energy iterative computation. Mathematically, for each subtarget E i ,
function for video sequence Bullinger. we indeed minimize the following energy function each time:
i
 i j
X j
1 E + wij E , (14)
2 http://people.csail.mit.edu/celiu/OpticalFlow/
j6=i

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology


P I R(t) (m, n) I L (p, q) + Occi + Occi  + B i
occ discon s
N N L R

(t+1) P (m,n)V i ,(p,q)V i
DN = arg min (13)
R L P

P 
DN iseg + DN (m, n) DN+k N,N+k (m, n)

2
k{k0 ,...,1} (m,n)Ri

  2
R I R (low) 2 +

R(t+1) (t+1)
P
= R (m, n) I L
IN arg min m, n + DN (m, n)
SKIN N 1 cmn IN

N

2

R(t) (m,n) 2
I R I R =I
N N N

(15)
R (m, n) T I L (p, q) 2
P P
+2 cmn wmn,pq T IN N
 2
(m,n) (t+1)
(p,q)nr m,n+D (m,n)
N

where j is the index of the adjacent regions of Rj , and j is No.100, full-resolution: 512 384), Leaving laptop (100
the regularization parameter and subjected to 0 j 1, and frames: No.1- No.100, full-resolution: 512 384), Bullinger
0 wij 1 is the corresponding weight. When optimizing (100 frames: No.101- No.200, full-resolution: 432240), TU-
Eq. (13), local window-based matching method is used to Berlin (100 frames: No.101- No.200, full-resolution: 360
determine the initial depth map [19]. 288). Among these video sequences, the first four sequences
contained local motion, e.g., walking man and waving hands,
R and the last sequence contained global motion. Then, we
B. Optimizing IN
In this step, we optimize the energy function to obtain performed a series of experiments on these video sequences
IN
R(t+1)
under the guidance of the previous depth estimate to test the validity and effectiveness of super-resolution part
(t+1) of the proposed method.
DN . In other words, when we perform the super-resolution
R(t+1) Next, we downloaded five synthetic stereoscopic video
part of the proposed method to obtain IN , we keep the sequences with ground truth disparities from real-time s-
R (t+1)
variable DN in the energy function as a constant, i.e., DN . patiotemporal stereo matching project4 [35], [36], including
The optimization problem Eq. (12) is simplified to Eq. (15). Book, Street, Tanks, Temple, and Tunnel. All these videos
Since the above problem is a quadratic optimization prob- consist of 100 frames except Book (41 frames) and have the
lem, it can be solved efficiently using Conjugate Gradient same full-resolution: 400 300. Then, we compared our depth
Method [31]. estimation part of the proposed method against two state-of-
the-art alternatives including fast cost-volume filtering method
C. Parameters setting (denoted as FCV in this paper) [32] and dual-cross-bilateral
The flowchart of the proposed method is shown in Fig. 1. grid based method (denoted as DCBGrid2 in this paper) [33]
The low-resolution videos are obtained by applying a blurring on these video sequences.
and down-sampling process. Here, we use a 33 Gaussian blur
kernel with a standard deviation 1, as well as a down-sampling A. Convergence analysis
operator that samples pixels from odd rows and columns. And
First, an experimental analysis of convergence of the pro-
in the following experiments, we use the bilinear interpolation posed approach was presented. Fig. 2 shows different indices
method to obtain the initial full-resolution right view. of our results in each iteration for video sequence Book arrival.
The parameters in cooperative optimization technique are The top row and bottom row of Fig. 2 show the results for
set the same with [19]. The parameters in Eq. (3) are set down-sampling factors 2 and 4, respectively. To evaluate the
according to [26]. The nonlocal neighborhood is set to be convergence of the proposed iterative method, we computed
the relative error (RE) between the super-resolution results of
11 11. The patch radius of the vectorization patch extraction two successive iterations. Its explicit form is as follows:
operator T is 1. The regularization parameters 1 and 2 in
Eq. (12) are set to 0.005 empirically. Since we alternately R(t) R(t1)
N0 IN (m, n) IN (m, n)

1 X 1 X
optimize IN R
and DN , the setting of 3 is trivial. Thus, we RE (t) = ,
N0 || R(t)
IN (m, n)

just set it to 1. N=1 (m,n)
(16)
IV. E XPERIMENTS where N0 is the number of total frames in a test video
R(t)
Since our approach consisted of two parts, super-resolution sequence, || is the number of total pixels in a frame, IN
R(t1)
and depth estimation, we tested the performances of the and IN are the estimated super-resolved right views of
two parts separately. We first selected the following five two successive iterations for the N th frame. The results are
shown as the black line marked by a block in Fig. 2(a).
stereoscopic video sequences of real scenes from Mobile 3D Meanwhile, we computed the root mean square error (RMSE)
TV project3 : Book arrival (100 frames: No.1-No.100, full- and structural similarity (SSIM) index [34] of the super-
resolution: 512 384), Door flowers (100 frames: No.1- resolution result in each iteration. The definition of RMSE
3 http://sp.cs.tut.fi/mobile3dtv/video-plus-depth/ 4 http://www.cl.cam.ac.uk/research/rainbow/projects/dcbgrid/datasets/

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

5 4.6 0.98
3.4
Relative error
0.97
RMSE 3.3 REoD
4 4.4
0.96 SSIM
3.2
Relative error (%)

4.2 0.95
3 3.1

REoD (%)
0.94

RMSE

SSIM
4.0 3.0
2 0.93
2.9
3.8 0.92
2.8
1 0.91
2.7
3.6
0.90
0 2.6
1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

Iteration Iteration Iteration

7 11.0 0.92
Relative error 10.0
6 RMSE 0.90 REoD
10.5
SSIM 9.5
0.88
5
Relative error (%)

10.0 9.0
0.86

REoD (%)
4
8.5
RMSE

SSIM

9.5 0.84
3
8.0
0.82
9.0
2 7.5
0.80

1 8.5
0.78 7.0

0 8.0 0.76 6.5


1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

Iteration Iteration Iteration


(a) (b) (c)

Fig. 2. Indices of super-resolution result and depth map in each iteration for video sequence Book arrival. The top row shows the results for down-sampling
factor 2, and the bottom row shows the results for down-sampling factor 4.

is as follows: we can achieve rather lower RMSE and higher SSIM of the
RM SE s= (t) super-resolution result, as well as rather lower REoD of the
N0 2 estimated depth map. Empirically, it needs 2-3 iterations for
1
P 1
P R(t) R(GT ) , (17) RE reaching such a threshold for down-sampling factor 2, and
N0 || IN (m, n) IN (m, n)
N=1 (m,n) 2
5-6 iterations for down-sampling factor 4.
R(GT )
where IN is the ground truth right view of the N th
frame. These results are shown as the blue line marked by B. Evaluation on super-resolution results
a cross in Fig. 2(a) and black line marked by a block in 1) Comparison with single-view super-resolution methods:
Fig. 2(b), respectively. We also computed the relative error To evaluate the performance of the super-resolution part of
of the estimated depth map against the ground truth (REoD)
in each iteration. It is defined as follows: our proposed method, we conducted a contrastive experi-
ment between the proposed method, bilinear interpolation,
(t) GT
N0 DN (m, n) DN (m, n) and sparse coding method. The sparse coding method [14],

1 X 1 X
REoD(t) = GT
, which achieved the state-of-the-art performance for single-
N0 || |DN (m, n)|
N=1 (m,n) image super-resolution, was used as the baseline method for
(18)
(t) the evaluation in this section. The parameters for the publicly
where DN is the estimated depth map of the N th frame, available code5 are set according to [14].
GT
and DN is the ground truth depth map of the N th frame. Tables I and II show the PSNR and SSIM of the final super-
However, there are no ground truth depth maps for those five resolution results obtained by the proposed method, bilinear
test video sequences from the Mobile 3D TV project. In this interpolation, and sparse coding method. The PSNR and SSIM
paper, we used the depth map obtained by matching the full- indices were calculated over all frames of each test video and
resolution left view and right view by using the method in [19] then averaged. It is same for all the following experiments
instead. This surrogate ground truth depth map is denoted as when calculating PSNR and SSIM. It can be seen that the
GT
DN in this paper. These results are shown as the black line proposed method significantly boosted the PSNR and SSIM.
marked by a block in Fig. 2(c). (Note that values at iteration For instance, the proposed method achieved average gains of
0 denote the indices of the initial result.) 5.5 dB and 4.16 dB over bilinear interpolation method, 2.25 dB
We can see that RE, as well as RMSE and REoD, are and 2.24 dB over sparse coding method, according to PSNR
gradually decreasing with the increase in the iterations, while for down-sampling factor 2 and 4. As for SSIM, it achieved
SSIM is gradually increasing. Besides, they all show grad- average gains of 0.0555 and 0.1190 over bilinear interpolation
ual convergence tendency with the increase in iterations. It method, and 0.0180 and 0.0746 over sparse coding method,
convinces that each step of the proposed method can benefit respectively.
from the improved result in the other step, until they obtain
stable results. In general, if we set a threshold of 1% for RE, 5 http://www.ifp.illinois.edu/jyang29/ScSR.htm

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

TABLE I
PSNR OF SUPER - RESOLUTION RESULTS OBTAINED BY THE PROPOSED METHOD , BILINEAR INTERPOLATION , AND SPARSE CODING METHOD .

down-sampling factor 2 4
Method bilinear sparse coding proposed bilinear sparse coding proposed
Book arrival 30.54 33.85 36.62 26.23 27.99 29.42
Door flowers 30.69 33.93 36.30 26.22 28.01 29.85
Leaving laptop 30.87 34.22 36.91 26.50 28.32 30.31
Bullinger 35.16 38.57 37.99 30.44 32.75 33.56
TU-Berlin 30.69 33.62 37.63 25.93 27.84 32.94
AVERAGE 31.59 34.84 37.09 27.06 28.98 31.22

TABLE II
SSIM OF SUPER - RESOLUTION RESULTS OBTAINED BY THE PROPOSED METHOD , BILINEAR INTERPOLATION , AND SPARSE CODING METHOD .

down-sampling factor 2 4
method bilinear sparse coding proposed bilinear sparse coding proposed
Book arrival 0.9049 0.9464 0.9708 0.7724 0.8205 0.9062
Door flowers 0.9094 0.9478 0.9702 0.7836 0.8283 0.9051
Leaving laptop 0.9060 0.9464 0.9705 0.7766 0.8229 0.9063
Bullinger 0.9312 0.9539 0.9468 0.8532 0.8778 0.8870
TU-Berlin 0.9023 0.9467 0.9728 0.7481 0.8064 0.9242
AVERAGE 0.9108 0.9482 0.9662 0.7868 0.8312 0.9058

(a) (b) (c) 5.0x10


5 (d)
Bilinear
5
4.5x10 Sparse
5 Proposed
4.0x10
5
3.5x10
Frequency

5
3.0x10
5
2.5x10
5
2.0x10
5
1.5x10
5
1.0x10
4
5.0x10
0.0
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60

Residual error
(e) (f) (g) (h) (i)

Fig. 3. Super-resolution results of bilinear interpolation, sparse coding, and the proposed method for the 81st frame of video sequence Door flowers with a
down-sampling factor 4. (a) Result of bilinear interpolation. (b) Result of sparse coding method. (c) Our result. (d) The ground truth. (e)-(h) Close-up views
of red rectangular regions of (a)-(d). (i) The histograms of different residuals between (a)-(c) and (d).

The improvement over the initial interpolation result is the calculation about stereo correspondence. However, stereo
obvious and can be clearly observed in Fig. 3 (the 81st frame matching algorithms sometimes fail in the textureless region,
of video sequence Door flowers with a down-sampling factor e.g., the background in Fig. 4(a)-(b). The second reason is the
4). Fig. 3(b) shows that the proposed method obtained a bet- slight color distortion between the original left image and right
ter super-resolution result than bilinear interpolation method image. This distortion may be caused by the slight illumination
Fig. 3(a) and sparse coding method Fig. 3(c), which has clear changes between the views of cameras. Therefore, the pixel
details and is close to the ground truth Fig. 3(d). Fig. 3(e)-(h) value borrowed from the full-resolution left view may be a
shows the close-up views of red rectangular regions of (a)- little different from the original right view. Though the PSNR
(d), respectively. Fig. 3(i) shows the comparison about the and SSIM of the proposed method are not the highest, the
histograms of different residuals between Fig. 3(a)-(c) and super-resolution result still exhibits plenty of details as shown
(d). More residuals of the proposed method are close to zero Fig. 4(f). In addition, its color is a little more consistent with
when compared with bilinear interpolation and sparse coding the left view than in the bilinear interpolation result and sparse
method. coding result. (Please see the collar part in a screen. Our result
For video sequence Bullinger, we notice that the PSNR is a little whiter, which is close to the left view, and not as
and SSIM by the proposed method are just comparable with blue as the one in the right view.) This color consistency is
those of sparse coding method. The first reason for this also an influence factor for 3D visual experience.
is that the proposed super-resolution method is based on

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

250

200

150

100

50

0
(a) (b) (c)

(d) (e) (f)

Fig. 4. Super-resolution results of bilinear interpolation, sparse coding method, and the proposed method for 130th frame of video sequence Bullinger, with
a down-sampling factor 2. (a) Original full-resolution left view. (b) Original full-resolution right view. (c) The disparity map obtained by the proposed method
is quantized with a shift of 20 disparities and a magnification factor of 8 to display. (d) Result of bilinear interpolation. (e) Result of sparse coding method.
(f) Our result.

(0)
2) Comparison with depth-based super-resolution methods: 4. Fig. 5(e) shows the 81st frame disparity map D81 of low
To compare with other depth-based super-resolution methods, quality. Fig. 5(a) is the corresponding super-resolution result
i.e., the methods in [10] and [17], we first compared our of the method in [10]. Only some details are restored, and
proposed approach with the method in [10] using a known most regions are still as blurry as the interpolation result (Fig.
depth map. Then, the whole approach was compared with the 3(a)). Our result (Fig. 5(c)) is better, but not so good as the
method in [17] that also included a depth map updating step results (Fig. 5(b) and (d)) obtained by using disparity map
in the iterative scheme. GT
D81 of high quality (Fig. 5(j)). Please see the close-up view
In the first experiment, two disparity maps with different for comparison (Fig. 5(f)-(i)).
qualities are used in this experiment. One was the disparity In the second experiment, we compared the proposed
map obtained by matching the left view with the bilinear method with the method in [17]. Since the method in [17]
interpolation result of right view (denoted as D(0) ), and the also includes a depth map updating step in the iterative
other was the disparity map obtained by matching the full- scheme for image super-resolution, we compare the entire pro-
resolution left view and right view by using the method in posed method including both depth estimation part and super-
[19] (denoted as DGT ). For the sake of fair comparison, we resolution part against it. In [17], they reported their results
maintained the disparity map (either D(0) or DGT ) fixed and on three stereoscopic video sequences, i.e., Outdoor, Lab, and
calculated the super-resolved image only in a single iteration Bullinger, which are also from the Mobile 3D TV Project.
by using the super-resolution part of the proposed method. We adopted the same blurring and down-sampling parameters
The PSNR and SSIM indices are summarized in Tables III with [17], and tested the proposed method on the same video
and IV. The proposed method achieved better results than sequences. The PSNR indices are summarized in Table VI. As
the method in [10] in both cases for all sequences. Note it can be seen, the proposed method achieved higher PSNR
that the scores of the proposed method with D(0) are even than the method in [17] with an average gain of 1.57 dB. There
better than the scores of the method in [10] with DGT for are three important properties of the proposed method that
down-sampling factor 2. The reason is that the method of lead to this result. They are: i) better depth estimation due to
[10] exploits a simple pixel-based high-frequency borrowing advanced stereo matching strategy; ii) better super-resolution
step to up-sample the low-resolution view. It requires a depth reconstruction by exploiting the self-similarity of the full-
map with much higher accuracy. In contrast, the proposed resolution view; iii) the gradual improvement by iteratively
method seeks the super-resolved image by optimizing an updating stereo matching and super-resolution reconstruction.
energy function, which includes a data term and a nonlocal
term besides the mapping term. The data term guarantees that
TABLE VI
the super-resolved image should have the same pixel values on PSNR OF SUPER - RESOLUTION RESULTS OBTAINED BY THE PROPOSED
the low-resolution image grid with the low-resolution input. In METHOD AND THE METHOD IN [17].
the nonlocal term, the depth map is only used to determine the
nonlocal neighborhood in the full-resolution left view. Similar Method [17] Proposed
patches found in the neighborhood due to the nonlocal prior Outdoor 32.81 35.65
can provide ample information for super-resolution. Lab 34.55 36.14
Bullinger 36.16 36.46
Fig. 5 shows the super-resolution results obtained by the AVERAGE 34.51 36.08
proposed method and the method in [10] for the 81st frame
of video sequence Door flowers with a down-sampling factor

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

TABLE III
PSNR OF SUPER - RESOLUTION RESULTS OBTAINED BY THE PROPOSED METHOD AND THE METHOD IN [10] GIVEN TWO DEPTH MAPS OF DIFFERENT
QUALITIES .

Down-sampling factor 2 4

Method [10]+D (0) [10]+D GT Proposed+D (0) Proposed+D GT [10]+D (0) [10]+D GT Proposed+D (0) Proposed+D GT
Book arrival 32.17 33.21 35.02 36.91 26.51 29.50 27.43 30.62
Door flowers 32.31 33.06 35.22 37.01 26.42 29.06 27.44 30.47
Leaving laptop 32.94 33.58 35.91 37.34 26.91 29.70 27.82 30.97
Bullinger 34.39 34.98 35.11 38.10 29.69 31.49 30.64 33.03
TU-Berlin 32.57 34.87 35.85 38.32 25.95 31.99 27.87 31.14
AVERAGE 32.88 33.94 35.42 37.54 27.10 30.35 28.24 31.25

TABLE IV
SSIM OF SUPER - RESOLUTION RESULTS OBTAINED BY THE PROPOSED METHOD AND THE METHOD IN [10] GIVEN TWO DEPTH MAPS OF DIFFERENT
QUALITIES .

Down-sampling factor 2 4

Method [10]+D (0) [10]+D GT Proposed+D (0) Proposed+D GT [10]+D (0) [10]+D GT Proposed+D (0) Proposed+D GT
Book arrival 0.9386 0.9461 0.9661 0.9703 0.8091 0.9005 0.8477 0.9192
Door flowers 0.9380 0.9458 0.9664 0.9716 0.8116 0.8990 0.8480 0.9198
Leaving laptop 0.9393 0.9466 0.9673 0.9711 0.8142 0.9008 0.8496 0.9211
Bullinger 0.9080 0.9134 0.9054 0.9488 0.8235 0.8543 0.8457 0.8872
TU-Berlin 0.9409 0.9537 0.9689 0.9726 0.8197 0.9189 0.8597 0.9181
AVERAGE 0.9330 0.9411 0.9548 0.9669 0.8156 0.8947 0.8501 0.9131

(a) (b) (c) (d) (e)


250

200

150

100

50

0
(f) (g) (h) (i) (j)

Fig. 5. Super-resolution results obtained by the proposed method and the method in [10] given two depth maps of different qualities for the 81st frame
(0)
of video sequence Door flowers with a down-sampling factor 4. (a) Result obtained by the method in [10] with disparity map D81 of low quality. (b)
Result obtained by the method in [10] with disparity map D81 GT of high quality. (c) Our result with initial disparity map D (0) . (d) Our result with disparity
81
map D81 GT . (e) Initial depth map D (0) obtained by matching the left view with the bilinear interpolation result of right view. (f)-(i) Close-up views of red
81
GT obtained by matching the full-resolution left view and right view by using the method in [19]. The
rectangular regions of (a)-(d). (j) Disparity map D81
disparity maps are quantized with a magnification factor of 8 to display.

C. Evaluation on depth map proposed method reduces the relative errors of the final depth
maps to 2.90% and 6.20%, respectively.
First, the visual and quantitative comparison between the
initial depth map D(0) and the final depth map of the proposed
TABLE VII
method were presented. Here, we computed the REoD indices RE O D OF THE INITIAL / FINAL DEPTH MAPS OBTAINED BY THE PROPOSED
according to Eq. (18). Since there were no ground truth depth METHOD .
maps for the five test video sequences from the Mobile 3D
TV project, we used the surrogate ground truth depth maps Down-sampling factor 2 4
DGT instead. Table VII shows the REoD statistics of the Method Initial Final Initial Final
initial/final depth maps of the proposed method. We can see Book arrival 3.38% 2.82% 9.92% 6.86%
Door flowers 4.21% 3.18% 12.05% 8.29%
that the proposed method achieves significant reduction of Leaving laptop 4.50% 3.19% 10.11% 6.75%
REoD through the iterative updating scheme. When compared Bullinger 4.65% 3.66% 7.17% 5.90%
with the initial depth maps whose average relative errors are TU-Berlin 2.98% 1.63% 9.44% 3.20%
AVERAGE 3.94% 2.90% 9.74% 6.20%
3.94% and 9.74% for down-sampling factors 2 and 4, the

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

10

(a) (b) (c) (d)


1
0.8

0.6

0.4

0.2

0
(e) (f) (g) (h)

Fig. 6. Super-resolution results and the estimated depth map for 126th frame of video sequence Tu-Berlin, with a down-sampling factor 4. (a) Result of
bilinear interpolation. (b) Our super-resolution result. (c) The ground truth right view. (d) The forward optical flow field (global motion). (e) The initial depth
(0) (6) GT . (h) The backward optical flow field (global motion).
map D126 . (f) Our final depth map D126 . (g) The surrogate ground truth depth map D126

TABLE V
RE O D OF THE INITIAL / FINAL DEPTH MAPS OBTAINED BY THE DEPTH ESTIMATION PART OF THE PROPOSED METHOD WITH / WITHOUT CONSIDERING
TEMPORAL CONSISTENCY (TC) BETWEEN ADJACENT DEPTH FRAMES , FAST COST- VOLUME FILTERING METHOD (FCV) [32], AND DICHROMATIC
D UAL -C ROSS -B ILATERAL G RID BASED METHOD (DCBG RID 2) [33].

Video sequence Down-sampling factor Proposed method without TC Proposed method with TC FCV DCBGrid2
2 (initial / final) 1.502% / 1.176% 1.432% / 1.120% 1.129% / 1.015% 8.586% / 8.436%
Book 4 (initial / final) 4.407% / 1.924% 3.889% / 1.694% 3.521% / 2.166% 10.024% / 8.761%
2 (initial / final) 14.446% / 6.804% 9.240% / 3.122% 5.769% / 5.354% 19.886% / 14.541%
Street 4 (initial / final) 30.386% / 24.339% 22.880% / 14.997% 21.787% / 19.030% 35.088% / 34.476%
2 (initial / final) 7.447% / 5.789% 7.025% / 5.837% 6.004% / 4.766% 15.391% / 10.088%
Tanks 4 (initial / final) 20.071% / 7.747% 16.373% / 7.382% 12.558% / 9.471% 30.042% / 13.236%
2 (initial / final) 5.563% / 3.311% 4.580% / 2.425% 4.781% / 4.522% 12.758% / 10.510%
Temple 4 (initial / final) 9.877% / 8.095% 8.605% / 6.711% 14.635% / 10.697% 22.067% / 15.520%
2 (initial / final) 3.249% / 1.578% 2.392% / 1.746% 4.057% / 2.477% 11.991% / 8.145%
Tunnel 4 (initial / final) 28.552% / 10.791% 21.394% / 3.750% 15.196% / 10.987% 28.873% / 13.985%
2 (initial / final) 6.441% / 3.732% 4.934% / 2.850% 4.348% / 3.627% 13.722% / 10.344%
AVERAGE 4 (initial / final) 18.659% / 10.579% 14.628% / 6.907% 13.539% / 10.470% 25.219% / 17.195%

Fig. 6 shows the results for 126th frame of video sequence test video sequences were downloaded from the homepage
Tu-Berlin, which contains global motion (Fig. 6(d) and (h)). of real-time spatiotemporal stereo matching project. Table V
The proposed method significantly improved the initial depth shows the REoD results of the above four methods. REoDs
map (Fig. 6(e)), and achieved comparable result (Fig. 6(f)) to of the initial depth map and the final depth map are listed in
GT each row separated by a slash. Generally, stable results were
the surrogate ground truth depth map D126 (Fig. 6(g)). This
high quality depth map led to a good super-resolution result obtained by using different methods after 3 and 5 iterations
(Fig. 6(b)). for down-sampling factor 2 and 4, respectively. It can be seen
that the proposed method with considering TC achieved better
As mentioned previously, we also compared the depth results than the one without considering TC for both initial and
estimation part of our proposed approach against two state- final cases. Overall, our methods on average achieved com-
of-the-art methods, fast cost-volume filtering method (FCV) parable or better results than FCV and DCBGrid2. Moreover,
[32] and dual-cross-bilateral grid based method (DCBGrid2) please note the significant reduction of REoD between the final
[33]. For the sake of fair comparison, we just replaced the results and the initial results of both FCV and DCBGrid2. This
corresponding depth estimation part with the two methods means that the proposed iterative scheme for simultaneously
in our iterative scheme. Accordingly, the initial depth maps estimating the depth map and super-resolved right view can
were estimated by matching the full-resolution left view and also integrate other depth estimation methods, and achieved
the bilinear interpolation result of right view. Subsequently, improvement of the depth estimation results. Figure 7 shows
we iteratively updated the depth map and super-resolved right the visual comparison about the above four methods for the
view until obtaining stable results. In addition to illustrating 91st frame of video sequence Tunnel. As it can be seen, our
the impact of temporal consistency term, we also presented the proposed method with TC term is able to eliminate wrong
experimental results of our method with/without considering pixels in the texture region and achieves better results.
temporal consistency (TC) between adjacent depth frames. The Next, we conducted an experiment to compare the rendering

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

11

(a) (b) (c) (d) (e)


250

0
(f) (g) (h) (i) (j)

Fig. 7. Depth estimation results of the proposed method without/with considering temporal consistency (TC) and state-of-the-art alternatives, FCV, DCBGrid2
for the 91st frame of video sequence Tunnel. The down-sampling factor is 2. (a) The full-resolution left view. (f) The ground truth disparity map. (b) and (g)
Initial/final disparity maps of the proposed method without considering TC. (c) and (h) Initial/final disparity maps of the proposed method with considering
TC. (d) and (i) Initial/final disparity maps of FCV. (e) and (j) Initial/final disparity maps of the proposed method without considering DCBGrid2.

TABLE VIII
PSNR OF RENDERING RESULTS OF RIGHT VIEW FROM ORIGINAL LEFT VIEW AND DIFFERENT DEPTH MAPS , INCLUDING THE INITIAL DEPTH MAP D (0) ,
THE FINAL DEPTH MAP D (t) GENERATED BY THE PROPOSED METHOD , AND THE SURROGATE GROUND TRUTH DEPTH MAP D GT .

Down-sampling factor 2 4

Method Rendering+D (0) Rendering+D (t) Rendering+D (0) Rendering+D (t) Rendering+D GT
Book arrival 25.38 25.98 22.79 24.82 26.15
Door flowers 25.92 26.45 22.62 25.18 26.66
Leaving laptop 26.09 26.33 22.99 25.15 26.43
Bullinger 27.54 28.57 25.64 27.85 28.71
TU-Berlin 28.94 29.37 24.60 28.93 29.84
AVERAGE 26.77 27.34 23.73 26.39 27.59

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Fig. 8. Rendering results of right view from original left view and depth maps of different qualities for the 81st frame of video sequence Door flowers. (a)-(b)
Rendering results with initial depth map D (0) and final depth map of the proposed method D (t) , respectively, down-sampling factor 2. (c)-(d) Rendering
results with initial depth map D (0) and final depth map of the proposed method D (t) , respectively, down-sampling factor 4. (e) Rendering result with the
surrogate ground truth depth map D GT . (f)-(j) Close-up views of red rectangular regions in (a)-(e).

results by using the original left view and depth maps of D. Subjective evaluation experiments
different qualities, including the initial depth map D(0) , the
final depth map D(t) generated by the proposed method, and In [35], Jain et al. conducted three subjective experiments to
the surrogate ground truth depth map DGT . The rendering compare two methods of mixed-resolution coding, single-eye
method was simple and straightforward by interpolating each and alternating-eye blur, in terms of overall quality for short
pixel of the right view from the corresponding one into the exposures and visual fatigue level for long exposures. They
left view under the guidance of the depth map. Table VIII gave a detailed description about the experiments procedure.
summarizes the PSNR of rendering results and Fig. 8 shows The subjective experiments in this part were designed by
some rendering results of different depth maps for the 81st referring to their work. To evaluate the quality of the full-
frame of video sequence Door flowers. resolution stereo videos generated by different super-resolution
methods, we conducted two groups of subjective experiments

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

12

(a) (b)

Fig. 9. Subjective evaluation results in terms of visual sharpness about the stereoscopic video sequences of bilinear interpolation method, the sparse coding
method, the proposed method, and the ground truth. (a) Average ranks from 20 subjects in 4 trials, down-sampling factor: 2. (b) Average ranks from 20
subjects in 4 trials, down-sampling factor: 4.

(a) (b)

Fig. 10. Subjective evaluation results in terms of visual sharpness about the stereoscopic video sequences of the method in [10] and the proposed method
given two depth maps of different qualities (D (0) and D GT ). (a) Average ranks from 20 subjects in 4 trials, down-sampling factor: 2. (b) Average ranks
from 20 subjects in 4 trials, down-sampling factor: 4.

on five scenes, Book arrival, Door flowers, Leaving laptop, filled the vertical dimension of the screen (1050 pixels), but
Bullinger, and TU-Berlin. These two groups of experiments may leave black margins on the horizontal dimension. As
correspond to the objective evaluation experiments in Section recommended in [37], the viewing distance is set as 1.05 m.
IV-B1 and IV-B2. In the first group, we generated each test stereoscopic video
Twenty naive subjects participated, all with normal or sequence in left-right format from four stereoscopic video
corrected to normal acuity, and the ability to perceive stereo- sequences, which are corresponding to the full-resolution left
scopically defined depth. Moreover, they all understood about view and the super-resolved right views by using bilinear
the term sharpness by additional tests for distinguishing the interpolation method, sparse coding method, and the proposed
differences between a clear stereoscopic video and several method, as well as the ground truth right view. Each sub-
blurry ones with different levels of blur. Experiments were ject was asked to rank the four unnamed video sequences
conducted on a 22-inch Sumsung SyncMaster 2233 LCD according to their sharpness. If they could not distinguish the
monitor with NVIDIA GeForce 3D Vision Ready (a 3D vision difference between any two sequences, they were allowed to
shutter glasses and an infrared transmitter as controller) driven give them the same rank. In the second group, we generated
by an NVIDIA GeForce GTX 480 video card running at the test stereoscopic video sequence in left-right format from
1680x1050 resolution with a refresh rate of 120 Hz. Each test four stereoscopic video sequences, which are corresponding to
video sequence was composed of four videos corresponding to the full-resolution left view and the super-resolved right views
different methods in a 2x2 arrangement. All the four different by using the method in [10] and the proposed method given
videos in each test video sequence are in random order. These two depth maps of different qualities (D(0) and DGT ). Then,
test sequences are saved in uncompressed AVI format to the experiment was conducted like the first one. Every subject
prevent adding compression artifacts. All these test sequences completed four times trials on each whole test condition (10

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

13

test video sequences corresponding to two down-sampling incorporate them into the proposed framework to improve
factors of 2 and 4). For each trial, they are allowed to view the depth estimation result. Another interesting direction that
the video sequence any time until they completed the ranks. arises from this work is that the estimated depth map can be
It was usually no more than five times. used in lots of computer vision applications, such as multi-
The average ranks from 20 subjects for different test se- view video coding, video semantic segmentation, 3D salient
quences and different methods are plotted in Figs. 9 and 10. object detection, image resizing, as well as somatosensory
In Fig. 9(a), the average ranks for all five sequences of interaction.
the bilinear interpolation method, sparse coding method, the
proposed method, and the ground truth are 2.75, 2.42, 1.9, and R EFERENCES
1.79 for down-sampling factor 2, respectively. Generally, the [1] Y. Chen, Y. Wang, K. Ugur, M. Hannuksela, J. Lainema, and M.Gabbouj,
degree of differentiation between these four kinds of results 3d video services with the emerging mvc standard, EURASIP Journal
on Advances in Signal Processing, vol. 2009, pp. 18, 2009.
is low. Especially for TU-Berlin, it has shadow and black [2] K. Willner, K. Ugur, M. Salmimaa, A. Hallapuro, and J.Lainema,
statue in the scene, and its overall contrast is not so high. Mobile 3d video using mvc and n800 internet tablet, in Proc. of 3DTV
Moreover, all the super-resolution results for down-sampling Conference: The True Vision - Capture, Transmission and Display of 3D
Video, May 2008, pp. 6972.
factor 2 are fairly good. So, it is hard to distinguish their [3] B. Julesz, Foundations of cyclopean perception. University of Chicago
differences. However, we noticed that some subjects preferred Press, 1971.
to give explicit different ranks for these four results. Therefore, [4] W. Tam, Image and depth quality of asymmetrically coded stereoscopic
video for 3d-tv, JVT-W094, San Jose, CA, April 2007.
the average ranks were influenced by this phenomenon and [5] H. Brust, A. Smolic, K. Mueller, G. Tech, and T.Wiegand, Mixed-
showed high fluctuation than others. For down-sampling factor resolution coding of stereoscopic video for mobile devices, in Proc. of
4, the average ranks for all five sequences are 3.1, 2.8, 2.04, 3DTV Conference: The True Vision - Capture, Transmission and Display
of 3D Video, 2009, pp. 14.
and 1.49. Generally, these four kinds of results show high de- [6] C. Fehn, P. Kauff, S. Cho, N. Hur, and J. Kim, Asymmetric coding
gree of differentiation. The proposed method shows superiority of stereoscopic video for transmission over t-dmb, in Proc. of 3DTV
than sparse coding method and bilinear interpolation in terms Conference: The True Vision - Capture, Transmission and Display of
3D Video, May 2007.
of visual sharpness comparison. [7] A. Aksay, C. Bilen, E. Kurutepe, T. Ozcelebi, G. Akar, R. Civanlar,
In Fig. 10, the average ranks for all five sequences are 2.65, and A. Tekalp, Temporal and spatial scaling for stereoscopic video
2.67, 1.86, and 1.85 for down-sampling factor 2, and 3.44, compression, in 14th European Signal Processing Conference, 2006.
[8] Y. Chen, Y. Wang, M. Gabbouj, and M. Hannuksela, Regionally
2.84, 2.10, and 1.56 for down-sampling factor 4, respectively. adaptive filtering for asymmetric stereoscopic video coding, in Proc.
There is a little difference between the subjective ranks and the of ISCAS, May 2009, p. 25852588.
objective indices in Tables III-IV for down-sampling factor 4. [9] P. Aflaki, W. Su, M. Joachimiak, D. Rusanovskyy, M. M. Hannuksela,
H. Li, and M. Gabbouj, Coding of mixed-resolution multiview video
Though the objective indices of the method in [10] with DGT in 3d video application, in Proc. of IEEE Int. Conf. Image Processing
are higher than the indices of the proposed method with D(0) , (ICIP), 2013.
[10] D. Garcia, C. Dorea, and R. de Queiroz, Super resolution for multiview
the subjective ranks show the opposite preference. It is because images using depth information, IEEE Trans. Circuits Syst. Video
the super-resolution result of the method in [10] exhibited Technol., vol. 22, no. 9, pp. 12491256, sept. 2012.
more wrong pixels in edge regions, which led inconsistency [11] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, Advances and chal-
lenges in super-resolution, International Journal of Imaging Systems
between adjacent frames and affected the visual experience, and Technology, vol. 14, no. 2, p. 4757, 2004.
while results of the proposed method were more stable. This [12] C. Liu and D. Sun, A bayesian approach to adaptive video super
phenomenon is quite obvious for Bullinger. resolution, in Proc. of IEEE Int. Conf. Computer Vision and Pattern
Recognition (CVPR), 2011, pp. 209216.
[13] J. Yang, J. Wright, T. Huang, and Y. Ma, Image super-resolution as
V. C ONCLUSION sparse representation of raw image patches, in Proc. of IEEE Int. Conf.
Computer Vision and Pattern Recognition (CVPR), 2008.
In this paper, we have proposed a novel method for re- [14] J. Yang, J. Wright, T. Huang, and Y. Ma, Image super-resolution via
sparse representation, IEEE Trans. Image Processing, vol. 19, no. 11,
constructing the low-resolution view and simultaneously esti- pp. 28612873, 2010.
mating depth information for asymmetric stereoscopic video. [15] H. Brust, G. Tech, and K. Mller, Report on generation of mixed spatial
The proposed method models these two problems as a unified resolution stereo data base, MOBILE3DTV project, Tech. Rep., 2009.
[16] N. Atzpadin, P. Kauff, and O. Schreer, Stereo analysis by hybrid
energy function and then minimizes it by using an alternating recursive matching for real-time immersive video conferencing, IEEE
optimization technique. The alternating steps interact with Trans. Circuits Syst. Video Technol., Special Issue on Immersive T-
each other to improve the depth result as well as the super- elecommunications, vol. 14, no. 3, pp. 321334, 2004.
[17] J. Tian, L. Chen, and Z. Liu, Dual regularization-based image resolution
resolution result. Objective experiments convince that the enhancement for asymmetric stereoscopic images, Signal Processing,
proposed method can achieve high-quality depth and super- vol. 92, no. 2, pp. 490497, 2012.
resolution result. In addition, subjective experiments show [18] T.-Y. Chung, S. Sull, and C.-S. Kim, Frame loss concealment for
stereoscopic video based on inter-view similarity of motion and intensity
the superiority of the proposed method in terms of visual difference, in IEEE International Conference on Image Processing
sharpness. (ICIP 2010), 2010.
The computational cost of the proposed method mainly [19] Z. F. Wang and Z. G. Zheng, A region based stereo matching algorithm
using cooperative optimization, in Proc. of IEEE Int. Conf. Computer
concentrates on three aspects: calculation of the optical flow Vision and Pattern Recognition (CVPR), 2008.
field, stereo matching, and nonlocal operation in the super- [20] C. M. Bishop, A. Blake, and B. Marthi, Super-resolution enhancement
resolution part. We leave it as future work for a fast imple- of video, in Proc. of Artificial Intelligence and Statistics, 2003.
[21] S. Na, K. Oh, and Y. Ho, Joint coding of multi-view video and
mentation. Moreover, an interesting direction of future work corresponding depth map, in Proc. of IEEE Int. Conf. Image Processing
is to explore more depth cues, such as motion parallax, and (ICIP), oct. 2008, pp. 24682471.

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSVT.2014.2367356, IEEE Transactions on Circuits and Systems for Video Technology

14

[22] N. D. Doulamis, A. D. Doulamis, Y. S. Avrithis, K. S. Ntalianis, and Zheng-Jun Zha (M08) is a Professor with the
S. D. Kollias, Efficient summarization of stereoscopic video sequences. Institute of Intelligent Machines, Chinese Academy
IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 4, pp. 501517, of Sciences, Hefei, China. He previously worked as
2000. a Senior Research Fellow in the School of Com-
[23] S. Dai, M. Han, W. Xu, Y. Wu, and Y. Gong, Soft edge smoothness puting, National University of Singapore (NUS). He
prior for alpha channel super resolution, in Proc. of IEEE Int. Conf. received the B.E. degree in automation and the Ph.D.
Computer Vision and Pattern Recognition (CVPR), 2007. degree in pattern recognition and intelligent system
[24] Q. Shan, Z. Li, J. Jia, and C. Tang, Fast image/video upsampling, from the University of Science and Technology of
ACM Trans. Graphics (TOG), vol. 27, no. 5, 2008. China (USTC), Hefei, China. His current research
[25] A. Buades, B. Coll, and J. Morel, A non-local algorithm for image interests include multimedia analysis and retrieval,
denoising, in Proc. of IEEE Int. Conf. Computer Vision and Pattern computer vision, and pattern recognition. He has
Recognition (CVPR), 2005. published over 100 book chapters, journal articles, and conference papers
[26] A.Buades and et al., http://www.ipol.im/pub/art/2011/bcm%5Fnlm/. in these areas, including TIP, TMM, TCSVT, TOMCCAP, ACM Multimedia,
[27] C. Dorin and P. Meer, Mean shift: A robust approach toward feature CVPR, and SIGIR etc. He received the Best Paper Award in the ACM Inter-
space analysis, IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 5, national Conference on Multimedia (ACM Multimedia) 2009, the Best Demo
pp. 603619, 2002. Runner-Up award in ACM Multimedia 2012, the Best Student Paper Award
[28] M. Bleyer and M. Gelautz., A layered stereo matching algorithm using in ACM Multimedia 2013, and the Best Paper Award in the International
image segmentation and global visibility constraints, Photogrammetry Conference on Internet Multimedia Computing and Service 2013.
and Remote Sensing, vol. 59, p. 128150, 2005.
[29] C. Liu, Beyond pixels: exploring new representations and applications
for motion analysis, Ph.D. dissertation, Massachusetts Institute of
Technology, 2009.
[30] X. Huang, Cooperative optimization for energy minimization: a
case study of stereo matching, arXiv preprint cs/0701057, 2007, Zhigang Zheng received his B.S. degree and Ph.D.
http://arxiv.org/abs/cs/0701057. degree in Control Science and Engineering from
[31] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University of Science and Technology of China, in
university press, the Uinted Kingdom, 2008. 1999 and 2008, respectively. He joined University of
[32] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz, Fast Science and Technology of China in 2008, where he
cost-volume filtering for visual correspondence and beyond, in Proc. is currently an assistant professor in the Department
of IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR), of Automation. His current research focus is on
2011. computer vision and pattern recognition.
[33] C. Richardt, D. Orr, I. Davies, A. Criminisi, and N. A. Dodgson, Real-
time spatiotemporal stereo matching using the dual-cross-bilateral grid,
in Proc. of the European Conference on Computer Vision (ECCV), 2010.
[34] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image
quality assessment: from error visibility to structural similarity, IEEE
Trans. Image Processing, vol. 13, no. 4, pp. 600612, Apr. 2004.
[35] A. K. Jain, A. E. Robinson, and T. Q. Nguyen, Comparing perceived
quality and fatigue for two methods of mixed resolution stereoscopic Chang Wen Chen (F04) is a Professor of Com-
coding, IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 3, pp. puter Science and Engineering at the State Univer-
418429, 2014. sity of New York at Buffalo, USA. Previously, he
[36] J. Kowalczuk, E. Psota, and L. Perez, Real-time temporal stereo match- was Allen S. Henry Endowed Chair Professor at
ing using iterative adaptive support weights, in Proc. of 2013 IEEE In- Florida Institute of Technology from 2003 to 2007,
ternational Conference on Electro/Information Technology (EIT), 2013. a faculty member at the University of Missouri -
[37] ITU-R, Recommendation itu-r bt.500-13: Methodology for the subjec- Columbia from 1996 to 2003 and at the University
tive assessment of the quality of television pictures, Int. Telecommuni- of Rochester, Rochester, NY, from 1992 to 1996. He
cation Union, Recommendation, Tech. Rep., Jan. 2012. has been the Editor-in-Chief for IEEE Trans. Multi-
media since 2014. He has also served as the Editor-
in-Chief for IEEE Trans. Circuits and Systems for
Video Technology from January 2006 to December 2009 and an Editor for
Proceedings of IEEE, IEEE T-MM, IEEE JSAC, IEEE JETCAS, and IEEE
Multimedia Magazine. He and his students have received eight (8) Best Paper
Jing Zhang (SM13) received his B.S. degree in Awards or Best Student Paper Awards and have been placed among Best
Applied Mathematics from Henan University, China, Paper Award finalists many times. He is a recipient of Sigma Xi Excellence
in 2010. Now, he is a Ph.D. student under the in Graduate Research Mentoring Award in 2003, Alexander von Humboldt
supervision of Prof. Zengfu Wang in the Department Research Award in 2009, and SUNY-Buffalo Exceptional Scholar C Sustained
of Automation, University of Science and Technol- Achievements Award in 2012. He is an IEEE Fellow and an SPIE Fellow.
ogy of China. His research interests include image
super-resolution, restoration and enhancement. He is
a student member of the IEEE Signal Processing
Society.

Zengfu Wang (M13) was born in 1960, and re-


ceived his B.S. degree in electronic engineering from
University of Science and Technology of China, in
1982 and his Ph.D. degree in control engineering
Yang Cao (M13) was born in 1980. He received from Osaka University, Japan, in 1992. He is cur-
his B.S. degree and Ph.D degree in information rently a professor of both Institute of Intelligent
engineering from Northeastern University, China, in Machines, Chinese Academy of Sciences and U-
1999 and 2004, respectively. Since 2004, he has been niversity of Science and Technology of China. He
with the Department of Automation at University has published more than 180 journal articles and
of Science and technology of China, where he is conference papers, including IEEE Trans. on Image
now an assistant professor. His research interests are Processing, IEEE Trans. on Cybernitics, CVPR, EC-
in image processing and computer vision. He is a CV. He received the Best Paper Award in the ACM International Conference
member of the IEEE Signal Processing Society. on Multimedia (ACM Multimedia) 2009. His research interests include
computer vision, human computer interaction and intelligent robots.

1051-8215 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like