Professional Documents
Culture Documents
Fig. 1. The framework of the proposed WLMNC distance based human depth recovery method. In the remote stage, the human depth map is compressed
to a BS matrix using the WLMNC distance learning method based on skeleton joint information. In the local stage, the rough human depth map is recovered
based on WLMNC distance augmented clustering and then used as a guide weight in the AR model with the accompanying color image to recover a fine
depth map.
The correct occlusion relationship can enhance the viewer’s with the guidance of the accompanying high-resolution color
sense of immersion in the surrounding environment, which image, such as [11], [18]–[22], [24], and [25]. These methods
helps the viewer to make correct judgments. According to [29], are mostly based on the assumption that depth discontinuities
previous occlusion handling methods can be divided into three and image edges co-occur in the same scene [24], [43].
main categories: depth data based [30], modeling based [31], Image guided upsampling methods can yield much better
and image analysis based [32]. In tele-immersive systems upsampling quality than single depth map upsampling [39]
based on composited video environments [3]–[7], the occlu- and do not need any prior database, in contrast to the existing
sion relationship between virtual and real objects is quickly methods [40]. Additionally, they are not subject to static scenes
handled using an image analysis based method [32], namely, and do not require complicated camera calibration processes.
directly laying the remote and local humans on top of the However, when the color edges are inconsistent with the
virtual background. The image analysis based method [32] depth discontinuities, the upsampled depth map suffers from
enables distributed users to focus on the humans but easily texture copy artifacts and blurred depth discontinuities.
causes incorrect overlap between the remote and local humans. To handle these two issues, most recent methods design
To overcome this problem, in CuteChat [3] and People in complex guidance weights based on guide color images and
Books [4], the layer and position of the remote and local heuristically take the initial interpolation of the input depth
humans in the virtual space are directly predefined. However, map into account. Park et al. proposed an edge-weighted
this leads to limited interaction between distributed users in NLM-regularization (Edge) [17], [21] method. The method
the virtual space. To enhance the interaction experience of used a non-local term to regularize depth maps combined
distributed users, a special RGB-D data transmission protocol with a weighting scheme that involved edge, gradient, and
is customized in Waazam [5] and Video Avatar [7] so that the segmentation information extracted from high-quality color
depth data of the remote human can be transmitted synchro- images and gradient information extracted from the bicu-
nously with the video data. Then, using the depth data based bic interpolated depth map. However, jaggy artifacts still
method [30], the occlusion relationship between the remote occurred in some boundaries. Yang et al. [16] proposed a
and local humans can be correctly determined to achieve color guided AR model, that took an initial interpolated depth
an occlusion consistent video composition. The customized map as the definition of the AR coefficient. As reported
RGB-D data transmission protocol greatly increases band- in [16], the AR model can achieve good performance for
width consumption, which makes it difficult to obtain smooth handling the inconsistency between the color edge and the
transmission over the Internet and reduces the distributed depth discontinuity, i.e., suppressing texture copy artifacts and
users’ quality of experience during remote video communi- preserving depth discontinuities when the corresponding color
cation. To reduce bandwidth consumption, it is important to edge is weak. Dong et al. also proposed a color-guided depth
develop a strategy to highly compress the depth data of the recovery method via joint local structural and nonlocal low-
remote human at the remote end and to accurately recover the rank regularization [10]. This method jointly exploited local
depth data of the remote human under such a high upsampling and nonlocal color-depth dependencies and outperformed the
rate at the local end. AR model [16]. However, a complex guidance weight does
not always help to improve the upsampling quality. Moreover,
B. Depth Recovery the initial depth map estimated by interpolating the noisy low-
Depth recovery is developed for upsampling, or inpaint- resolution depth map becomes unreliable, especially when the
ing, a low-quality depth map captured by depth sensors. upsampling rate is very large.
Since stereo matching based depth acquisition methods require To address this issue, Liu et al. [13], [14] proposed a
accurate image rectification and are inefficient for texture- robust weighted least squares (RWLS) model, which used an
less areas [33], in recent years, there has been enormous iteratively updated depth map as the guide weight. According
interest in depth sensor (including ToF camera and Kinect) to [13] and [14], using a depth map for the guide weight is
based depth acquisition methods [34], [35]. Even though the the key element in improving depth recovery performance.
new depth capturing techniques are promising, the use of Furthermore, as a guide depth map, the iteratively updated
depth cameras is limited by the low quality of the produced depth map is better than an initial interpolated depth map,
depth maps, e.g., low resolution, noise, and missing depth which contributes to better performance in preserving sharp
in some areas. To compensate for the missing and inac- depth discontinuities. However, the iteratively updated depth
curate depth measurements of Kinect, some image inpaint- map suffers in human depth recovery with depth data of
ing techniques have been developed [8], [9], [36]–[38]. only a few skeleton joints, as proved in our experiment.
These methods achieve good quality for smooth regions Li et al. [12] also proposed a depth recovery method guided
but may introduce artifacts, e.g., jagging, blurring, and by a cascadingly interpolated depth map. As clarified in [12],
ringing, around thin structures or sharp discontinuities. the cascaded scheme effectively addresses the potential struc-
To address the undersampling of ToF cameras, upsampling tural inconsistency between the sparse input data and the guide
methods with a single depth map, such as [40] and [41], image while preserving depth boundaries. However, when
have been introduced. A low-resolution depth map can also be given a very sparse input dataset, the method tends to generate
upsampled by integrating multiple low-resolution depth maps depth recovery results with texture copy artifacts.
of the same scene, such as in [42] and [43]. Another popular In summary, existing depth recovery methods can only be
approach is to upsample the noisy low-resolution depth map applied to repair a small amount of missing or noisy depth
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5731
C. Distance Learning provide good depth recovery performance for various types of
Distance learning, which trains a new metric to satisfy the human postures.
labels or constraints in the input data to enhance classifica-
III. WLMNC D ISTANCE BASED
tion or clustering performance [44]–[47], is widely used in
H UMAN D EPTH R ECOVERY
distance or similarity based machine learning, pattern recog-
nition and data mining applications [48], [49]. In this section, we will introduce the WLMNC distance
In recent years, several distance learning methods have based human depth recovery method, which consists of two
been developed, including distance metric learning using struc- stages: 1) Remote stage, where a WLMNC distance learning
tured regularization [50], cosine similarity metric learning for method is proposed to highly compress the depth data of
face verification [51], Euclidean distance trained by shortest- the remote human based on the skeleton joint information;
path algorithms [52], KL divergence adapted using gradient 2) Local stage, where a rough-to-fine depth recovery frame-
descent [53], Bregman divergence trained using nonlinear work is proposed to accurately recover the depth data of
learning [54], kernel similarity modified by incorporating the the remote human based on the learned WLMNC distance,
constraints in the objective function [55], [56] or using non- skeleton joint information and the accompanying color image.
parametric approaches [57], [58], and Mahalanobis distances To make the following description more clear, some complex
trained in [60] and [61]. symbol definition rules are shown in Table I.
The WLMNC distance learning method proposed in this
paper is largely inspired by the recent work on Mahalanobis A. Remote Stage
distance learning using convex optimization [61], especially Since human posture can be divided into several skele-
the large margin nearest neighbor (LMNN) distance learning tal block structures, e.g., head, hand and foot, and the
method [26] and the large margin nearest cluster (LMNC) depth data of each skeletal block structure is assumed to
distance learning method proposed by us in [27]. The LMNN be smooth, the depth data of the remote human can be
distance learning method [26] aims to study a generalized considered to be a piecewise linear function. Therefore, at the
Mahalanobis distance with the goal that the k-nearest neigh- remote end, a normal way to compress the depth data of
bors always belong to the same class while samples from the remote human is to detect the skeleton joints and then
different classes are separated by a large margin to improve transmit the depth data of only the extracted skeleton joints.
the classification performance of the k-nearest neighbor clas- Afterwards, at the local end, one method is to divide the
sification method. Different from LMNN, the LMNC distance remote human into several skeletal block structures using the
learning method [27] is designed mainly for clustering prob- received skeleton joints and then recover the depth data of
lems with the target of narrowing the distance between each the remote human based on the depth data of the skeletal
sample and its cluster center while widening the distance block structures. However, when there are occlusions between
between each sample and other heterogeneous cluster cen- different skeletal block structures, the skeletal block structure
ters to achieve better clustering performance. The proposed division results can be prone to error. To achieve better
WLMNC distance learning method is adapted from the LMNC division performance in this situation, we propose a WLMNC
distance learning method [27], which learns an adaptive mar- distance learning method that can learn a matrix to pre-
gin between different clusters by adjusting the penalty weight serve the skeletal block structure information of the remote
on heterogeneous clusters. By adopting the WLMNC distance human.
to augment the clustering approach, we can greatly improve The rest of this section will introduce the classic method,
the skeletal block structure division performance among depth i.e., nearest center (NC) clustering, which can be used for
discontinuities due to occlusion between skeletal block struc- dividing human pixels into different skeletal block structures,
tures and thus obtain more reliable depth estimation for the and will then present our WLMNC distance learning method
remote human. As shown in Fig. 2, the proposed method can for augmenting classic NC clustering.
5732 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018
between each pixel and its real nearest skeletal block structure
in the three-dimensional physical space while widening the
distance between each pixel and other skeletal block structures
with different depths.
The WLMNC distance is an adapted version of the
LMNC distance proposed by us in [27]. The LMNC distance
is a generalized Mahalanobis distance that is optimized by
convex optimization with the objective integrating two parts:
(a) Minimizing the distance between the sample and its
target cluster center; (b) Maximizing the distance between
the sample and other differently labeled cluster centers. The
Fig. 3. Skeletal block structure division results of the Hug posture LMNC distance between human pixel pi and skeletal block
using (b) NC clustering and (c) WLMNC distance augmented clustering. structure B j is defined as follows:
The result of WLMNC distance augmented clustering clearly demonstrates
its better performance in dividing skeletal block structures with occlusion
compared with the result of NC clustering.
d M ( pi , B j ) = (Pi − B j )T M(Pi − B j ), (3)
where M is a 4×4 semidefinite matrix. When M is an identity
matrix, the LMNC distance is equivalent to the Euclidean
1) Classic Nearest Center Clustering: According to [25],
distance in Eq. (2), which is used by classic NC clustering.
human posture can be represented by several skeleton joints
The squared LMNC distance is denoted as follows:
(see Fig. 2). Let J + 1 denote the number of skeleton joints;
the remote human can then be divided into J skeletal block D M ( pi , B j ) = d M
2
( pi , B j ). (4)
structures. Intuitively, each pixel of the remote human should
belong to the nearest skeletal block structure, and the classic Let h( pi ) ∈ J denote the label of pixel pi ’s real near-
NC clustering approach is a natural choice for skeletal block est skeletal block structure in the three-dimensional physical
structure division. Denote = { pi }i=1 N
⊂ R 2 as the pixel space, which is calculated as follows:
+1
set of the remote human and = {s j } Jj =1 s ⊂ as h( pi ) = arg min d3D ( pi , B j ), (5)
the skeleton joint set, where pi and s j are pixels of two- j ∈ J
dimensional image coordinates. For any positive integer A, where
denote A = {1, 2, · · · , A}. Let the skeleton joints associated
with skeletal block structure B j , j ∈ J be {s j 1 , s j 2 }, d3D ( pi , B j ) = {Pi − B j 2 + ( D o ( pi ) − D o (s j 1 ))2
s j 1 , s j 2 ∈ s , and define B j = [s j 1 , s j 2 ]T . Then, for any + ( D o ( pi ) − D o (s j 2 ))2 }1/2 . (6)
pixel pi ∈ , the distance between pi , i ∈ N and skeletal
In the above equation, D o ( pi ), D o (s j 1 ), and D o (s j 2 ) denote
block structure B j , j ∈ J , is defined as:
the observed depth acquired at the remote end for pixels pi ,
d2D ( pi , B j ) = pi − s j 1 2 + pi − s j 2 2 , (1) s j 1 , and s j 2 , respectively .
The matrix M (LMNC distance) can then be learned by
which is equivalent to the Euclidean distance between vector minimizing the following objective function:
Pi = [ pi , pi ]T and B j . Here, each pixel pi is represented as
a four-dimensional vector Pi by repeating its two-dimensional ε(M) = η δi j D M ( pi , B j ) + (1 − η) δi j (1 − δil )
image coordinate twice. Thus, each pixel pi can be divided i ijl
into the corresponding skeletal block structure using the fol- × [1 + D M ( pi , B j ) − D M ( pi , Bl )]+ , (7)
lowing NC clustering approach:
where δi j ∈ {0, 1} indicates whether label h( pi ) is equal
min d2D ( pi , B j ) = min Pi − B j 2 . (2) to j , [χ]+ = max(χ, 0) denotes the standard hinge loss,
j ∈ J j ∈ J and η ≤ 1 is a positive constant for adjusting the two terms
2) WLMNC Distance Learning: Since two-dimensional by (a) penalizing large distances between each pixel and
image coordinates do not contain depth information, the near- its target skeletal block structure and (b) penalizing small
est skeletal block structure of each pixel based on classic distances between each pixel and all other differently labeled
NC clustering may not be the nearest one in the three- skeletal block structures. For each pixel pi , hinge loss occurs
dimensional physical space, especially in the case that there when the squared LMNC distance of pi to a skeletal block
exist occlusions in the remote human. As shown in Fig. 2 (a), structure with a different label does not exceed the squared
suppose the remote human stretches her hands before her LMNC distance of pi to its target skeletal block structure
trunk to hug the local human; the result of the skeletal block plus one absolute unit of distance. In other words, hinge loss
structures divided by classic NC clustering in Eq. (2) is prone makes the objective function in Eq. (7) potentially penalize
to error (see Fig. 3(b)). Some of the pixels belonging to the triples (i, j, l) as follows:
trunk block structures are clearly misclassified as pixels of the
D M ( pi , Bl ) < D M ( pi , B j ) + 1, δi j = 1, δil = 0. (8)
arm block structures.
To address the above issue, a WLMNC distance learning The benefit of using the hinge loss to construct the
method is proposed in the remote stage to narrow the distance second term in Eq. (7) is to maintain a large margin between
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5733
Fig. 4. Clustering error rate of skeletal block structures and MAD of depth
recovery using WLMNC. This figure demonstrates that 1) the best skeletal
block structure division result (A) does not contribute to the lowest MAD of in Fig. 4, the best depth recovery result (in terms of MAD)
depth recovery (B), and 2) WLMNC (B) can achieve better depth recovery based on WLMNC distance with σ = 1 (B) is lower
performance than that of LMNC (C) by adaptively adjusting the penalty
weight ωil of the invading triples. than that based on LMNC distance, which is approximately
σ = 2−10 (C). In fact, by introducing the penalty weight w j l ,
the proposed WLMNC distance learns an adaptive margin
different skeletal block structures under the LMNC distance. between two different skeletal block structures instead of a
This large margin is natural in general clustering problems fixed large margin as the LMNC distance. As shown in Fig. 5,
because each cluster is always distributed in a different region with the WLMNC distance, only skeletal block structures with
of the sample space. However, when dividing skeletal block different depths are separated with a large margin, and a small
structures, clusters are always neighboring, which makes it margin is maintained for those skeletal block structures with
difficult to maintain a large margin between different clusters. similar depths.
Moreover, this paper focuses on obtaining a better depth recov- To improve the computational efficiency, we reformulate the
ery result instead of better clustering performance. Therefore, optimization of Eq. (9) as an instance of semidefinite program-
misclassification among neighboring skeletal block structures ming (SDP) [62]. A SDP problem is a linear programming
with similar depths can be tolerated. Considering the above problem with the additional constraint that a matrix, whose
two factors, we propose a WLMNC distance learning method elements are linear in the unknown variables, is required to
that minimizes the following objective function: be semidefinite. According to [61], SDPs are convex and can
be effectively solved. By introducing slack variables ζi j l to
ε(M) = η δi j D M ( pi , B j ) + (1 − η) δi j (1 − δil )ωil simplify the hinge loss in Eq. (9), the resulting SDP is given
i ijl by:
× [1 + D M ( pi , B j ) − D M ( pi , Bl )]+ , (9)
min η δi j D M ( pi , B j ) + (1 − η) δi j (1 − δil )ωil ζi j l
where an additional coefficient ωil is introduced to adaptively i ijl
adjust the penalty weight of the hinge loss caused by the s.t. M ≥ 0
invading triples defined above. Specifically, ωil is defined
as 1 − ξil with: ζi j l ≥ 0
D M ( pi , Bl ) − D M ( pi , B j ) > 1 − ζi j l . (11)
( D o ( pi ) − D o (sl1 ))2 + ( D o ( pi ) − D o (sl2 ))2
ξil = exp(− ),
2 × σ2 Based on the gradient projection algorithm [63], this paper
(10) implements an expert solver for the above SDP problem.
Let Ci j = (Pi − B j )(Pi − B j )T , the squared WLMNC
which depends on the depth difference between pixel pi and
distance corresponding to Mt generated in iteration t can be
skeletal block structure Bl , and σ is a predefined constant.
defined as:
The above definition of ωil means that the objective function
in Eq. (9) more heavily penalizes hinge loss caused by D Mt ( pi , B j ) = tr (Mt Ci j ), (12)
invading triples with more different depth values.
As shown in Fig. 4, the best depth recovery result (B) where tr (X) denotes the trace of matrix X. Hence, the objec-
based on WLMNC distance does not correspond to the lowest tive function in Eq. (9) can be rewritten as follows:
clustering error rate of skeletal block structures (A), which
verifies the assumption that each invading triple does not ε(Mt ) = η δi j tr (Mt Ci j ) + (1 − η) δi j (1 − δil )ωil
have the same impact on the depth recovery performance. ij ijl
By adaptively adjusting the penalty weight ωil of the invading × [1 + tr (Mt Ci j ) − tr (Mt Cil )]+ . (13)
triples according to depth differences, the proposed WLMNC
distance can greatly reduce the invading triples among the Let t denote the set of all triples (i, j, l) satisfying Eq. (8),
depth discontinuities due to occlusion in the remote human, i.e., making the second term of Eq. (13) greater than zero. The
resulting in better depth recovery performance. As shown gradient of the objective function represented by Eq. (13) is
5734 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018
Algorithm 1 The Pseudo-Code of the WLMNC Distance Algorithm 1 does not change for any ωi ∗ l ∗ ∈ (γ1 , γ2 ) and
Learning Method in the Remote Stage (i ∗ , j ∗ , l ∗ ) ∈ t . Define the output BS matrix of Algorithm 1
q q
with ωi ∗ l ∗ ∈ (γ1 , γ2 ) after t iterations as Mt , q = 1, 2. Then,
if ωi ∗ l ∗ ≤ ωi ∗ l ∗ , d Mt1 ( pi ∗ , Bl ∗ ) ≤ d Mt2 ( pi ∗ , Bl ∗ ).
1 2
is defined as
D̂( pi ) − D̂( p j )2
αi,D̂j = exp(− )
2σ12
k∈ Bi ◦ ( Pi − P j )
k k 2
αi,I j = exp(− )
2 × 3 × σ22
pi − p j 2
Bi (i, j ) = exp(− )
2σ32
I k ( pi ) − I k ( p j )2
Fig. 7. Depth recovery results of the Tease posture: (a) ground truth depth × exp(− k∈ ),
map and color image and depth maps recovered by (b) WLMNC (MAD:5.16), 2 × 3 × σ42
(c) GF (MAD:5.45), (d) GF-AR (MAD:4.54) and (e) WLMNC-AR
(MAD: 4.15). This figure clearly demonstrates 1) the necessity of the fine
where ( pi ) is the neighborhood of pixel pi in the H × H
depth recovery method when there exist a variety of occlusions in the remote square patch centered at pixel pi . i is the normalization factor
p j ∈( pi ) αi, j = 1. αi, j is a depth term defined
human, as WLMNC-AR performs better than WLMNC in terms of preserving that makes D̂
depth discontinuities in the red box area where skeletal block structures are
incorrectly divided; 2) the effectiveness of using the rough depth map as on the guide depth map D̂. αi,I j is a color term defined on the
the guide depth map of the AR model, as WLMNC and WLMNC-AR can accompanying color image. Parameters σ1 ,σ2 are user defined
suppress texture copy artifacts in the yellow box area where color edge is
weak, whereas GF and GF-AR cannot. constants. = {R, G, B} or = {Y, U, V } represents the
different channels of the color image. Pik denotes an operator
structure division using the WLMNC distance augmented that extracts a W × W patch centered at pixel pi in color
clustering approach. As shown in Fig. 7, for the Tease posture, channel k. “◦” represents element-wise multiplication. Bi is a
occlusion exists not only between the left arm block structure bilateral filter kernel defined in the extracted W × W patch.
and the trunk block structure but also between the right arm σ3 and σ4 are user defined constants. We refer to [16] for
block structure and the head block structure. In this situation, details.
the rough depth map recovered by the proposed WLMNC Different from the AR model in [16], which sets the guide
method is prone to error in the red box area because several depth map D̂ in Eq. (26) as the initial interpolated depth
pixels of the trunk block structure are wrongly clustered into map, the proposed rough depth guided AR model (denoted
the left arm block structure of the remote human. To improve as WLMNC-AR) sets D̂ as the rough depth map estimated
the depth recovery performance in this situation, we propose by WLMNC. As shown in Fig. 7(e), compared with GF-AR
a fine depth recovery method based on rough depth guided (see Fig. 7(d)), WLMNC-AR can suppress texture copy arti-
AR model, which takes the rough depth map estimated by facts in the yellow box area where the color edge is weak.
WLMNC as a guide depth map in the AR model [16]. On the other hand, with the guidance of the accompanying
As discussed in Section II-B, the AR model [16] is a color image, WLMNC-AR achieves better performance than
state-of-the-art color guided depth recovery model in terms that of WLMNC in terms of preserving depth discontinuities
of suppressing texture copy artifacts and preserving depth in the red box area where misdivisions of skeletal block
discontinuities by using an initial estimated depth map for structures exist (see Fig. 7(b)). Algorithm 2 shows the pseudo-
the guide weight. However, the initial depth map estimated code of the rough-to-fine depth recovery method based on the
by interpolating the noisy low-resolution depth map becomes WLMNC-AR model in the local stage.
unreliable, especially when the upsampling rate is very high. As shown in Algorithm 2, there are four time-consuming
This situation becomes worse in human depth recovery with parts in the rough-to-fine depth recovery algorithm based on
depth data of only a few skeleton joints. As shown in Fig. 7(c), the WLMNC-AR model: 1) computing the label j ∈ J of
for the Tease posture, the depth map recovered by GF, an inter- the nearest skeletal block structure for each remote human
polation method with a Gaussian filter, is blurred among depth pixel pi , i ∈ N , which has a time complexity of O(N J );
discontinuities in both highlighted regions. Therefore, when 2) computing the rough depth value Dr ( pi ) for each remote
using the AR model guided with the depth map estimated human pixel pi , i ∈ N , which has a time complexity
by GF, the depth recovery result suffers from texture copy of O(N); 3) construction of AR coefficients, which has a
artifacts in the yellow box area where the color edge is weak, time complexity of O(N H 2 W 2 ); 4) quadratic optimization in
see Fig. 7(d). To preserve the depth discontinuities while solving the AR model using matrix division solver, which has
suppressing the texture copy artifacts, we employ the above a time complexity of O(N 3 ). Since J, H, W
N, the overall
estimated rough depth map into the AR model for better depth time complexity of the rough-to-fine depth recovery algorithm
recovery performance, see Fig. 7(e). based on the WLMNC-AR model is O(N 3 ).
According to [16], the AR model is defined as follows:
IV. E XPERIMENTS AND R ESULTS
D f = arg min{ ( D(s j ) − D o (s j ))2
D
s j ∈s
In this section, we present the experimental results of
the proposed WLMNC distance based human depth recovery
+λ ( D( pi )− αi, j D( p j ))2 }, (26) method. All experimental methods are implemented using
pi ∈ p j ∈( pi ) Matlab and tested on a desktop PC with an i7-4770 CPU
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5737
depth map of the remote human given only the depth data TABLE III
of skeleton joints, we consider the iterative Gaussian fil- PARAMETER S ETTINGS FOR THE WLMNC-AR
M ODEL AND THE C OMPARED AR M ODELS
tering method (denoted as GF), which performs iterative
nearby interpolation with a Gaussian filter.
• To verify the effectiveness of the fine depth recov-
ery method based on rough depth guided AR model
(WLMNC-AR), we compare WLMNC-AR with NC-AR,
LMNC-AR and GF-AR, which denote the AR model
guided with a rough depth map estimated by NC, LMNC
and GF, respectively. We also compare WLMNC-AR with
ten other state-of-the-art methods, as shown in Table IV.
We ignore the comparison with total generalized variation
(TGV) [19], which does not converge within 3×104 steps
when given the depth data of only a few skeleton joints.
In the comparison experiments, the mean absolute differ-
ence (MAD) of the estimated depth map and the ground
truth depth map is employed to evaluate the depth recovery
performance of different depth recovery methods. Since the
motivation of this paper is to develop an efficient method to
highly compress the depth data of the remote human at the
remote end and to accurately recover the depth data of the
remote human at the local end, we directly set the observed
depth map of the remote human captured in the remote
end as the ground truth depth. In the visual comparisons
(Fig. 6, Fig. 7 , Fig. 8, Fig. 9 and Fig. 10), regions highlighted
by rectangles are enlarged, and the error maps are obtained
by subtracting the recovered depth and ground truth depth for
easy visual assessment.
C. Parameter Setup
For the rough depth recovery method based on WLMNC
and the compared LMNC method, the depth recovery per-
formance depends on the related distance, which is highly Fig. 8. Visual quality comparison of the depth recovery on the Hug
dependent on parameters σ , η, ς and T . σ determines the and Shoulder to shoulder postures: (a) ground truth depth map and
color image and depth maps recovered by (b) GF (MAD: 5.40; 3.99),
penalty weight ω j l on invading triples. η adjusts the impor- (c) NC (MAD: 6.27; 3.77), (d) LMNC (MAD: 4.83; 3.42), and (e) WLMNC
tance of the two terms of the objective function for training the (MAD: 4.15; 3.07). The first and second MADs for each method are for the
WLMNC distance or the LMNC distance. Step parameter ς Hug and Shoulder to shoulder postures, respectively. This figure demonstrates
that WLMNC can obtain a more accurate rough depth map than those of the
and the maximum iteration number T control the convergence other rough depth recovery methods when there are depth discontinuities in
level. These four parameters directly impact the performance the remote human.
of the learned WLMNC distance or LMNC distance. In the
experiments, we take η = 2−10 , ς = 0.5 × 10−3 , and T = 80 The other parameters, e.g., window size H and patch size W ,
for both LMNC and WLMNC and set σ = 1 for WLMNC are set to the values in the implementation code provided by
using the grid search technique. The grid search for σ , η, ς Yang et al. [16].
and T is {2−10 , 2−9 , · · · , 210 }, {2−10 , 2−9 , · · · , 20 }, 0.5 ×
{10−6 , 10−5 , · · · , 10−2 } and {10, 20, · · · , 100}, respectively. D. Results Analysis
The precision-tolerant parameter is empirically set to 10−3 . 1) Rough Depth Recovery Accuracy: Table IV shows the
For the fine depth recovery method based on WLMNC-AR quantitative depth recovery results (in MAD) of the bench-
and the compared AR models, the depth recovery performance mark dataset using the proposed method and other compar-
is highly dependent on the related guide depth map and ison methods. As shown in Table IV, compared with NC,
the parameter setting in the AR model. Since σ1 determines WLMNC reduces the overall depth recovery error by more
the weight of the guide depth, the grid search for σ1 in than 2 cm on the Hug posture labeled as A, which demon-
{20 , 21 , · · · , 27 } is employed for the AR model with different strates that WLMNC can make more precise predictions of
guide depths. As the best fine depth recovery results of those depth values by improving the division accuracy of the skeletal
methods on each posture in the benchmark dataset correspond block structures among depth discontinuities using the trained
to very different value settings of σ1 , σ1 is separately set WLMNC distance. As shown in Fig. 3, compared with classic
to the optimal value for each posture for the sake of fair- NC clustering, WLMNC distance augmented clustering greatly
ness. Table III shows the detailed optimal parameter settings. reduces the number of trunk pixels in the highlighted area
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5739
TABLE IV
T HE D EPTH R ECOVERY MAD ( CM ) OF THE B ENCHMARK D ATASET U SING D IFFERENT D EPTH R ECOVERY M ETHODS . T HE R ESULT “W ITHOUT N OISE ”
IS O BTAINED ON THE O RIGINAL B ENCHMARK D ATASET, W HILE THE R ESULT “W ITH N OISE ” I S O BTAINED ON THE D EGRADED B ENCHMARK
D ATASET, W HERE THE L OCATION OF S KELETON J OINTS AND THE A CCOMPANYING C OLOR I MAGE H AVE A DDED G AUSSIAN N OISE
W ITH A S TANDARD VARIANCE OF 5 AND 25, R ESPECTIVELY. T HE B EST R ESULTS A RE
IN B OLD , AND THE S ECOND B EST A RE UNDERLINED
that are misclustered into the arm block structures, which 2) Fine Depth Recovery Accuracy: As shown in Table IV,
demonstrates the effectiveness of the proposed WLMNC by employing the rough depth map recovered by WLMNC as
distance in reducing the clustering error rate among depth the guide depth map of the AR model, the average resulting
discontinuities caused by occlusion between skeletal block depth recovery error of WLMNC-AR is lower than that
structures. As shown in Table IV, the proposed WLMNC of the other AR models, which demonstrates that a better
method also achieves a lower MAD than that of LMNC on all guide depth map is the key element in improving the depth
six human postures. This result indicates that by introducing recovery performance of the AR model. We also find that
the penalty weight, the WLMNC distance becomes more effec- WLMNC-AR obtains comparable performance to that of
tive in widening the distance between the human pixels and WLMNC for the Hug and Kick postures, labeled as A and E,
other skeletal block structures with different depths. There- respectively, which further demonstrates the effectiveness
fore, the WLMNC distance augmented clustering approach of WLMNC.
can reduce misclustering between neighboring skeletal block As shown in Fig. 9, the results of NC-AR (GF-AR), which
structures with occlusion and contribute to better performance uses a rough depth map estimated by NC (GF) for the guide
in depth recovery. We also find that by making use of the prior weight, suffer from blurred depth discontinuities. By contrast,
information of the skeletal block structure, the average depth guided with a more accurate rough depth map estimated
recovery error of WLMNC is lower than that of the nearby by WLMNC, WLMNC-AR yields much better depth
interpolation method using GF. recovery results in the highlighted regions, as shown
Fig. 8 shows a visual comparison of the depth recovery in Fig. 9.
results for the Hug and Shoulder to shoulder postures using By comparing Fig. 9(e) with Fig. 8(e), we can see that
different rough depth recovery methods. As shown in Fig. 8, the depth maps recovered by WLMNC-AR and WLMNC are
the depth map of both human postures recovered by GF is similar in the highlighted regions of the Hug and Shoulder to
blurred among the arm block structures and the neighboring shoulder human postures, which demonstrates that WLMNC
trunk structures, and the depth map recovered by NC is error- is competitive in preserving the depth discontinuities of a
prone in those areas due to misdivision of skeletal block remote human with consistent occlusion between skeletal
structures. When augmented with the WLMNC (LMNC) dis- block structures. However, for the Tease posture, with a
tance, WLMNC (LMNC) achieves much better depth recovery variety of occlusions in the remote human, the result of
results in those areas. By more heavily penalizing the invading WLMNC is prone to error (see Fig. 7(b)) due to the incorrect
triples with large depth differences, WLMNC outperforms skeletal block structure division. By introducing the guidance
LMNC in terms of preserving depth discontinuities. of the accompanying color image, WLMNC-AR achieves
5740 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018
only the color term of the RWLS (BA-RWLS) model [13], [14]
and GF-AR model [16] plays a significant role in the depth
recovery performance. In contrast to the color term in the
RWLS (BA-RWLS) model [13], [14], the color term in the
GF-AR model [16] has a shape-adaptive neighborhood, which
increases the opportunities to exploit more correlations for
pixels around discontinuities. Therefore, the depth recovery
performance of the GF-AR model [16] is slightly better than
that of the RWLS (BA-RWLS) model [13], [14]. As shown
in Table IV, the RWLS (BA-RWLS) model [13], [14]
achieves almost the same performance as that of the
WLS model [18], [43] on the six human postures. The depth
discontinuities in the highlighted regions of the recovered
depth map of the RWLS (BA-RWLS) model [13], [14] are
severely blurred, as shown in Fig. 10(i) and Fig. 10(j).
In contrast to the RWLS (BA-RWLS) model [13], [14],
the FGI method [12] can preserve depth discontinuities
and achieves the lowest MAD for the (A) Hug posture,
see Table IV. As shown in Fig. 10(k), the depth maps recov-
ered by the FGI method [12] show clear boundaries without
blurring. However, when the color edges are not consistent
with the depth discontinuities, the FGI method [12] tends to
Fig. 9. Visual quality comparison for depth recovery on the Hug and generate depth recovery results with texture copy artifacts,
Shoulder to shoulder postures: (a) ground truth depth map and color image
and depth maps recovered by (b) GF-AR (MAD: 5.03; 4.05), (c) NC-AR
especially on the Kick and Shoulder to shoulder postures.
(MAD: 5.69; 3.36), (d) LMNC-AR (MAD: 4.77; 3.06), and (e) WLMNC-AR By contrast, guided by a more accurate rough depth map and a
(MAD: 4.18; 2.56). The first and second MADs for each method are for more efficient color term, the proposed WLMNC-AR method
the Hug and Shoulder to shoulder postures, respectively. This figure demon- can preserve depth discontinuities and suppress texture copy
strates that, compared with other fine depth recovery methods, the proposed
WLMNC-AR method can obtain much better depth recovery performance in artifacts, as shown in Fig. 10 (l).
terms of preserving depth discontinuities. 3) Upsampling Rate: As shown in Table IV, the proposed
WLMNC method can obtain a low depth recovery error when
given pseudo 3D information (including coordinates and depth
information) of only a few skeleton joints of the remote
better depth recovery performance for this human posture human and a learned WLMNC distance matrix. Compared
(see Fig. 7(e)). with the RGB-D data transmission protocol used in [5],
As shown in Table IV, the proposed WLMNC-AR method the proposed method can greatly reduce the amount of data
obtains the lowest average MAD on the benchmark dataset transmitted and bandwidth consumption, enabling smoother
(especially for the (F) Shoulder to shoulder posture). As shown remote video interaction. In particular, for the six human
in Fig. 10, the proposed WLMNC-AR method outperforms postures captured by Kinect I, the depth map of the remote
the other state-of-the-art comparison methods in preserving human is of VGA size, i.e., 640 × 480 = 307200 data, while
depth discontinuities caused by various types of occlusion the proposed method needs to transmit pseudo 3D information
in the remote human. The depth maps recovered by the of only 20 skeleton joints of the remote human and the learned
color guided depth recovery methods, namely, JBF [24], WLMNC distance matrix, a total of 20 × 3 + 4 × 4 = 76 data,
Guided [23], CLMF0 [22], CLMF1 [22], JGU [20] and which is equivalent
to downsampling the depth map of the
WLS [18], [43], suffer from texture copy artifacts and remote human at 307200 76 ≈ 64× the sampling rate. Based on
blurred depth discontinuities. The depth recovery result of the undersampled data, the proposed method can accurately
Edge [17], [21] also suffers from jaggy artifacts in the depth recover the depth data of the remote human at the local end,
boundaries. According to [13] and [14], the quality of an which demonstrates the effectiveness of the proposed method
iteratively updated depth map is much better than the initial under such a high upsampling rate.
depth map estimated by interpolation methods, which helps the 4) Stability Analysis: We also evaluated the stability of
RWLS (BA-RWLS) model to not only suppress texture copy the proposed method to variation in the input information,
artifacts but also to preserve sharper depth discontinuities than including the location of skeleton joints and the accompanying
those of the AR model in [16]. However, as shown in Table IV, color image. Table IV (b) shows the results on the degraded
the average depth recovery MAD of the RWLS (BA-RWLS) benchmark dataset, where the locations of the skeleton joints
model [13], [14] is higher than that of the GF-AR model [16] and the accompanying color image have added Gaussian noise
on the six human postures, which demonstrates that the with a standard variance of 5 and 25, respectively. These
iteratively updated depth map is not reliable in human depth results clearly demonstrate that the proposed WLMNC-AR
recovery with depth data of only a few skeleton joints. In fact, method is stable and outperforms the other state-of-the-art
for human depth recovery with such a high upsampling rate, depth recovery methods.
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5741
Fig. 10. Depth recovery results of the Tease posture. The depth recovery results of five other human postures are listed in the supplementary material.
(a) ground truth depth map and color image and depth maps (Average MAD) recovered by (b) JBF [24](4.21), (c) Guided [23](4.30), (d) Edge [17], [21] (4.30),
(e) CLMF0 [22] (4.29), (f) CLMF1 [22] (4.28), (g) JGU [20](4.68), (h) WLS [18], [43](4.21), (i) RWLS [13](4.20), (j) BA-RWLS [14] (4.18),
(k) FGI [12] (4.11), and (l) WLMNC-AR (3.56). This figure demonstrates that the proposed WLMNC method outperforms the other state-of-the-art methods
in terms of preserving depth discontinuities. Here, the average MAD is calculated based on the whole dataset.
V. D ISCUSSIONS AND F UTURE W ORK of the video stream. Although a simple numerical sta-
A. Acceleration Strategy bility analysis has been established in this paper, a more
detailed influence of the distortion source on the proposed
In our experiments, the Matlab implementation of the depth recovery method should be investigated.
WLMNC distance learning algorithm, i.e., Algorithm 1, takes • Occlusion handling performance. The experimental
7.93 seconds on average to learn the WLMNC distance for the results have demonstrated the effectiveness of the pro-
remote human of VGA size with approximately 40000 pixels posed method in human depth recovery, especially for the
and 17 valid skeletal block structures, while the rough-to-fine situation of self-occlusion. However, better depth recov-
depth recovery algorithm, i.e., Algorithm 2, takes 0.07 seconds ery performance does not mean more accurate occlusion
and 2.5 minutes on average to obtain the rough and fine human handling results; the latter should consider occlusion with
depth map, respectively. We suggest the following two aspects the local human. An additional experiment on the impact
to reduce the computational complexity and make the proposed of WLMNC-AR on occlusion handling with various types
method more practical: of interaction behaviors should be performed.
• Parallelizability. It is easy to check whether the time-
consuming steps in Algorithm 1 and Algorithm 2 can be C. Future Work
parallelized. A preliminary GPU version of the AR model
was implemented in [16] and took 2.8 seconds on To greatly reduce bandwidth consumption, the WLMNC
average, approximately 40× faster than the CPU ver- distance learning method introduced here uses a unified dis-
sion. Thus, we expect a high acceleration ratio for a tance for all clusters. According to [27], the learned distance
GPU version of the proposed method. formulated by Eq. (3) employs a linear operator on the input
• Temporal information. Since the human depth recovery
samples, which leads to limited performance for situations
method introduced here is designed for tele-immersive with complex occlusion. Thus, we will consider introducing
video interaction systems, i.e., video processing, a com- a nonlinear operator, e.g., a neural network, to construct the
mon acceleration strategy is to use the temporal infor- distance in the future work.
mation of the video. The most time-consuming part in
Algorithm 1 is the iteration scheme. For video processing, VI. C ONCLUSION
suppose the output BS matrix for frame i − 1 is Mi−1 ; This paper presented a WLMNC distance based human
we expect a faster convergence if we set M0 = Mi−1 to depth recovery method that can accurately recover the depth
learn BS matrix Mi for frame i because the neighboring map of a remote human with depth data of only a few skeleton
frames are similar. A similar idea can be found in the joints to obtain occlusion-consistent composition results for
spatial-temporal recovery of depth sequences [65], [66]. tele-immersive video interaction systems in low-bandwidth
environments. Specifically, at the remote end, we first used the
skeleton joint information to highly compress the depth data of
B. Practical Challenges the remote human to a BS matrix by using a WLMNC distance
This paper focused on algorithm development and its the- learning method, which was equivalent to downsampling the
oretical analysis. However, there are at least two practical remote human depth map with a 64× sampling rate. At the
challenges: local end, we first proposed a rough depth recovery method
• Chroma key based coding. The chroma key based coding based on WLMNC distance augmented clustering, which can
is an important distortion source when WLMNC-AR yield better depth recovery results than those of the classic
is applied to practical tele-immersive video interaction NC clustering when depth discontinuities exist in the remote
systems. It may affect the quality of the guide color human. Then, we employed the rough estimated depth map
image with different settings, such as the data size of in the AR model [16] as a guide depth map to obtain a
the compressed images and the transmission bandwidth fine depth map that can preserve depth discontinuities and
5742 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018
suppress texture copy artifacts. A theoretical analysis was [19] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
also conducted to guarantee the effectiveness of the pro- guided depth upsampling using anisotropic total generalized variation,”
in Proc. ICCV, Dec. 2013, pp. 993–1000.
posed method. To benchmark human depth recovery methods, [20] M.-Y. Liu, O. Tuzel, and Y. Taguchi, “Joint geodesic upsampling of
a novel dataset containing various types of human postures depth images,” in Proc. CVPR, Jun. 2013, pp. 169–176.
with self-occlusion was built. Comparisons with the state-of- [21] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality
depth map upsampling for 3D-TOF cameras,” in Proc. ICCV, Nov. 2012,
the-art depth recovery methods demonstrated the effectiveness pp. 1623–1630.
of the proposed method for human depth recovery with a high [22] J. Lu, K. Shi, D. Min, L. Lin, and M. N. Do, “Cross-based local
upsampling rate on the benchmark dataset. multipoint filtering,” in Proc. CVPR, Jun. 2012, pp. 430–437.
[23] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. ECCV,
2010, pp. 1–14.
ACKNOWLEDGMENT [24] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral
The authors thank Dr. Zhecai Chen, Yang He and upsampling,” ACM Trans. Graph., vol. 26, no. 3, p. 96, Jul. 2007.
XiaoYi Zhang for participating in building the dataset. The [25] J. Shotton et al., “Real-time human pose recognition in parts from single
depth images,” in Proc. CVPR, Jun. 2011, pp. 1297–1304.
authors are grateful to the referees for their valuable comments [26] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
and suggestions, which have helped us to significantly improve margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
the presentation of this paper. pp. 207–244, Feb. 2009.
[27] M. Huang, Y. Chen, B.-W. Chen, J. Liu, S. Rho, and W. Ji, “A semi-
supervised privacy-preserving clustering algorithm for healthcare,” Peer-
R EFERENCES Peer Netw. Appl., vol. 9, no. 5, pp. 864–875, 2016.
[1] S.-Y. Lee, I.-J. Kim, S. C. Ahn, M.-T. Lim, and H.-G. Kim, “Toward [28] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and
immersive telecommunication: 3D video avatar with physical interac- B. MacIntyre, “Recent advances in augmented reality,” IEEE Comput.
tion,” in Proc. ICAT, 2005, pp. 56–61. Graph. Appl., vol. 21, no. 6, pp. 34–47, Nov. 2001.
[2] T. Ogi, T. Yamada, K. Tamagawa, M. Kano, and M. Hirose, “Immersive [29] W. Xu, Y. Wang, Y. Liu, and D. Weng, “Survey on occlusion handling
telecommunication using stereo video avatar,” in Proc. VR, Mar. 2001, in augmented reality,” J. Comput.-Aided Des. Comput. Graph., vol. 25,
p. 45. no. 11, pp. 1635–1642, 2013.
[3] J. Lu, V. A. Nguyen, Z. Niu, B. Singh, Z. Luo, and M. N. Do, “CuteChat: [30] J. Zhu, Z. Pan, C. Sun, and W. Chen, “Handling occlusions in video-
A lightweight tele-immersive video chat system,” in Proc. MM, 2011, based augmented reality using depth information,” Comput. Animation
pp. 1309–1312. Virtual Worlds, vol. 21, no. 5, pp. 509–521, 2010.
[4] S. Follmer, R. Ballagas, H. Raffle, M. Spasojevic, and H. Ishii, “People [31] R. A. Newcombe et al., “KinectFusion: Real-time dense surface mapping
in books: Using a FlashCam to become part of an interactive book for and tracking,” in Proc. ISMAR, Oct. 2011, pp. 127–136.
connected reading,” in Proc. CSCW, 2012, pp. 685–694. [32] B. V. Lu, T. Kakuta, R. Kawakami, T. Oishi, and K. Ikeuchi, “Fore-
[5] S. E. Hunter, P. Maes, A. Tang, K. M. Inkpen, and S. M. Hessey, ground and shadow occlusion handling for outdoor augmented reality,”
“WaaZam!: Supporting creative play at a distance in customized video in Proc. ISMAR, Oct. 2010, pp. 109–118.
environments,” in Proc. CHI, 2014, pp. 1197–1206. [33] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation
[6] M. Huang, Y. Chen, L. Yin, and W. Ji, “Ti-photograph: A tele-immersive of dense two-frame stereo correspondence algorithms,” in Proc. IEEE
photograph system for distributed parents and children,” in Proc. 15th Workshop Stereo Multi-Baseline Vision, Dec. 2002, pp. 7–42.
ACM Int. Conf. Ubiquitous Comput. (Ubicomp Adjunct Publication), [34] A. Kolb, E. Barth, R. Koch, and R. Larsen, “Time-of-flight cameras in
2013, pp. 259–262. computer graphics,” Comput. Graph. Forum, vol. 29, no. 1, pp. 141–159,
[7] S. Liu, C. Yu, and Y. Shi, “Video avatar-based remote video col- 2010.
laboration,” J. Beijing Univ. Aeronaut. Astronaut., vol. 41, no. 6, [35] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-
pp. 1087–1094, 2015. view RGB-D object dataset,” in Proc. ICRA, May 2011, pp. 1817–1824.
[8] H. Lu et al., “Depth map reconstruction for underwater Kinect camera [36] C. Ti, G. Xu, Y. Guan, and Y. Teng, “Depth recovery for Kinect sensor
using inpainting and local image mode filtering,” IEEE Access, vol. 5, using contour-guided adaptive morphology filter,” IEEE Sensors J.,
pp. 7115–7122, 2017. vol. 17, no. 14, pp. 4534–4543, Jul. 2017.
[9] N. Yu et al., “Super resolving of the depth map for 3D reconstruction [37] A. Atapour-Abarghouei and T. P. Breckon, “DepthComp: Real-time
of underwater terrain using Kinect,” in Proc. IEEE Int. Conf. Parallel depth image completion based on prior semantic scene segmenta-
Distrib. Syst., Dec. 2016, pp. 1237–1240. tion,” in Proc. 28th Brit. Mach. Vis. Conf. (BMVC), London, U.K.,
[10] W. Dong, G. Shi, X. Li, K. Peng, J. Wu, and Z. Guo, “Color- Sep. 2017. [Online]. Available: http://dro.dur.ac.uk/22375/1/22375.
guided depth recovery via joint local structural and nonlocal low-rank pdf?DDD10+qhww73+d700tmt
regularization,” IEEE Trans. Multimedia, vol. 19, no. 2, pp. 293–301,
[38] H.-T. Zhang, J. Yu, and Z.-F. Wang, “Probability contour guided depth
Feb. 2017.
map inpainting and superresolution using non-local total generalized
[11] D. Chetverikov, Image-Guided ToF Depth Upsampling: A Survey.
variation,” Multimedia Tools Appl., vol. 77, no. 7, pp. 9003–9020, 2017.
New York, NY, USA: Springer-Verlag, 2017.
[39] M. Hornácek, C. Rhemann, M. Gelautz, and C. Rother, “Depth super res-
[12] Y. Li, D. Min, M. N. Do, and J. Lu, “Fast guided global interpolation
olution by rigid body self-similarity in 3D,” in Proc. CVPR, Jun. 2013,
for depth and motion,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016,
pp. 1123–1130.
pp. 717–733.
[13] W. Liu, X. Chen, J. Yang, and Q. Wu, “Robust weighted least squares [40] J. Li, Z. Lu, G. Zeng, R. Gan, and H. Zha, “Similarity-aware patchwork
for guided depth upsampling,” in Proc. ICIP, Sep. 2016, pp. 559–563. assembly for depth image super-resolution,” in Proc. CVPR, Jun. 2014,
[14] W. Liu, X. Chen, J. Yang, and Q. Wu, “Robust color guided depth map pp. 3374–3381.
restoration,” IEEE Trans. Image Process., vol. 26, no. 1, pp. 315–327, [41] U. Hahne and M. Alexa, “Exposure fusion for time-of-flight imaging,”
Jan. 2017. Comput. Graph. Forum, vol. 30, no. 7, pp. 1887–1894, 2011.
[15] H. H. Kwon, Y.-W. Tai, and S. Lin, “Data-driven depth map refine- [42] Q. Wang, S. Li, H. Qin, and A. Hao, “Super-resolution of multi-observed
ment via multi-scale sparse representation,” in Proc. CVPR, Jun. 2015, RGB-D images based on nonlocal regression and total variation,” IEEE
pp. 159–167. Trans. Image Process., vol. 25, no. 3, pp. 1425–1440, Mar. 2016.
[16] J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang, “Color-guided depth [43] J. Diebel and S. Thrun, “An application of Markov random fields to
recovery from rgb-d data using an adaptive autoregressive model,” IEEE range sensing,” in Proc. NIPS, 2005, pp. 291–298.
Trans. Image Process., vol. 23, no. 8, pp. 3443–3458, Aug. 2014. [44] S. C. H. Hoi, W. Liu, and S.-F. Chang, “Semi-supervised distance
[17] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. S. Kweon, “High- metric learning for collaborative image retrieval and clustering,” ACM
quality depth map upsampling and completion for RGB-D cameras,” Trans. Multimedia Comput., Commun., Appl., vol. 6, no. 3, 2010,
IEEE Trans. Image Process., vol. 23, no. 12, pp. 5559–5572, Dec. 2014. Art. no. 18.
[18] D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. Do, “Fast global [45] M. Guillaumin, J. Verbeek, and C. Schmid, “Multiple instance metric
image smoothing based on weighted least squares,” IEEE Trans. Image learning from automatically labeled bags of faces,” in Proc. ECCV, 2010,
Process., vol. 23, no. 12, pp. 5638–5653, Dec. 2014. pp. 634–647.
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5743
[46] L. Wu, S. C. H. Hoi, R. Jin, J. Zhu, and N. Yu, “Distance metric learning Meiyu Huang received the B.S. degree in computer
from uncertain side information with application to automated photo science and technology from the Huazhong Uni-
tagging,” in Proc. ACM Int. Conf. Multimedia (MM), 2009, pp. 135–144. versity of Science and Technology, Wuhan, China,
[47] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and in 2010, and the Ph.D. degree in computer appli-
metric learning in semi-supervised clustering,” in Proc. ICML, 2004, cation technology from the University of Chinese
pp. 81–88. Academy of Sciences, Beijing, China, in 2016.
[48] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for She is currently an Assistant Researcher with
feature vectors and structured data,” Comput. Sci., 2013. the Qian Xuesen Laboratory of Space Technol-
[49] L. Yang and R. Jin, “Distance metric learning: A comprehensive survey,” ogy, China Academy of Space Technology, Beijing.
Michigan State Univ., East Lansing, MI, USA, Tech. Rep., 2006. Her research interests include machine learning,
[50] Q. Qian, J. Hu, R. Jin, J. Pei, and S. Zhu, “Distance metric learning using ubiquitous computing, human–computer interaction,
dropout: A structured regularization approach,” in Proc. ACM SIGKDD computer vision, and image processing.
Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 323–332.
[51] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face
verification,” in Proc. ACCV, 2010, pp. 709–720. Xueshuang Xiang received the B.S. degree in
[52] D. Klein, S. D. Kamvar, and C. D. Manning, “From instance-level con- computational mathematics from Wuhan University,
straints to space-level constraints: Making the most of prior knowledge Wuhan, China, in 2009, and the Ph.D. degree in
in data clustering,” in Proc. ICML, 2002, pp. 307–314. computational mathematics from the Academy of
[53] D. Cohn, R. Caruana, and A. McCallum, “Semi-supervised clustering Mathematics and Systems Science, Chinese Acad-
with user feedback,” in Constrained Clustering: Advances in Algorithms, emy of Sciences, Beijing, China, in 2014.
Theory, and Applications, vol. 4. 2003, pp. 17–32. In 2016, he was a Post-Doctoral Researcher with
[54] L. Wu, S. C. H. Hoi, R. Jin, J. Zhu, and N. Yu, “Learning bregman the Department of Mathematics, National University
distance functions for semi-supervised clustering,” IEEE Trans. Knowl. of Singapore, Singapore. He is currently an Asso-
Data Eng., vol. 24, no. 3, pp. 478–491, Mar. 2012. ciate Researcher with the Qian Xuesen Laboratory
[55] C. Domeniconi, J. Peng, and B. Yan, “Composite kernels for semi- of Space Technology, China Academy of Space
supervised clustering,” Knowl. Inf. Syst., vol. 28, no. 1, pp. 99–116, Technology, Beijing. His research interests include numerical methods for
2011. partial differential equations, image processing, and deep learning.
[56] Y. Chen, M. Rege, M. Dong, and J. Hua, “Incorporating user provided
constraints into document clustering,” in Proc. ICDM, Oct. 2007,
pp. 103–112.
[57] M. S. Baghshah and S. B. Shouraki, “Kernel-based metric learning Yiqiang Chen received the B.S. and M.S. degrees in
for semi-supervised clustering,” Neurocomputing, vol. 73, nos. 7–9, computer science from Xiangtan University, Xiang-
pp. 1352–1361, 2010. tan, China, in 1996 and 1999, respectively, and the
[58] S. C. H. Hoi, R. Jin, and M. R. Lyu, “Learning nonparametric kernel Ph.D. degree in computer science from the Institute
matrices from pairwise constraints,” in Proc. ICML, 2007, pp. 361–368. of Computing Technology, Chinese Academy of
[59] H.-J. Ye, D.-C. Zhan, X.-M. Si, and Y. Jiang, “Learning Mahalanobis Sciences (CAS), Beijing, China, in 2003.
distance metric: Considering instance disturbance help,” in Proc. 26th In 2004, he was a Visiting Scholar Researcher
Int. Joint Conf. Artif. Intell. (IJCAI), 2017, pp. 3315–3321. with the Department of Computer Science, The
[60] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- Hong Kong University of Science and Technology,
theoretic metric learning,” in Proc. ICML, 2007, pp. 209–216. Hong Kong. He is currently a Professor and the
[61] K. Q. Weinberger, F. Sha, and L. K. Saul, “Convex optimizations for Director of the Pervasive Computing Research Cen-
distance metric learning and pattern classification [applications corner],” ter, Institute of Computing Technology, CAS. His research interests include
IEEE Signal Process. Mag., vol. 27, no. 3, pp. 146–158, May 2010. artificial intelligence, pervasive computing, and human–computer interaction.
[62] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev.,
vol. 38, no. 1, pp. 49–95, 1996.
[63] D. P. Bertsekas, “On the Goldstein-Levitin-Polyak gradient projection
method,” IEEE Trans. Autom. Control, vol. 21, no. 2, pp. 174–184, Da Fan received the B.S. degree in measurement-
Apr. 1976. control technology and instrumentation from
[64] M. Huang, Y. Chen, W. Ji, and C. Miao, “Accurate and robust moving- Tsinghua University, Beijing, China, in 2008,
object segmentation for telepresence systems,” ACM Trans. Intell. Syst. and the Ph.D. degree in instrument science and
Technol., vol. 6, no. 2, Mar. 2015, Art. no. 17. technology from Tsinghua University, Beijing,
[65] J. Zhu, L. Wang, J. Gao, and R. Yang, “Spatial-temporal fusion for high in 2013. He is currently an Assistant Researcher
accuracy depth maps using dynamic MRFs,” IEEE Trans. Pattern Anal. with the Qian Xuesen Laboratory of Space
Mach. Intell., vol. 32, no. 5, pp. 899–909, May 2010. Technology, China Academy of Space Technology,
[66] D. Min, J. Lu, and M. N. Do, “Depth video enhancement based on Beijing. His research interests include machine
weighted mode filtering,” IEEE Trans. Image Process., vol. 21, no. 3, learning, deep learning, computer vision, and
pp. 1176–1190, Mar. 2012. intelligent control.