You are on page 1of 16

5728 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO.

12, DECEMBER 2018

Weighted Large Margin Nearest Center Distance-


Based Human Depth Recovery With Limited
Bandwidth Consumption
Meiyu Huang , Xueshuang Xiang , Yiqiang Chen , Member, IEEE, and Da Fan

Abstract— This paper proposes a weighted large margin I. I NTRODUCTION


nearest center (WLMNC) distance-based human depth recovery
method for tele-immersive video interaction systems with limited
bandwidth consumption. In the remote stage, the proposed
method highly compresses the depth data of the remote human
into skeletal block structures by learning the WLMNC distance,
R EMOTE video communication and collaboration through
the Internet has become a popular application with
the development of social networks. Instead of providing
which is equivalent to downsampling the human depth map only a live window, as in the conventional video conferenc-
at 64× the sampling rate. In the local stage, the method first ing technologies (like Skype), tele-immersive video interac-
recovers a rough human depth map based on a WLMNC distance tion technology aims to enhance the feeling of immersion
augmented clustering approach and then obtains a fine depth
map based on a rough depth-guided autoregressive model to through merging distributed users’ video into the same virtual
preserve the depth discontinuities and suppress texture copy space [1], [2] and is widely applied to video chat [3], story
artifacts. The proposed WLMNC distance is learned by the reading [4] and remote play [5]–[7]. The essential task to
large margin clustering problem with a weighted hinge loss to create an immersive sense in tele-immersive video interac-
balance the clustering accuracy and depth recovery accuracy tion systems is to correctly handle the occlusion relationship
and is verified to be able to preserve depth discontinuities
between skeletal block structures with occlusion. A theoretical between the remote and local humans (user video) on the
analysis is conducted to verify the effectiveness of using the virtual background.
weighted hinge loss. Furthermore, a novel data set containing Traditional tele-immersive video interaction systems, such
various types of human postures with self-occlusion is built to as CuteChat [3] and People in Books [4], predefine the
benchmark the human depth recovery methods. The quantitative layer and position of the remote and local humans on the
comparison with the state-of-the-art depth recovery methods on
the introduced benchmark data set demonstrates the effectiveness virtual background to avoid occlusion yet limit body lan-
of the proposed method for human depth recovery with such a guage interaction. Equipped with a Kinect or Creative Senz3D
high upsampling rate. depth sensor, the newly developed remote collaboration
Index Terms— Tele-immersive interaction, occlusion handling, systems [5], [7] can correctly determine the occlusion rela-
depth recovery, distance learning. tionship of the remote and local humans. However, these
systems require transmission of the co-registered depth data
Manuscript received November 13, 2017; revised April 12, 2018, along with the video data, resulting in additional bandwidth
June 4, 2018, and June 29, 2018; accepted June 29, 2018. Date of publication
July 12, 2018; date of current version September 4, 2018. This work was consumption.
supported in part by the National Natural Science Foundation of China To reduce the bandwidth requirement and make remote
under Grant 61702520, Grant 61773383, and Grant 61572471, in part by collaboration systems [5], [7] more practical, it is important
the National Key Research and Development Program of China under Grant
2017YFC0803401, in part by the Beijing Municipal Science and Technology to highly compress the depth data at the remote end and
Commission under Grant Z171100000117017, and in part by the Innovation to employ depth recovery methods to accurately recover the
Foundation of Qian Xuesen Laboratory of Space Technology. The associate compressed depth data at the local end. Existing state-of-
editor coordinating the review of this manuscript and approving it for
publication was Prof. Jie Liang. (Corresponding author: Xueshuang Xiang.) the-art depth recovery methods [8]–[21], [21]–[24], focus on
M. Huang, X. Xiang, and D. Fan are with the Qian Xuesen recovering the conventional depth map and can work well only
Laboratory of Space Technology, China Academy of Space Technology, for upsampling rates up to 16×. By contrast, the background of
Beijing 100094, China (e-mail: huangmeiyu@qxslab.cn; xiangxueshuang@
qxslab.cn; fanda@qxslab.cn). a depth map is subtracted in tele-immersive video interaction
Y. Chen is with the Beijing Key Laboratory of Mobile Computing and systems, and only the depth data of the remote human needs to
Pervasive Device, Institute of Computing Technology, Chinese Academy of be recovered. Moreover, different from the conventional depth
Sciences, Beijing 100190, China, and also with the School of Computer
and Control Engineering, University of Chinese Academy of Sciences, map, the depth data of the remote human can be assumed to
Beijing 100049, China. be piecewise linear using the skeleton joint information [25].
This paper has supplementary downloadable material available at Considering the two factors, this paper proposes a weighted
http://ieeexplore.ieee.org., provided by the author. The material includes
the depth recovery results of the proposed WLMNC-AR method and the large margin nearest center (WLMNC) distance based human
comparison methods on the other five human postures in the benchmark depth recovery method that deeply exploits the prior informa-
dataset. Contact huangmeiyu@qxslab.cn for further questions about this work. tion, i.e., the skeleton joint information of human postures,
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. such that it can achieve high recovery accuracy even for
Digital Object Identifier 10.1109/TIP.2018.2855414 upsampling rates up to 64× (see discussion in Section IV-D.3
1057-7149 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5729

Fig. 1. The framework of the proposed WLMNC distance based human depth recovery method. In the remote stage, the human depth map is compressed
to a BS matrix using the WLMNC distance learning method based on skeleton joint information. In the local stage, the rough human depth map is recovered
based on WLMNC distance augmented clustering and then used as a guide weight in the AR model with the accompanying color image to recover a fine
depth map.

for details). As shown in Fig. 1, the proposed depth recovery


method consists of two stages:
• Remote Stage: Based on the human depth map and
skeleton joint information, i.e., human pose detected
by [25], the proposed method compresses the depth data
of the remote human to a block structure (BS) matrix by
using the WLMNC distance learning method. Then, the
BS matrix, skeleton joint information and the accompa-
nying color image are transmitted to the local end.
• Local Stage: Under the piecewise linear assumption on
remote human depth data, the proposed method recovers
a rough human depth map based on WLMNC distance
augmented clustering. The recovered rough depth map
and the accompanying color image are then used as a
guide weight in the AR model to recover a fine depth
map for preserving depth discontinuities and suppressing
texture copy artifacts.
The contribution of our work is summarized into the fol- Fig. 2. Remote human depth maps recovered by the proposed method
lowing three aspects: for various types of human postures collected in our experiment: (a) Hug,
(b) Tease, (c) Handshake, (d) Hand in hand, (e) Kick, (f) Shoulder to shoulder.
• Proposing a depth recovery framework for human Each human posture includes a color image, depth map and skeleton joint
posture: In contrast to depth upsampling problems information of the remote human. Note that the depth maps and color images
are actually of VGA size. For easy visual inspection, only the region of
in 3DTV or 3D reconstruction applications [8]–[10], [12], the remote human is shown in this paper. The depth recovery results of
[14]–[17], we investigate the depth recovery problem for the other state-of-the-art comparison methods are shown in Fig. 10 and the
the remote human in tele-immersive video interaction supplementary material.
systems, and the proposed method can accurately recover
the highly compressed depth data of the remote human. loss to balance the clustering accuracy and depth recovery
In addition, to benchmark human depth recovery meth- accuracy, i.e., tolerating the misclustering error of two
ods, we build a novel dataset containing various types of skeletal block structures with similar depths while heavily
human postures with self-occlusion, as shown in Fig. 2. penalizing the misclustering error of two skeletal block
• Using skeleton joint information to highly compress
structures with different depths. This idea is verified to
the human depth data at the remote end: We use skele- be able to preserve depth discontinuities between skeletal
ton joint information, i.e., human pose detected by [25], block structures with occlusion. A theoretical analysis
to compress the human depth data into a BS matrix at is conducted to verify the effectiveness of using the
the remote end, which is equivalent to downsampling weighted hinge loss.
the remote human depth map at 64× the sampling rate,
II. R ELATED W ORK
more than the 16× considered in the depth recovery
literature [10], [12], [14]–[17]. A. Occlusion Handling
• Proposing a weighted hinge loss in the large margin The concept of occlusion handling was first mentioned
nearest clustering problem: Instead of using the normal in augmented reality systems [28] primarily to achieve the
hinge loss in [27] and [28], we use the weighted hinge correct occlusion relationship between virtual and real objects.
5730 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

The correct occlusion relationship can enhance the viewer’s with the guidance of the accompanying high-resolution color
sense of immersion in the surrounding environment, which image, such as [11], [18]–[22], [24], and [25]. These methods
helps the viewer to make correct judgments. According to [29], are mostly based on the assumption that depth discontinuities
previous occlusion handling methods can be divided into three and image edges co-occur in the same scene [24], [43].
main categories: depth data based [30], modeling based [31], Image guided upsampling methods can yield much better
and image analysis based [32]. In tele-immersive systems upsampling quality than single depth map upsampling [39]
based on composited video environments [3]–[7], the occlu- and do not need any prior database, in contrast to the existing
sion relationship between virtual and real objects is quickly methods [40]. Additionally, they are not subject to static scenes
handled using an image analysis based method [32], namely, and do not require complicated camera calibration processes.
directly laying the remote and local humans on top of the However, when the color edges are inconsistent with the
virtual background. The image analysis based method [32] depth discontinuities, the upsampled depth map suffers from
enables distributed users to focus on the humans but easily texture copy artifacts and blurred depth discontinuities.
causes incorrect overlap between the remote and local humans. To handle these two issues, most recent methods design
To overcome this problem, in CuteChat [3] and People in complex guidance weights based on guide color images and
Books [4], the layer and position of the remote and local heuristically take the initial interpolation of the input depth
humans in the virtual space are directly predefined. However, map into account. Park et al. proposed an edge-weighted
this leads to limited interaction between distributed users in NLM-regularization (Edge) [17], [21] method. The method
the virtual space. To enhance the interaction experience of used a non-local term to regularize depth maps combined
distributed users, a special RGB-D data transmission protocol with a weighting scheme that involved edge, gradient, and
is customized in Waazam [5] and Video Avatar [7] so that the segmentation information extracted from high-quality color
depth data of the remote human can be transmitted synchro- images and gradient information extracted from the bicu-
nously with the video data. Then, using the depth data based bic interpolated depth map. However, jaggy artifacts still
method [30], the occlusion relationship between the remote occurred in some boundaries. Yang et al. [16] proposed a
and local humans can be correctly determined to achieve color guided AR model, that took an initial interpolated depth
an occlusion consistent video composition. The customized map as the definition of the AR coefficient. As reported
RGB-D data transmission protocol greatly increases band- in [16], the AR model can achieve good performance for
width consumption, which makes it difficult to obtain smooth handling the inconsistency between the color edge and the
transmission over the Internet and reduces the distributed depth discontinuity, i.e., suppressing texture copy artifacts and
users’ quality of experience during remote video communi- preserving depth discontinuities when the corresponding color
cation. To reduce bandwidth consumption, it is important to edge is weak. Dong et al. also proposed a color-guided depth
develop a strategy to highly compress the depth data of the recovery method via joint local structural and nonlocal low-
remote human at the remote end and to accurately recover the rank regularization [10]. This method jointly exploited local
depth data of the remote human under such a high upsampling and nonlocal color-depth dependencies and outperformed the
rate at the local end. AR model [16]. However, a complex guidance weight does
not always help to improve the upsampling quality. Moreover,
B. Depth Recovery the initial depth map estimated by interpolating the noisy low-
Depth recovery is developed for upsampling, or inpaint- resolution depth map becomes unreliable, especially when the
ing, a low-quality depth map captured by depth sensors. upsampling rate is very large.
Since stereo matching based depth acquisition methods require To address this issue, Liu et al. [13], [14] proposed a
accurate image rectification and are inefficient for texture- robust weighted least squares (RWLS) model, which used an
less areas [33], in recent years, there has been enormous iteratively updated depth map as the guide weight. According
interest in depth sensor (including ToF camera and Kinect) to [13] and [14], using a depth map for the guide weight is
based depth acquisition methods [34], [35]. Even though the the key element in improving depth recovery performance.
new depth capturing techniques are promising, the use of Furthermore, as a guide depth map, the iteratively updated
depth cameras is limited by the low quality of the produced depth map is better than an initial interpolated depth map,
depth maps, e.g., low resolution, noise, and missing depth which contributes to better performance in preserving sharp
in some areas. To compensate for the missing and inac- depth discontinuities. However, the iteratively updated depth
curate depth measurements of Kinect, some image inpaint- map suffers in human depth recovery with depth data of
ing techniques have been developed [8], [9], [36]–[38]. only a few skeleton joints, as proved in our experiment.
These methods achieve good quality for smooth regions Li et al. [12] also proposed a depth recovery method guided
but may introduce artifacts, e.g., jagging, blurring, and by a cascadingly interpolated depth map. As clarified in [12],
ringing, around thin structures or sharp discontinuities. the cascaded scheme effectively addresses the potential struc-
To address the undersampling of ToF cameras, upsampling tural inconsistency between the sparse input data and the guide
methods with a single depth map, such as [40] and [41], image while preserving depth boundaries. However, when
have been introduced. A low-resolution depth map can also be given a very sparse input dataset, the method tends to generate
upsampled by integrating multiple low-resolution depth maps depth recovery results with texture copy artifacts.
of the same scene, such as in [42] and [43]. Another popular In summary, existing depth recovery methods can only be
approach is to upsample the noisy low-resolution depth map applied to repair a small amount of missing or noisy depth
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5731

values or to upsample a depth map with an upsampling rate up TABLE I


to 16× and thus cannot be applied to depth recovery problems S YMBOL D EFINITION RULES
in cases where there is a little known depth information.
To generate a better guide depth map in this specific situation,
we propose a rough depth recovery method based on WLMNC
distance augmented clustering. This method is similar to the
work in [37], which nonparametrically filled missing depth
values based on prior semantic scene segmentation. However,
the work in [37] was designed for inpainting a conven-
tional depth map and performed scene segmentation over a
co-registered color image. The proposed method is designed
for human depth recovery and performs block structure seg-
mentation with prior skeleton joint information of human
bodies.

C. Distance Learning provide good depth recovery performance for various types of
Distance learning, which trains a new metric to satisfy the human postures.
labels or constraints in the input data to enhance classifica-
III. WLMNC D ISTANCE BASED
tion or clustering performance [44]–[47], is widely used in
H UMAN D EPTH R ECOVERY
distance or similarity based machine learning, pattern recog-
nition and data mining applications [48], [49]. In this section, we will introduce the WLMNC distance
In recent years, several distance learning methods have based human depth recovery method, which consists of two
been developed, including distance metric learning using struc- stages: 1) Remote stage, where a WLMNC distance learning
tured regularization [50], cosine similarity metric learning for method is proposed to highly compress the depth data of
face verification [51], Euclidean distance trained by shortest- the remote human based on the skeleton joint information;
path algorithms [52], KL divergence adapted using gradient 2) Local stage, where a rough-to-fine depth recovery frame-
descent [53], Bregman divergence trained using nonlinear work is proposed to accurately recover the depth data of
learning [54], kernel similarity modified by incorporating the the remote human based on the learned WLMNC distance,
constraints in the objective function [55], [56] or using non- skeleton joint information and the accompanying color image.
parametric approaches [57], [58], and Mahalanobis distances To make the following description more clear, some complex
trained in [60] and [61]. symbol definition rules are shown in Table I.
The WLMNC distance learning method proposed in this
paper is largely inspired by the recent work on Mahalanobis A. Remote Stage
distance learning using convex optimization [61], especially Since human posture can be divided into several skele-
the large margin nearest neighbor (LMNN) distance learning tal block structures, e.g., head, hand and foot, and the
method [26] and the large margin nearest cluster (LMNC) depth data of each skeletal block structure is assumed to
distance learning method proposed by us in [27]. The LMNN be smooth, the depth data of the remote human can be
distance learning method [26] aims to study a generalized considered to be a piecewise linear function. Therefore, at the
Mahalanobis distance with the goal that the k-nearest neigh- remote end, a normal way to compress the depth data of
bors always belong to the same class while samples from the remote human is to detect the skeleton joints and then
different classes are separated by a large margin to improve transmit the depth data of only the extracted skeleton joints.
the classification performance of the k-nearest neighbor clas- Afterwards, at the local end, one method is to divide the
sification method. Different from LMNN, the LMNC distance remote human into several skeletal block structures using the
learning method [27] is designed mainly for clustering prob- received skeleton joints and then recover the depth data of
lems with the target of narrowing the distance between each the remote human based on the depth data of the skeletal
sample and its cluster center while widening the distance block structures. However, when there are occlusions between
between each sample and other heterogeneous cluster cen- different skeletal block structures, the skeletal block structure
ters to achieve better clustering performance. The proposed division results can be prone to error. To achieve better
WLMNC distance learning method is adapted from the LMNC division performance in this situation, we propose a WLMNC
distance learning method [27], which learns an adaptive mar- distance learning method that can learn a matrix to pre-
gin between different clusters by adjusting the penalty weight serve the skeletal block structure information of the remote
on heterogeneous clusters. By adopting the WLMNC distance human.
to augment the clustering approach, we can greatly improve The rest of this section will introduce the classic method,
the skeletal block structure division performance among depth i.e., nearest center (NC) clustering, which can be used for
discontinuities due to occlusion between skeletal block struc- dividing human pixels into different skeletal block structures,
tures and thus obtain more reliable depth estimation for the and will then present our WLMNC distance learning method
remote human. As shown in Fig. 2, the proposed method can for augmenting classic NC clustering.
5732 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

between each pixel and its real nearest skeletal block structure
in the three-dimensional physical space while widening the
distance between each pixel and other skeletal block structures
with different depths.
The WLMNC distance is an adapted version of the
LMNC distance proposed by us in [27]. The LMNC distance
is a generalized Mahalanobis distance that is optimized by
convex optimization with the objective integrating two parts:
(a) Minimizing the distance between the sample and its
target cluster center; (b) Maximizing the distance between
the sample and other differently labeled cluster centers. The
Fig. 3. Skeletal block structure division results of the Hug posture LMNC distance between human pixel pi and skeletal block
using (b) NC clustering and (c) WLMNC distance augmented clustering. structure B j is defined as follows:
The result of WLMNC distance augmented clustering clearly demonstrates 
its better performance in dividing skeletal block structures with occlusion
compared with the result of NC clustering.
d M ( pi , B j ) = (Pi − B j )T M(Pi − B j ), (3)
where M is a 4×4 semidefinite matrix. When M is an identity
matrix, the LMNC distance is equivalent to the Euclidean
1) Classic Nearest Center Clustering: According to [25],
distance in Eq. (2), which is used by classic NC clustering.
human posture can be represented by several skeleton joints
The squared LMNC distance is denoted as follows:
(see Fig. 2). Let J + 1 denote the number of skeleton joints;
the remote human can then be divided into J skeletal block D M ( pi , B j ) = d M
2
( pi , B j ). (4)
structures. Intuitively, each pixel of the remote human should
belong to the nearest skeletal block structure, and the classic Let h( pi ) ∈  J denote the label of pixel pi ’s real near-
NC clustering approach is a natural choice for skeletal block est skeletal block structure in the three-dimensional physical
structure division. Denote  = { pi }i=1 N
⊂ R 2 as the pixel space, which is calculated as follows:
+1
set of the remote human and  = {s j } Jj =1 s ⊂  as h( pi ) = arg min d3D ( pi , B j ), (5)
the skeleton joint set, where pi and s j are pixels of two- j ∈ J
dimensional image coordinates. For any positive integer A, where
denote  A = {1, 2, · · · , A}. Let the skeleton joints associated
with skeletal block structure B j , j ∈  J be {s j 1 , s j 2 }, d3D ( pi , B j ) = {Pi − B j 2 + ( D o ( pi ) − D o (s j 1 ))2
s j 1 , s j 2 ∈ s , and define B j = [s j 1 , s j 2 ]T . Then, for any + ( D o ( pi ) − D o (s j 2 ))2 }1/2 . (6)
pixel pi ∈ , the distance between pi , i ∈  N and skeletal
In the above equation, D o ( pi ), D o (s j 1 ), and D o (s j 2 ) denote
block structure B j , j ∈  J , is defined as:
 the observed depth acquired at the remote end for pixels pi ,
d2D ( pi , B j ) =  pi − s j 1 2 +  pi − s j 2 2 , (1) s j 1 , and s j 2 , respectively .
The matrix M (LMNC distance) can then be learned by
which is equivalent to the Euclidean distance between vector minimizing the following objective function:
Pi = [ pi , pi ]T and B j . Here, each pixel pi is represented as  
a four-dimensional vector Pi by repeating its two-dimensional ε(M) = η δi j D M ( pi , B j ) + (1 − η) δi j (1 − δil )
image coordinate twice. Thus, each pixel pi can be divided i ijl
into the corresponding skeletal block structure using the fol- × [1 + D M ( pi , B j ) − D M ( pi , Bl )]+ , (7)
lowing NC clustering approach:
 where δi j ∈ {0, 1} indicates whether label h( pi ) is equal
min d2D ( pi , B j ) = min Pi − B j 2 . (2) to j , [χ]+ = max(χ, 0) denotes the standard hinge loss,
j ∈ J j ∈ J and η ≤ 1 is a positive constant for adjusting the two terms
2) WLMNC Distance Learning: Since two-dimensional by (a) penalizing large distances between each pixel and
image coordinates do not contain depth information, the near- its target skeletal block structure and (b) penalizing small
est skeletal block structure of each pixel based on classic distances between each pixel and all other differently labeled
NC clustering may not be the nearest one in the three- skeletal block structures. For each pixel pi , hinge loss occurs
dimensional physical space, especially in the case that there when the squared LMNC distance of pi to a skeletal block
exist occlusions in the remote human. As shown in Fig. 2 (a), structure with a different label does not exceed the squared
suppose the remote human stretches her hands before her LMNC distance of pi to its target skeletal block structure
trunk to hug the local human; the result of the skeletal block plus one absolute unit of distance. In other words, hinge loss
structures divided by classic NC clustering in Eq. (2) is prone makes the objective function in Eq. (7) potentially penalize
to error (see Fig. 3(b)). Some of the pixels belonging to the triples (i, j, l) as follows:
trunk block structures are clearly misclassified as pixels of the
D M ( pi , Bl ) < D M ( pi , B j ) + 1, δi j = 1, δil = 0. (8)
arm block structures.
To address the above issue, a WLMNC distance learning The benefit of using the hinge loss to construct the
method is proposed in the remote stage to narrow the distance second term in Eq. (7) is to maintain a large margin between
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5733

Fig. 5. Illustrations of the comparison of the traditional LMNC distance


learning method and the proposed WLMNC distance learning method. We can
clearly see that the proposed WLMNC distance aims to learn an adaptive
margin between two different clusters (skeletal block structures) instead of a
fixed large margin as the LMNC distance. Clusters with similar depths (color)
can be separated with a small margin.

Fig. 4. Clustering error rate of skeletal block structures and MAD of depth
recovery using WLMNC. This figure demonstrates that 1) the best skeletal
block structure division result (A) does not contribute to the lowest MAD of in Fig. 4, the best depth recovery result (in terms of MAD)
depth recovery (B), and 2) WLMNC (B) can achieve better depth recovery based on WLMNC distance with σ = 1 (B) is lower
performance than that of LMNC (C) by adaptively adjusting the penalty
weight ωil of the invading triples. than that based on LMNC distance, which is approximately
σ = 2−10 (C). In fact, by introducing the penalty weight w j l ,
the proposed WLMNC distance learns an adaptive margin
different skeletal block structures under the LMNC distance. between two different skeletal block structures instead of a
This large margin is natural in general clustering problems fixed large margin as the LMNC distance. As shown in Fig. 5,
because each cluster is always distributed in a different region with the WLMNC distance, only skeletal block structures with
of the sample space. However, when dividing skeletal block different depths are separated with a large margin, and a small
structures, clusters are always neighboring, which makes it margin is maintained for those skeletal block structures with
difficult to maintain a large margin between different clusters. similar depths.
Moreover, this paper focuses on obtaining a better depth recov- To improve the computational efficiency, we reformulate the
ery result instead of better clustering performance. Therefore, optimization of Eq. (9) as an instance of semidefinite program-
misclassification among neighboring skeletal block structures ming (SDP) [62]. A SDP problem is a linear programming
with similar depths can be tolerated. Considering the above problem with the additional constraint that a matrix, whose
two factors, we propose a WLMNC distance learning method elements are linear in the unknown variables, is required to
that minimizes the following objective function: be semidefinite. According to [61], SDPs are convex and can
  be effectively solved. By introducing slack variables ζi j l to
ε(M) = η δi j D M ( pi , B j ) + (1 − η) δi j (1 − δil )ωil simplify the hinge loss in Eq. (9), the resulting SDP is given
i ijl by:
× [1 + D M ( pi , B j ) − D M ( pi , Bl )]+ , (9)  
min η δi j D M ( pi , B j ) + (1 − η) δi j (1 − δil )ωil ζi j l
where an additional coefficient ωil is introduced to adaptively i ijl
adjust the penalty weight of the hinge loss caused by the s.t. M ≥ 0
invading triples defined above. Specifically, ωil is defined
as 1 − ξil with: ζi j l ≥ 0
D M ( pi , Bl ) − D M ( pi , B j ) > 1 − ζi j l . (11)
( D o ( pi ) − D o (sl1 ))2 + ( D o ( pi ) − D o (sl2 ))2
ξil = exp(− ),
2 × σ2 Based on the gradient projection algorithm [63], this paper
(10) implements an expert solver for the above SDP problem.
Let Ci j = (Pi − B j )(Pi − B j )T , the squared WLMNC
which depends on the depth difference between pixel pi and
distance corresponding to Mt generated in iteration t can be
skeletal block structure Bl , and σ is a predefined constant.
defined as:
The above definition of ωil means that the objective function
in Eq. (9) more heavily penalizes hinge loss caused by D Mt ( pi , B j ) = tr (Mt Ci j ), (12)
invading triples with more different depth values.
As shown in Fig. 4, the best depth recovery result (B) where tr (X) denotes the trace of matrix X. Hence, the objec-
based on WLMNC distance does not correspond to the lowest tive function in Eq. (9) can be rewritten as follows:
clustering error rate of skeletal block structures (A), which  
verifies the assumption that each invading triple does not ε(Mt ) = η δi j tr (Mt Ci j ) + (1 − η) δi j (1 − δil )ωil
have the same impact on the depth recovery performance. ij ijl
By adaptively adjusting the penalty weight ωil of the invading × [1 + tr (Mt Ci j ) − tr (Mt Cil )]+ . (13)
triples according to depth differences, the proposed WLMNC
distance can greatly reduce the invading triples among the Let t denote the set of all triples (i, j, l) satisfying Eq. (8),
depth discontinuities due to occlusion in the remote human, i.e., making the second term of Eq. (13) greater than zero. The
resulting in better depth recovery performance. As shown gradient of the objective function represented by Eq. (13) is
5734 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

Algorithm 1 The Pseudo-Code of the WLMNC Distance Algorithm 1 does not change for any ωi ∗ l ∗ ∈ (γ1 , γ2 ) and
Learning Method in the Remote Stage (i ∗ , j ∗ , l ∗ ) ∈ t . Define the output BS matrix of Algorithm 1
q q
with ωi ∗ l ∗ ∈ (γ1 , γ2 ) after t iterations as Mt , q = 1, 2. Then,
if ωi ∗ l ∗ ≤ ωi ∗ l ∗ , d Mt1 ( pi ∗ , Bl ∗ ) ≤ d Mt2 ( pi ∗ , Bl ∗ ).
1 2

Proof: By the induction argument, we need to prove only


that if d M 1 ( pi ∗ , Bl ∗ ) ≤ d M 2 ( pi ∗ , Bl ∗ ), then d M 1 ( pi ∗ , Bl ∗ ) ≤
t t t+1
d M 2 ( pi ∗ , Bl ∗ ). Combined with M0 = E in Algorithm 1,
t+1
we can complete the proof.
Suppose d M 1 ( pi ∗ , Bl ∗ ) ≤ d M 2 ( pi ∗ , Bl ∗ ). By Eq. (4) and
t t
Eq. (12), we have
tr (Mt1 Ci ∗ l ∗ ) ≤ tr (Mt2 Ci ∗ l ∗ ). (15)
In practice, we can choose a sufficiently small step parameter ς
such that the projection of matrix X on the semidefinite cone
q q q
ρ(X) = X; then, we have Mt +1 = Mt − ς G t , where
q q
G t is defined according to Eq. (14) with weight ωi ∗ l ∗ ,
q = 1, 2. Then, under the assumption that t does not change
q
for ωi ∗ l ∗ , q = 1, 2, we have
Mt1+1 − Mt2+1 = Mt1 − Mt2
+ ς (1−η)(ωi2∗l ∗ −ωi1∗ l ∗ )(Ci ∗ j ∗ −Ci ∗ l ∗ ). (16)
If we can prove
tr ((Ci ∗ j ∗ − Ci ∗ l ∗ )Ci ∗ l ∗ ) ≤ 0, (17)
computed as:
then by combining Eq. (16) with Eq. (15) and with ς > 0,
∂ε(Mt ) η < 1 and ωi1∗ l ∗ ≤ ωi2∗ l ∗ , we can obtain tr (Mt1+1 Ci ∗ l ∗ ) ≤
Gt =
∂M tr (Mt2+1 Ci ∗ l ∗ ), such that d M 1 ( pi ∗ , Bl ∗ ) ≤ d M 2 ( pi ∗ , Bl ∗ ).
t  t+1 t+1
=η δi j Ci j + (1 − η) ωil (Ci j − Cil ). (14) Then, the proof is complete.
ij (i, j,l)∈ t By direct calculation, the left side of Eq. (17) is
By introducing triple set t , the constraint of the minimization tr ((Ci ∗ j ∗ − Ci ∗ l ∗ )Ci ∗ l ∗ ) = tr (Ci ∗ j ∗ Ci ∗ l ∗ − Ci ∗ l ∗ Ci ∗ l ∗ )
problem of the cost function in Eq. (9) or the SDP problem
= tr ((Pi ∗ − B j ∗ )(Pi ∗ − B j ∗ )T (Pi ∗ − Bl ∗ )(Pi ∗ − Bl ∗ )T
represented by Eq. (11) is reduced to that matrix Mt must
be a semidefinite matrix. Denote ρ(X) as the projection of − (Pi ∗ − Bl ∗ )(Pi ∗ − Bl ∗ )T (Pi ∗ − Bl ∗ )(Pi ∗ − Bl ∗ )T )
matrix X on the semidefinite cone. The pseudo-code of the = (Pi ∗ − B j ∗ )T (Pi ∗ − Bl ∗ )tr ((Pi ∗ − B j ∗ )(Pi ∗ − Bl ∗ )T )
final solver for the WLMNC distance is shown in Algorithm 1. − (Pi ∗ − Bl ∗ )T (Pi ∗ − Bl ∗ )tr ((Pi ∗ − Bl ∗ )(Pi ∗ − Bl ∗ )T ).
Since the WLMNC distance, represented by matrix M, is the
key point in dividing skeletal block structures, we call the (18)
matrix M the block structure (BS) matrix. It is easy to check that
As shown in Algorithm 1, there are three time-consuming
parts in the WLMNC distance learning algorithm: 1) comput- tr ((Pi ∗ − B j ∗ )(Pi ∗ − Bl ∗ )T ) = (Pi ∗ − B j ∗ )T (Pi ∗ − Bl ∗ ),
ing the label h( pi ) ∈  J of the target skeletal block structure tr ((Pi ∗ − Bl ∗ )(Pi ∗ − Bl ∗ )T ) = (Pi ∗ − Bl ∗ )T (Pi ∗ − Bl ∗ ).
for each remote human pixel pi , i ∈  N , which has O(N J ) (19)
time complexity; 2) computing the invading triple set t based
on Mt , which has O(N J ) time complexity in each iteration; Then, Eq.(18) yields
and 3) computing the gradient G t in each iteration, which has tr ((Ci ∗ j ∗ − Ci ∗ l ∗ )Ci ∗ l ∗ ) = ((Pi ∗ − B j ∗ )T (Pi ∗ − Bl ∗ ))2
O(N J ) time complexity. Suppose that the maximum iteration
− Pi ∗ − Bl ∗ 4 . (20)
number is T ; then, the WLMNC distance learning algorithm
has an overall time complexity of O(N J T ). Since B j ∗ has the same depth as Bl ∗ , we have D o (s
j ∗1 ) =
The following theorem indicates that the proposed WLMNC D o (sl ∗ 1 ) and D o (s j ∗ 2 ) = D o (sl ∗ 2 ). Thus, by the definition of
distance learning method can maintain a small margin between h( pi ∗ ) in Eq. (5), Eq. (6) and h( pi ∗ ) = j ∗ , we have
two different skeletal block structures B j and Bl of the same
depth by reducing the weight ωil when the target skeletal Pi ∗ − B j ∗ 2 ≤ Pi ∗ − Bl ∗ 2 . (21)
block structure of pixel pi is B j , which is consistent with With Eq. (21) and Cauchy inequality, we obtain
the illustrations in Fig. 5.
Theorem 1: Suppose B j ∗ has the same depth as Bl ∗ . Pi ∗ − B j ∗ 2 + Pi ∗ − Bl ∗ 2
(Pi ∗ − B j ∗ )T (Pi ∗ − Bl ∗ ) ≤
Assume that there exists a pixel pi ∗ , a value domain (γ1 , γ2 ) 2
and a parameter set (ωil , ς, η) such that the triple set t of ≤ Pi ∗ − Bl ∗ 2 . (22)
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5735

Then, according to the above inequality and Eq. (20),


we obtain the inequality in Eq. (17). This completes the
proof.
Here, we note that the assumption in Theorem 1 is rea-
sonable. The key algorithm for solving the SDP problem
in Eq. (11) is the gradient projection algorithm [63]. As dis-
cussed in [63], based on the Lipschitz continuous assumption
on function ε(·) and a suitable step parameter, the gradient
projection algorithm will make the iteration in Algorithm 1
converge to a stationary point, denoted by Mt ∗ . By the
definition of a stationary point in [63, eq. (8)], the stationary
point Mt ∗ satisfies

(G t ∗ )mn (X − Mt ∗ )mn ≥ 0, ∀ X ∈ Ψ, (23) Fig. 6. Depth recovery results of the Hug posture: (a) ground truth depth
mn map and color image and depth maps recovered by (b) NC (MAD: 6.27)
and (c) WLMNC (MAD: 4.15). The result of WLMNC demonstrates its
where Ψ is the set of 4 × 4 semidefinite matrices, and better performance in preserving depth discontinuities between skeletal block
m, n ∈ 4 represent a scalar index. Since Mt ∗ is a semidefinite structures with occlusion compared with the result of NC.
matrix, there must exist an index m̂ such that (Mt ∗ )m̂ m̂ > 0.
Thus, we can choose a sufficiently small θ and X of the same
entries with Mt ∗ , except (X)m̂ m̂ = (Mt ∗ )m̂ m̂ ± θ , which when skeletal block structures with different depths, the WLMNC
combined with Eq. (23) yields (G t ∗ )m̂ m̂ = 0. By the definition distance augmented clustering approach can improve the
of G t ∗ in Eq. (14), the triple set t ∗ should not be empty; accuracy of skeletal block structure division among depth
otherwise, discontinuities. As shown in Fig. 3(c), the number of pixels
 belonging to the trunk block structure misclassified into the
δi j (Ci j )m̂ m̂ = 0, arm block structures is greatly reduced.
ij According to the physiological characteristics of the human
which means that (Pi )m̂ = (B j )m̂ for all pixels pi with body, each skeletal block structure can be regarded as a
target skeletal block structure B j . Namely, the skeletal block rigid body structure; thus, the depth range of each skeletal
structure B j is a line along the horizontal or vertical dimen- block structure can be determined by the depth of the two
sion, which is clearly a contradiction of the concerned situa- skeleton joints associated with the skeletal block structure.
tion. Thus, there must exist at least one pixel pi ∗ and two Therefore, we propose to use a linear function to estimate
block structures B j ∗ and Bl ∗ such that h( pi ∗ ) = j ∗ and each remote human pixel’s depth value based on the divided
(i ∗ , j ∗ , l ∗ ) ∈ t ∗ . Then, the above discussion demonstrates skeletal block structure using WLMNC distance augmented
that the assumption in Theorem 1 is reasonable. clustering. Assuming that pixel pi is divided into skeletal
block structure B j , the rough depth Dr ( pi ) of pixel pi can be
estimated as follows:
B. Local Stage
Aiming to accurately recover the depth data of the remote Dr ( pi ) = w( pi ) × D o (s j 1 ) + (1 − w( pi )) × D o (s j 2 )
human based on the learned BS matrix, skeleton joint informa- d M ( pi , P j 2 )
w( pi ) = , (25)
tion and the accompanying color image from the remote stage, d M ( pi , P j 1 ) + d M ( pi , P j 2 )
we propose a rough-to-fine depth recovery framework in the
local stage. The framework consists of a rough depth recovery where P j 1 = [s j 1 , s j 1 ]T , P j 2 = [s j 2 , s j 2 ]T . As shown
method based on the divided skeletal block structures using in Fig. 6, compared with the rough depth recovery method
WLMNC distance augmented clustering and the piecewise based on NC clustering (see Fig. 6(b)), the proposed rough
linear assumption on remote human depth data, as well as depth recovery method based on WLMNC distance augmented
a fine depth recovery method based on the AR model guided clustering (abbreviated as WLMNC) performs better in terms
with the depth map estimated by the rough depth recovery of preserving depth discontinuities (see Fig. 6(c)) by improv-
method. ing the clustering accuracy of skeletal block structures with
1) Rough Depth Recovery Based on WLMNC Distance occlusion (see Fig. 3(c)).
Augmented Clustering: Based on the BS matrix M learned 2) Fine Depth Recovery Based on Rough Depth Guided
in the remote stage, we divide each remote human pixel pi AR Model: The above proposed WLMNC method can achieve
into the corresponding skeletal block structure according to the precise depth recovery for the remote human in the case
following WLMNC distance augmented clustering approach: of accurate skeletal block structure division (see Fig. 6(c)).
According to [27], the learned WLMNC distance formulated
min d M ( pi , B j ). (24) by Eq. (3) employs a uniform linear transform on the samples,
j ∈ J
which can separate only invading skeletal block structure pairs
Since the WLMNC distance can narrow the distance with consistent occlusion. However, there can be a variety
between each human pixel pi and its target skeletal block of occlusions in the remote human in practical applications;
structure Bh( pi ) while widening the distance of pi and other in this case, it is difficult to achieve accurate skeletal block
5736 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

where λ is a positive constant, and the weight αi, j = i αi, j αi, j


1 D̂ I

is defined as
 D̂( pi ) − D̂( p j )2
αi,D̂j = exp(− )
2σ12

k∈ Bi ◦ ( Pi − P j )
k k 2
αi,I j = exp(− )
2 × 3 × σ22
 pi − p j 2
Bi (i, j ) = exp(− )
2σ32

I k ( pi ) − I k ( p j )2
Fig. 7. Depth recovery results of the Tease posture: (a) ground truth depth × exp(− k∈ ),
map and color image and depth maps recovered by (b) WLMNC (MAD:5.16), 2 × 3 × σ42
(c) GF (MAD:5.45), (d) GF-AR (MAD:4.54) and (e) WLMNC-AR
(MAD: 4.15). This figure clearly demonstrates 1) the necessity of the fine
where ( pi ) is the neighborhood of pixel pi in the H × H
depth recovery method when there exist a variety of occlusions in the remote square patch centered at pixel pi . i is the normalization factor

p j ∈( pi ) αi, j = 1. αi, j is a depth term defined
human, as WLMNC-AR performs better than WLMNC in terms of preserving that makes D̂
depth discontinuities in the red box area where skeletal block structures are
incorrectly divided; 2) the effectiveness of using the rough depth map as on the guide depth map D̂. αi,I j is a color term defined on the
the guide depth map of the AR model, as WLMNC and WLMNC-AR can accompanying color image. Parameters σ1 ,σ2 are user defined
suppress texture copy artifacts in the yellow box area where color edge is
weak, whereas GF and GF-AR cannot. constants.  = {R, G, B} or  = {Y, U, V } represents the
different channels of the color image. Pik denotes an operator
structure division using the WLMNC distance augmented that extracts a W × W patch centered at pixel pi in color
clustering approach. As shown in Fig. 7, for the Tease posture, channel k. “◦” represents element-wise multiplication. Bi is a
occlusion exists not only between the left arm block structure bilateral filter kernel defined in the extracted W × W patch.
and the trunk block structure but also between the right arm σ3 and σ4 are user defined constants. We refer to [16] for
block structure and the head block structure. In this situation, details.
the rough depth map recovered by the proposed WLMNC Different from the AR model in [16], which sets the guide
method is prone to error in the red box area because several depth map D̂ in Eq. (26) as the initial interpolated depth
pixels of the trunk block structure are wrongly clustered into map, the proposed rough depth guided AR model (denoted
the left arm block structure of the remote human. To improve as WLMNC-AR) sets D̂ as the rough depth map estimated
the depth recovery performance in this situation, we propose by WLMNC. As shown in Fig. 7(e), compared with GF-AR
a fine depth recovery method based on rough depth guided (see Fig. 7(d)), WLMNC-AR can suppress texture copy arti-
AR model, which takes the rough depth map estimated by facts in the yellow box area where the color edge is weak.
WLMNC as a guide depth map in the AR model [16]. On the other hand, with the guidance of the accompanying
As discussed in Section II-B, the AR model [16] is a color image, WLMNC-AR achieves better performance than
state-of-the-art color guided depth recovery model in terms that of WLMNC in terms of preserving depth discontinuities
of suppressing texture copy artifacts and preserving depth in the red box area where misdivisions of skeletal block
discontinuities by using an initial estimated depth map for structures exist (see Fig. 7(b)). Algorithm 2 shows the pseudo-
the guide weight. However, the initial depth map estimated code of the rough-to-fine depth recovery method based on the
by interpolating the noisy low-resolution depth map becomes WLMNC-AR model in the local stage.
unreliable, especially when the upsampling rate is very high. As shown in Algorithm 2, there are four time-consuming
This situation becomes worse in human depth recovery with parts in the rough-to-fine depth recovery algorithm based on
depth data of only a few skeleton joints. As shown in Fig. 7(c), the WLMNC-AR model: 1) computing the label j ∈  J of
for the Tease posture, the depth map recovered by GF, an inter- the nearest skeletal block structure for each remote human
polation method with a Gaussian filter, is blurred among depth pixel pi , i ∈  N , which has a time complexity of O(N J );
discontinuities in both highlighted regions. Therefore, when 2) computing the rough depth value Dr ( pi ) for each remote
using the AR model guided with the depth map estimated human pixel pi , i ∈  N , which has a time complexity
by GF, the depth recovery result suffers from texture copy of O(N); 3) construction of AR coefficients, which has a
artifacts in the yellow box area where the color edge is weak, time complexity of O(N H 2 W 2 ); 4) quadratic optimization in
see Fig. 7(d). To preserve the depth discontinuities while solving the AR model using matrix division solver, which has
suppressing the texture copy artifacts, we employ the above a time complexity of O(N 3 ). Since J, H, W
N, the overall
estimated rough depth map into the AR model for better depth time complexity of the rough-to-fine depth recovery algorithm
recovery performance, see Fig. 7(e). based on the WLMNC-AR model is O(N 3 ).
According to [16], the AR model is defined as follows:
 IV. E XPERIMENTS AND R ESULTS
D f = arg min{ ( D(s j ) − D o (s j ))2
D
s j ∈s
In this section, we present the experimental results of
  the proposed WLMNC distance based human depth recovery
+λ ( D( pi )− αi, j D( p j ))2 }, (26) method. All experimental methods are implemented using
pi ∈ p j ∈( pi ) Matlab and tested on a desktop PC with an i7-4770 CPU
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5737

Algorithm 2 The Pseudo-Code of the Rough-to-Fine Depth TABLE II


Recovery Method Based on WLMNC-AR in the Local Stage B ENCHMARK D ATASET D ESCRIPTION

and 4 GB RAM. Next, we introduce the proposed bench-


mark dataset for human depth recovery and then present the
comparison methods, parameter setup and results analysis. method, as shown in Fig. 2. Each posture includes a color
To encourage further comparison and future work, the novel image, a depth map and skeleton joint information of the
dataset and MATLAB code of our method are available at the remote human. The captured depth maps and color images are
project website.1 640 × 480 in size and are registered to the same viewpoint.
The original physical resolution of Kinect I’s depth map
A. Dataset is 320 × 240. However, the SDK for Kinect I provides a
method2 to resize the depth map to VGA size and to register
To evaluate the effectiveness of the proposed WLMNC the depth map to the same viewpoint as the corresponding
distance based human depth recovery method, we first color image. The detailed description of the collected dataset
re-implement the Kinect I based tele-immersive video inter- is shown in Table II.
action system proposed by us in [6]. In the system, Kinect In our experiment, several skeleton joints detected by
is used to capture live video and a depth stream, which is Kinect I are outside of the region of the remote human. More-
then processed with the video object cutout method proposed over, for those skeleton joints within the region of the remote
in [64] to segment the human in real time. Next, each human’s human, their depth is not consistent with the depth of the
skeleton joint information is detected by the pose detection pixels located at their coordinates. Since the detection accuracy
method introduced in [25]. Then, the video and depth stream of skeleton joints is important for skeletal block structure
of the remote human are encoded using chroma key based division and the depth of skeleton joints determines the depth
video coding technology and are then sent out through the recovery performance, skeleton joints outside of the region
public Internet with the skeleton joint information. At the of the remote human and those with much different depths
local end, the remote video is decoded, and the remote human from the corresponding pixels are automatically removed. The
is recovered using the chroma key. Then, the remote and depth of the pixels located at the coordinates of the remaining
local humans are adaptively integrated into a selected virtual skeleton joints is used for depth recovery.
background based on the depth data. Finally, we obtain the
occlusion consistent co-presence of distributed participants in
a shared video space. B. Method Comparison
The proposed human depth recovery method is designed To demonstrate the performance advantages of the proposed
to correctly handle the occlusion relationship of the remote human depth recovery method, we conduct two sets of com-
and local humans in a low-bandwidth environment when depth parative experiments:
discontinuities exist in the remote human due to self-occlusion. • To verify the effectiveness of the rough depth recovery
Therefore, we collect a dataset containing six such types of method based on WLMNC distance augmented clus-
human postures based on the above system in an indoor tering (WLMNC), we compare WLMNC with classic
environment with a depth range of 5 m. The six human NC clustering (NC), LMNC distance augmented cluster-
postures are hug, tease, handshake, hand in hand, kick and ing (LMNC) and an interpolation method (GF). Since it
shoulder to shoulder, which satisfy the goal of the proposed is impossible to use bicubic interpolation to recover the
1 https://github.com/beautifuljade/WLMNC-AR 2 https://msdn.microsoft.com/en-us/library/jj663856.aspx.
5738 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

depth map of the remote human given only the depth data TABLE III
of skeleton joints, we consider the iterative Gaussian fil- PARAMETER S ETTINGS FOR THE WLMNC-AR
M ODEL AND THE C OMPARED AR M ODELS
tering method (denoted as GF), which performs iterative
nearby interpolation with a Gaussian filter.
• To verify the effectiveness of the fine depth recov-
ery method based on rough depth guided AR model
(WLMNC-AR), we compare WLMNC-AR with NC-AR,
LMNC-AR and GF-AR, which denote the AR model
guided with a rough depth map estimated by NC, LMNC
and GF, respectively. We also compare WLMNC-AR with
ten other state-of-the-art methods, as shown in Table IV.
We ignore the comparison with total generalized variation
(TGV) [19], which does not converge within 3×104 steps
when given the depth data of only a few skeleton joints.
In the comparison experiments, the mean absolute differ-
ence (MAD) of the estimated depth map and the ground
truth depth map is employed to evaluate the depth recovery
performance of different depth recovery methods. Since the
motivation of this paper is to develop an efficient method to
highly compress the depth data of the remote human at the
remote end and to accurately recover the depth data of the
remote human at the local end, we directly set the observed
depth map of the remote human captured in the remote
end as the ground truth depth. In the visual comparisons
(Fig. 6, Fig. 7 , Fig. 8, Fig. 9 and Fig. 10), regions highlighted
by rectangles are enlarged, and the error maps are obtained
by subtracting the recovered depth and ground truth depth for
easy visual assessment.

C. Parameter Setup
For the rough depth recovery method based on WLMNC
and the compared LMNC method, the depth recovery per-
formance depends on the related distance, which is highly Fig. 8. Visual quality comparison of the depth recovery on the Hug
dependent on parameters σ , η, ς and T . σ determines the and Shoulder to shoulder postures: (a) ground truth depth map and
color image and depth maps recovered by (b) GF (MAD: 5.40; 3.99),
penalty weight ω j l on invading triples. η adjusts the impor- (c) NC (MAD: 6.27; 3.77), (d) LMNC (MAD: 4.83; 3.42), and (e) WLMNC
tance of the two terms of the objective function for training the (MAD: 4.15; 3.07). The first and second MADs for each method are for the
WLMNC distance or the LMNC distance. Step parameter ς Hug and Shoulder to shoulder postures, respectively. This figure demonstrates
that WLMNC can obtain a more accurate rough depth map than those of the
and the maximum iteration number T control the convergence other rough depth recovery methods when there are depth discontinuities in
level. These four parameters directly impact the performance the remote human.
of the learned WLMNC distance or LMNC distance. In the
experiments, we take η = 2−10 , ς = 0.5 × 10−3 , and T = 80 The other parameters, e.g., window size H and patch size W ,
for both LMNC and WLMNC and set σ = 1 for WLMNC are set to the values in the implementation code provided by
using the grid search technique. The grid search for σ , η, ς Yang et al. [16].
and T is {2−10 , 2−9 , · · · , 210 }, {2−10 , 2−9 , · · · , 20 }, 0.5 ×
{10−6 , 10−5 , · · · , 10−2 } and {10, 20, · · · , 100}, respectively. D. Results Analysis
The precision-tolerant parameter  is empirically set to 10−3 . 1) Rough Depth Recovery Accuracy: Table IV shows the
For the fine depth recovery method based on WLMNC-AR quantitative depth recovery results (in MAD) of the bench-
and the compared AR models, the depth recovery performance mark dataset using the proposed method and other compar-
is highly dependent on the related guide depth map and ison methods. As shown in Table IV, compared with NC,
the parameter setting in the AR model. Since σ1 determines WLMNC reduces the overall depth recovery error by more
the weight of the guide depth, the grid search for σ1 in than 2 cm on the Hug posture labeled as A, which demon-
{20 , 21 , · · · , 27 } is employed for the AR model with different strates that WLMNC can make more precise predictions of
guide depths. As the best fine depth recovery results of those depth values by improving the division accuracy of the skeletal
methods on each posture in the benchmark dataset correspond block structures among depth discontinuities using the trained
to very different value settings of σ1 , σ1 is separately set WLMNC distance. As shown in Fig. 3, compared with classic
to the optimal value for each posture for the sake of fair- NC clustering, WLMNC distance augmented clustering greatly
ness. Table III shows the detailed optimal parameter settings. reduces the number of trunk pixels in the highlighted area
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5739

TABLE IV
T HE D EPTH R ECOVERY MAD ( CM ) OF THE B ENCHMARK D ATASET U SING D IFFERENT D EPTH R ECOVERY M ETHODS . T HE R ESULT “W ITHOUT N OISE ”
IS O BTAINED ON THE O RIGINAL B ENCHMARK D ATASET, W HILE THE R ESULT “W ITH N OISE ” I S O BTAINED ON THE D EGRADED B ENCHMARK
D ATASET, W HERE THE L OCATION OF S KELETON J OINTS AND THE A CCOMPANYING C OLOR I MAGE H AVE A DDED G AUSSIAN N OISE
W ITH A S TANDARD VARIANCE OF 5 AND 25, R ESPECTIVELY. T HE B EST R ESULTS A RE
IN B OLD , AND THE S ECOND B EST A RE UNDERLINED

that are misclustered into the arm block structures, which 2) Fine Depth Recovery Accuracy: As shown in Table IV,
demonstrates the effectiveness of the proposed WLMNC by employing the rough depth map recovered by WLMNC as
distance in reducing the clustering error rate among depth the guide depth map of the AR model, the average resulting
discontinuities caused by occlusion between skeletal block depth recovery error of WLMNC-AR is lower than that
structures. As shown in Table IV, the proposed WLMNC of the other AR models, which demonstrates that a better
method also achieves a lower MAD than that of LMNC on all guide depth map is the key element in improving the depth
six human postures. This result indicates that by introducing recovery performance of the AR model. We also find that
the penalty weight, the WLMNC distance becomes more effec- WLMNC-AR obtains comparable performance to that of
tive in widening the distance between the human pixels and WLMNC for the Hug and Kick postures, labeled as A and E,
other skeletal block structures with different depths. There- respectively, which further demonstrates the effectiveness
fore, the WLMNC distance augmented clustering approach of WLMNC.
can reduce misclustering between neighboring skeletal block As shown in Fig. 9, the results of NC-AR (GF-AR), which
structures with occlusion and contribute to better performance uses a rough depth map estimated by NC (GF) for the guide
in depth recovery. We also find that by making use of the prior weight, suffer from blurred depth discontinuities. By contrast,
information of the skeletal block structure, the average depth guided with a more accurate rough depth map estimated
recovery error of WLMNC is lower than that of the nearby by WLMNC, WLMNC-AR yields much better depth
interpolation method using GF. recovery results in the highlighted regions, as shown
Fig. 8 shows a visual comparison of the depth recovery in Fig. 9.
results for the Hug and Shoulder to shoulder postures using By comparing Fig. 9(e) with Fig. 8(e), we can see that
different rough depth recovery methods. As shown in Fig. 8, the depth maps recovered by WLMNC-AR and WLMNC are
the depth map of both human postures recovered by GF is similar in the highlighted regions of the Hug and Shoulder to
blurred among the arm block structures and the neighboring shoulder human postures, which demonstrates that WLMNC
trunk structures, and the depth map recovered by NC is error- is competitive in preserving the depth discontinuities of a
prone in those areas due to misdivision of skeletal block remote human with consistent occlusion between skeletal
structures. When augmented with the WLMNC (LMNC) dis- block structures. However, for the Tease posture, with a
tance, WLMNC (LMNC) achieves much better depth recovery variety of occlusions in the remote human, the result of
results in those areas. By more heavily penalizing the invading WLMNC is prone to error (see Fig. 7(b)) due to the incorrect
triples with large depth differences, WLMNC outperforms skeletal block structure division. By introducing the guidance
LMNC in terms of preserving depth discontinuities. of the accompanying color image, WLMNC-AR achieves
5740 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

only the color term of the RWLS (BA-RWLS) model [13], [14]
and GF-AR model [16] plays a significant role in the depth
recovery performance. In contrast to the color term in the
RWLS (BA-RWLS) model [13], [14], the color term in the
GF-AR model [16] has a shape-adaptive neighborhood, which
increases the opportunities to exploit more correlations for
pixels around discontinuities. Therefore, the depth recovery
performance of the GF-AR model [16] is slightly better than
that of the RWLS (BA-RWLS) model [13], [14]. As shown
in Table IV, the RWLS (BA-RWLS) model [13], [14]
achieves almost the same performance as that of the
WLS model [18], [43] on the six human postures. The depth
discontinuities in the highlighted regions of the recovered
depth map of the RWLS (BA-RWLS) model [13], [14] are
severely blurred, as shown in Fig. 10(i) and Fig. 10(j).
In contrast to the RWLS (BA-RWLS) model [13], [14],
the FGI method [12] can preserve depth discontinuities
and achieves the lowest MAD for the (A) Hug posture,
see Table IV. As shown in Fig. 10(k), the depth maps recov-
ered by the FGI method [12] show clear boundaries without
blurring. However, when the color edges are not consistent
with the depth discontinuities, the FGI method [12] tends to
Fig. 9. Visual quality comparison for depth recovery on the Hug and generate depth recovery results with texture copy artifacts,
Shoulder to shoulder postures: (a) ground truth depth map and color image
and depth maps recovered by (b) GF-AR (MAD: 5.03; 4.05), (c) NC-AR
especially on the Kick and Shoulder to shoulder postures.
(MAD: 5.69; 3.36), (d) LMNC-AR (MAD: 4.77; 3.06), and (e) WLMNC-AR By contrast, guided by a more accurate rough depth map and a
(MAD: 4.18; 2.56). The first and second MADs for each method are for more efficient color term, the proposed WLMNC-AR method
the Hug and Shoulder to shoulder postures, respectively. This figure demon- can preserve depth discontinuities and suppress texture copy
strates that, compared with other fine depth recovery methods, the proposed
WLMNC-AR method can obtain much better depth recovery performance in artifacts, as shown in Fig. 10 (l).
terms of preserving depth discontinuities. 3) Upsampling Rate: As shown in Table IV, the proposed
WLMNC method can obtain a low depth recovery error when
given pseudo 3D information (including coordinates and depth
information) of only a few skeleton joints of the remote
better depth recovery performance for this human posture human and a learned WLMNC distance matrix. Compared
(see Fig. 7(e)). with the RGB-D data transmission protocol used in [5],
As shown in Table IV, the proposed WLMNC-AR method the proposed method can greatly reduce the amount of data
obtains the lowest average MAD on the benchmark dataset transmitted and bandwidth consumption, enabling smoother
(especially for the (F) Shoulder to shoulder posture). As shown remote video interaction. In particular, for the six human
in Fig. 10, the proposed WLMNC-AR method outperforms postures captured by Kinect I, the depth map of the remote
the other state-of-the-art comparison methods in preserving human is of VGA size, i.e., 640 × 480 = 307200 data, while
depth discontinuities caused by various types of occlusion the proposed method needs to transmit pseudo 3D information
in the remote human. The depth maps recovered by the of only 20 skeleton joints of the remote human and the learned
color guided depth recovery methods, namely, JBF [24], WLMNC distance matrix, a total of 20 × 3 + 4 × 4 = 76 data,
Guided [23], CLMF0 [22], CLMF1 [22], JGU [20] and which is equivalent
 to downsampling the depth map of the
WLS [18], [43], suffer from texture copy artifacts and remote human at 307200 76 ≈ 64× the sampling rate. Based on
blurred depth discontinuities. The depth recovery result of the undersampled data, the proposed method can accurately
Edge [17], [21] also suffers from jaggy artifacts in the depth recover the depth data of the remote human at the local end,
boundaries. According to [13] and [14], the quality of an which demonstrates the effectiveness of the proposed method
iteratively updated depth map is much better than the initial under such a high upsampling rate.
depth map estimated by interpolation methods, which helps the 4) Stability Analysis: We also evaluated the stability of
RWLS (BA-RWLS) model to not only suppress texture copy the proposed method to variation in the input information,
artifacts but also to preserve sharper depth discontinuities than including the location of skeleton joints and the accompanying
those of the AR model in [16]. However, as shown in Table IV, color image. Table IV (b) shows the results on the degraded
the average depth recovery MAD of the RWLS (BA-RWLS) benchmark dataset, where the locations of the skeleton joints
model [13], [14] is higher than that of the GF-AR model [16] and the accompanying color image have added Gaussian noise
on the six human postures, which demonstrates that the with a standard variance of 5 and 25, respectively. These
iteratively updated depth map is not reliable in human depth results clearly demonstrate that the proposed WLMNC-AR
recovery with depth data of only a few skeleton joints. In fact, method is stable and outperforms the other state-of-the-art
for human depth recovery with such a high upsampling rate, depth recovery methods.
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5741

Fig. 10. Depth recovery results of the Tease posture. The depth recovery results of five other human postures are listed in the supplementary material.
(a) ground truth depth map and color image and depth maps (Average MAD) recovered by (b) JBF [24](4.21), (c) Guided [23](4.30), (d) Edge [17], [21] (4.30),
(e) CLMF0 [22] (4.29), (f) CLMF1 [22] (4.28), (g) JGU [20](4.68), (h) WLS [18], [43](4.21), (i) RWLS [13](4.20), (j) BA-RWLS [14] (4.18),
(k) FGI [12] (4.11), and (l) WLMNC-AR (3.56). This figure demonstrates that the proposed WLMNC method outperforms the other state-of-the-art methods
in terms of preserving depth discontinuities. Here, the average MAD is calculated based on the whole dataset.

V. D ISCUSSIONS AND F UTURE W ORK of the video stream. Although a simple numerical sta-
A. Acceleration Strategy bility analysis has been established in this paper, a more
detailed influence of the distortion source on the proposed
In our experiments, the Matlab implementation of the depth recovery method should be investigated.
WLMNC distance learning algorithm, i.e., Algorithm 1, takes • Occlusion handling performance. The experimental
7.93 seconds on average to learn the WLMNC distance for the results have demonstrated the effectiveness of the pro-
remote human of VGA size with approximately 40000 pixels posed method in human depth recovery, especially for the
and 17 valid skeletal block structures, while the rough-to-fine situation of self-occlusion. However, better depth recov-
depth recovery algorithm, i.e., Algorithm 2, takes 0.07 seconds ery performance does not mean more accurate occlusion
and 2.5 minutes on average to obtain the rough and fine human handling results; the latter should consider occlusion with
depth map, respectively. We suggest the following two aspects the local human. An additional experiment on the impact
to reduce the computational complexity and make the proposed of WLMNC-AR on occlusion handling with various types
method more practical: of interaction behaviors should be performed.
• Parallelizability. It is easy to check whether the time-
consuming steps in Algorithm 1 and Algorithm 2 can be C. Future Work
parallelized. A preliminary GPU version of the AR model
was implemented in [16] and took 2.8 seconds on To greatly reduce bandwidth consumption, the WLMNC
average, approximately 40× faster than the CPU ver- distance learning method introduced here uses a unified dis-
sion. Thus, we expect a high acceleration ratio for a tance for all clusters. According to [27], the learned distance
GPU version of the proposed method. formulated by Eq. (3) employs a linear operator on the input
• Temporal information. Since the human depth recovery
samples, which leads to limited performance for situations
method introduced here is designed for tele-immersive with complex occlusion. Thus, we will consider introducing
video interaction systems, i.e., video processing, a com- a nonlinear operator, e.g., a neural network, to construct the
mon acceleration strategy is to use the temporal infor- distance in the future work.
mation of the video. The most time-consuming part in
Algorithm 1 is the iteration scheme. For video processing, VI. C ONCLUSION
suppose the output BS matrix for frame i − 1 is Mi−1 ; This paper presented a WLMNC distance based human
we expect a faster convergence if we set M0 = Mi−1 to depth recovery method that can accurately recover the depth
learn BS matrix Mi for frame i because the neighboring map of a remote human with depth data of only a few skeleton
frames are similar. A similar idea can be found in the joints to obtain occlusion-consistent composition results for
spatial-temporal recovery of depth sequences [65], [66]. tele-immersive video interaction systems in low-bandwidth
environments. Specifically, at the remote end, we first used the
skeleton joint information to highly compress the depth data of
B. Practical Challenges the remote human to a BS matrix by using a WLMNC distance
This paper focused on algorithm development and its the- learning method, which was equivalent to downsampling the
oretical analysis. However, there are at least two practical remote human depth map with a 64× sampling rate. At the
challenges: local end, we first proposed a rough depth recovery method
• Chroma key based coding. The chroma key based coding based on WLMNC distance augmented clustering, which can
is an important distortion source when WLMNC-AR yield better depth recovery results than those of the classic
is applied to practical tele-immersive video interaction NC clustering when depth discontinuities exist in the remote
systems. It may affect the quality of the guide color human. Then, we employed the rough estimated depth map
image with different settings, such as the data size of in the AR model [16] as a guide depth map to obtain a
the compressed images and the transmission bandwidth fine depth map that can preserve depth discontinuities and
5742 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 12, DECEMBER 2018

suppress texture copy artifacts. A theoretical analysis was [19] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
also conducted to guarantee the effectiveness of the pro- guided depth upsampling using anisotropic total generalized variation,”
in Proc. ICCV, Dec. 2013, pp. 993–1000.
posed method. To benchmark human depth recovery methods, [20] M.-Y. Liu, O. Tuzel, and Y. Taguchi, “Joint geodesic upsampling of
a novel dataset containing various types of human postures depth images,” in Proc. CVPR, Jun. 2013, pp. 169–176.
with self-occlusion was built. Comparisons with the state-of- [21] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality
depth map upsampling for 3D-TOF cameras,” in Proc. ICCV, Nov. 2012,
the-art depth recovery methods demonstrated the effectiveness pp. 1623–1630.
of the proposed method for human depth recovery with a high [22] J. Lu, K. Shi, D. Min, L. Lin, and M. N. Do, “Cross-based local
upsampling rate on the benchmark dataset. multipoint filtering,” in Proc. CVPR, Jun. 2012, pp. 430–437.
[23] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. ECCV,
2010, pp. 1–14.
ACKNOWLEDGMENT [24] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral
The authors thank Dr. Zhecai Chen, Yang He and upsampling,” ACM Trans. Graph., vol. 26, no. 3, p. 96, Jul. 2007.
XiaoYi Zhang for participating in building the dataset. The [25] J. Shotton et al., “Real-time human pose recognition in parts from single
depth images,” in Proc. CVPR, Jun. 2011, pp. 1297–1304.
authors are grateful to the referees for their valuable comments [26] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
and suggestions, which have helped us to significantly improve margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
the presentation of this paper. pp. 207–244, Feb. 2009.
[27] M. Huang, Y. Chen, B.-W. Chen, J. Liu, S. Rho, and W. Ji, “A semi-
supervised privacy-preserving clustering algorithm for healthcare,” Peer-
R EFERENCES Peer Netw. Appl., vol. 9, no. 5, pp. 864–875, 2016.
[1] S.-Y. Lee, I.-J. Kim, S. C. Ahn, M.-T. Lim, and H.-G. Kim, “Toward [28] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and
immersive telecommunication: 3D video avatar with physical interac- B. MacIntyre, “Recent advances in augmented reality,” IEEE Comput.
tion,” in Proc. ICAT, 2005, pp. 56–61. Graph. Appl., vol. 21, no. 6, pp. 34–47, Nov. 2001.
[2] T. Ogi, T. Yamada, K. Tamagawa, M. Kano, and M. Hirose, “Immersive [29] W. Xu, Y. Wang, Y. Liu, and D. Weng, “Survey on occlusion handling
telecommunication using stereo video avatar,” in Proc. VR, Mar. 2001, in augmented reality,” J. Comput.-Aided Des. Comput. Graph., vol. 25,
p. 45. no. 11, pp. 1635–1642, 2013.
[3] J. Lu, V. A. Nguyen, Z. Niu, B. Singh, Z. Luo, and M. N. Do, “CuteChat: [30] J. Zhu, Z. Pan, C. Sun, and W. Chen, “Handling occlusions in video-
A lightweight tele-immersive video chat system,” in Proc. MM, 2011, based augmented reality using depth information,” Comput. Animation
pp. 1309–1312. Virtual Worlds, vol. 21, no. 5, pp. 509–521, 2010.
[4] S. Follmer, R. Ballagas, H. Raffle, M. Spasojevic, and H. Ishii, “People [31] R. A. Newcombe et al., “KinectFusion: Real-time dense surface mapping
in books: Using a FlashCam to become part of an interactive book for and tracking,” in Proc. ISMAR, Oct. 2011, pp. 127–136.
connected reading,” in Proc. CSCW, 2012, pp. 685–694. [32] B. V. Lu, T. Kakuta, R. Kawakami, T. Oishi, and K. Ikeuchi, “Fore-
[5] S. E. Hunter, P. Maes, A. Tang, K. M. Inkpen, and S. M. Hessey, ground and shadow occlusion handling for outdoor augmented reality,”
“WaaZam!: Supporting creative play at a distance in customized video in Proc. ISMAR, Oct. 2010, pp. 109–118.
environments,” in Proc. CHI, 2014, pp. 1197–1206. [33] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation
[6] M. Huang, Y. Chen, L. Yin, and W. Ji, “Ti-photograph: A tele-immersive of dense two-frame stereo correspondence algorithms,” in Proc. IEEE
photograph system for distributed parents and children,” in Proc. 15th Workshop Stereo Multi-Baseline Vision, Dec. 2002, pp. 7–42.
ACM Int. Conf. Ubiquitous Comput. (Ubicomp Adjunct Publication), [34] A. Kolb, E. Barth, R. Koch, and R. Larsen, “Time-of-flight cameras in
2013, pp. 259–262. computer graphics,” Comput. Graph. Forum, vol. 29, no. 1, pp. 141–159,
[7] S. Liu, C. Yu, and Y. Shi, “Video avatar-based remote video col- 2010.
laboration,” J. Beijing Univ. Aeronaut. Astronaut., vol. 41, no. 6, [35] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-
pp. 1087–1094, 2015. view RGB-D object dataset,” in Proc. ICRA, May 2011, pp. 1817–1824.
[8] H. Lu et al., “Depth map reconstruction for underwater Kinect camera [36] C. Ti, G. Xu, Y. Guan, and Y. Teng, “Depth recovery for Kinect sensor
using inpainting and local image mode filtering,” IEEE Access, vol. 5, using contour-guided adaptive morphology filter,” IEEE Sensors J.,
pp. 7115–7122, 2017. vol. 17, no. 14, pp. 4534–4543, Jul. 2017.
[9] N. Yu et al., “Super resolving of the depth map for 3D reconstruction [37] A. Atapour-Abarghouei and T. P. Breckon, “DepthComp: Real-time
of underwater terrain using Kinect,” in Proc. IEEE Int. Conf. Parallel depth image completion based on prior semantic scene segmenta-
Distrib. Syst., Dec. 2016, pp. 1237–1240. tion,” in Proc. 28th Brit. Mach. Vis. Conf. (BMVC), London, U.K.,
[10] W. Dong, G. Shi, X. Li, K. Peng, J. Wu, and Z. Guo, “Color- Sep. 2017. [Online]. Available: http://dro.dur.ac.uk/22375/1/22375.
guided depth recovery via joint local structural and nonlocal low-rank pdf?DDD10+qhww73+d700tmt
regularization,” IEEE Trans. Multimedia, vol. 19, no. 2, pp. 293–301,
[38] H.-T. Zhang, J. Yu, and Z.-F. Wang, “Probability contour guided depth
Feb. 2017.
map inpainting and superresolution using non-local total generalized
[11] D. Chetverikov, Image-Guided ToF Depth Upsampling: A Survey.
variation,” Multimedia Tools Appl., vol. 77, no. 7, pp. 9003–9020, 2017.
New York, NY, USA: Springer-Verlag, 2017.
[39] M. Hornácek, C. Rhemann, M. Gelautz, and C. Rother, “Depth super res-
[12] Y. Li, D. Min, M. N. Do, and J. Lu, “Fast guided global interpolation
olution by rigid body self-similarity in 3D,” in Proc. CVPR, Jun. 2013,
for depth and motion,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016,
pp. 1123–1130.
pp. 717–733.
[13] W. Liu, X. Chen, J. Yang, and Q. Wu, “Robust weighted least squares [40] J. Li, Z. Lu, G. Zeng, R. Gan, and H. Zha, “Similarity-aware patchwork
for guided depth upsampling,” in Proc. ICIP, Sep. 2016, pp. 559–563. assembly for depth image super-resolution,” in Proc. CVPR, Jun. 2014,
[14] W. Liu, X. Chen, J. Yang, and Q. Wu, “Robust color guided depth map pp. 3374–3381.
restoration,” IEEE Trans. Image Process., vol. 26, no. 1, pp. 315–327, [41] U. Hahne and M. Alexa, “Exposure fusion for time-of-flight imaging,”
Jan. 2017. Comput. Graph. Forum, vol. 30, no. 7, pp. 1887–1894, 2011.
[15] H. H. Kwon, Y.-W. Tai, and S. Lin, “Data-driven depth map refine- [42] Q. Wang, S. Li, H. Qin, and A. Hao, “Super-resolution of multi-observed
ment via multi-scale sparse representation,” in Proc. CVPR, Jun. 2015, RGB-D images based on nonlocal regression and total variation,” IEEE
pp. 159–167. Trans. Image Process., vol. 25, no. 3, pp. 1425–1440, Mar. 2016.
[16] J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang, “Color-guided depth [43] J. Diebel and S. Thrun, “An application of Markov random fields to
recovery from rgb-d data using an adaptive autoregressive model,” IEEE range sensing,” in Proc. NIPS, 2005, pp. 291–298.
Trans. Image Process., vol. 23, no. 8, pp. 3443–3458, Aug. 2014. [44] S. C. H. Hoi, W. Liu, and S.-F. Chang, “Semi-supervised distance
[17] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. S. Kweon, “High- metric learning for collaborative image retrieval and clustering,” ACM
quality depth map upsampling and completion for RGB-D cameras,” Trans. Multimedia Comput., Commun., Appl., vol. 6, no. 3, 2010,
IEEE Trans. Image Process., vol. 23, no. 12, pp. 5559–5572, Dec. 2014. Art. no. 18.
[18] D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. Do, “Fast global [45] M. Guillaumin, J. Verbeek, and C. Schmid, “Multiple instance metric
image smoothing based on weighted least squares,” IEEE Trans. Image learning from automatically labeled bags of faces,” in Proc. ECCV, 2010,
Process., vol. 23, no. 12, pp. 5638–5653, Dec. 2014. pp. 634–647.
HUANG et al.: WLMNC DISTANCE-BASED HUMAN DEPTH RECOVERY WITH LIMITED BANDWIDTH CONSUMPTION 5743

[46] L. Wu, S. C. H. Hoi, R. Jin, J. Zhu, and N. Yu, “Distance metric learning Meiyu Huang received the B.S. degree in computer
from uncertain side information with application to automated photo science and technology from the Huazhong Uni-
tagging,” in Proc. ACM Int. Conf. Multimedia (MM), 2009, pp. 135–144. versity of Science and Technology, Wuhan, China,
[47] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and in 2010, and the Ph.D. degree in computer appli-
metric learning in semi-supervised clustering,” in Proc. ICML, 2004, cation technology from the University of Chinese
pp. 81–88. Academy of Sciences, Beijing, China, in 2016.
[48] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for She is currently an Assistant Researcher with
feature vectors and structured data,” Comput. Sci., 2013. the Qian Xuesen Laboratory of Space Technol-
[49] L. Yang and R. Jin, “Distance metric learning: A comprehensive survey,” ogy, China Academy of Space Technology, Beijing.
Michigan State Univ., East Lansing, MI, USA, Tech. Rep., 2006. Her research interests include machine learning,
[50] Q. Qian, J. Hu, R. Jin, J. Pei, and S. Zhu, “Distance metric learning using ubiquitous computing, human–computer interaction,
dropout: A structured regularization approach,” in Proc. ACM SIGKDD computer vision, and image processing.
Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 323–332.
[51] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face
verification,” in Proc. ACCV, 2010, pp. 709–720. Xueshuang Xiang received the B.S. degree in
[52] D. Klein, S. D. Kamvar, and C. D. Manning, “From instance-level con- computational mathematics from Wuhan University,
straints to space-level constraints: Making the most of prior knowledge Wuhan, China, in 2009, and the Ph.D. degree in
in data clustering,” in Proc. ICML, 2002, pp. 307–314. computational mathematics from the Academy of
[53] D. Cohn, R. Caruana, and A. McCallum, “Semi-supervised clustering Mathematics and Systems Science, Chinese Acad-
with user feedback,” in Constrained Clustering: Advances in Algorithms, emy of Sciences, Beijing, China, in 2014.
Theory, and Applications, vol. 4. 2003, pp. 17–32. In 2016, he was a Post-Doctoral Researcher with
[54] L. Wu, S. C. H. Hoi, R. Jin, J. Zhu, and N. Yu, “Learning bregman the Department of Mathematics, National University
distance functions for semi-supervised clustering,” IEEE Trans. Knowl. of Singapore, Singapore. He is currently an Asso-
Data Eng., vol. 24, no. 3, pp. 478–491, Mar. 2012. ciate Researcher with the Qian Xuesen Laboratory
[55] C. Domeniconi, J. Peng, and B. Yan, “Composite kernels for semi- of Space Technology, China Academy of Space
supervised clustering,” Knowl. Inf. Syst., vol. 28, no. 1, pp. 99–116, Technology, Beijing. His research interests include numerical methods for
2011. partial differential equations, image processing, and deep learning.
[56] Y. Chen, M. Rege, M. Dong, and J. Hua, “Incorporating user provided
constraints into document clustering,” in Proc. ICDM, Oct. 2007,
pp. 103–112.
[57] M. S. Baghshah and S. B. Shouraki, “Kernel-based metric learning Yiqiang Chen received the B.S. and M.S. degrees in
for semi-supervised clustering,” Neurocomputing, vol. 73, nos. 7–9, computer science from Xiangtan University, Xiang-
pp. 1352–1361, 2010. tan, China, in 1996 and 1999, respectively, and the
[58] S. C. H. Hoi, R. Jin, and M. R. Lyu, “Learning nonparametric kernel Ph.D. degree in computer science from the Institute
matrices from pairwise constraints,” in Proc. ICML, 2007, pp. 361–368. of Computing Technology, Chinese Academy of
[59] H.-J. Ye, D.-C. Zhan, X.-M. Si, and Y. Jiang, “Learning Mahalanobis Sciences (CAS), Beijing, China, in 2003.
distance metric: Considering instance disturbance help,” in Proc. 26th In 2004, he was a Visiting Scholar Researcher
Int. Joint Conf. Artif. Intell. (IJCAI), 2017, pp. 3315–3321. with the Department of Computer Science, The
[60] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- Hong Kong University of Science and Technology,
theoretic metric learning,” in Proc. ICML, 2007, pp. 209–216. Hong Kong. He is currently a Professor and the
[61] K. Q. Weinberger, F. Sha, and L. K. Saul, “Convex optimizations for Director of the Pervasive Computing Research Cen-
distance metric learning and pattern classification [applications corner],” ter, Institute of Computing Technology, CAS. His research interests include
IEEE Signal Process. Mag., vol. 27, no. 3, pp. 146–158, May 2010. artificial intelligence, pervasive computing, and human–computer interaction.
[62] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev.,
vol. 38, no. 1, pp. 49–95, 1996.
[63] D. P. Bertsekas, “On the Goldstein-Levitin-Polyak gradient projection
method,” IEEE Trans. Autom. Control, vol. 21, no. 2, pp. 174–184, Da Fan received the B.S. degree in measurement-
Apr. 1976. control technology and instrumentation from
[64] M. Huang, Y. Chen, W. Ji, and C. Miao, “Accurate and robust moving- Tsinghua University, Beijing, China, in 2008,
object segmentation for telepresence systems,” ACM Trans. Intell. Syst. and the Ph.D. degree in instrument science and
Technol., vol. 6, no. 2, Mar. 2015, Art. no. 17. technology from Tsinghua University, Beijing,
[65] J. Zhu, L. Wang, J. Gao, and R. Yang, “Spatial-temporal fusion for high in 2013. He is currently an Assistant Researcher
accuracy depth maps using dynamic MRFs,” IEEE Trans. Pattern Anal. with the Qian Xuesen Laboratory of Space
Mach. Intell., vol. 32, no. 5, pp. 899–909, May 2010. Technology, China Academy of Space Technology,
[66] D. Min, J. Lu, and M. N. Do, “Depth video enhancement based on Beijing. His research interests include machine
weighted mode filtering,” IEEE Trans. Image Process., vol. 21, no. 3, learning, deep learning, computer vision, and
pp. 1176–1190, Mar. 2012. intelligent control.

You might also like