RGB-D-Based Object Recognition Using Multimodal Co

Received March 3, 2019, accepted March 16, 2019, date of current version April 13, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2907071
RGB-D-Based Object Recognition Using

Multimodal Convolutional Neural
Networks: A Survey
MINGLIANG GAO 1,2 , JUN JIANG3,4 , GUOFENG ZOU 1, VIJAY JOHN5 ,
AND ZHENG LIU 2 , (Senior Member, IEEE)
1 School of Electrical and Electronic Engineering, Shandong University of Technology, Zibo 255000, China
2 Faculty of Applied Science, The University of British Columbia, Vancouver, BC V1V 1V7, Canada
3 School of Computer Science and Technology, Southwest Petroleum University, Chengdu 610500, China
4 School of Computer Science, Sichuan University of Science and Engineering, Zigong 643000, China
5 Intelligent Information Processing Laboratory, Toyota Technological Institute, Nagoya 468-8511, Japan
Corresponding author: Mingliang Gao (mlgao@sdut.edu.cn)

This work was supported in part by the National Natural Science Foundation of China under Grant 61601266 and Grant 61801272, in part
by the China Postdoctoral Science Foundation under Grant 2017M612306, in part by the Young Teacher Development Support Program of
Shandong University of Technology, and in part by the Shandong Province-sponsored Overseas Study Program.
ABSTRACT Object recognition in real-world environments is one of the fundamental and key tasks in
computer vision and robotics communities. With the advanced sensing technologies and low-cost depth
sensors, the high-quality RGB and depth images can be recorded synchronously, and the object recognition
performance can be improved by jointly exploiting them. RGB-D-based object recognition has evolved from
early methods that using hand-crafted representations to the current state-of-the-art deep learning-based
methods. With the undeniable success of deep learning, especially convolutional neural networks (CNNs)
in the visual domain, the natural progression of deep learning research points to problems involving
larger and more complex multimodal data. In this paper, we provide a comprehensive survey of recent
multimodal CNNs (MMCNNs)-based approaches that have demonstrated significant improvements over
previous methods. We highlight two key issues, namely, training data deficiency and multimodal fusion.
In addition, we summarize and discuss the publicly available RGB-D object recognition datasets and present
a comparative performance evaluation of the proposed methods on these benchmark datasets. Finally,
we identify promising avenues of research in this rapidly evolving field. This survey will not only enable
researchers to get a good overview of the state-of-the-art methods for RGB-D-based object recognition but
also provide a reference for other multimodal machine learning applications, e.g., multimodal medical image
fusion, audio-visual speech recognition, and multimedia retrieval and generation.
INDEX TERMS Convolutional neural network, multimodal fusion, object recognition, RGB-D, survey.
I. INTRODUCTION Various works have achieved exciting results on several

The task of object recognition is to recognize an object as RGB-based object recognition challenges [16]–[18] and it is
being a member of a class. In other words, given a set of pos- currently a relatively mature research area. However, there
sible class labels and an object to classify, the object recog- are several limitations for object recognition using only
nition task is to assign one of the labels to the object. Object RGB information in many real-world applications, because
recognition in real-world environments is a fundamental and it projects the 3D world into a 2D space and leads to
important task in computer vision, robotics, etc. It can assist inevitable data loss. Meanwhile, the appearance of objects
semantic scene understanding [1]–[4], scene labeling [5]–[8], change significantly with variations in viewpoint, illumina-
object grasping [9]–[12], etc. Due to its usefulness, tion and occlusions. To amend those shortcomings, using
object recognition has received significant consideration in depth images as a supplementary is a plausible way.
computer vision and robotics communities [13]–[15]. In recent years, RGB-D sensors, e.g. Microsoft’s Kinect,
Asus’s Xtion PRO, Intel’s RealSense, have been introduced
The associate editor coordinating the review of this manuscript and to provide high quality depth images [19]. They have been
approving it for publication was Yu Zhang. widely used in our daily life because they are inexpensive,
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
43110 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Gao et al.: RGB-D-Based Object Recognition Using MMCNNs: A Survey
(1)Training data deficiency

The number of RGB-D datasets, especially labeled data,
is much more scarce compared to the RGB images, and it
makes the training of data-hungry algorithms such as CNNs
unfeasible. To solve this problem, the most common strat-
egy is to use an off-the-shelf CNN architecture pre-trained
on ImageNet and fine-tune it using the labeled RGB-D
dataset [40], [43] . This can be considered as a supervised
method. While initializing an RGB network from a pre-
FIGURE 1. General framework of MMCNNs for RGB-D object recognition. trained network is straight-forward, using such a network for
processing depth data is not the case because of the different
color distributions between two modalities. An effective strat-
independent of complicated hardware, and widely supported egy is the transfer learning across modalities which employs
by open source software. The depth information has many an encoding method for the depth images such that the
extra advantages, being invariant to lighting and color vari- encoded depth images closely emulate the distribution of the
ations, allowing better separation from background and pro- corresponding RGB images [44], [45]. Besides, to leverage
viding pure geometry and shape cues [20], [21]. Therefore, a small set of labeled data and large amount of unlabeled
combing the information from the RGB and depth modali- data, more and more semi-supervised and weakly-supervised
ties can improve the performance of object recognition dra- based methods were proposed and had been proved to be very
matically. Although RGB-D data provides rich multimodal powerful for RGB-D object recognition [46], [47].
information to depict an object, how to effectively represent (2)Multimodal fusion
each modality and fuse the two modalities remains an open Multimodal fusion is an active research topic in multime-
problem [22]. dia and has benefited a diversity of applications [48], [49].
Many successful RGB-D object recognition methods have Analysis techniques for multimodal data fusion have been
been proposed. Early works mainly focused on the design investigated by the research communities for a long
of hand-crafted features in RGB-D image descriptors [19]. time [49]–[56]. For example, Liu et al. [54] established a
The hand-crafted features, such as SIFT [23] and spin projective dictionary learning framework for weakly paired
images [24], extracted from RGB and depth modalities have multimodal data fusion. Zhou et al. [55], [56] presented a
been studied in detail. However, the hand-crafted features are three-stage deep feature learning and fusion framework and
often dataset-specific and require a strong understanding of a multiview latent space learning framework with feature
domain-specific knowledge [25]. To reduce the dependency redundancy minimization. Generally, the mulitimodal data
on hand-crafted features, traditional machine learning-based can be classified to incomplete and completed data. For the
features, e.g. hierarchical matching pursuit (HMP) [26], incomplete multimodal data, some modalities are often miss-
single-layer convolutional operators [22], [27] have been pro- ing or incomplete. Thung et al. [57] proposed a multi-task
posed in a data-driven fashion for RGB-D object recognition. deep learning method for multi-stage diagnosis of Alzheimer
Recently, deep learning, has shown its power in repre- disease with incomplete multimodal data. Zhu et al. [58] put
sentation learning using architectures consisting of multiple forward a maximum mean discrepancy based multiple kernel
non-linear transformation layers [28]–[30]. Particularly, due learning method for incomplete multimodal neuroimaging
to the recent development of large public image reposito- data. Huang et al. [59] proposed a multi-label conditional
ries such as ImageNet [31] and high-performance computing restricted Boltzmann machine (ML-CRBM), which handled
systems such as graphics processing units (GPUs), CNNs modality completion, fusion, and multi-label prediction in a
have become extremely popular [32]–[34]. Many success- unified framework.
ful CNN architectures, e.g. AlexNet [35], VGGNet [36], This paper focuses on the RGB and depth modalities which
GoogleNet [37], ResNet [38], DenseNet [39], have been can be considered as complete multimodal data. In the lit-
proposed and extensively applied to various vision tasks. erature, early fusion and late fusion are the two most popu-
Meanwhile, due to their strong abilities in image character- lar fusion schemes [50], [52], [60] in complete multimodal
ization, CNNs have researched their points to solve RGB-D data fusion. Early fusion, also known as feature fusion, inte-
object recognition problems and proven to perform excel- grates data from different modalities before being passed
lently [5], [40]–[42]. Discriminative features can be extracted to a classifier. While the late fusion, also known as deci-
independently from the RGB and depth modalities by taking sion fusion, integrates the responses obtained after individ-
advantage of CNNs, and the features of the two modalities ual features learning the model for each descriptor at the
are fused and fed to a classifier for recognition. A general last stage. In this survey, we categorize the RGB and depth
framework of MMCNNs based RGB-D object recognition is fusion into early fusion and late fusion, and further refine
shown in Fig. 1. the early fusion into convolutional layer fusion and fully-
In the MMCNNs based methods, training data deficiency connected layer fusion depending on the stage of feature
and multimodal fusion are two key issues and challenges. fusion.
VOLUME 7, 2019 43111

Although RGB-based object recognition has been stud- A variety of methods have been proposed for RGB-D
ied extensively and there are many surveys or reviews in object representation and they can be mainly classified into
this domain [61]–[63], there are relatively fewer surveys on three categories: hand-crafted feature based methods, tradi-
RGB-D-based object recognition, especially based on CNNs. tional feature learning based methods and MMCNNs based
The surveys most closely associated to this work are those methods.
presented in [64]–[68]. These surveys covered related themes
but came with certain limitations. The surveys in [64]–[67] A. HAND-CRAFTED FEATURE BASED METHODS
mainly focused on hand-craft feature representations. For Earlier work utilized hand engineered features for both RGB
example, [64], [65] centered on local descriptors for 3D and depth modalities. These features are fused and fed
object recognition. The work in [66], [67] surveyed 3D into classifiers along with their labels for the recognition
features especially keypoint-based methods from 3D point task [65]. The features extracted from the RGB modality,
clouds for RGB-D object recognition. Ioannidou et al. [68] e.g. Scale Invariant Feature Transform (SIFT) [23], [25],
presented a survey on deep learning with 3D data in computer Speeded Up Robust Features (SURF) [69], Histograms of
vision. However, the focus is not specific to RGB-D object Oriented Gradients(HOG) [70], have been studied in detail
recognition and more details are needed. and widely used in the RGB-based object recognition [16].
To the best of our knowledge, the present survey is the The features extracted from the depth modality are mainly
first comprehensive study on MMCNNs based RGB-D object 3D shape descriptors over 3D point cloud [71] such as spin
recognition. The large body of work on CNNs based RGB-D images [24], 3D Shape Context(3DSC) [72], Rotation Invari-
object recognition besides the lack of such an extensive ant Feature Transform (RIFT) [73], Point Feature Histograms
study on state-of-the-art techniques prompts us to review the (PFH) [74], Point Pair Feature (PPF) [75], Signature of His-
recent advances in this domain. This survey is developed in a tograms of Orientations(SHOT) [76], etc.
problem-oriented pattern and provides an insightful analysis Lai et al. [77] introduced a well-known Washington
on the solutions to training data deficiency and multimodal RGB-D dataset and made a baseline for RGB-D object recog-
fusion strategies in MMCNNs for RGB-D object recognition. nition. In their work, they extracted many hand-crafted fea-
To this end, the proposed methods are broadly categorized tures, e.g., SIFT, color histograms, texton histogram, mean
into supervised and semi-supervised schemes considering the and standard deviation of color channels from RGB modality,
label of the training data. Furthermore, we analysis the multi- and spin image, 3D bounding box from the depth modality.
modal fusion strategies and categorize them into early fusion These features are combined with efficient match kernel
and late fusion considering the fusion level. Meanwhile, (EMK) [78]. Then, the fused features are reduced using prin-
the influence of different depth encoding methods, different cipal component analysis (PCA) and fed to three classifiers,
CNN models and different multimodal fusion strategies on i.e., linear support vector machine (LinSVM), Gaussian ker-
the performance are analyzed. Compared to the former sur- nel SVM (kSVM) and random forest (RF) for performance
veys, this survey covers the most recent and advanced work comparison. Experiments showed that the proposed features
on MMCNNs based RGB-D object recognition and provides with kSVM performed best. On this basis, Lai et al. [79]
the readers with the state-of-the-art methods. proposed a sparse instance distance learning (IDL) with reg-
Rest of the paper is organized as follows: Section II ularization to provide a distance measure for evaluating the
briefly reviews the hand-crafted feature based methods and similarity of all views of a particular object. Comparative
the traditional feature learning based methods in the early results showed that the IDL-based method outperformed the
RGB-D based object recognition. Meanwhile, the MMCNN exemplar-based local distance method [77]. Paulk et al. [80]
based methods are introduced briefly. Section III presents a extracted nine color-based features from RGB channel, three
detailed survey on MMCNNs based methods through super- geometric features and one volumetric feature from depth
vised learning. Section IV reviews the MMCNNs based channel. The multimodal features were concatenated directly
methods through semi-supervised learning. In Section V, and fed to AdaBoost, SVMs and ANN for comparative
recently published RGB-D object recognition datasets are studies. Browatzki et al. [81] extracted four 2D descriptors
introduced along with the comparative results and discussion. from the RGB images, namely, SURF [69], pyramids of
We discuss several promising avenues for achieving further histograms of oriented gradients (PHOG) [82], self-similarity
progress in Section VI. Finally, concluding remarks are made feature (SSF) [83] and CIELAB color histograms, and four
in Section VII. 3D descriptors from the depth images, namely, 3DSC [72],
depth buffer [84] , shape-index histograms (SIH) [85] and
II. RGB-D BASED OBJECT RECOGNITION shape distributions [86]. Then a bag-of-words (BoWs) rep-
There are two main procedures in RGB-D object recognition, resentation was utilized to transform the local features into
namely object feature representation and object classifica- a global descriptor and multiple SVMs learnt on separate
tion. Compared with object classification, object feature rep- features were fused through multilayer perception for classi-
resentation affects the performance of the object recognition fication. Liu et al. [87] calculated the features of compact-
system significantly because real-world objects usually suffer ness, symmetry, global convexity, uniqueness, smoothness
from large intra-class discrepancy and inter-class affinity. from the contour-based images (2D) and point cloud
43112 VOLUME 7, 2019

TABLE 1. Summary hand-crafted feature based RGB-D object recognition methods.
data (3D), respectively. Then the five 2D shape features and descriptors [98] were extracted and quantized to 3D visual
five 3D shape features were concatenated and fed into SVM words by K-means. Then a LinSVM was utilized for
for classification. object recognition. Logoglu et al. [99] proposed two spa-
Kernel descriptors [88]–[91] learn patch level or image tially enhanced local 3D descriptors for object recognition
level features based on kernels and they were successfully tasks, namely, histograms of spatial concentric surflet-pairs
used in RGB-D object recognition. Bo et al. [88] presented a (SPAIR) and colored SPAIR (CoSPAIR). Then the query
kernel descriptor method which combines three RGB kernels descriptors were matched and a majority vote classifier [100]
(color, gradient and LBP) and five depth kernel (size kernels, was utilized for classification. Experiments demonstrated that
kernel PCA, spin kernels, gradient kernels and LBP kernels) the introduced CoSPAIR descriptor outperformed the state-
to capture different recognition cues. They used pyramid of-the-art descriptors in both category-level and instance-
EMK [78] to aggregate local kernel descriptors into object- level recognition tasks.
level features and a LinSVM was utilized for recognition. A summary of hand-craft feature representations combing
Based on this, Bo et al. [92] proposed hierarchical kernel classifiers for RGB-D object recognition is shown in Table 1.
descriptors (HKD) that applied kernel descriptors recursively
to form image-level features and thus provided a conceptually B. TRADITIONAL FEATURE LEARNING BASED METHODS
simple and consistent way to generate image level features Although hand-crafted features could boost object recogni-
from pixel attributes. tion accuracy, the feature design process requires a strong
3D shape descriptors have been widely studied and understanding of domain-specific knowledge and the features
used in RGB-D object recognition [66], [67], [93], [94]. should be re-designed for new data types. Another shortcom-
Yu et al. [95] proposed hierarchical sparse shape descriptor ing for hand-crafted features is that they only capture a subset
(HSSD), which learnt shape primitives in multiple hierar- of features from raw data that are discriminative for object
chies from the dataset. They achieved this by transforming recognition and other useful cues are ignored during feature
spin image and fast point feature histograms (FPFH) [96] design [43]. To reduce the dependency on hand-crafted fea-
into filter-pooling framework to generalize them for learn- tures, several feature learning based methods which learn fea-
ing shape representation. Redondo-Cabrera et al. [97] pro- tures from raw data directly have been proposed for RGB-D
posed an approach to recognize object categories using object recognition.
quantized 3D spatial pyramid matching kernels (3D SPMK) Hierarchical matching pursuit (HMP) [26] is an unsu-
in point clouds. In their work, the 3D SURF local pervised feature learning method which learns dictionaries
VOLUME 7, 2019 43113

TABLE 2. Summary of traditional feature learning based methods for RGB-D object recognition.
over image and depth patches via K-SVD [101] to represent RGB and depth modalities, and the combined scores were
observations as sparse combinations of codewords. With the used to predict the category. Li et al. [108] applied multiple
learnt dictionary, feature hierarchies are built from scratch, scales of filters coupled with different pooling granularities,
layer by layer, using orthogonal matching pursuit and spatial and utilized color as an additional pooling domain to reduce
pyramid pooling. Bo et al. [102] used a two-layers HMP to the sensitivity to spatial deformations. Cheng et al. [109]
learn features from RGB, gray, depth, and surface normals extracted a 128-dimensional convolutional descriptor [110]
channels. Then, the features were concatenated and fed to a from RGB modality and a 200-dimensional kernel descrip-
LinSVM for training and classification. tor consisting of the gradient kernel and LBP kernel from
PCA and their variants are powerful tools for unsupervised depth modality [88]. These descriptors were converted to a
feature learning techniques from possibly high-dimensional 256-dimensional feature vector by PCA. A query adap-
data set and have been widely used in RGB-D object recogni- tive similarity measurement (QASM) and a learning-to-
tion. Chen et al. [103] treated the RGB-D data as a quaternion combination strategy were proposed to balance the RGB and
signal and proposed an algorithm for RGB-D object recog- depth cues, and majority vote classifiers [100] were used for
nition based on bi-directional 2D kernel quaternion princi- recognition.
pal component analysis [104] and quaternion representation A summary of traditional feature learning based methods
(QR) [105]. Sun et al. [106] proposed an RGB-D object for RGB-D object recognition is shown in Table 2.
recognition method combing PCA and canonical correlation
analysis (CCA). In their work, two stages of cascaded filter C. MMCNNS BASED METHODS
layers, one layer learnt features by PCA and another layer Due to their powerful representational capacities, CNNs have
maximized correlation by CCA, were constructed and fol- taken the place of traditional feature learning methods and
lowed by binary hashing and block histograms. Experiments have been considered to be mainstream approaches to RGB-D
proved that the introduced method was efficient and accurate. object recognition. More and more MMCNNs based methods
Single-layer convolutional operators were also widely used were proposed and have achieved exciting results. Accord-
in learning low level features from RGB and depth modal- ing to whether the training examples are labeled or unla-
ities in an unsupervised fashion. Blum et al. [27] proposed beled, the machine learning methods are broadly categorized
a convolutional k-means descriptor (CKM) to learn mean- into supervised, semi-supervised and unsupervised methods
ingful local feature description automatically. They calcu- [29], [111]. The CNNs based object recognition methods
lated the features response, accumulated them into feature often depend on a great deal of labeled data to train the
histograms and utilized a LinSVM for object recognition. model in a supervised way. To circumvent the burdensome
Similarly, Cheng et al. [22] proposed a convolutional Fisher labeling task, semi-supervised RGB-D object recognition
kernel (CFK) method for RGB-D object recognition. They methods have been proposed. Although, many methods such
learnt convolutional features from RGB and depth modalities as auto-encoder (AE) are widely used to learn deep fea-
via a single-layer CNN, utilized Gaussian mixture models tures in an unsupervised fashion [112], [113], the classifier
(GMMs) to model the feature distribution, and exploited should still be trained using the labeled data in a super-
Fisher kernel encoding [107] to generate image-level feature vised way. Essentially, these methods are still supervised
vector. Last, two LinSVMs were separately trained for the methods. To the best of our knowledge, there is no real
43114 VOLUME 7, 2019

unsupervised method to solve the RGB-D object recognition

problem.
1) SUPERVISED LEARNING BASED METHODS

In the supervised learning methods, an RGB-CNN and a
Depth-CNN are constructed, or two off-the-shelf CNNs are
fine-tuned based on the labeled RGB and depth data. Then,
a multimodal fusion strategy is adopted to fuse the two
modalities and finally a classifier is trained for recognition.
An overall structure of supervised MMCNNs for RGB-D
object recognition is shown in Fig. 2.
FIGURE 3. Overall structure of semi-supervised learning based MMCNNs

for RGB-D object recognition.
training sets. Then, the trained classifiers are applied to

predict the examples from the unlabeled training sets (U ).
The most confidently predicted instances of each class by
the two classifiers are transferred from unlabeled training
sets to labeled training sets (L) for the next round training.
FIGURE 2. Overall structure of supervised learning based MMCNNs for The algorithm runs until it reaches the maximum number
RGB-D object recognition.
of iteration or the unlabeled pool is empty. More detailed
descriptions are shown in SectionIV.
Overall, most of state-of-the-art works were based on the
off-the-shelf CNNs pre-trained on ImageNet except for the
3) MULTIMODAL FUSION STRATEGIES
work [43], [114], [115] in which CNNs were newly designed
The underlying motivation to use multimodal data is that
and trained. The reasons to utilize the off-the-shelf CNNs are
complementary information yields a richer representation
two folds. First, the number of sufficiently scaled training
that could be used to produce much improved performance
datasets of depth images is much more scarce compared to
compared to single modality. The multimodal fusion research
the color images. Second, it has been proven that off-the-shelf
communities have achieved substantial advances [49]. The
CNNs can be utilized to address different tasks in computer
fusion of different modalities is generally performed at two
vision by fine-tuning to a new dataset [116]–[118]. To lever-
levels, namely early level (or feature level) fusion and late
age a pre-trained CNN, the RGB modality needs no further
level (or decision level) fusion [50], [52], [60].
processing because it follows a specific input distribution as
that of natural camera images. However, the depth images
have to be transferred to the RGB domain to benefit from
the features learnt in the off-the-shelf CNNs which were pre-
trained on RGB images. More detailed descriptions of super-
vised MMCNNs for RGB-D object recognition are shown
in Section III.
2) SEMI-SUPERVISED LEARNING BASED METHODS

Although supervised learning based CNNs have so far made
great progresses in this domain, they still suffers a lot from
FIGURE 4. Overall architecture of early level fusion for RGB-D object
lack of large-scale manually labeled RGB-D data. Labeling recognition.
large-scale RGB-D dataset is an expensive, time-consuming
and boring task, because they require the efforts of experi- (1) Early level fusion
enced human annotators. Thus it is necessary to develop a In the early fusion approach, the RGB and depth features
new effective training framework for deep learning to benefit are extracted from two separate CNNs. Then the features
from the massive unlabeled RGB-D data. To address this from the two modalities are fused to meet the requirements
problem, a natural idea is to incorporate the conventional for a deeper understanding of the captured information. The
semi-supervised learning methods into the deep learning early level fusion is advantageous in that it can utilize the cor-
framework. An overview of semi-supervised learning based relation between multiple features from different modalities
MMCNNs for RGB-D object recognition is shown in Fig. 3. at an early stage which helps in better task accomplishment.
An RGB classifier and a depth classifier are trained An overview of early level fusion in RGB-D object recogni-
to model the two modalities using the limited labeled tion is shown in Fig. 4.
VOLUME 7, 2019 43115

(2) Late level fusion Bo et al. [102] first converted the depth map to surface
Late level fusion refers to the aggregation of decisions from normals and then re-interpreted it as RGB values. Due to the
multiple classifiers, each trained on separate modalities. This inherent noisiness of the depth channel, such a direct conver-
fusion architecture is favored because errors from multiple sion results in noisy images in the color space. To address this
classifiers tend to be uncorrelated and it is feature indepen- issue, Akerberg et al. [121] modified it with recursive median
dent. In the late level fusion, the final object recognition filter and bilateral filter.
decision is performed by combining the classification prob- (2) ColorJet
abilities of different complementary CNN-based RGB-D ColorJet is a simple color mapping method by assign-
object recognition models. Compared with the early level ing different colors to different depth values. It was first
fusion, the late level fusion is much easier because the classi- introduced in RGB-D object recognition by Eitel et al. [40].
fication probabilities of different complementary in semantic In ColorJet, the original depth map is firstly normalized
stage usually have the same representation. An overall archi- 0-255 values. Then the colorization works by mapping the
tecture of late level fusion in RGB-D object recognition is lowest value to the blue channel and the highest value to the
shown in Fig. 5. red channel. The value in the middle is mapped to green and
the intermediate values are arranged accordingly. To increase
invariance of the generated images against camera pitch angle
changes, Schwarz et al. [119] improved the ColorJet model
by an optional re-projection step in a coordinate system.
(3) HHA
HHA was first introduced by Gupta et al. [122] for
RGB-D object detection and segmentation. HHA is a map-
ping method in which the first channel encodes the horizon-
tal disparity, the second channel encodes the height above
ground and the third channel encodes the pixel-wise angle
between the surface normal and the gravity vector. Horizontal
FIGURE 5. Overall architecture of late level fusion for RGB-D object disparity calculates the difference between the images of the
recognition.
left and right camera, which helps to understand the closeness
of objects from the camera. Height above the ground depicts
III. SUPERVISED LEARNING BASED MMCNNS the possible positions of objects in the scene concerning the
FOR RGB-D OBJECT RECOGNITION ground plane. The angle quantifies the shape of object in the
Among the MMCNNs based RGB-D object recognition scene. The HHA representation encodes properties of geo-
methods, the most common one is to utilize two off-the- centric pose that emphasize complementary discontinuities
shelf CNN architectures pre-trained on ImageNet and fine- in the image (depth, surface normal and height). This embed-
tune them using the labeled RGB-D dataset in a supervised ding method is strictly geocentric and such information is not
fashion. To this end, no further processing is required for always available in object-centric recognition [120].
the RGB modality, but the depth modality is often encoded (4) Embedded depth
as a rendered RGB image to emulate the distribution of the Embedded depth method was first proposed by
corresponding RGB images. Zaki et al. [120]. This method takes the shape informa-
tion into the depth embedding. In embedded depth method,
A. DEPTH ENCODING a single channel depth map is convoluted by vertical and
Recent work have focused on transferring the knowledge of horizontal Prewitt kernels. Then the gradient magnitude Gm
the CNN models pre-trained on ImageNet to extract discrim- and gradient direction Gθ are calculated, respectively by:
inative features from depth images [119], [120]. Many depth q
encoding methods were proposed which can be grouped into Gm = G2x + G2y , (1)
two categories, namely, hand-crafted encoding methods and Gθ = arctan(Gx , Gy ), (2)
learning based encoding methods
where Gx and Gy are the vertical and horizontal derivative of
1) HAND-CRAFTED BASED DEPTH ENCODING METHODS depth image. Last, the three-channel depth map is constructed
The hand-crafted based methods are very popular because as the concatenation of the original single channel depth map
they are intuitive and easy to be implemented. Many hand- with the gradient magnitude and direction maps.
crafted mappings were proposed to colorize depth data and It’s worth mentioning that the depth modality can be
obtained impressive improvements especially over the Wash- encoded by a certain encoding method or by a combination
ington RGB-D object dataset [77]. of the encoding methods. For example, Rahman et al. [123]
(1) Surface normals utilized both surface normals and ColorJet methods to encode
Within the RGB-D classification literature, surface nor- the depth images. Aakerberg et al. [124] encoded the depth
mals [102] is the most widely used encoding method. modality by combing the surface normals, ColorJet and
43116 VOLUME 7, 2019

fine-tuning of the surface normals in an ensemble learning 1) FULLY-CONNECTED LAYER FUSION

framework. The advantage of using fully-connected layer fusion lies
that features extracted from fully-connected layer carry rich
2) LEARNING BASED DEPTH ENCODING METHODS semantic information [116] which are crucial in image classi-
Although hand-crafted depth encoding methods perform well fication [134], [135]. Also, they are easily to be fused before
in specific RGB-D datasets, they are sub-optimal because one feeding to a softmax classifier in an end-to-end fashion.
has to make strong assumptions on what information, and up (1) Concatenated fusion
to which extent, should be preserved in the transfer learning Of the various ways of fully-connected layer fusion,
towards the RGB modality. To overcome these shortcom- a straightforward approach is to concatenate the fully-
ings, learning based depth encoding method was proposed connected layers of the RGB and depth modalities directly.
recently inspired by recent work on colorization of gray scale A representative work in this point was presented by
photographs [125]–[127]. Eitel et al. [40]. In their work, two CaffeNets [136] pre-
Carlucci et al. [128] proposed a deep network architecture trained on ImageNet were fine-tuned using the RGB and
termed as ‘‘Deep Depth Colorization’’ ((DE)2 CO), to map depth images (Encoded by ColorJet [40] and HHA [122],
the depth images to RGB images by exploiting a residual respectively) in the RGB-D dataset. After that, the two
paradigm. The (DE)2 CO method takes advantage of the resid- fc7 layers were concatenated and merged into a fusion net-
ual approach and learns how to map between the two modal- work. Then the fusion network was further trained by a
ities by leveraging over a reference database for any given softmax classifier for recognition. Extensive experimental
architecture. Experimental results proved that the proposed results demonstrated the accurate performance of the intro-
method outperformed the ColorJet and HHA. An overview duced approach in the task of recognizing objects in noisy
of the (DE)2 CO colorization network is shown in Fig. 6. scenes. An Overall architecture of RGB-D object recognition
proposed by Eitel et al. [40] is shown in Fig. 7.
FIGURE 6. Overview of the (DE)2 CO depth colorization network [128].
Different depth encoding methods are summarized

in Table 3.
TABLE 3. Summary of different depth encoding methods. FIGURE 7. Overall architecture of RGB-D object recognition proposed by
Eitel et al. [40].
Most of the follow-up works were proposed inspired by

Eitel et al.’s work [40]. There are four main differences:
- Different CNN architectures
- Different depth encoding methods
- Different fully-connected layers
- Different classifiers
Wang et al. [132] fine-tuned two CaffeNets [136] using
the RGB and depth images which was similar to [40]. How-
ever, the depth images were encoded using HHA [122] and
the features from the first fully-connected layer (fc6) were
B. EARLY LEVEL FUSION concatenated to train a LinSVM for classification. They also
Most of state-of-the-art works in this domain adopted early carried a comparative study by concatenating the second
level fusion strategy because the features extracted from fully-connected layer (fc7), and results indicated that using
CNN are discriminative and easy to be fused. Sanchez- fusion features from fc6 layer outperformed the fc7 layer
Riera et al. [60] argued that early fusion outperformed the fusion.
late fusion in RGB-D based hand gesture recognition and To increase the invariance of the generated images against
object recognition. The early level fusion can be further cat- camera pitch angle changes, Schwarz et al. [119] pre-
egorized into fully-connected layer fusion and convolutional processed the RGB images by a fading operation, incorpo-
layer fusion based on the stage of feature fusion. rated the depth images by rendering objects from a canonical
VOLUME 7, 2019 43117

create the final prediction:

n
1X i
P= a softmax(i), (3)
n
i=1
where softmax(i) is the softmax score vector of each depth-

CNN and ai is the weight of the ith depth model. They proved
that forming an ensemble by combining the softmax proba-
bilities increased the recognition performance compared to
using a single baseline model.
FIGURE 8. Overall architecture of RGB-D object recognition proposed by To enhance the object recognition performance,
Schwarz et al. [119].
Zhou et al. [129] adopted a deeper CNN architecture,
DenseNet-121 [39], to model the RGB and depth modali-
ties. To retrain the modal, they proposed a multimodal self-
perspective and colorized the depth channel according to dis- augmentation and adversarial network (MSANet) to augment
tance from the object center. Two pre-trained CaffeNets [136] the RGB and depth data while keeping the annotations,
were utilized to extract the features from the RGB and depth as shown in Fig. 10. In their work, a series of transforma-
modalities respectively. The last two fully-connected layer tions were leveraged to generate class-agnostic examples
layers (fc7 and fc8) were combined and fed to a LinSVM for for each instance and two pre-trained DenseNets-121 were
classification. Experiments proved that although the fc7 and fine-tuned based on the self-augmented RGB and depth
fc8 layers were connected before modality fusion which pro- instances (Encoded by surface normals). Meanwhile, they
duced a larger features in each modality, the result was worse trained an adversarial occlusion neural network (AONN) on
than single fc7 fusion [40]. An overall architecture proposed a feature map space in which the self-augmented instances
by Schwarz et al. [119] is shown in Fig. 8. were processed by projecting the generated mask to gen-
Aakerberg et al. [121] used VGG-16 architecture [36] erate hard positive examples. Then the MSANet-RGB and
for RGB stream and CaffeNet [136] for depth stream. The MSANet-Depth were concatenated via discarding their own
two networks were fine-tuned and the fc7 layer of Caf- classification layers and fed into a softmax for classification.
feNet and the second fully-connected layer of VGG-16 (fc7) Experimental results indicated that the MSANet achieved
were fused and a softmax classifier was trained for recogni- 94.2% top-1 test accuracy on the Washington RGB-D object
tion. Experimental results showed that the recognition accu- dataset. Also, they proved that the MSANet could be adopted
racy increased 6% compared with Eitel et al.’s work [40]. in other CNNs, e.g. AlexNet, VGG-16 and ResNet-101, and
Carlucci et al. [128] adopted the same network as [121] to a plausible improvement can be obtained.
model RGB and depth modalities but the depth stream was (2) Multimodal sharable and modality-specific fusion
encoded by a colorization network architecture as depicted The aforementioned methods connect the features from
in Fig. 6. Comparative results showed that the introduced RGB and depth modalities directly. The major shortcoming
depth encoding method using deep network outperformed the is that the relation between the two modalities is ignored
ColorJet [40] and HHA [122]. and the complementary nature of the modalities cannot be
To fully exploit the discriminate features of depth modality, fully exploited. Also, it leads to very large input vectors
Rahman et al. [123] utilized both surface normals [102] and that may contain redundancies. To address these problems,
ColorJet [40] methods to encode the depth modality, and pro- an alternative method is to fuse the fully-connected layer in a
posed a three-streams MMCNNs based deep network archi- multimodal sharable and modality-specific fashion.
tecture (GoogLeNet [37] for RGB stream and CaffeNet [136] Wang et al. [114] proposed a multimodal fusion layer that
for both depth streams), as shown in Fig. 9. The last fully- used matrix transformations to explicitly enforce a common
connected layers of the three CNNs were removed and a part to be shared by features of different modalities while
new fully-connected layer named RGB-FC was added in the retaining modality-specific learning. In their work, a CNN
RGB stream. After that, a concatenation layer was added to architecture which was very similar to the AlexNet [35]
combine all the stream networks followed by a softmax layer except for the number of kernels in convolutional layers was
for final classification. Experiments proved that combining constructed, as shown in Fig. 11. The network was pre-trained
two depth encoding methods performed better than one single on ImageNet and fine-tuned using the Washington RGB-D
depth encoding method. dataset [77]. After that, the second fully-connected layers
To design a more robust model to extract features (fc7) of the two modalities were extracted and fed into a mul-
from the depth images, Aakerberg et al. [124] improved timodal feature learning framework to exploit the modality-
Eitel et al.’s work [40] by using ensemble learning. In their specific and modality-common features, as shown in Fig. 12.
work, depth values were encoded by surface normal [102], Experiments proved that the shared common patterns and
ColorJet [40], and fine-tuning of the surface normals. The modal-specific patterns of different modalities increased
softmax probabilities of CNNs were weight averaged to the recognition accuracy by 4% compared with simply
43118 VOLUME 7, 2019

FIGURE 9. Overview of three-streams MMCNNs framework by Rahman et al. [123].
and complementary properties of RGB and depth modalities.

An overview of Jin et al.’s work [142] is shown in Fig. 13.
2) CONVOLUTIONAL LAYER FUSION

Recent studies proved that there were several inherent advan-
tages of the convolutional layers [143]. First, the activations
of the convolutional layers contain more spatial information.
Second, the convolutional features can be extracted from an
FIGURE 10. Overview of MSANet architecture by Zhou et al. [129]. image of any size and aspect ratio. Third, it has been proved
that the convolutional layers also contain a degree of semanti-
cally meaningful features [144]. For example, Liu et al. [145]
connecting the two fully-connected layer. Their follow-up exploited CNNs with deeply local description for remote
work [130] was similar to [114] in learning correlated and sensing image classification and proved that deeply local
individual features expect that they utilized a deeper network, descriptors outperformed the features extracted from fully
ResNets-50 [38] to extract features. Experimental results connected layers. Owing to these promising advantages,
proved that the deeper network further increased the recog- many recent studies have shifted to fully exploit the benefits
nition accuracy. of the convolutional layers for RGB-D object recognition.
Canonical correlation analysis (CCA) [137]–[140] is a Socher et al. [146] proposed a deep feature learning model
powerful statistical technique to find linear mapping that combing convolutional and recursive neural networks (CNN-
maximizes the cross-correlation between two feature sets. RNNs). A single CNN layer [147] learnt low-level features
In their follow-up work, Wang et al. [43] adopted the same from the RGB and depth modalities. Then these features were
CNN architecture as proposed in [114] to extract features and average pooled and assembled by multiple RNNs [148] with
utilized CCA to maximize the correlations of RGB and depth fixed tree structure to construct high order representation. The
features as well as force the distance between same-class RNN descriptors extracted from each modality were merged
objects to be small and the distance between different-class and fed to a joint softmax for classification. Experimental
objects to be large. Experiments proved that the CCA-based results proved that the CNN-RNN model outperformed a
method further improved the recognition accuracy compared two-layers CNN. An overall framework of CNN-RNN model
with the matrix transformation-based method [114]. is shown in Fig. 14.
Motivated by the success of reconstruction independent After the work of Socher et al. [146], many improved
component analysis (RICA) in object recognition task [141], methods were proposed. Bai et al. [113] improved Socher’s
Jin et al. [142] utilized RICA to learn partially common- work by proposing a subset based sparse auto-encoder to
semantic learning (PCSL) features which were assumed to learn low-level features from RGB and depth modalities and
be composed of some specific latent factors for RGB and improved the performance. Yin et al. [149] also adopted a
depth modalities and partially common latent factors shared single CNN layer to learn low level features, but the fusion
across both modalities. Specifically, two pre-trained Caf- took place in an earlier stage. In their work, a single CNN
feNets [136] were fine-turned using the labeled RGB-D layer [147] and average pooling strategy were utilized to learn
dataset and the second fully-connected layers (fc7) were uti- low-level features from the RGB and depth modalities. Then
lized to learn compact latent representations by RICA. Last, the low-level features were fused in a united feature map and
a LinSVM classifier was employed for object recognition. input to another CNN layer to extract high level features. Last,
Experiments indicated that the concatenation of specific and a kernel extreme learning machine (KELM) [150], [151] was
common semantic content could jointly exploit consistency trained for classification. Comparative results showed that the
VOLUME 7, 2019 43119

FIGURE 11. The configuration of the CNN model proposed by Wang et al. [114].
FIGURE 15. The CNN-RNN model for RGB-D object recognition by

Yin et al. [149].
FIGURE 12. The framework of matrix transformations for multimodal

feature learning framework by Wang et al. [114]. T1 = [T1s ; Tc ] and
T2 = [T2s ; Tc ] denote the learnt features for the color and depth
modalities respectively. T1s and T2s denote the RGB and depth
modal-specific parts. Tc is the common part. T = [T1s ; Tc ; T2s ].
W1 = [W1s ; Wc ] and W2 = [W2s ; Wc ] are the transformation matrix for
RGB and depth modalities. W1s and W2s denote the transformation
matrix for RGB and depth modal-specific parts. Wc is the transformation
matrix for common part. W is an additional matrix to map the feature
representation T to actual label L.
FIGURE 16. The framework of multimodal ELM-LRF model proposed by

Liu et al. [152].
Liu et al. [152] improved Yin et al.’s work [149] by

leveraging a local receptive field extreme learning machine
(LRF-ELM) [153] in which the links between input and hid-
FIGURE 13. The PCSL-based multimodal feature learning framework for den layer nodes were sparse and bounded by corresponding
RGB-D object recognition by Jin et al. [142]. receptive fields to extract low level features in each modality.
Then the features of each modality were concatenated and
fed into another LRF-ELM to extract high-level features for
recognition. An overview of Liu’s multimodal ELM-LRF
model is shown in Fig. 16.
The aforementioned methods extracted low-level features
based on a single-layer CNN, and used either multiple
RNN [146] or another single-layer CNN [149], [152] to
learn high-level features. Different from these methods which
FIGURE 14. Overview of CNN-RNN model by Socher et al. [146]. used shallow neural networks, Carlucci et al. [154] utilized
CaffeNet architecture [136] in two different ways to modal
proposed method outperformed the CNN-RNN model [146]. the RGB and depth modalities. Specifically, the CaffeNet for
An overall framework by Yin et al. [149] is shown RGB modality was trained on ImageNet named as ‘‘Caffe-
in Fig. 15. ImageNet’’. Instead, the CaffeNet for depth modality was
43120 VOLUME 7, 2019

FIGURE 17. DepthNet (left) and the associated RGB-D object

classification framework (right) proposed by Carlucci et al. [154].
trained using a synthetic depth dataset (VANDAL dataset

which contains 4.1 million synthesized depth images) named
as ‘‘DepthNet’’. Then the fifth convolutional layer (Conv5)
were combined and fed into a multi-kernel learning classi-
fier [155] for recognition, as shown in Fig. 17. Experimental
FIGURE 19. Architecture of RCFusion proposed by Loghmani et al. [131].
results indicated that the features fusion in the Conv5 layer
was better than in fc6 and fc7.
To fully exploit the convolution layer feature,
Zaki et al. [133] proposed a deeply supervised multimodal compact multimodal feature. Last, the output of the RNN was
embedding (DSMME) model. To be specific, they used a pre- used by the softmax classifier for the final prediction. Exper-
trained CNN-M [156] as the backbone network to initialize imental results showed that the multi-level feature fusion
both the RGB and the embedded depth streams. The activa- method outperformed the single-level feature fusion method.
tions of the feature extractors for both streams were combined An overall framework of Loghmani’s recurrent convolutional
in a cross modalities fashion via a bilinear operation [157] at fusion (RCFusion) is shown in Fig. 19.
various feature scales. The resultant shared representation at
each network branch was passed to an independent softmax C. LATE LEVEL FUSION
classifier for multi-scale joint deeply supervised learning. Most of RGB-D object recognition methods leverage early
Then the last normalized fully-connected bilinear activation fusion because it can utilize the correlation between multiple
was extracted and the dimensionality was reduced using features from different modalities at an early stage which
PCA. Last, an ELM was utilized as the multi-class classifier helps in better task accomplishment [158]. However, the fea-
for object recognition [150]. An overview of DSMME model tures should be represented in the same format before fusion.
is shown in Fig. 18. Unlike early level fusion, the decisions (at the semantic level)
usually have the same representation which enables the deci-
sion fusion much easier.
1) CASCADE FUSION
Usually, a fusion operation is performed once in a process.
A two-stage fusion scheme is presented by Zaki et al. [120].
In their work, a pre-trained CNN-F model [156] was applied
as the feature extractor for RGB, embedded depth and embed-
ded point cloud, respectively. For the case of late fusion,
the hypercube pyramid features of convolutional layers Fhc
and fully-connected layer features (fc6) Ffc were first fed
to two ELM classifiers to predict the class probabilities yhc
and yfc . Then, the concatenation of these new vectors Fc =
FIGURE 18. Overview of DSMME model introduced by Zaki et al. [133]. [yhc , yfc ] was used as an input vector to another ELM for
final classification. Experimental results proved that the late
Loghmani et al. [131] utilized two streams of pre-trained fusion scheme outperformed the early fusion scheme which
ResNet-18 [38] to process the RGB and depth modality concatenated Fhc and Ffc simply. An overview of Zaki’s
(Encoded by surface normals), and the volumetric features cascaded late fusion framework is shown in Fig. 20.
were extracted at different convolutional levels. Then the vol-
umetric features were individually transformed into a vector 2) HIERARCHICAL FUSION
through projection blocks consisting of two convolutional In the hierarchical fusion, the class probabilities are deter-
layers and a global max pooling layer. RGB-D features mined by the fusion of class probabilities estimated at differ-
extracted from different levels were sequentially fed into an ent hierarchical levels. A typically hierarchical fusion fusion
recurrent neural network (RNN) to produce a descriptive and for RGB-D object recognition was proposed by Asif et al. [5].
VOLUME 7, 2019 43121

FIGURE 20. Overview of cascaded late fusion framework for RGB-D object recogniton by Zaki et al. [120].
They divided the classification into image-level and pixel- dependent on large-scale manually labeled RGB-D data
level. The final object classification loss function LObject which is an expensive, time-consuming and boring task. Thus
was a weighted combination of a pixel-level classification it is necessary to develop a new effective training frame-
loss function LPixel and an image-level classification loss work for deep learning to benefit from the massive unlabeled
function LImage : RGB-D data, which is often cheap and available. To address
LObject = αLPixel + (1 − α)LImage (4) the problem of large-scale annotated RGB-D dataset, a natu-
ral idea is to incorporate the semi-supervised learning meth-
where α controls the relative influence of the pixel-level
ods into the deep learning framework to automatically exploit
regularization to the overall training objective function.
unlabeled data [159].
In their work, the image-level classification, named CNN-I
Semi-supervised based methods address the problem of
branch was composed of five successive convolutional layers
learning a better classifier by combining a small set of
followed by two fully-connected layers FC6 and FC7 as
labeled data and large amount of unlabeled data [159].
shown in Fig. 21. The pixel-level classification, named CNN-
Many semi-supervised learning methods were proposed in
P branch originated from the Conv5 layer of the CNN-I
the literature [159], e.g. self-training [160], [161], co-training
branch and composed of two convolution layers Conv6 and
[162], [163], expectation-maximization (EM) [164], [165]
Conv7. Both the CNN-I and the CNN-P branches learnt
and graph based methods [166], [167]. Among these meth-
modality specific features from VGG-11 [36]. Then, two
ods, co-training was theoretically proved to be very appropri-
fusion branches, namely, CNN-FV and FS2 were constructed
ate and successful in combining the labeled and unlabeled
to fused the two modalities in image-level and pixel-level,
data under three strong assumptions in [168]. Specifically,
respectively. The CNN-FV branch consisted of a Fisher
(i) features can be split into two sets; (ii) each sub-feature
encoding module, a concatenation module, and two fully-
set is sufficient to train a good classifier; (iii) the two sets are
connected layers. The fusion branch FS2 composed of
conditionally independent given the class. RGB-D data meets
two convolution layers, and a deconvolution layer. Then,
the three assumptions very well. First, RGB-D data contains
a weighted loss function as shown in Eq.4 was defined to
two distinct views, RGB and depth. Second, both of them can
combine the pixel-level and image-level classification loss
provide useful cues for object recognition. Third, the image
functions.
capturing modes of RGB (e.g., RGB cameras) and depth
A summary of supervised MMCNN methods for RGB-D
(e.g., infrared cameras) are very different.
recognition is shown in Table 4.
From the standpoint of multimodal fusion, the semi-
IV. SEMI-SUPERVISED LEARNING BASED MMCNNS supervised methods are also grouped into early level fusion
FOR RGB-D OBJECT RECOGNITION and late level fusion. In this section, the late level fusion is
Supervised learning based CNNs have so far made great first presented considering that it was first introduced in the
successes in RGB-D object recognition but they are highly literature and more straightforward.
43122 VOLUME 7, 2019

FIGURE 21. Overview of hierarchical fusion for RGB-D object recognition by Asif et al. [5].
TABLE 4. Summary of supervised MMCNNs methods for RGB-D object recognition. In the column of ‘‘Fusion level’’, ‘‘EF’’ and ‘‘LF’’ represent the Early
Fusion and Late Fusion, ‘‘CL’’ and ‘‘FCL’’ denote the Convolutional Layer and Full Connected Layer in early level fusion, ‘‘C’’ and ‘‘H’’ represent the Cascade
fusion and Hierarchical fusion in late level fusion, respectively.
A. WEIGHTED AVERAGE BASED LATE LEVEL FUSION semi-supervised learning based MMCNNs for RGB-D object
The weighted average based fusion is the simplest but recognition. It was first proposed by Cheng et al. [169].
the most commonly used late level fusion method in In Cheng et al.’s work [169], they adopted the
VOLUME 7, 2019 43123
co-training [168] scheme to learn from the unlabeled data. decisions from each modality. The reason why they did not
Specifically, they utilized a CNN-RNN model [146] to learn adopt deeper CNN architectures and early fusion scheme is
features from RGB and depth modalities and trained two that the small labeled dataset is infeasible to supervise the
SVM classifiers from the labeled training sets. The trained training of a deep CNN model for object recognition, thus
classifiers CRGB and CDepth were applied to predict the exam- the feature is unreliable.
ples from the unlabeled training sets. The most confidently To overcome this barrier, Cheng et al. [46] utilized two
predicted instances of each class by the two classifiers were reconstruction networks which consisted of 5 convolutional
transferred from unlabeled dataset to labeled dataset for the layers and 12 fully-connected layers to decode each chan-
next round training. The algorithm ran until it reached the nel of the inputs using the labeled and unlabeled data. The
maximum number of iteration or the unlabeled pool was configuration of 5 convolutional layers in the reconstruction
empty. The category of the input instance was determined network was the same as the AlexNet [35]. The input of the
by the weighted sum of the two probability scores generated network was a re-scaled RGB or depth image denoted as
by the two classifiers: x ∈ R148×148×3 , and the output was a reconstructed map
R(x) ∈ R64×64×3 . The reconstruction network was trained
c = argci ∈χ Max(αPcCiRGB + (1 − α)PcCiDepth ), (5)
by minimizing the mean square reconstruction error:
where PCRGB and PCDepth are the probability scores generated 3
1 2
by the RGB and depth classifiers. α is the weighted param-
X X
LossvR =
ch
x̃ − Rch (x) (6)

eter which determined by cross-validation. Experimental M +N
x∈{Lv ,U v } ch=1
results on Washington RGB-D dataset [77] showed that using
where M and N are the size of the labeled and unlabeled
20% labeled Washington RGB-D training set (7000 images)
pool, ch denotes the channel index of the input, L and U
together with 80% unlabeled training sets outperformed many
denote the labeled and unlabeled datasets and v represents the
state-of-the-art methods using 100% labeled training sets
RGB or depth modality. x̃ is the ground truth by re-sizing x
[88], [146]. An overview of Cheng’s work is shown in Fig. 22.
via a bilinear interpolation.
FIGURE 22. Co-training based semi-supervised learning for RGB-D object

recognition by Cheng et al. [169].
In their follow-up work, Cheng et al. [170] also leveraged

co-training [168] scheme to learn from the unlabeled data
iteratively. But the features from RGB and depth modalities FIGURE 23. Overall framework of diversity preserving co-training method
were improved by using a spatial pyramid matching (SPM) by Cheng et al. [46].
layer [171] to address the cropping problem in CNN-RNN
model [146]. The category of the input instance was deter- When the reconstruction network of each modality
mined using a late-fusion fashion in which the two probabil- achieved convergence, the parameters of the convolutional
ity scores generated by two SVM classifiers were weighted layers were utilized to initialize the corresponding convolu-
averaged. Experiments indicated that the introduced method tional layers of each modality in the proposed framework,
further improved their previous work [169] by adding SPM as shown in Fig. 23. Then a diversity preserving co-training
layer. Similarly, Yin and Li [172] used co-training and weight algorithm was proposed to select highly confident examples
averaged late level fusion scheme to learn the unlabeled with predicted labels from the unlabeled pool, whilst keeping
data and fused the two modalities. The difference is that these newly labeled examples to preserve intra-class diversity.
they utilized an ELM classifier instead of SVM classifier for Last, a fusion classification layer was added by concatenat-
classification in each modality. ing the two fc7 layers and the entire model was optimized
end-to-end. They utilized 5% labeled Washington RGB-D
B. CONCATENATED BASED EARLY LEVEL FUSION training set (1750 images) to train the model. Experiments
The semi-supervised methods mentioned above utilized showed that the introduced method outperformed their earlier
CNN-RNN based model to extract features in each modality work [169], [170] and the results were comparable to the
and the final results were obtained by the aggregation of state-of-the-art supervised methods [22], [40], [114].
43124 VOLUME 7, 2019

TABLE 5. Summary of semi-supervised MMCNNs based methods for RGB-D object recognition. (The labeled data size is specific to Washington RGB-D
dataset.)
FIGURE 24. The configuration of the Depth-Net proposed by

Sun et al. [47].
Sun et al. [47] proposed a weakly-supervised learn-

FIGURE 25. The architecture of multimodal CNN-GPC model proposed by
ing architecture for RGB-D based object recognition. This Sun et al. [47].
work can be regarded as a semi-supervised learning scheme
because it also utilized a small number of labeled samples and
large-scale unlabeled samples [173]. In their work, a VGG-16 a contemporary summary of the existing benchmark datasets
architecture [36] was leveraged as RGB-Net. The Depth-Net in 3D computer vision. Cai et al. [19] systematically sur-
was newly designed, as shown in Fig. 24 and pre-trained on veyed popular RGB-D datasets for different applications
virtual depth images (290,000) which were projected from including object recognition, scene classification, hand ges-
40-class subset of Princeton Model-Net dataset [174]. Then, ture recognition, 3D-simultaneous localization, and pose
the fc7 layers of the two modalities were concatenated to an estimation.
8192-dimension vector and fed to a Gaussian process classi- Although many RGB-D object datasets exist in the liter-
fication (GPC) [175]. After the GPC was trained and hyper ature, the datasets appropriate for object recognition evalu-
parameters were optimized using a small number labeled ation are still limited. There are many reasons behind the
dataset (0.3% of the Washington RGB-D training set with situation. Some RGB-D datasets, e.g. SUN RGB-D [178],
1750 images), the GPC was utilized to propagate labels to NYU V1 [179], NYU V2 [180], Cornell RGB-D [181]
large-scale unlabeled dataset. Last, the GPC was replaced were built for scene recognition tasks. Some datasets are
by a softmax layer connected with the fully-connected layer not public accessible, for example, a hand-held RGB-D
fc8 and the whole CNN model was trained in an end-to- object dataset (HOD) by Lv et al. [182]–[184]. BigBIRD
end fashion. Experiments proved that the introduced method dataset [185] is one of the biggest RGB-D object datasets
outperformed the co-training based methods [46], [169], we considered which contains 121 object instances and
[170], [172]. An overview of Sun’s work is shown in Fig. 25. 75000 images. Unfortunately, it is hard to evaluate the
A summary of semi-supervised MMCNNs based methods for depth features with this dataset because many objects are
RGB-D object recognition is shown in Table 5. extremely similar, and many are boxes, which are indistin-
guishable without texture information [128]. To the best of
V. BENCHMARK DATASETS AND RESULTS our knowledge, only Li et al. [108] evaluated the object recog-
A. DATASETS FOR RGB-D OBJECT RECOGNITION nition performance on this dataset. Therefore, we review
A number of RGB-D benchmark datasets have been collected three public and most used RGB-D datasets for object
and made publicly available [19], [176], [177]. Firman [176] recognition evaluation, namely, Washington RGB-D object
discovered a considerable quantity of RGB-D datasets avail- dataset [77], 2D/3D object dataset [81] and JHUIT-50
able for researchers to use. Guo et al. [177] presented object dataset [186].
VOLUME 7, 2019 43125

FIGURE 27. Examples of RGB (The first two rows) and depth images (The
last two rows) from the 2D/3D object dataset.
Cybernetics, Germany in 2011. It contained 154 objects with

14 categories that were likely to be encountered by a robot
operating in a household environment.
Each object was put on a step-motor-controlled turntable
and it was recorded every 10◦ around the vertical axis, yield-
ing 36 views per object by a PMDTM CamCube2.0 time-of-
flight camera. In total, the dataset contained 5,544 views with
every view consisting of two RGB images (1388 × 1038)
and depth images (204 × 204). The 2D/3D dataset is very
challenging for object recognition because of the huge vari-
ance of views. Fig. 27 shows some RGB images and the
corresponding depth images from the 2D/3D object dataset.
(3) JHUIT-50 object dataset 3
JHUIT-50 [186] was collected by Johns Hopkins Univer-
sity. It contained 14,698 RGB-D images capturing 50 com-
mon workshop tools, such as clamps and screw drivers,
FIGURE 26. Object samples from the Washington RGB-D object dataset
(a), the examples of RGB and depth images of an object (the top of (b))
as shown in Fig. 28. The PrimeSense Carmine 1.08 depth
and RGB-D scene images (the bottom of (b)). sensor was used to capture the depth images. All the objects
were segmented from the background. Training sequences
were captured under three fixed viewing angles (30◦ , 45◦
(1) Washington RGB-D object dataset 1 and 60◦ ) and testing sequences were collected under random
Washington RGB-D object dataset [77] is the most widely view points of the camera. JHUIT-50 is a challenging dataset
used dataset for RGB-D object recognition which was col- that focuses on the problem of fine-grained classification.
lected by University of Washington in 2011. This dataset
contained color and depth information of 300 objects from
51 categories of household objects which were organized in
a hierarchical structure manner.
Images were captured with a Kinect style camera at a
resolution of 640 × 480. Each object was recorded from
3 viewing heights (30◦ , 45◦ and 60◦ above the horizon).
Each video sequence was recorded at 20 Hz and contained
around 250 frames, giving 150 views per object and nearly
210,000 RGB-D images. Each sample consisted of a pair of a
portable network graphics (PNG) file and the corresponding
point cloud data (PCD) file. Meanwhile, the objects provided
in the dataset had been previously segmented from their FIGURE 28. The examples of 50 industrial objects in JHUIT-50 dataset.
background. Fig. 26 illustrates some selected objects from
this dataset as well as the examples of RGB-D images. B. COMPARATIVE RESULTS
(2) 2D/3D object dataset 2
Broadly, RGB-D object recognition deals with two differ-
The 2D/3D object dataset was collected by
ent problems, namely, category recognition and instance
Browatzki et al. [81] from Max Planck Institute for Biological
recognition [77].
1 https://rgbd-dataset.cs.washington.edu/people.html 3 https://cirl.lcsr.jhu.edu/research/human-machine-collaborative-
2 http://www.kyb.mpg.de/nc/employee/details/browatbn.html systems/visual-perception/jhu-visual-perception-datasets/
43126 VOLUME 7, 2019

(1) Category level recognition around objects to sample another 150 random views of the
Category level recognition involves classifying objects as object from the whole viewing sphere as the testing data.
belonging to some categories even though the system has not Tables 6,7 and 8 show the comparative results of differ-
seen them before. To this end, the system is trained on a set ent methods on Washington RGB-D dataset, 2D/3D object
of labeled objects. At test time, the system is presented with dataset and JHUIT-50 object dataset, respectively. The best
an RGB and depth image pair containing an object that is not results are shown in bold.
present in training and the task is to assign a category label
to this object. The accuracies are averaged across many trials C. DISCUSSION ON THE RESULTS
with alternating contiguous frames and the result is evaluated We make the following observations regarding the results:
by the mean accuracies with standard deviation. (1) Intuitively, visual features play a dominant role in
(2) Instance level recognition category level and instance level recognition and combining
Instance level recognition is to identify whether an object the RGB and depth modalities can improve the recognition
is physically the same one as that has previously been seen accuracy and stability compared with single modality.
based on its unique appearance. The system is trained on (2) Generally, the traditional feature learning based meth-
a subset of views of each object and the task here is to ods outperform the hand-crafted feature based methods
distinguish between object instances. At test time, the system across all datasets, and the MMCNNs based methods can
is presented with a pair RGB and depth image that contains a further improve the object recognition performance.
previously unseen view of one of the objects and assigns an (3) The category recognition accuracies of some semi-
instance label to the image. supervised methods, e.g. [47], [170], [172] are more than
The Washington RGB-D dataset can be evaluated from 90% which are comparative to the state-of-the-art supervised
both the aspects of category recognition and instance recog- methods. Moreover, the GPC in [47] is a recommendable
nition. For category recognition, one object is chosen ran- method to learn from unlabeled data which outperforms the
domly from each category for testing and train the classifiers traditional co-training based methods [46], [170], [172].
on all views of the remaining objects. For object instance (4) For the fully-connected layer fusion in supervised
recognition, the leave-sequence-out method is adopted in learning based MMCNNs, the best and most commonly
which video sequences of each object recorded on 30◦ and used strategy is to combine the RGB-CNN and Depth-
60◦ are trained and the sequences recorded on 45◦ are lever- CNN together via discarding their own classification lay-
aged for instance recognition. Each split consists of roughly ers. For example, the fc7 layer of the CaffeNet [40], [123],
35,000 labeled training images and 7,000 testing images [77]. [124], [132], [142], and VGGNet [121], [128], the Dense
It is worth mentioning that for the semi-supervised learn- Block(4) of the DenseNet [129], and the conv5_x layer of the
ing methods, the number of label training data was set ResNet [130] are extracted as feature from RGB and depth
7,000 for [169], [170], [172],1500 for [46]. and 105 for [47]. modalities. Schwarz et al. [119] carried on a comparative
The accuracy is averaged across 10 trials for category recog- experiment by connecting the fc7 layer with the classification
nition and instance recognition with alternating contiguous layers fc8 in the CaffeNet before feature fusion. Comparative
frames. results showed that combing the fc7 and fc8 layers achieved
The 2D/3D object dataset is evaluated for category recogni- a recognition accuracy with 89.4 ± 1.3(%) which was worse
tion only. The dataset is randomly split into training and test than the result by single fc7 fusion (91.3 ± 1.4(%)) [40].
sets. Six objects of each class are used for training and the (5) For the RGB modality, a common method is to uti-
remaining objects are used for testing (The only exception is lize an off-the-shelf CNN architecture pre-trained on Ima-
the class scissors which has less than six instances. In this geNet and fine-tune it using the labeled RGB images from
case, at least one instance is available for validation). For the RGB-D dataset. The category recognition accuracy of
each training object, only 18 views out of 36 views are used. RGB modality with different CNN models are compared
Eventually 82 objects in 1,476 RGB-D images are chosen as and shown in Fig. 29. It indicates that a deeper net-
training data, while 74 objects in 1,332 RGB-D images are work often yields a better recognition result. This is also
used for testing. It is worth mentioning that for the semi- verified by Zhou et al. [129]. In Zhou’s work, four
supervised learning methods [170], the label training set is architectures namely, AlexNet, VGG-16, ResNet-101 and
20% of the training data (7000 images). The experiment DenseNet-121 were adopted in the MSANet framework and
repeated 30 times with random splitting each category of comparative results showed that the DenseNet-121 performed
objects into training and test set and the final accuracy is best, and the rest in turn, were ResNet-101, VGG-16 and
averaged category across 30 trials. AlexNet. However, this is not always the case for depth image
The JHUIT-50 object dataset is used for instance level because the performance based on depth modality is not only
object recognition only. On the training side, each object is related to the network models but also closely connected
placed on a turntable in increments of 7.2◦ at three fixed with the depth encoding methods. To probe into the influence
camera viewing angles with 30◦ , 45◦ and 60◦ . This amounts of encoding methods and the CNN models on the depth-
to 360
7.2 × 3 = 150 object views (7500 images) in total for based category recognition accuracy, we compared the cat-
training. For testing data, the camera is manually moved egory recognition accuracy of the same CNN model with
VOLUME 7, 2019 43127

TABLE 6. Comparative performance of different methods on Washington RGB-D dataset.
43128 VOLUME 7, 2019

TABLE 7. Comparative performance of different methods on 2D/3D

object dataset (Category level (%)).
FIGURE 29. Category recognition accuracy of RGB modality with different

CNN models.
TABLE 9. Category recognition accuracy of depth modality with different

encoding methods (The first four rows) and the same encoding method
with different CNN models (The first row and the last two rows). The
embedded depth encoding method is combined with the CNN-F model
which is very similar to CaffeNet except for the number of kernels in
convolutional layers.
TABLE 8. Comparative performance of different methods on

JHUIT-50 dataset (Instance level (%)).
TABLE 10. Runtime performance of different methods on Washington

RGB-D dataset.
different encoding methods and the same encoding method

with different CNN models. Comparative results are as shown practical applications. More efforts are highly required in this
in Table 9. It indicates that the effect of encoding methods on respect which will be discussed in Section VI.
depth recognition accuracy is greater than the CNN models.
Meanwhile, it shows that the surface normals outperforms VI. FUTURE RESEARCH DIRECTIONS
other depth encoding methods. Although a significant number of MMCNNs based RGB-D
(6) The runtime performance of the MMCNNs based object recognition methods have been proposed, there are
methods is closely related to many factors, e.g. CNNs archi- several areas of investigation that may be explored in the
tectures, training strategies, computing devices hardwares, future. We identify some of them as follows:
especially GPUs performance, etc. Considering different (1) Large-scale RGB-D object datasets: With the devel-
methods have different factors, there are relatively fewer opment of data-hungry deep learning approach, there is an
comparative results at this point. Table 10 compares the urgent demand for large scale RGB-D datasets. Although
runtime performance of six MMCNNs based methods on low-cost RGB-D sensors can be easily used by the robotics
Washington RGB-D dataset. It shows that the training time research community, the number of sufficiently scaled train-
has reduced to 3 hours with four Nvidia Titan X GPUs [123] ing datasets of depth images is still limited in comparison
and the testing time has reached to 200 ms [47], [119]. with the RGB images. Although transfer learning and semi-
However, there is still a certain gap with the real-time supervised methods can solve the insufficient of training data
VOLUME 7, 2019 43129

problem to some extent, more quantities and variations of image feature either through stacking as a fourth pixel feature
RGB-D training datasets are highly desired and important, directly [198] or through training a network to operate on
and innovations in collecting and annotating these datasets 2D depth images enhanced with geocentric encodings [122].
are also needed. However, these approaches do not make full use of the
(2) Zero/one-shot learning: As aforementioned, it is not geometric information and makes it difficult to integrate
always easy to collect large scale labeled data like ImageNet information across viewpoints. Rather than treating depth
which includes more than a million color images with a as a 2D mapping, the 3D volumetric representation models
thousand object categories. Learning from a few examples either through 3D voxel grids [199] or through 3D volumes
remains a key challenge in machine learning. How to adopt [200], [201] provide a rich and powerful representation of 3D
deep learning methods for zero/one shot RGB-D based object shapes and give a more fundamental view of objects and
recognition is an interesting research direction. Zero/one-shot scenes. To exploit the discriminative features from 3D volu-
learning is to recognize classes that are never seen or only metric inputs, more complex volumetric neural networks and
one training sample per class before [188], [189]. In the notable 3D data augmentation methods are highly required.
past few years, there were some works on zero/one-shot (5) Real-time applications: Real-time RGB-D object
learning for classification and recognition tasks. For example, recognition is required in practical applications. CNNs usu-
Long et al. [190] proposed a zero-shot learning framework ally require great computational burden which limits their
that mapped semantic embeddings to a discriminative rep- access to real-time applications. To address this problem, one
resentation space and achieved promising performances on feasible approach is to develop new architectures which allow
both recognition and retrieval tasks. Mettes and Snoek [191] running CNNs in real-time. He and Sun [202] conducted a
proposed a spatial-aware object embedding for zero-shot series of experiments under constrained time cost, and pro-
action localization and classification. Zhang et al. [192] stud- posed models that were fast for real-world applications, yet
ied the one-shot learning gesture recognition on RGB-D data were competitive with existing CNN models. Li et al. [203]
recorded from Microsoft’s Kinect and proposed a novel bag eliminated all the redundant computations in the forward and
of manifold words based feature representation on symmet- backward propagation in CNNs, which resulted in a speed up
ric positive definite manifolds. However, how to effectively of over 1500 times. Another trend is to develop deep learning
adopt deep learning methods for zero/one shot RGB-D based chips for mobile devices. Recently, the idea of deep learning
object recognition would be remain a topic for future research chips has emerged and drawn great attentions. For example,
direction. Bong et al. [204] proposed a low-power convolutional neural
(3) Unsupervised learning: Collecting labeled datasets is network for the user authentication in smart devices which
time-consuming and costly, hence learning from unsuper- consumed 0.62 mW to evaluate one face at 1 fps and achieved
vised video data is required. Mobile robots mounted with 97% accuracy in LFW dataset. Krestinskaya et al. [205]
RGB-D cameras need to continuously learn from the environ- proposed analog backpropagation learning circuits for var-
ment without human intervention. How to automatically learn ious memristive learning architectures, such as deep neural
from the unlabeled data to improve the learning capability network, binary neural network, multiple neural network,
of deep networks would be a fruitful and useful research hierarchical temporal memory, and long short-term memory.
direction. Recently, generative adversarial network (GAN) It is expected that these methodologies will consolidate in
has achieved great success in image generation tasks, such the following few years, and potentially make it practical to
as face generation [193], [194] and text/image-to-image run neural networks locally on mobile devices in real-time
[195], [196] translation. Also, it can be used for recognition applications.
task. For example, Tran et al. [197] proposed a disentan- (6) CNN feature interpretation and visualization: CNNs
gled representation learning generative adversarial networks have achieved superior performance in many visual tasks.
(DR-GAN) for pose-invariant face recognition. Therefore, However, the end-to-end learning strategy makes CNNs rep-
we believe the GAN-based techniques can remove the heavy resentations a black box. With the exception of the final
dependence on richly annotated training data, which is a great network output, it is difficult to understand the logic of
exciting direction for RGB-D object recognition. CNN predictions hidden inside the network. In recent years,
(4) Depth data processing and representation: Depth data researchers have realized that high model interpretability
is a critical modeling element in an RGB-D vision sys- is of significant value in both theory and practice, and
tem. Although the RGB-D data is now widespread and has have developed models with interpretable knowledge rep-
been incorporated into many applications, the depth data resentations [206]. For example, visualization of CNN rep-
is still typically noisy and incomplete. The depth data pro- resentations in intermediate network layers [207]–[209],
cessing techniques still lag behind the RGB data processing diagnosis of CNN representations [210]–[213], disentan-
techniques. The robust processing of depth data, especially glement of the mixture of patterns encoded in each fil-
the denoising, inpainting, segmentation, and hole-filling are ter of CNNs [214], [215] and building explainable models
highly required. Meanwhile, more works need to be car- [216], [217]. More explicit and concise CNN feature interpre-
ried out to find the best methods for representing depth. tation and visualization technologies are required for better
Currently, the depth modality is usually treated as a 2D use of CNNs features in vision tasks.
43130 VOLUME 7, 2019

VII. CONCLUSIONS [16] K. Mikolajczyk and C. Schmid, ‘‘A performance evaluation of local
In this paper, we reviewed recent advances in MMCNNs descriptors,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10,
pp. 1615–1630, Oct. 2005.
based RGB-D object recognition. We provided an insightful [17] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, ‘‘Learning transferable
analysis on the solutions for training data deficiency and architectures for scalable image recognition,’’ in Proc. IEEE Conf. Com-
multimodal fusion strategies. We categorized the general put. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710.
[18] Z. Wu, C. Shen, and A. Van Den Hengel, ‘‘Wider or deeper: Revisiting
methods into supervised and semi-supervised schemes con- the ResNet model for visual recognition,’’ Pattern Recognit., vol. 90,
sidering whether the training data was labeled or unlabeled. pp. 119–133, Jun. 2019.
Furthermore, we analyzed the multimodal fusion strategies [19] Z. Cai, J. Han, L. Liu, and L. Shao, ‘‘RGB-D datasets using Microsoft
kinect or similar sensors: A survey,’’ Multimedia Tools Appl., vol. 76,
and categorized them into early level fusion and late level no. 3, pp. 4313–4355, 2017.
fusion based on the fusion level and more detailed analyses [20] D. Du, X. Xu, T. Ren, and G. Wu, ‘‘Depth images could tell us
were presented. Meanwhile, the influence of different depth more: Enhancing depth discriminability for RGB-D scene recogni-
tion,’’ in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2018,
encoding methods, different CNN models and different mul-
pp. 1–6.
timodal fusion methods on the RGB-D object recognition [21] X. Liu, D. Zhai, R. Chen, X. Ji, D. Zhao, and W. Gao, ‘‘Depth restoration
performance were analyzed as well. Finally, we identified from RGB-D data via joint adaptive regularization and thresholding on
some compelling challenges and future research directions manifolds,’’ IEEE Trans. Image Process., vol. 28, no. 3, pp. 1068–1079,
Mar. 2019.
that confront research in RGB-D computer vision using [22] Y. Cheng, R. Cai, X. Zhao, and K. Huang, ‘‘Convolutional fisher kernels
CNN-based approaches. for RGB-D object recognition,’’ in Proc. Int. Conf. 3D Vis. (3DV), 2015,
pp. 135–143.
[23] D. G. Lowe, ‘‘Distinctive image features from scale-invariant keypoints,’’
REFERENCES Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[1] M. Naseer, S. Khan, and F. Porikli, ‘‘Indoor scene understanding [24] A. E. Johnson and M. Hebert, ‘‘Using spin images for efficient object
in 2.5/3D for autonomous agents: A survey,’’ IEEE Access, vol. 7, recognition in cluttered 3D scenes,’’ IEEE Trans. Pattern Anal. Mach.
pp. 1859–1887, 2019. Intell., vol. 21, no. 5, pp. 433–449, May 1999.
[2] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Indoor scene under- [25] L. Zheng, Y. Yang, and Q. Tian, ‘‘SIFT meets CNN: A decade survey of
standing with RGB-D images: Bottom-up segmentation, object detec- instance retrieval,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5,
tion and semantic segmentation,’’ Int. J. Comput. Vis., vol. 112, no. 2, pp. 1224–1244, May 2018.
pp. 133–149, 2015. [26] L. Bo, X. Ren, and D. Fox, ‘‘Hierarchical matching pursuit for image
[3] M. Naseer, S. H. Khan, and F. Porikli. (2018). ‘‘Indoor scene understand- classification: Architecture and fast algorithms,’’ in Proc. Adv. Neural Inf.
ing in 2.5/3D for autonomous agents: A survey.’’ [Online]. Available: Process. Syst. (NIPS), 2011, pp. 2115–2123.
https://arxiv.org/abs/1803.03352 [27] M. Blum, J. T. Springenberg, J. Wülfing, and M. Riedmiller, ‘‘A learned
[4] C. Sakaridis, D. Dai, and L. Van Gool, ‘‘Semantic foggy scene under- feature descriptor for object recognition in RGB-D data,’’ in Proc. IEEE
standing with synthetic data,’’ Int. J. Comput. Vis., vol. 126, no. 9, Int. Conf. Robot. Autom. (ICRA), May 2012, pp. 1298–1303.
pp. 973–992, 2018. [28] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
[5] U. Asif, M. Bennamoun, and F. A. Sohel, ‘‘A multi-modal, discriminative A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
and spatially invariant CNN for RGB-D object labeling,’’ IEEE Trans. vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
Pattern Anal. Mach. Intell., vol. 40, no. 9, pp. 2051–2065, Sep. 2018. [29] X. Liu, W. Liu, Y. Liu, Z. Wang, N. Zeng, and F. E. Alsaadi, ‘‘A survey
[6] M. Brucker et al., ‘‘Semantic labeling of indoor environments from 3D of deep neural network architectures and their applications,’’ Neurocom-
RGB maps,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018, puting, vol. 234, pp. 11–26, Apr. 2017.
pp. 1871–1878. [30] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, ‘‘A survey on deep learning for
[7] Q. Cheng, Q. Zhang, P. Fu, C. Tu, and S. Li, ‘‘A survey and analysis on big data,’’ Inf. Fusion, vol. 42, pp. 146–157, Jul. 2018.
automatic image annotation,’’ Pattern Recognit., vol. 79, pp. 242–259, [31] O. Russakovsky et al., ‘‘ImageNet large scale visual recognition
Jul. 2018. challenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252,
[8] D. Kouřil et al., ‘‘Labels on levels: Labeling of multi-scale multi-instance Dec. 2015.
and crowded 3D biological environments,’’ IEEE Trans. Vis. Comput. [32] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, ‘‘A guide
Graphics, vol. 25, no. 1, pp. 977–986, Jan. 2019. to convolutional neural networks for computer vision,’’ Synth. Lectures
[9] U. Asif, M. Bennamoun, and F. A. Sohel, ‘‘RGB-D object recognition and Comput. Vis., vol. 8, no. 1, pp. 1–207, 2018.
grasp detection using hierarchical cascaded forests,’’ IEEE Trans. Robot., [33] Z. Qin, F. Yu, C. Liu, and X. Chen, ‘‘How convolutional neural networks
vol. 33, no. 3, pp. 547–564, Jun. 2017. see the world—A survey of convolutional neural network visualization
[10] J. Shen and N. Gans, ‘‘Robot-to-human feedback and automatic object methods,’’ Math. Found. Comput., vol. 1, no. 2, pp. 149–180, 2018.
grasping using an RGB-D camera–projector system,’’ Robotica, vol. 36, [34] J. Gu et al., ‘‘Recent advances in convolutional neural networks,’’ Pattern
no. 2, pp. 241–260, 2018. Recognit., vol. 77, pp. 354–377, May 2018.
[11] P. Schmidt, N. Vahrenkamp, M. Wächter, and T. Asfour, ‘‘Grasping [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
of unknown objects using deep convolutional neural networks based with deep convolutional neural networks,’’ in Advances in Neural Infor-
on depth images,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), mation Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and
May 2018, pp. 6831–6838. K. Q. Weinberger, Eds. New York, NY, USA: Curran Associates, 2012,
[12] I. González-Díaz, J. Benois-Pineau, J.-P. Domenger, D. Cattaert, and pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-
A. de Rugy, ‘‘Perceptually-guided deep neural networks for ego-action imagenet-classification-with-deep-convolutional-neural-networks.pdf
prediction: Object grasping,’’ Pattern Recognit., vol. 88, pp. 223–235, [36] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional
Apr. 2019. networks for large-scale image recognition.’’ [Online]. Available:
[13] A. Andreopoulos and J. K. Tsotsos, ‘‘50 years of object recognition: http://arxiv.org/abs/1409.1556
Directions forward,’’ Comput. Vis. Image Understand., vol. 117, no. 8, [37] C. Szegedy et al., ‘‘Going deeper with convolutions,’’ in Proc. IEEE Conf.
pp. 827–891, 2013. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
[14] S. A. A. Shah, M. Bennamoun, F. Boussaid, and L. While, ‘‘Evolution- [38] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
ary feature learning for 3-D object recognition,’’ IEEE Access, vol. 6, image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
pp. 2434–2444, 2018. (CVPR), Jun. 2016, pp. 770–778.
[15] G. Pasquale, C. Ciliberto, F. Odone, L. Rosasco, and L. Natale, ‘‘Are [39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, ‘‘Densely
we done with object recognition? The iCub robot’s perspective?’’ Robot. connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
Auto. Syst., vol. 112, pp. 260–281, Feb. 2019. Pattern Recognit. (CVPR), Jun. 2017, pp. 2261–2269.
VOLUME 7, 2019 43131

[40] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and [63] P. G. Schyns, ‘‘Object recognition: Complexity of recognition strategies,’’
W. Burgard, ‘‘Multimodal deep learning for robust RGB-D object Current Biol., vol. 28, no. 7, pp. R313–R315, 2018.
recognition,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), [64] E. Rachmawati, I. Supriana, and M. L. Khodra, ‘‘Review of local descrip-
Sep. 2015, pp. 681–687. tor in RGB-D object recognition,’’ Telkomnika, vol. 12, no. 4, p. 1132,
[41] L. Song et al., ‘‘A deep multi-modal CNN for multi-instance multi- 2014.
label image classification,’’ IEEE Trans. Image Process., vol. 27, no. 12, [65] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan, ‘‘3D object
pp. 6025–6038, Dec. 2018. recognition in cluttered scenes with local surface features: A survey,’’
[42] X. Song, S. Jiang, L. Herranz, and C. Chen, ‘‘Learning effective RGB- IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2270–2287,
D representations for scene recognition,’’ IEEE Trans. Image Process., Nov. 2014.
vol. 28, no. 2, pp. 980–993, Feb. 2019. [66] M. Hashimoto, S. Akizuki, and S. Takei, ‘‘A survey and technology trends
[43] A. Wang, J. Lu, J. Cai, T.-J. Cham, and G. Wang, ‘‘Large-margin multi- of 3D features for object recognition,’’ IEEJ Trans. Electron., Inf. Syst.,
modal deep learning for RGB-D object recognition,’’ IEEE Trans. Multi- vol. 136, no. 8, pp. 1038–1046, 2016.
media, vol. 17, no. 11, pp. 1887–1898, Nov. 2015. [67] L. A. Alexandre, ‘‘3D descriptors for object and category recognition:
[44] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. A comparative evaluation,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. Syst. (IROS), Oct. 2012, vol. 1, no. 3, p. 7.
[45] C. J. Debono et al., ‘‘Efficient depth-based coding,’’ in Proc. 3D Vis. [68] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, ‘‘Deep
Content Creation, Coding Del. Cham, Switzerland: Springer, 2019, learning advances in computer vision with 3D data: A survey,’’ ACM
pp. 97–114. Comput. Surv., vol. 50, no. 2, pp. 1–38, Apr. 2017.
[46] Y. Cheng, X. Zhao, R. Cai, Z. Li, K. Huang, and Y. Rui, ‘‘Semi-supervised [69] H. Bay, T. Tuytelaars, and L. Van Gool, ‘‘SURF: Speeded up robust fea-
multimodal deep learning for RGB-D object recognition,’’ in Proc. 25th tures,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV). Graz, Austria: Springer,
Int. Joint Conf. Artif. Intell. (IJCAI), 2016, pp. 3345–3351. 2006, pp. 404–417.
[47] L. Sun, C. Zhao, and R. Stolkin. (2017). ‘‘Weakly-supervised [70] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human
DCNN for RGB-D object recognition in real-world applications detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
which lack large-scale annotated training data.’’ [Online]. Available: Jun. 2005, pp. 886–893.
https://arxiv.org/abs/1703.06370 [71] R. B. Rusu and S. Cousins, ‘‘3D is here: Point cloud library (PCL),’’ in
[48] R. S. Blum and Z. Liu, Multi-Sensor Image Fusion and its Applications. Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2011, pp. 1–4.
Boca Raton, FL, USA: CRC Press, 2005. [72] M. Körtgen, G.-J. Park, M. Novotni, and R. Klein, ‘‘3D shape matching
[49] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, ‘‘Multimodal machine learn- with 3D shape contexts,’’ in Proc. 7th Central Eur. Seminar Comput.
ing: A survey and taxonomy,’’ IEEE Trans. Pattern Anal. Mach. Intell., Graph., Budmerice, Slovakia, 2003, pp. 5–17.
vol. 41, no. 2, pp. 423–443, Feb. 2019. [73] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘A sparse texture representation
[50] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, ‘‘Mul- using local affine regions,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
timodal fusion for multimedia analysis: A survey,’’ Multimedia Syst., vol. 27, no. 8, pp. 1265–1278, Aug. 2005.
vol. 16, no. 6, pp. 345–379, 2010. [74] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, ‘‘Aligning point
[51] D. Lahat, T. Adali, and C. Jutten, ‘‘Multimodal data fusion: An overview cloud views using persistent feature histograms,’’ in Proc. IEEE/RSJ Int.
of methods, challenges, and prospects,’’ Proc. IEEE, vol. 103, no. 9, Conf. Intell. Robots Syst. (IROS), Sep. 2008, pp. 3384–3391.
pp. 1449–1477, Sep. 2015. [75] B. Drost, M. Ulrich, N. Navab, and S. Ilic, ‘‘Model globally, match
[52] D. Ramachandram and G. W. Taylor, ‘‘Deep multimodal learning: A sur- locally: Efficient and robust 3D object recognition,’’ in Proc. IEEE Conf.
vey on recent advances and trends,’’ IEEE Signal Process. Mag., vol. 34, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 998–1005.
no. 6, pp. 96–108, Nov. 2017. [76] F. Tombari, S. Salti, and L. Di Stefano, ‘‘Unique signatures of histograms
[53] Z. Liu, E. Blasch, G. Bhatnagar, V. John, W. Wu, and R. S. Blum, for local surface description,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV).
‘‘Fusing synergistic information from multi-sensor images: An overview Heraklion, Crete: Springer, 2010, pp. 356–369.
from implementation to performance assessment,’’ Inf. Fusion, vol. 42, [77] K. Lai, L. Bo, X. Ren, and D. Fox, ‘‘A large-scale hierarchical multi-view
pp. 127–145, Jul. 2018. RGB-D object dataset,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
[54] H. Liu, Y. Wu, F. Sun, B. Fang, and D. Guo, ‘‘Weakly paired multimodal May 2011, pp. 1817–1824.
fusion for object recognition,’’ IEEE Trans. Autom. Sci. Eng., vol. 15, [78] L. Bo and C. Sminchisescu, ‘‘Efficient match kernel between sets of
no. 2, pp. 784–795, Apr. 2018. features for visual recognition,’’ in Proc. Adv. Neural Inf. Process. Syst.
[55] T. Zhou, K.-H. Thung, X. Zhu, and D. Shen, ‘‘Effective feature learn- (NIPS), 2009, pp. 135–143.
ing and fusion of multimodality data using stage-wise deep neural net- [79] K. Lai, L. Bo, X. Ren, and D. Fox, ‘‘Sparse distance learning for object
work for dementia diagnosis,’’ Hum. Brain Mapping, vol. 40, no. 3, recognition combining RGB and depth information,’’ in Proc. IEEE Int.
pp. 1001–1016, 2019. Conf. Robot. Autom. (ICRA), May 2011, pp. 4007–4013.
[56] T. Zhou, C. Zhang, C. Gong, H. Bhaskar, and J. Yang, ‘‘Multiview latent [80] D. Paulk, V. Metsis, C. McMurrough, and F. Makedon, ‘‘A supervised
space learning with feature redundancy minimization,’’ IEEE Trans. learning approach for fast object recognition from RGB-D data,’’ in Proc.
Cybern., to be published. 7th Int. Conf. Pervasive Technol. Rel. Assistive Environ. (PETRA), 2014,
[57] K.-H. Thung, P.-T. Yap, and D. Shen, ‘‘Multi-stage diagnosis of p. 5.
Alzheimer’s disease with incomplete multimodal data via multi-task [81] B. Browatzki, J. Fischer, B. Graf, H. H. Bülthoff, and C. Wallraven,
deep learning,’’ in Deep Learning in Medical Image Analysis and Mul- ‘‘Going into depth: Evaluating 2D and 3D cues for object classification
timodal Learning for Clinical Decision Support. Québec City, QC, on a new, large-scale object dataset,’’ in Proc. IEEE Int. Conf. Comput.
Canada: Springer, 2017, pp. 160–168. Vis. Workshops (ICCV), Nov. 2011, pp. 1189–1195.
[58] X. Zhu, K.-H. Thung, E. Adeli, Y. Zhang, and D. Shen, ‘‘Maximum mean [82] A. Bosch, A. Zisserman, and X. Munoz, ‘‘Representing shape with a
discrepancy based multiple kernel learning for incomplete multimodality spatial pyramid kernel,’’ in Proc. 6th ACM Int. Conf. Image Video Retr.
neuroimaging data,’’ in Proc. Int. Conf. Med. Image Comput. Comput.- (ICIVR), Jul. 2007, pp. 401–408.
Assist. Intervent. Québec City, QC, Canada: Springer, 2017, pp. 72–80. [83] E. Shechtman and M. Irani, ‘‘Matching local self-similarities across
[59] Y. Huang, W. Wang, and L. Wang, ‘‘Unconstrained multimodal multi- images and videos,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
label learning,’’ IEEE Trans. Multimedia, vol. 17, no. 11, pp. 1923–1935, (CVPR), Jul. 2007, pp. 1–8.
Nov. 2015. [84] M. Heczko, D. Keim, D. Saupe, and D. Vranic, ‘‘Methods for similarity
[60] J. Sanchez-Riera, K.-L. Hua, Y.-S. Hsiao, T. Lim, S. C. Hidayati, and search on 3D databases,’’ Datenbank-Spektrum, vol. 2, no. 2, pp. 54–63,
W.-H. Cheng, ‘‘A comparative study of data fusion for RGB-D based 2002.
visual recognition,’’ Pattern Recognit. Lett., vol. 73, pp. 1–6, Apr. 2016. [85] J. J. Koenderink and A. J. Van Doorn, ‘‘Surface shape and curvature
[61] P. N. Druzhkov and V. D. Kustikova, ‘‘A survey of deep learning methods scales,’’ Image Vis. Comput., vol. 10, no. 8, pp. 557–564, 1992.
and software tools for image classification and object detection,’’ Pattern [86] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, ‘‘Shape distribu-
Recognit. Image Anal., vol. 26, no. 1, pp. 9–15, 2016. tions,’’ ACM Trans. Graph., vol. 21, no. 4, pp. 807–832, 2002.
[62] W. Rawat and Z. Wang, ‘‘Deep convolutional neural networks for image [87] Z. Liu, C. Zhao, X. Wu, and W. Chen, ‘‘An effective 3D shape descriptor
classification: A comprehensive review,’’ Neural Comput., vol. 29, no. 9, for object recognition with RGB-D sensors,’’ Sensors, vol. 17, no. 3,
pp. 2352–2449, 2017. p. 451, 2017.
43132 VOLUME 7, 2019

[88] L. Bo, X. Ren, and D. Fox, ‘‘Kernel descriptors for visual recognition,’’ [113] J. Bai, Y. Wu, J. Zhang, and F. Chen, ‘‘Subset based deep learning for
in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2010, pp. 244–252. RGB-D object recognition,’’ Neurocomputing, vol. 165, pp. 280–292,
[89] S. Liu, S. Wang, L. Wu, and S. Jiang, ‘‘Multiple feature fusion based hand- Oct. 2015.
held object recognition with RGB-D data,’’ in Proc. Int. Conf. Internet [114] A. Wang, J. Cai, J. Lu, and T.-J. Cham, ‘‘MMSS: Multi-modal sharable
Multimedia Comput. Service (ICIMCS), 2014, p. 303. and specific feature learning for RGB-D object recognition,’’ in Proc.
[90] Y. Gu, J. Chanussot, X. Jia, and J. A. Benediktsson, ‘‘Multiple Kernel IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1125–1133.
learning for hyperspectral image classification: A review,’’ IEEE Trans. [115] K. Nguyen, C. Fookes, A. Ross, and S. Sridharan, ‘‘Iris recognition with
Geosci. Remote Sens., vol. 55, no. 11, pp. 6547–6565, Nov. 2017. off-the-shelf CNN features: A deep learning perspective,’’ IEEE Access,
[91] S. S. Bucak, R. Jin, and A. K. Jain, ‘‘Multiple kernel learning for visual vol. 6, pp. 18848–18855, 2018.
object recognition: A review,’’ IEEE Trans. Pattern Anal. Mach. Intell., [116] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, ‘‘CNN features
vol. 36, no. 7, pp. 1354–1369, Jul. 2014. off-the-shelf: An astounding baseline for recognition,’’ in Proc. IEEE
[92] L. Bo, K. Lai, X. Ren, and D. Fox, ‘‘Object recognition with hierarchical Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 806–813.
kernel descriptors,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [117] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
(CVPR), Jun. 2011, pp. 1729–1736. features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst.
[93] R. Rostami, F. S. Bashiri, B. Rostami, and Z. Yu, ‘‘A survey on data- (NIPS), 2014, pp. 3320–3328.
driven 3D shape descriptors,’’ Comput. Graph. Forum, vol. 38, no. 1, [118] F. Radenović, G. Tolias, and O. Chum, ‘‘Fine-tuning CNN image retrieval
pp. 356–393, 2018. with no human annotation,’’ IEEE Trans. Pattern Anal. Mach. Intell., to
[94] I. K. Kazmi, L. You, and J. J. Zhang, ‘‘A survey of 2D and 3D shape be published.
descriptors,’’ in Proc. 10th Int. Conf. Comput. Graph., Imag. Vis. (CGIV), [119] M. Schwarz, H. Schulz, and S. Behnke, ‘‘RGB-D object recognition
2013, pp. 1–10. and pose estimation based on pre-trained convolutional neural network
[95] K.-T. Yu, S.-H. Tseng, and L.-C. Fu, ‘‘Learning hierarchical representa- features,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2015,
tion with sparsity for RGB-D object recognition,’’ in Proc. IEEE/RSJ Int. pp. 1329–1335.
Conf. Intell. Robots Syst. (IROS), Oct. 2012, pp. 3011–3016. [120] H. F. Zaki, F. Shafait, and A. Mian, ‘‘Convolutional hypercube pyramid
[96] R. B. Rusu, N. Blodow, and M. Beetz, ‘‘Fast point feature histograms for accurate RGB-D object category and instance recognition,’’ in Proc.
(FPFH) for 3D registration,’’ in Proc. IEEE Int. Conf. Robot. Autom. IEEE Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 1685–1692.
(ICRA), May 2009, pp. 3212–3217. [121] A. Aakerberg, K. Nasrollahi, and T. Heder, ‘‘Depth value pre-processing
[97] C. Redondo-Cabrera, R. J. López-Sastre, J. Acevedo-Rodriguez, and for accurate transfer learning based RGB-D object recognition,’’ in Proc.
S. Maldonado-Bascón, ‘‘SURFing the point clouds: Selective 3D spatial Int. Joint Conf. Comput. Intell. (IJCCI), 2017, pp. 121–128.
pyramids for category-level object recognition,’’ in Proc. IEEE Conf. [122] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, ‘‘Learning rich features
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3458–3465. from RGB-D images for object detection and segmentation,’’ in Proc.
[98] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool, ‘‘Hough Eur. Conf. Comput. Vis. (ECCV). Zurich, Switzerland: Springer, 2014,
transform and 3D surf for robust three dimensional classification,’’ in pp. 345–360.
Proc. Eur. Conf. Comput. Vis. (ECCV). Heraklion, Crete: Springer, 2010,
[123] M. M. Rahman, Y. Tan, J. Xue, and K. Lu, ‘‘RGB-D object recognition
pp. 589–602.
with multimodal deep convolutional neural networks,’’ in Proc. IEEE Int.
[99] K. B. Logoglu, S. Kalkan, and A. Temizel, ‘‘CoSPAIR: Colored his-
Conf. Multimedia Expo (ICME), Jul. 2017, pp. 991–996.
tograms of spatial concentric surflet-pairs for 3D object recognition,’’
[124] A. Aakerberg, K. Nasrollahi, and T. Heder, ‘‘Improving a deep learning
Robot. Auton. Syst., vol. 75, pp. 558–570, Jan. 2016.
based RGB-D object recognition model by ensemble learning,’’ in Proc.
[100] G. James, ‘‘Majority vote classifiers: Theory and applications,’’ Ph.D.
7th Int. Conf. Image Process. Theory, Tools Appl. (IPTA), 2017, pp. 1–6.
dissertation, Dept. Statist., Stanford Univ., Stanford, CA, USA, 1998.
[125] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ‘‘Let there be color!: Joint end-
[101] M. Aharon, M. Elad, and A. Bruckstein, ‘‘SVD: An algorithm for design-
to-end learning of global and local image priors for automatic image col-
ing overcomplete dictionaries for sparse representation,’’ IEEE Trans.
orization with simultaneous classification,’’ ACM Trans. Graph., vol. 35,
Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
no. 4, p. 110, 2016.
[102] L. Bo, X. Ren, and D. Fox, ‘‘Unsupervised feature learning for RGB-D
[126] G. Larsson, M. Maire, and G. Shakhnarovich, ‘‘Learning representations
based object recognition,’’ in Experimental Robotics. Cham, Switzerland:
for automatic colorization,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV).
Springer, 2013, pp. 387–402.
Amsterdam, The Netherlands: Springer, 2016, pp. 577–593.
[103] B. Chen, J. Yang, B. Jeon, and X. Zhang, ‘‘Kernel quaternion principal
component analysis and its application in RGB-D object recognition,’’ [127] Z. Cheng, Q. Yang, and B. Sheng, ‘‘Deep colorization,’’ in Proc. IEEE
Neurocomputing, vol. 266, pp. 293–303, Nov. 2017. Int. Conf. Comput. Vis. (ICCV), Jun. 2015, pp. 415–423.
[104] Y. Sun, S. Chen, and B. Yin, ‘‘Color face recognition based on quater- [128] F. M. Carlucci, P. Russo, and B. Caputo, ‘‘(DE)2 CO: Deep depth coloriza-
nion matrix representation,’’ Pattern Recognit. Lett., vol. 32, no. 4, tion,’’ IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 2386–2393, 2018.
pp. 597–605, 2011. [129] F. Zhou, Y. Hu, and X. Shen, ‘‘MSANet: Multimodal self-augmentation
[105] B. Chen, H. Shu, G. Coatrieux, G. Chen, X. Sun, and J. L. Coatrieux, and adversarial network for RGB-D object recognition,’’ Vis. Comput.,
‘‘Color image analysis by quaternion-type moments,’’ J. Math. Imag. Vis., pp. 1–12, May 2018.
vol. 51, no. 1, pp. 124–144, 2015. [130] Z. Wang et al. (2016). ‘‘Correlated and individual multi-modal
[106] S. Sun, N. An, X. Zhao, and M. Tan, ‘‘A PCA–CCA network for RGB- deep learning for RGB-D object recognition.’’ [Online]. Available:
D object recognition,’’ Int. J. Adv. Robot. Syst., vol. 15, no. 1, pp. 1–12, https://arxiv.org/abs/1604.01655
2018. [131] M. R. Loghmani, M. Planamente, B. Caputo, and M. Vincze. (2018).
[107] F. Perronnin and C. Dance, ‘‘Fisher kernels on visual vocabularies for ‘‘Recurrent convolutional fusion for RGB-D object recognition.’’
image categorization,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [Online]. Available: https://arxiv.org/abs/1806.01673
nit. (CVPR), Jun. 2007, pp. 1–8. [132] J. Wang, J. Lu, W. Chen, and X. Wu, ‘‘Convolutional neural network for
[108] C. Li, A. Reiter, and G. D. Hager, ‘‘Beyond spatial pooling: Fine-grained 3D object recognition based on RGB-D dataset,’’ in Proc. Ind. Electron.
representation learning in multiple domains,’’ in Proc. IEEE Conf. Com- Appl. (ICIEA), 2015, pp. 34–39.
put. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4913–4922. [133] H. F. Zaki, F. Shafait, and A. Mian, ‘‘Learning a deeply supervised
[109] Y. Cheng et al., ‘‘Query adaptive similarity measure for RGB-D object multi-modal RGB-D embedding for semantic scene and object category
recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, recognition,’’ Robot. Auto. Syst., vol. 92, pp. 41–52, Jun. 2017.
pp. 145–153. [134] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convo-
[110] A. Coates, A. Ng, and H. Lee, ‘‘An analysis of single-layer networks lutional networks,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV). Zurich,
in unsupervised feature learning,’’ in Proc. 14th Int. Conf. Artif. Intell. Switzerland: Springer, 2014, pp. 818–833.
Statist. (AISTATS), Jun. 2011, pp. 215–223. [135] P. Agrawal, R. Girshick, and J. Malik, ‘‘Analyzing the performance of
[111] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, multilayer neural networks for object recognition,’’ in Proc. Eur. Conf.
vol. 1. Cambridge, MA, USA: MIT Press, 2016. Comput. Vis. (ECCV). Zurich, Switzerland: Springer, 2014, pp. 329–344.
[112] H. F. Zaki, F. Shafait, and A. Mian, ‘‘Localized deep extreme learning [136] Y. Jia et al., ‘‘Caffe: Convolutional architecture for fast feature
machines for efficient RGB-D object recognition,’’ in Proc. Int. Conf. embedding,’’ in Proc. 22nd ACM Int. Conf. Multimedia (ICM), 2014,
Digit. Image Comput., Techn. Appl. (DICTA), 2015, pp. 1–8. pp. 675–678.
VOLUME 7, 2019 43133

[137] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, ‘‘Canonical correlation [161] Y. He and D. Zhou, ‘‘Self-training from labeled features for sentiment
analysis: An overview with application to learning methods,’’ Neural analysis,’’ Inf. Process. Manage., vol. 47, no. 4, pp. 606–616, 2011.
Comput., vol. 16, no. 12, pp. 2639–2664, 2004. [162] M.-F. Balcan, A. Blum, and K. Yang, ‘‘Co-training and expansion:
[138] N. M. Correa, T. Adali, Y.-O. Li, and V. D. Calhoun, ‘‘Canonical correla- Towards bridging theory and practice,’’ in Proc. Adv. Neural Inf. Process.
tion analysis for data fusion and group inferences,’’ IEEE Signal Process. Syst. (NIPS), 2005, pp. 89–96.
Mag., vol. 27, no. 4, pp. 39–50, Jun. 2010. [163] J. Wu, L. Li, and W. Y. Wang. (2018). ‘‘Reinforced co-training.’’ [Online].
[139] Y. Jiao, Y. Zhang, Y. Wang, B. Wang, J. Jin, and X. Wang, ‘‘A novel Available: https://arxiv.org/abs/1804.06035
multilayer correlation maximization model for improving cca-based fre- [164] H. Padhiyar and P. Rekh, ‘‘An improved expectation maximization based
quency recognition in SSVEP brain–computer interface,’’ Int. J. Neural semi-supervised email classification using Naive Bayes and k-nearest
Syst., vol. 28, no. 04, p. 1750039, 2018. neighbor,’’ Int. J. Comput. Appl., vol. 101, no. 6, pp. 7–11, 2014.
[140] Y. Zhang et al., ‘‘Correlated component analysis for enhancing the perfor- [165] B. Nigam, P. Ahirwal, S. Salve, and S. Vamney. (2011). ‘‘Document
mance of SSVEP-based brain-computer interface,’’ IEEE Trans. Neural classification using expectation maximization with semi supervised learn-
Syst. Rehabil. Eng., vol. 26, no. 5, pp. 948–956, May 2018. ing.’’ [Online]. Available: https://arxiv.org/abs/1112.2028
[141] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, ‘‘ICA with reconstruction [166] R. Sheikhpour, E. Sheikhpour, and M. A. Sarram, ‘‘Semi-supervised
cost for efficient overcomplete feature learning,’’ in Proc. Adv. Neural Inf. sparse feature selection via graph Laplacian based scatter matrix for
Process. Syst. (NIPS), 2011, pp. 1017–1025. regression problems,’’ Inf. Sci., vol. 468, pp. 14–28, Nov. 2018.
[142] L. Jin, Z. Li, X. Shu, S. Gao, and J. Tang, ‘‘Partially common-semantic [167] E. Faerman, F. Borutta, J. Busch, and M. Schubert. (2018). ‘‘Semi-
pursuit for RGB-D object recognition,’’ in Proc. 23rd ACM Int. Conf. supervised learning on graphs based on local label distributions.’’
Multimedia (ICM), 2015, pp. 959–962. [Online]. Available: https://arxiv.org/abs/1802.05563
[143] Y. Guo, Y. Liu, S. Lao, E. M. Bakker, L. Bai, and M. S. Lew, ‘‘Bag of [168] A. Blum and T. Mitchell, ‘‘Combining labeled and unlabeled data with
surrogate parts feature for visual recognition,’’ IEEE Trans. Multimedia, co-training,’’ in Proc. 11th Annu. Conf. Comput. Learn. Theory (COLT),
vol. 20, no. 6, pp. 1525–1536, Jun. 2018. 1998, pp. 92–100.
[144] L. Liu, C. Shen, and A. van den Hengel, ‘‘The treasure beneath con- [169] Y. Cheng, X. Zhao, K. Huang, and T. Tan, ‘‘Semi-supervised learning for
volutional layers: Cross-convolutional-layer pooling for image classifi- RGB-D object recognition,’’ in Proc. Int. Conf. Pattern Recognit. (ICPR),
cation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Aug. 2014, pp. 2377–2382.
Jun. 2015, pp. 4749–4757.
[170] Y. Cheng, X. Zhao, K. Huang, and T. Tan, ‘‘Semi-supervised learning and
[145] N. Liu, L. Wan, Y. Zhang, T. Zhou, H. Huo, and T. Fang, ‘‘Exploiting feature evaluation for RGB-D object recognition,’’ Comput. Vis. Image
convolutional neural networks with deeply local description for remote Understand., vol. 139, pp. 149–160, Oct. 2015.
sensing image classification,’’ IEEE Access, vol. 6, pp. 11215–11228,
[171] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘Beyond bags of features:
2018.
Spatial pyramid matching for recognizing natural scene categories,’’ in
[146] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2006,
‘‘Convolutional-recursive deep learning for 3D object classification,’’ in
pp. 2169–2178.
Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 656–664.
[172] Y. Yin and H. Li, ‘‘Multi-view CSPMPR-ELM feature learning and
[147] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, Y. LeCun, ‘‘What is the best
classifying for RGB-D object recognition,’’ in Cluster Comput.-J. Netw.
multi-stage architecture for object recognition?’’ in Proc. IEEE Int. Conf.
Softw. Tools Appl., 2018, pp. 1–11. doi: 10.1007/s10586-018-1695-0.
Comput. Vis. (ICCV), May 2009, pp. 2146–2153.
[173] Z.-H. Zhou, ‘‘A brief introduction to weakly supervised learning,’’ Nat.
[148] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, ‘‘Parsing natural scenes
Sci. Rev., vol. 5, no. 1, pp. 44–53, 2017.
and natural language with recursive neural networks,’’ in Proc. Int. Conf.
Mach. Learn. (ICML), 2011, pp. 129–136. [174] Z. Wu et al., ‘‘3D ShapeNets: A deep representation for volumetric
shapes,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[149] Y. Yin, H. Li, and X. Wen, ‘‘Multi-model convolutional extreme learn-
Jun. 2015, pp. 1912–1920.
ing machine with kernel for RGB-D object recognition,’’ Proc. SPIE,
vol. 10605, Nov. 2017, Art. no. 106051Z. [175] C. E. Rasmussen, ‘‘Gaussian processes in machine learning,’’ in Summer
[150] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, ‘‘Extreme learning School on Machine Learning (Advanced Lectures on Machine Learning).
machine for regression and multiclass classification,’’ IEEE Trans. Syst., Berlin, Germany: Springer, 2004, pp. 63–71.
Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012. [176] M. Firman, ‘‘RGBD datasets: Past, present and future,’’ in Proc. IEEE
[151] Y. Zhang et al., ‘‘Multi-kernel extreme learning machine for EEG clas- Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 19–31.
sification in brain-computer interfaces,’’ Expert Syst. Appl., vol. 96, [177] Y. Guo, J. Zhang, M. Lu, J. Wan, and Y. Ma, ‘‘Benchmark datasets for 3D
pp. 302–310, Apr. 2018. computer vision,’’ in Proc. IEEE 9th Conf. Ind. Electron. Appl. (ICIEA),
[152] H. Liu, F. Li, X. Xu, and F. Sun, ‘‘Multi-modal local receptive field Jun. 2014, pp. 1846–1851.
extreme learning machine for object recognition,’’ Neurocomputing, [178] S. Song, S. P. Lichtenberg, and J. Xiao, ‘‘SUN RGB-D: A RGB-D
vol. 277, pp. 4–11, Feb. 2018. scene understanding benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis.
[153] G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong, ‘‘Local receptive Pattern Recognit. (CVPR), Jun. 2015, pp. 567–576.
fields based extreme learning machine,’’ IEEE Comput. Intell. Mag., [179] N. Silberman and R. Fergus, ‘‘Indoor scene segmentation using a struc-
vol. 10, no. 2, pp. 18–29, May 2015. tured light sensor,’’ in Proc. IEEE Int. Conf. Comput. Vis. Workshops
[154] F. M. Carlucci, P. Russo, and B. Caputo, ‘‘A deep representation for depth (ICCV), Nov. 2011, pp. 601–608.
images from synthetic data,’’ in Proc. IEEE Int. Conf. Robot. Autom. [180] S. Gupta, P. Arbeláez, and J. Malik, ‘‘Perceptual organization and recog-
(ICRA), May/Jun. 2017, pp. 1362–1369. nition of indoor scenes from RGB-D images,’’ in Proc. IEEE Conf.
[155] F. Orabona, L. Jie, and B. Caputo, ‘‘Online-batch strongly convex multi Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 564–571.
kernel learning,’’ in Proc. Proc. IEEE Conf. Comput. Vis. Pattern Recog- [181] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena, ‘‘Semantic label-
nit. (CVPR), Jun. 2010, pp. 787–794. ing of 3D point clouds for indoor scenes,’’ in Proc. Adv. Neural Inf.
[156] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. (2014). Process. Syst. (NIPS), 2011, pp. 244–252.
‘‘Return of the devil in the details: Delving deep into convolutional nets.’’ [182] X. Lv, S.-Q. Jiang, L. Herranz, and S. Wang, ‘‘RGB-D hand-held object
[Online]. Available: https://arxiv.org/abs/1405.3531 recognition based on heterogeneous feature fusion,’’ J. Comput. Sci.
[157] J. B. Tenenbaum and W. T. Freeman, ‘‘Separating style and content Technol., vol. 30, no. 2, pp. 340–352, 2015.
with bilinear models,’’ Neural Comput., vol. 12, no. 6, pp. 1247–1283, [183] X. Lv, X. Liu, X. Li, X. Li, S. Jiang, and Z. He, ‘‘Modality-specific and
Jun. 2000. hierarchical feature learning for RGB-D hand-held object recognition,’’
[158] C. G. M. Snoek, M. Worring, and A. W. Smeulders, ‘‘Early versus late Multimedia Tools Appl., vol. 76, no. 3, pp. 4273–4290, 2017.
fusion in semantic video analysis,’’ in Proc. 13th Annu. ACM Int. Conf. [184] X. Lv, S. Jiang, L. Herranz, and S. Wang, ‘‘Hand-object sense: A hand-
Multimedia (ICM), 2005, pp. 399–402. held object recognition system based on RGB-D information,’’ in Proc.
[159] X. Zhu, ‘‘Semi-supervised learning literature survey,’’ Comput. Sci., Univ. 23rd ACM Int. Conf. Multimedia (ICM), 2015, pp. 765–766.
Wisconsin-Madison, vol. 2, no. 3, p. 4, 2006. [185] A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, ‘‘BigBIRD:
[160] D. Wu et al., ‘‘Self-training semi-supervised classification based on den- A large-scale 3D database of object instances,’’ in Proc. IEEE Int. Conf.
sity peaks of data,’’ Neurocomputing, vol. 275, pp. 180–191, Jan. 2018. Robot. Autom. (ICRA), May/Jun. 2014, pp. 509–516.
43134 VOLUME 7, 2019

[186] C. Li, J. Bohren, E. Carlson, and G. D. Hager, ‘‘Hierarchical semantic [209] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. (2014).
parsing for object pose estimation in densely cluttered scenes,’’ in Proc. ‘‘Object detectors emerge in deep scene CNNs.’’ [Online]. Available:
IEEE Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 5068–5075. https://arxiv.org/abs/1412.6856
[187] U. Asif, M. Bennamoun, and F. Sohel, ‘‘Discriminative feature learning [210] C. Szegedy et al. (2013). ‘‘Intriguing properties of neural networks.’’
for efficient RGB-D object recognition,’’ in Proc. IEEE/RSJ Int. Conf. [Online]. Available: https://arxiv.org/abs/1312.6199
Intell. Robots Syst. (IROS), Sep./Oct. 2015, pp. 272–279. [211] R. C. Fong and A. Vedaldi, ‘‘Interpretable explanations of black boxes by
[188] L. Fei-Fei, R. Fergus, and P. Perona, ‘‘One-shot learning of object meaningful perturbation,’’ in Proc. IEEE Int. Conf. Comput. Vis. (CVPR),
categories,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, Jun. 2017, pp. 3429–3437.
pp. 594–611, Apr. 2006. [212] H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz, ‘‘Identifying
[189] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong, ‘‘Recent unknown unknowns in the open world: Representations and policies
advances in zero-shot recognition: Toward data-efficient understand- for guided exploration,’’ in Proc. 31st AAAI Conf. Artif. Intell., 2017,
ing of visual content,’’ IEEE Signal Process. Mag., vol. 35, no. 1, pp. 1–9.
pp. 112–125, Jan. 2018. [213] Q. Zhang, W. Wang, and S.-C. Zhu, ‘‘Examining CNN representations
[190] T. Long, X. Xu, F. Shen, L. Liu, N. Xie, and Y. Yang, ‘‘Zero-shot with respect to dataset bias,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
learning via discriminative representation extraction,’’ Pattern Recognit. pp. 1–10.
Lett., vol. 109, pp. 27–34, Jul. 2018. [214] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S.-C. Zhu, ‘‘Interpreting CNN
[191] P. Mettes and C. G. M. Snoek, ‘‘Spatial-aware object embeddings for knowledge via an explanatory graph,’’ in Proc. 32nd AAAI Conf. Artif.
zero-shot localization and classification of actions,’’ in Proc. IEEE Int. Intell., 2018, pp. 1–10.
Conf. Comput. Vis. (ICCV), Jul. 2017, pp. 4453–4462. [215] Q. Zhang, R. Cao, Y. N. Wu, and S.-C. Zhu, ‘‘Growing interpretable part
[192] L. Zhang et al., ‘‘BoMW: Bag of manifold words for one-shot learning graphs on ConvNets via multi-shot learning,’’ in Proc. 31st AAAI Conf.
gesture recognition from kinect,’’ IEEE Trans. Circuits Syst. Video Tech- Artif. Intell., 2017, pp. 1–9.
nol., vol. 28, no. 10, pp. 2562–2573, Oct. 2018. [216] Q. Zhang, Y. N. Wu, and S.-C. Zhu, ‘‘Interpretable convolutional neural
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[193] Y. Shen, P. Luo, J. Yan, X. Wang, and X. Tang, ‘‘FaceID-GAN: Learning
Jun. 2018, pp. 8827–8836.
a symmetry three-player GAN for identity-preserving face synthesis,’’
[217] T. Wu, W. Sun, X. Li, X. Song, and B. Li. (2017). ‘‘Towards inter-
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2018,
pretable R-CNN by unfolding latent structures.’’ [Online]. Available:
pp. 821–830.
https://arxiv.org/abs/1711.05226
[194] J. Choe, S. Park, K. Kim, J. H. Park, D. Kim, and H. Shim, ‘‘Face
generation for low-shot learning using generative adversarial networks,’’
in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1940–1948.
[195] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, ‘‘DualGAN: Unsupervised
dual learning for image-to-image translation,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Jun. 2017, pp. 2868–2876.
[196] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. (2016). MINGLIANG GAO received the Ph.D. degree
‘‘Generative adversarial text to image synthesis.’’ [Online]. Available: in communication and information systems from
https://arxiv.org/abs/1605.05396 Sichuan University, Chengdu, China, in 2013.
[197] L. Tran, X. Yin, and X. Liu, ‘‘Disentangled representation learning GAN From 2018 to 2019, he was a Visiting Scholar
for pose-invariant face recognition,’’ in Proc. IEEE Conf. Comput. Vis. with the School of Engineering, The University of
Pattern Recognit. (CVPR), Jul. 2017, pp. 1415–1424. British Columbia, Okanagan. He is currently an
[198] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. (2013). ‘‘Indoor Associate Professor with the School of Electrical
semantic segmentation using depth information.’’ [Online]. Available: and Electronic Engineering, Shandong University
https://arxiv.org/abs/1301.3572
of Technology, Zibo, China. His main research
[199] S. Zia, B. Yüksel, D. Yüret, and Y. Yemez, ‘‘RGB-D object recognition
interests include image fusion and deep learning.
using deep convolutional neural networks,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 887–894.
[200] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser,
‘‘Semantic scene completion from a single depth image,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2017, pp. 190–198.
[201] D. Maturana and S. Scherer, ‘‘VoxNet: A 3D convolutional neural net-
work for real-time object recognition,’’ in Proc. IEEE/RSJ Int. Conf. JUN JIANG received the Ph.D. degree in com-
Intell. Robots Syst. (IROS), Sep./Oct. 2015, pp. 922–928. munication and information systems from Sichuan
[202] K. He and J. Sun, ‘‘Convolutional neural networks at constrained time University, Chengdu, China, in 2015. He is cur-
cost,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), rently an Associate Professor with the School of
Jun. 2015, pp. 5353–5360. Computer Science, Southwest Petroleum Univer-
[203] H. Li, R. Zhao, and X. Wang. (2014). ‘‘Highly efficient forward and sity, Chengdu. His main research interests include
backward propagation of convolutional neural networks for pixelwise machine vision and deep learning.
classification.’’ [Online]. Available: https://arxiv.org/abs/1412.4526
[204] K. Bong, S. Choi, C. Kim, D. Han, and H.-J. Yoo, ‘‘A low-power convo-
lutional neural network face recognition processor and a CIS integrated
with always-on face detector,’’ IEEE J. Solid-State Circuits, vol. 53, no. 1,
pp. 115–123, Jan. 2018.
[205] O. Krestinskaya, K. N. Salama, and A. P. James, ‘‘Learning in memristive
neural network architectures using analog backpropagation circuits,’’
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 2, pp. 719–732,
Feb. 2019. GUOFENG ZOU received the Ph.D. degree in
[206] Q.-S. Zhang and S.-C. Zhu, ‘‘Visual interpretability for deep learn- pattern recognition and intelligent system from
ing: A survey,’’ Frontiers Inf. Technol. Electron. Eng., vol. 19, no. 1, the College of Automation, Harbin Engineering
pp. 27–39, 2018. University, Harbin, in 2013. He is currently a Lec-
[207] C. Olah, A. Mordvintsev, and L. Schubert, ‘‘Feature visualization,’’ Dis- turer with the College of Electrical and Electronic
till, vol. 2, no. 11, p. e7, 2017. Engineering, Shandong University of Technology,
[208] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski, ‘‘Plug Zibo, China. His current research interests include
& play generative networks: Conditional iterative generation of images pattern recognition, digital image processing and
in latent space,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. analysis, and machine learning.
(CVPR), Jun. 2017, pp. 4467–4477.
VOLUME 7, 2019 43135

VIJAY JOHN received the M.S. degree in infor- ZHENG LIU (SM’02) received the Ph.D. degree in
mation technology and robotics from Carnegie engineering from Kyoto University, Kyoto, Japan,
Mellon University, Pittsburgh, PA, USA, in 2005, in 2000, and the Ph.D. degree from the University
and the Ph.D. degree in computer vision from of Ottawa, Canada, in 2007. From 2000 to 2001,
the University of Dundee, Dundee, U.K., in 2007. he was a Research Fellow with Nanyang Techno-
In 2011, he joined the University of Amster- logical University, Singapore. Then, he joined the
dam, Amsterdam, The Netherlands, as a Post- Institute for Aerospace Research (IAR), National
doctoral Researcher. In 2012, he joined Philips Research Council Canada, Ottawa, ON, Canada,
Research, Eindhoven, The Netherlands, as a Vis- as a Governmental Laboratory Visiting Fellow
iting Researcher. Since 2014, he has been a Post- nominated by NSERC, for five years. He was a
doctoral Research Fellow with the Toyota Technological Institute, Nagoya, Research Officer with the NRC Institute for Research in Construction.
Japan. His research interests include computer vision, machine learning with From 2012 to 2015, he was a Full Professor with the Toyota Technological
applications in autonomous vehicles, and human motion analysis. Institute, Nagoya, Japan. He is currently with the School of Engineering, The
University of British Columbia, Okanagan. His research interests include
image/data fusion, computer vision, pattern recognition, senor/sensor net-
works, condition-based maintenance, and non-destructive inspection and
evaluation. He is a member of SPIE. He is the Chair of the IEEE IMS Tech-
nical Committee on Industrial Inspection (TC-36). He holds a Professional
Engineer License in British Columbia and Ontario. He serves on the editorial
board for the IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT,
IEEE Instrumentation and Measurement Magazine, the IEEE JOURNAL OF
RFID, Information Fusion, Machine Vision and Applications, and Intelligent
Industrial Systems.
43136 VOLUME 7, 2019

RGB-D-Based Object Recognition Using Multimodal Co

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RGB-D-Based Object Recognition Using Multimodal Co

Uploaded by

Copyright:

Available Formats

Received March 3, 2019, accepted March 16, 2019, date of current version April 13, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2907071

RGB-D-Based Object Recognition Using

Corresponding author: Mingliang Gao (mlgao@sdut.edu.cn)

I. INTRODUCTION Various works have achieved exciting results on several

(1)Training data deficiency

VOLUME 7, 2019 43111

43112 VOLUME 7, 2019

TABLE 1. Summary hand-crafted feature based RGB-D object recognition methods.

VOLUME 7, 2019 43113

43114 VOLUME 7, 2019

unsupervised method to solve the RGB-D object recognition

1) SUPERVISED LEARNING BASED METHODS

FIGURE 3. Overall structure of semi-supervised learning based MMCNNs

training sets. Then, the trained classifiers are applied to

2) SEMI-SUPERVISED LEARNING BASED METHODS

VOLUME 7, 2019 43115

43116 VOLUME 7, 2019

fine-tuning of the surface normals in an ensemble learning 1) FULLY-CONNECTED LAYER FUSION

FIGURE 6. Overview of the (DE)2 CO depth colorization network [128].

Different depth encoding methods are summarized

Most of the follow-up works were proposed inspired by

VOLUME 7, 2019 43117

create the final prediction:

where softmax(i) is the softmax score vector of each depth-

43118 VOLUME 7, 2019

FIGURE 9. Overview of three-streams MMCNNs framework by Rahman et al. [123].

and complementary properties of RGB and depth modalities.

2) CONVOLUTIONAL LAYER FUSION

VOLUME 7, 2019 43119

FIGURE 15. The CNN-RNN model for RGB-D object recognition by

FIGURE 12. The framework of matrix transformations for multimodal

FIGURE 16. The framework of multimodal ELM-LRF model proposed by

Liu et al. [152] improved Yin et al.’s work [149] by

43120 VOLUME 7, 2019

FIGURE 17. DepthNet (left) and the associated RGB-D object

trained using a synthetic depth dataset (VANDAL dataset

VOLUME 7, 2019 43121

43122 VOLUME 7, 2019

FIGURE 22. Co-training based semi-supervised learning for RGB-D object

In their follow-up work, Cheng et al. [170] also leveraged

43124 VOLUME 7, 2019

FIGURE 24. The configuration of the Depth-Net proposed by

Sun et al. [47] proposed a weakly-supervised learn-

VOLUME 7, 2019 43125

Cybernetics, Germany in 2011. It contained 154 objects with

43126 VOLUME 7, 2019

VOLUME 7, 2019 43127

TABLE 6. Comparative performance of different methods on Washington RGB-D dataset.

43128 VOLUME 7, 2019

TABLE 7. Comparative performance of different methods on 2D/3D

FIGURE 29. Category recognition accuracy of RGB modality with different

TABLE 9. Category recognition accuracy of depth modality with different

TABLE 8. Comparative performance of different methods on

TABLE 10. Runtime performance of different methods on Washington

different encoding methods and the same encoding method

VOLUME 7, 2019 43129

43130 VOLUME 7, 2019

VOLUME 7, 2019 43131

43132 VOLUME 7, 2019

VOLUME 7, 2019 43133

43134 VOLUME 7, 2019

VOLUME 7, 2019 43135

43136 VOLUME 7, 2019