Professional Documents
Culture Documents
ABSTRACT Object recognition in real-world environments is one of the fundamental and key tasks in
computer vision and robotics communities. With the advanced sensing technologies and low-cost depth
sensors, the high-quality RGB and depth images can be recorded synchronously, and the object recognition
performance can be improved by jointly exploiting them. RGB-D-based object recognition has evolved from
early methods that using hand-crafted representations to the current state-of-the-art deep learning-based
methods. With the undeniable success of deep learning, especially convolutional neural networks (CNNs)
in the visual domain, the natural progression of deep learning research points to problems involving
larger and more complex multimodal data. In this paper, we provide a comprehensive survey of recent
multimodal CNNs (MMCNNs)-based approaches that have demonstrated significant improvements over
previous methods. We highlight two key issues, namely, training data deficiency and multimodal fusion.
In addition, we summarize and discuss the publicly available RGB-D object recognition datasets and present
a comparative performance evaluation of the proposed methods on these benchmark datasets. Finally,
we identify promising avenues of research in this rapidly evolving field. This survey will not only enable
researchers to get a good overview of the state-of-the-art methods for RGB-D-based object recognition but
also provide a reference for other multimodal machine learning applications, e.g., multimodal medical image
fusion, audio-visual speech recognition, and multimedia retrieval and generation.
INDEX TERMS Convolutional neural network, multimodal fusion, object recognition, RGB-D, survey.
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
43110 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Gao et al.: RGB-D-Based Object Recognition Using MMCNNs: A Survey
Although RGB-based object recognition has been stud- A variety of methods have been proposed for RGB-D
ied extensively and there are many surveys or reviews in object representation and they can be mainly classified into
this domain [61]–[63], there are relatively fewer surveys on three categories: hand-crafted feature based methods, tradi-
RGB-D-based object recognition, especially based on CNNs. tional feature learning based methods and MMCNNs based
The surveys most closely associated to this work are those methods.
presented in [64]–[68]. These surveys covered related themes
but came with certain limitations. The surveys in [64]–[67] A. HAND-CRAFTED FEATURE BASED METHODS
mainly focused on hand-craft feature representations. For Earlier work utilized hand engineered features for both RGB
example, [64], [65] centered on local descriptors for 3D and depth modalities. These features are fused and fed
object recognition. The work in [66], [67] surveyed 3D into classifiers along with their labels for the recognition
features especially keypoint-based methods from 3D point task [65]. The features extracted from the RGB modality,
clouds for RGB-D object recognition. Ioannidou et al. [68] e.g. Scale Invariant Feature Transform (SIFT) [23], [25],
presented a survey on deep learning with 3D data in computer Speeded Up Robust Features (SURF) [69], Histograms of
vision. However, the focus is not specific to RGB-D object Oriented Gradients(HOG) [70], have been studied in detail
recognition and more details are needed. and widely used in the RGB-based object recognition [16].
To the best of our knowledge, the present survey is the The features extracted from the depth modality are mainly
first comprehensive study on MMCNNs based RGB-D object 3D shape descriptors over 3D point cloud [71] such as spin
recognition. The large body of work on CNNs based RGB-D images [24], 3D Shape Context(3DSC) [72], Rotation Invari-
object recognition besides the lack of such an extensive ant Feature Transform (RIFT) [73], Point Feature Histograms
study on state-of-the-art techniques prompts us to review the (PFH) [74], Point Pair Feature (PPF) [75], Signature of His-
recent advances in this domain. This survey is developed in a tograms of Orientations(SHOT) [76], etc.
problem-oriented pattern and provides an insightful analysis Lai et al. [77] introduced a well-known Washington
on the solutions to training data deficiency and multimodal RGB-D dataset and made a baseline for RGB-D object recog-
fusion strategies in MMCNNs for RGB-D object recognition. nition. In their work, they extracted many hand-crafted fea-
To this end, the proposed methods are broadly categorized tures, e.g., SIFT, color histograms, texton histogram, mean
into supervised and semi-supervised schemes considering the and standard deviation of color channels from RGB modality,
label of the training data. Furthermore, we analysis the multi- and spin image, 3D bounding box from the depth modality.
modal fusion strategies and categorize them into early fusion These features are combined with efficient match kernel
and late fusion considering the fusion level. Meanwhile, (EMK) [78]. Then, the fused features are reduced using prin-
the influence of different depth encoding methods, different cipal component analysis (PCA) and fed to three classifiers,
CNN models and different multimodal fusion strategies on i.e., linear support vector machine (LinSVM), Gaussian ker-
the performance are analyzed. Compared to the former sur- nel SVM (kSVM) and random forest (RF) for performance
veys, this survey covers the most recent and advanced work comparison. Experiments showed that the proposed features
on MMCNNs based RGB-D object recognition and provides with kSVM performed best. On this basis, Lai et al. [79]
the readers with the state-of-the-art methods. proposed a sparse instance distance learning (IDL) with reg-
Rest of the paper is organized as follows: Section II ularization to provide a distance measure for evaluating the
briefly reviews the hand-crafted feature based methods and similarity of all views of a particular object. Comparative
the traditional feature learning based methods in the early results showed that the IDL-based method outperformed the
RGB-D based object recognition. Meanwhile, the MMCNN exemplar-based local distance method [77]. Paulk et al. [80]
based methods are introduced briefly. Section III presents a extracted nine color-based features from RGB channel, three
detailed survey on MMCNNs based methods through super- geometric features and one volumetric feature from depth
vised learning. Section IV reviews the MMCNNs based channel. The multimodal features were concatenated directly
methods through semi-supervised learning. In Section V, and fed to AdaBoost, SVMs and ANN for comparative
recently published RGB-D object recognition datasets are studies. Browatzki et al. [81] extracted four 2D descriptors
introduced along with the comparative results and discussion. from the RGB images, namely, SURF [69], pyramids of
We discuss several promising avenues for achieving further histograms of oriented gradients (PHOG) [82], self-similarity
progress in Section VI. Finally, concluding remarks are made feature (SSF) [83] and CIELAB color histograms, and four
in Section VII. 3D descriptors from the depth images, namely, 3DSC [72],
depth buffer [84] , shape-index histograms (SIH) [85] and
II. RGB-D BASED OBJECT RECOGNITION shape distributions [86]. Then a bag-of-words (BoWs) rep-
There are two main procedures in RGB-D object recognition, resentation was utilized to transform the local features into
namely object feature representation and object classifica- a global descriptor and multiple SVMs learnt on separate
tion. Compared with object classification, object feature rep- features were fused through multilayer perception for classi-
resentation affects the performance of the object recognition fication. Liu et al. [87] calculated the features of compact-
system significantly because real-world objects usually suffer ness, symmetry, global convexity, uniqueness, smoothness
from large intra-class discrepancy and inter-class affinity. from the contour-based images (2D) and point cloud
data (3D), respectively. Then the five 2D shape features and descriptors [98] were extracted and quantized to 3D visual
five 3D shape features were concatenated and fed into SVM words by K-means. Then a LinSVM was utilized for
for classification. object recognition. Logoglu et al. [99] proposed two spa-
Kernel descriptors [88]–[91] learn patch level or image tially enhanced local 3D descriptors for object recognition
level features based on kernels and they were successfully tasks, namely, histograms of spatial concentric surflet-pairs
used in RGB-D object recognition. Bo et al. [88] presented a (SPAIR) and colored SPAIR (CoSPAIR). Then the query
kernel descriptor method which combines three RGB kernels descriptors were matched and a majority vote classifier [100]
(color, gradient and LBP) and five depth kernel (size kernels, was utilized for classification. Experiments demonstrated that
kernel PCA, spin kernels, gradient kernels and LBP kernels) the introduced CoSPAIR descriptor outperformed the state-
to capture different recognition cues. They used pyramid of-the-art descriptors in both category-level and instance-
EMK [78] to aggregate local kernel descriptors into object- level recognition tasks.
level features and a LinSVM was utilized for recognition. A summary of hand-craft feature representations combing
Based on this, Bo et al. [92] proposed hierarchical kernel classifiers for RGB-D object recognition is shown in Table 1.
descriptors (HKD) that applied kernel descriptors recursively
to form image-level features and thus provided a conceptually B. TRADITIONAL FEATURE LEARNING BASED METHODS
simple and consistent way to generate image level features Although hand-crafted features could boost object recogni-
from pixel attributes. tion accuracy, the feature design process requires a strong
3D shape descriptors have been widely studied and understanding of domain-specific knowledge and the features
used in RGB-D object recognition [66], [67], [93], [94]. should be re-designed for new data types. Another shortcom-
Yu et al. [95] proposed hierarchical sparse shape descriptor ing for hand-crafted features is that they only capture a subset
(HSSD), which learnt shape primitives in multiple hierar- of features from raw data that are discriminative for object
chies from the dataset. They achieved this by transforming recognition and other useful cues are ignored during feature
spin image and fast point feature histograms (FPFH) [96] design [43]. To reduce the dependency on hand-crafted fea-
into filter-pooling framework to generalize them for learn- tures, several feature learning based methods which learn fea-
ing shape representation. Redondo-Cabrera et al. [97] pro- tures from raw data directly have been proposed for RGB-D
posed an approach to recognize object categories using object recognition.
quantized 3D spatial pyramid matching kernels (3D SPMK) Hierarchical matching pursuit (HMP) [26] is an unsu-
in point clouds. In their work, the 3D SURF local pervised feature learning method which learns dictionaries
TABLE 2. Summary of traditional feature learning based methods for RGB-D object recognition.
over image and depth patches via K-SVD [101] to represent RGB and depth modalities, and the combined scores were
observations as sparse combinations of codewords. With the used to predict the category. Li et al. [108] applied multiple
learnt dictionary, feature hierarchies are built from scratch, scales of filters coupled with different pooling granularities,
layer by layer, using orthogonal matching pursuit and spatial and utilized color as an additional pooling domain to reduce
pyramid pooling. Bo et al. [102] used a two-layers HMP to the sensitivity to spatial deformations. Cheng et al. [109]
learn features from RGB, gray, depth, and surface normals extracted a 128-dimensional convolutional descriptor [110]
channels. Then, the features were concatenated and fed to a from RGB modality and a 200-dimensional kernel descrip-
LinSVM for training and classification. tor consisting of the gradient kernel and LBP kernel from
PCA and their variants are powerful tools for unsupervised depth modality [88]. These descriptors were converted to a
feature learning techniques from possibly high-dimensional 256-dimensional feature vector by PCA. A query adap-
data set and have been widely used in RGB-D object recogni- tive similarity measurement (QASM) and a learning-to-
tion. Chen et al. [103] treated the RGB-D data as a quaternion combination strategy were proposed to balance the RGB and
signal and proposed an algorithm for RGB-D object recog- depth cues, and majority vote classifiers [100] were used for
nition based on bi-directional 2D kernel quaternion princi- recognition.
pal component analysis [104] and quaternion representation A summary of traditional feature learning based methods
(QR) [105]. Sun et al. [106] proposed an RGB-D object for RGB-D object recognition is shown in Table 2.
recognition method combing PCA and canonical correlation
analysis (CCA). In their work, two stages of cascaded filter C. MMCNNS BASED METHODS
layers, one layer learnt features by PCA and another layer Due to their powerful representational capacities, CNNs have
maximized correlation by CCA, were constructed and fol- taken the place of traditional feature learning methods and
lowed by binary hashing and block histograms. Experiments have been considered to be mainstream approaches to RGB-D
proved that the introduced method was efficient and accurate. object recognition. More and more MMCNNs based methods
Single-layer convolutional operators were also widely used were proposed and have achieved exciting results. Accord-
in learning low level features from RGB and depth modal- ing to whether the training examples are labeled or unla-
ities in an unsupervised fashion. Blum et al. [27] proposed beled, the machine learning methods are broadly categorized
a convolutional k-means descriptor (CKM) to learn mean- into supervised, semi-supervised and unsupervised methods
ingful local feature description automatically. They calcu- [29], [111]. The CNNs based object recognition methods
lated the features response, accumulated them into feature often depend on a great deal of labeled data to train the
histograms and utilized a LinSVM for object recognition. model in a supervised way. To circumvent the burdensome
Similarly, Cheng et al. [22] proposed a convolutional Fisher labeling task, semi-supervised RGB-D object recognition
kernel (CFK) method for RGB-D object recognition. They methods have been proposed. Although, many methods such
learnt convolutional features from RGB and depth modalities as auto-encoder (AE) are widely used to learn deep fea-
via a single-layer CNN, utilized Gaussian mixture models tures in an unsupervised fashion [112], [113], the classifier
(GMMs) to model the feature distribution, and exploited should still be trained using the labeled data in a super-
Fisher kernel encoding [107] to generate image-level feature vised way. Essentially, these methods are still supervised
vector. Last, two LinSVMs were separately trained for the methods. To the best of our knowledge, there is no real
(2) Late level fusion Bo et al. [102] first converted the depth map to surface
Late level fusion refers to the aggregation of decisions from normals and then re-interpreted it as RGB values. Due to the
multiple classifiers, each trained on separate modalities. This inherent noisiness of the depth channel, such a direct conver-
fusion architecture is favored because errors from multiple sion results in noisy images in the color space. To address this
classifiers tend to be uncorrelated and it is feature indepen- issue, Akerberg et al. [121] modified it with recursive median
dent. In the late level fusion, the final object recognition filter and bilateral filter.
decision is performed by combining the classification prob- (2) ColorJet
abilities of different complementary CNN-based RGB-D ColorJet is a simple color mapping method by assign-
object recognition models. Compared with the early level ing different colors to different depth values. It was first
fusion, the late level fusion is much easier because the classi- introduced in RGB-D object recognition by Eitel et al. [40].
fication probabilities of different complementary in semantic In ColorJet, the original depth map is firstly normalized
stage usually have the same representation. An overall archi- 0-255 values. Then the colorization works by mapping the
tecture of late level fusion in RGB-D object recognition is lowest value to the blue channel and the highest value to the
shown in Fig. 5. red channel. The value in the middle is mapped to green and
the intermediate values are arranged accordingly. To increase
invariance of the generated images against camera pitch angle
changes, Schwarz et al. [119] improved the ColorJet model
by an optional re-projection step in a coordinate system.
(3) HHA
HHA was first introduced by Gupta et al. [122] for
RGB-D object detection and segmentation. HHA is a map-
ping method in which the first channel encodes the horizon-
tal disparity, the second channel encodes the height above
ground and the third channel encodes the pixel-wise angle
between the surface normal and the gravity vector. Horizontal
FIGURE 5. Overall architecture of late level fusion for RGB-D object disparity calculates the difference between the images of the
recognition.
left and right camera, which helps to understand the closeness
of objects from the camera. Height above the ground depicts
III. SUPERVISED LEARNING BASED MMCNNS the possible positions of objects in the scene concerning the
FOR RGB-D OBJECT RECOGNITION ground plane. The angle quantifies the shape of object in the
Among the MMCNNs based RGB-D object recognition scene. The HHA representation encodes properties of geo-
methods, the most common one is to utilize two off-the- centric pose that emphasize complementary discontinuities
shelf CNN architectures pre-trained on ImageNet and fine- in the image (depth, surface normal and height). This embed-
tune them using the labeled RGB-D dataset in a supervised ding method is strictly geocentric and such information is not
fashion. To this end, no further processing is required for always available in object-centric recognition [120].
the RGB modality, but the depth modality is often encoded (4) Embedded depth
as a rendered RGB image to emulate the distribution of the Embedded depth method was first proposed by
corresponding RGB images. Zaki et al. [120]. This method takes the shape informa-
tion into the depth embedding. In embedded depth method,
A. DEPTH ENCODING a single channel depth map is convoluted by vertical and
Recent work have focused on transferring the knowledge of horizontal Prewitt kernels. Then the gradient magnitude Gm
the CNN models pre-trained on ImageNet to extract discrim- and gradient direction Gθ are calculated, respectively by:
inative features from depth images [119], [120]. Many depth q
encoding methods were proposed which can be grouped into Gm = G2x + G2y , (1)
two categories, namely, hand-crafted encoding methods and Gθ = arctan(Gx , Gy ), (2)
learning based encoding methods
where Gx and Gy are the vertical and horizontal derivative of
1) HAND-CRAFTED BASED DEPTH ENCODING METHODS depth image. Last, the three-channel depth map is constructed
The hand-crafted based methods are very popular because as the concatenation of the original single channel depth map
they are intuitive and easy to be implemented. Many hand- with the gradient magnitude and direction maps.
crafted mappings were proposed to colorize depth data and It’s worth mentioning that the depth modality can be
obtained impressive improvements especially over the Wash- encoded by a certain encoding method or by a combination
ington RGB-D object dataset [77]. of the encoding methods. For example, Rahman et al. [123]
(1) Surface normals utilized both surface normals and ColorJet methods to encode
Within the RGB-D classification literature, surface nor- the depth images. Aakerberg et al. [124] encoded the depth
mals [102] is the most widely used encoding method. modality by combing the surface normals, ColorJet and
TABLE 3. Summary of different depth encoding methods. FIGURE 7. Overall architecture of RGB-D object recognition proposed by
Eitel et al. [40].
FIGURE 11. The configuration of the CNN model proposed by Wang et al. [114].
1) CASCADE FUSION
Usually, a fusion operation is performed once in a process.
A two-stage fusion scheme is presented by Zaki et al. [120].
In their work, a pre-trained CNN-F model [156] was applied
as the feature extractor for RGB, embedded depth and embed-
ded point cloud, respectively. For the case of late fusion,
the hypercube pyramid features of convolutional layers Fhc
and fully-connected layer features (fc6) Ffc were first fed
to two ELM classifiers to predict the class probabilities yhc
and yfc . Then, the concatenation of these new vectors Fc =
FIGURE 18. Overview of DSMME model introduced by Zaki et al. [133]. [yhc , yfc ] was used as an input vector to another ELM for
final classification. Experimental results proved that the late
Loghmani et al. [131] utilized two streams of pre-trained fusion scheme outperformed the early fusion scheme which
ResNet-18 [38] to process the RGB and depth modality concatenated Fhc and Ffc simply. An overview of Zaki’s
(Encoded by surface normals), and the volumetric features cascaded late fusion framework is shown in Fig. 20.
were extracted at different convolutional levels. Then the vol-
umetric features were individually transformed into a vector 2) HIERARCHICAL FUSION
through projection blocks consisting of two convolutional In the hierarchical fusion, the class probabilities are deter-
layers and a global max pooling layer. RGB-D features mined by the fusion of class probabilities estimated at differ-
extracted from different levels were sequentially fed into an ent hierarchical levels. A typically hierarchical fusion fusion
recurrent neural network (RNN) to produce a descriptive and for RGB-D object recognition was proposed by Asif et al. [5].
FIGURE 20. Overview of cascaded late fusion framework for RGB-D object recogniton by Zaki et al. [120].
They divided the classification into image-level and pixel- dependent on large-scale manually labeled RGB-D data
level. The final object classification loss function LObject which is an expensive, time-consuming and boring task. Thus
was a weighted combination of a pixel-level classification it is necessary to develop a new effective training frame-
loss function LPixel and an image-level classification loss work for deep learning to benefit from the massive unlabeled
function LImage : RGB-D data, which is often cheap and available. To address
LObject = αLPixel + (1 − α)LImage (4) the problem of large-scale annotated RGB-D dataset, a natu-
ral idea is to incorporate the semi-supervised learning meth-
where α controls the relative influence of the pixel-level
ods into the deep learning framework to automatically exploit
regularization to the overall training objective function.
unlabeled data [159].
In their work, the image-level classification, named CNN-I
Semi-supervised based methods address the problem of
branch was composed of five successive convolutional layers
learning a better classifier by combining a small set of
followed by two fully-connected layers FC6 and FC7 as
labeled data and large amount of unlabeled data [159].
shown in Fig. 21. The pixel-level classification, named CNN-
Many semi-supervised learning methods were proposed in
P branch originated from the Conv5 layer of the CNN-I
the literature [159], e.g. self-training [160], [161], co-training
branch and composed of two convolution layers Conv6 and
[162], [163], expectation-maximization (EM) [164], [165]
Conv7. Both the CNN-I and the CNN-P branches learnt
and graph based methods [166], [167]. Among these meth-
modality specific features from VGG-11 [36]. Then, two
ods, co-training was theoretically proved to be very appropri-
fusion branches, namely, CNN-FV and FS2 were constructed
ate and successful in combining the labeled and unlabeled
to fused the two modalities in image-level and pixel-level,
data under three strong assumptions in [168]. Specifically,
respectively. The CNN-FV branch consisted of a Fisher
(i) features can be split into two sets; (ii) each sub-feature
encoding module, a concatenation module, and two fully-
set is sufficient to train a good classifier; (iii) the two sets are
connected layers. The fusion branch FS2 composed of
conditionally independent given the class. RGB-D data meets
two convolution layers, and a deconvolution layer. Then,
the three assumptions very well. First, RGB-D data contains
a weighted loss function as shown in Eq.4 was defined to
two distinct views, RGB and depth. Second, both of them can
combine the pixel-level and image-level classification loss
provide useful cues for object recognition. Third, the image
functions.
capturing modes of RGB (e.g., RGB cameras) and depth
A summary of supervised MMCNN methods for RGB-D
(e.g., infrared cameras) are very different.
recognition is shown in Table 4.
From the standpoint of multimodal fusion, the semi-
IV. SEMI-SUPERVISED LEARNING BASED MMCNNS supervised methods are also grouped into early level fusion
FOR RGB-D OBJECT RECOGNITION and late level fusion. In this section, the late level fusion is
Supervised learning based CNNs have so far made great first presented considering that it was first introduced in the
successes in RGB-D object recognition but they are highly literature and more straightforward.
FIGURE 21. Overview of hierarchical fusion for RGB-D object recognition by Asif et al. [5].
TABLE 4. Summary of supervised MMCNNs methods for RGB-D object recognition. In the column of ‘‘Fusion level’’, ‘‘EF’’ and ‘‘LF’’ represent the Early
Fusion and Late Fusion, ‘‘CL’’ and ‘‘FCL’’ denote the Convolutional Layer and Full Connected Layer in early level fusion, ‘‘C’’ and ‘‘H’’ represent the Cascade
fusion and Hierarchical fusion in late level fusion, respectively.
A. WEIGHTED AVERAGE BASED LATE LEVEL FUSION semi-supervised learning based MMCNNs for RGB-D object
The weighted average based fusion is the simplest but recognition. It was first proposed by Cheng et al. [169].
the most commonly used late level fusion method in In Cheng et al.’s work [169], they adopted the
VOLUME 7, 2019 43123
M. Gao et al.: RGB-D-Based Object Recognition Using MMCNNs: A Survey
co-training [168] scheme to learn from the unlabeled data. decisions from each modality. The reason why they did not
Specifically, they utilized a CNN-RNN model [146] to learn adopt deeper CNN architectures and early fusion scheme is
features from RGB and depth modalities and trained two that the small labeled dataset is infeasible to supervise the
SVM classifiers from the labeled training sets. The trained training of a deep CNN model for object recognition, thus
classifiers CRGB and CDepth were applied to predict the exam- the feature is unreliable.
ples from the unlabeled training sets. The most confidently To overcome this barrier, Cheng et al. [46] utilized two
predicted instances of each class by the two classifiers were reconstruction networks which consisted of 5 convolutional
transferred from unlabeled dataset to labeled dataset for the layers and 12 fully-connected layers to decode each chan-
next round training. The algorithm ran until it reached the nel of the inputs using the labeled and unlabeled data. The
maximum number of iteration or the unlabeled pool was configuration of 5 convolutional layers in the reconstruction
empty. The category of the input instance was determined network was the same as the AlexNet [35]. The input of the
by the weighted sum of the two probability scores generated network was a re-scaled RGB or depth image denoted as
by the two classifiers: x ∈ R148×148×3 , and the output was a reconstructed map
R(x) ∈ R64×64×3 . The reconstruction network was trained
c = argci ∈χ Max(αPcCiRGB + (1 − α)PcCiDepth ), (5)
by minimizing the mean square reconstruction error:
where PCRGB and PCDepth are the probability scores generated 3
1
2
by the RGB and depth classifiers. α is the weighted param-
X X
LossvR =
ch
x̃ − Rch (x)
(6)
eter which determined by cross-validation. Experimental M +N
x∈{Lv ,U v } ch=1
results on Washington RGB-D dataset [77] showed that using
where M and N are the size of the labeled and unlabeled
20% labeled Washington RGB-D training set (7000 images)
pool, ch denotes the channel index of the input, L and U
together with 80% unlabeled training sets outperformed many
denote the labeled and unlabeled datasets and v represents the
state-of-the-art methods using 100% labeled training sets
RGB or depth modality. x̃ is the ground truth by re-sizing x
[88], [146]. An overview of Cheng’s work is shown in Fig. 22.
via a bilinear interpolation.
TABLE 5. Summary of semi-supervised MMCNNs based methods for RGB-D object recognition. (The labeled data size is specific to Washington RGB-D
dataset.)
FIGURE 27. Examples of RGB (The first two rows) and depth images (The
last two rows) from the 2D/3D object dataset.
(1) Category level recognition around objects to sample another 150 random views of the
Category level recognition involves classifying objects as object from the whole viewing sphere as the testing data.
belonging to some categories even though the system has not Tables 6,7 and 8 show the comparative results of differ-
seen them before. To this end, the system is trained on a set ent methods on Washington RGB-D dataset, 2D/3D object
of labeled objects. At test time, the system is presented with dataset and JHUIT-50 object dataset, respectively. The best
an RGB and depth image pair containing an object that is not results are shown in bold.
present in training and the task is to assign a category label
to this object. The accuracies are averaged across many trials C. DISCUSSION ON THE RESULTS
with alternating contiguous frames and the result is evaluated We make the following observations regarding the results:
by the mean accuracies with standard deviation. (1) Intuitively, visual features play a dominant role in
(2) Instance level recognition category level and instance level recognition and combining
Instance level recognition is to identify whether an object the RGB and depth modalities can improve the recognition
is physically the same one as that has previously been seen accuracy and stability compared with single modality.
based on its unique appearance. The system is trained on (2) Generally, the traditional feature learning based meth-
a subset of views of each object and the task here is to ods outperform the hand-crafted feature based methods
distinguish between object instances. At test time, the system across all datasets, and the MMCNNs based methods can
is presented with a pair RGB and depth image that contains a further improve the object recognition performance.
previously unseen view of one of the objects and assigns an (3) The category recognition accuracies of some semi-
instance label to the image. supervised methods, e.g. [47], [170], [172] are more than
The Washington RGB-D dataset can be evaluated from 90% which are comparative to the state-of-the-art supervised
both the aspects of category recognition and instance recog- methods. Moreover, the GPC in [47] is a recommendable
nition. For category recognition, one object is chosen ran- method to learn from unlabeled data which outperforms the
domly from each category for testing and train the classifiers traditional co-training based methods [46], [170], [172].
on all views of the remaining objects. For object instance (4) For the fully-connected layer fusion in supervised
recognition, the leave-sequence-out method is adopted in learning based MMCNNs, the best and most commonly
which video sequences of each object recorded on 30◦ and used strategy is to combine the RGB-CNN and Depth-
60◦ are trained and the sequences recorded on 45◦ are lever- CNN together via discarding their own classification lay-
aged for instance recognition. Each split consists of roughly ers. For example, the fc7 layer of the CaffeNet [40], [123],
35,000 labeled training images and 7,000 testing images [77]. [124], [132], [142], and VGGNet [121], [128], the Dense
It is worth mentioning that for the semi-supervised learn- Block(4) of the DenseNet [129], and the conv5_x layer of the
ing methods, the number of label training data was set ResNet [130] are extracted as feature from RGB and depth
7,000 for [169], [170], [172],1500 for [46]. and 105 for [47]. modalities. Schwarz et al. [119] carried on a comparative
The accuracy is averaged across 10 trials for category recog- experiment by connecting the fc7 layer with the classification
nition and instance recognition with alternating contiguous layers fc8 in the CaffeNet before feature fusion. Comparative
frames. results showed that combing the fc7 and fc8 layers achieved
The 2D/3D object dataset is evaluated for category recogni- a recognition accuracy with 89.4 ± 1.3(%) which was worse
tion only. The dataset is randomly split into training and test than the result by single fc7 fusion (91.3 ± 1.4(%)) [40].
sets. Six objects of each class are used for training and the (5) For the RGB modality, a common method is to uti-
remaining objects are used for testing (The only exception is lize an off-the-shelf CNN architecture pre-trained on Ima-
the class scissors which has less than six instances. In this geNet and fine-tune it using the labeled RGB images from
case, at least one instance is available for validation). For the RGB-D dataset. The category recognition accuracy of
each training object, only 18 views out of 36 views are used. RGB modality with different CNN models are compared
Eventually 82 objects in 1,476 RGB-D images are chosen as and shown in Fig. 29. It indicates that a deeper net-
training data, while 74 objects in 1,332 RGB-D images are work often yields a better recognition result. This is also
used for testing. It is worth mentioning that for the semi- verified by Zhou et al. [129]. In Zhou’s work, four
supervised learning methods [170], the label training set is architectures namely, AlexNet, VGG-16, ResNet-101 and
20% of the training data (7000 images). The experiment DenseNet-121 were adopted in the MSANet framework and
repeated 30 times with random splitting each category of comparative results showed that the DenseNet-121 performed
objects into training and test set and the final accuracy is best, and the rest in turn, were ResNet-101, VGG-16 and
averaged category across 30 trials. AlexNet. However, this is not always the case for depth image
The JHUIT-50 object dataset is used for instance level because the performance based on depth modality is not only
object recognition only. On the training side, each object is related to the network models but also closely connected
placed on a turntable in increments of 7.2◦ at three fixed with the depth encoding methods. To probe into the influence
camera viewing angles with 30◦ , 45◦ and 60◦ . This amounts of encoding methods and the CNN models on the depth-
to 360
7.2 × 3 = 150 object views (7500 images) in total for based category recognition accuracy, we compared the cat-
training. For testing data, the camera is manually moved egory recognition accuracy of the same CNN model with
problem to some extent, more quantities and variations of image feature either through stacking as a fourth pixel feature
RGB-D training datasets are highly desired and important, directly [198] or through training a network to operate on
and innovations in collecting and annotating these datasets 2D depth images enhanced with geocentric encodings [122].
are also needed. However, these approaches do not make full use of the
(2) Zero/one-shot learning: As aforementioned, it is not geometric information and makes it difficult to integrate
always easy to collect large scale labeled data like ImageNet information across viewpoints. Rather than treating depth
which includes more than a million color images with a as a 2D mapping, the 3D volumetric representation models
thousand object categories. Learning from a few examples either through 3D voxel grids [199] or through 3D volumes
remains a key challenge in machine learning. How to adopt [200], [201] provide a rich and powerful representation of 3D
deep learning methods for zero/one shot RGB-D based object shapes and give a more fundamental view of objects and
recognition is an interesting research direction. Zero/one-shot scenes. To exploit the discriminative features from 3D volu-
learning is to recognize classes that are never seen or only metric inputs, more complex volumetric neural networks and
one training sample per class before [188], [189]. In the notable 3D data augmentation methods are highly required.
past few years, there were some works on zero/one-shot (5) Real-time applications: Real-time RGB-D object
learning for classification and recognition tasks. For example, recognition is required in practical applications. CNNs usu-
Long et al. [190] proposed a zero-shot learning framework ally require great computational burden which limits their
that mapped semantic embeddings to a discriminative rep- access to real-time applications. To address this problem, one
resentation space and achieved promising performances on feasible approach is to develop new architectures which allow
both recognition and retrieval tasks. Mettes and Snoek [191] running CNNs in real-time. He and Sun [202] conducted a
proposed a spatial-aware object embedding for zero-shot series of experiments under constrained time cost, and pro-
action localization and classification. Zhang et al. [192] stud- posed models that were fast for real-world applications, yet
ied the one-shot learning gesture recognition on RGB-D data were competitive with existing CNN models. Li et al. [203]
recorded from Microsoft’s Kinect and proposed a novel bag eliminated all the redundant computations in the forward and
of manifold words based feature representation on symmet- backward propagation in CNNs, which resulted in a speed up
ric positive definite manifolds. However, how to effectively of over 1500 times. Another trend is to develop deep learning
adopt deep learning methods for zero/one shot RGB-D based chips for mobile devices. Recently, the idea of deep learning
object recognition would be remain a topic for future research chips has emerged and drawn great attentions. For example,
direction. Bong et al. [204] proposed a low-power convolutional neural
(3) Unsupervised learning: Collecting labeled datasets is network for the user authentication in smart devices which
time-consuming and costly, hence learning from unsuper- consumed 0.62 mW to evaluate one face at 1 fps and achieved
vised video data is required. Mobile robots mounted with 97% accuracy in LFW dataset. Krestinskaya et al. [205]
RGB-D cameras need to continuously learn from the environ- proposed analog backpropagation learning circuits for var-
ment without human intervention. How to automatically learn ious memristive learning architectures, such as deep neural
from the unlabeled data to improve the learning capability network, binary neural network, multiple neural network,
of deep networks would be a fruitful and useful research hierarchical temporal memory, and long short-term memory.
direction. Recently, generative adversarial network (GAN) It is expected that these methodologies will consolidate in
has achieved great success in image generation tasks, such the following few years, and potentially make it practical to
as face generation [193], [194] and text/image-to-image run neural networks locally on mobile devices in real-time
[195], [196] translation. Also, it can be used for recognition applications.
task. For example, Tran et al. [197] proposed a disentan- (6) CNN feature interpretation and visualization: CNNs
gled representation learning generative adversarial networks have achieved superior performance in many visual tasks.
(DR-GAN) for pose-invariant face recognition. Therefore, However, the end-to-end learning strategy makes CNNs rep-
we believe the GAN-based techniques can remove the heavy resentations a black box. With the exception of the final
dependence on richly annotated training data, which is a great network output, it is difficult to understand the logic of
exciting direction for RGB-D object recognition. CNN predictions hidden inside the network. In recent years,
(4) Depth data processing and representation: Depth data researchers have realized that high model interpretability
is a critical modeling element in an RGB-D vision sys- is of significant value in both theory and practice, and
tem. Although the RGB-D data is now widespread and has have developed models with interpretable knowledge rep-
been incorporated into many applications, the depth data resentations [206]. For example, visualization of CNN rep-
is still typically noisy and incomplete. The depth data pro- resentations in intermediate network layers [207]–[209],
cessing techniques still lag behind the RGB data processing diagnosis of CNN representations [210]–[213], disentan-
techniques. The robust processing of depth data, especially glement of the mixture of patterns encoded in each fil-
the denoising, inpainting, segmentation, and hole-filling are ter of CNNs [214], [215] and building explainable models
highly required. Meanwhile, more works need to be car- [216], [217]. More explicit and concise CNN feature interpre-
ried out to find the best methods for representing depth. tation and visualization technologies are required for better
Currently, the depth modality is usually treated as a 2D use of CNNs features in vision tasks.
VII. CONCLUSIONS [16] K. Mikolajczyk and C. Schmid, ‘‘A performance evaluation of local
In this paper, we reviewed recent advances in MMCNNs descriptors,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10,
pp. 1615–1630, Oct. 2005.
based RGB-D object recognition. We provided an insightful [17] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, ‘‘Learning transferable
analysis on the solutions for training data deficiency and architectures for scalable image recognition,’’ in Proc. IEEE Conf. Com-
multimodal fusion strategies. We categorized the general put. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710.
[18] Z. Wu, C. Shen, and A. Van Den Hengel, ‘‘Wider or deeper: Revisiting
methods into supervised and semi-supervised schemes con- the ResNet model for visual recognition,’’ Pattern Recognit., vol. 90,
sidering whether the training data was labeled or unlabeled. pp. 119–133, Jun. 2019.
Furthermore, we analyzed the multimodal fusion strategies [19] Z. Cai, J. Han, L. Liu, and L. Shao, ‘‘RGB-D datasets using Microsoft
kinect or similar sensors: A survey,’’ Multimedia Tools Appl., vol. 76,
and categorized them into early level fusion and late level no. 3, pp. 4313–4355, 2017.
fusion based on the fusion level and more detailed analyses [20] D. Du, X. Xu, T. Ren, and G. Wu, ‘‘Depth images could tell us
were presented. Meanwhile, the influence of different depth more: Enhancing depth discriminability for RGB-D scene recogni-
tion,’’ in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2018,
encoding methods, different CNN models and different mul-
pp. 1–6.
timodal fusion methods on the RGB-D object recognition [21] X. Liu, D. Zhai, R. Chen, X. Ji, D. Zhao, and W. Gao, ‘‘Depth restoration
performance were analyzed as well. Finally, we identified from RGB-D data via joint adaptive regularization and thresholding on
some compelling challenges and future research directions manifolds,’’ IEEE Trans. Image Process., vol. 28, no. 3, pp. 1068–1079,
Mar. 2019.
that confront research in RGB-D computer vision using [22] Y. Cheng, R. Cai, X. Zhao, and K. Huang, ‘‘Convolutional fisher kernels
CNN-based approaches. for RGB-D object recognition,’’ in Proc. Int. Conf. 3D Vis. (3DV), 2015,
pp. 135–143.
[23] D. G. Lowe, ‘‘Distinctive image features from scale-invariant keypoints,’’
REFERENCES Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[1] M. Naseer, S. Khan, and F. Porikli, ‘‘Indoor scene understanding [24] A. E. Johnson and M. Hebert, ‘‘Using spin images for efficient object
in 2.5/3D for autonomous agents: A survey,’’ IEEE Access, vol. 7, recognition in cluttered 3D scenes,’’ IEEE Trans. Pattern Anal. Mach.
pp. 1859–1887, 2019. Intell., vol. 21, no. 5, pp. 433–449, May 1999.
[2] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Indoor scene under- [25] L. Zheng, Y. Yang, and Q. Tian, ‘‘SIFT meets CNN: A decade survey of
standing with RGB-D images: Bottom-up segmentation, object detec- instance retrieval,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5,
tion and semantic segmentation,’’ Int. J. Comput. Vis., vol. 112, no. 2, pp. 1224–1244, May 2018.
pp. 133–149, 2015. [26] L. Bo, X. Ren, and D. Fox, ‘‘Hierarchical matching pursuit for image
[3] M. Naseer, S. H. Khan, and F. Porikli. (2018). ‘‘Indoor scene understand- classification: Architecture and fast algorithms,’’ in Proc. Adv. Neural Inf.
ing in 2.5/3D for autonomous agents: A survey.’’ [Online]. Available: Process. Syst. (NIPS), 2011, pp. 2115–2123.
https://arxiv.org/abs/1803.03352 [27] M. Blum, J. T. Springenberg, J. Wülfing, and M. Riedmiller, ‘‘A learned
[4] C. Sakaridis, D. Dai, and L. Van Gool, ‘‘Semantic foggy scene under- feature descriptor for object recognition in RGB-D data,’’ in Proc. IEEE
standing with synthetic data,’’ Int. J. Comput. Vis., vol. 126, no. 9, Int. Conf. Robot. Autom. (ICRA), May 2012, pp. 1298–1303.
pp. 973–992, 2018. [28] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
[5] U. Asif, M. Bennamoun, and F. A. Sohel, ‘‘A multi-modal, discriminative A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
and spatially invariant CNN for RGB-D object labeling,’’ IEEE Trans. vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
Pattern Anal. Mach. Intell., vol. 40, no. 9, pp. 2051–2065, Sep. 2018. [29] X. Liu, W. Liu, Y. Liu, Z. Wang, N. Zeng, and F. E. Alsaadi, ‘‘A survey
[6] M. Brucker et al., ‘‘Semantic labeling of indoor environments from 3D of deep neural network architectures and their applications,’’ Neurocom-
RGB maps,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018, puting, vol. 234, pp. 11–26, Apr. 2017.
pp. 1871–1878. [30] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, ‘‘A survey on deep learning for
[7] Q. Cheng, Q. Zhang, P. Fu, C. Tu, and S. Li, ‘‘A survey and analysis on big data,’’ Inf. Fusion, vol. 42, pp. 146–157, Jul. 2018.
automatic image annotation,’’ Pattern Recognit., vol. 79, pp. 242–259, [31] O. Russakovsky et al., ‘‘ImageNet large scale visual recognition
Jul. 2018. challenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252,
[8] D. Kouřil et al., ‘‘Labels on levels: Labeling of multi-scale multi-instance Dec. 2015.
and crowded 3D biological environments,’’ IEEE Trans. Vis. Comput. [32] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, ‘‘A guide
Graphics, vol. 25, no. 1, pp. 977–986, Jan. 2019. to convolutional neural networks for computer vision,’’ Synth. Lectures
[9] U. Asif, M. Bennamoun, and F. A. Sohel, ‘‘RGB-D object recognition and Comput. Vis., vol. 8, no. 1, pp. 1–207, 2018.
grasp detection using hierarchical cascaded forests,’’ IEEE Trans. Robot., [33] Z. Qin, F. Yu, C. Liu, and X. Chen, ‘‘How convolutional neural networks
vol. 33, no. 3, pp. 547–564, Jun. 2017. see the world—A survey of convolutional neural network visualization
[10] J. Shen and N. Gans, ‘‘Robot-to-human feedback and automatic object methods,’’ Math. Found. Comput., vol. 1, no. 2, pp. 149–180, 2018.
grasping using an RGB-D camera–projector system,’’ Robotica, vol. 36, [34] J. Gu et al., ‘‘Recent advances in convolutional neural networks,’’ Pattern
no. 2, pp. 241–260, 2018. Recognit., vol. 77, pp. 354–377, May 2018.
[11] P. Schmidt, N. Vahrenkamp, M. Wächter, and T. Asfour, ‘‘Grasping [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
of unknown objects using deep convolutional neural networks based with deep convolutional neural networks,’’ in Advances in Neural Infor-
on depth images,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), mation Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and
May 2018, pp. 6831–6838. K. Q. Weinberger, Eds. New York, NY, USA: Curran Associates, 2012,
[12] I. González-Díaz, J. Benois-Pineau, J.-P. Domenger, D. Cattaert, and pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-
A. de Rugy, ‘‘Perceptually-guided deep neural networks for ego-action imagenet-classification-with-deep-convolutional-neural-networks.pdf
prediction: Object grasping,’’ Pattern Recognit., vol. 88, pp. 223–235, [36] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional
Apr. 2019. networks for large-scale image recognition.’’ [Online]. Available:
[13] A. Andreopoulos and J. K. Tsotsos, ‘‘50 years of object recognition: http://arxiv.org/abs/1409.1556
Directions forward,’’ Comput. Vis. Image Understand., vol. 117, no. 8, [37] C. Szegedy et al., ‘‘Going deeper with convolutions,’’ in Proc. IEEE Conf.
pp. 827–891, 2013. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
[14] S. A. A. Shah, M. Bennamoun, F. Boussaid, and L. While, ‘‘Evolution- [38] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
ary feature learning for 3-D object recognition,’’ IEEE Access, vol. 6, image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
pp. 2434–2444, 2018. (CVPR), Jun. 2016, pp. 770–778.
[15] G. Pasquale, C. Ciliberto, F. Odone, L. Rosasco, and L. Natale, ‘‘Are [39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, ‘‘Densely
we done with object recognition? The iCub robot’s perspective?’’ Robot. connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
Auto. Syst., vol. 112, pp. 260–281, Feb. 2019. Pattern Recognit. (CVPR), Jun. 2017, pp. 2261–2269.
[40] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and [63] P. G. Schyns, ‘‘Object recognition: Complexity of recognition strategies,’’
W. Burgard, ‘‘Multimodal deep learning for robust RGB-D object Current Biol., vol. 28, no. 7, pp. R313–R315, 2018.
recognition,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), [64] E. Rachmawati, I. Supriana, and M. L. Khodra, ‘‘Review of local descrip-
Sep. 2015, pp. 681–687. tor in RGB-D object recognition,’’ Telkomnika, vol. 12, no. 4, p. 1132,
[41] L. Song et al., ‘‘A deep multi-modal CNN for multi-instance multi- 2014.
label image classification,’’ IEEE Trans. Image Process., vol. 27, no. 12, [65] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan, ‘‘3D object
pp. 6025–6038, Dec. 2018. recognition in cluttered scenes with local surface features: A survey,’’
[42] X. Song, S. Jiang, L. Herranz, and C. Chen, ‘‘Learning effective RGB- IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2270–2287,
D representations for scene recognition,’’ IEEE Trans. Image Process., Nov. 2014.
vol. 28, no. 2, pp. 980–993, Feb. 2019. [66] M. Hashimoto, S. Akizuki, and S. Takei, ‘‘A survey and technology trends
[43] A. Wang, J. Lu, J. Cai, T.-J. Cham, and G. Wang, ‘‘Large-margin multi- of 3D features for object recognition,’’ IEEJ Trans. Electron., Inf. Syst.,
modal deep learning for RGB-D object recognition,’’ IEEE Trans. Multi- vol. 136, no. 8, pp. 1038–1046, 2016.
media, vol. 17, no. 11, pp. 1887–1898, Nov. 2015. [67] L. A. Alexandre, ‘‘3D descriptors for object and category recognition:
[44] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. A comparative evaluation,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. Syst. (IROS), Oct. 2012, vol. 1, no. 3, p. 7.
[45] C. J. Debono et al., ‘‘Efficient depth-based coding,’’ in Proc. 3D Vis. [68] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, ‘‘Deep
Content Creation, Coding Del. Cham, Switzerland: Springer, 2019, learning advances in computer vision with 3D data: A survey,’’ ACM
pp. 97–114. Comput. Surv., vol. 50, no. 2, pp. 1–38, Apr. 2017.
[46] Y. Cheng, X. Zhao, R. Cai, Z. Li, K. Huang, and Y. Rui, ‘‘Semi-supervised [69] H. Bay, T. Tuytelaars, and L. Van Gool, ‘‘SURF: Speeded up robust fea-
multimodal deep learning for RGB-D object recognition,’’ in Proc. 25th tures,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV). Graz, Austria: Springer,
Int. Joint Conf. Artif. Intell. (IJCAI), 2016, pp. 3345–3351. 2006, pp. 404–417.
[47] L. Sun, C. Zhao, and R. Stolkin. (2017). ‘‘Weakly-supervised [70] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human
DCNN for RGB-D object recognition in real-world applications detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
which lack large-scale annotated training data.’’ [Online]. Available: Jun. 2005, pp. 886–893.
https://arxiv.org/abs/1703.06370 [71] R. B. Rusu and S. Cousins, ‘‘3D is here: Point cloud library (PCL),’’ in
[48] R. S. Blum and Z. Liu, Multi-Sensor Image Fusion and its Applications. Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2011, pp. 1–4.
Boca Raton, FL, USA: CRC Press, 2005. [72] M. Körtgen, G.-J. Park, M. Novotni, and R. Klein, ‘‘3D shape matching
[49] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, ‘‘Multimodal machine learn- with 3D shape contexts,’’ in Proc. 7th Central Eur. Seminar Comput.
ing: A survey and taxonomy,’’ IEEE Trans. Pattern Anal. Mach. Intell., Graph., Budmerice, Slovakia, 2003, pp. 5–17.
vol. 41, no. 2, pp. 423–443, Feb. 2019. [73] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘A sparse texture representation
[50] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, ‘‘Mul- using local affine regions,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
timodal fusion for multimedia analysis: A survey,’’ Multimedia Syst., vol. 27, no. 8, pp. 1265–1278, Aug. 2005.
vol. 16, no. 6, pp. 345–379, 2010. [74] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, ‘‘Aligning point
[51] D. Lahat, T. Adali, and C. Jutten, ‘‘Multimodal data fusion: An overview cloud views using persistent feature histograms,’’ in Proc. IEEE/RSJ Int.
of methods, challenges, and prospects,’’ Proc. IEEE, vol. 103, no. 9, Conf. Intell. Robots Syst. (IROS), Sep. 2008, pp. 3384–3391.
pp. 1449–1477, Sep. 2015. [75] B. Drost, M. Ulrich, N. Navab, and S. Ilic, ‘‘Model globally, match
[52] D. Ramachandram and G. W. Taylor, ‘‘Deep multimodal learning: A sur- locally: Efficient and robust 3D object recognition,’’ in Proc. IEEE Conf.
vey on recent advances and trends,’’ IEEE Signal Process. Mag., vol. 34, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 998–1005.
no. 6, pp. 96–108, Nov. 2017. [76] F. Tombari, S. Salti, and L. Di Stefano, ‘‘Unique signatures of histograms
[53] Z. Liu, E. Blasch, G. Bhatnagar, V. John, W. Wu, and R. S. Blum, for local surface description,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV).
‘‘Fusing synergistic information from multi-sensor images: An overview Heraklion, Crete: Springer, 2010, pp. 356–369.
from implementation to performance assessment,’’ Inf. Fusion, vol. 42, [77] K. Lai, L. Bo, X. Ren, and D. Fox, ‘‘A large-scale hierarchical multi-view
pp. 127–145, Jul. 2018. RGB-D object dataset,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
[54] H. Liu, Y. Wu, F. Sun, B. Fang, and D. Guo, ‘‘Weakly paired multimodal May 2011, pp. 1817–1824.
fusion for object recognition,’’ IEEE Trans. Autom. Sci. Eng., vol. 15, [78] L. Bo and C. Sminchisescu, ‘‘Efficient match kernel between sets of
no. 2, pp. 784–795, Apr. 2018. features for visual recognition,’’ in Proc. Adv. Neural Inf. Process. Syst.
[55] T. Zhou, K.-H. Thung, X. Zhu, and D. Shen, ‘‘Effective feature learn- (NIPS), 2009, pp. 135–143.
ing and fusion of multimodality data using stage-wise deep neural net- [79] K. Lai, L. Bo, X. Ren, and D. Fox, ‘‘Sparse distance learning for object
work for dementia diagnosis,’’ Hum. Brain Mapping, vol. 40, no. 3, recognition combining RGB and depth information,’’ in Proc. IEEE Int.
pp. 1001–1016, 2019. Conf. Robot. Autom. (ICRA), May 2011, pp. 4007–4013.
[56] T. Zhou, C. Zhang, C. Gong, H. Bhaskar, and J. Yang, ‘‘Multiview latent [80] D. Paulk, V. Metsis, C. McMurrough, and F. Makedon, ‘‘A supervised
space learning with feature redundancy minimization,’’ IEEE Trans. learning approach for fast object recognition from RGB-D data,’’ in Proc.
Cybern., to be published. 7th Int. Conf. Pervasive Technol. Rel. Assistive Environ. (PETRA), 2014,
[57] K.-H. Thung, P.-T. Yap, and D. Shen, ‘‘Multi-stage diagnosis of p. 5.
Alzheimer’s disease with incomplete multimodal data via multi-task [81] B. Browatzki, J. Fischer, B. Graf, H. H. Bülthoff, and C. Wallraven,
deep learning,’’ in Deep Learning in Medical Image Analysis and Mul- ‘‘Going into depth: Evaluating 2D and 3D cues for object classification
timodal Learning for Clinical Decision Support. Québec City, QC, on a new, large-scale object dataset,’’ in Proc. IEEE Int. Conf. Comput.
Canada: Springer, 2017, pp. 160–168. Vis. Workshops (ICCV), Nov. 2011, pp. 1189–1195.
[58] X. Zhu, K.-H. Thung, E. Adeli, Y. Zhang, and D. Shen, ‘‘Maximum mean [82] A. Bosch, A. Zisserman, and X. Munoz, ‘‘Representing shape with a
discrepancy based multiple kernel learning for incomplete multimodality spatial pyramid kernel,’’ in Proc. 6th ACM Int. Conf. Image Video Retr.
neuroimaging data,’’ in Proc. Int. Conf. Med. Image Comput. Comput.- (ICIVR), Jul. 2007, pp. 401–408.
Assist. Intervent. Québec City, QC, Canada: Springer, 2017, pp. 72–80. [83] E. Shechtman and M. Irani, ‘‘Matching local self-similarities across
[59] Y. Huang, W. Wang, and L. Wang, ‘‘Unconstrained multimodal multi- images and videos,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
label learning,’’ IEEE Trans. Multimedia, vol. 17, no. 11, pp. 1923–1935, (CVPR), Jul. 2007, pp. 1–8.
Nov. 2015. [84] M. Heczko, D. Keim, D. Saupe, and D. Vranic, ‘‘Methods for similarity
[60] J. Sanchez-Riera, K.-L. Hua, Y.-S. Hsiao, T. Lim, S. C. Hidayati, and search on 3D databases,’’ Datenbank-Spektrum, vol. 2, no. 2, pp. 54–63,
W.-H. Cheng, ‘‘A comparative study of data fusion for RGB-D based 2002.
visual recognition,’’ Pattern Recognit. Lett., vol. 73, pp. 1–6, Apr. 2016. [85] J. J. Koenderink and A. J. Van Doorn, ‘‘Surface shape and curvature
[61] P. N. Druzhkov and V. D. Kustikova, ‘‘A survey of deep learning methods scales,’’ Image Vis. Comput., vol. 10, no. 8, pp. 557–564, 1992.
and software tools for image classification and object detection,’’ Pattern [86] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, ‘‘Shape distribu-
Recognit. Image Anal., vol. 26, no. 1, pp. 9–15, 2016. tions,’’ ACM Trans. Graph., vol. 21, no. 4, pp. 807–832, 2002.
[62] W. Rawat and Z. Wang, ‘‘Deep convolutional neural networks for image [87] Z. Liu, C. Zhao, X. Wu, and W. Chen, ‘‘An effective 3D shape descriptor
classification: A comprehensive review,’’ Neural Comput., vol. 29, no. 9, for object recognition with RGB-D sensors,’’ Sensors, vol. 17, no. 3,
pp. 2352–2449, 2017. p. 451, 2017.
[88] L. Bo, X. Ren, and D. Fox, ‘‘Kernel descriptors for visual recognition,’’ [113] J. Bai, Y. Wu, J. Zhang, and F. Chen, ‘‘Subset based deep learning for
in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2010, pp. 244–252. RGB-D object recognition,’’ Neurocomputing, vol. 165, pp. 280–292,
[89] S. Liu, S. Wang, L. Wu, and S. Jiang, ‘‘Multiple feature fusion based hand- Oct. 2015.
held object recognition with RGB-D data,’’ in Proc. Int. Conf. Internet [114] A. Wang, J. Cai, J. Lu, and T.-J. Cham, ‘‘MMSS: Multi-modal sharable
Multimedia Comput. Service (ICIMCS), 2014, p. 303. and specific feature learning for RGB-D object recognition,’’ in Proc.
[90] Y. Gu, J. Chanussot, X. Jia, and J. A. Benediktsson, ‘‘Multiple Kernel IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1125–1133.
learning for hyperspectral image classification: A review,’’ IEEE Trans. [115] K. Nguyen, C. Fookes, A. Ross, and S. Sridharan, ‘‘Iris recognition with
Geosci. Remote Sens., vol. 55, no. 11, pp. 6547–6565, Nov. 2017. off-the-shelf CNN features: A deep learning perspective,’’ IEEE Access,
[91] S. S. Bucak, R. Jin, and A. K. Jain, ‘‘Multiple kernel learning for visual vol. 6, pp. 18848–18855, 2018.
object recognition: A review,’’ IEEE Trans. Pattern Anal. Mach. Intell., [116] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, ‘‘CNN features
vol. 36, no. 7, pp. 1354–1369, Jul. 2014. off-the-shelf: An astounding baseline for recognition,’’ in Proc. IEEE
[92] L. Bo, K. Lai, X. Ren, and D. Fox, ‘‘Object recognition with hierarchical Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 806–813.
kernel descriptors,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [117] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
(CVPR), Jun. 2011, pp. 1729–1736. features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst.
[93] R. Rostami, F. S. Bashiri, B. Rostami, and Z. Yu, ‘‘A survey on data- (NIPS), 2014, pp. 3320–3328.
driven 3D shape descriptors,’’ Comput. Graph. Forum, vol. 38, no. 1, [118] F. Radenović, G. Tolias, and O. Chum, ‘‘Fine-tuning CNN image retrieval
pp. 356–393, 2018. with no human annotation,’’ IEEE Trans. Pattern Anal. Mach. Intell., to
[94] I. K. Kazmi, L. You, and J. J. Zhang, ‘‘A survey of 2D and 3D shape be published.
descriptors,’’ in Proc. 10th Int. Conf. Comput. Graph., Imag. Vis. (CGIV), [119] M. Schwarz, H. Schulz, and S. Behnke, ‘‘RGB-D object recognition
2013, pp. 1–10. and pose estimation based on pre-trained convolutional neural network
[95] K.-T. Yu, S.-H. Tseng, and L.-C. Fu, ‘‘Learning hierarchical representa- features,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2015,
tion with sparsity for RGB-D object recognition,’’ in Proc. IEEE/RSJ Int. pp. 1329–1335.
Conf. Intell. Robots Syst. (IROS), Oct. 2012, pp. 3011–3016. [120] H. F. Zaki, F. Shafait, and A. Mian, ‘‘Convolutional hypercube pyramid
[96] R. B. Rusu, N. Blodow, and M. Beetz, ‘‘Fast point feature histograms for accurate RGB-D object category and instance recognition,’’ in Proc.
(FPFH) for 3D registration,’’ in Proc. IEEE Int. Conf. Robot. Autom. IEEE Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 1685–1692.
(ICRA), May 2009, pp. 3212–3217. [121] A. Aakerberg, K. Nasrollahi, and T. Heder, ‘‘Depth value pre-processing
[97] C. Redondo-Cabrera, R. J. López-Sastre, J. Acevedo-Rodriguez, and for accurate transfer learning based RGB-D object recognition,’’ in Proc.
S. Maldonado-Bascón, ‘‘SURFing the point clouds: Selective 3D spatial Int. Joint Conf. Comput. Intell. (IJCCI), 2017, pp. 121–128.
pyramids for category-level object recognition,’’ in Proc. IEEE Conf. [122] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, ‘‘Learning rich features
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3458–3465. from RGB-D images for object detection and segmentation,’’ in Proc.
[98] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool, ‘‘Hough Eur. Conf. Comput. Vis. (ECCV). Zurich, Switzerland: Springer, 2014,
transform and 3D surf for robust three dimensional classification,’’ in pp. 345–360.
Proc. Eur. Conf. Comput. Vis. (ECCV). Heraklion, Crete: Springer, 2010,
[123] M. M. Rahman, Y. Tan, J. Xue, and K. Lu, ‘‘RGB-D object recognition
pp. 589–602.
with multimodal deep convolutional neural networks,’’ in Proc. IEEE Int.
[99] K. B. Logoglu, S. Kalkan, and A. Temizel, ‘‘CoSPAIR: Colored his-
Conf. Multimedia Expo (ICME), Jul. 2017, pp. 991–996.
tograms of spatial concentric surflet-pairs for 3D object recognition,’’
[124] A. Aakerberg, K. Nasrollahi, and T. Heder, ‘‘Improving a deep learning
Robot. Auton. Syst., vol. 75, pp. 558–570, Jan. 2016.
based RGB-D object recognition model by ensemble learning,’’ in Proc.
[100] G. James, ‘‘Majority vote classifiers: Theory and applications,’’ Ph.D.
7th Int. Conf. Image Process. Theory, Tools Appl. (IPTA), 2017, pp. 1–6.
dissertation, Dept. Statist., Stanford Univ., Stanford, CA, USA, 1998.
[125] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ‘‘Let there be color!: Joint end-
[101] M. Aharon, M. Elad, and A. Bruckstein, ‘‘SVD: An algorithm for design-
to-end learning of global and local image priors for automatic image col-
ing overcomplete dictionaries for sparse representation,’’ IEEE Trans.
orization with simultaneous classification,’’ ACM Trans. Graph., vol. 35,
Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
no. 4, p. 110, 2016.
[102] L. Bo, X. Ren, and D. Fox, ‘‘Unsupervised feature learning for RGB-D
[126] G. Larsson, M. Maire, and G. Shakhnarovich, ‘‘Learning representations
based object recognition,’’ in Experimental Robotics. Cham, Switzerland:
for automatic colorization,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV).
Springer, 2013, pp. 387–402.
Amsterdam, The Netherlands: Springer, 2016, pp. 577–593.
[103] B. Chen, J. Yang, B. Jeon, and X. Zhang, ‘‘Kernel quaternion principal
component analysis and its application in RGB-D object recognition,’’ [127] Z. Cheng, Q. Yang, and B. Sheng, ‘‘Deep colorization,’’ in Proc. IEEE
Neurocomputing, vol. 266, pp. 293–303, Nov. 2017. Int. Conf. Comput. Vis. (ICCV), Jun. 2015, pp. 415–423.
[104] Y. Sun, S. Chen, and B. Yin, ‘‘Color face recognition based on quater- [128] F. M. Carlucci, P. Russo, and B. Caputo, ‘‘(DE)2 CO: Deep depth coloriza-
nion matrix representation,’’ Pattern Recognit. Lett., vol. 32, no. 4, tion,’’ IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 2386–2393, 2018.
pp. 597–605, 2011. [129] F. Zhou, Y. Hu, and X. Shen, ‘‘MSANet: Multimodal self-augmentation
[105] B. Chen, H. Shu, G. Coatrieux, G. Chen, X. Sun, and J. L. Coatrieux, and adversarial network for RGB-D object recognition,’’ Vis. Comput.,
‘‘Color image analysis by quaternion-type moments,’’ J. Math. Imag. Vis., pp. 1–12, May 2018.
vol. 51, no. 1, pp. 124–144, 2015. [130] Z. Wang et al. (2016). ‘‘Correlated and individual multi-modal
[106] S. Sun, N. An, X. Zhao, and M. Tan, ‘‘A PCA–CCA network for RGB- deep learning for RGB-D object recognition.’’ [Online]. Available:
D object recognition,’’ Int. J. Adv. Robot. Syst., vol. 15, no. 1, pp. 1–12, https://arxiv.org/abs/1604.01655
2018. [131] M. R. Loghmani, M. Planamente, B. Caputo, and M. Vincze. (2018).
[107] F. Perronnin and C. Dance, ‘‘Fisher kernels on visual vocabularies for ‘‘Recurrent convolutional fusion for RGB-D object recognition.’’
image categorization,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [Online]. Available: https://arxiv.org/abs/1806.01673
nit. (CVPR), Jun. 2007, pp. 1–8. [132] J. Wang, J. Lu, W. Chen, and X. Wu, ‘‘Convolutional neural network for
[108] C. Li, A. Reiter, and G. D. Hager, ‘‘Beyond spatial pooling: Fine-grained 3D object recognition based on RGB-D dataset,’’ in Proc. Ind. Electron.
representation learning in multiple domains,’’ in Proc. IEEE Conf. Com- Appl. (ICIEA), 2015, pp. 34–39.
put. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4913–4922. [133] H. F. Zaki, F. Shafait, and A. Mian, ‘‘Learning a deeply supervised
[109] Y. Cheng et al., ‘‘Query adaptive similarity measure for RGB-D object multi-modal RGB-D embedding for semantic scene and object category
recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, recognition,’’ Robot. Auto. Syst., vol. 92, pp. 41–52, Jun. 2017.
pp. 145–153. [134] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convo-
[110] A. Coates, A. Ng, and H. Lee, ‘‘An analysis of single-layer networks lutional networks,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV). Zurich,
in unsupervised feature learning,’’ in Proc. 14th Int. Conf. Artif. Intell. Switzerland: Springer, 2014, pp. 818–833.
Statist. (AISTATS), Jun. 2011, pp. 215–223. [135] P. Agrawal, R. Girshick, and J. Malik, ‘‘Analyzing the performance of
[111] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, multilayer neural networks for object recognition,’’ in Proc. Eur. Conf.
vol. 1. Cambridge, MA, USA: MIT Press, 2016. Comput. Vis. (ECCV). Zurich, Switzerland: Springer, 2014, pp. 329–344.
[112] H. F. Zaki, F. Shafait, and A. Mian, ‘‘Localized deep extreme learning [136] Y. Jia et al., ‘‘Caffe: Convolutional architecture for fast feature
machines for efficient RGB-D object recognition,’’ in Proc. Int. Conf. embedding,’’ in Proc. 22nd ACM Int. Conf. Multimedia (ICM), 2014,
Digit. Image Comput., Techn. Appl. (DICTA), 2015, pp. 1–8. pp. 675–678.
[137] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, ‘‘Canonical correlation [161] Y. He and D. Zhou, ‘‘Self-training from labeled features for sentiment
analysis: An overview with application to learning methods,’’ Neural analysis,’’ Inf. Process. Manage., vol. 47, no. 4, pp. 606–616, 2011.
Comput., vol. 16, no. 12, pp. 2639–2664, 2004. [162] M.-F. Balcan, A. Blum, and K. Yang, ‘‘Co-training and expansion:
[138] N. M. Correa, T. Adali, Y.-O. Li, and V. D. Calhoun, ‘‘Canonical correla- Towards bridging theory and practice,’’ in Proc. Adv. Neural Inf. Process.
tion analysis for data fusion and group inferences,’’ IEEE Signal Process. Syst. (NIPS), 2005, pp. 89–96.
Mag., vol. 27, no. 4, pp. 39–50, Jun. 2010. [163] J. Wu, L. Li, and W. Y. Wang. (2018). ‘‘Reinforced co-training.’’ [Online].
[139] Y. Jiao, Y. Zhang, Y. Wang, B. Wang, J. Jin, and X. Wang, ‘‘A novel Available: https://arxiv.org/abs/1804.06035
multilayer correlation maximization model for improving cca-based fre- [164] H. Padhiyar and P. Rekh, ‘‘An improved expectation maximization based
quency recognition in SSVEP brain–computer interface,’’ Int. J. Neural semi-supervised email classification using Naive Bayes and k-nearest
Syst., vol. 28, no. 04, p. 1750039, 2018. neighbor,’’ Int. J. Comput. Appl., vol. 101, no. 6, pp. 7–11, 2014.
[140] Y. Zhang et al., ‘‘Correlated component analysis for enhancing the perfor- [165] B. Nigam, P. Ahirwal, S. Salve, and S. Vamney. (2011). ‘‘Document
mance of SSVEP-based brain-computer interface,’’ IEEE Trans. Neural classification using expectation maximization with semi supervised learn-
Syst. Rehabil. Eng., vol. 26, no. 5, pp. 948–956, May 2018. ing.’’ [Online]. Available: https://arxiv.org/abs/1112.2028
[141] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, ‘‘ICA with reconstruction [166] R. Sheikhpour, E. Sheikhpour, and M. A. Sarram, ‘‘Semi-supervised
cost for efficient overcomplete feature learning,’’ in Proc. Adv. Neural Inf. sparse feature selection via graph Laplacian based scatter matrix for
Process. Syst. (NIPS), 2011, pp. 1017–1025. regression problems,’’ Inf. Sci., vol. 468, pp. 14–28, Nov. 2018.
[142] L. Jin, Z. Li, X. Shu, S. Gao, and J. Tang, ‘‘Partially common-semantic [167] E. Faerman, F. Borutta, J. Busch, and M. Schubert. (2018). ‘‘Semi-
pursuit for RGB-D object recognition,’’ in Proc. 23rd ACM Int. Conf. supervised learning on graphs based on local label distributions.’’
Multimedia (ICM), 2015, pp. 959–962. [Online]. Available: https://arxiv.org/abs/1802.05563
[143] Y. Guo, Y. Liu, S. Lao, E. M. Bakker, L. Bai, and M. S. Lew, ‘‘Bag of [168] A. Blum and T. Mitchell, ‘‘Combining labeled and unlabeled data with
surrogate parts feature for visual recognition,’’ IEEE Trans. Multimedia, co-training,’’ in Proc. 11th Annu. Conf. Comput. Learn. Theory (COLT),
vol. 20, no. 6, pp. 1525–1536, Jun. 2018. 1998, pp. 92–100.
[144] L. Liu, C. Shen, and A. van den Hengel, ‘‘The treasure beneath con- [169] Y. Cheng, X. Zhao, K. Huang, and T. Tan, ‘‘Semi-supervised learning for
volutional layers: Cross-convolutional-layer pooling for image classifi- RGB-D object recognition,’’ in Proc. Int. Conf. Pattern Recognit. (ICPR),
cation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Aug. 2014, pp. 2377–2382.
Jun. 2015, pp. 4749–4757.
[170] Y. Cheng, X. Zhao, K. Huang, and T. Tan, ‘‘Semi-supervised learning and
[145] N. Liu, L. Wan, Y. Zhang, T. Zhou, H. Huo, and T. Fang, ‘‘Exploiting feature evaluation for RGB-D object recognition,’’ Comput. Vis. Image
convolutional neural networks with deeply local description for remote Understand., vol. 139, pp. 149–160, Oct. 2015.
sensing image classification,’’ IEEE Access, vol. 6, pp. 11215–11228,
[171] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘Beyond bags of features:
2018.
Spatial pyramid matching for recognizing natural scene categories,’’ in
[146] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2006,
‘‘Convolutional-recursive deep learning for 3D object classification,’’ in
pp. 2169–2178.
Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 656–664.
[172] Y. Yin and H. Li, ‘‘Multi-view CSPMPR-ELM feature learning and
[147] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, Y. LeCun, ‘‘What is the best
classifying for RGB-D object recognition,’’ in Cluster Comput.-J. Netw.
multi-stage architecture for object recognition?’’ in Proc. IEEE Int. Conf.
Softw. Tools Appl., 2018, pp. 1–11. doi: 10.1007/s10586-018-1695-0.
Comput. Vis. (ICCV), May 2009, pp. 2146–2153.
[173] Z.-H. Zhou, ‘‘A brief introduction to weakly supervised learning,’’ Nat.
[148] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, ‘‘Parsing natural scenes
Sci. Rev., vol. 5, no. 1, pp. 44–53, 2017.
and natural language with recursive neural networks,’’ in Proc. Int. Conf.
Mach. Learn. (ICML), 2011, pp. 129–136. [174] Z. Wu et al., ‘‘3D ShapeNets: A deep representation for volumetric
shapes,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[149] Y. Yin, H. Li, and X. Wen, ‘‘Multi-model convolutional extreme learn-
Jun. 2015, pp. 1912–1920.
ing machine with kernel for RGB-D object recognition,’’ Proc. SPIE,
vol. 10605, Nov. 2017, Art. no. 106051Z. [175] C. E. Rasmussen, ‘‘Gaussian processes in machine learning,’’ in Summer
[150] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, ‘‘Extreme learning School on Machine Learning (Advanced Lectures on Machine Learning).
machine for regression and multiclass classification,’’ IEEE Trans. Syst., Berlin, Germany: Springer, 2004, pp. 63–71.
Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012. [176] M. Firman, ‘‘RGBD datasets: Past, present and future,’’ in Proc. IEEE
[151] Y. Zhang et al., ‘‘Multi-kernel extreme learning machine for EEG clas- Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 19–31.
sification in brain-computer interfaces,’’ Expert Syst. Appl., vol. 96, [177] Y. Guo, J. Zhang, M. Lu, J. Wan, and Y. Ma, ‘‘Benchmark datasets for 3D
pp. 302–310, Apr. 2018. computer vision,’’ in Proc. IEEE 9th Conf. Ind. Electron. Appl. (ICIEA),
[152] H. Liu, F. Li, X. Xu, and F. Sun, ‘‘Multi-modal local receptive field Jun. 2014, pp. 1846–1851.
extreme learning machine for object recognition,’’ Neurocomputing, [178] S. Song, S. P. Lichtenberg, and J. Xiao, ‘‘SUN RGB-D: A RGB-D
vol. 277, pp. 4–11, Feb. 2018. scene understanding benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis.
[153] G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong, ‘‘Local receptive Pattern Recognit. (CVPR), Jun. 2015, pp. 567–576.
fields based extreme learning machine,’’ IEEE Comput. Intell. Mag., [179] N. Silberman and R. Fergus, ‘‘Indoor scene segmentation using a struc-
vol. 10, no. 2, pp. 18–29, May 2015. tured light sensor,’’ in Proc. IEEE Int. Conf. Comput. Vis. Workshops
[154] F. M. Carlucci, P. Russo, and B. Caputo, ‘‘A deep representation for depth (ICCV), Nov. 2011, pp. 601–608.
images from synthetic data,’’ in Proc. IEEE Int. Conf. Robot. Autom. [180] S. Gupta, P. Arbeláez, and J. Malik, ‘‘Perceptual organization and recog-
(ICRA), May/Jun. 2017, pp. 1362–1369. nition of indoor scenes from RGB-D images,’’ in Proc. IEEE Conf.
[155] F. Orabona, L. Jie, and B. Caputo, ‘‘Online-batch strongly convex multi Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 564–571.
kernel learning,’’ in Proc. Proc. IEEE Conf. Comput. Vis. Pattern Recog- [181] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena, ‘‘Semantic label-
nit. (CVPR), Jun. 2010, pp. 787–794. ing of 3D point clouds for indoor scenes,’’ in Proc. Adv. Neural Inf.
[156] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. (2014). Process. Syst. (NIPS), 2011, pp. 244–252.
‘‘Return of the devil in the details: Delving deep into convolutional nets.’’ [182] X. Lv, S.-Q. Jiang, L. Herranz, and S. Wang, ‘‘RGB-D hand-held object
[Online]. Available: https://arxiv.org/abs/1405.3531 recognition based on heterogeneous feature fusion,’’ J. Comput. Sci.
[157] J. B. Tenenbaum and W. T. Freeman, ‘‘Separating style and content Technol., vol. 30, no. 2, pp. 340–352, 2015.
with bilinear models,’’ Neural Comput., vol. 12, no. 6, pp. 1247–1283, [183] X. Lv, X. Liu, X. Li, X. Li, S. Jiang, and Z. He, ‘‘Modality-specific and
Jun. 2000. hierarchical feature learning for RGB-D hand-held object recognition,’’
[158] C. G. M. Snoek, M. Worring, and A. W. Smeulders, ‘‘Early versus late Multimedia Tools Appl., vol. 76, no. 3, pp. 4273–4290, 2017.
fusion in semantic video analysis,’’ in Proc. 13th Annu. ACM Int. Conf. [184] X. Lv, S. Jiang, L. Herranz, and S. Wang, ‘‘Hand-object sense: A hand-
Multimedia (ICM), 2005, pp. 399–402. held object recognition system based on RGB-D information,’’ in Proc.
[159] X. Zhu, ‘‘Semi-supervised learning literature survey,’’ Comput. Sci., Univ. 23rd ACM Int. Conf. Multimedia (ICM), 2015, pp. 765–766.
Wisconsin-Madison, vol. 2, no. 3, p. 4, 2006. [185] A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, ‘‘BigBIRD:
[160] D. Wu et al., ‘‘Self-training semi-supervised classification based on den- A large-scale 3D database of object instances,’’ in Proc. IEEE Int. Conf.
sity peaks of data,’’ Neurocomputing, vol. 275, pp. 180–191, Jan. 2018. Robot. Autom. (ICRA), May/Jun. 2014, pp. 509–516.
[186] C. Li, J. Bohren, E. Carlson, and G. D. Hager, ‘‘Hierarchical semantic [209] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. (2014).
parsing for object pose estimation in densely cluttered scenes,’’ in Proc. ‘‘Object detectors emerge in deep scene CNNs.’’ [Online]. Available:
IEEE Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 5068–5075. https://arxiv.org/abs/1412.6856
[187] U. Asif, M. Bennamoun, and F. Sohel, ‘‘Discriminative feature learning [210] C. Szegedy et al. (2013). ‘‘Intriguing properties of neural networks.’’
for efficient RGB-D object recognition,’’ in Proc. IEEE/RSJ Int. Conf. [Online]. Available: https://arxiv.org/abs/1312.6199
Intell. Robots Syst. (IROS), Sep./Oct. 2015, pp. 272–279. [211] R. C. Fong and A. Vedaldi, ‘‘Interpretable explanations of black boxes by
[188] L. Fei-Fei, R. Fergus, and P. Perona, ‘‘One-shot learning of object meaningful perturbation,’’ in Proc. IEEE Int. Conf. Comput. Vis. (CVPR),
categories,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, Jun. 2017, pp. 3429–3437.
pp. 594–611, Apr. 2006. [212] H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz, ‘‘Identifying
[189] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong, ‘‘Recent unknown unknowns in the open world: Representations and policies
advances in zero-shot recognition: Toward data-efficient understand- for guided exploration,’’ in Proc. 31st AAAI Conf. Artif. Intell., 2017,
ing of visual content,’’ IEEE Signal Process. Mag., vol. 35, no. 1, pp. 1–9.
pp. 112–125, Jan. 2018. [213] Q. Zhang, W. Wang, and S.-C. Zhu, ‘‘Examining CNN representations
[190] T. Long, X. Xu, F. Shen, L. Liu, N. Xie, and Y. Yang, ‘‘Zero-shot with respect to dataset bias,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
learning via discriminative representation extraction,’’ Pattern Recognit. pp. 1–10.
Lett., vol. 109, pp. 27–34, Jul. 2018. [214] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S.-C. Zhu, ‘‘Interpreting CNN
[191] P. Mettes and C. G. M. Snoek, ‘‘Spatial-aware object embeddings for knowledge via an explanatory graph,’’ in Proc. 32nd AAAI Conf. Artif.
zero-shot localization and classification of actions,’’ in Proc. IEEE Int. Intell., 2018, pp. 1–10.
Conf. Comput. Vis. (ICCV), Jul. 2017, pp. 4453–4462. [215] Q. Zhang, R. Cao, Y. N. Wu, and S.-C. Zhu, ‘‘Growing interpretable part
[192] L. Zhang et al., ‘‘BoMW: Bag of manifold words for one-shot learning graphs on ConvNets via multi-shot learning,’’ in Proc. 31st AAAI Conf.
gesture recognition from kinect,’’ IEEE Trans. Circuits Syst. Video Tech- Artif. Intell., 2017, pp. 1–9.
nol., vol. 28, no. 10, pp. 2562–2573, Oct. 2018. [216] Q. Zhang, Y. N. Wu, and S.-C. Zhu, ‘‘Interpretable convolutional neural
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[193] Y. Shen, P. Luo, J. Yan, X. Wang, and X. Tang, ‘‘FaceID-GAN: Learning
Jun. 2018, pp. 8827–8836.
a symmetry three-player GAN for identity-preserving face synthesis,’’
[217] T. Wu, W. Sun, X. Li, X. Song, and B. Li. (2017). ‘‘Towards inter-
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2018,
pretable R-CNN by unfolding latent structures.’’ [Online]. Available:
pp. 821–830.
https://arxiv.org/abs/1711.05226
[194] J. Choe, S. Park, K. Kim, J. H. Park, D. Kim, and H. Shim, ‘‘Face
generation for low-shot learning using generative adversarial networks,’’
in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1940–1948.
[195] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, ‘‘DualGAN: Unsupervised
dual learning for image-to-image translation,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Jun. 2017, pp. 2868–2876.
[196] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. (2016). MINGLIANG GAO received the Ph.D. degree
‘‘Generative adversarial text to image synthesis.’’ [Online]. Available: in communication and information systems from
https://arxiv.org/abs/1605.05396 Sichuan University, Chengdu, China, in 2013.
[197] L. Tran, X. Yin, and X. Liu, ‘‘Disentangled representation learning GAN From 2018 to 2019, he was a Visiting Scholar
for pose-invariant face recognition,’’ in Proc. IEEE Conf. Comput. Vis. with the School of Engineering, The University of
Pattern Recognit. (CVPR), Jul. 2017, pp. 1415–1424. British Columbia, Okanagan. He is currently an
[198] C. Couprie, C. Farabet, L. Najman, and Y. LeCun. (2013). ‘‘Indoor Associate Professor with the School of Electrical
semantic segmentation using depth information.’’ [Online]. Available: and Electronic Engineering, Shandong University
https://arxiv.org/abs/1301.3572
of Technology, Zibo, China. His main research
[199] S. Zia, B. Yüksel, D. Yüret, and Y. Yemez, ‘‘RGB-D object recognition
interests include image fusion and deep learning.
using deep convolutional neural networks,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 887–894.
[200] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser,
‘‘Semantic scene completion from a single depth image,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2017, pp. 190–198.
[201] D. Maturana and S. Scherer, ‘‘VoxNet: A 3D convolutional neural net-
work for real-time object recognition,’’ in Proc. IEEE/RSJ Int. Conf. JUN JIANG received the Ph.D. degree in com-
Intell. Robots Syst. (IROS), Sep./Oct. 2015, pp. 922–928. munication and information systems from Sichuan
[202] K. He and J. Sun, ‘‘Convolutional neural networks at constrained time University, Chengdu, China, in 2015. He is cur-
cost,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), rently an Associate Professor with the School of
Jun. 2015, pp. 5353–5360. Computer Science, Southwest Petroleum Univer-
[203] H. Li, R. Zhao, and X. Wang. (2014). ‘‘Highly efficient forward and sity, Chengdu. His main research interests include
backward propagation of convolutional neural networks for pixelwise machine vision and deep learning.
classification.’’ [Online]. Available: https://arxiv.org/abs/1412.4526
[204] K. Bong, S. Choi, C. Kim, D. Han, and H.-J. Yoo, ‘‘A low-power convo-
lutional neural network face recognition processor and a CIS integrated
with always-on face detector,’’ IEEE J. Solid-State Circuits, vol. 53, no. 1,
pp. 115–123, Jan. 2018.
[205] O. Krestinskaya, K. N. Salama, and A. P. James, ‘‘Learning in memristive
neural network architectures using analog backpropagation circuits,’’
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 2, pp. 719–732,
Feb. 2019. GUOFENG ZOU received the Ph.D. degree in
[206] Q.-S. Zhang and S.-C. Zhu, ‘‘Visual interpretability for deep learn- pattern recognition and intelligent system from
ing: A survey,’’ Frontiers Inf. Technol. Electron. Eng., vol. 19, no. 1, the College of Automation, Harbin Engineering
pp. 27–39, 2018. University, Harbin, in 2013. He is currently a Lec-
[207] C. Olah, A. Mordvintsev, and L. Schubert, ‘‘Feature visualization,’’ Dis- turer with the College of Electrical and Electronic
till, vol. 2, no. 11, p. e7, 2017. Engineering, Shandong University of Technology,
[208] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski, ‘‘Plug Zibo, China. His current research interests include
& play generative networks: Conditional iterative generation of images pattern recognition, digital image processing and
in latent space,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. analysis, and machine learning.
(CVPR), Jun. 2017, pp. 4467–4477.
VIJAY JOHN received the M.S. degree in infor- ZHENG LIU (SM’02) received the Ph.D. degree in
mation technology and robotics from Carnegie engineering from Kyoto University, Kyoto, Japan,
Mellon University, Pittsburgh, PA, USA, in 2005, in 2000, and the Ph.D. degree from the University
and the Ph.D. degree in computer vision from of Ottawa, Canada, in 2007. From 2000 to 2001,
the University of Dundee, Dundee, U.K., in 2007. he was a Research Fellow with Nanyang Techno-
In 2011, he joined the University of Amster- logical University, Singapore. Then, he joined the
dam, Amsterdam, The Netherlands, as a Post- Institute for Aerospace Research (IAR), National
doctoral Researcher. In 2012, he joined Philips Research Council Canada, Ottawa, ON, Canada,
Research, Eindhoven, The Netherlands, as a Vis- as a Governmental Laboratory Visiting Fellow
iting Researcher. Since 2014, he has been a Post- nominated by NSERC, for five years. He was a
doctoral Research Fellow with the Toyota Technological Institute, Nagoya, Research Officer with the NRC Institute for Research in Construction.
Japan. His research interests include computer vision, machine learning with From 2012 to 2015, he was a Full Professor with the Toyota Technological
applications in autonomous vehicles, and human motion analysis. Institute, Nagoya, Japan. He is currently with the School of Engineering, The
University of British Columbia, Okanagan. His research interests include
image/data fusion, computer vision, pattern recognition, senor/sensor net-
works, condition-based maintenance, and non-destructive inspection and
evaluation. He is a member of SPIE. He is the Chair of the IEEE IMS Tech-
nical Committee on Industrial Inspection (TC-36). He holds a Professional
Engineer License in British Columbia and Ontario. He serves on the editorial
board for the IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT,
IEEE Instrumentation and Measurement Magazine, the IEEE JOURNAL OF
RFID, Information Fusion, Machine Vision and Applications, and Intelligent
Industrial Systems.