You are on page 1of 7

Deep Learning approach for object detection using

CNN

Scholar, Department Of Computer Science

Abstract: Deep learning has now taken a major role in action for the AI related tasks that are done before
using the shallow learning techniques. This deep learning holds good for the task of training and a better
performance when it comes to comparison between deep learning approaches and shallow learning
approaches. Hence, nowadays computers are able to drive deeper, wider and more powerful models. State of
the art CNNs have achieved human-like performance in several recognition tasks such as: handwritten
character recognition, face recognition, scene labelling, object detection and image classification among
others. Meanwhile, mobile devices have become powerful enough to handle the computations required for
deploying CNNs models in near real-time . Here we will use the CNN to handle the task of object detection.

Introduction: In this case study, we address object detection on mobile devices (tablets, smartphones,...)
for domain-specific case-studies. We aim at developing interactive interfaces on mobile platforms based on
object recognition. As case-study, we develop a collaboration with a museum (”Musée National de la Marine
de Brest”). The general objective is to automatically trigger or suggest interactive contents (i.e. 3D models,
video, audio, etc.) on a mobile device. The detection of an object of interest from the camera of the mobile
device is a key step. Here, the detection comprises both the identification and localisation of an object in an
image, while the classification consists in the prediction of elements belonging to a certain class in an image.
From a methodological point of view, as briefly reviewed below, we investigate deep learning models, more
precisely Fully Convolutional Neural Networks (FCNN), and their deployment on mobile devices for
domain-specific real- world applications.

During the last decade Machine Learning (ML) techniques have been commonly employed to address such
tasks, making the choice of visual features a crucial factor. With the discov- ery of the Scale Invariant
Feature Transform (SIFT) [1], multi- ple opportunities for vocabulary learning techniques have been
successfully developed, including for instance Bag of Features (BoF) [2], Improved Fisher Vector Encoding
(IFV) [3] among others. These techniques are simple yet effective and can be summarised in well defined
steps: dense sampling of local descriptors, encoding into a high-dimensional representation and finally
pooling to create a single descriptor per image [4]. Despite their simplicity these methods are hand-crafted
and require a certain amount of engineering behind them. These techniques are known as shallow, where the
learning is done only at mid-level by training classifiers such as Support Vector Machines (SVM), Random
Forest or Naive Bayes classifier. Deep learning models, and CNNS, have become in the last few years the
state-of-the-art for a variety of large-scale pattern recognition problems [5]. CNNs are regarded as deep
architectures as they involve a hierarchy of layers, such that the outputs of a layer are connected to the next
layer’s inputs. The exploitation of a large number of layers, for instance up to 22 for GoogleNet model [6],
has led to very significant gain in visual recognition tasks [7] compared to shallow strategies. The use of
such model for domain-specific (and small-scale) case-studies is an active topic, as deep architectures
typically require large-scale datasets for their learning.

CONVOLUTIONAL NEURAL NETWORKS: Convolutional Neural Networks (CNNs)


refer to a family of statistical learning models which use the convolutional operator as basis to abstract,
encapsulate and learn information [5]. CNNs are used to estimate or approximate functions that can depend
on a large number of inputs. Compared to original neural networks, deep CNNs involve a large number of
layers (typically, varying from 8 to 22 [7], [6]). Whereas the first layers are composed of convolutions,
normalisation, pooling and activation (nonlinear function) layers, the top layers generally involve fully-
connected layers. For classification purposes, the last layer is typically a SoftMax function that acts as a
classifier. We briefly review below the different layers involved in the CNNs as well as learning strategies
for CNNs. For an in-depth description of CNNs, we refer the reader to [5].
A. CNN Layers
A layer of a CNN is composed of nodes (or neurons) connected to nodes of the previous layers, such that the
output at a given node for layer L is a function of outputs of nodes in layer L − 1.
CNNs involve four main types of layers:
• Convolution layers: Convolution layers are characterized by weights (filter values). There exist multiple
convolutions per layer with a fixed size, and each kernel is applied over the entire image with a fixed step
(stride). The first convolution layers learn the low-level features such as edges, lines and corners. Next layers
learn more complex representations (e.g., parts and models). The deeper the network is, the higher-level the
learnt features.
• Pooling layers: Pooling layers perform a nonlinear down-sampling.
• Activation layers: Activation functions mimic the behaviour of the neuron’s axon that fires a signal when a
specific stimulus is presented. Some of the most common activations functions are the Hyperbolic Tangent,
Sigmoid and the Rectified Linear Units (ReLU) among others. ReLU has emerged as a key feature of CNN.
It is definedas: f (x) = max(0, x).
• Fully-connected layers: a fully-connected layer (FC) differs from the above mentioned by the fact that all
outputs of the previous layer are connected to all inputs of the FC layer. These layers can be mathematically
represented by inner products.

B. Learning Convolutional Neural Networks


The learning stage for CNNs comes to the estimation of the weights of the different layers. In a supervised
setting, it relies on the backpropagation technique, which provides a gradient-based algorithm for some
predefined cost function, in our case the misclassification rate. The backpropagation exploits the structure of
the CNN to compute the gradient of the cost function with respect to CNN weights as an error propagation
process from the final layer of the CNN to the input layer. The learning stage then proceeds iteratively a
feed- forward pass which computes the filter responses, pooling, and nonlinear activations followed by the
backpropagation of the cost error. The backpropagation technique benefits from activation functions that are
smooth and differentiable. Dropout layers are in charge of randomly neglecting outputs of hidden or visible
units. Generally a unit has a probability p = 0.5 to be kept. By setting its output to a value of zero the neuron
will not contribute to the forward pass nor the backpropagation of the error function [13]. These layers
remain active only during the learning stage, and are ignored at inference time.

NETWORK ARCHITECTURES: Different architectures have been proposed recently. Here, we


focus on three state-of-the-art architectures, namely the AlexNet network [7], and two fully convolutional
networks called Network In Network (NIN) [12] and GoogLeNet [6].

A. AlexNet :A widely used CNN architecture is the award-winning AlexNet network presented in [7]. It has
been selected as starting point for multiple applications, for instance [14] and [15]. The network consists of a
combination of 5 convolutional and 3 fully-connected layers. The final FC layer is connected to a soft-max
classifier which produces a 1000-class distribu- tion output. Input colour images are size-normalised to a
squared shape with 256x256 pixels. In the first convolutional layer 96 filters of size 11x11x3 are applied to
the input image with a stride of 4 pixels. The output of this layer is then normalised and max- pooled.
Outputs from this layer are fed to 256 kernels of size 5x5x48, then normalised and pooled again. Third,
fourth and fifth layers are connected without any normalisation or pooling layers. The third convolutional
layer has 384 kernels of size 3x3x256 connected to the outputs of the second layer. The fourth convolutional
layer has 384 kernels of size 3x3x192, and the fifth has 256 kernels of size 3x3x192. The fully- connected
layers have a total 4096 neurons each. A SoftMax layer is connected on top to obtain probability estimates
for classification purposes. The network maximises the multinomial logistic regression objective, which is
equivalent to maximising the average across training cases of the log-probability of the correct label under
the prediction distribution [7]. This network has been trained using the ImageNet challenge database, which
contains about 1.3 Million images [16]. The accuracy obtained by the model in the Top One Prediction is
57.4%. In this network over 60 Millions of parameters have to be estimated. Despite the prediction of a
significant number of labels (1000) and a humongous amount of data to train, the model is prone to
overfitting. To avoid this, dropout strategies and data augmentation techniques are applied: left to
rightmirroring and up to bottom reflections, and 224x224 image crops from the centre and the corners of the
images.
B. Network In Network: Network in Network is a FCNN that uses Micro Neural Networks with complex
structures to abstract the data within the receptive field [12]. In this architecture no fully connected layers are
used, which results in a considerable reduction of the number of parameters to estimate. Overall, this
network involves about 8 million parameters. A layer that mimics a Multilayer Perceptron (MLP) is
employed in this network. A MLP layer is a regarded as a ”micro network” and is approximated by stacking
two convolution layers with a 1x1 convolution kernel. The short convolutions reproduce the effect of a Cross
Channel Parametric (CCP) pooling layer, which outputs a weighted linear recombination of the input feature
maps [12]. The final architecture of this network is obtained by stacking three MLP layers. Instead of using
FC + SoftMax layers for classification, the network obtains class confidence scores by globally averaging the
output of the final MLP layer. The convolution layers respectively contain 96, 256 and 384 kernels of sizes
11x11, 5x5 and 3x3. While the pairs of CCP layers are coupled to have the same number of filters but with
fixed size of 1x1. Model training also involves an aggressive dropout strategy to avoid overfitting. When
trained using ImageNet dataset the networks obtains a 59.36% Top One Prediction accuracy.

C. GoogLeNet :This network is called Inception and widely known as GoogLeNet [6]. Inspired by the
tight convolutions of size 1x1 introduced by [12]. The use of these short convolutions is twofold 1) reduction
of dimensionality 2) removing computational bottlenecks. The network decomposes convolutional filters
with a wide receptive field into groups of parallel small convolutions (1x1, 3x3 and 5x5) and pooling layers.
The output of this parallel blocks are then concatenated. Each arrangement of this kind is called an inception
module, and the final architecture is composed by stacking multiple modules. The final output is obtained by
using average pooling and an extra linear layer. Overall, GoogLeNet involves about 7 million parameters.
Dropout remains a key factor for regularisation and additional intermediate outputs are used for
classification. It was shown to reach state-of-the-art performance on the ImageNet dataset [6].

LEARNING TRANSFER: For real-world and domain-specific scenarios, it may not be possible to
train a model from scratch, especially when dealing with relatively small training datasets. However, it has
been proven that features learned across the layers of a deep architecture can be transferred to domain-
specific applications [17]. Two main transfer scenarios can be investigated: the fine- tuning of a previously
trained network and the use of a deep network as feature extraction scheme. We review these two strategies
below.

A. Fine-tuning of a pre-trained network: The fine-tuning [17] consists in using a previously trained
network as initialisation of the training step for given task and training datasets. Compared to a purely
random initialisation of network weights, the fine-tuning reduces the learning time, as it converges in fewer
iterations and make feasible the training of network from relatively small dataset. We apply here a common
fine-tuning strategy [17]. It exploits a pre-trained network as initialisation for all layers ex- cept the top fully
connected layer, whose weight are randomly initialised. Besides, the learning rate of this layer is higher than
the learning rates of the other ones. It may be noticed that the top fully connected layer may contain a
number of outputs different from the original network, which allows to adjust well to new classification
tasks.

B. CNNs as feature extractors:


Another strategy for application to domain-specific classification tasks is to combine classical machine
learning techniques, e.g. SVMs or random forests, within a feature space issued from a trained CNN [18].
The output of each layer of a CNN may be regarded as a description of the input image. The deeper in the
hierarchy, the higher-level the associated image information. These outputs of the different layers of a CNN
are referred to as CNN codes [19], which may provide relevant feature space for classification tasks. The
dimension of each code refers to the number of nodes of the associated layer. For instance, using AlexNet
CNN, if one considers the input of the fully connected layers as a feature vector, one resorts to a 4096-
dimensional feature vector. Given the feature space defined from the selected CNN codes, any supervised
machine learning model may be relevant. Here, we consider linear SVMs. CNN codes may be further
compressed through a principal component analysis without a significant accuracy loss. Whereas the training
of the CNN typically requires very large image databases, the training of linear SVMs is efficient for small
image datasets,which makes them appealing for domain-specific applications.
EXPERIMENTAL FRAMEWORK:In this study, we report an experimental evaluation of
CNN-based strategies for classification applications on mobile platforms (i.e., smart-phones and tablets),
with a specific emphasis on domain specific case-studies. The reported experiments compare the considered
CNN model against shallow techniques mentioned in section I. In addition to classification performance, we
also evaluate the computational complexity of the considered model for three types of implementation: CPU,
GPU and mobile devices. All experiments for CNNs have been conducted using the CAFFE library [20] and
associated C++ and Python interfaces. A custom Objective- C version of the library was used to deploy
models on mobile devices [21]. For the shallow methods, we used the Matlab interface of VLFeat library
[22].

A. Image databases:
We consider two different datasets:

Caltech-101 database [11] contains 101 object categories, with about 50 images per category in general.
While this benchmarked dataset is big enough to train shallow classification algorithms, it is too small for the
direct application of deep learning strategies.

Marine Museum database. This database, described below, is part of a collaborative project with the
Marine Museum in Brest for the development of new interactive services on mobile devices for visitors. This
database provides a real-word case study for the considered models.

The ”Marine Museum” database involves 50 different classes (including a background class) corresponding
to a variety of objects (e.g., statues, mockups, small objects) and materials (e.g., wooden and stone objects).
For each object, the database contains at least 90 images associated with different viewpoints and lighting
conditions. Different cameras were also used to acquire images with different image qualities. Representative
examples of the database are reported . The database is randomly split into training, testing and validation
subsets. Each subset respectively contains 45, 25 and 20 images of each object. All images in the database
are subject to a preprocessing step, which includes mean subtraction and size normalisation. Besides, data
augmentation techniques (mirroring and cropping) mentioned in section III are applied to the training and
validation subsets.

B. Shallow Methods: As baseline approach, we implement and compare different shallow models in our
experiments. These models exploit dense SIFT features extracted in √ a regular grid with step of 4 pixels at 7
scales with a factor 2 and bins of 8 pixel wide. As encodings, we consider two classical methods, namely
Bag of Visual Words (BOVW) [2] and Improved Fisher Vectors(IFV) [3]. BOVW uses 4096 vector quantised
visual words.

C. Deep Models:
We investigate three categories of deep models:

Fully trained deep models: Deep models with the architectures described in III-A and III-B are first fully
trained for each database. Optimisation is carried out by Batch Stochastic Gradient Descent (BSGD). The
parameters are set as follows: momentum m = 0.9, weight decay of 0.0005, starting with a high learning rate
lr = 0.01 and step as learning rate update policy. Optimisation is performed across 20 epochs. The resulting
trained models are referred hereafter to as AlexNet and NIN.

Finely-tuned deep models: as mentioned in Section IV, we benefit from pre-trained models as initialisation
to obtain a more discriminative model using a fine- tuning specific to each case-study database. Our fine-
tuning strategy consists in learning the last two FC layers in the case of AlexNet, the last two CCP layers for
NIN model and FC layers of the multiple outputs in GoogLeNet. Learning rates in these final layers are set
ten times higher than the rest of the network. The optimisation is carried out during fewer iterations than in
the general training since faster convergence is expected. Learning rate (BSGD step) is set to be 0.001, which
is ten times smaller than the learning rate used for full training. The resulting deep models are referred to as
AlexNet+F, NIN+F, and GoogLeNet+F. For GoogLeNet model, we report the output of the deepest
classification layer. Intermediate output layers are omitted. We noticed that they lead to similar classification
performance.
Deep models as feature extraction schemes: we explore the use of AlexNet as a feature extractor scheme.
We consider input features from layers FC5, FC6 and FC7. These features are exploited to train a multiclass
linear SVM model with a fixed parameter C = 1.0. From cross-validation experiments, the best classification
performances were issued from features extracted from layer FC7. We only report these results referred to as
AlexNet FC. A combination with Principal Component Analysis (PCA) dimension reduction to keep first
1024 Principal Components is also investigated and referred to as AlexNet+PCA.

EXPERIMENTAL RESULTS:
A. Classification performance: We report the synthesis of the classification performance of the
considered models in Table I. For Caltech-101 database,classification performances range from 43% to 87%.
Whereas the baseline approaches, BOVW and IFV encodings combined with linear SVMs, reach 73% of
correct classification, fully trained deep models behave poorly due to the limited size of the training dataset
both for AlexNet and NIN architectures (below 50%). The best results are retrieved for the fine- tuned
models (respectively, 87.63% of correct classification for AlexNet and 87.22% for NIN). The gain appears
significant compared to the use of AlexNet as a feature extraction scheme. Similar conclusions can be drawn
for the ”Maritime Museum” database. Whereas fully trained models reach up to 86% of correct
classification, the fine-tuned version of NIN model reach more than 98% of correct classification. The
significantly higher correct classification rates reported for fully-trained models compared to Caltech-101
case-study may relate to the greater visual separation between object classes for the ”Maritime Museum”
database.

Table 1. Recognition performance for different database

By contrast, GoogLeNet and NIN models are associated with a significantly lower memory requirement.
However, large differences exist in terms of computational time. Whereas GoogLeNet leads to a processing
time close to 1s per frame, NIN model requires less than 300ms per frame. These differences may be
explained by the large number of convolution operations involved in GoogLeNet. Such operations are not
optimized for a CPU- only implementation. Overall, NIN appears as the best choice for the considered
domain-specific real-world application with: 1) a smaller memory foot-print, 2) slightly better accuracy
results in our application, 3) lower computational time, and 3) potential use for localisation as described in
VI-B.Table II

Table
2.

Average computational time and memory requirement for different CNN architecture

B. Object Localisation Task : In this section, we report object localization experiments using NIN.
FCNNs retain a strong spatial input-output relationship. Based on this assumption we propose to roughly
localise objects using the Feature Maps (FMs) produced by the final MLP layer. The class assigned to the
image directly depends on the output of the average pooling, which enforces the network to learn
correspondences between feature maps and categories [12]. Thus, the class index that is attributed to the
image serves to choose the FM employed to perform the detection. In this way localization comes free of
computational burden because FMs are built at inference time. Feature maps have a considerably small size,
this is caused by the multiple nonlinear down sampling performed by the pooling operations, due to this fact
we can retrieve only a rough estimation of the object’s position. We start by first normalising the selected FM
then up-scale it by bilinear interpolation to match the size of the original image, thena threshold th = mean(F
M ) is applied to segment objects.

DISCUSSION : Deep learning has led to significant improvements of object recognition and image
classification performance over shallow techniques for complex dataset such as ImageNet. Deep models
typically benefit from very large datasets to learn models with a very large number of parameters (up to 60
millions of parameters for AlexNet). This raises scientific questions for their application to real case-studies,
where the number of training data remain relatively small. Similarly, the memory storage requested by deep
models may question their applicability on mobile devices. Within the context of interactive services on
mobile devices in museum, we here investigated object recognition on mobile platforms using deep models.
Following [4], we demonstrated that highly accurate domain-specific object recognition pipeline can be used
in near real-time on mobile devices.
Deep models, more-specifically fine-tuning strategies, were shown to significantly outperform classical
shallow strategies with a computational time lower than 300 ms per frame for NIN model. This model is
superior in terms of minimal memory storage, computational complexity and recognition performance
(compared with AlexNet and GoogLeNet). The computational time and the memory requirement for the
other two models may be prohibitive for real-world applications or require additional optimizations. We also
illustrated that object recognition can be complemented by object localization at no additional computational
cost from the considered model. Future work will further investigate the optimization of implementation
issues on mobile devices. Our current solution uses a CPU-only implementation, which achieves a near real-
time classification. Computational time could be further reduced by adopting a mobile GPU implementation.
Ongoing research also includes the extension of the proposed model for object recognition on mobile devices
from RGB-D images, as depth sensors will be embedded in the next generation of mobile devices.

REFERENCES
[1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vision, vol. 60,
no. 2, pp. 91–110, Nov. 2004.
[2] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of
keypoints,” in In Workshop on Statistical Learning in Computer Vision, ECCV, 2004, pp. 1–22.
[3] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image
classification,” in Proceedings of the 11th Euro- pean Conference on Computer Vision: Part IV, ser.
ECCV’10, 2010, pp. 143–156.
[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving
deep into convolutional nets,” in British Machine Vision Conference, 2014.
[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural
networks,” in Advances in Neural Infor- mation Processing Systems 25, F. Pereira, C. J. C. Burges, L.
Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. View publication stats
[8] S. Gammeter, A. Gassmann, L. Bossard, T. Quack, and L. V. Gool, “Server-side object recognition and
client-side object tracking for mobile augmented reality,” in IEEE Conference on Computer Vision and
Pattern Recognition, CVPR, Workshops 2010, 2010, pp. 1–8.
[9] S. S. Kumar, M. Sun, and S. Savarese, “Mobile object detection through client-server based vote
transfer,” in CVPR, 2012.
[10] L. Czúni, P. J. Kiss, Á. Lipovits, and M. Gál, “Lightweight mobile object recognition,” in 2014 IEEE
International Conference on Image Processing, ICIP, 2014, pp. 3426–3428.
[11] R. P. L. Fei-Fei; Fergus, “One-shot learning of object categories,” IEEE Transactions on Pattern
Analysis Machine Intelligence, vol. 28, pp. 594– 611, April 2006.
[12] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol. abs/1312.4400, 2013.
[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to
prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958,
2014.
[14] J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, B. Price, and R. Mĕch, “Salient
object subitizing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[15] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object
detection and segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV),
2014.
[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical
Image Database,” in CVPR09, 2009.
[17] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural
networks?” CoRR, vol. abs/1411.1792, 2014.
[18] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding
baseline for recognition,” CoRR, vol. abs/1403.6382, 2014.
[19] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky, “Neural codes for image retrieval,” CoRR,
vol. abs/1404.1777, 2014.
[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell,
“Caffe: Convolutional architecture for fast feature embedding,” pp. 675–678, 2014.
[21] A. Isaza, “Caffe for ios: A wrapper for the caffe library,” https://github.com/aleph7/caffe, 2015.
[22] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library ofcomputer vision algorithms,”
http://www.vlfeat.org/, 2008.
[23] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of
recent feature encoding methods,” in British Machine Vision Conference, 2011.

You might also like