You are on page 1of 6

Object Detection using Deep Learning

Research Area: Image Processing, Machine Learning

Abstract

In the field of autonomous robotics, path planning is an important element in order to enable a robot to
move from one point to another targeted destination by using the shortest path. By using the shortest
path, undesirable turning and braking can be avoided and this leads to less computation time and lower
cost. Object detection is important in the robotics field as during the process of path planning, mapping
and executing tasks, a robot needs information about the surrounding environment and location of
objects. The working principle of object recognition is to detect objects in the real world from input image
of the surroundings with prior information about the object models in object database. The ability of
robot to recognize a known object which is placed in object database from different points of view will be
helpful in assisting the search and rescue activity in locating the position of a victim in particular or objects
in general (Ekvall, Kragic and Jensfelt, 2007).

Deep learning is a machine learning technique that deploys several convolutional layers to train
an artificial neural network by using complex functions. By doing so, deep learning improves the
performance and decreases the time for training by using automated training and multiple deep neural
networks.. These assignment varies in the structure of interpretation of data processing and the method
of communications between the convolutional layers (Shabbir and Anwer, 2018).

The aim of this research project is to develop deep learning models for object recognition. Various
modelling techniques in deep neural network design will be studied and applied for their effectiveness in
improving the performance of deep neural network for Object detection. The focus of this project will be
Object detection for robot navigation The outcome of this project will be the presentation of the
architectures of deep learning which may help in faster and more accurate object detection that may in
turn help in robust robot navigation.

Objectives

i) To investigate and understand deep neural network (DNN) designs

ii) To investigate the effectiveness of deep learning in object detection for robotic navigation

iii) To improve the performance of the deep learning architecture in object detection
Literature Review

Autonomous navigation in robotic field implies that the robot is capable of extracting information
or data such as corner and edges from the real world environment and then perform specific
responses according to the information obtained from the sensors. Therefore object detection is
carried out by the robot from time to time to acquire the information from external unknown
environment. However, object detection process often consumes a lot of time as a huge amount of
trial and error for fine tuning the parameters used. In order to make the system more robust, deep
learning may help to speed up the process as well as performance.
Region-based Convolutional Network (R-CNN) is constructed by using the selective
search method for region detection and deep learning for identification of object. This model
increases the computation efficiency by comparing with exhaustive search method as this model
only initializes from small region and later merge into a final group which contains the object
detected in the entire image by classifying the colour spaces and similarity metrics (Uijlings et al.,
2013).
The R-CNN model and SVM classifiers are trained based on the ImageNet dataset and PASCAL
VOC dataset. After extracts features are extracted from the image and feed into CNN, SVM classifier
begins to detect possible objects and then linear regressor makes adjustment to the coordinates of
bounding box formed around the object detected. The R-CNN model has proven to achieve a 62.4% mean
average precision (mAP) score over the 2012 PASCAL VOC test dataset and 31.4% mAP score over the
2013 ImageNet dataset (Uijlings et al., 2013).

The performance of R-CNN model is then improved by the Fast Region-based Convolutional
Network (Fast R-CNN) developed by R.Girshick. Region of Interests (RoIs) are used in this model to reduce
the size of the input image by extracting the features of interest before feeding the image into the fully
connected layers. Softmax classifier is used to detect the feature vectors and the linear regressor to
modify the coordinates of the bounding box formed. This Fast R-CNN model have achieved a 68.4% mAP
scores over the 2012 PASCAL VOC test datasheet (Girshick, 2015).

Region Proposal Network (RPN) is then introduced to substitute selective search method in order
to reduce the computation time and improve the performance of R-CNN model. Faster Region-based
Convolutional Network (Faster R-CNN) combines RPN and Fast R-CNN model. RPN is used to speed up the
generation of region proposals from the input image and thus enhance the training and testing of the Fast
R-CNN model. RPN is made up of a pre-trained model for classification over the ImageNet dataset and the
model is fine-tuned on the PASCAL VOC dataset (Ren et al., 2015).

Faster R-CNN model has achieved a 75.9% of mAP scores over the 2012 PASCAL VOC test dataset.
The Faster R-CNN model has proven to be 34% faster than Fast R-CNN model which makes use of
selective search method. The comparison of performance between R-CNN model, Fast R-CNN and Faster
R-CNN model are computed in Table 1 below (Ren et al., 2015).
Table 1: Comparison of performance between R-CNN model, Fast R-CNN and Faster R-CNN model

Model mAP scores over 2012 PASCAL VOC test


dataset (%)

R-CNN 62.4

Fast R-CNN 68.4

Faster R-CNN 75.9

Dai, Li, He and Sun (2016) applied deep learning in convolutional layers to introduce
Region-based Fully Convolutional Network (R-FCN) which allows backward propagation for
training. The R-FCN model is able to detect not only object itself but also the position of the
object detected by making use of RPN model to compute scores of each region. The last layer of
R-FCN model outputs position-sensitive score maps to detect certain characteristics of objects
based on the image classification. For example, one feature map is used for specific detection of a
dog and another for human and so on.
Deep learning also proven to be useful in real time applications in You Only Look Once (YOLO)
model. YOLO model is simple as it evaluates the input image, predicts bounding boxes and class
probabilities by using a single network. It works by dividing the input image into an SxS grid, bounding
boxes are predicted and classification is performed over the most confident ones. YOLO model is made up
of 24 convolutional layers followed by 2 fully connected layers (Redmon et al., 2016).

The Fast YOLO model is a simpler version of YOLO model which comprises of only nine
convolutional layers and less number of filters. However, there is a downside of YOLO model where there
are too many bounding boxes predicted so some of the bounding boxes might not contain object within
it. Therefore, non-maximum suppression (NMS) method is used at the last layer of the network to merge
those highly-overlapping bounding boxes of a similar object into one bounding box (Redmon et al., 2016).

Another application of deep learning in object detection is Mask Region-based Convolutional


Network (Mask R-CNN) which is an extension of the Faster R-CNN model where object segmentation in
pixel is performed on the input image. There are four steps in Mask R-CNN model which are instance
segmentation, bounding box detection, object detection and keypoint detection. The outputs of Mask R-
CNN are class label, bounding box offset and object mask. RPN is used to generate the bounding box
proposal and generate the three outputs according to the region of interest (RoI). The overview of mAP
scores on the 2012 PASCAL VOC dataset and 2015,2016 COCO datasets are shown in Table 2 below (He et
al., 2017).
Table 2: Overview of mAP scores on the PASCAL VOC dataset and COCO datasets

Model PASCAL VOC COCO 2015 COCO 2016 Real Time


2012 Speed

R-CNN - - - No

Fast R-CNN 68.4% - - No

Faster R-CNN 75.9% - - No

R-FCN - 31.5% - No

YOLO 57.9% - - Yes

Mask R-CNN - - 39.8% No

There are generally two types of data that can be extracted from the external environment
which are representative of terrain features such as lake and road and representative of
traversability which are obstacles and pedestrians. These features will then be grouped and trained
according to their respective output class label. There are several different types of training
techniques which are (Otte, 2015):
1. Offline from human training database
2. Online using external sensors which is also called self-supervised learning
3. Offline using sensor data that are stored in database files
The advantage of deploying offline training is the access to a huge amount of training database
within a short period of time whereas the disadvantage will be the inaccuracy of the training model while
conducting in the real world environment that is not actually the same as in the training database. On the
other hand, online training allows training model to be constructed according to the current state of the
real world environment. Since the training must be conducted in real time, the complexity of the training
model is somehow limited. Online training often uses simple models such as histograms and Adaboost
(Happold et al., 2006) whereas offline training uses more complex models such as neural networks and
cascades of Gaussians (Erkan et al., 2007).

All the research findings discussed above strongly support that deep learning is powerful
in the application of object detection as it enables the model to be trained from head to tail with
backpropagation. According to Zhao, Zheng, Xu and Wu (2018), deep learning for object
detection has promising improvement with the developments in neural networks and related
deep learning modelling techniques.

Ekvall, S., Kragic, D. and Jensfelt, P. (2007). Object detection and mapping for service robot tasks. [online] Available
at: http://doi.org/10.1017/S0263574706003237 [Accessed 20 Sep. 2018].

Girshick, R. (2015). Fast R-CNN - IEEE Conference Publication. [online] Semantic Scholar. Available at:
http://doi.org/10.1109/ICCV.2015.169 [Accessed 19 Sep. 2018].

He, K., Gkioxari, G., Dollar, P. and Girshick, R. (2017). Mask R-CNN.

Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object
Detection.

Ren, S., He, K., Girshick, R. and Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks.

Shabbir, J. and Anwer, T. (2018). A Survey of Deep Learning Technique Mobile Robot Applications. Journal of
Latex Class Files, 14(8).

Uijlings, J., van de Sande, K., Gevers, T. and Smeulders, A. (2013). Selective Search for Object Recognition.

Zhao, Z., Zheng, P., Xu, S. and Wu, X. (2018). Object Detection with Deep Learning: A Review. Computer Vision
and Pattern Recognition.

h)Research Methodology

a. Background study and Preliminary Work (Objective 1, 2, 3)


The following will be accomplished here:
1. Literature Review: The initial study will be completed within first four months. The initial study will
include the image processing concepts related to object detection and recognition
2. Study of traditional machine learning and deep learning models.
3. Literature review of open source machine learning libraries.
4. Acquisition of equipment, and hardware/software installation.

b. Model Design, Training and Validation (Objective 1, 2, 3)


This step will take another four months. Different models will be designed and implemented Object
detection frameworks on base networks will be implemented...

c. Performance comparison and analysis of architectures for object detection for robot navigation
(Objective 1, 2, 3)
This step will cover the last 4 months. In this step a detailed analysis of different architectures and
modeling techniques will be carried in context of robot navigation. The accuracy and real time navigation
will be considered. The project documentation will be finalized in this stage.

Flow chart is attached in Appendix A.

You might also like