You are on page 1of 5

Survey on Object Detection Framework:

Evolution of Algorithms
Ms. Yogitha. R 1, Dr. G. Mathivanan 2
1
Department of Computer Science and Engineering, 2Department of Information Technology,
2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA) | 978-1-6654-3524-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICECA52323.2021.9675888

Sathyabama Institute of Science and Technology, Chennai – 600 119.


yogitha.ravi1915@gmail.com

Abstract— Object Detection is one among the most emerging In general, the word “object” in the framework of object
and effective field of interest in the broad domain of Artificial detection algorithms might refer to wide range of objects that
Intelligence. Given a set of images or video (sequence of frames) as can be seen in a picture like, non-living things (bat, ball, hat,
input, object detection involves detecting and identifying various
jug, bottle, bucket etc.), animals (cat, dog, lion, cheetah,
range of artifacts in the corresponding image or frame. This in turn
includes various categories of detecting an artifact like detecting an gorilla, giraffe etc.), human beings doing various activities
object based on its location or detecting an object based on its (walking, talking, sleeping, moving, standing etc.), vehicles
identity. This categorization of the object detection framework is of various kind (two-wheelers, four-wheelers, airways,
important because of the ranges of application that’s available
railways etc.) and scenic surroundings (clouds, sky, grass,
where these techniques can be applied effectively. This paper
establishes a detailed survey which explains the different sand etc.). These artifacts in the images occur based on the
techniques and algorithms that has paved our way into the object surrounding spatial environment in which the picture was
detection framework. Starting from the historical mathematical taken. But mostly we need algorithms that can detect the
technique dynamic programming which followed geometric artifacts of interest like detecting pedestrians on road, vehicle
techniques to detect objects, techniques like SIFT, HOG which
followed feature level detection and networks like D-CNN, R-CNN, number from the vehicle’s number board, fishes from
RFCN, YOLO etc. which employs a pipeline of network to detect satellite images etc. The algorithms should have the
and segment instance of objects from the images. capability to ignore the unnecessary facts like scenic
surroundings in the input. This will enable the algorithm to
Index Terms— Object Detection, Scale Invariant Feature achieve better accuracy with optimized time and space
Transform (SIFT), Histogram of Oriented Gradients (HOG), Deep
complexity.
Convolutional Neural Network (D-CNN), Region based
Convolutional Neural Network (R-CNN) and You Only Look Once Most real time applications that use Object Detection
(YOLO). Framework algorithms emphasizes on detecting the spatial
locations of the artifacts along with classification of the
I. INTRODUCTION artifacts present in the corresponding image. This is
Object Detection has a wide range of real time applications. technically termed as drawing a bounding box over the object
For instance, given a video from video surveillance footage of interest. The bounding box represents the spatial
taken from ATM centre where burglary happened, from the coordinates of the objects in a two-dimensional space. This
given input, we need a technique to detect the face of the person bounding box can be a rectangular box drawn enclosing the
who caused the burglary at the given time to further identify and object of interest closely or it can be a mask of pixels that
catch the corresponding person responsible behind the act. encloses the exact boundary of the object of interest in a
Whereas in an instance, given a video footage taken from any precise manner.
public space, taking pandemic situation into consideration,
There are several challenges that have to be considered
there’s a need to monitor the video to ensure that there’s proper
while building an algorithm for object detection framework.
social distancing between the individuals present in the frame.
Some of them include the inevitable deformities that occur
For this kind of applications, we need to employ a technique
like occlusion present in the images, blurring of parts of
that detects the location of the objects in the frame.
images, invariable cluttering in images, improper levels of
Likewise, based on the application’s need the object detection illumination in images, Motion of objects that makes the
framework has been evolved and expanded throughout these image disoriented, having different types of instances under
years. This paper focuses on the most important techniques that same artifacts (Eg. Different types of two-wheelers), having
has been employed to detect objects over all these years, the list little variations among various inter-class object
of applications that has paved way towards these innovations representation (Eg. Different species of animals) and
and finally on list of applications that needs more accurate and
better algorithms for building a user-friendly, smart
environment.

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:13:16 UTC from IEEE Xplore. Restrictions apply.
unnecessary image noise that occurs in the input. Amidst all [10] 1986 Hierarchical approach that 3DPO 3D
takes the objects in a approach objects in
these challenges the goal is to build an object detection scene, verifies them and images
algorithm with high accuracy in terms of both classification, configures essential
information out of them
localization and with high efficiency in terms of time, memory and arrange them based on
and storage. that.
[11] 1986 Object identification HYPER Isolated
approach for the objects approach objects
II. REVIEW THROUGH LITERATURE – GEOMETRICAL that are lying on flat (Hypothesis positioned
METHODS surface predicted and here and
evaluated there
The table 1 specifies a series of object detection algorithms recursively)
that are supposedly based on the geometry of the objects present [12] 1987 Constructing vertex-pair Affine Outdoor
[13] from the images that transformatio images
in the image. The technique of object detection started long back defines transformation n and
and it were first based on simple mathematical concepts based from 2D to 3D Clustering
approach
on the object’s geometrical aspects. Until 1990 these techniques
[14] 1990 Construct aspect graph in Ray tracing Images
were famous and widely applied for all kind of object detection an exact manner for the and with
tasks. After 1990 some more approaches were identified that objects that have curves numerical curved
which are observed under curve tracing objects
had a different way of interpretation when compared to previous orthographic projection
techniques.
Table 1: Review through literature – Geometric methods
III. REVIEW THROUGH LITERATURE – FEATURE
BASED METHODS
Refe Year Concept Technique Dataset
rence After being done with geometric methods for Object
[1] 1973 Find out the real object in Pattern Face Detection, feature level object detection started to bloom
any given photograph with matching and
the help of its predefined Dynamic from early 2000’s till mid-20th century. The features were
description programming first localized globally which gave more accurate results
[2] 1973 Assembly processing Structure Different compared to geometric based object detection. To identify
system used to recognize Matching parts of
the objects by Algorithm objects and localise more intricate objects, the methods were further
continuously programming fine tuned into identifying the local features of objects using
by showing the objects
[3] 1981 Given an image, to detect Hough Grey-level various math-based computations which in turn gave more
the lines and curves Transform images accurate results. The class of techniques that were very
(analytic and non-analytic)
from the image
prominent among global and local representation of features
[4] 1981 A filtering technique to Random Images of are listed in the table. These techniques were identified to be
[5] filter out the gross errors Sample cylindrical milestones in the era before experimenting with classifiers
and detect the cylindrical Consensus objects
shapes. (RANSAC) based on region pipeline.
Filtering Table 2: Review through literature – Feature Based Methods
technique
[6] 1982 Based on certain local CAD models Images
Refe Year Concept Technique Dataset
features of images based for with
rence
on industrial parts, identifying industrial
partially visible parts were key features parts [15] 1991 For visual skill Global Feature Models
located. of objects development in Representation obtained from
and robots, color using Histogram large database
clustering for histograms were Intersection,
mapping used for Histogram
local features identification and Backprojection
[7] 1983 Formulated a set of criteria Set of Images localisation of the
for edge detection to detect operators – with objects
the edges and lines in varied in intensity [16] 1991 Human face Global Feature Low resolution
images as directly as width, length changes identification and Representation images of
possible and recognition by using Eigen- human faces
orientation projecting the face Faces
[8] 1984 Given an already known Model-based Image in two-dimensional
set of objects, measuring recognition data space and
surface normal and algorithm obtained considering it as a
positions in three from global shape
dimensions help in object triangulati altogether in
identification on range upright view
sensor [17] 1995 Object Recognition Global Feature Objects with
[9] 1985 By detecting the five RBC 2D image done through visual Representation complex
properties of 2D images: (Recognition data appearance using Eigen appearance
collinearity, cotermination, by matching rather Space characteristics
curvature, symmetry, Components) than matching the
parallelism – set of modest approach shape of object
components can be derived [18] 1999 Staged Filtering Scale Invariant Partially
approach was used Feature occluded and

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:13:16 UTC from IEEE Xplore. Restrictions apply.
for the feature Transform on cluttered edge and gradients. for final backgrounds.
identification and Local features images classification
detection [24] 2006 Classification of Integral image- Images with
[19] 2001 1)representation of Haar like Face detection texture and object based fast large sized
image using features and images detection using covariance rotations and
integral image cascading local region covariance computation and different levels
2)AdaBoost features descriptor covariance of illumination
Learning algorithm calculation of change
3)Cascade derivatives
combination of based on color
complex classifier and intensity
[20] 2002 Attaching Shape Aligning Handwritten
Context Descriptor Transform digits and Coil
To aggregate these identified local features using approaches
to each point and estimation using dataset mentioned in the table 2, several algorithms were used. Among
measure the Shape Context that, few algorithms yielded high accuracy percentage such as
similarity index Descriptor [25][26]
Bag of Visual (BOV) words – involved text retrieval
between various
shapes of the
approach by building a visual vocabulary, Spatial Pyramid
objects Pooling – BOV algorithm’s extension based upon geometric
[21] 2002 Identify the Local Binary Images with correspondence in a global manner, Fisher Kernel – another
common and Pattern different extension of BOV algorithm by computing features beyond
powerful local rotation
statistics. These algorithms play key role in object detection
image texture variance
feature using Local framework involving identification of local and intricate features.
Binary Pattern and
compute the
histogram for the IV. REVIEW THROUGH LITERATURE –
same. CONVOLUTION
[22] 2004 Extraction of SIFT to identify Occluded and
distinct local and localise cluttered
invariant features features and images
The Object Detection framework can be mainly categorized
from images having Hough into two main frameworks based on the detection framework
objects in different transform to pipeline. The two categories are named as one-stage and two-
views and scenes. identify and
group objects stage detector framework which is shown in Figure 1. The
that comes only difference between both is that in one-stage framework,
under a single
category input image is directly given to the convolutional and max
[23] 2005 Visual Recognition Histogram of MIT pedestrian pooling layers where as in two-stage framework, the input
of objects in a Oriented dataset and
robust manner Gradients 1800 annotated
image is given to a separate network to find the region of
using HOGS and Descriptor for image of interest and region proposals are generated from the image
SVM that local features humans in which is then given to the convolutional and max pooling
outperforms usual and Support various pose
detectors based on Vector Machine and layers.

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:13:16 UTC from IEEE Xplore. Restrictions apply.
Figure 1: Two categories of Object Detection Framework

Table 3: Review through literature – Convolution

Refe Year Concept Technique Dataset [34] 2016 YOLO: Features are You Only Picasso
rence generated directly from the Look Once and
[27] 2013 DCNN: Applying Deep Detector Net Pascal image instead of using a Peopleart
Neural Network concepts (Deep Neural 2007 pipeline for generating Dataset
to localize the objects Network) VOC region of interest
using object mask in given [35] 2016 SSD: Combining the ideas Single Shot Pascal
input image. from YOLO and RPN, Detector VOC, MS
[28] 2013 RCNN: Technique was CNN and Pascal features are generated in COCO
proposed to Selectively SVM VOC one flow instead of using and
search region of interest two different pipelines Imagenet
from the given input image [36] 2016 RFCN: All CONV layers RFCN Pascal
to selectively search the in the network are used VOC
required objects alone while constructing the
from the image region proposals.
[29] 2013 Overfeat: These networks ConvNet for ImageNet [37] 2018 Mask RCNN: A mask was Mask R- COCO
were formed with an idea Classification generated by covering the CNN 2016
that integration of , Localization objects in the image by
recognition, localization and using pixel-wise
and detection of objects Detection. segmentation.
are possible in given input [38] 2020 Mask Scoring RCNN for MaskS R- ICDAR
image. detecting the text present CNN 2017
[30] 2006 SPPNet: Bag of feature Histogram Caltech in the scenes in scene text
[31] algorithm’s extension. 101 detection
Features were locally [39] 2021 Accuracy of detecting in Faster R- Vehicle
computed using Faster R-CNN and Mask CNN, Mask Dataset
histograms and partitioned R-CNN is greater than R-CNN
in increased order forming 80% and greater than 75%
a pyramid shape in Resnet50
[32] 2015 Fast RCNN: Along with Fast R-CNN Pascal V. CONCLUSION
ROI generation, region of VOC
interest pooling layer is 2012
added in between last There has been a tremendous development in the field of object
convolutional layer and
first fully connected layer.
detection framework because of the innovation and development in
[33] 2015 Faster RCNN: A faster, Faster R- Imagenet, various deep learning techniques. This paper has contributed a
efficient and accurate CNN COCO survey on historic techniques to recent development in the field of
region proposal network 2015
was proposed for object detection., its evaluation criteria and famous dataset used to
generation region of implement those techniques and methods. Now the final objective
interest
would be to apply these techniques to real-world applications that
would be useful to the surrounding environment.

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:13:16 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [19] Viola P., Jones M. (2001) Rapid object detection using a boosted
cascade of simple features. In: CVPR, vol 1, pp. 1–8
[1] Ambler, H. Barrow, C. Brown, R. Burstall, and R. Popplestone. [20] Lowe D. (2004) Distinctive image features from scale-invariant
A Versatile Computer-Controlled Assembly System. In key points. IJCV 60(2):91110
International Joint Conference on Artificial Intelligence, pages [21] Belongie S., Malik J., Puzicha J. (2002) Shape matching and
298–307, 1973. object recognition using shape contexts. IEEE TPAMI 24(4):509–
[2] Mundy J. (2006) Object recognition in the geometric era: A 522
retrospective. in book Toward Category Level Object [22] Dalal N., Triggs B. (2005) Histograms of oriented gradients for
Recognition edited by J Ponce, M Hebert, C Schmid and A human detection. In: CVPR, vol 1, pp. 886–893
Zisserman pp. 3–28 5 [23] Ojala T., Pietikainen M., Maenp ¨ a¨a T. (2002) Multiresolution
[3] Fischler M., Elschlager R. (1973) The representation and gray-scale ¨ and rotation invariant texture classification with local
matching of pictorial structures. IEEE Transactions on computers binary patterns. IEEE TPAMI 24(7):971–987
100(1):67–92 5 [24] Tuzel O., Porikli F., Meer P. (2006) Region covariance: A fast
[4] D. Ballard. Generalizing the Hough Transform to Detect Arbitrary descriptor for detection and classification. In: ECCV, pp. 589–600
Shapes. Pattern Recognition, 13(2):111–122, 1981 [25] Sivic J., Zisserman A. (2003) Video google: A text retrieval
[5] R. C. Bolles and M. A. Fischler. A RANSAC-based approach to approach to object matching in videos. In: International
model fitting and its application to finding cylinders in range data. Conference on Computer Vision (ICCV), vol 2, pp. 1470–1477
In International Joint Conference on Artificial Intelligence, pages [26] Csurka G., Dance C., Fan L., Willamowski J., Bray C. (2004)
637–643, Vancouver, Canada, August 1981. Visual categorization with bags of keypoints. In: ECCV
[6] W. E. L. Grimson and T. Lozano-P´erez. Model-based Workshop on statistical learning in computer vision
recognition and localization from sparse range or tactile data. [27] Szegedy C., Toshev A., Erhan D. (2013) Deep neural networks
International Journal of Robotics Research, 3(3):3–35, 1984. for object detection. In: NIPS, pp. 2553–2561
[7] R. Bolles and R. Horaud. 3DPO: A Tree-dimensional Part [28] Girshick R., Donahue J., Darrell T., Malik J. (2014) Rich feature
Orientation System. International Journal of Robotics Research, hi erarchies for accurate object detection and semantic
5(3):3–26, 1986. segmentation. In: CVPR, pp. 580–587
[8] I. Biederman. Human Image Understanding: Recent Research and [29] Sermanet P., Eigen D., Zhang X., Mathieu M., Fergus R., LeCun
a Theory. Computer Vision, Graphics and Image Processing, Y. (2014) OverFeat: Integrated recognition, localization and
32:29–73, 1985. detection us ing convolutional network
[9] D. W. Thompson and J. L. Mundy. Three-dimensional model [30] auman K., Darrell T. (2005) The pyramid match kernel:
matching from an unconstrained viewpoint. In Proceedings of the Discrimina tive classification with sets of image features.
International Conference on Robotics and Automation, Raleigh, [31] Lazebnik S., Schmid C., Ponce J. (2006) Beyond bags of features:
NC, pages 208–220, 1987 Spatial pyramid matching for recognizing natural scene
[10] D. Kriegman and J. Ponce. Computing exact aspect graphs of categories.
curved objects:solids of revolution. The International Journal of [32] Girshick R. (2015) Fast R-CNN
Computer Vision, 5(2):119– 136, November 1990. [33] Ren S., He K., Girshick R., Sun J. (2015) Faster R-CNN: Towards
[11] N. Ayache and O. Faugeras. HYPER: A New Approach for the real time object detection with region proposal networks.
Recognition and Positioning of Two-Dimensional Objects. IEEE [34] Dai J., Li Y., He K., Sun J. (2016) RFCN: object detection via
Transactions on Pattern Analysis and Machine Intelligence, region based fully convolutional networks.
8(1):44–54, January 1986. [35] He K., Gkioxari G., Dollar P., Girshick R. (2017) Mask RCNN.
[12] R. Bolles and R. Cain. Recognizing and locating partially visible [36] Redmon J., Divvala S., Girshick R., Farhadi A. (2016) You only
objects: The local-feature-focus method. International Journal of look once: Unified, real-time object detection.
Robotics Research, 1(3):57– 82, 1982. [37] Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C., Berg
[13] J. F. Canny. Finding edges and lines in images. Technical Report A. (2016) SSD: single shot multibox detector
AI-TR-720, Massachusets Institute of Technology, Artificial [38] Pengfei Duan; Jiahao Pan; Wenbi Rao, “MaskS R-CNN Text
Intelligence Laboratory, June 1983. Detector” (2020)
[14] Mundy J. (2006) Object recognition in the geometric era: A [39] Hassam Tahir; Muhammad Shahbaz Khan; Muhammad Owais
retrospective. in book Toward Category Level Object Tariq, (2021) “Performance Analysis and Comparison of Faster
Recognition edited by J Ponce, M Hebert, C Schmid and A R-CNN, Mask R-CNN and ResNet50 for the Detection and
Zisserman pp. 3–28 Counting of Vehicles”
[15] Murase H., Nayar S. (1995) Visual learning and recognition of 3d
objects from appearance. IJCV 14(1):5–24
[16] Swain M., Ballard D. (1991) Color indexing. IJCV 7(1):11–32 5
[17] Turk M. A., Pentland A. (1991) Face recognition using
eigenfaces. In: CVPR, pp. 586–591
[18] Lowe D. (1999) Object recognition from local scale invariant
features. In: ICCV, vol 2, pp. 1150–1157

Authorized licensed use limited to: FhI fur Integrierte Schaltungen Angewandte Elek. Downloaded on March 27,2023 at 17:13:16 UTC from IEEE Xplore. Restrictions apply.

You might also like