You are on page 1of 10

Object Detection using Yolo algorithm

Ashmita Dey Kiran Prabhakar Sriyansh Hetamsaria


Electronics and communication Engineering Digital Communications Industrial Engineering and Management

R.V College of Engineering R.V College of Engineering R.V College of Engineering


Bengaluru,India Bengaluru,India Bengaluru,India
ashmitadey.ec18@rvce.edu.in kiranp.ldc20@rvce.edu.in sriyanshh.im19@rvce.edu.in

Mentor Name: Dr. B. Roja Reddy


Designation: Associate Professor
Department: TCE

Abstract

Object recognition is to describe a collection of related computer vision tasks that involve activities like
identifying objects in digital photographs. Image classification involves activities such as predicting the class of
one object in an image. Object localization refers to identifying the location of one or more objects in an image
and drawing an bounding box around their extent. Object detection does the work of combines these two tasks
and localizes and classifies one or more objects in an image. When a user or practitioner refers to the term
“object recognition“, they often mean “object detection“. It may be challenging for beginners to distinguish
between different related computer vision tasks. So, we can distinguish between these three computer vision
tasks with this example: Image Classification: This is done by predicting the type or class of an object in an
image. Input: An image which consists of a single object, such as a photograph. Output: A class label (e.g. one
or more integers that are mapped to class labels). Object Localization: This is done through, Locate the presence
of objects in an image and indicate their location with a bounding box. Input: An image which consists of one or
more objects, such as a photograph. Output: One or more bounding boxes (e.g. defined by a point, width, and
height). Object Detection: This is done through locating the presence of objects with a bounding box and types
or classes of the located objects in an image.Object recognition refers to a collection of related tasks for
identifying objects in digital photographs. Region-based Convolutional Neural Networks, or R-CNNs, is a
family of techniques for addressing object localization and recognition tasks, designed for model performance.
You Only Look Once, or YOLO is known as the second family of techniques for object recognition designed
for speed and real-time use.The aim of object detection is to detect all instances of objects from a known class,
such as people, cars or faces in an image. Generally, only a small number of instances of the object are present
in the image, but there is a very large number of possible locations and scales at which they can occur and that
need to somehow be explored. Each detection of the image is reported with some form of pose information. This
is as simple as the location of the object, a location and scale, or the extent of the object defined in terms of a
bounding box. In some other situations, the pose information is more detailed and contains the parameters of a
linear or non-linear transformation.

Key Terms

Classification,convolutional neural networks,deep learning technology, bounding box,confidence score,


accuracy, Intersection over union

1. Introduction

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
Object detection refers to the capability of computer and software systems to locate objects in an image/scene
and identify each object. Object detection has been widely used for face detection, vehicle detection, pedestrian
counting, web images, security systems and driverless cars. There are many ways object detection can be used
as well in many fields of practice. Like every other computer technology, a wide range of creative and amazing
uses of object detection will definitely come from the efforts of computer programmers and software
developers.Getting to use modern object detection methods in applications and systems, as well as building new
applications based on these methods is not a straightforward task. Early implementations of object detection
involved the use of classical algorithms, like the ones supported in OpenCV, the popular computer vision
library. However, these classical algorithms could not achieve enough performance to work under different
conditions. The breakthrough and rapid adoption of deep learning in 2012 brought into existence modern and
highly accurate object detection algorithms and methods such as R-CNN, Fast-RCNN, Faster-RCNN, RetinaNet
and fast yet highly accurate ones like SSD and YOLO. Using these methods and algorithms, based on deep
learning which is also based on machine learning requires lots of mathematical and deep learning frameworks
understanding. There are millions of expert computer programmers and software developers that want to
integrate and create new products that use object detection. But this technology is kept out of their reach due to
the extra and complicated path to understanding and making practical use of it.

YOLO is an abbreviation for the term ‘You Only Look Once’. This is an algorithm that detects and recognizes
various objects in a picture (in real-time). Object detection in YOLO is done as a regression problem and
provides the class probabilities of the detected images.YOLO algorithm employs convolutional neural networks
(CNN) to detect objects in real-time. As the name suggests, the algorithm requires only a single forward
propagation through a neural network to detect objects.This means that prediction in the entire image is done in
a single algorithm run. The CNN is used to predict various class probabilities and bounding boxes
simultaneously.The YOLO algorithm consists of various variants. Some of the common ones include tiny
YOLO and YOLOv3.

2. Related Works
C. Liu, Y. Tao, J. Liang, K. Li and Y. Chen, "Object Detection Based on YOLO Network," 2018 IEEE 4th
Information Technology and Mechatronics Engineering Conference (ITOEC), 2018, pp. 799-803, doi:
10.1109/ITOEC.2018.8740604:
A generalized object detection network was developed by applying complex degradation processes on training
sets like noise, blurring, rotating and cropping of images. The model was trained with the degraded training sets
which resulted in better generalizing ability and higher robustness.

2. W. Lan, J. Dang, Y. Wang and S. Wang, "Pedestrian Detection Based on YOLO Network Model," 2018
IEEE International Conference on Mechatronics and Automation (ICMA), 2018, pp. 1547-1551, doi:
10.1109/ICMA.2018.8484698:
The network structure of YOLO algorithm is improved and a new network structure YOLO-R was proposed to
increase the ability of the network to extract the information of the shallow pedestrian features by adding
passthrough layers to the original YOLO network.

3. J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object
Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788,
doi: 10.1109/CVPR.2016.91:
A fast and simple approach to detecting real time images was introduced as You Only Look Once. The model
was built to detect images accurately, fast and to differentiate between art and real images.

4. R. Zhang, Y. Yang, W. Wang, L. Zeng, J. Chen and S. McGrath, "An Algorithm for Obstacle Detection based on
YOLO and Light Field Camera," 2018 12th International Conference on Sensing Technology (ICST), 2018, pp.
223-226, doi: 10.1109/ICSensT.2018.8603600:
An obstacle detection algorithm in the indoor environment is proposed which combines the YOLO object
detection algorithm and the light field camera and will classify objects into categories and mark them in the
image.
5. X. Zhou, W. Gong, W. Fu and F. Du, "Application of deep learning in object detection," 2017 IEEE/ACIS
16th International Conference on Computer and Information Science (ICIS), 2017, pp. 631-634, doi:
10.1109/ICIS.2017.7960069:

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
Expresses the importance of deep learning technology applications and the impact of dataset for deep learning
through the use of the faster models on new datasets.
Experimental data shows that the technology of deep learning is an effective tool to pass the man-made feature
relying on the drive of experience to the learning relying on the drive of data

3. Object detection using Yolo algorithm

3.1 Problem statement

Implementation of Object Detection System which detects objects efficiently based on YOLO algorithm and
apply the algorithm on image data and video data to detect objects.

3.2 Methodology
3.2.1 Comparison of other Algorithms
1. Fast R-CNN

Fast R-CNN is an improved version of R-CNN as it had some disadvantages like multistage pipelining
environment, space and time expensive, slow object detection. To remove them Fast R-CNN
introduced a new structure.It takes the entire image as an input along with the object proposals. Initially
the algorithm runs a CNN on the input image and forms a feature map by the use of various conv and
max pooling layers. After that for each object proposal a Region of Interest (RoI) pooling layer extracts
a fixed length feature vector and inputs it into a Fully Connected (FC) Layer. This layer further
branches into two output layers : one producing the SoftMax probability for each class along with a
“background” class, the other layer outputs four real numbers for each of the classes which define the
bounding box for that specific class.

2. Single Shot MultiBox Detector (SSD)

SSD works on the approach of free-forward convolution layer, that outputs the fixed sized collection of
the bounding boxes and the scores for the instance of an object class to be present in those bounding
boxes. It also uses a Non-Max Suppression to produce the final decisions.The Architecture of SSD is
quite simple. The initial layers in the model are the standard ConvNet layers used for Image
classification, which in their terminology is the Base network, building up on this base network they
then add some auxiliary layers to produce the detections keeping in mind the Multi-scale feature maps,
default boxes and aspect ratio.

3. Retina-Net

Retina-Net is a single, unified network composed of a backbone network and two task-specific
subnetworks. The backbone is responsible for computing a conv feature map over an entire input image
and is an off-the-self convolution network. The first subnet performs classification on the backbones
output; the second subnet performs convolution bounding box regression.It uses a Feature Pyramid
Network (FPN) backbone on top of a feedforward ResNet architecture to generate a rich, multi-scale
convolutional feature pyramid which is then fed to the two subnets where one classifies the anchor
boxes and the other performs regression from the anchor boxes to the ground-truth anchor boxes.

3.2.2 YOLO Algorithm


YOLO algorithm is important because of the following reasons:

Speed: This algorithm improves the speed of detection because it can predict objects in real-time.

High accuracy: YOLO is a predictive technique that provides accurate results with minimal
background errors.

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
Learning capabilities: The algorithm has excellent learning capabilities that enable it to learn the
representations of objects and apply them in object detection.

YOLO algorithm works using the following three techniques:

● Residual blocks
● Bounding box regression
● Intersection Over Union (IOU)

Residual blocks
First, the image is divided into various grids. Each grid has a dimension of S x S. The following image
shows how an input image is divided into grids.

Bounding box regression


A bounding box is an outline that highlights an object in an image.Every bounding box in the image
consists of the following attributes:

● Width (bw)
● Height (bh)
● Class (for example, person, car, traffic light, etc.)- This is represented by the letter c.Bounding
box center (bx,by)
● The following image shows an example of a bounding box. The bounding box has been
represented by a yellow outline.

YOLO uses a single bounding box regression to predict the height, width, center, and class of objects.
In the image above, represents the probability of an object appearing in the bounding box.

Intersection over union (IOU)


Intersection over union (IOU) is a phenomenon in object detection that describes how boxes overlap.
YOLO uses IOU to provide an output box that surrounds the objects perfectly.Each grid cell is
responsible for predicting the bounding boxes and their confidence scores. The IOU is equal to 1 if the
predicted bounding box is the same as the real box. This mechanism eliminates bounding boxes that are
not equal to the real box.

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
In the image above, there are two bounding boxes, one in green and the other one in blue. The blue box
is the predicted box while the green box is the real box. YOLO ensures that the two bounding boxes are
equal.

First, the image is divided into grid cells. Each grid cell forecasts B bounding boxes and provides their
confidence scores. The cells predict the class probabilities to establish the class of each
object.Intersection over union ensures that the predicted bounding boxes are equal to the real boxes of
the objects. This phenomenon eliminates unnecessary bounding boxes that do not meet the
characteristics of the objects (like height and width). The final detection will consist of unique
bounding boxes that fit the objects perfectly.

3.3.3 Software Implementation


Software Requirements:

● Spyder: Spyder is a free and open source scientific environment written in Python, for Python,
and designed by and for scientists, engineers and data analysts.
● Anaconda is a free and open-source distribution of the programming languages Python and R
(check out these Python online courses and R programming courses). The distribution comes
with the Python interpreter and various packages related to machine learning and data science.
● Dataset: COCO
● Webcam for OpenCV

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
CODE execution:

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
CODE EXPLANATION:
1. First, we import our required packages and parse some command-line arguments using argparse.
2. Command-line arguments allow us to change inputs to our script from the terminal. This is great
because that way, we don't have to hardcode a path to our model and the input image. Our command-
line arguments

After parsing the arguments, we continue by loading in the labels, creating a random color for each
label and loading in the model using the dnn module.

3. YOLO model is outputting the center coordinates and the width and height of a bounding box. We will
transform the output to get the upper left corner coordinates instead.Lastly, we can draw the bounding
boxes on top of the image using another custom method called draw_bounding_boxes. After that, we
can display and/or save the image depending on the command line arguments.
4. The draw_bounding_boxes method draws the bounding boxes and confidences onto the image using
the cv2.rectangle and cv2.putText methods.
5. Extending the script to also work for video streams is quite simple. First, we'll add a new command-
line argument called --video_path. We will combine the --image_path and --video_path arguments into
a mutually exclusive group so only one of the two arguments can be specified.If the image_path is
specified, we'll execute the code above. If a video_path is specified, we'll run the code repeatedly until
the video is over. If both arguments are empty, the script will use a webcam.

3.3 Results
● A model has been implemented on a COCO- dataset for object detection.
● Real time object detection using webcam is also being implemented.
Bounding box around book, bottle and scissor created and correctly predicted by the model from the
webcam
● Research on various object detection algorithms were done .
Observation: Yolo algorithm works much better in real time detection than other CNN algorithms in
terms of speed and accuracy.
● Yolov3 algorithm was implemented to detect objects in an image or in real time video.
● The objects provided in the training datasets were correctly predicted while implementing real-time
detection algorithms.
● Some of the objects such as stapler was detected as a knife by the algorithm.

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
4. Future scope
Features either the local or global used for recognition can be increased, to increase the efficiency of the object
recognition system. Geometric properties of the image can be included in the feature vector for recognition. 150
Using unsupervised classifier instead of a supervised classifier for recognition of the object. The proposed
object recognition system uses grey-scale image and discards the color information. The colour information in
the image can be used for recognition of the object. Colour based object recognition plays vital role in Robotics
Although the visual tracking algorithm proposed here is robust in many of the conditions, it can be made more
robust by eliminating some of the limitations as listed below: In the Single Visual tracking, the size of the
template remains fixed for tracking. If the size of the object reduces with the time, the background becomes
more dominant than the object being tracked. In this case the object may not be tracked. Fully occluded object
cannot be tracked and considered as a new object in the next frame. Foreground object extraction depends on
the binary segmentation which is carried out by applying threshold techniques. So blob extraction and tracking
depends on the threshold value. Splitting and merging cannot be handled very well in all conditions using the
single camera due to the loss of information of a 3D object projection in 2D images.cope for future study.For
Night time visual tracking, night vision mode should be available as an inbuilt feature in the CCTV camera. To
make the system fully automatic and also to overcome the above limitations, in future, multi- view tracking can
be implemented using multiple cameras.

5. Conclusion
By using this thesis and based on experimental results we are able to detect obeject more precisely and identify
the objects individually with exact location of an obeject in the picture in x,y axis.This paper also provide
experimental results on different methods for object detection and identification and compares each method for

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059
their efficiencies.We discussed all the aspects of Object detection along with the challenges we face in that
domain. We then saw some of the algorithms that tried to solve some of these challenges but were failing in the
most crucial one-Real time detection (speed in fps). We then studied the YOLO algorithm which outperforms all
the other models in terms of the challenges faced, its fast-can work well in real-time object detection, follows a
regression approach.Still improvements are being made in the algorithm. We currently have four generations of
the YOLO Algorithm from v1 to v4, along with a slightly small version of it YOLO-tiny, it is specifically
designed to achieve a incredibly high speed of 220fps.

References

1. Agarwal, S., Awan, A., and Roth, D. (2004). Learning to detect objects in images via a sparse, part-based
representation. IEEE Trans. Pattern Anal. Mach. Intell. 26,1475–1490. doi:10.1109/TPAMI.2004.108
2. Alexe, B., Deselaers, T., and Ferrari, V. (2010). “What is an object?,” in ComputerVision and Pattern
Recognition (CVPR), 2010 IEEE Conference on (San Francisco,CA: IEEE), 73–80.
doi:10.1109/CVPR.2010.5540226
3. Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. Int. J.Comput. Vis. 1, 333–356.
doi:10.1007/BF00133571
4. Andreopoulos, A., and Tsotsos, J. K. (2013). 50 years of object recognition: direc-tions forward. Comput.
Vis. Image Underst. 117, 827–891. doi:10.1016/j.cviu.2013.04.005
5. Azizpour, H., and Laptev, I. (2012). “Object detection using strongly-superviseddeformable part
models,” in Computer Vision-ECCV 2012 (Florence: Springer),836–849. 6. Azzopardi, G., and Petkov, N.
(2013). Trainable cosfire filters for keypoint detectionand pattern
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 490–503.doi:10.1109/TPAMI.2012.106
7. Azzopardi, G., and Petkov, N. (2014). Ventral-stream-like shape representation:from pixel intensity
values to trainable object-selective cosfire models. Front.Comput. Neurosci. 8:80.
doi:10.3389/fncom.2014.00080
8. Benbouzid, D., Busa-Fekete, R., and Kegl, B. (2012). “Fast classification using sparsedecision dags,”
in Proceedings of the 29th International Conference on MachineLearning (ICML-12), ICML ‘12, eds
J. Langford and J. Pineau (New York, NY:Omnipress), 951–958. 9. Bengio, Y. (2012). “Deep learning of
representations for unsupervised and transferlearning,” in
ICML Unsupervised and Transfer Learning, Volume 27 of JMLRProceedings, eds I. Guyon, G. Dror, V.
Lemaire, G. W. Taylor, and D. L. Silver(Bellevue: JMLR.Org), 17–36. 10. Bourdev, L. D., Maji, S., Brox, T.,
and Malik, J. (2010). “Detecting peopleusing mutually consistent
poselet activations,” in Computer Vision – ECCV2010 – 11th European Conference on Computer
Vision, Heraklion, Crete, Greece,September 5-11, 2010, Proceedings, Part VI, Volume 6316 of
Lecture Notes inComputer Science, eds K. Daniilidis, P. Maragos, and N. Paragios
(Heraklion:Springer), 168–181. 11. Bourdev, L. D., and Malik, J. (2009). “Poselets: body part detectors trained
using 3dhuman pose
annotations,” in IEEE 12th International Conference on ComputerVision, ICCV 2009, Kyoto, Japan, September
27 – October 4, 2009 (Kyoto: IEEE),1365–1372.
41
12. Cadena, C., Dick, A., and Reid, I. (2015). “A fast, modular scene understanding sys-tem using
context-aware object detection,” in Robotics and Automation (ICRA),2015 IEEE International
Conference on (Seattle, WA). 13. Correa, M., Hermosilla, G., Verschae, R., and Ruiz-del-Solar, J. (2012).
Humandetection and
identification by robots using thermal and visual information indomestic environments. J. Intell. Robot Syst. 66,
223–243. doi:10.1007/s10846-011-9612-2
14. Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for humandetection,” in
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEEComputer Society Conference on, Vol. 1
(San Diego, CA: IEEE), 886–893. doi:10.1109/CVPR.2005.177
15. Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D. (2014). “Scalable object detec-tion using deep
neural networks,” in Computer Vision and Pattern Recognition Frontiers in Robotics and AI
www.frontiersin.org November 2015

Center of Excellence, Macro Electronics, Interdisciplinary Research Center, RVCE, Bengaluru – 560 059

You might also like