You are on page 1of 49

OBJECT DETECTION

A
Minor Project Report
Submitted in partial fulfillment of the requirement for the award of degree of

Bachelor of Technology
In
Computer Science & Engineering

Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA,
BHOPAL (M.P.)

Guided By Submitted By

Prof. Jayesh Umre Kratagya Mourya (0875CS191054)


Anmol Saxena (0875CS191014)
Abhinav Shrivastava (0875CS191004)
Aryan Patel (0875CS191018)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


SHIVAJIRAO KADAM INSTITUTE OF TECHNOLOGY & MANAGEMENT –
TECHNICAL CAMPUS, INDORE (M.P.) - 452020
2021-2022
Declaration
I hereby declared that the work, which is being presented in the project entitled Object
Detection partial fulfillment of the requirement for the award of the degree of Bachelor
of Engineering, submitted in the Department of Computer Science & Engineering at
Shivajirao Kadam Institute of Technology & Management – Technical Campus,
Indore is an authentic record of my own work carried under the supervision of “Prof.
Jayesh Umre”. I have not submitted the matter embodied in this report for award of any
other degree.

Kratagya Mourya (0875CS191054)


Anmol Saxena (0875CS191014)
Abhinav Shrivastava (0875CS191004)
Aryan Patel (0875CS191018)

Prof. Jayesh Umre

Supervisor

ii
Project Approval Form
I hereby recommend that the project Object Detection prepared under my supervision by
Kratagya Mourya (0875CS191054), Anmol Saxena (0875CS191014), Abhinav
Shrivastava (0875CS191004) and Aryan Patel (0875CS191018) be accepted in partial
fulfillment of the requirement for the degree of Bachelor of Engineering in Computer
Science & Engineering.

Prof. Jayesh Umre

Supervisor

Recommendation concurred in 2021-2022

Prof. Virendra Dani

Project In-charge

Prof. Deepak Singh Chouhan

Project Coordinator

iii
Shivajirao Kadam Institute of Technology & Management –
Technical Campus

Department of Computer Science & Engineering

Certificate
The project work entitled Object Detection submitted by Kratagya Mourya
(0875CS191054), Anmol Saxena (0875CS191014), Abhinav Shrivastava
(0875CS191004) and Aryan Patel (0875CS191018) is approved as partial fulfillment
for the award of the degree of Bachelor of Engineering in Computer Science &
Engineering by Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal (M.P.).

Internal Examiner External Examiner

Name:………………. Name: ……………..

Date: …./…/……….. Date: …./…/………..

iv
Acknowledgement

With boundless love and appreciation, we would like to extend our/my heartfelt gratitude
and appreciation to the people who helped us/me to bring this work in reality. We would
like to have some space of acknowledgement for them.

Foremost, our would like to express our sincere gratitude to our supervisor, Prof. Jayesh
Umre whose expertise, consistent guidance, ample time spent and consistent advices that
helped us to bring this study into success.

To the project in-charges Prof. Virendra Dani for their constructive comments,
suggestions, and critiquing even in hardship.

To the project Coordinator Prof. Deepak Singh Chouhan for their consistent guidance,
coordination and schedule.

To the honorable Dr. Rashmi Yadav, Head, Department of Computer Science &
Engineering for his favorable responses regarding the study and providing necessary
facility.

To the honorable Dr. Sanjay T. Purkar, Director, Shivajirao Kadam Institute of


Technology & Management – Technical Campus, Indore for his unending support,
advises and effort to make possible.

Finally, we would like to pay our thanks to faculty members and staff of Department of
Computer Science & Engineering for their timely help and support.

We also like to pay thanks to our parents for their eternal love, support and prayers.
Without them it is not possible.

Kratagya Mourya (0875CS191054)

Anmol Saxena (0875CS191014)

Abhinav Shrivastava (0875CS191004)

Aryan Patel (0875CS191018)

v
Abstract

Object detection methods aim to identify all target objects in the target image and
determine the categories and position information to achieve machine vision
understanding. Numerous approaches have been proposed to solve this problem, mainly
inspired by methods of computer vision and deep learning.

Real-time object detection is a vast, vibrant and complex area of computer vision. If there
is a single object to be detected in an image, it is known as Image Localization and if
there are multiple objects in an image, then it is Object Detection. This detects the
semantic objects of a class in digital images and videos. The applications of real-time
object detection include tracking objects, video surveillance, pedestrian detection, people
counting, self-driving cars, face detection, ball tracking in sports, and many more.
Convolution neural networks are a representative tool of Deep learning to detect objects
using OpenCV(Opensource Computer Vision), which is a library of programming
functions mainly aimed at real-time computer vision.

The main purpose of object detection is to identify and locate one or more effective
targets from still image or video data. It comprehensively includes a variety of important
techniques, such as image processing, pattern recognition, artificial intelligence, and
machine learning. It has broad application prospects in such areas as road traffic accident
prevention, warnings of dangerous goods in factories, military restricted area monitoring,
and advanced human-computer interaction. Since the application scenarios of multi-target
detection in the real world are usually complex and variable, balancing the relationship
between accuracy and computing costs is a difficult task.

Object detection consists of various approaches such as fast R-CNN, Retina-Net,


and Single-Shot MultiBox Detector (SSD). Although these approaches have solved the
challenges of data limitation and modeling in object detection, they are not able to detect
objects in a single algorithm run. YOLO algorithm has gained popularity because of its
superior performance over the aforementioned object detection techniques.

vi
The objective is to detect objects using the You Only Look Once (YOLO) approach. This
method has several advantages as compared to other object detection algorithms. In other
algorithms like Convolutional Neural Network, Fast Convolutional Neural Network the
algorithm will not look at the image completely but in YOLO the algorithm looks at the
image completely by predicting the bounding boxes using convolutional network and the
class probabilities for these boxes and detects the image faster as compared to other
algorithms.

YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-
time. As the name suggests, the algorithm requires only a single forward propagation
through a neural network to detect objects. This means that prediction in the entire image
is done in a single algorithm run. The CNN is used to predict various class probabilities
and bounding boxes simultaneously.

vii
Table of Content
List of Figures...........................................................................................................................................

List of Tables.............................................................................................................................................

Abbreviations............................................................................................................................................

Chapter 1: Introduction ..............................................................................................................................

1.2 Goal.......................................................................................................................................................

1.3 Objective...............................................................................................................................................

1.4 Methodology.........................................................................................................................................

1.5 Role.......................................................................................................................................................

1.6 Contribution of Project.........................................................................................................................

1.6.1 Market Potential:.....................................................................................................................

1.6.2 Innovativeness.........................................................................................................................

1.6.3 Usefulness...............................................................................................................................

1.7 Report Organisation..............................................................................................................................

Chapter 2:Requirement Engineering..........................................................................................................

2.1 User Roles & Responsibilities..............................................................................................................

2.2 Requirement Collection........................................................................................................................

2.1.1 Functional Requirements.........................................................................................................

2.1.2 Non-Functional Requirements................................................................................................

Chapter 3: Analysis and Design.................................................................................................................

3.1 Use-case Diagrams...........................................................................................................................

3.2 Activity Diagrams.................................................................................................................................

3.3 System Architecture..............................................................................................................................

3.4 Sequence Diagram................................................................................................................................

3.5 Class Diagram.......................................................................................................................................

viii
3.6 E-R Diagram.........................................................................................................................................

3.7 Database Design...................................................................................................................................

3.7.1 Schema Definitions.......................................................................................................

3.7.2 Integrity Contraints.......................................................................................................

Chapter 4: Methodology.............................................................................................................................

4.1 Dataset Discussion and Interpretation..................................................................................................

4.1 Proposed Algorithm..............................................................................................................................

4.1 Tools Required......................................................................................................................................

Chapter 5: Construction..............................................................................................................................

4.1 Implementation and Testing................................................................................................................

4.1.1 Implemented Classes..........................................................................................................

4.1.2 Implemented Functions......................................................................................................

4.1.3 Logical Structure of Data…………………………………………………………………

4.2 Validation and Verification..................................................................................................................

4.3 Testing Approach..................................................................................................................................

4.3.1 Unit Testing.............................................................................................................................

4.3.1.1 Test Cases.....................................................................................................................

4.3.2 Integration Testing..................................................................................................................

4.3.2.1 Test Cases.....................................................................................................................

Chapter 5: Result and Discussion………………………………………………………………………...

Conclusion & Future Works……………………………………………………………………………...

Appendix A (Project Synopsis)…………………………………………………………………………..

Appendix B (Guide Interaction Report)………………………………………………………………….

Appendix C (Project Snapshots)…..……………………………………………………………………..

References………………………………………………………………………………………………...

ix
CHAPTER 1
INTRODUCTION

1.1 Goal:
Blind people do lead a normal life with their own style of doing things. But they definitely face
troubles due to inaccessible infrastructure and social challenges. The biggest challenge for a
blind person, especially the one with the complete loss of vision, is to navigate around places.
Obviously, blind people roam easily around their house without any help because they know the
position of everything in the house. Blind people have a tough time finding objects around them.
So, we decided to make a REAL TIME OBJECT DETECTION System. We are interested in this
project after we went through few papers in this area. As a result, we are highly motivated to
develop a system that recognizes objects in the real time environment.

1.2 Objective:
The motive of object detection is to recognize and locate all known objects in a scene. Preferably
in 3D space, recovering pose of objects in 3D is very important for robotic control systems.

Imparting intelligence to machines and making robots more and more autonomous and
independent has been a sustaining technological dream for the mankind. It is our dream to let the
robots take on tedious, boring, or dangerous work so that we can commit our time to more
creative tasks. Unfortunately, the intelligent part seems to be still lagging behind. In real life, to
achieve this goal, besides hardware development, we need the software that can enable robot the
intelligence to do the work and act independently. One of the crucial components regarding this
is vision, apart from other types of intelligences such as learning and cognitive thinking. A robot
cannot be too intelligent if it cannot see and adapt to a dynamic environment.

The searching or recognition process in real time scenario is very difficult. So far, no effective
solution has been found for this problem. Despite a lot of research in this area, the methods
developed so far are not efficient, require long training time, are not suitable for real time
application, and are not scalable to large number of classes. Object detection is relatively simpler
if the machine is looking for detecting one particular object. However, recognizing all the objects
inherently requires the skill to differentiate one object from the other, though they may be of
same type. Such problem is very difficult for machines, if they do not know about the various
possibilities of objects

x
1.3 Methodology:
YOLO algorithm gives a much better performance on all the parameters we discussed along with
a high fps for real-time usage. YOLO algorithm is an algorithm based on regression, instead of
selecting the interesting part of an Image, it predicts classes and bounding boxes for the whole
image in one run of the Algorithm.

YOLO is an abbreviation for the term ‘You Only Look Once’. This is an algorithm that detects
and recognizes various objects in a picture (in real-time). Object detection in YOLO is done as a
regression problem and provides the class probabilities of the detected images.

YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-time.
As the name suggests, the algorithm requires only a single forward propagation through a neural
network to detect objects.

This means that prediction in the entire image is done in a single algorithm run. The CNN is used
to predict various class probabilities and bounding boxes simultaneously.

The YOLO algorithm consists of various variants. Some of the common ones include tiny
YOLO and YOLOv3.

1.4 Role:
The significant most advantage of object detection projects is that it is more accurate than human
vision. The human brain is astounding, so much that it can finish pictures dependent on only a
couple of snippets of data. But it can sometimes also keep us from seeing what is actually there.
The complete picture isn’t always accurate because human brains make assumptions.

Object detection projects react to images based only on the data presented and not just snippets
of it like the human brain. Although it can make assumptions based on patterns, it does not have
the disadvantage of a human brain’s tendency to leap to conclusions that may not be accurate. 

Object detection also operates at the pixel level at which the human brain can’t process. This
allows object detection projects to provide more accurate results. 

1.5 Contribution of Project:

1.5.1 Market Potential:

xi
Today, object recognition is the core of most vision-based AI software and programs. Object
detection plays an important role in scene understanding, which is popular in security,
transportation, medical, and military use cases.

Object detection in Retail. Strategically placed people counting systems throughout multiple


retail stores are used to gather information about how customers spend their time and customer
footfall. AI-based customer analysis to detect and track customers with cameras helps to gain an
understanding of customer interaction and customer experience, optimize the store layout, and
make operations more efficient. A popular use case is the detection of queues to reduce waiting
time in retail stores.

Autonomous Driving. Self-driving cars depend on object detection to recognize pedestrians,


traffic signs, other vehicles, and more. For example, Tesla’s Autopilot AI heavily utilizes object
detection to perceive environmental and surrounding threats such as oncoming vehicles or
obstacles.

Animal detection in Agriculture. Object detection is used in agriculture for tasks such as


counting, animal monitoring, and evaluation of the quality of agricultural products. Damaged
produce can be detected while it is in processing using machine learning algorithms.

People detection in Security. A wide range of security applications in video surveillance are
based on object detection, for example, to detect people in restricted or dangerous areas, suicide
prevention, or to automate inspection tasks on remote locations with computer vision.

1.5.2 Innovativeness:
As a real-time object detection system, YOLO object detection utilizes a single neural network.
The latest release of ImageAI v2.1.0 now supports training a custom YOLO model to detect any
kind and number of objects. Convolutional neural networks are instances of classifier-based
systems where the system repurposes classifiers or localizers to perform detection and applies
the detection model to an image at multiple locations and scales. Using this process, “high
scoring” regions of the image are considered detections. Simply put, the regions which look most
like the training images given are identified positively.

As a single-stage detector, YOLO performs classification and bounding box regression in one
step, making it much faster than most convolutional neural networks. For example, YOLO object
detection is more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.

YOLOv3 achieves 57.9% mAP on the MS COCO dataset compared to DSSD513 of 53.3% and
RetinaNet of 61.1%. YOLOv3 uses multi-label classification with overlapping patterns for
training. Hence it can be used in complex scenarios for object detection. Because of its multi-
class prediction capabilities, YOLOv3 can be used for small object classification while it shows
worse performance for detecting large or medium-sized objects.

xii
1.5.3 Usefulness:

Object detection is one of the fundamental problems of computer vision. It forms the basis of
many other downstream computer vision tasks, for example, instance segmentation, image
captioning, object tracking, and more. Specific object detection applications include pedestrian
detection, people counting, face detection, text detection, pose detection, or number-plate
recognition.

xiii
CHAPTER 2
ANALYSIS AND DESIGN

3.1 Use Case Diagrams

A use case diagram in the Unified Modeling Language (UML is a kind of behavioral outline
positive by and produced using a Use-contextual investigation. Its determination is to surviving a
graphical sign of the usefulness giving by a framework regarding performing artists and their
points (spoke to as use cases), and any conditions between those utilization cases. The key
reason for a utilization case outline is to show what framework capacities are performed for
which on-screen character. Parts of the on-screen characters in the framework can be delineated.

xiv
3.2 Activity Diagram

xv
3.3 System Architecture
The architecture has been bought out, keeping in view that general algorithms are usually
implemented in OpenCV, referred to as the popular computer vision library. Past evolutions on
these theories of object identification also used these algorithms. These evolving technologies in
the building of the latest applications based on these algorithms are not so useful and accurate.
Hence, these traditional algorithms could not meet its specifications in evaluating its
performance and work efficiency under certain circumstance

xvi
3.3.1 System Architecture of YOLOv3
YOLOv2 used a feature extractor known as the Darknet-19, which consisted of 19 convolutional
layers. The newer version of this algorithm, YOLOv3 uses a new feature extractor known as
Darknet-53 which, as the name suggests, uses 53 convolutional layers while the overall
algorithm consists of 75 convolutional layers and 31 other layers making it a total of 106 layers
[36]. Pooling layers have been removed from the architecture and replaced by another
convolutional layer with stride ‘2’, for the purpose of down-sampling. This key change has been
made to prevent the loss of features during the process of pooling. Figure 2.8 which is created by
‘CyberailAB’ clearly depicts the architecture of YOLOv3 algorithm.

YOLOv3 performs detections at three different scales, as shown in the Figure 2.8. 1 x 1
detection kernels are applied on the feature maps with three unique sizes located at three unique
places in the network. The shape of the detection kernel is 1x1x (B ∗ (4 + 1 + C)), where ‘B’ is
the number of bounding boxes that can be predicted by a cell located on the feature map, ‘4’
represents the number of bounding box attributes, ‘1’ represents the object confidence and ‘C’
represents the number of classes. Figure 2.7 depicts the splitting of an image and bounding-box
prediction in YOLOv3 and Figure 2.8 depicts the architecture of YOLOv3 algorithm trained on
COCO dataset which has 80 classes and bounding boxes are considered to be 3. Therefore, the
kernel size would be 1 x 1 x 255 [37]. In YOLOv3, the dimensions of the input image are down
sampled by 32, 16 and 8 to make predictions at scales 3,2 and 1 respectively.

In the Figure 2.8, the size of the input image is 416 x 416. As mentioned in the earlier section,
the total number of layers in YOLOv3 is 106. As shown in the network architecture diagram
Figure 2.8, the input image is down sampled by the network for the first 81 layers. Since the 81st
layer has a stride of 32, the 82nd layer performs the first detection with a feature map of size 13
x 13. Since a 1 x 1 kernel is used to perform the detection, the size of the resulting detection
feature map is 13 x 13 x 255 which is responsible for the detection of objects at scale 3.

Following this, the feature map from 79th layer is up sampled by 2x after subjecting it to a few
convolutional layers, resulting in the dimensions 26 x 26. This is then concatenated with the
feature map from 61st layer. The features are fused by subjecting the concatenated feature map
to a few more 1 x 1 convolutional layers. As a result, the 94th layer performs the second
detection with a feature map of 26 x 26 x 255, which is responsible for the detection of objects at
scale 2.

Following the second detection, the feature map from 91st layer is up sampled by 2x after
subjecting it to a few convolutional layers, resulting in the dimensions 52 x 52. This is then
concatenated with the feature map from 36th layer. The features are fused by subjecting the
concatenated feature map to a few more 1 x 1 convolutional layers. As a result, the 106th layer

xvii
performs the third and final detection with a feature map of 52 x 52 x 255, which is responsible
for the detection of objects at scale 1. As a result, YOLOv3 is better at detecting smaller objects
when compared to its predecessors YOLOv2 and YOLO.

xviii
3.3.1 System Architecture for Tiny-YOLOv3
Tiny-YOLOv3 is the smaller and simplified version of YOLOv3. Even though the number of
layers in Tiny-YOLOv3 is quite less when compared to that of YOLOv3, the accuracy of the
model is almost the same as that of its bigger self when high frame rates are considered. Tiny-
YOLOv3 consists of only 13 convolutional layers and 8 max-pool layers and therefore, requires
minimal memory to run which is way less than the layers in YOLOv3. The major difference
between YOLOv3 and TinyYOLOv3 is that the former is designed to detect objects at three
different scales while the later can only detect objects at two different scales. Apart from these
differences, the working of both these variants is similar. Figure 2.11 shows the architecture
details of Tiny-YOLOv3.

Compared to YOLOv3, the number of convolutional layers is greatly reduced in Tiny-YOLOv3.


The primary structure of Tiny-YOLOv3 only has 13 convolutional layers while the overall
number of layers is 23. A limited number of 1 x 1 and 3 x 3 kernels are utilized to extract the
features in Tiny-YOLOv3 [12]. Unlike YOLOv3, which uses convolutional layers of stride 2 for
the purpose of down sampling, the Tiny-YOLOv3 uses the pooling layer. The convolutional
layer structure of TinyYOLOv3 is similar to that of YOLOv3

Tiny-YOLOv3 performs detections at two different scales, as shown in the Figure 2.11. 1 x 1
detection kernels are applied on the feature maps with two unique sizes located at two unique
places in the network. The shape of the detection kernel is 1x1x (B ∗ (4 + 1 + C)), where ‘B’ is
the number of bounding boxes that can be predicted by a cell located on the feature map, ‘4’
represents the number of bounding box attributes, ‘1’ represents the object confidence and ‘C’
represents the number of classes. Figure 2.11 depicts the architecture of Tiny-YOLOv3
algorithm trained on COCO dataset which has 80 classes and bounding boxes are considered to
be 3. Therefore, the kernel size would be 1 x 1 x 255 [37].

In the Figure 2.11, the size of the input image is 416 x 416. As mentioned in the earlier section,
the total number of layers in Tiny-YOLOv3 is 23. As shown in the network architecture diagram
Figure 2.11, the input image is max-pooled by the network for the first 15 layers. The 15th layer
performs the first detection with a feature map of size 13 x 13. Since a 1 x 1 kernel is used to
perform the detection, the size of the resulting detection feature map is 13 x 13 x 255 which is
responsible for the detection of objects at scale 2.

Following this, the feature map from 14th layer is up sampled by 2x after subjecting it to a
convolutional layer, resulting in the dimensions 26 x 26. This is then concatenated with the
feature map from 9th layer. The features are fused by subjecting the concatenated feature map to
a 1 x 1 and a 3 x 3 convolutional layer. As a result, the 23rd layer performs the second and final
detection with a feature map of 26 x 26 x 255, which is responsible for the detection of objects at
scale 1.

xix
3.4 Sequence Diagram

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.

xx
3.5 E-R Diagram

xxi
CHAPTER 3
METHODOLOGY

4.1 Dataset Discussion and Interpretation


The MS COCO dataset is a large-scale object detection, segmentation, and captioning dataset
published by Microsoft. Machine Learning and Computer Vision engineers popularly use the
COCO dataset for various computer vision projects.

Understanding visual scenes is a primary goal of computer vision; it involves recognizing what
objects are present, localizing the objects in 2D and 3D, determining the object’s attributes, and
characterizing the relationship between objects. Therefore, algorithms for object detection and
object classification can be trained using the dataset.

COCO stands for Common Objects in Context, as the image dataset was created with the goal of
advancing image recognition. The COCO dataset contains challenging, high-quality visual
datasets for computer vision, mostly state-of-the-art neural networks.

For example, COCO is often used to benchmark algorithms to compare the performance of real-
time object detection. The format of the COCO dataset is automatically interpreted by
advanced neural network libraries.

xxii
The COCO dataset classes for object detection and tracking include the following pre-trained 80
objects:

'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep',
'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
'frisbee', 'skis’, ‘snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut',
'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator',
'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'

xxiii
The COCO key points include 17 different pre-trained key points (classes) that are annotated
with three values (x,y,v). The x and y values mark the coordinates, and v indicates the visibility
of the key point (visible, not visible).

"nose", "left_eye", "right_eye", "left_ear", "right_ear", "left_shoulder",


"right_shoulder", "left_elbow", "right_elbow", "left_wrist", "right_wrist", "left_hip",
"right_hip", "left_knee", "right_knee", "left_ankle", "right_ankle"

The large dataset comprises annotated photos of everyday scenes of common objects in their
natural context. Those objects are labeled using pre-defined classes such as “chair” or “banana”.
The process of labeling, also named image annotation and is a very popular technique in
computer vision.

While other object recognition datasets have focused on 1) image classification, 2) object
bounding-box localization, or 3) semantic pixel-level segmentation – the mscoco dataset focuses
on 4) segmenting individual object instances.

With COCO, Microsoft introduced a visual dataset that contains a massive number of photos
depicting common objects in complex everyday scenes. This sets COCO apart from other object
recognition datasets that may be specifically specific sectors of artificial intelligence. Such
sectors include image classification, object bounding box localization, or semantic pixel-level
segmentation.

Meanwhile, the annotations of COCO are mainly focused on the segmentation of multiple,
individual object instances. This broader focus allows COCO to be used in more instances than
other popular datasets like CIFAR-10 and CIFAR-100. However, compared to the OID dataset,
COCO does not stand out too much and in most cases, both could be used.

With 2.5 million labeled instances in 328k images, COCO is a very large and expansive dataset
that allows many uses. However, this amount does not compare to Google’s OID, which contains
a whopping 9 million annotated images.

Google’s 9 million annotated images were manually annotated, while OID discloses that it
generated object bounding boxes and segmentation masks using automated and computerized
methods. Both COCO and OID have not disclosed bounding box accuracy, so it remains up to
the user whether they assume automated bounding boxes would be more precise than manually
made ones.

xxiv
4.2 Proposed Algorithm
Object detection can be identified using convolution neural networks as an algorithm. Detection
algorithms can be compared as identification of the image can be done not only by identifying
the label of a particular class but also identifies its location in which the object is placed. This
algorithm also helps to divide the picture into parts but also in identifying different objects that
are located along with the image. This convolution neural network algorithm uses only a single
structure neural network to detect the complete image by dividing the image into separate
portions and identifying closed boxes and probabilities for separate portions. These bounding
boxes in the image are calculated using pre-identified probabilities.

The image which has been kept under identifying the objects uses YOLO as the substitute for its
size. In general, we consider only fixed size due to different issues that only reveal its main
objectives while evaluating the detection process by using algorithms.

This algorithm initially fixes the image’s height and width as input and gets the output. It lists
out all the boxes available in the frame with a class multi labelling. Each box in the frame
contains (pp ,bx, by,bh,bw,p) as a parameter where ‘pp ’ can be either 0 or 1 which defines the
probability of a person ‘p’ is present in the image or not, ‘bx’ and ‘by’ defines the midpoint of
the box and ‘bh,’ ‘by’ defines the height and width of the box respectively.

xxv
Workflow of YOLOv3

The above shows the workflow procedure of Yolo V3. In this, the feature prediction mapping is
done on each box available in the frame. Each box is treated three individual boxes, which

xxvi
defines V3. Each box’s attributes contain ‘box coordinates,’ ‘probability scores,’ and ‘multi-label
scores’ for bounding all the boxes available in the frame. The ‘red’ color cell is at the 5th row of
the sixth cell on the grid image. Now we have applied the feature mapping on it to detect the
person

Algorithm:
Step 1: if (setModel=YoloV3( )|| TinyYoloV3( ))

Step 2: Set the execution path

Step 3: Set the model path to load the model

Step 4: Set timer as default timer to get the elapsed time

Step 5: Detect the objects from input to output image

Step 6: join the detector on inputImage and outputImage

Step 7: obj:=0, accur:=0

Step 8: for each object in detection do

Step 9: {

Step 10 :obj:=obj+1; accr:=accr+percentprob;

Step 11: }

Step 12: eTime: = defaultTimer – startTime;

xxvii
After applying the frame’s echoing technique to get the twodimensional shape, we have
computed the probability score for each box by applying the product. The sample example on
how to get the probability scores as follows:

xxviii
CHAPTER 4

CONSTRUCTION

An object detection technique lets you understand the details of an image or a video as it allows
for the recognition, localization, and detection of multiple objects within an image.
It is usually utilized in applications like image retrieval, security, surveillance, and advanced
driver assistance systems (ADAS). Object Detection is done through many ways:

 Feature Based Object Detection


 Viola Jones Object Detection
 SVM Classifications with HOG Features
 Deep Learning Object Detection

DIGITAL IMAGE PROCESSING

Computerized picture preparing is a range portrayed by the requirement for broad test
work to build up the practicality of proposed answers for a given issue. A critical
trademark hidden the plan of picture preparing frameworks is the huge level of testing
and experimentation that

Typically, is required before touching base at a satisfactory arrangement. This trademark


informs that the capacity to plan approaches and rapidly model hopeful arrangements by
and large assumes a noteworthy part in diminishing the cost and time required to land at a
suitable framework execution.

Processing on image:

Processing on image can be of three types They are low-level, mid-level, high level.

Low-level Processing:

 Preprocessing to remove noise.

 Contrast enhancement.

 Image sharpening.

xxix
Medium Level Processing:

 Segmentation.

 Edge detection

 Object extraction.

High Level Processing:

 Image analysis

 Scene interpretation

YOLO: Real-Time Object Detection


You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a
Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-
dev.

Comparison to Other Detectors

YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par
with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and
accuracy simply by changing the size of the model, no retraining required!

xxx
Performance on the COCO Dataset

Model Train Test mAP FLOPS FPS


SSD300 COCO trainval test-dev 41.2 - 46
SSD500 COCO trainval test-dev 46.5 - 19
YOLOv2 608x608 COCO trainval test-dev 48.1 62.94 Bn 40
Tiny YOLO COCO trainval test-dev 23.7 5.41 Bn 244
SSD321 COCO trainval test-dev 45.4 - 16
DSSD321 COCO trainval test-dev 46.1 - 12
R-FCN COCO trainval test-dev 51.9 - 12
SSD513 COCO trainval test-dev 50.4 - 8
DSSD513 COCO trainval test-dev 53.3 - 6
FPN FRCN COCO trainval test-dev 59.1 - 6

xxxi
Retinanet-50-500 COCO trainval test-dev 50.9 - 14
Retinanet-101-500 COCO trainval test-dev 53.1 - 11
Retinanet-101-800 COCO trainval test-dev 57.5 - 5
YOLOv3-320 COCO trainval test-dev 51.5 38.97 Bn 45
YOLOv3-416 COCO trainval test-dev 55.3 65.86 Bn 35
YOLOv3-608 COCO trainval test-dev 57.9 140.69 Bn 20
YOLOv3-tiny COCO trainval test-dev 33.1 5.56 Bn 220
YOLOv3-spp COCO trainval test-dev 60.6 141.45 Bn 20

How It Works

Prior detection systems repurpose classifiers or localizers to perform detection. They


apply the model to an image at multiple locations and scales. High scoring regions of the
image are considered detections.

We use a totally different approach. We apply a single neural network to the full image.
This network divides the image into regions and predicts bounding boxes and
probabilities for each region. These bounding boxes are weighted by the predicted
probabilities. Our model has several advantages over classifier-based systems. It looks at
the whole image at test time so its predictions are informed by global context in the
image. It also makes predictions with a single network evaluation unlike systems like R-
CNN which require thousands for a single image. This makes it extremely fast, more than
1000x faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more
details on the full system.

What's New in Version 3?

xxxii
YOLOv3 uses a few tricks to improve training and increase performance, including:
multi-scale predictions, a better backbone classifier, and more. The full details are in our
paper!

Detection Using a Pre-Trained Model

This post will guide you through detecting objects with the YOLO system using a pre-
trained model. If you don't already have Darknet installed, you should do that first.

Algorithm:

Yolo

Step 1: if (setModel=YoloV3( )|| TinyYoloV3( ))

Step 2: Set the execution path

Step 3: Set the model path to load the model

Step 4: Set timer as default timer to get the elapsed time

Step 5: Detect the objects from input to output image

Step 6: join the detector on inputImage and outputImage

Step 7: obj:=0, accur:=0

Step 8: for each object in detection do

Step 9: {

Step 10 :obj:=obj+1; accr:=accr+percentprob;

Step 11: }

Step 12: eTime: = defaultTimer – startTime;

5.1 Implementation and testing:

Real-Time Detection on a Webcam


xxxiii
Running YOLO on test data isn't very interesting if you can't see the result. Instead of
running it on a bunch of images let's run it on the input from a webcam!

To run this demo, you will need to compile Darknet with CUDA and OpenCV. Then run
the command:

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights

YOLO will display the current FPS and predicted classes as well as the image with
bounding boxes drawn on top of it.

You will need a webcam connected to the computer that OpenCV can connect to or it
won't work. If you have multiple webcams connected and want to select which one to use
you can pass the flag -c <num> to pick (OpenCV uses webcam 0 by default).

You can also run it on a video file if OpenCV can read the video:

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights <video file>

Implemented Classes:

Some of the main classes that have been implemented in the project are as follows there
are many kind of classes that are present in the weights section of the YOLOv3 but some
of the main classes are as follows.

ClassIndex: Used for defining the index of the particular class and the array that we have
created using the coco dataset in our project.

Confidence: Used for creating the boxes outside of the object that has been detected using
the YOLO.It can be of green or any color we want.

Bbox: Used for creating the boxes outside of the object that has been detected using the
YOLO. It can be of green or any color we want.

xxxiv
Implemented Functions

Weighted Sum

Inputs to a neuron can either be features from a training set or outputs from the neurons
of a previous layer. Each connection between two neurons has a unique synapse with a
unique weight attached. If you want to get from one neuron to the next, you have to travel
along the synapse and pay the “toll” (weight). The neuron then applies an activation
function to the sum of the weighted inputs from each incoming synapse. It passes the
result on to all the neurons in the next layer. When we talk about updating weights in a
network, we’re talking about adjusting the weights on these synapses.

A neuron’s input is the sum of weighted outputs from all the neurons in the previous
layer. Each input is multiplied by the weight associated with the synapse connecting the
input to the current neuron. If there are 3 inputs or neurons in the previous layer, each
neuron in the current layer will have 3 distinct weights: one for each synapse.

In a nutshell, the activation function of a node defines the output of that node.

The activation function (or transfer function) translates the input signals to output signals.
It maps the output values on a range like 0 to 1 or -1 to 1. It’s an abstraction that
represents the rate of action potential firing in the cell. It’s a number that represents the
likelihood that the cell will fire. At it’s simplest, the function is binary: yes (the neuron
fires) or no (the neuron doesn’t fire). The output can be either 0 or 1 (on/off or yes/no), or
it can be anywhere in a range. If you were using a function that maps a range between 0
and 1 to determine the likelihood that an image is a cat, for example, an output of 0.9
would show a 90% probability that your image is, in fact, a cat.

Activation function

In a nutshell, the activation function of a node defines the output of that node.

The activation function (or transfer function) translates the input signals to output signals.
It maps the output values on a range like 0 to 1 or -1 to 1. It’s an abstraction that
represents the rate of action potential firing in the cell. It’s a number that represents the

xxxv
likelihood that the cell will fire. At it’s simplest, the function is binary: yes (the neuron
fires) or no (the neuron doesn’t fire). The output can be either 0 or 1 (on/off or yes/no), or
it can be anywhere in a range.

Threshold function

This is a step function. If the summed value of the input reaches a certain threshold the
function passes on 0. If it’s equal to or more than zero, then it would pass on 1. It’s a very
rigid, straightforward, yes or no function.

Sigmoid function

This function is used in logistic regression. Unlike the threshold function, it’s a smooth,
gradual progression from 0 to 1. It’s useful in the output layer and is used heavily for
linear regression.

Hyperbolic Tangent Function

This function is very similar to the sigmoid function. But unlike the sigmoid function
which goes from 0 to 1, the value goes below zero, from -1 to 1. Even though this isn’t a
lot like what happens in a brain, this function gives better results when it comes to
training neural networks. Neural networks sometimes get “stuck” during training with the
sigmoid function. This happens when there’s a lot of strongly negative input that keeps
the output near zero, which messes with the learning process.

Rectifier function

xxxvi
This might be the most popular activation function in the universe of neural networks.
It’s the most efficient and biologically plausible. Even though it has a kink, it’s smooth
and gradual after the kink at. This means, for example, that your output would be either
“no” or a percentage of “yes.” This function doesn’t require normalization or other
complicated calculations.

It comes under the layer of machine learning, where machines can acquire skills and
learn from past experience without any involvement of human. Deep learning comes
under machine learning where artificial neural networks, algorithms inspired by the
human brain, learn from large amounts of data.

5.2 Validation and Verification

Although Objection detection is an esteemed task yet, it is an innovative errand. It plays


an essential role in numerous implementations like identifying an image, auto-annotation
of image, and apprehension of the ideology. Eliminating the problem of vision in visually
impaired persons, the proposed work can be used effectively in detecting the objects
along with their design patterns in an exact manner and to identify them among multiple
different objects in a captured input image individually with high accuracy and with
expert navigation, by implementing the Specific model X-Y plane by calculating their
percentages accurately of the detection and also supporting the transformation input
images to speech. The object detection also furnishes its results on multiple objects and
various methodologies in discovering artefacts, identifying and collating each step for its
productiveness.

xxxvii
CHAPTER 5

RESULTS AND DISCUSSION


The image or video can be loaded into the object detection model, this interface contains
loading an image, running a module to execute the program, the number of detected
images detected in the module and play audio for better understanding for visually
impaired persons.

Figure 7: Interface to run the detection modules.

xxxviii
Figure 8: Object detection in an outdoor environment with multi labelling.

Figure 8 shows the loaded image in the outdoor environment at one part, and the other
hands, the model is marked with all the objects available in the picture with blue-
coloured frames.

xxxix
Figure 9: Accuracy values of available objects in the image. Figure 9 shows all the
available objects in the images with accuracies. The detection module observed that there
are five bottles, one chair, and eight persons are detected in the loaded image. By playing
the audio, the module says, “Hey! There are five bottles, one chair’s eight people’s before
you”.

Figure 10: Object detection in a traffic environment.

Figure 10 shows all the available variables with labels at the traffic signal environment.

In this case, the model detected seven cars, two trucks, one person’s, and one bicycle’s
before the person, along with accuracy.

xl
Figure 11: Playing audio output. Figure 11 shows the accuracy of the objects available in
the loaded image. By playing the “play audio” module, the visually impaired people can
listen to the type of objects in the surrounding environment and the count.

Figure 12: Object detection at a traffic signal.

xli
Figure 12 shows all the detected variables applied at a traffic signal and used for the
comparative analysis to compare results by Retina Net, Yolo V3, and Yolo Tiny on the
same image

CONCLUSION AND FUTURE WORKS

Although Objection detection is an esteemed task yet, it is an innovative errand. It plays an


essential role in numerous implementations like identifying an image, auto-annotation of image,
and apprehension of the ideology. Eliminating the problem of vision in visually impaired
persons, the proposed work can be used effectively in detecting the objects along with their
design patterns in an exact manner and to identify them among multiple different objects in a
captured input image individually with high accuracy and with expert navigation, by
implementing the Specific model X-Y plane by calculating their percentages accurately of the
detection and also supporting the transformation input images to speech. The object detection
also furnishes its results on multiple objects and various methodologies in discovering artefacts,
identifying and collating each step for its productiveness

For humans and many other animals, visual perception is one of the most important senses; we
heavily rely on vision whenever we interact with our environment. In order to pick up a glass, we
need to first determine which part of our visual impression corresponds to the glass before we
can find out where we have to move our hands in order to grasp it.

The same code that can be used to recognize Stop signs or pedestrians in a self-driving vehicle
signs can also be used to find cancer cells in a tissue biopsy.

xlii
If we want to recognize another human, we first have to find out which part of the image we see
represents that individual, as well as any distinguishing factors of their face.

Notably, we generally do not actively consider these basic steps, but these steps pose a major
challenge for artificial systems dealing with image processing.

SHIVAJIRAO KADAM INSTITUTE OF


TECHNOLOGY AND MANAGEMENT-
TECHNICAL CAMPUS, INDORE
Department of Computer Science & Engineering
Synopsis
On
Object Detection

1. Problem Domain
People who can’t see, face problems in recognizing things by feeling them. If there
will be some other way or felicity in recognizing things like by someone telling
them or any software that tells them the name and distance of thing from a respective
point.

xliii
2. Solution Domain
A person cannot remain with a blind person all the time in order to assist them in
recognizing objects.
Our project will aid blind people in recognizing things by telling them the name of
thing.
This is an object detection system which currently works only for recognizing the
name or type of things, not their distance from a point.

3. System Domain
A smart device with camera feature and speaker.
The camera will first capture a single frame from the live field, and then the code will
scan the frame to identify the object. Once the object recognition is complete, the
name or information will be sent to the speakers. Finally, the speaker will dictate the
name or information it has.

4. Application Domain
Every blind person can get help from this,

5. Expected Outcomes/ Benefits:


(a) Improve Accuracy. The significant most advantage of object detection projects is that
it is more accurate than human vision.
(b) Deliver Faster Results.
(c) Reduce Costs.
(d) Provide Unbiased Results.
(e) Offer a Unique Customer Experience.

6. References:
Kaggle
xliv
Darknet

Guided By: Group Members:


Prof. Jayesh Umre Anmol Saxena (0875CS191014)
Kratagya Mourya (0875CS191054)
Abhinav Shrivastava (0875CS191004)
Aryan Patel (0875CS191018)

SNAPSHOTS OF THE CODE AND THE OUTPUT:

xlv
xlvi
xlvii
xlviii
REFERENCES

1. Asa Gautam, Anjana Kumari, Pankaj Singh: "The Concept of Object Recognition",
International Journal of Advanced Research in Computer Science and Software Engineering,
Volume 5, Issue 3, March 2015

2. Tatsuro UE, Hirohiko K, Tetsuo T, Akihisa O, Shin’ich Y. Visual Information Assist System
Using 3D SOKUIKI Sensor for Blind People. the 32nd annual of the IEEE Industrial Electronics
Society (IECON). 2006.

3 TutorialsPoint, Supervised Learning, 2020. [Online]. Available:


https://www.tutorialspoint.com/artificial_neural_network/artificial_
neural_network_supervised_learning.htm

4. V. Marco, What Is Machine Learning? A Definition, 2017. [Online]. Available:


www.expertsystem.com/machine-learning-definition/

5. C. M. Bishop, Pattern recognition and machine learning. springer, 2006.

xlix

You might also like