You are on page 1of 21

YOU ONLY LOOK ONCE-YOLO

SEMINAR REPORT

Submitted by

DEVI.P

Reg.No:19TH0412

in partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY

in

DEPARTMENT OF INFORMATION TECHNOLOGY

iiiiiii

MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY

KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107

PONDICHERRY UNIVERSITY

DECEMBER 2022

i
YOU ONLY LOOK ONCE-YOLO
SEMINAR REPORT

Submitted by

DEVI.P

Reg.No:19TH0412

in partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY

in

DEPARTMENT OF INFORMATION TECHNOLOGY

iiiiiii

MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY

KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107

PONDICHERRY UNIVERSITY

DECEMBER 2022

i
MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY
KALITHEERTHAL KUPPAM, PUDUCHERRY- 605 107

PONDICHERRY UNIVERSITY

DEPARTMENT OF INFORMATION TECHNOLOGY

BONAFIDE CERTIFICATE

This is to certify that the Seminar Report Presentation Titled “YOU ONLY LOOK

ONCE-YOLO” is a bonafide record work done by DEVI.P (Reg. No:19TH0412) of Seventh

Semester B.TECH in INFORMATION TECHNOLOGY for the SEMINAR REPORT during

the academic year 2022-2023.

Staff in charge Head of the Department

ii
TABLE OF CONTENTS

CHAPTER NO TITLE PAGE NO

1 INTRODUCTION 1

1.1 HISTORY OF YOLO 1

1.2 INTRODUCTION TO YOLO 2

2 VERSIONS OF YOLO 3

2.1 YOLOV1 3

2.2 YOLOV2 4

2.3 YOLOV3 6

2.4 YOLOV4 7

2.5 YOLOV5 8

3 WORKING OF YOLO 9

3.1 RESIDUAL BLOCK 9

3.2 BOUNDING BOX REGRESSION 9

3.3 INTERSECTION OVER UNION 10

4 IMPORTANCE OF YOLO 15

4.1 IMPORTANCE OF YOLO 15

5 CONCLUSION 16

REFERENCES 17

iii
CHAPTER-1
INTRODUCTION

1.1 HISTORY OF YOLO


Originally introduced by Joseph Redmon in Darknet, YOLO has come a long way. Here
are a few thing that made the YOLO’s first version break the competition over R-CNN and DPM.
● Real-time frames processing at 45 fps
● Less false positive on the background
● Higher detection accuracy (although lower accuracy on localization)
Any computer vision enthusiast has surely heard of YOLO models for object detection.
Ever since the first YOLOv1 was introduced in 2015, it garnered too much popularity within the
computer vision community. Subsequently, multiple versions of YOLOv2, YOLOv3, YOLOv4,
and YOLOv5 have been released albeit by different people. In this article, we will give a brief
background about all the object detection models of the YOLO family from YOLOv1 to
YOLOv5.The algorithm has continued to evolve ever since its initial release in 2016. Both
YOLOv2 and YOLOv3 were written by Joseph Redmon. After YOLOv3, there came new authors
who anchored their own goals in every other YOLO release.

YOLOv2: Released in 2017, this version earned an honorable mention at CVPR 2017 because of
significant improvements on anchor boxes and higher resolution.

YOLOv3: The 2018th release had an additional objectivity score to the bounding box prediction
and connections to the backbone network layers. It also provided an improved performance on tiny
objects because of the ability to run predictions at three different levels of granularity.

YOLOv4: April’s release of 2020 became the first paper not authored by Joseph Redmon. Here
Alexey Bochkovski introduced novel improvements, including mind activation, improved feature
aggregation, etc.

YOLOv5: Glenn Jocher continued to make further improvements in his June 2020 release,
focusing on the architecture itself.

1
1.2 INTRODUCTION TO YOLO
Standing for You Only Look Once, YOLO is a regression algorithm that falls under the
class of real-time object detection methods with a multitude of computer vision applications.
YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-time. As
the name suggests, the algorithm requires only a single forward propagation through a neural
network to detect objects.This means that prediction in the entire image is done in a single
algorithm run.
The CNN is used to predict various class probabilities and bounding boxes
simultaneously.The YOLO algorithm consists of various variants. Some of the common ones
include tiny YOLO and YOLOv3.
This algorithm uses a single bounding box regression to identify elements like height,
width, centre, and object classes. It cornered the market because of its accuracy, demonstrated
speed, and ability to detect objects in a single run, surpassing Fast R-CNN, RetinaNet, and Single-
Shot MultiBox Detector (SSD).The RCNN family was too slow.
It took longer to find the proposed region for the bounding box, train a model, detect and
classify regions, and then check for refined outputs in separate steps.
In many tasks, extreme levels of accuracy (as the ones provided by CNNs) are not
imperative, so it is reasonable to rely on less accurate but faster-to-train methods. Hence, YOLO’s
unprecedented emergence. First, it improves the detection time given that it predicts objects in
real-time. Second, YOLO provides accurate results with minimal background errors. And finally,
the algorithm has wonderful learning capabilities that enable it to learn the representations of
objects and implement them in object detection tasks.

Figure:1.1 Object detection using YOLO

2
CHAPTER-2
VERSIONS OF YOLO

2.1 YOLOV1 – THE BEGINNING


The first YOLO model was introduced by Joseph Redmon et all in their 2015 paper titled
“You Only Look Once: Unified, Real-Time Object Detection”. Till that time RCNN models were
the most sought-after models for object detection. Although the RCNN family of models was
accurate but was relatively slow because it was a multi-step process of finding the proposed region
for the bounding box and then do classification of these regions and finally post-processing to
refine the output.YOLO was created with the goal to do away with multistage and perform object
detection in just a single stage, thus increasing the inference time.
2.1.1 PERFORMANCE
YOLOv1 sported a 63.4 mAP with an inference speed of 45 frames per second (22ms per
image). At that time, it was a huge improvement in speed over the RCNN family for which
inference rates ranged from 143ms to 20 seconds.

Figure:2.1 Performance of YOLOV1


The basic working of the YOLO model relies upon its unified detection technique which
groups together different components of object detection into a single feed neural network. The
model divides an incoming image into numerous grids and calculates the probability of an object
residing inside that grid. This is done for all the grids that the image is divided into. After that, the
algorithm groups nearby high-value probability grids as a single object. Low-value predictions are
discarded using a technique called Non-Max Suppression(NMS).

3
The model is trained in a similar fashion where the centre of each object detected is
compared with the ground truth. In order to check whether the model is correct or not and adjust
the weights accordingly.

Figure:2.2 YOLOV1

2.2 YOLOV2 – BETTER, FASTER, STRONGER


YOLOv2 was released by Joseph Redmon and Ali Farhadi in 2016 in their paper titled
“YOLO9000:Better, Faster, Stronger”. The 9000 signified that YOLOv2 was able to detect over
9000 categories of objects. This version had various improvements over the previous version
YOLOV1.

2.2.1 PERFORMANCE
YOLOv2 registered 78.6 mAP on the VOC 2012 dataset. We can see in the below table
that it performed very well on the VOC 2012 dataset compared to other object detection models.

Figure:2.3Accuracy Comparison: State-of-the-art accuracy with 2-10 times better inference rates

4
2.2.2 TECHNICAL IMPROVEMENTS
The YOLO v2 version introduced the concept of anchor boxes. Anchor boxes are nothing
but predefined areas for an image that illustrates the 5dealized position of the objects to be
detected. We calculate the ratio of overlap over the union (IoU) of the predicted bounding box and
the predefined anchor box. The IoU value acts as a threshold to decide whether the probability of
the detected object is sufficient to make a prediction or NOT.

Figure:2.4 Overlap Over the Union

But in the case of YOLO, anchor boxes are not computed randomly. Instead, the YOLO
algorithm examines the training data and performs clustering on it.. All this is done in order to
ensure that the anchor boxes that are used represent the data on which we will be training our
model. This helps in enhancing the accuracy a lot.

Figure:2.5 Anchor Boxes and Dimension Clusters.

2.2.3 ADDITIONAL IMPROVEMENTS


In order to adapt to different aspect ratios, the YOLOv2 model is randomly resized
throughout the training process. For the model to be robust the YOLOv2 model was trained on a
combination of the COCO dataset .When the model processes an image with labels the detection
and classification error is calculated.

5
Whereas when the model sees a label-less image it back propagates the classification error
only. This structure is called the tree .Inference speeds of up to 200 FPS and mAP of 75.3 were
achieved using a classification network architecture called darknet19 .

2.3 YOLOV3 AN INCREMENTAL IMPROVEMENT


In 2018, Joseph Redmon and Ali Farhadi introduced the third version of YOLOv3 in their
paper “YOLOv3: An Incremental Improvement”. This model was a little bigger than the earlier
ones but more accurate and yet was fast enough.

2.3.1 PERFORMANCE
YOLOv3-320 has an mAP of 28.2 with an inference time of 22 milliseconds. (On the
COCO dataset). This is 3 times fast than the SSD object detection technique yet with similar
accuracy.

Figure:2.6 Difference between YOLO v1,v2,v3

2.3.2 TECHNICAL IMPROVEMENTS


YOLOv3 consisted of 75 convolutional layers without using fully connected or pooling
layers which greatly reduced the model size and weight. It provided the best of both worlds i.e.
using residual models for multiple feature learning with feature pyramid network while
maintaining minimal inference times.
● A feature pyramid network is a feature extractor that extracts different types/forms/sizes of
features for a single image. It concatenates all the features so that the model can learn local
and general features.

6
● By employing the use of logistic classifiers and activations the class predictions for the
YOLOv3 goes above and beyond RetinaNet-50 and 101 in terms of accuracy.
2.4 YOLOV4- SPEED AND ACCURACY OF OBJECT DETECTION
YOLOV4 was not released by Joseph Redmon but by Alexey Bochkovskiy, et all in their
2020 paper “YOLOv4: Optimal Speed and Accuracy of Object Detection”.

2.4.1 PERFORMANCE
YOLOv4 model stands atop of the other detection models like efficient Det and
ResNext50. It has the Darknet53 backbone (same as the YOLOv3).

2.4.2 TECHNICAL IMPROVEMENTS


YOLOv4 introduced the concept of the bag of freebies (techniques that bring about an
enhancement in model performance without increasing the inference cost) and the bag of specials
(techniques that increase accuracy while increasing the computation cost).It has a speed of 62
frames per second with an mAP of 43.5 percent on the COCO dataset.

MODEL LAYERS fPS mAP

YOLOV1 26 45 89.4

YOLOV2 32 42 91.2

YOLOV3 106 20 94.6

YOLOV4 53 22.6 84.3

YOLOV5 16 62 91.2

Figure:2.7 Difference between YOLO v1,v2,v3,v4,v5

7
2.5 YOLOV5
YOLOv5 is supposedly the next member of the YOLO family released in 2020 by the
company Ultranytics just a few days after YOLOv4. No paper has been released and there is a
debate in the community if it justifies using YOLO branding as it is just the PyTorch
implementation of YOLOv3.
2.5.1 PERFORMANCE
The authenticity of performance cannot be guaranteed as there is no official paper yet. It
achieves the same if not better accuracy(mAP of 55.6) than the other YOLO models while taking
less computation power.

Figure:2.8 Performance of YOLOV5

2.5.2 TECHNICAL IMPROVEMENTS


● Better data augmentation and loss calculations (Now that the base of the model has
shifted from C to PyTorch)
● Auto learning of anchor boxes (they do not need to be added manually now)
● Use of cross-stage partial connections(CSP) in the backbone.
● Use of path aggregation(PAN) network in the neck of the model

8
CHAPTER-3
WORKING OF YOLO

3.1 BASIC WORKING OF YOLO OBJECT DETECTOR MODELS:


YOLO-based models do not seize to take over the space, and the way these models operate
is based on three fundamental techniques. They are Residual block, Bounding box regression,
Intersection over union.
3.1.1 RESIDUAL BLOCKS
First, the image is divided into various grids. Each grid has a dimension of S x S. The
following image shows how an input image is divided into grids.

Figure:3.1 Residual Block

In the image below, there are many grid cells of equal dimension. Every grid cell will detect
objects that appear within them. For example, if an object centre appears within a certain grid cell,
then this cell will be responsible for detecting it.

3.1.2 BOUNDING BOX REGRESSION


A bounding box is an outline that highlights an object in an image. Every bounding
box in the image consists of the following attributes.Width , Height ,Class and Bounding box
centre .The following image shows an example of a bounding box. The bounding box has been
represented by a yellow outline. YOLO uses a single bounding box regression to predict the height,
width, centre, and class of objects. In the image above, represents the probability of an object
appearing in the bounding box.

9
Figure:3.2 Bounding Box Regression

3.1.3 INTERSECTION OVER UNION (IOU)


Intersection over union (IOU) is a phenomenon in object detection that describes how
boxes overlap. YOLO uses IOU to provide an output box that surrounds the objects perfectly.Each
grid cell is responsible for predicting the bounding boxes and their confidence scores. The IOU is
equal to 1 if the predicted bounding box is the same as the real box. This mechanism eliminates
bounding boxes that are not equal to the real box.The following image provides a simple example
of how IOU works.In the image below, there are two bounding boxes, one in green and the other
one in blue. The blue box is the predicted box while the green box is the real box. YOLO ensures
that the two bounding boxes are equal.

Figure:3.3 Intersection Over Time

As for every ML-based model precision and recall are very important to deduce and judging
its accuracy and robustness. Thus the creator of YOLO kept trying to come up with an object
detection model that maximises mAP (mean average precision).Besides this, the architecture of all
the YOLO models have a similar theme of components as outlined below

10
1. Backbone: A convolutional neural network that accumulates and produces visual
features with different shapes and sizes. Classification models like ResNet, VGG, and
EfficientNet are used as feature extractors.
2. Neck: A set of layers that integrate and blend characteristics before passing them on to
the prediction layer. Example: Feature pyramid network(FPN), path aggregation
network(PAN) and Bi-FPN
3. Head: Takes in features from the neck along with the bounding box predictions.
Performs classification, regression on the features, and bounding box coordinates to
complete the detection process. Outputs 4 values, generally x,y coordinate along with
width and height.
So the next obvious question would be, How does YOLO work, Say we have a CNN that’s
been trained to recognize several classes, including a traffic light, a car, a person, and a truck. We
give it two types of anchor boxes, a tall one and a wide one so that it can handle overlapping
objects of different shapes. Once CNN has been trained, we can now detect objects in images by
feeding at new test images.

YOLO can work well for multiple objects where each object is associated with one grid

cell. But in the case of overlap, in which one grid cell actually contains the center points of two
different objects, we can use something called anchor boxes to allow one grid cell to detect multiple

objects.
11
Figure:3.4 Anchor boxes in action

In image above, we see that we have a person and a car overlapping in the image. So, part

of the car is obscured. We can also see that the centers of both bounding boxes, the car, and the

pedestrian fall in the same grid cell. Since the output vector of each grid cell can only have one

class, then it will be forced to pick either the car or the person. But by defining anchor boxes, we

can create a longer grid cell vector and associate multiple classes with each grid cell.

Anchor boxes have a defined aspect ratio, and they tried to detect objects that nicely fit into a box

with that ratio. For example, since we’re detecting a wide car and a standing person, we’ll define

one anchor box that is roughly the shape of a car, this box will be wider than it is tall. And we’ll
define another anchor box that can fit a standing person inside of it, which will be taller than it is

wide. The test image is first broken up into a grid and the network then produces output vectors,

one for each grid cell. These vectors tell us if a cell has an object in it, what class the object is, and

the bounding boxes for the object. Since we’re using two anchor boxes, we’ll get two predicted

anchor boxes for each grid cell. Some, in fact most of the predicted anchor boxes will have a low.

After producing these output vectors, we use non-maximal suppression to get rid of unlikely

bounding boxes. For each class, non-maximal suppression gets rid of the bounding boxes that have
a PC value lower than some given threshold. YOLO uses Non-Maximal Suppression (NMS) to only

keep the best bounding box.

12
The first step in NMS is to remove all the predicted bounding boxes that have a detection

probability that is less than a given NMS threshold. In the code below, we set this NMS threshold

to 0.6. This means that all predicted bounding boxes that have a detection probability less than 0.6

will be removed. After removing all the predicted bounding boxes that have a low detection

probability, the second step in NMS, is to select the bounding boxes with the highest detection

probability and eliminate all the bounding boxes whose Intersection Over Union (IOU) value is

higher than a given IOU threshold. In the code below, we set this IOU threshold to 0.4. This means

that all predicted bounding boxes that have an IOU value greater than 0.4 with respect to the best

bounding boxes will be removed.

13
It then selects the bounding boxes with the highest PC value and removes bounding boxes
that are too similar to this. It will repeat this until all of the non-maximal bounding boxes had been
removed for every class. The end result will look like the image below, we can see that yellow has
effectively detected many objects in the image such as cars and people.

Figure:3.5 YOLO Object Detection

14
CHAPTER-4
IMPORTANCE OF YOLO
4.1 IMPORTANCE OF YOLO
4.1.1 SPEED
● This algorithm improves the speed of detection because it can predict objects in real-time.
4.1.2 HIGH ACCURACY
● YOLO is a predictive technique that provides accurate results with minimal background
errors.
4.1.3 LEARNING CAPABILITIES:
● The algorithm has excellent learning capabilities that enable it to learn the representations
of objects and apply them in object detection.

YOLO algorithm can be used in various applications like Autonomous driving, Security,
Wildlife.
● Autonomous driving: YOLO algorithm can be used in autonomous cars to detect objects
around cars such as vehicles, people, and parking signals. Object detection in autonomous
cars is done to avoid collision since no human driver is controlling the car.
● Wildlife: This algorithm is used to detect various types of animals in forests. This type of
detection is used by wildlife rangers and journalists to identify animals in videos (both
recorded and real-time) and images. Some of the animals that can be detected include
giraffes, elephants, and bears.
● Security: YOLO can also be used in security systems to enforce security in an area. Let’s
assume that people have been restricted from passing through a certain area for security
reasons. If someone passes through the restricted area, the YOLO algorithm will detect
him/her, which will require the security personnel to take further action.

15
CHAPTER-5
CONCLUSION

This report has provided an overview of the YOLO algorithm and how it is used in
object detection. This technique given in the paper provides improved detection results
compared to other object detection techniques such as Fast R-CNN and Retina-Net. As with all
other computer vision algorithms, due to various unpredictable factors in real-world
applications (lighting conditions, human factor), there is not a unique model for every problem,
including the problem of store shelf detection. The YOLO algorithm is used after being trained
on entire image inputs, thus, it does not isolate and identify specific objects but rather processes
the entire image at once. This enables it to not only encode class appearance data but also gather
contextual data. This is why the YOLO algorithm does not get affected by noise or background
data when trying to detect specific targets in real-time. This seminar also contains various
versions of YOLOV1,YOLV2,YOLOV3,YOLOV4,YOLOV5.Through this seminar you have
gained an overview of object detection and the YOLO algorithm ,gone through the main reasons
why the YOLO algorithm is important, learned how the YOLO algorithm works and also
gained an understanding of the main techniques used by YOLO to detect objects. you might
also learned the real-life applications of YOLO.

16
REFERENCE

[1] Joseph Redmon,Santosh Divvala,Ross Girshick,Ali Farhadi "You Only Look Once: Unified,
Real-Time Object Detection"- 2016 IEEE Conference on Computer Vision and Pattern
Recognition

[2] N.Murali Krishna,Ramidi Yashwanth Reddy, Mallu Sai Chandra Reddy, Kasibhatla Phani
Madhav,Gaikwad Sudham "Object Detection and Tracking Using Yolo"- 2021 Third International
Conference on Inventive Research in Computing Applications

[3] Rachel Huang,Jonathan Pedoeem,Cuixian Chen"YOLO-LITE: A Real-Time Object Detection


Algorithm Optimised for Non-GPU Computers"- 2018 IEEE International Conference on Big Data

[4] Irvine Valiant Fanthony,Zaenal Husin,Hera Hikmarika,Suci Dwijayanti,Bhakti Yudho


Suprapto"YOLO Algorithm-Based Surrounding Object Identification on Autonomous Electric
Vehicle"-2021 8th International Conference on Electrical Engineering, Computer Science and
Informatics
[5] Pranav Adarsh, Pratibha Rathi,Manoj Kumar “ YOLO v3-Tiny: Object Detection and
Recognition using one stage improved model”- 2020 6th International Conference on Advanced
Computing and Communication Systems

[6] Fan Wu, Guoqing Jin, Mingyu Gao,Zhiwei HE,Yuxiang Yang Al-“ Helmet Detection Based

On Improved YOLO V3 Deep Model”- 2019 IEEE 16th International Conference on Networking,

Sensing and Control (ICNSC)

17

You might also like