Devi Sem

YOU ONLY LOOK ONCE-YOLO
SEMINAR REPORT
Submitted by
DEVI.P
Reg.No:19TH0412
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
DEPARTMENT OF INFORMATION TECHNOLOGY
iiiiiii
MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY
KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107
PONDICHERRY UNIVERSITY
DECEMBER 2022
i
YOU ONLY LOOK ONCE-YOLO
SEMINAR REPORT
Submitted by
DEVI.P
Reg.No:19TH0412
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
iiiiiii
KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107
DECEMBER 2022
i
KALITHEERTHAL KUPPAM, PUDUCHERRY- 605 107
BONAFIDE CERTIFICATE
This is to certify that the Seminar Report Presentation Titled “YOU ONLY LOOK
ONCE-YOLO” is a bonafide record work done by DEVI.P (Reg. No:19TH0412) of Seventh
Semester B.TECH in INFORMATION TECHNOLOGY for the SEMINAR REPORT during
the academic year 2022-2023.
Staff in charge Head of the Department
ii
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
1 INTRODUCTION 1
1.1 HISTORY OF YOLO 1
1.2 INTRODUCTION TO YOLO 2
2 VERSIONS OF YOLO 3
2.1 YOLOV1 3
2.2 YOLOV2 4
2.3 YOLOV3 6
2.4 YOLOV4 7
2.5 YOLOV5 8
3 WORKING OF YOLO 9
3.1 RESIDUAL BLOCK 9
3.2 BOUNDING BOX REGRESSION 9
3.3 INTERSECTION OVER UNION 10
4 IMPORTANCE OF YOLO 15
4.1 IMPORTANCE OF YOLO 15
5 CONCLUSION 16
REFERENCES 17
iii
CHAPTER-1
INTRODUCTION
1.1 HISTORY OF YOLO

Originally introduced by Joseph Redmon in Darknet, YOLO has come a long way. Here
are a few thing that made the YOLO’s first version break the competition over R-CNN and DPM.
● Real-time frames processing at 45 fps
● Less false positive on the background
● Higher detection accuracy (although lower accuracy on localization)
Any computer vision enthusiast has surely heard of YOLO models for object detection.
Ever since the first YOLOv1 was introduced in 2015, it garnered too much popularity within the
computer vision community. Subsequently, multiple versions of YOLOv2, YOLOv3, YOLOv4,
and YOLOv5 have been released albeit by different people. In this article, we will give a brief
background about all the object detection models of the YOLO family from YOLOv1 to
YOLOv5.The algorithm has continued to evolve ever since its initial release in 2016. Both
YOLOv2 and YOLOv3 were written by Joseph Redmon. After YOLOv3, there came new authors
who anchored their own goals in every other YOLO release.
YOLOv2: Released in 2017, this version earned an honorable mention at CVPR 2017 because of
significant improvements on anchor boxes and higher resolution.
YOLOv3: The 2018th release had an additional objectivity score to the bounding box prediction
and connections to the backbone network layers. It also provided an improved performance on tiny
objects because of the ability to run predictions at three different levels of granularity.
YOLOv4: April’s release of 2020 became the first paper not authored by Joseph Redmon. Here
Alexey Bochkovski introduced novel improvements, including mind activation, improved feature
aggregation, etc.
YOLOv5: Glenn Jocher continued to make further improvements in his June 2020 release,
focusing on the architecture itself.
1
1.2 INTRODUCTION TO YOLO
Standing for You Only Look Once, YOLO is a regression algorithm that falls under the
class of real-time object detection methods with a multitude of computer vision applications.
YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-time. As
the name suggests, the algorithm requires only a single forward propagation through a neural
network to detect objects.This means that prediction in the entire image is done in a single
algorithm run.
The CNN is used to predict various class probabilities and bounding boxes
simultaneously.The YOLO algorithm consists of various variants. Some of the common ones
include tiny YOLO and YOLOv3.
This algorithm uses a single bounding box regression to identify elements like height,
width, centre, and object classes. It cornered the market because of its accuracy, demonstrated
speed, and ability to detect objects in a single run, surpassing Fast R-CNN, RetinaNet, and Single-
Shot MultiBox Detector (SSD).The RCNN family was too slow.
It took longer to find the proposed region for the bounding box, train a model, detect and
classify regions, and then check for refined outputs in separate steps.
In many tasks, extreme levels of accuracy (as the ones provided by CNNs) are not
imperative, so it is reasonable to rely on less accurate but faster-to-train methods. Hence, YOLO’s
unprecedented emergence. First, it improves the detection time given that it predicts objects in
real-time. Second, YOLO provides accurate results with minimal background errors. And finally,
the algorithm has wonderful learning capabilities that enable it to learn the representations of
objects and implement them in object detection tasks.
Figure:1.1 Object detection using YOLO
2
CHAPTER-2
VERSIONS OF YOLO
2.1 YOLOV1 – THE BEGINNING

The first YOLO model was introduced by Joseph Redmon et all in their 2015 paper titled
“You Only Look Once: Unified, Real-Time Object Detection”. Till that time RCNN models were
the most sought-after models for object detection. Although the RCNN family of models was
accurate but was relatively slow because it was a multi-step process of finding the proposed region
for the bounding box and then do classification of these regions and finally post-processing to
refine the output.YOLO was created with the goal to do away with multistage and perform object
detection in just a single stage, thus increasing the inference time.
2.1.1 PERFORMANCE
YOLOv1 sported a 63.4 mAP with an inference speed of 45 frames per second (22ms per
image). At that time, it was a huge improvement in speed over the RCNN family for which
inference rates ranged from 143ms to 20 seconds.
Figure:2.1 Performance of YOLOV1

The basic working of the YOLO model relies upon its unified detection technique which
groups together different components of object detection into a single feed neural network. The
model divides an incoming image into numerous grids and calculates the probability of an object
residing inside that grid. This is done for all the grids that the image is divided into. After that, the
algorithm groups nearby high-value probability grids as a single object. Low-value predictions are
discarded using a technique called Non-Max Suppression(NMS).
3
The model is trained in a similar fashion where the centre of each object detected is
compared with the ground truth. In order to check whether the model is correct or not and adjust
the weights accordingly.
Figure:2.2 YOLOV1
2.2 YOLOV2 – BETTER, FASTER, STRONGER

YOLOv2 was released by Joseph Redmon and Ali Farhadi in 2016 in their paper titled
“YOLO9000:Better, Faster, Stronger”. The 9000 signified that YOLOv2 was able to detect over
9000 categories of objects. This version had various improvements over the previous version
YOLOV1.
2.2.1 PERFORMANCE
YOLOv2 registered 78.6 mAP on the VOC 2012 dataset. We can see in the below table
that it performed very well on the VOC 2012 dataset compared to other object detection models.
Figure:2.3Accuracy Comparison: State-of-the-art accuracy with 2-10 times better inference rates
4
2.2.2 TECHNICAL IMPROVEMENTS
The YOLO v2 version introduced the concept of anchor boxes. Anchor boxes are nothing
but predefined areas for an image that illustrates the 5dealized position of the objects to be
detected. We calculate the ratio of overlap over the union (IoU) of the predicted bounding box and
the predefined anchor box. The IoU value acts as a threshold to decide whether the probability of
the detected object is sufficient to make a prediction or NOT.
Figure:2.4 Overlap Over the Union
But in the case of YOLO, anchor boxes are not computed randomly. Instead, the YOLO
algorithm examines the training data and performs clustering on it.. All this is done in order to
ensure that the anchor boxes that are used represent the data on which we will be training our
model. This helps in enhancing the accuracy a lot.
Figure:2.5 Anchor Boxes and Dimension Clusters.
2.2.3 ADDITIONAL IMPROVEMENTS

In order to adapt to different aspect ratios, the YOLOv2 model is randomly resized
throughout the training process. For the model to be robust the YOLOv2 model was trained on a
combination of the COCO dataset .When the model processes an image with labels the detection
and classification error is calculated.
5
Whereas when the model sees a label-less image it back propagates the classification error
only. This structure is called the tree .Inference speeds of up to 200 FPS and mAP of 75.3 were
achieved using a classification network architecture called darknet19 .
2.3 YOLOV3 AN INCREMENTAL IMPROVEMENT

In 2018, Joseph Redmon and Ali Farhadi introduced the third version of YOLOv3 in their
paper “YOLOv3: An Incremental Improvement”. This model was a little bigger than the earlier
ones but more accurate and yet was fast enough.
2.3.1 PERFORMANCE
YOLOv3-320 has an mAP of 28.2 with an inference time of 22 milliseconds. (On the
COCO dataset). This is 3 times fast than the SSD object detection technique yet with similar
accuracy.
Figure:2.6 Difference between YOLO v1,v2,v3

YOLOv3 consisted of 75 convolutional layers without using fully connected or pooling
layers which greatly reduced the model size and weight. It provided the best of both worlds i.e.
using residual models for multiple feature learning with feature pyramid network while
maintaining minimal inference times.
● A feature pyramid network is a feature extractor that extracts different types/forms/sizes of
features for a single image. It concatenates all the features so that the model can learn local
and general features.
6
● By employing the use of logistic classifiers and activations the class predictions for the
YOLOv3 goes above and beyond RetinaNet-50 and 101 in terms of accuracy.
2.4 YOLOV4- SPEED AND ACCURACY OF OBJECT DETECTION
YOLOV4 was not released by Joseph Redmon but by Alexey Bochkovskiy, et all in their
2020 paper “YOLOv4: Optimal Speed and Accuracy of Object Detection”.
2.4.1 PERFORMANCE
YOLOv4 model stands atop of the other detection models like efficient Det and
ResNext50. It has the Darknet53 backbone (same as the YOLOv3).

YOLOv4 introduced the concept of the bag of freebies (techniques that bring about an
enhancement in model performance without increasing the inference cost) and the bag of specials
(techniques that increase accuracy while increasing the computation cost).It has a speed of 62
frames per second with an mAP of 43.5 percent on the COCO dataset.
MODEL LAYERS fPS mAP
YOLOV1 26 45 89.4
YOLOV2 32 42 91.2
YOLOV3 106 20 94.6
YOLOV4 53 22.6 84.3
YOLOV5 16 62 91.2
Figure:2.7 Difference between YOLO v1,v2,v3,v4,v5
7
2.5 YOLOV5
YOLOv5 is supposedly the next member of the YOLO family released in 2020 by the
company Ultranytics just a few days after YOLOv4. No paper has been released and there is a
debate in the community if it justifies using YOLO branding as it is just the PyTorch
implementation of YOLOv3.
2.5.1 PERFORMANCE
The authenticity of performance cannot be guaranteed as there is no official paper yet. It
achieves the same if not better accuracy(mAP of 55.6) than the other YOLO models while taking
less computation power.
Figure:2.8 Performance of YOLOV5

● Better data augmentation and loss calculations (Now that the base of the model has
shifted from C to PyTorch)
● Auto learning of anchor boxes (they do not need to be added manually now)
● Use of cross-stage partial connections(CSP) in the backbone.
● Use of path aggregation(PAN) network in the neck of the model
8
CHAPTER-3
WORKING OF YOLO
3.1 BASIC WORKING OF YOLO OBJECT DETECTOR MODELS:

YOLO-based models do not seize to take over the space, and the way these models operate
is based on three fundamental techniques. They are Residual block, Bounding box regression,
Intersection over union.
3.1.1 RESIDUAL BLOCKS
First, the image is divided into various grids. Each grid has a dimension of S x S. The
following image shows how an input image is divided into grids.
Figure:3.1 Residual Block
In the image below, there are many grid cells of equal dimension. Every grid cell will detect
objects that appear within them. For example, if an object centre appears within a certain grid cell,
then this cell will be responsible for detecting it.
3.1.2 BOUNDING BOX REGRESSION

A bounding box is an outline that highlights an object in an image. Every bounding
box in the image consists of the following attributes.Width , Height ,Class and Bounding box
centre .The following image shows an example of a bounding box. The bounding box has been
represented by a yellow outline. YOLO uses a single bounding box regression to predict the height,
width, centre, and class of objects. In the image above, represents the probability of an object
appearing in the bounding box.
9
Figure:3.2 Bounding Box Regression
3.1.3 INTERSECTION OVER UNION (IOU)

Intersection over union (IOU) is a phenomenon in object detection that describes how
boxes overlap. YOLO uses IOU to provide an output box that surrounds the objects perfectly.Each
grid cell is responsible for predicting the bounding boxes and their confidence scores. The IOU is
equal to 1 if the predicted bounding box is the same as the real box. This mechanism eliminates
bounding boxes that are not equal to the real box.The following image provides a simple example
of how IOU works.In the image below, there are two bounding boxes, one in green and the other
one in blue. The blue box is the predicted box while the green box is the real box. YOLO ensures
that the two bounding boxes are equal.
Figure:3.3 Intersection Over Time
As for every ML-based model precision and recall are very important to deduce and judging
its accuracy and robustness. Thus the creator of YOLO kept trying to come up with an object
detection model that maximises mAP (mean average precision).Besides this, the architecture of all
the YOLO models have a similar theme of components as outlined below
10
1. Backbone: A convolutional neural network that accumulates and produces visual
features with different shapes and sizes. Classification models like ResNet, VGG, and
EfficientNet are used as feature extractors.
2. Neck: A set of layers that integrate and blend characteristics before passing them on to
the prediction layer. Example: Feature pyramid network(FPN), path aggregation
network(PAN) and Bi-FPN
3. Head: Takes in features from the neck along with the bounding box predictions.
Performs classification, regression on the features, and bounding box coordinates to
complete the detection process. Outputs 4 values, generally x,y coordinate along with
width and height.
So the next obvious question would be, How does YOLO work, Say we have a CNN that’s
been trained to recognize several classes, including a traffic light, a car, a person, and a truck. We
give it two types of anchor boxes, a tall one and a wide one so that it can handle overlapping
objects of different shapes. Once CNN has been trained, we can now detect objects in images by
feeding at new test images.
YOLO can work well for multiple objects where each object is associated with one grid
cell. But in the case of overlap, in which one grid cell actually contains the center points of two
different objects, we can use something called anchor boxes to allow one grid cell to detect multiple
objects.
11
Figure:3.4 Anchor boxes in action
In image above, we see that we have a person and a car overlapping in the image. So, part
of the car is obscured. We can also see that the centers of both bounding boxes, the car, and the
pedestrian fall in the same grid cell. Since the output vector of each grid cell can only have one
class, then it will be forced to pick either the car or the person. But by defining anchor boxes, we
can create a longer grid cell vector and associate multiple classes with each grid cell.
Anchor boxes have a defined aspect ratio, and they tried to detect objects that nicely fit into a box
with that ratio. For example, since we’re detecting a wide car and a standing person, we’ll define
one anchor box that is roughly the shape of a car, this box will be wider than it is tall. And we’ll
define another anchor box that can fit a standing person inside of it, which will be taller than it is
wide. The test image is first broken up into a grid and the network then produces output vectors,
one for each grid cell. These vectors tell us if a cell has an object in it, what class the object is, and
the bounding boxes for the object. Since we’re using two anchor boxes, we’ll get two predicted
anchor boxes for each grid cell. Some, in fact most of the predicted anchor boxes will have a low.
After producing these output vectors, we use non-maximal suppression to get rid of unlikely
bounding boxes. For each class, non-maximal suppression gets rid of the bounding boxes that have
a PC value lower than some given threshold. YOLO uses Non-Maximal Suppression (NMS) to only
keep the best bounding box.
12
The first step in NMS is to remove all the predicted bounding boxes that have a detection
probability that is less than a given NMS threshold. In the code below, we set this NMS threshold
to 0.6. This means that all predicted bounding boxes that have a detection probability less than 0.6
will be removed. After removing all the predicted bounding boxes that have a low detection
probability, the second step in NMS, is to select the bounding boxes with the highest detection
probability and eliminate all the bounding boxes whose Intersection Over Union (IOU) value is
higher than a given IOU threshold. In the code below, we set this IOU threshold to 0.4. This means
that all predicted bounding boxes that have an IOU value greater than 0.4 with respect to the best
bounding boxes will be removed.
13
It then selects the bounding boxes with the highest PC value and removes bounding boxes
that are too similar to this. It will repeat this until all of the non-maximal bounding boxes had been
removed for every class. The end result will look like the image below, we can see that yellow has
effectively detected many objects in the image such as cars and people.
Figure:3.5 YOLO Object Detection
14
CHAPTER-4
IMPORTANCE OF YOLO
4.1 IMPORTANCE OF YOLO
4.1.1 SPEED
● This algorithm improves the speed of detection because it can predict objects in real-time.
4.1.2 HIGH ACCURACY
● YOLO is a predictive technique that provides accurate results with minimal background
errors.
4.1.3 LEARNING CAPABILITIES:
● The algorithm has excellent learning capabilities that enable it to learn the representations
of objects and apply them in object detection.
YOLO algorithm can be used in various applications like Autonomous driving, Security,
Wildlife.
● Autonomous driving: YOLO algorithm can be used in autonomous cars to detect objects
around cars such as vehicles, people, and parking signals. Object detection in autonomous
cars is done to avoid collision since no human driver is controlling the car.
● Wildlife: This algorithm is used to detect various types of animals in forests. This type of
detection is used by wildlife rangers and journalists to identify animals in videos (both
recorded and real-time) and images. Some of the animals that can be detected include
giraffes, elephants, and bears.
● Security: YOLO can also be used in security systems to enforce security in an area. Let’s
assume that people have been restricted from passing through a certain area for security
reasons. If someone passes through the restricted area, the YOLO algorithm will detect
him/her, which will require the security personnel to take further action.
15
CHAPTER-5
CONCLUSION
This report has provided an overview of the YOLO algorithm and how it is used in
object detection. This technique given in the paper provides improved detection results
compared to other object detection techniques such as Fast R-CNN and Retina-Net. As with all
other computer vision algorithms, due to various unpredictable factors in real-world
applications (lighting conditions, human factor), there is not a unique model for every problem,
including the problem of store shelf detection. The YOLO algorithm is used after being trained
on entire image inputs, thus, it does not isolate and identify specific objects but rather processes
the entire image at once. This enables it to not only encode class appearance data but also gather
contextual data. This is why the YOLO algorithm does not get affected by noise or background
data when trying to detect specific targets in real-time. This seminar also contains various
versions of YOLOV1,YOLV2,YOLOV3,YOLOV4,YOLOV5.Through this seminar you have
gained an overview of object detection and the YOLO algorithm ,gone through the main reasons
why the YOLO algorithm is important, learned how the YOLO algorithm works and also
gained an understanding of the main techniques used by YOLO to detect objects. you might
also learned the real-life applications of YOLO.
16
REFERENCE
[1] Joseph Redmon,Santosh Divvala,Ross Girshick,Ali Farhadi "You Only Look Once: Unified,
Real-Time Object Detection"- 2016 IEEE Conference on Computer Vision and Pattern
Recognition
[2] N.Murali Krishna,Ramidi Yashwanth Reddy, Mallu Sai Chandra Reddy, Kasibhatla Phani
Madhav,Gaikwad Sudham "Object Detection and Tracking Using Yolo"- 2021 Third International
Conference on Inventive Research in Computing Applications
[3] Rachel Huang,Jonathan Pedoeem,Cuixian Chen"YOLO-LITE: A Real-Time Object Detection

Algorithm Optimised for Non-GPU Computers"- 2018 IEEE International Conference on Big Data
[4] Irvine Valiant Fanthony,Zaenal Husin,Hera Hikmarika,Suci Dwijayanti,Bhakti Yudho

Suprapto"YOLO Algorithm-Based Surrounding Object Identification on Autonomous Electric
Vehicle"-2021 8th International Conference on Electrical Engineering, Computer Science and
Informatics
[5] Pranav Adarsh, Pratibha Rathi,Manoj Kumar “ YOLO v3-Tiny: Object Detection and
Recognition using one stage improved model”- 2020 6th International Conference on Advanced
Computing and Communication Systems
[6] Fan Wu, Guoqing Jin, Mingyu Gao,Zhiwei HE,Yuxiang Yang Al-“ Helmet Detection Based
On Improved YOLO V3 Deep Model”- 2019 IEEE 16th International Conference on Networking,
Sensing and Control (ICNSC)
17

Devi Sem

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Devi Sem

Uploaded by

Copyright:

Available Formats

YOU ONLY LOOK ONCE-YOLO

in partial fulfillment for the award of the degree

DEPARTMENT OF INFORMATION TECHNOLOGY

MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY

KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107

in partial fulfillment for the award of the degree

DEPARTMENT OF INFORMATION TECHNOLOGY

MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY

KALITHEERTHALKUPPAM, PUDUCHERRY- 605 107

DEPARTMENT OF INFORMATION TECHNOLOGY

ONCE-YOLO” is a bonafide record work done by DEVI.P (Reg. No:19TH0412) of Seventh

Semester B.TECH in INFORMATION TECHNOLOGY for the SEMINAR REPORT during

the academic year 2022-2023.

Staff in charge Head of the Department

CHAPTER NO TITLE PAGE NO

1.1 HISTORY OF YOLO 1

1.2 INTRODUCTION TO YOLO 2

3.1 RESIDUAL BLOCK 9

3.2 BOUNDING BOX REGRESSION 9

3.3 INTERSECTION OVER UNION 10

4.1 IMPORTANCE OF YOLO 15

1.1 HISTORY OF YOLO

Figure:1.1 Object detection using YOLO

2.1 YOLOV1 – THE BEGINNING

Figure:2.1 Performance of YOLOV1

2.2 YOLOV2 – BETTER, FASTER, STRONGER

Figure:2.4 Overlap Over the Union

Figure:2.5 Anchor Boxes and Dimension Clusters.

2.2.3 ADDITIONAL IMPROVEMENTS

2.3 YOLOV3 AN INCREMENTAL IMPROVEMENT

Figure:2.6 Difference between YOLO v1,v2,v3

2.3.2 TECHNICAL IMPROVEMENTS

2.4.2 TECHNICAL IMPROVEMENTS

MODEL LAYERS fPS mAP

YOLOV3 106 20 94.6

YOLOV4 53 22.6 84.3

Figure:2.7 Difference between YOLO v1,v2,v3,v4,v5

Figure:2.8 Performance of YOLOV5

2.5.2 TECHNICAL IMPROVEMENTS

3.1 BASIC WORKING OF YOLO OBJECT DETECTOR MODELS:

Figure:3.1 Residual Block

3.1.2 BOUNDING BOX REGRESSION

3.1.3 INTERSECTION OVER UNION (IOU)

Figure:3.3 Intersection Over Time

keep the best bounding box.

bounding boxes will be removed.

Figure:3.5 YOLO Object Detection

[3] Rachel Huang,Jonathan Pedoeem,Cuixian Chen"YOLO-LITE: A Real-Time Object Detection

[4] Irvine Valiant Fanthony,Zaenal Husin,Hera Hikmarika,Suci Dwijayanti,Bhakti Yudho

Sensing and Control (ICNSC)

You might also like