Professional Documents
Culture Documents
SEMINAR REPORT
Submitted by
DEVI.P
Reg.No:19TH0412
of
BACHELOR OF TECHNOLOGY
in
iiiiiii
PONDICHERRY UNIVERSITY
DECEMBER 2022
i
YOU ONLY LOOK ONCE-YOLO
SEMINAR REPORT
Submitted by
DEVI.P
Reg.No:19TH0412
of
BACHELOR OF TECHNOLOGY
in
iiiiiii
PONDICHERRY UNIVERSITY
DECEMBER 2022
i
MANAKULA VINAYAGAR INSTITUTE OF TECHNOLOGY
KALITHEERTHAL KUPPAM, PUDUCHERRY- 605 107
PONDICHERRY UNIVERSITY
BONAFIDE CERTIFICATE
This is to certify that the Seminar Report Presentation Titled “YOU ONLY LOOK
ii
TABLE OF CONTENTS
1 INTRODUCTION 1
2 VERSIONS OF YOLO 3
2.1 YOLOV1 3
2.2 YOLOV2 4
2.3 YOLOV3 6
2.4 YOLOV4 7
2.5 YOLOV5 8
3 WORKING OF YOLO 9
4 IMPORTANCE OF YOLO 15
5 CONCLUSION 16
REFERENCES 17
iii
CHAPTER-1
INTRODUCTION
YOLOv2: Released in 2017, this version earned an honorable mention at CVPR 2017 because of
significant improvements on anchor boxes and higher resolution.
YOLOv3: The 2018th release had an additional objectivity score to the bounding box prediction
and connections to the backbone network layers. It also provided an improved performance on tiny
objects because of the ability to run predictions at three different levels of granularity.
YOLOv4: April’s release of 2020 became the first paper not authored by Joseph Redmon. Here
Alexey Bochkovski introduced novel improvements, including mind activation, improved feature
aggregation, etc.
YOLOv5: Glenn Jocher continued to make further improvements in his June 2020 release,
focusing on the architecture itself.
1
1.2 INTRODUCTION TO YOLO
Standing for You Only Look Once, YOLO is a regression algorithm that falls under the
class of real-time object detection methods with a multitude of computer vision applications.
YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-time. As
the name suggests, the algorithm requires only a single forward propagation through a neural
network to detect objects.This means that prediction in the entire image is done in a single
algorithm run.
The CNN is used to predict various class probabilities and bounding boxes
simultaneously.The YOLO algorithm consists of various variants. Some of the common ones
include tiny YOLO and YOLOv3.
This algorithm uses a single bounding box regression to identify elements like height,
width, centre, and object classes. It cornered the market because of its accuracy, demonstrated
speed, and ability to detect objects in a single run, surpassing Fast R-CNN, RetinaNet, and Single-
Shot MultiBox Detector (SSD).The RCNN family was too slow.
It took longer to find the proposed region for the bounding box, train a model, detect and
classify regions, and then check for refined outputs in separate steps.
In many tasks, extreme levels of accuracy (as the ones provided by CNNs) are not
imperative, so it is reasonable to rely on less accurate but faster-to-train methods. Hence, YOLO’s
unprecedented emergence. First, it improves the detection time given that it predicts objects in
real-time. Second, YOLO provides accurate results with minimal background errors. And finally,
the algorithm has wonderful learning capabilities that enable it to learn the representations of
objects and implement them in object detection tasks.
2
CHAPTER-2
VERSIONS OF YOLO
3
The model is trained in a similar fashion where the centre of each object detected is
compared with the ground truth. In order to check whether the model is correct or not and adjust
the weights accordingly.
Figure:2.2 YOLOV1
2.2.1 PERFORMANCE
YOLOv2 registered 78.6 mAP on the VOC 2012 dataset. We can see in the below table
that it performed very well on the VOC 2012 dataset compared to other object detection models.
Figure:2.3Accuracy Comparison: State-of-the-art accuracy with 2-10 times better inference rates
4
2.2.2 TECHNICAL IMPROVEMENTS
The YOLO v2 version introduced the concept of anchor boxes. Anchor boxes are nothing
but predefined areas for an image that illustrates the 5dealized position of the objects to be
detected. We calculate the ratio of overlap over the union (IoU) of the predicted bounding box and
the predefined anchor box. The IoU value acts as a threshold to decide whether the probability of
the detected object is sufficient to make a prediction or NOT.
But in the case of YOLO, anchor boxes are not computed randomly. Instead, the YOLO
algorithm examines the training data and performs clustering on it.. All this is done in order to
ensure that the anchor boxes that are used represent the data on which we will be training our
model. This helps in enhancing the accuracy a lot.
5
Whereas when the model sees a label-less image it back propagates the classification error
only. This structure is called the tree .Inference speeds of up to 200 FPS and mAP of 75.3 were
achieved using a classification network architecture called darknet19 .
2.3.1 PERFORMANCE
YOLOv3-320 has an mAP of 28.2 with an inference time of 22 milliseconds. (On the
COCO dataset). This is 3 times fast than the SSD object detection technique yet with similar
accuracy.
6
● By employing the use of logistic classifiers and activations the class predictions for the
YOLOv3 goes above and beyond RetinaNet-50 and 101 in terms of accuracy.
2.4 YOLOV4- SPEED AND ACCURACY OF OBJECT DETECTION
YOLOV4 was not released by Joseph Redmon but by Alexey Bochkovskiy, et all in their
2020 paper “YOLOv4: Optimal Speed and Accuracy of Object Detection”.
2.4.1 PERFORMANCE
YOLOv4 model stands atop of the other detection models like efficient Det and
ResNext50. It has the Darknet53 backbone (same as the YOLOv3).
YOLOV1 26 45 89.4
YOLOV2 32 42 91.2
YOLOV5 16 62 91.2
7
2.5 YOLOV5
YOLOv5 is supposedly the next member of the YOLO family released in 2020 by the
company Ultranytics just a few days after YOLOv4. No paper has been released and there is a
debate in the community if it justifies using YOLO branding as it is just the PyTorch
implementation of YOLOv3.
2.5.1 PERFORMANCE
The authenticity of performance cannot be guaranteed as there is no official paper yet. It
achieves the same if not better accuracy(mAP of 55.6) than the other YOLO models while taking
less computation power.
8
CHAPTER-3
WORKING OF YOLO
In the image below, there are many grid cells of equal dimension. Every grid cell will detect
objects that appear within them. For example, if an object centre appears within a certain grid cell,
then this cell will be responsible for detecting it.
9
Figure:3.2 Bounding Box Regression
As for every ML-based model precision and recall are very important to deduce and judging
its accuracy and robustness. Thus the creator of YOLO kept trying to come up with an object
detection model that maximises mAP (mean average precision).Besides this, the architecture of all
the YOLO models have a similar theme of components as outlined below
10
1. Backbone: A convolutional neural network that accumulates and produces visual
features with different shapes and sizes. Classification models like ResNet, VGG, and
EfficientNet are used as feature extractors.
2. Neck: A set of layers that integrate and blend characteristics before passing them on to
the prediction layer. Example: Feature pyramid network(FPN), path aggregation
network(PAN) and Bi-FPN
3. Head: Takes in features from the neck along with the bounding box predictions.
Performs classification, regression on the features, and bounding box coordinates to
complete the detection process. Outputs 4 values, generally x,y coordinate along with
width and height.
So the next obvious question would be, How does YOLO work, Say we have a CNN that’s
been trained to recognize several classes, including a traffic light, a car, a person, and a truck. We
give it two types of anchor boxes, a tall one and a wide one so that it can handle overlapping
objects of different shapes. Once CNN has been trained, we can now detect objects in images by
feeding at new test images.
YOLO can work well for multiple objects where each object is associated with one grid
cell. But in the case of overlap, in which one grid cell actually contains the center points of two
different objects, we can use something called anchor boxes to allow one grid cell to detect multiple
objects.
11
Figure:3.4 Anchor boxes in action
In image above, we see that we have a person and a car overlapping in the image. So, part
of the car is obscured. We can also see that the centers of both bounding boxes, the car, and the
pedestrian fall in the same grid cell. Since the output vector of each grid cell can only have one
class, then it will be forced to pick either the car or the person. But by defining anchor boxes, we
can create a longer grid cell vector and associate multiple classes with each grid cell.
Anchor boxes have a defined aspect ratio, and they tried to detect objects that nicely fit into a box
with that ratio. For example, since we’re detecting a wide car and a standing person, we’ll define
one anchor box that is roughly the shape of a car, this box will be wider than it is tall. And we’ll
define another anchor box that can fit a standing person inside of it, which will be taller than it is
wide. The test image is first broken up into a grid and the network then produces output vectors,
one for each grid cell. These vectors tell us if a cell has an object in it, what class the object is, and
the bounding boxes for the object. Since we’re using two anchor boxes, we’ll get two predicted
anchor boxes for each grid cell. Some, in fact most of the predicted anchor boxes will have a low.
After producing these output vectors, we use non-maximal suppression to get rid of unlikely
bounding boxes. For each class, non-maximal suppression gets rid of the bounding boxes that have
a PC value lower than some given threshold. YOLO uses Non-Maximal Suppression (NMS) to only
12
The first step in NMS is to remove all the predicted bounding boxes that have a detection
probability that is less than a given NMS threshold. In the code below, we set this NMS threshold
to 0.6. This means that all predicted bounding boxes that have a detection probability less than 0.6
will be removed. After removing all the predicted bounding boxes that have a low detection
probability, the second step in NMS, is to select the bounding boxes with the highest detection
probability and eliminate all the bounding boxes whose Intersection Over Union (IOU) value is
higher than a given IOU threshold. In the code below, we set this IOU threshold to 0.4. This means
that all predicted bounding boxes that have an IOU value greater than 0.4 with respect to the best
13
It then selects the bounding boxes with the highest PC value and removes bounding boxes
that are too similar to this. It will repeat this until all of the non-maximal bounding boxes had been
removed for every class. The end result will look like the image below, we can see that yellow has
effectively detected many objects in the image such as cars and people.
14
CHAPTER-4
IMPORTANCE OF YOLO
4.1 IMPORTANCE OF YOLO
4.1.1 SPEED
● This algorithm improves the speed of detection because it can predict objects in real-time.
4.1.2 HIGH ACCURACY
● YOLO is a predictive technique that provides accurate results with minimal background
errors.
4.1.3 LEARNING CAPABILITIES:
● The algorithm has excellent learning capabilities that enable it to learn the representations
of objects and apply them in object detection.
YOLO algorithm can be used in various applications like Autonomous driving, Security,
Wildlife.
● Autonomous driving: YOLO algorithm can be used in autonomous cars to detect objects
around cars such as vehicles, people, and parking signals. Object detection in autonomous
cars is done to avoid collision since no human driver is controlling the car.
● Wildlife: This algorithm is used to detect various types of animals in forests. This type of
detection is used by wildlife rangers and journalists to identify animals in videos (both
recorded and real-time) and images. Some of the animals that can be detected include
giraffes, elephants, and bears.
● Security: YOLO can also be used in security systems to enforce security in an area. Let’s
assume that people have been restricted from passing through a certain area for security
reasons. If someone passes through the restricted area, the YOLO algorithm will detect
him/her, which will require the security personnel to take further action.
15
CHAPTER-5
CONCLUSION
This report has provided an overview of the YOLO algorithm and how it is used in
object detection. This technique given in the paper provides improved detection results
compared to other object detection techniques such as Fast R-CNN and Retina-Net. As with all
other computer vision algorithms, due to various unpredictable factors in real-world
applications (lighting conditions, human factor), there is not a unique model for every problem,
including the problem of store shelf detection. The YOLO algorithm is used after being trained
on entire image inputs, thus, it does not isolate and identify specific objects but rather processes
the entire image at once. This enables it to not only encode class appearance data but also gather
contextual data. This is why the YOLO algorithm does not get affected by noise or background
data when trying to detect specific targets in real-time. This seminar also contains various
versions of YOLOV1,YOLV2,YOLOV3,YOLOV4,YOLOV5.Through this seminar you have
gained an overview of object detection and the YOLO algorithm ,gone through the main reasons
why the YOLO algorithm is important, learned how the YOLO algorithm works and also
gained an understanding of the main techniques used by YOLO to detect objects. you might
also learned the real-life applications of YOLO.
16
REFERENCE
[1] Joseph Redmon,Santosh Divvala,Ross Girshick,Ali Farhadi "You Only Look Once: Unified,
Real-Time Object Detection"- 2016 IEEE Conference on Computer Vision and Pattern
Recognition
[2] N.Murali Krishna,Ramidi Yashwanth Reddy, Mallu Sai Chandra Reddy, Kasibhatla Phani
Madhav,Gaikwad Sudham "Object Detection and Tracking Using Yolo"- 2021 Third International
Conference on Inventive Research in Computing Applications
[6] Fan Wu, Guoqing Jin, Mingyu Gao,Zhiwei HE,Yuxiang Yang Al-“ Helmet Detection Based
On Improved YOLO V3 Deep Model”- 2019 IEEE 16th International Conference on Networking,
17