You are on page 1of 6

Review of YOLO: drawback and

improvement from v1 to v3
Posted on April 25, 2020

1. Introduction
Image 1: Work flow of R-CNN
In this post, I’d like to review the 3 paper of YOLO. The main purpose is to
understand the design of the YOLO and how the authors try to improve YOLO. 1. Bottleneck: R-CNN firstly (1) propose thousands of regions and then (2) do
For the details of implementation, such as learning rate and training tricks, please classification of object as well as regression to refine the bounding box on
read the experiments parts in the paper[1][2][3]. This post is organized as: the proposed region. Since the classification and regression are repeated
thousands times, there are lots of researchers focus on the optimization of
1. The drawbacks of region proposal approach, as R-CNN, from the perspective (2).
of optimization 2. Unnecessary work: I think there is no unnecessary work in the work flow of
2. The design of YOLO V1 R-CNN.
3. The drawbacks of previous version and improvement of current version of 3. Duplicated work: Too much duplicated work can be found in R-CNN. R-CNN
Yolo V2 and Yolo V3 use CNN to extract features for classification. There are 4 yellow rectangles
in the step 2 in Image 1. Let’s focus on the 2 rectangles in the middle. We find
2. Drawbacks of region proposal approach that the higher one and the lower one share some overlapping region. The
extraction of features on overlapping region is repeated 2 times.
Region proposal approach is also called 2-shot or 2-stage approach. The 2 steps
are: propose regions which may contain objects and then do classification on What can we suggest to reduce the duplicated work? Let’s look at YOLO.
proposed regions. As discussed in my previous post: Review of LeNet-5 (2020-04-
19-review_lenet.md), a non-globally-trainable-system, as Region proposal 3. Design of YOLO V1
approach, is hard to optimize because the individual module should be trained
separately. In this post, I would like to analyse 2-shot approach from another 3-1. Unified detection
point of view: optimization.
YOLO splits the image into S ∗ S grid cells, shown in Image 2[1].
To perform a generic optimization, we check the 3 following aspects (BUD):
bottleneck, unnecessary work, duplicated work. Let’s do the analysis for R-CNN,
whose work flow is shown in Image 1[5].
The coordinates of box center, x and y, are the offset to a particular grid cell and
their values are between 0 and 1. The width w and height h, are measured subject
to the image width and height and their values are also between 0 and 1

3-3. The architecture of YOLO network

Image 2: Grid split

The regions of objects (the green rectangle) in the image are represented as a box.
The center of the box (the green point) is defined as the center of an object. As the
image is split into several grid cells, the center of an object falls into a grid cell.
We consider this grid cell as the predictor, which predicts the class of object and
the four corners of the box. Image 3: The architecture of YOLO network

Each cell predict B possible boxes. The prediction of each box includes the center Image 3[1] shows clearly the architecture of YOLO. During the implementation,
point coordinates x and y, the width w and height h of the box, and a probability the last fully connected layer takes 4096 vector as input and outputs a vector of
7 ∗ 7 ∗ 30 . Then we have to reshape it to a tensor with dimension of 7 ∗ 7 ∗ 30 .
to indicates how accurate the box is predicted.

The probability is defined formally as: p(Object) ∗ IOU truth


. In YOLO V1:
pred.

1. No Batch Normalization layer.


where p(Object) is the probability to have an object in the cell, IOU truth
is the
pred.
2. Activation function: LeakyReLU for all layers, except the last one, which uses
intersection over union between predicted region and the grand truth. linear activation.
3. Dropout: dropout layer with rate=0.5 after the first fully connected layer
Moreover, every grid cell predicts conditional probabilities for C types, marked as
p(Class |Object) . To sum up, each grid predicts B ∗ 5 + C values and YOLO
i
3-4. Loss function
with S ∗ S cells predicts S ∗ S ∗ (B ∗ 5 + C) values.
The loss function of YOLO is composed of the squared error of localization of box
Attention: each grid cell in YOLO V1 can predict only on object. and classification and is the sum of the following 5 parts.

3-2. How to parametrize the predicted values?


2

1.
S B obj 2 2
λcoord ∑ ∑ 1 ^i )
[(xi − x ^i ) ]
+ (yi − y
i=0 j=0 ij
2. λcoord ∑
S

i=0
2


B

j=0
1
obj

ij
−− −− 2 −−
−−
^ 2
^i ) + (√hi − √h
[(√wi − √w i ) ]
4-1. Batch Normalization
2

3.
S B obj
^ 2
∑ ∑ 1 (Ci − Ci )
i=0 j=0
2
ij
Adding Batch Normalization helps to improve mAP by 2%.
4.
S B noobj
^ 2
λnoobj ∑ ∑ 1 (Ci − Ci )
i=0 j=0 ij

4-2. Hight Resolution classifier


2

5.
S obj 2
∑ ∑ 1 ^i (c))
(p i (c) − p
i=0 c∈classes i

where 1 denotes if object appears in cell i and 1 means the j th bounding box
obj obj
YOLO V2 is trained on high resolution images for first 10 epochs. This improves
i i

predictor in cell i is in charge of the predicting. λ and λ is used to coord noobj


the mAP by 4%
remedy the problem of imbalanced data, because there are more cells without
object than cells with objects. In the paper, λ = 5 and λ = 0.5 coord noobj 4-3. Convolutional with Anchor Boxes
In the part 2, the square root of width and height are used. Since the same One of the drawbacks of YOLO V1 is the bad performance in localization of boxes,
prediction error of width and height has different influence for small box and because bounding boxes are learning totally from data. In YOLO V2, the authors
large box, the square root can amplifier the influence when the box is small relate add prior (anchor boxes) to help the localization. In order to introducing the
to large box. Do some calculation, you could find out this effect. anchors, some modifications are done on the architecture of the network.

3-5. Benefits and drawbacks of YOLO V1 1. Remove the fully connected layers
2. Shrink the input image dimension, so the feature maps after the final
Benefits: convolutional layers owns the same dimension as the grid. (S ∗ S )
3. Move the prediction of the class of object from grid cell to bounding box.
This means a cell can predict different objects with different boxes. So the
1. Fast
prediction values for each box include 4 localization values for object, 1
2. Global trainable system eases the optimization
confidence score for object, and C conditional probabilities for class. In this
3. More generalized, when testing on other database
case, every cell predicts N ∗ (C + 5), N means the number of anchors used
for one cell.
Drawbacks:

This improvement increases the recall value.


1. Bad performance when there are groups of small objects, because each grid
can only detect one object.
2. Main error of YOLO is from localization, because the ratio of bounding box is 4-4.Dimension Clusters
totally learned from data and YOLO makes error on the unusual ratio
bounding box. This process decides the number of anchors and the predefined ratios for
anchors.
4. YOLO V2
4-5. Direct location prediction
In order to remedy the 2 problems mentioned before and improve the
performance, the authors make several modifications in YOLO V2. The values of box center coordinates, x and y, are between 0 and 1. But YOLO V1
can predict the values smaller than 0 or bigger than 1. The authors add logistic
activation to force the prediction fall into the this range.
4-6. Fine-Grained Feature 4-9. New mechanism
This is a similar idea as feature pyramid. The authors reshape the last feature By introducing WordNet’s tree representation of object labels, YOLO V2 can train
maps with 26 * 26 resolution to 13 * 13, and concatenate the reshaped feature with data from different databases. This gives the YOLO the ability to predict
maps to the output of last convolutional layer. The last convolutional layer about 9000 different objects. I think it’s not the improvement in the design of
outputs 13 * 13 * 1024 tensor. The reshaped feature map has dimension of 13 * 13 * YOLO itself and this does not belong to the purpose of this post. Please check the
2048. The concatenating results has dimension of 13 * 13 * 3072. paper for more information.

4-7. Multi-scale Training 4-10. Conclusion of YOLO V2


Training the model with images of different dimensions. The dimension of images Compared to YOLO V1, YOLO V2 becomes better, faster and stronger.
are randomly chosen every 10 batch during the training.
5. YOLO V3
4-8 New Architecture DarkNet-19
The authors of YOLO have tried many technics to improve the accuracy and
Image 4[2] shows the new architecture. performance. Four of them work well for YOLO.

5-1. Bounding Box Prediction


YOLO V2 considers the confidence score as the multiplication of p(Object) and
IOU (b, Object). First we calculate the probability of objectness, then take into

account the IOU to adjust the probability. While YOLO V3 gives more intuitive
understanding, which considers the confidence score of box directly as the
probability of objectness. Then the probability of objectness is defined as the IOU
of the predicted box and ground truth. By using logistic activation, the prediction
is force to the range between 0 and 1.

5-2. Class Prediction


YOLO V2 uses softmax when calculating the conditional probabilities of classes.
The softmax technique imposes the assumption that every box has only one label.
But in reality, we often give object several labels, which can have overlapping. To
solve the multi-label problem, YOLO V3 use independent logistic classification for
each class. The binary cross-entropy is used to calculate the loss for each class.

5-3. Prediction Across Scales


Image 4: The architecture of DarkNet-19
YOLO V3 predict 3 boxes at each scale, so the tensor is
S ∗ S ∗ [3 ∗ (4 + 1 + 80)] . Unfortunately, the authors don’t indicate the detail

information about how to calculate feature maps for the 3 scales. (I will write the
details in my next post about the implementation of YOLOV3). In general, the 3
scales are:

1. Last convolutional layer, which has 32 stride compared to input dimension


2. The layer with 16 stride
3. The layer with 8 stride

If we have input images with dimension of 416 * 416, the 3 scales are in dimension
of 13 * 13, 26 * 26, and 52 * 52.

5-4. New Architecture


The new architecture is named DarkNet-53, shown in Image 5[3].

Image 5: The architecture of DarkNet-53

6. Conclusion
From YOLO V1 to YOLO V3, the authors showed us:

1. One-shot object detection


2. Move the prediction from cell to box to enable the detection of several
objects in one cells
3. Use logistic activation for box prediction to constrain the range
4. Use logistic classification to enable multi-labeling 0 Comments GitHub 🔒 Disqus' Privacy Policy 
1 Login
5. Use feature pyramid to reinforce the small object detection
 Favorite t Tweet f Share Sort by Best

Reference
Start the discussion…
[1] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi  
You Only Look LOG IN WITH
OR SIGN UP WITH DISQUS ?
Once: Unified, Real-Time Object Detection  
CVPR 2016
Name

[2] Joseph Redmon, Ali Farhadi  


YOLO9000: Better, Faster, Stronger  
CVPR 2017

[3] Joseph Redmon, Ali Farhadi  


YOLOv3: An Incremental Improvement  
Tech
Be the first to comment.
report

[4] Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3 ✉ Subscribe d Add Disqus to your siteAdd DisqusAdd ⚠ Do Not Sell My Data

(https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-
yolov2-28b1b93e2088)

[5] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik  


Rich feature
hierarchies for accurate object detection and semantic segmentation  
CVPR 2014 
 (https://github.com/sheng-fang)

 (https://linkedin.com/in/sheng-fang)

Tags:
review (/tags#review) review_det (/tags#review_det) selected (/tags#selected) Sheng FANG
 • 
2020
Theme by
beautiful-jekyll (https://deanattali.com/beautiful-jekyll/)
(ht… (ht… (ht…
  

← PREVIOUS POST (/2020-04-20-REVIEW_ALEXNET/)

NEXT POST → (/2020-04-29-IMPLEMENT_YOLO/)

You might also like