Professional Documents
Culture Documents
improvement from v1 to v3
Posted on April 25, 2020
1. Introduction
Image 1: Work flow of R-CNN
In this post, I’d like to review the 3 paper of YOLO. The main purpose is to
understand the design of the YOLO and how the authors try to improve YOLO. 1. Bottleneck: R-CNN firstly (1) propose thousands of regions and then (2) do
For the details of implementation, such as learning rate and training tricks, please classification of object as well as regression to refine the bounding box on
read the experiments parts in the paper[1][2][3]. This post is organized as: the proposed region. Since the classification and regression are repeated
thousands times, there are lots of researchers focus on the optimization of
1. The drawbacks of region proposal approach, as R-CNN, from the perspective (2).
of optimization 2. Unnecessary work: I think there is no unnecessary work in the work flow of
2. The design of YOLO V1 R-CNN.
3. The drawbacks of previous version and improvement of current version of 3. Duplicated work: Too much duplicated work can be found in R-CNN. R-CNN
Yolo V2 and Yolo V3 use CNN to extract features for classification. There are 4 yellow rectangles
in the step 2 in Image 1. Let’s focus on the 2 rectangles in the middle. We find
2. Drawbacks of region proposal approach that the higher one and the lower one share some overlapping region. The
extraction of features on overlapping region is repeated 2 times.
Region proposal approach is also called 2-shot or 2-stage approach. The 2 steps
are: propose regions which may contain objects and then do classification on What can we suggest to reduce the duplicated work? Let’s look at YOLO.
proposed regions. As discussed in my previous post: Review of LeNet-5 (2020-04-
19-review_lenet.md), a non-globally-trainable-system, as Region proposal 3. Design of YOLO V1
approach, is hard to optimize because the individual module should be trained
separately. In this post, I would like to analyse 2-shot approach from another 3-1. Unified detection
point of view: optimization.
YOLO splits the image into S ∗ S grid cells, shown in Image 2[1].
To perform a generic optimization, we check the 3 following aspects (BUD):
bottleneck, unnecessary work, duplicated work. Let’s do the analysis for R-CNN,
whose work flow is shown in Image 1[5].
The coordinates of box center, x and y, are the offset to a particular grid cell and
their values are between 0 and 1. The width w and height h, are measured subject
to the image width and height and their values are also between 0 and 1
The regions of objects (the green rectangle) in the image are represented as a box.
The center of the box (the green point) is defined as the center of an object. As the
image is split into several grid cells, the center of an object falls into a grid cell.
We consider this grid cell as the predictor, which predicts the class of object and
the four corners of the box. Image 3: The architecture of YOLO network
Each cell predict B possible boxes. The prediction of each box includes the center Image 3[1] shows clearly the architecture of YOLO. During the implementation,
point coordinates x and y, the width w and height h of the box, and a probability the last fully connected layer takes 4096 vector as input and outputs a vector of
7 ∗ 7 ∗ 30 . Then we have to reshape it to a tensor with dimension of 7 ∗ 7 ∗ 30 .
to indicates how accurate the box is predicted.
1.
S B obj 2 2
λcoord ∑ ∑ 1 ^i )
[(xi − x ^i ) ]
+ (yi − y
i=0 j=0 ij
2. λcoord ∑
S
i=0
2
∑
B
j=0
1
obj
ij
−− −− 2 −−
−−
^ 2
^i ) + (√hi − √h
[(√wi − √w i ) ]
4-1. Batch Normalization
2
3.
S B obj
^ 2
∑ ∑ 1 (Ci − Ci )
i=0 j=0
2
ij
Adding Batch Normalization helps to improve mAP by 2%.
4.
S B noobj
^ 2
λnoobj ∑ ∑ 1 (Ci − Ci )
i=0 j=0 ij
5.
S obj 2
∑ ∑ 1 ^i (c))
(p i (c) − p
i=0 c∈classes i
where 1 denotes if object appears in cell i and 1 means the j th bounding box
obj obj
YOLO V2 is trained on high resolution images for first 10 epochs. This improves
i i
3-5. Benefits and drawbacks of YOLO V1 1. Remove the fully connected layers
2. Shrink the input image dimension, so the feature maps after the final
Benefits: convolutional layers owns the same dimension as the grid. (S ∗ S )
3. Move the prediction of the class of object from grid cell to bounding box.
This means a cell can predict different objects with different boxes. So the
1. Fast
prediction values for each box include 4 localization values for object, 1
2. Global trainable system eases the optimization
confidence score for object, and C conditional probabilities for class. In this
3. More generalized, when testing on other database
case, every cell predicts N ∗ (C + 5), N means the number of anchors used
for one cell.
Drawbacks:
account the IOU to adjust the probability. While YOLO V3 gives more intuitive
understanding, which considers the confidence score of box directly as the
probability of objectness. Then the probability of objectness is defined as the IOU
of the predicted box and ground truth. By using logistic activation, the prediction
is force to the range between 0 and 1.
information about how to calculate feature maps for the 3 scales. (I will write the
details in my next post about the implementation of YOLOV3). In general, the 3
scales are:
If we have input images with dimension of 416 * 416, the 3 scales are in dimension
of 13 * 13, 26 * 26, and 52 * 52.
6. Conclusion
From YOLO V1 to YOLO V3, the authors showed us:
Reference
Start the discussion…
[1] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi
You Only Look LOG IN WITH
OR SIGN UP WITH DISQUS ?
Once: Unified, Real-Time Object Detection
CVPR 2016
Name
[4] Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3 ✉ Subscribe d Add Disqus to your siteAdd DisqusAdd ⚠ Do Not Sell My Data
(https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-
yolov2-28b1b93e2088)
Tags:
review (/tags#review) review_det (/tags#review_det) selected (/tags#selected) Sheng FANG
•
2020
Theme by
beautiful-jekyll (https://deanattali.com/beautiful-jekyll/)
(ht… (ht… (ht…