Professional Documents
Culture Documents
1 Introduction
Faster R-CNN is proposed to shorten the time spent on region proposal step during test time. Previous
region proposal method such as Selective Search (SS), is an order slower than state-of-the-art object detection
network (Fast R-CNN). Hence the authors of Faster R-CNN came with the idea that region proposal step
can be combined with Fast R-CNN, so that it can take advantage of GPUs as well as sharing computations.
Faster R-CNN is composed of two modules. The first one is Region Proposal Network (RPN), which share
convolutional layers with the second module, Fast R-CNN.
The RPN first generates a set of bounding boxes with one or different size scales and ratios on each pixel.
Since there will be many boxes with a relatively large overlap area, we will use Non-Maximum Suppression
(NMS) technique to reduce the number of candidate boxes. Then, bounding boxes with Intersection-over-
Union (IoU) score (with ground-truth box) higher than 0.7 are selected as foreground, and that lower than 0.3
as background. To optimize the RPN during the training, both classification loss and bounding box regression
loss will be calculated. Classification Loss is the entropy loss on whether it’s foreground or background. For
Bounding Box Regression Loss, the difference between the regression of foreground box and that of ground
truth box is plugged into L1 function, and then sum all terms together[1].
For the second module (or detection network), Fast R-CNN, there is a ”head” part which processes the
original image using several convolutional layers and max pooling layers to produce a feature map. The
”tail” part takes object proposals (from RPN) to the region of interest (RoI) pooling layer and extracts a
fixed-length feature vector from the feature map. Then, the feature vector is put into a sequence of fully
connected layers which will branch into two output layers. One will produce softmax probability to estimate
over 20 object classes and a “background” class. Another will output four real-valued numbers for each of
the 20 object classes. Note that a set of 4 values will be used to calculate refined bounding box positions for
a certain class [2].
The author introduced three training methods. The first one is 4-Step Alternating Training: initialize
the convolutional layers in RPN network by a pre-trained model and train the RPN network; initialize the Fast
R-CNN convolutional layers with the pre-trained model; use the proposals generated from the trained RPN to
train the Fast R-CNN; use the trained RPN to train the layers above the shared convolutional layers of Fast
R-CNN. The second one is Approximate joint training, which is the one we use to train out model, and it will
be introduced later. The third one is Non-approximate joint training. It involves the derivative w.r.t. the
proposal boxes’ coordinates that is ignored in the second method.
2 Implementation
In this section, we will briefly introduce the main idea of the code our group developed for this project. We
will explain the architecture of the code along with some important hyper-parameters as well as the training
method we used for this network.
2.1 Architecture
The architecture of the network is explained in great detail in the original paper [1]. Thus, we mainly follow
the code structure of Girshick, one of the paper author, from his github repository [1].
The structure of the network is that a convolutional neural network is on the base and its feature map
from the last layer is used to feed a region proposal network (RPN). We used two kinds of base net in our
1
project: the VGG16 and Res101. On the other hand, we generate initial bounding boxes for each pixels
through the image, (the initial bounding boxes are called anchors) and RPN learns to extract foreground
anchors and adjust its shape to get bounding boxes. Then we apply non-maximum suppression (NMS) layer
to the proposed bounding boxes to eliminate their numbers. The suppression is done to eliminate boxes that
overlap greatly and only keeps those with the top foreground scores.
So far, the network has proposed bounding boxes for the region of interest, then we will use them to find
classes for the object int he bounding boxes as well as a class dependent regression for the bounding boxes.
We use crop pooling to extract feature maps to a given shape, so that all bounding boxes will give feature
map with the same shape. As a result, we can pass the feature maps of standard shape from the bounding
boxes to the last classifier layer to have class and bounding box regression coefficients for the bounding boxes.
Both of the classifier and the bounding box regression layer are one layer linear fully connected network.
2.3 Hyper-parameters
Above is the basic structure of our network, and we will introduce some important hyper-parameters such as
the anchor initialization, the maximum number of bounding boxes kept and learning rate decay. We initialize
our anchors on each pixel of the feature map got from the base net. This corresponds to put anchors on
every 16 pixels on each dimension. For each pixel of the feature map, there will be 9 anchors initialized. The
9 anchors are generated by three different scale and three different height-width ratio. Our scales are 8, 16
and 32. Our ratios are 1 : 2, 1 : 1 and 2 : 1.
Also, in training, we need to use the target bounding boxes as labels to train our RPN. We choose the
anchors whose overlap with ground truth box is greater than RPN POSITIVE OVERLAP = 0.7 as foreground
label and whose overlap with any ground truth box is lower than RPN NEGATIVE OVERLAP = 0.3 as negative
2
Net Optimizer step size decay epoch total epoch batch size # GPU
VGG16 SGD 4.0 × 10−3 8 12 4 2
Res101 SGD 4.0 × 10−3 8 10 4 2
label or background label. Those who does not fulfill any of the two conditions are not considered. Thus,
we have chosen the the appropriate number of anchors to train the RPN network. With the reduced number
of anchors, we will have reduced number of ROIs which are bounding boxes generated from the chosen
anchors by applying transformation calculated by RPN. However, we will choose ROIs whose overlap with
the foreground higher than FG THRESH = 0.5 as foreground ROIs and use those ROIs to do crop pooling and
then get classifier’s classification loss and bounding box regression loss.
Although we have many other hyperparameters, we do not list all of them here. Table 1 shows the training
hyperparameters for two nets we trained, VGG16 and Res101. Since we also did some other measurements
and comparison with the original paper[1], we changed the above mentioned hyperparameters in those
experiments and the particular choice will be mentioned when we introduce the experiment results.
3 Experiments
In this section, we will introduce all experiments we did with our developed Faster R-CNN code. We will
show that the loss decreasing trend with trained epochs, the mean Average Precision (mAP) of two trained
model (Res101 and VGG16). We will also compare those results with the paper’s results as well as the image
detection time. We will also compare settings of anchors as did in the original paper [1].
3.1 Dataset
In the original paper, the authors did experiments on PASCAL VOC 2007, 2012, and COCO detection
benchmarks. While in our project, we choose only to do experiments on the smaller dataset PASCAL VOC
2007, due to the time limit. In total this dataset consists of 9,963 images (containing 24,640 annotated ob-
jects) over 20 object categories. It has been split into 50% for training and 50% for testing. The distributions
of images and objects by class are approximately equal across the training and test sets. [3]
3.2 Loss
As mentioned in section 2.2, we trained the network by decreasing the total loss. In Figure 1 in the next
page, we plotted the total loss and the four components for a training session on net VGG16 for 20 epochs. In
this training session, we used SGD optimizer and the initial step size is 1.0 × 10−3 . The step size decrease by
a factor of 10 for every 5 epochs. This is our first trial and thus the step size schedule may not be optimal
since the decrease of rpn box loss and rcnn box loss is not dramatic.
The accuracy of detection of VGG16 is comparable with the original paper result [1]. In that paper, the
published number is 69.9. This is considerable since in the original paper, the net was trained for 240k
3
1.0 11.0 21.0
0.77
total loss
rpn class loss 0.38
4
rpn box loss
rcnn class loss 0.33
rcnn box loss
3 0.17
3.44
2
1.72
1 0.41
0.20
0
0.00
1.0 6.0 11.0 16.0 21.0
epochs
Figure 1: Total loss and its four components vs. trained epochs for VGG16
iterations which is approximately 50 epochs for VOC 2007 and our net was trained for 12 epochs. Besides,
our detection time is also considerably good since in the original paper, the reported detection rate is 17 fps
and we achieve similar number and even higher.
Table 3 shows mAP for our best trained model on different classes of objects. From the table, we can see
that even though in general, Res101 does better than VGG16, on detecting bus and cars, VGG16 does better.
Also, the two kinds of net share same preferences on classes. For example, both of them are not good at
detecting chair potted plant and boat.
Net mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
VGG16 68.7 67.0 78.2 65.8 52.2 51.6 79.8 84.4 78.2 45.4 78.4 61.1 78.8 79.3 74.8 77.9 44.4 66.8 64.6 75.1 70.3
Res101 74.1 70.4 80.1 78.2 64.1 57.6 78.8 79.8 87.2 54.3 82.2 71.1 85.6 86.3 79.4 78.9 48.1 76.8 74.9 77.2 71.9
Table 3: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG16 or Res101.
4
3.5 Selected Examples of Object Detection Results on PASCAL VOC 2007 Test
Set
The following figure shows some results on the PASCAL VOC 2007 test set. The left one is done with VGG16,
and the right one with Res101. As indicated by the mAP metric, the one using Res101 structure is slightly
better. For instance, there are more boats detected in the image locating on column 2, row 3.
Figure 2: Selected Examples of Object Detection Results on PASCAL VOC 2007 Test Set
References
[1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object
detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS),
2015.
[2] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages
1440–1448, 2015.